REGEXP_EXTRACT_OR_NULL
Description​
Extract the first substring that matches the target regular expression pattern from the text string, and extract a specific group from it based on the expression group index.
It should be noted that when handling character set matching, Utf-8 standard character classes should be used. This ensures that functions can correctly identify and process various characters from different languages.
Support since Apache Doris 3.0.2
If the 'pattern' is not allowed regexp regular,throw error
Support character match classes : https://github.com/google/re2/wiki/Syntax
Syntax​
REGEXP_EXTRACT_OR_NULL(<str>, <pattern>, <pos>)
Parameters​
Parameter | Description |
---|---|
<str> | A string parameter. It represents the text string in which the regular expression matching will be performed. This string can contain any combination of characters, and the function will search within it for substrings that match the |
<pattern> | A string parameter. It is the target regular expression pattern. This pattern can include various regular expression metacharacters and character classes, which precisely define the rules for the substring to be matched |
<pos> | An integer parameter. It indicates the index of the expression group to be extracted. The indexing starts from 1. If |
Return Value​
Return a string type, with the result being the part that matches <pattern>
.
If the input <pos>
is 0, return the entire first matching substring.
If the input <pos>
is invalid (negative or exceeds the number of expression groups), return NULL.
If the regular expression fails to match, return NULL.
If the <pos>
< 0,return NULL;
If the pos
> the length of <str>
,return NULL;
Example​
Extracting a specific group from a match. Explanation: The regular expression ([[:lower:]]+)C([[:lower:]]+) looks for sequences of one or more lowercase letters separated by 'C'. The group with index 1 corresponds to the first sequence of lowercase letters, so 'b' is returned.
SELECT REGEXP_EXTRACT_OR_NULL('123AbCdExCx', '([[:lower:]]+)C([[:lower:]]+)', 1);
+---------------------------------------------------------------------------+
| REGEXP_EXTRACT_OR_NULL('123AbCdExCx', '([[:lower:]]+)C([[:lower:]]+)', 1) |
+---------------------------------------------------------------------------+
| b |
+---------------------------------------------------------------------------+
Returning the entire matched substring.When
SELECT REGEXP_EXTRACT_OR_NULL('123AbCdExCx', '([[:lower:]]+)C([[:lower:]]+)', 0);
+---------------------------------------------------------------------------+
| REGEXP_EXTRACT_OR_NULL('123AbCdExCx', '([[:lower:]]+)C([[:lower:]]+)', 0) |
+---------------------------------------------------------------------------+
| bCd |
+---------------------------------------------------------------------------+
An invalid group index.Since the pattern has only 2 groups, an index of 5 is out of range, so NULL is returned.
SELECT REGEXP_EXTRACT_OR_NULL('123AbCdExCx', '([[:lower:]]+)C([[:lower:]]+)', 5);
+---------------------------------------------------------------------------+
| REGEXP_EXTRACT_OR_NULL('123AbCdExCx', '([[:lower:]]+)C([[:lower:]]+)', 5) |
+---------------------------------------------------------------------------+
| NULL |
+---------------------------------------------------------------------------+
A non - matching regular expression.There is no part of the string 'AbCdE' that fully matches the pattern
SELECT REGEXP_EXTRACT_OR_NULL('AbCdE', '([[:lower:]]+)C([[:upper:]]+)', 1);
+---------------------------------------------------------------------+
| REGEXP_EXTRACT_OR_NULL('AbCdE', '([[:lower:]]+)C([[:upper:]]+)', 1) |
+---------------------------------------------------------------------+
| NULL |
+---------------------------------------------------------------------+
Chinese character matching.The pattern (\p{Han}+)(.+) first matches one or more Chinese characters and then any remaining characters. The group with index 2 represents the non - Chinese part of the string after the Chinese characters.
select REGEXP_EXTRACT_OR_NULL('è¿™æ˜¯ä¸€æ®µä¸æ–‡ This is a passage in English 1234567', '(\\p{Han}+)(.+)', 2);
+---------------------------------------------------------------------------------------------------------+
| REGEXP_EXTRACT_OR_NULL('è¿™æ˜¯ä¸€æ®µä¸æ–‡ This is a passage in English 1234567', '(\\p{Han}+)(.+)', 2) |
+---------------------------------------------------------------------------------------------------------+
| This is a passage in English 1234567 |
+---------------------------------------------------------------------------------------------------------+
Inserting data into a table and performing extraction
CREATE TABLE test_regexp_extract_or_null (
id INT,
text_column VARCHAR(255),
pattern_column VARCHAR(255),
position_column INT
) PROPERTIES ("replication_num"="1");
INSERT INTO test_regexp_extract_or_null VALUES
(1, 'abc123def', '([a-z]+)([0-9]+)([a-z]+)', 2),
(2, 'Hello World', '([A-Z][a-z]+) ([A-Z][a-z]+)', 0),
(3, '123-456-789', '([0-9]{3})-([0-9]{3})-([0-9]{3})', 3),
(4, 'example@example.com', '([a-z]+)@([a-z]+)\\.([a-z]+)', 1),
(5, '测试文本 test text', '(\\p{Han}+) (.+)', 1);
SELECT id, REGEXP_EXTRACT_OR_NULL(text_column, pattern_column, position_column) AS extracted_result
FROM test_regexp_extract_or_null
ORDER BY id;
+------+-----------------+
| id | extracted_result|
+------+-----------------+
| 1 | 123 |
| 2 | Hello World |
| 3 | 789 |
| 4 | example |
| 5 | 测试文本 |
+------+-----------------+
Emoji case
SELECT regexp_extract_or_null('😀😊😎', '😀|😊|😎',0);
+------------------------------------------------------------+
| regexp_extract_or_null('😀😊😎', '😀|😊|😎',0) |
+------------------------------------------------------------+
| 😀 |
+------------------------------------------------------------+
'str' is NULL,return NULL
SELECT REGEXP_EXTRACT_OR_NULL(NULL, '([a-z]+)', 1);
+---------------------------------------------+
| REGEXP_EXTRACT_OR_NULL(NULL, '([a-z]+)', 1) |
+---------------------------------------------+
| NULL |
+---------------------------------------------+
'pattern' is NULL,return NULL
SELECT REGEXP_EXTRACT_OR_NULL('Hello World', NULL, 1);
+------------------------------------------------+
| REGEXP_EXTRACT_OR_NULL('Hello World', NULL, 1) |
+------------------------------------------------+
| NULL |
+------------------------------------------------+
'pos' is NULL ,return NULL
SELECT REGEXP_EXTRACT_OR_NULL('Hello World', '([a-z]+)', NULL);
+---------------------------------------------------------+
| REGEXP_EXTRACT_OR_NULL('Hello World', '([a-z]+)', NULL) |
+---------------------------------------------------------+
| NULL |
+---------------------------------------------------------+
ALL parameters are NULL,return NULL;
SELECT REGEXP_EXTRACT_OR_NULL(NULL,NULL,NULL);
+----------------------------------------+
| REGEXP_EXTRACT_OR_NULL(NULL,NULL,NULL) |
+----------------------------------------+
| NULL |
+----------------------------------------+
If the pattern
is not allowed regexp regular,throw error;
mysql> SELECT REGEXP_EXTRACT_OR_NULL('123AbCdExCx', '([[:lower:]]+)C([[]ower:]]+)', 1);
ERROR 1105 (HY000): errCode = 2, detailMessage = (10.16.10.2)[INVALID_ARGUMENT]Could not compile regexp pattern: ([[:lower:]]+)C([[:lower:]+)
Error: missing ]: [[:lower:]+)