SPLIT_BY_REGEXP
Description
The SPLIT_BY_REGEXP function splits a string into an array of strings based on a specified regular expression pattern. Unlike SPLIT_BY_STRING, this function supports complex regular expression matching for more flexible splitting rules. It optionally supports a maximum split count limit, which is useful for structured text processing, data cleansing, and pattern matching.
Syntax
SPLIT_BY_REGEXP(<str>, <pattern> [, <max_limit>])
Parameters
| Parameter | Description |
|---|---|
<str> | The source string to be split. Type: VARCHAR |
<pattern> | Regular expression pattern used as the delimiter. Type: VARCHAR |
<max_limit> | Optional parameter, limits the maximum number of elements in the returned array. Type: INT |
Return Value
Returns ARRAY
Splitting rules:
- Uses regular expression pattern to match split points
- Supports standard regular expression syntax
- Empty string pattern will split the string into individual characters
- If pattern doesn't match anything, returns single-element array containing the original string
- max_limit restricts the maximum length of the result array
Special cases:
- If any parameter is NULL, returns NULL
- If string is empty, returns single-element array containing empty string
- If regular expression is empty string, splits by characters
- If max_limit is 0 or negative, no limit is applied
- Consecutive matches produce empty string elements
Examples
- Empty pattern splits by characters
SELECT SPLIT_BY_REGEXP('abcde', '');
+------------------------------+
| SPLIT_BY_REGEXP('abcde', '') |
+------------------------------+
| ["a", "b", "c", "d", "e"] |
+------------------------------+
- Digit pattern splitting
SELECT SPLIT_BY_REGEXP('a12bc23de345f', '\\d+');
+-----------------------------------------+
| SPLIT_BY_REGEXP('a12bc23de345f', '\d+') |
+-----------------------------------------+
| ["a", "bc", "de", "f"] |
+-----------------------------------------+
- NULL value handling
SELECT SPLIT_BY_REGEXP(NULL, '\\d+'), SPLIT_BY_REGEXP('test', NULL);
+--------------------------------+--------------------------------+
| SPLIT_BY_REGEXP(NULL, '\d+') | SPLIT_BY_REGEXP('test', NULL) |
+--------------------------------+--------------------------------+
| NULL | NULL |
+--------------------------------+--------------------------------+
- Empty string handling
SELECT SPLIT_BY_REGEXP('', ','), SPLIT_BY_REGEXP('hello', 'xyz');
+---------------------------+-------------------------------+
| SPLIT_BY_REGEXP('', ',') | SPLIT_BY_REGEXP('hello', 'xyz') |
+---------------------------+-------------------------------+
| [""] | ["hello"] |
+---------------------------+-------------------------------+
- Using maximum limit parameter
SELECT SPLIT_BY_REGEXP('a,b,c,d,e', ',', 3), SPLIT_BY_REGEXP('1-2-3-4-5', '-', 2);
+--------------------------------------+--------------------------------------+
| SPLIT_BY_REGEXP('a,b,c,d,e', ',', 3) | SPLIT_BY_REGEXP('1-2-3-4-5', '-', 2) |
+--------------------------------------+--------------------------------------+
| ["a", "b", "c,d,e"] | ["1", "2-3-4-5"] |
+--------------------------------------+--------------------------------------+
- Whitespace pattern
SELECT SPLIT_BY_REGEXP('hello world test', '\\s+'), SPLIT_BY_REGEXP('a\tb\nc\rd', '\\s');
+------------------------------------------+------------------------------------+
| SPLIT_BY_REGEXP('hello world test', '\s+') | SPLIT_BY_REGEXP('a\tb\nc\rd', '\s') |
+------------------------------------------+------------------------------------+
| ["hello", "world", "test"] | ["a", "b", "c", "d"] |
+------------------------------------------+------------------------------------+
- Special characters and escaping
SELECT SPLIT_BY_REGEXP('a.b.c.d', '\\.'), SPLIT_BY_REGEXP('x(y)z[w]', '[\\(\\)\\[\\]]');
+----------------------------------+--------------------------------------------+
| SPLIT_BY_REGEXP('a.b.c.d', '\.') | SPLIT_BY_REGEXP('x(y)z[w]', '[\(\)\[\]]') |
+----------------------------------+--------------------------------------------+
| ["a", "b", "c", "d"] | ["x", "y", "z", "w"] |
+----------------------------------+--------------------------------------------+
- Word boundaries and complex patterns
SELECT SPLIT_BY_REGEXP('TheQuickBrownFox', '[A-Z]'), SPLIT_BY_REGEXP('user@example.com', '@|\\.');
+------------------------------------------+-------------------------------------------+
| SPLIT_BY_REGEXP('TheQuickBrownFox', '[A-Z]') | SPLIT_BY_REGEXP('user@example.com', '@|\.') |
+------------------------------------------+-------------------------------------------+
| ["", "he", "uick", "rown", "ox"] | ["user", "example", "com"] |
+------------------------------------------+-------------------------------------------+
- UTF-8 multi-byte characters
SELECT SPLIT_BY_REGEXP('ṭṛì→ḍḍumai→hello', '→'), SPLIT_BY_REGEXP('αβγδε', '[βδ]');
+------------------------------------------+----------------------------------+
| SPLIT_BY_REGEXP('ṭṛì→ḍḍumai→hello', '→') | SPLIT_BY_REGEXP('αβγδε', '[βδ]') |
+------------------------------------------+----------------------------------+
| ["ṭṛì", "ḍḍumai", "hello"] | ["α", "γ", "ε"] |
+------------------------------------------+----------------------------------+
- Consecutive matches and empty elements
SELECT SPLIT_BY_REGEXP('a,,b,c', ','), SPLIT_BY_REGEXP('123abc456def', '[a-z]+');
+-------------------------------+---------------------------------------+
| SPLIT_BY_REGEXP('a,,b,c', ',') | SPLIT_BY_REGEXP('123abc456def', '[a-z]+') |
+-------------------------------+---------------------------------------+
| ["a", "", "b", "c"] | ["123", "456", ""] |
+-------------------------------+---------------------------------------+
Keywords
SPLIT_BY_REGEXP, SPLIT, REGEXP