UNICODE_NORMALIZE
Description
Performs Unicode Normalization on the input string.
Unicode normalization is the process of converting equivalent Unicode character sequences into a unified form. For example, the character "é" can be represented by a single code point (U+00E9) or by "e" + a combining acute accent (U+0065 + U+0301). Normalization ensures that these equivalent representations are handled uniformly.
Syntax
UNICODE_NORMALIZE(<str>, <mode>)
Parameters
| Parameter | Description |
|---|---|
<str> | The input string to be normalized. Type: VARCHAR |
<mode> | The normalization mode, must be a constant string (case-insensitive). Supported modes: - NFC: Canonical Decomposition, followed by Canonical Composition- NFD: Canonical Decomposition- NFKC: Compatibility Decomposition, followed by Canonical Composition- NFKD: Compatibility Decomposition- NFKC_CF: NFKC followed by Case Folding |
Return Value
Returns VARCHAR type, representing the normalized result of the input string.
Examples
- Difference between NFC and NFD (composed vs decomposed characters)
-- 'Café' where é may be in composed form, NFD will decompose it into e + combining accent
SELECT length(unicode_normalize('Café', 'NFC')) AS nfc_len, length(unicode_normalize('Café', 'NFD')) AS nfd_len;
+---------+---------+
| nfc_len | nfd_len |
+---------+---------+
| 4 | 5 |
+---------+---------+
- NFKC_CF for case folding
SELECT unicode_normalize('ABC 123', 'nfkc_cf') AS result;
+---------+
| result |
+---------+
| abc 123 |
+---------+
- NFKC handling fullwidth characters (compatibility decomposition)
-- Fullwidth digits '123' will be converted to halfwidth '123'
SELECT unicode_normalize('123ABC', 'NFKC') AS result;
+--------+
| result |
+--------+
| 123ABC |
+--------+
- NFKD handling special symbols (compatibility decomposition)
-- ℃ (degree Celsius symbol) will be decomposed to °C
SELECT unicode_normalize('25℃', 'NFKD') AS result;
+--------+
| result |
+--------+
| 25°C |
+--------+
- Handling circled numbers
-- ① ② ③ circled numbers will be converted to regular digits
SELECT unicode_normalize('①②③', 'NFKC') AS result;
+--------+
| result |
+--------+
| 123 |
+--------+
- Comparing different modes on the same string
SELECT
unicode_normalize('fi', 'NFC') AS nfc_result,
unicode_normalize('fi', 'NFKC') AS nfkc_result;
+------------+-------------+
| nfc_result | nfkc_result |
+------------+-------------+
| fi | fi |
+------------+-------------+
- String equality comparison scenario
-- Use normalization to compare visually identical but differently encoded strings
SELECT unicode_normalize('café', 'NFC') = unicode_normalize('café', 'NFC') AS is_equal;
+----------+
| is_equal |
+----------+
| 1 |
+----------+