Custom Analyzer
A Custom Analyzer combines a character filter (char_filter), a tokenizer, and a token filter (token_filter) so that you can precisely define how text is split into searchable terms based on business needs. It overcomes the limitations of built-in analyzers and directly affects search relevance and the accuracy of data analysis.

Applicable Scenarios
| Scenario | Recommended Combination |
|---|---|
| Multilingual mixed-text search | icu tokenizer + icu_normalizer filter |
| Phone number / code prefix matching | edge_ngram tokenizer |
| Splitting complex English text (camel case, hyphens, etc.) | standard tokenizer + word_delimiter filter |
| Auto-completion / input suggestions | edge_ngram tokenizer + lowercase filter |
| Exact match (preserve original term) | keyword tokenizer + lowercase / asciifolding filter |
Core Components
A custom analyzer consists of four kinds of objects, applied to the original text in order:
Original text ──► [char_filter] ──► [tokenizer] ──► [token_filter] ──► Terms
| Component | Purpose | Count |
|---|---|---|
char_filter | Pre-processes characters before tokenization (such as replacement and normalization) | 0 ~ N |
tokenizer | Splits text into terms | 1 |
token_filter | Processes the resulting terms (such as lowercasing or ASCII folding) | 0 ~ N |
analyzer | Assembles the components above into a complete tokenization pipeline | 1 |
Creating a Custom Analyzer
1. Character Filter (char_filter)
CREATE INVERTED INDEX CHAR_FILTER IF NOT EXISTS x_char_filter
PROPERTIES (
"type" = "char_replace"
-- See below for other parameters
);
Supported character filter types:
-
char_replace: replaces specified characters with target characters before tokenization.char_filter_pattern: list of characters to replacechar_filter_replacement: replacement character (default: space)
-
icu_normalizer: pre-processes text using ICU normalization.name: normalization form (defaultnfkc_cf). Options:nfc,nfkc,nfkc_cf,nfd,nfkdmode: normalization mode (defaultcompose). Options:compose,decomposeunicode_set_filter: specifies the character set to normalize (such as[a-z])
2. Tokenizer
CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS x_tokenizer
PROPERTIES (
"type" = "standard"
);
Supported tokenizer types:
| Type | Description | Main Parameters |
|---|---|---|
standard | Standard tokenization (follows Unicode text segmentation), suitable for most languages | None |
ngram | Splits by N-grams | min_ngram, max_ngram, token_chars |
edge_ngram | Generates N-grams starting from the beginning of the word | min_ngram, max_ngram, token_chars |
keyword | Outputs the entire text as a single term, often combined with token_filter | None |
char_group | Splits by the given characters | tokenize_on_chars |
basic | Simple English / digit / Chinese / Unicode tokenization | extra_chars |
icu | ICU internationalized tokenization, supports complex scripts in multiple languages | None |
Parameter descriptions:
min_ngram: minimum length (default 1)max_ngram: maximum length (default 2)token_chars: character categories to keep (default: keep all). Options:letter,digit,whitespace,punctuation,symboltokenize_on_chars: a character list or category. Categories supportwhitespace,letter,digit,punctuation,symbol,cjkextra_chars: additional ASCII characters to split on (such as[]().)
3. Token Filter
CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS x_token_filter
PROPERTIES (
"type" = "word_delimiter"
);
Supported token filter types:
| Type | Purpose |
|---|---|
word_delimiter | Splits at non-alphanumeric characters and can perform normalization |
ascii_folding | Maps non-ASCII characters to their ASCII equivalents |
lowercase | Converts token text to lowercase |
icu_normalizer | Processes tokens using ICU normalization |
word_delimiter Details
Default behavior:
- Uses non-alphanumeric characters as delimiters (for example:
Super-DupertoSuper,Duper) - Removes leading and trailing delimiters from a token (for example:
XL---42+'Autocoder'toXL,42,Autocoder) - Splits at case transitions (for example:
PowerShottoPower,Shot) - Splits at the boundary between letters and digits (for example:
XL500toXL,500) - Removes the English possessive
's(for example:Neil'stoNeil)
Optional parameters:
| Parameter | Default | Description |
|---|---|---|
generate_number_parts | true | Whether to output the numeric parts |
generate_word_parts | true | Whether to output the word parts |
protected_words | None | Protected words that are not split |
split_on_case_change | true | Whether to split at case changes |
split_on_numerics | true | Whether to split at the boundary between letters and digits |
stem_english_possessive | true | Whether to remove the English possessive 's |
type_table | None | Custom character type mapping table |
type_table supports the following mapping types:
ALPHA(letter)ALPHANUM(alphanumeric)DIGIT(digit)LOWER(lowercase letter)SUBWORD_DELIM(non-alphanumeric delimiter)UPPER(uppercase letter)
Example: ["+ => ALPHA", "- => ALPHA"] treats + and - as letters so that they are not used as split points.
icu_normalizer Parameters
name: normalization form (defaultnfkc_cf). Options:nfc,nfkc,nfkc_cf,nfd,nfkdunicode_set_filter: specifies the character set to normalize
4. Analyzer
Assembles the components above into a complete tokenization pipeline:
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS x_analyzer
PROPERTIES (
"tokenizer" = "x_tokenizer", -- A single tokenizer
"token_filter" = "x_filter1, x_filter2" -- One or more token_filters, executed in order
);
Using a Custom Analyzer in a Table
Reference a created custom analyzer through the analyzer field in the index PROPERTIES:
CREATE TABLE tbl (
`a` bigint NOT NULL AUTO_INCREMENT(1),
`ch` text NULL,
INDEX idx_ch (`ch`) USING INVERTED PROPERTIES("analyzer" = "x_custom_analyzer", "support_phrase" = "true")
)
table_properties;
Notes:
- A custom analyzer is set in the index
PROPERTIESthroughanalyzer. - The only property that can be used together with
analyzerissupport_phrase.
Multiple Analyzer Indexes on a Single Column
Doris supports creating multiple inverted indexes that use different tokenizers on the same column, so that the same data can be retrieved with different tokenization strategies.
Use Cases
- Multilingual support: use tokenizers for different languages on the same text column.
- Balancing search precision and recall: use a keyword tokenizer for exact matching and a standard tokenizer for fuzzy search.
- Auto-completion: use an edge_ngram tokenizer for prefix matching and a standard tokenizer for regular search.
Creating Multiple Indexes
-- 1. Create tokenizers with different tokenization strategies
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS std_analyzer
PROPERTIES ("tokenizer" = "standard", "token_filter" = "lowercase");
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS kw_analyzer
PROPERTIES ("tokenizer" = "keyword", "token_filter" = "lowercase");
CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_tokenizer
PROPERTIES (
"type" = "edge_ngram",
"min_gram" = "1",
"max_gram" = "20",
"token_chars" = "letter"
);
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS ngram_analyzer
PROPERTIES ("tokenizer" = "edge_ngram_tokenizer", "token_filter" = "lowercase");
-- 2. Create multiple indexes on the same column
CREATE TABLE articles (
id INT,
content TEXT,
-- Standard tokenizer for tokenized search
INDEX idx_content_std (content) USING INVERTED
PROPERTIES("analyzer" = "std_analyzer", "support_phrase" = "true"),
-- Keyword tokenizer for exact matching
INDEX idx_content_kw (content) USING INVERTED
PROPERTIES("analyzer" = "kw_analyzer"),
-- edge n-gram tokenizer for auto-completion
INDEX idx_content_ngram (content) USING INVERTED
PROPERTIES("analyzer" = "ngram_analyzer")
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 1
PROPERTIES ("replication_allocation" = "tag.location.default: 1");
Specifying an Analyzer at Query Time
Use the USING ANALYZER clause to specify which index to use:
-- Insert test data
INSERT INTO articles VALUES
(1, 'hello world'),
(2, 'hello'),
(3, 'world'),
(4, 'hello world test');
-- Tokenized search: matches rows containing the term 'hello'
-- Returns: 1, 2, 4
SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER std_analyzer ORDER BY id;
-- Exact match: matches only the exact 'hello' string
-- Returns: 2
SELECT id FROM articles WHERE content MATCH 'hello' USING ANALYZER kw_analyzer ORDER BY id;
-- Use edge n-gram for prefix matching
-- Returns: 1, 2, 4 (all rows starting with 'hel')
SELECT id FROM articles WHERE content MATCH 'hel' USING ANALYZER ngram_analyzer ORDER BY id;
You can also use built-in tokenizers directly:
SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER standard;
SELECT * FROM articles WHERE content MATCH 'hello' USING ANALYZER none;
SELECT * FROM articles WHERE content MATCH 'Hello' USING ANALYZER chinese;
Adding an Index to an Existing Table
-- Add a new index that uses a different tokenizer
ALTER TABLE articles ADD INDEX idx_content_chinese (content)
USING INVERTED PROPERTIES("parser" = "chinese");
-- Wait for the schema change to complete
SHOW ALTER TABLE COLUMN WHERE TableName='articles';
Building an Index
After adding an index, you must build the index for existing data:
-- Build a specific index (non-cloud mode)
BUILD INDEX idx_content_chinese ON articles;
-- Build all indexes (cloud mode)
BUILD INDEX ON articles;
-- Check the build progress
SHOW BUILD INDEX WHERE TableName='articles';
Key Notes
- Tokenizer identity: two tokenizers with the same
tokenizerandtoken_filterconfiguration are considered identical. You cannot create multiple indexes on the same column that share the same tokenizer identity. - Index selection behavior:
- When
USING ANALYZERis specified, the index for the specified tokenizer is used if it exists and has been built. - If the index is not built, the query falls back to the non-index path (results are correct, but performance is slower).
- When
USING ANALYZERis not specified, any available index may be used.
- When
- Performance considerations:
- Each additional index increases storage space and write overhead.
- Choose tokenizers based on your actual query patterns.
- If your query patterns are predictable, consider using fewer indexes.
Management and Maintenance
View
SHOW INVERTED INDEX TOKENIZER;
SHOW INVERTED INDEX TOKEN_FILTER;
SHOW INVERTED INDEX ANALYZER;
Delete
DROP INVERTED INDEX TOKENIZER IF EXISTS x_tokenizer;
DROP INVERTED INDEX TOKEN_FILTER IF EXISTS x_token_filter;
DROP INVERTED INDEX ANALYZER IF EXISTS x_analyzer;
Complete Examples
Example 1: Phone Number Prefix Matching
Use edge_ngram to generate all prefix fragments of a phone number, enabling search-as-you-type.
CREATE INVERTED INDEX TOKENIZER IF NOT EXISTS edge_ngram_phone_number_tokenizer
PROPERTIES
(
"type" = "edge_ngram",
"min_gram" = "3",
"max_gram" = "10",
"token_chars" = "digit"
);
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS edge_ngram_phone_number
PROPERTIES
(
"tokenizer" = "edge_ngram_phone_number_tokenizer"
);
CREATE TABLE tbl (
`a` bigint NOT NULL AUTO_INCREMENT(1),
`ch` text NULL,
INDEX idx_ch (`ch`) USING INVERTED PROPERTIES("support_phrase" = "true", "analyzer" = "edge_ngram_phone_number")
) ENGINE=OLAP
DUPLICATE KEY(`a`)
DISTRIBUTED BY RANDOM BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
SELECT tokenize('13891972631', '"analyzer"="edge_ngram_phone_number"');
Result:
[
{"token":"138"},
{"token":"1389"},
{"token":"13891"},
{"token":"138919"},
{"token":"1389197"},
{"token":"13891972"},
{"token":"138919726"},
{"token":"1389197263"}
]
Example 2: Fine-Grained Tokenization for Complex English Text
Use the standard tokenizer together with word_delimiter for finer-grained tokenization, plus case normalization and ASCII folding.
CREATE INVERTED INDEX TOKEN_FILTER IF NOT EXISTS word_splitter
PROPERTIES
(
"type" = "word_delimiter",
"split_on_numerics" = "false",
"split_on_case_change" = "false"
);
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS lowercase_delimited
PROPERTIES
(
"tokenizer" = "standard",
"token_filter" = "asciifolding, word_splitter, lowercase"
);
CREATE TABLE tbl (
`a` bigint NOT NULL AUTO_INCREMENT(1),
`ch` text NULL,
INDEX idx_ch (`ch`) USING INVERTED PROPERTIES("support_phrase" = "true", "analyzer" = "lowercase_delimited")
) ENGINE=OLAP
DUPLICATE KEY(`a`)
DISTRIBUTED BY RANDOM BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
SELECT tokenize('The server at IP 192.168.1.15 sent a confirmation to user_123@example.com, requiring a quickResponse before the deadline.', '"analyzer"="lowercase_delimited"');
Result:
[
{"token":"the"},
{"token":"server"},
{"token":"at"},
{"token":"ip"},
{"token":"192"},
{"token":"168"},
{"token":"1"},
{"token":"15"},
{"token":"sent"},
{"token":"a"},
{"token":"confirmation"},
{"token":"to"},
{"token":"user"},
{"token":"123"},
{"token":"example"},
{"token":"com"},
{"token":"requiring"},
{"token":"a"},
{"token":"quickresponse"},
{"token":"before"},
{"token":"the"},
{"token":"deadline"}
]
Example 3: Exact Match with Original Term Preserved
Use the keyword tokenizer to keep the entire text intact, then apply lowercase and asciifolding for normalization. This is commonly used for case-insensitive exact matching of strings.
CREATE INVERTED INDEX ANALYZER IF NOT EXISTS keyword_lowercase
PROPERTIES
(
"tokenizer" = "keyword",
"token_filter" = "asciifolding, lowercase"
);
CREATE TABLE tbl (
`a` bigint NOT NULL AUTO_INCREMENT(1),
`ch` text NULL,
INDEX idx_ch (`ch`) USING INVERTED PROPERTIES("support_phrase" = "true", "analyzer" = "keyword_lowercase")
) ENGINE=OLAP
DUPLICATE KEY(`a`)
DISTRIBUTED BY RANDOM BUCKETS 1
PROPERTIES (
"replication_allocation" = "tag.location.default: 1"
);
SELECT tokenize('hÉllo World', '"analyzer"="keyword_lowercase"');
Result:
[
{"token":"hello world"}
]
Limitations
- The
typeand parameters intokenizerandtoken_filtercan only use currently supported tokenizers and token filters; otherwise, table creation fails. - An
analyzercan be dropped only when no table is using it. - A
tokenizerortoken_filtercan be dropped only when noanalyzeris using it. - Custom analyzer DDL is synchronized to the BE 10 seconds after execution; subsequent imports do not produce errors.
Notes
- Nesting multiple components in a custom
analyzermay degrade tokenization performance. - The
select tokenizetokenization function supports custom analyzers and can be used to debug tokenization results. - Only one of the predefined
built_in_analyzerand a customanalyzercan exist on the same index.