Skip to main content

VARIANT

VARIANT

Overview

The VARIANT type stores semi-structured JSON data. It can contain different primitive types (integers, strings, booleans, etc.), one-dimensional arrays, and nested objects. On write, Doris infers the structure and type of sub-paths based on JSON paths and materializes frequent paths as independent subcolumns, leveraging columnar storage and vectorized execution for both flexibility and performance.

Using VARIANT

Create table syntax

Declare a VARIANT column when creating a table:

CREATE TABLE IF NOT EXISTS ${table_name} (
k BIGINT,
v VARIANT
)
PROPERTIES("replication_num" = "1");

Constrain certain paths with a Schema Template (see “Extended types”):

CREATE TABLE IF NOT EXISTS ${table_name} (
k BIGINT,
v VARIANT <
'id' : INT, -- restrict path id to INT
'message*' : STRING, -- restrict message* prefix to STRING
'tags*' : ARRAY<TEXT> -- restrict tags* prefix to ARRAY<TEXT>
>
)
PROPERTIES("replication_num" = "1");

Query syntax

-- Access nested fields (returns VARIANT; explicit or implicit CAST is required for aggregation/comparison)
SELECT v['properties']['title'] FROM ${table_name};

-- CAST to a concrete type before aggregation
SELECT CAST(v['properties']['title'] AS STRING) AS title
FROM ${table_name}
GROUP BY title;

-- Query arrays
SELECT *
FROM ${table_name}
WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY<TEXT>), 'Doris');

Primitive types

VARIANT infers subcolumn types automatically. Supported types include:

Supported types
TinyInt
NULL (equivalent to JSON null)
BigInt (64 bit)
Double
String (Text)
Jsonb
Variant (nested object)
Array<T> (one-dimensional only)

Simple INSERT example:

INSERT INTO vartab VALUES
(1, 'null'),
(2, NULL),
(3, 'true'),
(4, '-17'),
(5, '123.12'),
(6, '1.912'),
(7, '"A quote"'),
(8, '[-1, 12, false]'),
(9, '{ "x": "abc", "y": false, "z": 10 }'),
(10, '"2021-01-01"');

Tip: Non-standard JSON types such as date/time will be stored as strings unless a Schema Template is provided. For better computation efficiency, consider extracting them to static columns or declaring their types via a Schema Template.

Extended types (Schema Template)

Besides primitive types, VARIANT supports the following extended types via Schema Template:

  • Number (extended)
    • Decimal: Decimal32 / Decimal64 / Decimal128 / Decimal256
    • LargeInt
  • Datetime
  • Date
  • IPV4 / IPV6
  • Boolean
  • ARRAY<T> (T can be any of the above, one-dimensional only)

Note: Predefined Schema can only be specified at table creation. ALTER is currently not supported (future versions may support adding new subcolumn definitions, but changing an existing subcolumn type is not supported).

Example:

CREATE TABLE test_var_schema (
id BIGINT NOT NULL,
v1 VARIANT<
'large_int_val': LARGEINT,
'string_val': STRING,
'decimal_val': DECIMAL(38, 9),
'datetime_val': DATETIME,
'ip_val': IPV4
> NULL
)
PROPERTIES ("replication_num" = "1");

INSERT INTO test_var_schema VALUES (1, '{
"large_int_val" : "123222222222222222222222",
"string_val" : "Hello World",
"decimal_val" : 1.11111111,
"datetime_val" : "2025-05-16 11:11:11",
"ip_val" : "127.0.0.1"
}');

SELECT variant_type(v1) FROM test_var_schema;

+----------------------------------------------------------------------------------------------------------------------------+
| variant_type(v1) |
+----------------------------------------------------------------------------------------------------------------------------+
| {"datetime_val":"datetimev2","decimal_val":"decimal128i","ip_val":"ipv4","large_int_val":"largeint","string_val":"string"} |
+----------------------------------------------------------------------------------------------------------------------------+

{"date": 2020-01-01} and {"ip": 127.0.0.1} are invalid JSON texts; the correct format is {"date": "2020-01-01"} and {"ip": "127.0.0.1"}.

Once a Schema Template is specified, if a JSON value conflicts with the declared type and cannot be converted, it will be stored as NULL. For example:

INSERT INTO test_var_schema VALUES (1, '{
"decimal_val" : "1.11111111",
"ip_val" : "127.xxxxxx.xxxx",
"large_int_val" : "aaabbccc"
}');

-- Only decimal_val remains
SELECT * FROM test_var_schema;

+------+-----------------------------+
| id | v1 |
+------+-----------------------------+
| 1 | {"decimal_val":1.111111110} |
+------+-----------------------------+

Schema only guides the persisted storage type. During query execution, the effective type depends on actual data at runtime:

-- At runtime v['a'] may still be STRING
SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']);

Wildcard matching and order:

CREATE TABLE test_var_schema (
id BIGINT NOT NULL,
v1 VARIANT<
'enumString*' : STRING,
'enum*' : ARRAY<TEXT>,
'ip*' : IPV6
> NULL
)
PROPERTIES ("replication_num" = "1");

-- If enumString1 matches both patterns, the first matching pattern in definition order (STRING) is used

If a column name contains * and you want to match it by its literal name (not as a prefix wildcard), use:

v1 VARIANT<
MATCH_NAME 'enumString*' : STRING
> NULL

Matched subpaths are materialized as columns by default. If too many paths match and generate excessive columns, consider enabling variant_enable_typed_paths_to_sparse (see “Configuration”).

Type conflicts and promotion rules

When incompatible types appear on the same path (e.g., the same field shows up as both integer and string), the type is promoted to JSONB to avoid information loss:

{"a" : 12345678}
{"a" : "HelloWorld"}
-- a will be promoted to JSONB

Promotion rules:

Source typeCurrent typeFinal type
TinyIntBigIntBigInt
TinyIntDoubleDouble
TinyIntStringJSONB
TinyIntArrayJSONB
BigIntDoubleJSONB
BigIntStringJSONB
BigIntArrayJSONB
DoubleStringJSONB
DoubleArrayJSONB
Array<Double>Array<String>Array<Jsonb>

If you need strict types (for stable indexing and storage), declare them via Schema Template.

Variant indexes

Choosing indexes

VARIANT supports BloomFilter and Inverted Index on subpaths.

  • High-cardinality equality/IN filters: prefer BloomFilter (sparser index, better write performance).
  • Tokenization/phrase/range search: use Inverted Index and set proper parser/analyzer properties.
...  
PROPERTIES("replication_num" = "1", "bloom_filter_columns" = "v");

-- Use BloomFilter for equality/IN filters
SELECT * FROM tbl WHERE v['id'] = 12345678;
SELECT * FROM tbl WHERE v['id'] IN (1, 2, 3);

Once an inverted index is created on a VARIANT column, all subpaths inherit the same index properties (e.g., parser):

CREATE TABLE IF NOT EXISTS tbl (
k BIGINT,
v VARIANT,
INDEX idx_v(v) USING INVERTED PROPERTIES("parser" = "english")
);

-- All subpaths inherit the english parser
SELECT * FROM tbl WHERE v['id_1'] MATCH 'Doris';
SELECT * FROM tbl WHERE v['id_2'] MATCH 'Apache';

Index by subpath

In 3.1.x/4.0 and later, you can specify index properties for certain VARIANT subpaths, and even configure both tokenized and non-tokenized inverted indexes for the same path. Path-specific indexes require the path type to be declared via Schema Template.

-- Common properties: field_pattern (target path), analyzer, parser, support_phrase, etc.
CREATE TABLE IF NOT EXISTS tbl (
k BIGINT,
v VARIANT<'content' : STRING>,
INDEX idx_tokenized(v) USING INVERTED PROPERTIES("parser" = "english", "field_pattern" = "content"),
INDEX idx_v(v) USING INVERTED PROPERTIES("field_pattern" = "content")
);

-- v.content has both tokenized and non-tokenized inverted indexes
SELECT * FROM tbl WHERE v['content'] MATCH 'Doris';
SELECT * FROM tbl WHERE v['content'] = 'Doris';

Wildcard path indexing:

CREATE TABLE IF NOT EXISTS tbl (
k BIGINT,
v VARIANT<'pattern_*' : STRING>,
INDEX idx_tokenized(v) USING INVERTED PROPERTIES("parser" = "english", "field_pattern" = "pattern_*"),
INDEX idx_v(v) USING INVERTED -- global non-tokenized inverted index
);

SELECT * FROM tbl WHERE v['pattern_1'] MATCH 'Doris';
SELECT * FROM tbl WHERE v['pattern_1'] = 'Doris';

Note: 2.1.7+ supports only InvertedIndex V2 properties (fewer files, lower write IOPS; suitable for disaggregated storage/compute). 2.1.8+ removes offline Build Index.

When indexes don’t work

  1. Type changes cause index loss: if a subpath changes to an incompatible type (e.g., INT → JSONB), the index is lost. Fix by pinning types and indexes via Schema Template.
  2. Query type mismatch:
    -- v['id'] is actually STRING; using INT equality causes index not to be used
    SELECT * FROM tbl WHERE v['id'] = 123456;
  3. Misconfigured index: indexes apply to subpaths, not the entire VARIANT column.
    -- VARIANT itself cannot be indexed as a whole
    SELECT * FROM tbl WHERE v MATCH 'Doris';

    -- If whole-JSON search is needed, store a duplicate STRING column and index it
    CREATE TABLE IF NOT EXISTS tbl (
    k BIGINT,
    v VARIANT,
    v_str STRING,
    INDEX idx_v_str(v_str) USING INVERTED PROPERTIES("parser" = "english")
    );
    SELECT * FROM tbl WHERE v_str MATCH 'Doris';

INSERT and load

INSERT INTO VALUES

CREATE TABLE IF NOT EXISTS variant_tbl (
k BIGINT,
v VARIANT
) PROPERTIES("replication_num" = "1");

INSERT INTO variant_tbl VALUES (1, '{"a" : 123}');

select * from variant_tbl;
+------+-----------+
| k | v |
+------+-----------+
| 1 | {"a":123} |
+------+-----------+

-- v['a'] is a VARIANT
select v['a'] from variant_tbl;
+--------+
| v['a'] |
+--------+
| 123 |
+--------+

-- Accessing a non-existent key returns NULL
select v['a']['no_such_key'] from variant_tbl;;
+-----------------------+
| v['a']['no_such_key'] |
+-----------------------+
| NULL |
+-----------------------+

Load (Stream Load)

# Line-delimited JSON (one JSON record per line)
curl --location-trusted -u root: -T gh_2022-11-07-3.json \
-H "read_json_by_line:true" -H "format:json" \
http://127.0.0.1:8030/api/test_variant/github_events/_stream_load

See also: https://doris.apache.org/docs/dev/data-operate/import/complex-types/variant

After loading, verify with SELECT count(*) or sample with SELECT * ... LIMIT 1. For high-throughput ingestion, prefer RANDOM bucketing and enable Group Commit.

Supported operations and CAST rules

  • VARIANT cannot be compared/operated directly with other types; comparisons between two VARIANTs are not supported either.
  • For comparison, filtering, aggregation, and ordering, CAST subpaths to concrete types (explicitly or implicitly).
-- Explicit CAST
SELECT CAST(v['arr'] AS ARRAY<TEXT>) FROM tbl;
SELECT * FROM tbl WHERE CAST(v['decimal'] AS DECIMAL(27, 9)) = 1.111111111;
SELECT * FROM tbl WHERE CAST(v['date'] AS DATE) = '2021-01-02';

-- Implicit CAST
SELECT * FROM tbl WHERE v['bool'];
SELECT * FROM tbl WHERE v['str'] MATCH 'Doris';
  • VARIANT itself cannot be used directly in ORDER BY, GROUP BY, as a JOIN KEY, or as an aggregate argument; CAST subpaths instead.
  • Strings can be implicitly converted to VARIANT.
VARIANTCastableCoercibleConversion Function
ARRAY
BOOLEAN
DATE/DATETIME
FLOAT
IPV4/IPV6
DECIMAL
MAP
TIMESTAMP
VARCHARPARSE_TO_JSON
JSON

Limitations

  • variant_max_subcolumns_count: default 0 (no limit). In production, set to 2048 (tablet level) to control the number of materialized paths. Above the threshold, low-frequency/sparse paths are moved to a shared data structure; reading from it may be slower (see “Configuration”).
  • If a path type is specified via Schema Template, that path will be forced to be materialized; when variant_enable_typed_paths_to_sparse = true, it also counts toward the threshold and may be moved to the shared structure.
  • JSON key length ≤ 255.
  • Cannot be a primary key or sort key.
  • Cannot be nested within other types (e.g., Array<Variant>, Struct<Variant>).
  • Reading the entire VARIANT column scans all subpaths. If a column has many subpaths, consider storing the original JSON string in an extra STRING/JSONB column for whole-object searches like LIKE:
CREATE TABLE example_table (
id INT,
data_variant VARIANT
);
SELECT * FROM example_table WHERE data_variant LIKE '%doris%';

-- Better: keep the original JSON string for whole-object matching
CREATE TABLE example_table (
id INT,
data_string STRING,
data_variant VARIANT
);
SELECT * FROM example_table WHERE data_string LIKE '%doris%';

Configuration

Starting from 3.1+, VARIANT supports type-level properties on columns:

CREATE TABLE example_table (
id INT,
data_variant VARIANT<
'path_1' : INT,
'path_2' : STRING,
properties(
'variant_max_subcolumns_count' = '2048',
'variant_enable_typed_paths_to_sparse' = 'true'
)
>
);
Property
Description
`variant_max_subcolumns_count`
Max number of materialized paths. Above the threshold, new paths may be stored in a shared data structure. Default 0 (unlimited). Recommended 2048; do not exceed 10000.
`variant_enable_typed_paths_to_sparse`
By default, typed paths are always materialized (and do not count against `variant_max_subcolumns_count`). When set to `true`, typed paths also count toward the threshold and may be moved to the shared structure.

Behavior at limits and tuning suggestions:

  1. After exceeding the threshold, new paths are written into the shared structure; Rowset merges may also recycle some paths into the shared structure.
  2. The system prefers to keep paths with higher non-null ratios and higher access frequencies materialized.
  3. Close to 10,000 materialized paths requires strong hardware (≥128G RAM, ≥32C per node recommended).
  4. Ingestion tuning: increase client batch_size appropriately, or use Group Commit (increase group_commit_interval_ms/group_commit_data_bytes as needed).
  5. If partition pruning is not needed, consider RANDOM bucketing and enabling single-tablet loading to reduce compaction write amplification.
  6. BE tuning knobs: max_cumu_compaction_threads (≥8), vertical_compaction_num_columns_per_group=500 (improves vertical compaction but increases memory), segment_cache_memory_percentage=20 (improves metadata cache efficiency).
  7. Watch Compaction Score; if it keeps rising, compaction is lagging—reduce ingestion pressure.
  8. Avoid large SELECT * on VARIANT; prefer specific projections like SELECT v['path'].

Note: If you see Stream Load error [DATA_QUALITY_ERROR]Reached max column size limit 2048 (only on 2.1.x and 3.0.x), it means the merged tablet schema reached its column limit. You may increase variant_max_merged_tablet_schema_size (not recommended beyond 4096; requires strong hardware).

Inspect number of columns and types

Approach 1: use variant_type to inspect per-row schema (more precise, higher cost):

SELECT variant_type(v) FROM variant_tbl;

Approach 2: extended DESC to show materialized subpaths (only those extracted):

SET describe_extend_variant_column = true;
DESC variant_tbl;

Use both: Approach 1 is precise; Approach 2 is efficient.

Compared with JSON type

  • Storage: JSON is stored as JSONB (row-oriented). VARIANT is inferred and materialized into columns on write (higher compression, smaller size).
  • Query: JSON requires parsing. VARIANT scans columns directly and is usually much faster.

ClickBench (43 queries):

  • Storage: VARIANT saves ~65% vs JSON.
  • Query: VARIANT is 8x+ faster than JSON, close to predefined static columns.

Storage space

TypeSize
Predefined columns12.618 GB
VARIANT12.718 GB
JSON35.711 GB

~65% space savings

RunPredefinedVARIANTJSON
First (cold)233.79s248.66sMost timed out
Second (hot)86.02s94.82s789.24s
Third (hot)83.03s92.29s743.69s

FAQ

  1. Are null in VARIANT and SQL NULL different?
    • No. They are equivalent.
  2. Why doesn’t my query/index work?
    • Check whether you CAST paths to the correct types; whether the type was promoted to JSONB due to conflicts; or whether you mistakenly expect an index on the whole VARIANT instead of on subpaths.