Skip to main content

VARIANT

VARIANT

Overview

The VARIANT type stores semi-structured JSON data. It can contain different primitive types (integers, strings, booleans, etc.), one-dimensional arrays, and nested objects. On write, Doris infers the structure and type of sub-paths based on JSON paths and performs Subcolumnization on frequent paths, exposing them as independent columnar subcolumns for both flexibility and performance.

Why VARIANT

VARIANT is a good fit when document shape changes over time but queries still focus on a small set of hot paths.

  • Hot paths participate in Subcolumnization, so they benefit from columnar performance, file pruning, and vectorized execution.
  • Key paths can use path-level indexes, full-text search, and still benefit from Doris sparse-index pruning.
  • Wide-column optimizations keep automatic Subcolumnization practical at 10k-scale subcolumns. When paths participating in Subcolumnization approach 10,000, hardware requirements rise quickly, so DOC mode is usually the safer starting point.

If you still need to choose between default behavior, sparse columns, DOC mode, and Schema Template, start with Variant Workload Guide. This page is the reference for syntax, type rules, indexes, limits, and configuration.

Using VARIANT

Create table syntax

Declare a VARIANT column when creating a table:

CREATE TABLE IF NOT EXISTS ${table_name} (
k BIGINT,
v VARIANT
)
PROPERTIES("replication_num" = "1");

Constrain certain paths with a Schema Template (see “Extended types”):

CREATE TABLE IF NOT EXISTS ${table_name} (
k BIGINT,
v VARIANT <
'id' : INT, -- restrict path id to INT
'message*' : STRING, -- restrict message* prefix to STRING
'tags*' : ARRAY<TEXT> -- restrict tags* prefix to ARRAY<TEXT>
>
)
PROPERTIES("replication_num" = "1");

Query syntax

-- Access nested fields (returns VARIANT; explicit or implicit CAST is required for aggregation/comparison)
SELECT v['properties']['title'] FROM ${table_name};

-- CAST to a concrete type before aggregation
SELECT CAST(v['properties']['title'] AS STRING) AS title
FROM ${table_name}
GROUP BY title;

-- Query arrays
SELECT *
FROM ${table_name}
WHERE ARRAY_CONTAINS(CAST(v['tags'] AS ARRAY<TEXT>), 'Doris');

Primitive types

VARIANT infers subcolumn types automatically. Supported types include:

Supported types
TinyInt
NULL (equivalent to JSON null)
BigInt (64 bit)
Double
String (Text)
Jsonb
Variant (nested object)
Array<T> (one-dimensional only)

Simple INSERT example:

INSERT INTO vartab VALUES
(1, 'null'),
(2, NULL),
(3, 'true'),
(4, '-17'),
(5, '123.12'),
(6, '1.912'),
(7, '"A quote"'),
(8, '[-1, 12, false]'),
(9, '{ "x": "abc", "y": false, "z": 10 }'),
(10, '"2021-01-01"');

Tip: Non-standard JSON types such as date/time will be stored as strings unless a Schema Template is provided. For better computation efficiency, consider extracting them to static columns or declaring their types via a Schema Template.

Extended types (Schema Template)

Besides primitive types, VARIANT supports the following extended types via Schema Template:

  • Number (extended)
    • Decimal: Decimal32 / Decimal64 / Decimal128 / Decimal256
    • LargeInt
  • Datetime
  • Timestamptz
  • Date
  • IPV4 / IPV6
  • Boolean
  • ARRAY<T> (T can be any of the above, one-dimensional only)

Note: Predefined Schema can only be specified at table creation. ALTER is currently not supported (future versions may support adding new subcolumn definitions, but changing an existing subcolumn type is not supported).

Example:

CREATE TABLE test_var_schema (
id BIGINT NOT NULL,
v1 VARIANT<
'large_int_val': LARGEINT,
'string_val': STRING,
'decimal_val': DECIMAL(38, 9),
'datetime_val': DATETIME,
'tz_val': TIMESTAMPTZ,
'ip_val': IPV4
> NULL
)
PROPERTIES ("replication_num" = "1");

INSERT INTO test_var_schema VALUES (1, '{
"large_int_val" : "123222222222222222222222",
"string_val" : "Hello World",
"decimal_val" : 1.11111111,
"datetime_val" : "2025-05-16 11:11:11",
"tz_val" : "2025-05-16 11:11:11+08:00",
"ip_val" : "127.0.0.1"
}');

SELECT variant_type(v1) FROM test_var_schema;

+---------------------------------------------------------------------------------------------------------------------------------------------------+
| variant_type(v1) |
+---------------------------------------------------------------------------------------------------------------------------------------------------+
| {"datetime_val":"datetimev2","decimal_val":"decimal128i","ip_val":"ipv4","large_int_val":"largeint","string_val":"string","tz_val":"timestamptz"} |
+---------------------------------------------------------------------------------------------------------------------------------------------------+

{"date": 2020-01-01} and {"ip": 127.0.0.1} are invalid JSON texts; the correct format is {"date": "2020-01-01"} and {"ip": "127.0.0.1"}.

Once a Schema Template is specified, if a JSON value conflicts with the declared type and cannot be converted, it will be stored as NULL. For example:

INSERT INTO test_var_schema VALUES (1, '{
"decimal_val" : "1.11111111",
"ip_val" : "127.xxxxxx.xxxx",
"large_int_val" : "aaabbccc"
}');

-- Only decimal_val remains
SELECT * FROM test_var_schema;

+------+-----------------------------+
| id | v1 |
+------+-----------------------------+
| 1 | {"decimal_val":1.111111110} |
+------+-----------------------------+

Schema only guides the persisted storage type. During query execution, the effective type depends on actual data at runtime:

-- At runtime v['a'] may still be STRING
SELECT variant_type(CAST('{"a" : "12345"}' AS VARIANT<'a' : INT>)['a']);

Wildcard matching and order:

CREATE TABLE test_var_schema (
id BIGINT NOT NULL,
v1 VARIANT<
'enumString*' : STRING,
'enum*' : ARRAY<TEXT>,
'ip*' : IPV6
> NULL
)
PROPERTIES ("replication_num" = "1");

-- If enumString1 matches both patterns, the first matching pattern in definition order (STRING) is used

If a column name contains * and you want to match it by its literal name (not as a prefix wildcard), use:

v1 VARIANT<
MATCH_NAME 'enumString*' : STRING
> NULL

Matched subpaths participate in Subcolumnization by default and are exposed as columns. If too many paths match and generate excessive columns, consider enabling variant_enable_typed_paths_to_sparse (see “Configuration”).

Type conflicts and promotion rules

When incompatible types appear on the same path (e.g., the same field shows up as both integer and string), the type is promoted to JSONB to avoid information loss:

{"a" : 12345678}
{"a" : "HelloWorld"}
-- a will be promoted to JSONB

Promotion rules:

Source typeCurrent typeFinal type
TinyIntBigIntBigInt
TinyIntDoubleDouble
TinyIntStringJSONB
TinyIntArrayJSONB
BigIntDoubleJSONB
BigIntStringJSONB
BigIntArrayJSONB
DoubleStringJSONB
DoubleArrayJSONB
Array<Double>Array<String>Array<Jsonb>

If you need strict types (for stable indexing and storage), declare them via Schema Template.

Variant indexes

Choosing indexes

VARIANT supports BloomFilter and Inverted Index on subpaths.

  • High-cardinality equality/IN filters: prefer BloomFilter (sparser index, better write performance).
  • Tokenization/phrase/range search: use Inverted Index and set proper parser/analyzer properties.
...  
PROPERTIES("replication_num" = "1", "bloom_filter_columns" = "v");

-- Use BloomFilter for equality/IN filters
SELECT * FROM tbl WHERE v['id'] = 12345678;
SELECT * FROM tbl WHERE v['id'] IN (1, 2, 3);

Once an inverted index is created on a VARIANT column, all subpaths inherit the same index properties (e.g., parser):

CREATE TABLE IF NOT EXISTS tbl (
k BIGINT,
v VARIANT,
INDEX idx_v(v) USING INVERTED PROPERTIES("parser" = "english")
);

-- All subpaths inherit the english parser
SELECT * FROM tbl WHERE v['id_1'] MATCH 'Doris';
SELECT * FROM tbl WHERE v['id_2'] MATCH 'Apache';

Index by subpath

In 3.1.x/4.0 and later, you can specify index properties for certain VARIANT subpaths, and even configure both tokenized and non-tokenized inverted indexes for the same path. Path-specific indexes require the path type to be declared via Schema Template.

-- Common properties: field_pattern (target path), analyzer, parser, support_phrase, etc.
CREATE TABLE IF NOT EXISTS tbl (
k BIGINT,
v VARIANT<'content' : STRING>,
INDEX idx_tokenized(v) USING INVERTED PROPERTIES("parser" = "english", "field_pattern" = "content"),
INDEX idx_v(v) USING INVERTED PROPERTIES("field_pattern" = "content")
);

-- v.content has both tokenized and non-tokenized inverted indexes
SELECT * FROM tbl WHERE v['content'] MATCH 'Doris';
SELECT * FROM tbl WHERE v['content'] = 'Doris';

Wildcard path indexing:

CREATE TABLE IF NOT EXISTS tbl (
k BIGINT,
v VARIANT<'pattern_*' : STRING>,
INDEX idx_tokenized(v) USING INVERTED PROPERTIES("parser" = "english", "field_pattern" = "pattern_*"),
INDEX idx_v(v) USING INVERTED -- global non-tokenized inverted index
);

SELECT * FROM tbl WHERE v['pattern_1'] MATCH 'Doris';
SELECT * FROM tbl WHERE v['pattern_1'] = 'Doris';

Note: 2.1.7+ supports only InvertedIndex V2 properties (fewer files, lower write IOPS; suitable for disaggregated storage/compute). 2.1.8+ removes offline Build Index.

When indexes don’t work

  1. Type changes cause index loss: if a subpath changes to an incompatible type (e.g., INT → JSONB), the index is lost. Fix by pinning types and indexes via Schema Template.
  2. Query type mismatch:
    -- v['id'] is actually STRING; using INT equality causes index not to be used
    SELECT * FROM tbl WHERE v['id'] = 123456;
  3. Misconfigured index: indexes apply to subpaths, not the entire VARIANT column.
    -- VARIANT itself cannot be indexed as a whole
    SELECT * FROM tbl WHERE v MATCH 'Doris';

    -- If whole-JSON search is needed, store a duplicate STRING column and index it
    CREATE TABLE IF NOT EXISTS tbl (
    k BIGINT,
    v VARIANT,
    v_str STRING,
    INDEX idx_v_str(v_str) USING INVERTED PROPERTIES("parser" = "english")
    );
    SELECT * FROM tbl WHERE v_str MATCH 'Doris';

INSERT and load

INSERT INTO VALUES

CREATE TABLE IF NOT EXISTS variant_tbl (
k BIGINT,
v VARIANT
) PROPERTIES("replication_num" = "1");

INSERT INTO variant_tbl VALUES (1, '{"a" : 123}');

select * from variant_tbl;
+------+-----------+
| k | v |
+------+-----------+
| 1 | {"a":123} |
+------+-----------+

-- v['a'] is a VARIANT
select v['a'] from variant_tbl;
+--------+
| v['a'] |
+--------+
| 123 |
+--------+

-- Accessing a non-existent key returns NULL
select v['a']['no_such_key'] from variant_tbl;;
+-----------------------+
| v['a']['no_such_key'] |
+-----------------------+
| NULL |
+-----------------------+

Load (Stream Load)

# Line-delimited JSON (one JSON record per line)
curl --location-trusted -u root: -T gh_2022-11-07-3.json \
-H "read_json_by_line:true" -H "format:json" \
http://127.0.0.1:8030/api/test_variant/github_events/_stream_load

See also: https://doris.apache.org/docs/dev/data-operate/import/complex-types/variant

After loading, verify with SELECT count(*) or sample with SELECT * ... LIMIT 1. For high-throughput ingestion, prefer RANDOM bucketing and enable Group Commit.

Supported operations and CAST rules

  • VARIANT cannot be compared/operated directly with other types; comparisons between two VARIANTs are not supported either.
  • For comparison, filtering, aggregation, and ordering, CAST subpaths to concrete types (explicitly or implicitly).
-- Explicit CAST
SELECT CAST(v['arr'] AS ARRAY<TEXT>) FROM tbl;
SELECT * FROM tbl WHERE CAST(v['decimal'] AS DECIMAL(27, 9)) = 1.111111111;
SELECT * FROM tbl WHERE CAST(v['date'] AS DATE) = '2021-01-02';

-- Implicit CAST
SELECT * FROM tbl WHERE v['bool'];
SELECT * FROM tbl WHERE v['str'] MATCH 'Doris';
  • VARIANT itself cannot be used directly in ORDER BY, GROUP BY, as a JOIN KEY, or as an aggregate argument; CAST subpaths instead.
  • Strings can be implicitly converted to VARIANT.
VARIANTCastableCoercible
ARRAY
BOOLEAN
DATE/DATETIME
FLOAT
IPV4/IPV6
DECIMAL
MAP
TIMESTAMP
VARCHAR
JSON

Wide columns

When ingested data contains many distinct JSON keys, the number of subcolumns produced by Subcolumnization can grow rapidly; at scale this may cause metadata bloat, higher write/merge cost, and query slowdowns. To address “wide columns” (too many subcolumns), VARIANT provides two mechanisms: Sparse columns and DOC encoding.

For workload selection guidance, see Variant Workload Guide. This section only explains the mechanisms and their related properties.

Note: these two mechanisms are mutually exclusive—enabling DOC encoding disables sparse columns, and vice versa.

Sparse columns

How it works

  • The system ranks paths by non-null ratio / sparsity: high-frequency (less-sparse) paths go through Subcolumnization and are stored as independent subcolumns; remaining low-frequency (sparse) paths are merged and stored in sparse columns. The maximum number of extracted subcolumns is controlled by variant_max_subcolumns_count.
  • If a path is declared in a Schema Template, by default it will not be moved into sparse columns; set variant_enable_typed_paths_to_sparse to allow typed paths to be moved into sparse columns.
  • Sparse columns support sharding: distribute sparse subpaths across multiple sparse columns to reduce per-column read overhead and improve read efficiency. Use variant_sparse_hash_shard_count to specify how many sparse columns are physically stored.

Reference notes

  • If most keys have similar non-null ratios (little sparsity contrast), it’s hard to identify truly sparse paths and the benefit of sparse columns is reduced.
  • variant_max_subcolumns_count defaults to 2048, which is already enough for most workloads. Avoid raising it aggressively just to pre-allocate more extracted subcolumns; if the workload truly needs large-scale Subcolumnization, prefer DOC mode. The practical upper bound is still recommended to stay ≤ 10000.
  • variant_sparse_hash_shard_count can be roughly estimated as “number of sparse paths / 128”. Example: total JSON keys ≈ 10,000, variant_max_subcolumns_count = 2000, then sparse paths ≈ 8000, so variant_sparse_hash_shard_count can start around 8000/128.

DOC encoding (DOC mode)

How it works

  • Paths can still go through Subcolumnization for path-based queries, and the original JSON is additionally stored as a stored field to return the full JSON document efficiently.
  • DOC encoding supports sharding: the original JSON is split into multiple columns for storage and reassembled when querying the full JSON. Use variant_doc_hash_shard_count to specify the number of DOC shards.
  • For small-batch writes, Subcolumnization can be skipped and deferred to later merges. This is controlled by variant_doc_materialization_min_rows. For example, if variant_doc_materialization_min_rows = 10000, writes below 10,000 rows will only store the original JSON and won’t trigger Subcolumnization for that batch.
  • For ultra-wide workloads, DOC mode is also the more stable choice when Subcolumnization scale approaches ten-thousand subcolumns. Compared with default eager Subcolumnization, compaction memory can drop by about two-thirds, and sparse wide-column ingest throughput can improve by about 5-10x.
  • When a VARIANT column is very wide and queries often read the whole document, DOC mode can improve SELECT variant_col performance by orders of magnitude compared with reconstructing the document from many subcolumns.

Reference notes

  • DOC mode requires variant_enable_doc_mode = true.
  • In DOC mode, typed paths declared via Schema Template are limited to numeric, string, and array types.
  • variant_doc_hash_shard_count can be roughly estimated as “total JSON keys / 128”.

See the “Configuration” section below for the full property list.

Limitations

  • Wide tables optimization: For wide tables with a large number of dynamic sub-columns (e.g., more than 2000 columns) generated by the VARIANT type, it is highly recommended to enable Storage Format V3 by specifying "storage_format" = "V3" in the table PROPERTIES. This decouples column metadata from the Segment Footer, speeding up file opening and reducing memory overhead.
  • JSON key length ≤ 255.
  • Cannot be a primary key or sort key.
  • Cannot be nested within other types (e.g., Array<Variant>, Struct<Variant>).
  • Outside DOC mode, reading the entire VARIANT column scans all subpaths. For very wide columns, direct SELECT variant_col is generally not recommended unless DOC mode is enabled. If a column has many subpaths, consider storing the original JSON string in an extra STRING/JSONB column for whole-object searches like LIKE:
CREATE TABLE example_table (
id INT,
data_variant VARIANT
);
SELECT * FROM example_table WHERE data_variant LIKE '%doris%';

-- Better: keep the original JSON string for whole-object matching
CREATE TABLE example_table (
id INT,
data_string STRING,
data_variant VARIANT
);
SELECT * FROM example_table WHERE data_string LIKE '%doris%';

Configuration

Starting from 3.1+, VARIANT supports type-level properties on columns:

CREATE TABLE example_table (
id INT,
data_variant VARIANT<
'path_1' : INT,
'path_2' : STRING,
properties(
'variant_max_subcolumns_count' = '2048',
'variant_enable_typed_paths_to_sparse' = 'true',
'variant_sparse_hash_shard_count' = '64'
)
>
);
PropertyDescription
`variant_max_subcolumns_count`Max number of paths that can go through Subcolumnization. Above the threshold, new paths may be stored in a shared data structure. Default: 2048 (Recommended), which is already enough for most workloads. Avoid setting it too large. If the workload truly needs very large extracted-subcolumn scale, prefer DOC mode. 0 means no limit; do not exceed 10000.
`variant_enable_typed_paths_to_sparse`By default, typed paths always participate in Subcolumnization (and do not count against `variant_max_subcolumns_count`). When set to `true`, typed paths also count toward the threshold and may be moved to the shared structure.
`variant_sparse_hash_shard_count`Shard count for sparse columns. Distributes sparse subpaths across multiple sparse columns to improve read performance. Default: 1; tune based on the number of sparse subpaths.
CREATE TABLE example_table (
id INT,
data_variant VARIANT<
'path_1' : INT,
'path_2' : STRING,
properties(
'variant_enable_doc_mode' = 'true',
'variant_doc_materialization_min_rows' = '10000',
'variant_doc_hash_shard_count' = '64'
)
>
);
PropertyDescription
`variant_enable_doc_mode`Enable DOC encoding mode. When `true`, the original JSON is stored as a stored field to quickly return the whole JSON document. DOC mode is mutually exclusive with sparse columns. Default: `false`.
`variant_doc_materialization_min_rows`Minimum row threshold to trigger Subcolumnization in DOC mode. When rows are below this value, only the original JSON is stored; after compaction merges files to reach the threshold, Subcolumnization is performed. Helps reduce overhead for small-batch writes.
`variant_doc_hash_shard_count`Shard count for DOC encoding. The original JSON is split into the specified number of columns for storage and reassembled when querying the whole JSON. Default: 64; tune based on JSON size and concurrency.

Behavior at limits and tuning suggestions:

  1. After exceeding the threshold, new paths are written into the shared structure; Rowset merges may also recycle some paths into the shared structure.
  2. The system prefers to keep paths with higher non-null ratios and higher access frequencies in Subcolumnization.
  3. Close to 10,000 paths in Subcolumnization requires strong hardware (≥128G RAM, ≥32C per node recommended). If the workload is already near this range, prefer evaluating DOC mode first.
  4. Ingestion tuning: increase client batch_size appropriately, or use Group Commit (increase group_commit_interval_ms/group_commit_data_bytes as needed).
  5. If partition pruning is not needed, consider RANDOM bucketing and enabling single-tablet loading to reduce compaction write amplification.
  6. BE tuning knobs: max_cumu_compaction_threads (≥8), vertical_compaction_num_columns_per_group=500 (improves vertical compaction but increases memory), segment_cache_memory_percentage=20 (improves metadata cache efficiency).
  7. Watch Compaction Score; if it keeps rising, compaction is lagging—reduce ingestion pressure.
  8. Avoid large SELECT * on VARIANT; prefer specific projections like SELECT v['path'].

Note: If you see Stream Load error [DATA_QUALITY_ERROR]Reached max column size limit 2048 (only on 2.1.x and 3.0.x), it means the merged tablet schema reached its column limit. You may increase variant_max_merged_tablet_schema_size (not recommended beyond 4096; requires strong hardware).

Inspect number of columns and types

Approach 1: use variant_type to inspect per-row schema (more precise, higher cost):

SELECT variant_type(v) FROM variant_tbl;

Approach 2: extended DESC to show subpaths extracted through Subcolumnization:

SET describe_extend_variant_column = true;
DESC variant_tbl;
DESCRIBE ${table_name} PARTITION ($partition_name);

Use both: Approach 1 is precise; Approach 2 is efficient.

Compared with JSON type

  • Storage: JSON is stored as JSONB (row-oriented). VARIANT uses Subcolumnization on write (higher compression, smaller size).
  • Query: JSON requires parsing. VARIANT scans columns directly and is usually much faster.

ClickBench (43 queries):

  • Storage: VARIANT saves ~65% vs JSON.
  • Query: VARIANT is 8x+ faster than JSON, close to predefined static columns.

Storage space

TypeSize
Predefined columns12.618 GB
VARIANT12.718 GB
JSON35.711 GB

~65% space savings

RunPredefinedVARIANTJSON
First (cold)233.79s248.66sMost timed out
Second (hot)86.02s94.82s789.24s
Third (hot)83.03s92.29s743.69s

FAQ

  1. Are null in VARIANT and SQL NULL different?
    • No. They are equivalent.
  2. Why doesn’t my query/index work?
    • Check whether you CAST paths to the correct types; whether the type was promoted to JSONB due to conflicts; or whether you mistakenly expect an index on the whole VARIANT instead of on subpaths.