JSON
This document explains how to load JSON format data files into Doris. Doris supports loading standard JSON format data and can flexibly handle different JSON data structures through parameter configuration, supporting field extraction from JSON data and handling nested structures.
Loading Methodsβ
The following loading methods support JSON format data:
Supported JSON Formatsβ
Doris supports the following three JSON formats:
Multiple Rows Represented as Arrayβ
Suitable for batch loading multiple rows of data, requirements:
- Root node must be an array
- Each element in the array is an object representing a row of data
- Must set
strip_outer_array=true
Example data:
[
{"id": 123, "city": "beijing"},
{"id": 456, "city": "shanghai"}
]
// Supports nested structures
[
{"id": 123, "city": {"name": "beijing", "region": "haidian"}},
{"id": 456, "city": {"name": "beijing", "region": "chaoyang"}}
]
Single Row Represented as Objectβ
Suitable for loading single row data, requirements:
- Root node must be an object
- The entire object represents one row of data
Example data:
{"id": 123, "city": "beijing"}
// Supports nested structures
{"id": 123, "city": {"name": "beijing", "region": "haidian"}}
Typically used with Routine Load method, such as single messages in Kafka.
Multiple Object Rows Separated by Delimiterβ
Suitable for batch loading multiple rows of data, requirements:
- Each line is a complete JSON object
- Must set
read_json_by_line=true
- Line delimiter can be specified using
line_delimiter
parameter, default is\n
Example data:
{"id": 123, "city": "beijing"}
{"id": 456, "city": "shanghai"}
Parameter Configurationβ
Parameter Supportβ
The following table lists the JSON format parameters supported by various loading methods:
Parameter | Default Value | Stream Load | Broker Load | Routine Load | TVF |
---|---|---|---|---|---|
json paths | None | jsonpaths | properties.jsonpaths | properties.jsonpaths | jsonpaths |
json root | None | json_root | properties.json_root | properties.json_root | json_root |
strip outer array | false | strip_outer_array | properties.strip_outer_array | properties.strip_outer_array | strip_outer_array |
read json by line | false | read_json_by_line | Always true | Not supported | read_json_by_line, default true |
fuzzy parse | false | fuzzy_parse | properties.fuzzy_parse | Not supported | fuzzy_parse |
num as string | false | num_as_string | properties.num_as_string | properties.num_as_string | num_as_string |
- Stream Load: Parameters are specified directly through HTTP Headers, e.g.,
-H "jsonpaths: $.data"
- Broker Load: Parameters are specified through
PROPERTIES
, e.g.,PROPERTIES("jsonpaths"="$.data")
- Routine Load: Parameters are specified through
PROPERTIES
, e.g.,PROPERTIES("jsonpaths"="$.data")
- TVF: Parameters are specified in TVF statements, e.g.,
S3("jsonpaths"="$.data")
Parameter Descriptionβ
JSON Pathβ
- Purpose: Specifies how to extract fields from JSON data
- Type: String array
- Default Value: None, defaults to matching column names
- Usage Examples:
-- Basic usage
["$.id", "$.city"]
-- Nested structures
["$.id", "$.info.city", "$.data[0].name"]
JSON Rootβ
- Purpose: Specifies the parsing starting point for JSON data
- Type: String
- Default Value: None, defaults to parsing from root node
- Usage Example:
-- Original data
{
"data": {
"id": 123,
"city": "beijing"
}
}
-- Set json_root
json_root = $.data
Strip Outer Arrayβ
- Purpose: Specifies whether to remove the outermost array structure
- Type: Boolean
- Default Value: false
- Usage Example:
-- Original data
[
{"id": 1, "city": "beijing"},
{"id": 2, "city": "shanghai"}
]
-- Set strip_outer_array=true
Read JSON By Lineβ
- Purpose: Specifies whether to read JSON data line by line
- Type: Boolean
- Default Value: false
- Usage Example:
-- Original data (one complete JSON object per line)
{"id": 1, "city": "beijing"}
{"id": 2, "city": "shanghai"}
-- Set read_json_by_line=true
Fuzzy Parseβ
- Purpose: Accelerates JSON data loading efficiency
- Type: Boolean
- Default Value: false
- Limitations:
- Field order in each row of the Array must be identical
- Usually used with strip_outer_array
- Performance: Can improve loading efficiency by 3-5 times
Num As Stringβ
- Purpose: Specifies whether to parse JSON numeric types as strings
- Type: Boolean
- Default Value: false
- Use Cases:
- Handling numbers outside numeric range
- Avoiding numeric precision loss
- Usage Example:
-- Original data
{
"id": "12345678901234567890",
"price": "99999999.999999"
}
-- Set num_as_string=true, price field will be parsed as string
Relationship between JSON Path and Columnsβ
During data loading, JSON Path and Columns serve different responsibilities:
JSON Path: Defines data extraction rules
- Extracts fields from JSON data according to specified paths
- Extracted fields are reordered according to the order defined in JSON Path
Columns: Defines data mapping rules
- Maps extracted fields to target table columns
- Can perform column reordering and transformation
These two parameters are processed serially: first, JSON Path extracts fields from source data and forms an ordered dataset, then Columns maps these data to table columns. If Columns is not specified, extracted fields will be mapped directly according to table column order.
Usage Examplesβ
Using JSON Path Onlyβ
Table structure and data:
-- Table structure
CREATE TABLE example_table (
k2 int,
k1 int
);
-- JSON data
{"k1": 1, "k2": 2}
Load command:
curl -v ... -H "format: json" \
-H "jsonpaths: [\"$.k2\", \"$.k1\"]" \
-T example.json \
http://<fe_host>:<fe_http_port>/api/db_name/table_name/_stream_load
Load result:
+------+------+
| k1 | k2 |
+------+------+
| 2 | 1 |
+------+------+
Using JSON Path + Columnsβ
Using the same table structure and data, adding columns parameter:
Load command:
curl -v ... -H "format: json" \
-H "jsonpaths: [\"$.k2\", \"$.k1\"]" \
-H "columns: k2, k1" \
-T example.json \
http://<fe_host>:<fe_http_port>/api/db_name/table_name/_stream_load
Load result:
+------+------+
| k1 | k2 |
+------+------+
| 1 | 2 |
+------+------+
Field Reuseβ
Table structure and data:
-- Table structure
CREATE TABLE example_table (
k2 int,
k1 int,
k1_copy int
);
-- JSON data
{"k1": 1, "k2": 2}
Load command:
curl -v ... -H "format: json" \
-H "jsonpaths: [\"$.k2\", \"$.k1\", \"$.k1\"]" \
-H "columns: k2, k1, k1_copy" \
-T example.json \
http://<fe_host>:<fe_http_port>/api/db_name/table_name/_stream_load
Load result:
+------+------+---------+
| k2 | k1 | k1_copy |
+------+------+---------+
| 2 | 1 | 1 |
+------+------+---------+
Nested Field Mappingβ
Table structure and data:
-- Table structure
CREATE TABLE example_table (
k2 int,
k1 int,
k1_nested1 int,
k1_nested2 int
);
-- JSON data
{
"k1": 1,
"k2": 2,
"k3": {
"k1": 31,
"k1_nested": {
"k1": 32
}
}
}
Load command:
curl -v ... -H "format: json" \
-H "jsonpaths: [\"$.k2\", \"$.k1\", \"$.k3.k1\", \"$.k3.k1_nested.k1\"]" \
-H "columns: k2, k1, k1_nested1, k1_nested2" \
-T example.json \
http://<fe_host>:<fe_http_port>/api/db_name/table_name/_stream_load
Load result:
+------+------+------------+------------+
| k2 | k1 | k1_nested1 | k1_nested2 |
+------+------+------------+------------+
| 2 | 1 | 31 | 32 |
+------+------+------------+------------+
Usage Examplesβ
This section demonstrates the usage of JSON format in different loading methods.
Stream Loadβ
# Using JSON Path
curl --location-trusted -u <user>:<passwd> \
-H "format: json" \
-H "jsonpaths: [\"$.id\", \"$.city\"]" \
-T example.json \
http://<fe_host>:<fe_http_port>/api/example_db/example_table/_stream_load
# Specifying JSON root
curl --location-trusted -u <user>:<passwd> \
-H "format: json" \
-H "json_root: $.events" \
-T example.json \
http://<fe_host>:<fe_http_port>/api/example_db/example_table/_stream_load
# Reading JSON by line
curl --location-trusted -u <user>:<passwd> \
-H "format: json" \
-H "read_json_by_line: true" \
-T example.json \
http://<fe_host>:<fe_http_port>/api/example_db/example_table/_stream_load
Broker Loadβ
-- Using JSON Path
LOAD LABEL example_db.example_label
(
DATA INFILE("s3://bucket/path/example.json")
INTO TABLE example_table
FORMAT AS "json"
PROPERTIES
(
"jsonpaths" = "[\"$.id\", \"$.city\"]"
)
)
WITH S3
(
...
);
-- Specifying JSON root
LOAD LABEL example_db.example_label
(
DATA INFILE("s3://bucket/path/example.json")
INTO TABLE example_table
FORMAT AS "json"
PROPERTIES
(
"json_root" = "$.events"
)
)
WITH S3
(
...
);
Routine Loadβ
-- Using JSON Path
CREATE ROUTINE LOAD example_db.example_job ON example_table
PROPERTIES
(
"format" = "json",
"jsonpaths" = "[\"$.id\", \"$.city\"]"
)
FROM KAFKA
(
...
);
TVF Loadβ
-- Using JSON Path
INSERT INTO example_table
SELECT *
FROM S3
(
"path" = "s3://bucket/example.json",
"format" = "json",
"jsonpaths" = "[\"$.id\", \"$.city\"]",
...
);
-- Specifying JSON root
INSERT INTO example_table
SELECT *
FROM S3
(
"path" = "s3://bucket/example.json",
"format" = "json",
"json_root" = "$.events",
...
);