Text/CSV/JSON
This document introduces the support for reading and writing text file formats in Doris.
Text/CSV
-
Catalog
Supports reading Hive tables in the
org.apache.hadoop.mapred.TextInputFormat
format.Supports reading Hive tables in the
org.apache.hadoop.hive.serde2.OpenCSVSerde
format. (Supported from version 2.1.7) -
Table Valued Function
-
Import
Import functionality supports Text/CSV formats. See the import documentation for details.
-
Export
Export functionality supports Text/CSV formats. See the export documentation for details.
Supported Compression Formats
- uncompressed
- gzip
- deflate
- bzip2
- zstd
- lz4
- snappy
- lzo
JSON
Catalog
-
Hive table in
org.apache.hive.hcatalog.data.JsonSerDe
format (supported since version 3.0.4)- Supports both primitive and complex types.
- Does not support the
timestamp.formats
SERDEPROPERTIES.
-
Hive table in
org.openx.data.jsonserde.JsonSerDe
format (supported since version 3.0.6)- Supports both primitive and complex types.
- SERDEPROPERTIES: Only
ignore.malformed.json
is supported and behaves the same as in this JsonSerDe. Other SERDEPROPERTIES are not effective. - Does not support
Using Arrays
(similar to Text/CSV format, where all column data is placed into a single array). - Does not support
Promoting a Scalar to an Array
(promoting a scalar to a single-element array). - By default, Doris can correctly recognize the table schema. However, due to the lack of support for certain parameters, automatic schema recognition might fail. In this case, you can set
read_hive_json_in_one_column = true
to place the entire JSON row into the first column to ensure the original data is fully read. Users can then process it manually. This feature requires the first column's data type to beString
.
Import
Import functionality supports JSON formats. See the import documentation for details.
Character Set
Currently, Doris only supports the UTF-8 character set encoding. However, some data, such as the data in Hive Text-formatted tables, may contain content encoded in non-UTF-8 encoding, which will cause reading failures and result in the following error:
Only support csv data in utf8 codec
In this case, you can set the session variable as follows:
SET enable_text_validate_utf8 = false
This will ignore the UTF-8 encoding check, allowing you to read this content. Note that this parameter is only used to skip the check, and non-UTF-8 encoded content will still be displayed as garbled text.
This parameter has been supported since version 3.0.4.