CSV
This document explains how to load CSV format data files into Doris. Doris supports flexible CSV format configuration, including custom delimiters, field enclosures, and provides various loading methods to meet data loading requirements in different scenarios.
Loading Methods
Doris supports the following methods to load CSV format data:
Parameter Configuration
Parameter Support
The following table lists the CSV format parameters supported by various loading methods:
Parameter | Default Value | Stream Load | Broker Load | Routine Load | MySQL Load | TVF |
---|---|---|---|---|---|---|
Line Delimiter | \n | line_delimiter | LINES TERMINATED BY | Not supported | LINES TERMINATED BY | line_delimiter |
Column Delimiter | \t | column_separator | COLUMNS TERMINATED BY | COLUMNS TERMINATED BY | COLUMNS TERMINATED BY | column_separator |
Enclosure | None | enclose | PROPERTIES.enclose | PROPERTIES.enclose | PROPERTIES.enclose | enclose |
Escape Character | \ | escape | PROPERTIES.escape | PROPERTIES.escape | PROPERTIES.escape | escape |
Skip Lines | 0 | skip_lines | PROPERTIES.skip_lines | Not supported | IGNORE LINES | skip_lines |
Trim Double Quotes | false | trim_double_quotes | Not supported | PROPERTIES.trim_double_quotes | Not supported | trim_double_quotes |
Compression Format | plain | compress_type | PROPERTIES.compress_type | Not supported | Not supported | compress_type |
Note
- Stream Load: Parameters are specified directly through HTTP Headers, e.g.,
-H "line_delimiter:\n"
- Broker Load: Parameters are specified in SQL statements, where:
- Delimiters are specified through
COLUMNS TERMINATED BY
,LINES TERMINATED BY
- Other parameters are specified through
PROPERTIES
, e.g.,PROPERTIES("compress_type"="gz")
- Delimiters are specified through
- Routine Load: Parameters are specified in SQL statements, where:
- Delimiters are specified through
COLUMNS TERMINATED BY
- Other parameters are specified through
PROPERTIES
, e.g.,PROPERTIES("enclose"="\"")
- Delimiters are specified through
- MySQL Load: Parameters are specified in SQL statements, where:
- Delimiters are specified through
LINES TERMINATED BY
,COLUMNS TERMINATED BY
- Other parameters are specified through
PROPERTIES
, e.g.,PROPERTIES("escape"="\\")
- Delimiters are specified through
- TVF: Parameters are specified in TVF statements, e.g.,
S3("line_delimiter"="\n")
Parameter Description
Line Delimiter
- Purpose: Specifies the line break character in the data file
- Default Value:
\n
- Features: Supports multiple characters combination as line break
- Use Cases and Examples:
- Linux/Unix System Files:
Data File:
John,25\n
Mary,30\n
Parameter Setting:
line_delimiter: \n (default value, can be omitted) - Windows System Files:
Data File:
John,25\r\n
Mary,30\r\n
Parameter Setting:
line_delimiter: \r\n - Special Program Generated Files:
Data File:
John,25\r
Mary,30\r
Parameter Setting:
line_delimiter: \r - Custom Multi-character Delimiter:
Data File:
John,25||
Mary,30||
Parameter Setting:
line_delimiter: ||
- Linux/Unix System Files:
Column Delimiter
- Purpose: Specifies the column delimiter in the data file
- Default Value:
\t
(tab) - Features:
- Supports visible and invisible characters
- Supports multiple character combinations
- Invisible characters need to use hex representation with
\x
prefix
- Special Handling for MySQL Protocol:
- Invisible characters need an extra backslash
- For example, Hive's
\x01
needs to be written as\\x01
in Broker Load
- Use Cases and Examples:
- Common Visible Characters:
Data File:
John,25,New York
Mary,30,Los Angeles
Parameter Setting:
column_separator: , - Tab (Default):
Data File:
John 25 New York
Mary 30 Los Angeles
Parameter Setting:
column_separator: \t (default value, can be omitted) - Hive Files (Stream Load):
Data File:
John\x0125\x01New York
Mary\x0130\x01Los Angeles
Parameter Setting:
column_separator: \x01 - Hive Files (Broker Load):
Data File:
John\x0125\x01New York
Mary\x0130\x01Los Angeles
Parameter Setting:
PROPERTIES("column_separator"="\\x01") - Multi-character Delimiter:
Data File:
John||25||New York
Mary||30||Los Angeles
Parameter Setting:
column_separator: ||
- Common Visible Characters:
Enclosure
- Purpose: Protects fields containing special characters from being incorrectly parsed
- Limitation: Only supports single-byte characters
- Common Characters:
- Single quote:
'
- Double quote:
"
- Single quote:
- Use Cases and Examples:
- Field Contains Column Delimiter:
Data: a,'b,c',d
Column Delimiter: ,
Enclosure: '
Parsing Result: 3 fields [a] [b,c] [d] - Field Contains Line Delimiter:
Data: a,'b\nc',d
Column Delimiter: ,
Enclosure: '
Parsing Result: 3 fields [a] [b\nc] [d] - Field Contains Both Column and Line Delimiters:
Data: a,'b,c\nd,e',f
Column Delimiter: ,
Enclosure: '
Parsing Result: 3 fields [a] [b,c\nd,e] [f]
- Field Contains Column Delimiter:
Escape Character
- Purpose: Escapes characters in fields that are the same as the enclosure character
- Limitation: Only supports single-byte characters, default is
\
- Use Cases and Examples:
- Field Contains Enclosure Character:
Data: a,'b,\'c',d
Column Delimiter: ,
Enclosure: '
Escape Character: \
Parsing Result: 3 fields [a] [b,'c] [d] - Field Contains Multiple Enclosure Characters:
Data: a,"b,\"c\"d",e
Column Delimiter: ,
Enclosure: "
Escape Character: \
Parsing Result: 3 fields [a] [b,"c"d] [e] - Field Contains Escape Character Itself:
Data: a,'b\\c',d
Column Delimiter: ,
Enclosure: '
Escape Character: \
Parsing Result: 3 fields [a] [b\c] [d]
- Field Contains Enclosure Character:
Skip Lines
- Purpose: Skips the first few lines of the CSV file
- Type: Integer
- Default Value: 0
- Special Note:
- When format is
csv_with_names
, the system automatically skips the first line (column names), ignoring the skip_lines parameter - When format is
csv_with_names_and_types
, the system automatically skips the first two lines (column names and types), ignoring the skip_lines parameter
- When format is
- Use Cases and Examples:
- Skip Title Line:
Data File:
Name,Age,City
John,25,New York
Mary,30,Los Angeles
Parameter Setting:
skip_lines: 1
Result: Skip title line, load subsequent data - Skip Comment Lines:
Data File:
# User Information Table
# Created Time: 2024-01-01
John,25,New York
Mary,30,Los Angeles
Parameter Setting:
skip_lines: 2
Result: Skip comment lines, load subsequent data - Use csv_with_names Format:
Data File:
name,age,city
John,25,New York
Mary,30,Los Angeles
Parameter Setting:
format: csv_with_names
Result: System automatically skips the first line of column names - Use csv_with_names_and_types Format:
Data File:
name,age,city
string,int,string
John,25,New York
Mary,30,Los Angeles
Parameter Setting:
format: csv_with_names_and_types
Result: System automatically skips the first two lines of column names and types
- Skip Title Line:
Trim Double Quotes
- Purpose: Trims the outermost double quotes from each field in the CSV file
- Type: Boolean
- Default Value: false
- Use Cases and Examples:
- Trim Double Quotes:
Data File:
"John","25","New York"
"Mary","30","Los Angeles"
Parameter Setting:
trim_double_quotes: true
Result:
John,25,New York
Mary,30,Los Angeles
- Trim Double Quotes:
Compression Format
- Purpose: Specifies the compression format of the data file
- Type: String, ignoring case
- Default Value: plain
- Supported Compression Formats:
- plain: No compression (default)
- bz2: BZIP2 compression
- deflate: DEFLATE compression
- gz: GZIP compression
- lz4: LZ4 Frame format compression
- lz4_block: LZ4 Block format compression
- lzo: LZO compression
- lzop: LZOP compression
- snappy_block: SNAPPY Block format compression
- Note:
- tar is a file packaging format, not a compression format, so it is not supported
- If you need to use tar packaged files, please unpack them first before loading
Usage Examples
This section demonstrates the usage of CSV format in different loading methods.
Stream Load
# Specify delimiter
curl --location-trusted -u root: \
-H "column_separator:," \
-H "line_delimiter:\n" \
-T example.csv \
http://<fe_host>:<fe_http_port>/api/test_db/test_table/_stream_load
# Handle quoted data
curl --location-trusted -u root: \
-H "column_separator:," \
-H "enclose:\"" \
-T example.csv \
http://<fe_host>:<fe_http_port>/api/test_db/test_table/_stream_load
# Load compressed file
curl --location-trusted -u root: \
-H "compress_type:gz" \
-T example.csv.gz \
http://<fe_host>:<fe_http_port>/api/test_db/test_table/_stream_load
Broker Load
-- Specify delimiter
LOAD LABEL test_db.test_label
(
DATA INFILE("s3://bucket/example.csv")
INTO TABLE test_table
COLUMNS TERMINATED BY ","
LINES TERMINATED BY "\n"
...
);
-- Handle quoted data
LOAD LABEL test_db.test_label
(
DATA INFILE("s3://bucket/example.csv")
INTO TABLE test_table
COLUMNS TERMINATED BY ","
LINES TERMINATED BY "\n"
PROPERTIES
(
"enclose" = "\"",
"escape" = "\\"
)
...
);
-- Load compressed file
LOAD LABEL test_db.test_label
(
DATA INFILE("s3://bucket/example.csv.gz")
INTO TABLE test_table
COLUMNS TERMINATED BY ","
LINES TERMINATED BY "\n"
PROPERTIES
(
"compress_type" = "gz"
)
...
);
Routine Load
-- Specify delimiter
CREATE ROUTINE LOAD test_db.test_job ON test_table
COLUMNS TERMINATED BY ","
FROM KAFKA
(
...
);
-- Handle quoted data
CREATE ROUTINE LOAD test_db.test_job ON test_table
COLUMNS TERMINATED BY ","
PROPERTIES
(
"enclose" = "\"",
"escape" = "\\"
)
FROM KAFKA
(
...
);
MySQL Load
-- Specify delimiter
LOAD DATA LOCAL INFILE 'example.csv'
INTO TABLE test_table
COLUMNS TERMINATED BY ','
LINES TERMINATED BY '\n';
-- Handle quoted data
LOAD DATA LOCAL INFILE 'example.csv'
INTO TABLE test_table
COLUMNS TERMINATED BY ','
LINES TERMINATED BY '\n'
PROPERTIES
(
"enclose" = "\"",
"escape" = "\\"
);
-- Skip table header
LOAD DATA LOCAL INFILE 'example.csv'
INTO TABLE test_table
COLUMNS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
TVF Load
-- Specify delimiter
INSERT INTO test_table
SELECT *
FROM S3
(
"path" = "s3://bucket/example.csv",
"column_separator" = ",",
"line_delimiter" = "\n",
...
);
-- Handle quoted data
INSERT INTO test_table
SELECT *
FROM S3
(
"path" = "s3://bucket/example.csv",
"column_separator" = ",",
"enclose" = "\"",
"escape" = "\\",
...
);
-- Load compressed file
INSERT INTO test_table
SELECT *
FROM S3
(
"path" = "s3://bucket/example.csv.gz",
"column_separator" = ",",
"line_delimiter" = "\n",
"compress_type" = "gz",
...
);