Doris Streamloader
Doris Streamloader is a dedicated client tool for ingesting data into the Apache Doris database. Compared with the single-concurrency import approach using curl directly, this tool provides multi-concurrency import capability and significantly reduces the time required for loading large data volumes.
Core Features
| Feature | Description |
|---|---|
| Concurrent import | Performs Stream Load with multiple concurrent workers. The concurrency level is set with the workers parameter |
| Multi-file import | Imports multiple files and directories in a single task. Supports wildcard matching and automatically traverses all files under a directory recursively |
| Resumable transfer | If a partial failure occurs during import, the tool can resume from the failure point |
| Automatic retry | After an import failure, no manual retry is needed. The tool retries automatically up to the default number of times. If it still fails, it prints the manual retry command |
Use Cases
- Batch loading large data volumes (GB to TB scale) into Doris
- Batch importing of multiple files and multiple directories
- Scenarios sensitive to import latency that need multi-concurrency to improve throughput
- Stable import workflows that require resumable transfers and automatic recovery from failures
Download and Installation
| Resource | Address |
|---|---|
| Source code | https://github.com/apache/doris-streamloader |
| Binary download | https://doris.apache.org/download |
The download is an executable binary. No additional compilation or installation is required.
Usage
Basic Command Format
doris-streamloader \
--source_file={FILE_LIST} \
--url={FE_OR_BE_SERVER_URL}:{PORT} \
--header={STREAMLOAD_HEADER} \
--db={TARGET_DATABASE} \
--table={TARGET_TABLE}
Required Parameters
| Parameter | Meaning |
|---|---|
--source_file | The list of data files to import. Supports a single file, a directory, wildcards, and a comma-separated list |
--url | The service address of Doris FE or BE, in the format http://host:port |
--header | The header parameters for Stream Load. Multiple parameters are separated by ? |
--db | The name of the target database |
--table | The name of the target table |
Formats Supported by source_file
The --source_file parameter supports the following five formats. You can choose flexibly based on your scenario.
1. A Single File
For example, to import a single file file.csv:
doris-streamloader --source_file="file.csv" --url="http://localhost:8330" --header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
2. A Single Directory
For example, to import the directory dir:
doris-streamloader --source_file="dir" --url="http://localhost:8330" --header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
3. A File Name with Wildcards (Must Be Quoted)
For example, to import file0.csv, file1.csv, and file2.csv:
doris-streamloader --source_file="file*" --url="http://localhost:8330" --header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
4. A Comma-Separated List of File Names
For example, to import file0.csv, file1.csv, and file2.csv:
doris-streamloader --source_file="file0.csv,file1.csv,file2.csv" --url="http://localhost:8330" --header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
5. A Comma-Separated List of Directories
For example, to import dir1, dir2, and dir3:
doris-streamloader --source_file="dir1,dir2,dir3" --url="http://localhost:8330" --header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
Header Parameter
--header supports all parameters of Stream Load. Multiple parameters are separated with ?.
Example:
doris-streamloader --source_file="data.csv" --url="http://localhost:8330" --header="column_separator:|?columns:col1,col2" --db="testdb" --table="testtbl"
Optional Parameters
In addition to the required parameters above, the tool provides a series of optional parameters for fine-grained control of the import behavior. The table below groups them by function.
Authentication and Transport
| Parameter | Meaning | Default | Recommendation |
|---|---|---|---|
--u | Database user name | root | —— |
--p | Password for the database user | Empty string | —— |
--compress | Whether data is compressed during HTTP transport | false | Keep the default. Enabling compression adds CPU pressure on both the tool and Doris BE for compression and decompression. Enable it only when the network bandwidth on the data source machine is the bottleneck |
--timeout | Timeout for HTTP requests sent to Doris, in seconds | 60*60*10 | Keep the default |
Batch and Concurrency
| Parameter | Meaning | Default | Recommendation |
|---|---|---|---|
--batch | Granularity for batch reading and sending of files, in rows | 4096 | Keep the default |
--batch_byte | Granularity for batch reading and sending of files, in bytes | 943718400 (900 MB) | Keep the default |
--workers | Concurrency level for the import | 0 | When set to 0, the tool runs in automatic mode and computes the value based on the size of the imported data, the disk throughput, and the Stream Load import speed. You can also set it manually. For high-performance clusters, you may increase it appropriately, but preferably no more than 10. If the import memory is too high (observed via Memtracker or Exceed logs), you can lower it appropriately |
--disk_throughput | Disk throughput, in MB/s | 800 | Usually keep the default. This value participates in the automatic calculation of --workers. If you want the tool to compute an appropriate workers count, you can set this based on the actual disk throughput |
--streamload_throughput | Actual Stream Load import throughput, in MB/s | 100 | Usually keep the default. This value participates in the automatic calculation of --workers. The default value is derived from a daily performance test environment. If you want the tool to compute an appropriate workers count, you can set this based on the measured throughput, using the formula: (LoadBytes*1000) / (LoadTimeMs*1024*1024) |
--max_byte_per_task | Upper limit on the data volume per import task. When exceeded, the data is split into a new task | 107374182400 (100 GB) | A larger value is recommended to reduce the number of import versions. However, if you encounter a body exceed max size error and do not want to adjust streaming_load_max_mb (which requires restarting the BE), or if you encounter -238 TOO MANY SEGMENT, you can lower it temporarily |
Data Validation and Logs
| Parameter | Meaning | Default | Recommendation |
|---|---|---|---|
--check_utf8 | Whether to check the encoding of the imported data: false skips the check and imports the raw data; true replaces non-UTF-8 characters with � | true | Keep the default |
--debug | Whether to print debug logs | false | Keep the default |
--log_filename | Where logs are stored | "" | Logs are output to the console by default. To write logs to a file, specify a path, for example --log_filename="/var/log" |
Failure Retry
| Parameter | Meaning | Default | Recommendation |
|---|---|---|---|
--auto_retry | The list of worker and task numbers to retry automatically | Empty string | Use this only when the import fails. You do not need to set it during normal imports. On failure, the specific parameter values are printed. Just copy and run them. For example, --auto_retry="1,1,2,1" means the first task of the first worker and the first task of the second worker need to be retried |
--auto_retry_times | Number of automatic retries | 3 | Keep the default. To disable retries, set it to 0 |
--auto_retry_interval | Interval between automatic retries, in seconds | 60 | Keep the default. If failures are caused by Doris being down, set this based on the actual restart time |
Result
Whether the import succeeds or fails, the tool prints a final result when it finishes.
Result Fields
| Field | Description |
|---|---|
Status | The import status. Success means success and Failed means failure |
TotalRows | The total number of rows in the files to be imported |
FailLoadRows | The number of rows that were intended to be imported but were not |
LoadedRows | The number of rows actually imported into Doris |
FilteredRows | The number of rows filtered out by Doris during import |
UnselectedRows | The number of rows ignored by Doris during import |
LoadBytes | The number of bytes actually imported |
LoadTimeMs | The actual import duration, in milliseconds |
LoadFiles | The list of files actually imported |
Success Example
When the import succeeds, the output is as follows:
Load Result: {
"Status": "Success",
"TotalRows": 120,
"FailLoadRows": 0,
"LoadedRows": 120,
"FilteredRows": 0,
"UnselectedRows": 0,
"LoadBytes": 40632,
"LoadTimeMs": 971,
"LoadFiles": [
"basic.csv",
"basic_data1.csv",
"basic_data2.csv",
"dir1/basic_data.csv",
"dir1/basic_data.csv.1",
"dir1/basic_data1.csv"
]
}
Failure Example
If part of the data fails to import, the tool first prints the retry command:
load has some error, and auto retry failed, you can retry by :
./doris-streamloader --source_file /mnt/disk1/laihui/doris/tools/tpch-tools/bin/tpch-data/lineitem.tbl.1 --url="http://127.0.0.1:8239" --header="column_separator:|?columns: l_orderkey, l_partkey, l_suppkey, l_linenumber, l_quantity, l_extendedprice, l_discount, l_tax, l_returnflag,l_linestatus, l_shipdate,l_commitdate,l_receiptdate,l_shipinstruct,l_shipmode,l_comment,temp" --db="db" --table="lineitem1" -u root -p "" --compress=false --timeout=36000 --workers=3 --batch=4096 --batch_byte=943718400 --max_byte_per_task=1073741824 --check_utf8=true --report_duration=1 --auto_retry="2,1;1,1;0,1" --auto_retry_times=0 --auto_retry_interval=60
Copying and running this command performs the manual retry. The meaning of auto_retry is described in the parameter section above. The failure result is then printed:
Load Result: {
"Status": "Failed",
"TotalRows": 1,
"FailLoadRows": 1,
"LoadedRows": 0,
"FilteredRows": 0,
"UnselectedRows": 0,
"LoadBytes": 0,
"LoadTimeMs": 104,
"LoadFiles": [
"/mnt/disk1/laihui/doris/tools/tpch-tools/bin/tpch-data/lineitem.tbl.1"
]
}
Best Practices
Recommended Parameters
-
Required parameters: The following parameters must be configured:
--source_file=FILE_LIST
--url=FE_OR_BE_SERVER_URL_WITH_PORT
--header=STREAMLOAD_HEADER
--db=TARGET_DATABASE
--table=TARGET_TABLETo import multiple files, use the
source_fileapproach. -
workers: The default value is the number of CPU cores. On machines with many cores (such as 96), this produces too much concurrency. Lower the value. A common recommendation is8. -
max_byte_per_task: A larger value reduces the number of import versions. However, if you encounter abody exceed max sizeerror and do not want to adjuststreaming_load_max_mb(which requires restarting the BE), or if you encounter the-238 TOO MANY SEGMENTerror, you can lower this value temporarily. The default usually works. -
Two key parameters that affect the number of versions:
Parameter Effect Recommendation workersMore workers means more versions and higher concurrency Usually use 8max_byte_per_taskA larger value means more data per version and fewer versions, but a value that is too large can trigger -238 TOO MANY SEGMENTUsually use the default
Recommended Command
Setting the required parameters and workers to 8 is sufficient for most scenarios:
./doris-streamloader \
--source_file="demo.csv,demoFile*.csv,demoDir" \
--url="http://127.0.0.1:8030" \
--header="column_separator:," \
--db="demo" \
--table="test_load" \
--u="root" \
--workers=8
FAQ
1. What should I do if some subtasks fail during import?
The tool retries automatically. If the retries still fail, it prints a manual retry command. Just copy and run it. There is no need to drop the table and reimport.
2. What if a single import exceeds the BE default streaming_load_max_mb threshold?
The tool's default upper limit per import is 100 GB, which may exceed the BE streaming_load_max_mb threshold. To avoid restarting the BE, lower the --max_byte_per_task parameter.
To check the value of streaming_load_max_mb:
curl "http://127.0.0.1:8040/api/show_config"
3. What should I do when the -238 TOO MANY SEGMENT error occurs?
Lowering the --max_byte_per_task parameter mitigates this issue.