Load Overview
Apache Doris provides multiple data load and integration methods to help you write data into the database from different sources. Starting from typical business scenarios, this document explains how to choose the most suitable solution among four categories: real-time writes, streaming sync, batch loading, and external data source integration.
Quick Navigation
Based on data source and timeliness requirements, you can refer to the following table to quickly locate the recommended load method:
| Business scenario | Data source | Recommended load method |
|---|---|---|
| Application real-time write (very small volume, every 5 minutes) | JDBC client | JDBC INSERT |
| Application high-concurrency or high-frequency small-batch write | JDBC / HTTP | Group Commit + JDBC INSERT or Stream Load |
| Application high-throughput write | HTTP | Stream Load |
| Real-time data stream ingestion | Flink | Flink Doris Connector |
| Real-time message queue ingestion | Kafka | Routine Load or Doris Kafka Connector |
| Transactional database real-time sync (no external components) | MySQL / PostgreSQL | Streaming Job continuous load |
| Transactional database CDC sync | MySQL / PostgreSQL, etc. | Flink CDC or DataX |
| Object storage continuous load (automatic incremental file load) | S3 | Streaming Job continuous load |
| Object storage / HDFS file batch load | S3 / OSS / HDFS | Broker Load or INSERT INTO SELECT |
| Local file batch load | Local disk | Stream Load or Doris Streamloader |
| External data source (data lake / external table) query and load | Hive / Iceberg / JDBC, etc. | Catalog + INSERT INTO SELECT |
Each load in Doris is by default an implicit transaction. For more transaction-related information, see Transaction.
Choosing a Load Method by Scenario
Real-Time Write: Direct Application Write
This applies to scenarios where applications write data in real time to Doris tables through HTTP or JDBC, commonly used for businesses that require real-time analysis and queries.
- Very small volume of data (about once every 5 minutes): Use JDBC INSERT to write data.
- High concurrency or high frequency (more than 20 concurrent, or multiple writes within 1 minute): Enable Group Commit and use it together with JDBC INSERT or Stream Load.
- High-throughput write: Use Stream Load to write data over the HTTP protocol.
Streaming Sync: Real-Time Data Stream Ingestion
This applies to scenarios where data is continuously synchronized to Doris tables through real-time data streams (such as Flink, Kafka, or transactional database CDC).
-
Flink real-time data stream
Use the Flink Doris Connector to write Flink real-time data streams into Doris tables.
-
Kafka real-time data stream
Choose between Routine Load and the Doris Kafka Connector. The differences are as follows:
Method Data flow direction Supported formats Routine Load Doris actively pulls data from Kafka csv, json Kafka Connector Kafka actively pushes data into Doris avro, json, csv, protobuf -
Transactional database CDC sync
Use Flink CDC or DataX to write CDC data streams from transactional databases into Doris.
-
Streaming Job continuous load (no external components)
Through the built-in Streaming Job in Doris, you can continuously read data from sources such as MySQL, PostgreSQL, and S3 and write it into Doris, without depending on external components such as Flink or Kafka. Two sync methods are supported:
Sync method Underlying mechanism Auto table creation Semantic guarantee Typical scenario Table-level sync Job + TVF (INSERT INTO SELECT) Pre-creation needed exactly-once Cases requiring column pruning, field renaming, type conversion, or conditional filtering Database-level sync Job + native whole-database DDL Auto-created on first run at-least-once Mirror replication of an entire database or a group of tables, with downstream table schemas automatically following the upstream
Batch Load: Loading Files from External Storage
This applies to non-real-time scenarios where files in external storage systems (such as object storage, HDFS, local files, or NAS) are loaded in batches into Doris tables.
- Object storage / HDFS files: Use Broker Load to write data into Doris.
- Object storage / HDFS / NAS files (synchronous or asynchronous): Use INSERT INTO SELECT for synchronous writes. For asynchronous execution, combine it with JOB scheduling.
- Local files: Use Stream Load or Doris Streamloader to write data into Doris.
External Data Source Integration: Catalog Federated Query and Load
This applies to scenarios where you integrate with external data sources (such as Hive, JDBC, or Iceberg) to query external data and load it on demand into Doris tables.
- Create a Catalog to read data from external data sources.
- Use INSERT INTO SELECT to synchronously write data from the external data source into Doris. For asynchronous execution, combine it with JOB scheduling.
Partial Column Update During Load
Doris supports partial column updates during data loading, which allows you to update only specific columns in a table without providing values for all columns. This capability is especially useful in the following scenarios:
- Updating a small number of fields in a wide table
- Performing incremental updates (Upsert on partial columns)
For details on how to perform partial column updates on Unique Key tables and Aggregate tables, see Column Update.
Load Method Overview
Loading in Doris involves several aspects, including data sources, data formats, load methods, error data handling, data transformation, and transactions. The following table summarizes the suitable scenarios, supported file formats, and load modes for each method:
| Load method | Use case | Supported file formats | Load mode |
|---|---|---|---|
| Stream Load | Loading local files or application writes | csv, json, parquet, orc | Synchronous |
| Broker Load | Loading from object storage, HDFS, etc. | csv, json, parquet, orc | Asynchronous |
| INSERT INTO VALUES | Loading through interfaces such as JDBC | SQL | Synchronous |
| INSERT INTO SELECT | Loading from external tables, object storage, or HDFS | SQL | Synchronous |
| Routine Load | Real-time loading from Kafka | csv, json | Asynchronous |
| MySQL Load | Loading from local data | csv | Synchronous |
| Group Commit | High-frequency small-batch loading | Depends on the load method used | - |
| Streaming Job | Continuous loading from sources such as MySQL, PostgreSQL, and S3 | Depends on the data source | Asynchronous |
FAQ
Q1: Which load method should be chosen for high-concurrency, small-batch writes?
Enable Group Commit and use it together with JDBC INSERT or Stream Load. When concurrency exceeds 20, or when multiple writes occur within 1 minute, Group Commit can significantly reduce load pressure.
Q2: What is the difference between Routine Load and the Doris Kafka Connector?
- Routine Load: Doris schedules tasks that actively pull data from Kafka. Supports csv and json formats.
- Doris Kafka Connector: Kafka actively pushes data into Doris. Supports avro, json, csv, and protobuf formats.
Q3: How do you load local files into Doris?
You can use Stream Load (suitable for small and medium files) or Doris Streamloader (suitable for large-file batch scenarios).
Q4: Can data from external data sources such as Hive or Iceberg be loaded into Doris?
Yes. First connect to the external data source through a Catalog, then use INSERT INTO SELECT to synchronously write the data into Doris. For asynchronous execution, combine it with JOB scheduling.
Q5: Are there transactional guarantees for loading?
Yes. Each load in Doris is by default an implicit transaction. For details, see Transaction.
Q6: How do you choose between Streaming Job and Flink CDC?
- Streaming Job: A built-in capability of Doris that does not depend on external components such as Flink or Kafka. It supports table-level and database-level sync for MySQL and PostgreSQL, as well as continuous loading from S3. Database-level sync can automatically create tables, while table-level sync provides exactly-once semantics and supports SQL processing.
- Flink CDC: Requires deploying a Flink cluster. It is suitable for scenarios that already have a Flink stream-processing system, require complex ETL processing, or need multi-target sync.
If you only need to continuously synchronize MySQL or PostgreSQL data into Doris and have no external stream-processing requirements, prefer Streaming Job. For details, see Continuous Load Overview.