Skip to main content

Load Overview

Apache Doris provides multiple data load and integration methods to help you write data into the database from different sources. Starting from typical business scenarios, this document explains how to choose the most suitable solution among four categories: real-time writes, streaming sync, batch loading, and external data source integration.

Quick Navigation

Based on data source and timeliness requirements, you can refer to the following table to quickly locate the recommended load method:

Business scenarioData sourceRecommended load method
Application real-time write (very small volume, every 5 minutes)JDBC clientJDBC INSERT
Application high-concurrency or high-frequency small-batch writeJDBC / HTTPGroup Commit + JDBC INSERT or Stream Load
Application high-throughput writeHTTPStream Load
Real-time data stream ingestionFlinkFlink Doris Connector
Real-time message queue ingestionKafkaRoutine Load or Doris Kafka Connector
Transactional database real-time sync (no external components)MySQL / PostgreSQLStreaming Job continuous load
Transactional database CDC syncMySQL / PostgreSQL, etc.Flink CDC or DataX
Object storage continuous load (automatic incremental file load)S3Streaming Job continuous load
Object storage / HDFS file batch loadS3 / OSS / HDFSBroker Load or INSERT INTO SELECT
Local file batch loadLocal diskStream Load or Doris Streamloader
External data source (data lake / external table) query and loadHive / Iceberg / JDBC, etc.Catalog + INSERT INTO SELECT

Each load in Doris is by default an implicit transaction. For more transaction-related information, see Transaction.

Choosing a Load Method by Scenario

Real-Time Write: Direct Application Write

This applies to scenarios where applications write data in real time to Doris tables through HTTP or JDBC, commonly used for businesses that require real-time analysis and queries.

  • Very small volume of data (about once every 5 minutes): Use JDBC INSERT to write data.
  • High concurrency or high frequency (more than 20 concurrent, or multiple writes within 1 minute): Enable Group Commit and use it together with JDBC INSERT or Stream Load.
  • High-throughput write: Use Stream Load to write data over the HTTP protocol.

Streaming Sync: Real-Time Data Stream Ingestion

This applies to scenarios where data is continuously synchronized to Doris tables through real-time data streams (such as Flink, Kafka, or transactional database CDC).

  1. Flink real-time data stream

    Use the Flink Doris Connector to write Flink real-time data streams into Doris tables.

  2. Kafka real-time data stream

    Choose between Routine Load and the Doris Kafka Connector. The differences are as follows:

    MethodData flow directionSupported formats
    Routine LoadDoris actively pulls data from Kafkacsv, json
    Kafka ConnectorKafka actively pushes data into Dorisavro, json, csv, protobuf
  3. Transactional database CDC sync

    Use Flink CDC or DataX to write CDC data streams from transactional databases into Doris.

  4. Streaming Job continuous load (no external components)

    Through the built-in Streaming Job in Doris, you can continuously read data from sources such as MySQL, PostgreSQL, and S3 and write it into Doris, without depending on external components such as Flink or Kafka. Two sync methods are supported:

    Sync methodUnderlying mechanismAuto table creationSemantic guaranteeTypical scenario
    Table-level syncJob + TVF (INSERT INTO SELECT)Pre-creation neededexactly-onceCases requiring column pruning, field renaming, type conversion, or conditional filtering
    Database-level syncJob + native whole-database DDLAuto-created on first runat-least-onceMirror replication of an entire database or a group of tables, with downstream table schemas automatically following the upstream

Batch Load: Loading Files from External Storage

This applies to non-real-time scenarios where files in external storage systems (such as object storage, HDFS, local files, or NAS) are loaded in batches into Doris tables.

  • Object storage / HDFS files: Use Broker Load to write data into Doris.
  • Object storage / HDFS / NAS files (synchronous or asynchronous): Use INSERT INTO SELECT for synchronous writes. For asynchronous execution, combine it with JOB scheduling.
  • Local files: Use Stream Load or Doris Streamloader to write data into Doris.

External Data Source Integration: Catalog Federated Query and Load

This applies to scenarios where you integrate with external data sources (such as Hive, JDBC, or Iceberg) to query external data and load it on demand into Doris tables.

  • Create a Catalog to read data from external data sources.
  • Use INSERT INTO SELECT to synchronously write data from the external data source into Doris. For asynchronous execution, combine it with JOB scheduling.

Partial Column Update During Load

Doris supports partial column updates during data loading, which allows you to update only specific columns in a table without providing values for all columns. This capability is especially useful in the following scenarios:

  • Updating a small number of fields in a wide table
  • Performing incremental updates (Upsert on partial columns)

For details on how to perform partial column updates on Unique Key tables and Aggregate tables, see Column Update.

Load Method Overview

Loading in Doris involves several aspects, including data sources, data formats, load methods, error data handling, data transformation, and transactions. The following table summarizes the suitable scenarios, supported file formats, and load modes for each method:

Load methodUse caseSupported file formatsLoad mode
Stream LoadLoading local files or application writescsv, json, parquet, orcSynchronous
Broker LoadLoading from object storage, HDFS, etc.csv, json, parquet, orcAsynchronous
INSERT INTO VALUESLoading through interfaces such as JDBCSQLSynchronous
INSERT INTO SELECTLoading from external tables, object storage, or HDFSSQLSynchronous
Routine LoadReal-time loading from Kafkacsv, jsonAsynchronous
MySQL LoadLoading from local datacsvSynchronous
Group CommitHigh-frequency small-batch loadingDepends on the load method used-
Streaming JobContinuous loading from sources such as MySQL, PostgreSQL, and S3Depends on the data sourceAsynchronous

FAQ

Q1: Which load method should be chosen for high-concurrency, small-batch writes?

Enable Group Commit and use it together with JDBC INSERT or Stream Load. When concurrency exceeds 20, or when multiple writes occur within 1 minute, Group Commit can significantly reduce load pressure.

Q2: What is the difference between Routine Load and the Doris Kafka Connector?

  • Routine Load: Doris schedules tasks that actively pull data from Kafka. Supports csv and json formats.
  • Doris Kafka Connector: Kafka actively pushes data into Doris. Supports avro, json, csv, and protobuf formats.

Q3: How do you load local files into Doris?

You can use Stream Load (suitable for small and medium files) or Doris Streamloader (suitable for large-file batch scenarios).

Q4: Can data from external data sources such as Hive or Iceberg be loaded into Doris?

Yes. First connect to the external data source through a Catalog, then use INSERT INTO SELECT to synchronously write the data into Doris. For asynchronous execution, combine it with JOB scheduling.

Q5: Are there transactional guarantees for loading?

Yes. Each load in Doris is by default an implicit transaction. For details, see Transaction.

Q6: How do you choose between Streaming Job and Flink CDC?

  • Streaming Job: A built-in capability of Doris that does not depend on external components such as Flink or Kafka. It supports table-level and database-level sync for MySQL and PostgreSQL, as well as continuous loading from S3. Database-level sync can automatically create tables, while table-level sync provides exactly-once semantics and supports SQL processing.
  • Flink CDC: Requires deploying a Flink cluster. It is suitable for scenarios that already have a Flink stream-processing system, require complex ETL processing, or need multi-target sync.

If you only need to continuously synchronize MySQL or PostgreSQL data into Doris and have no external stream-processing requirements, prefer Streaming Job. For details, see Continuous Load Overview.