Skip to main content

System Architecture

This document describes the system architecture of Apache Doris, including the core components and their interaction logic in two deployment modes:

  • Integrated storage-compute architecture: The classic FE + BE architecture, where data storage and computation are integrated.
  • Decoupled storage-compute architecture: A three-layer separation of metadata, compute, and storage.

Use cases: Architecture selection, architecture learning, and operational understanding.

Integrated Storage-Compute Architecture

The integrated storage-compute architecture is the classic deployment mode of Apache Doris. It consists of two types of processes: Frontend (FE) and Backend (BE).

Integrated storage-compute architecture

Core Components

Frontend (FE) Node

The FE is the entry node of Apache Doris and handles coordination and control:

ResponsibilityDescription
User request handlingCompatible with the MySQL protocol and supports standard SQL
Query parsing and planningLexical analysis -> semantic analysis -> logical plan generation -> CBO optimization -> execution dispatch
Metadata managementDatabase and table schemas, replica distribution, user privileges, cluster topology, and load status
Node managementHeartbeat detection, load balancing, replica repair, and scale-out and scale-in management

The FE uses BDB JE as the metadata storage engine and supports transactional features. It does not depend on external components such as ZooKeeper, which simplifies deployment and maintenance.

Backend (BE) Node

The BE is the compute and storage node:

FeatureDescription
Columnar storageData is organized by column, combined with encoding and compression to improve I/O efficiency
Data sharding (Tablet)Data is sharded horizontally; the tablet is the smallest unit of replica scheduling
Multiple replicasEach tablet has 3 replicas by default, distributed across different BE nodes
Vectorized executionColumnar memory layout combined with SIMD acceleration delivers 5 to 10x performance on wide-table aggregations
Pipeline engineMulti-core parallelism with thread count limits to prevent thread explosion

FE High Availability

In production environments, multiple FE nodes are deployed. The role types are as follows:

RoleResponsibilityParticipates in election
MasterReads and writes metadata, and synchronizes to Followers and ObserversYes
FollowerReads metadata and participates in election when the Master failsYes
ObserverReads metadata, only extending query concurrency capabilityNo

Metadata changes require confirmation from a majority of nodes to ensure consistency.

Architectural Characteristics

  • Simple and easy to maintain: Only two types of processes, FE and BE.
  • High performance: Compute nodes access local storage directly, with low network overhead.
  • High availability: Multiple replicas combined with automatic fault isolation.
  • Horizontal scaling: Both FE and BE support online scale-out.

Decoupled Storage-Compute Architecture

Introduced in version 3.0, this architecture fully separates the compute layer from the storage layer, supporting independent elastic scaling.

Decoupled storage-compute architecture

Core Components

In the decoupled storage-compute architecture, the FE node is retained and continues to handle user request entry and query parsing. A new Meta Service is added to manage metadata at the data layer.

FE Frontend Node

The responsibilities of the FE remain unchanged in the decoupled storage-compute architecture:

ResponsibilityDescription
User request handlingCompatible with the MySQL protocol, supports standard SQL, and handles authentication and privilege validation
Query parsing and planningLexical analysis -> semantic analysis -> logical plan -> CBO optimization -> execution dispatch
SQL-layer metadataDatabase and table schemas, user privileges, and cluster topology

Metadata Layer (Meta Service)

A stateless service that scales horizontally.

ResponsibilityDescription
Data ingestion transactionsVersion management and conflict detection
Tablet metadataData versions and file lists
Rowset metadataIncremental data information used for recovery and garbage collection
Cluster resourcesResource allocation and scheduling for compute groups

Compute Layer

The compute layer consists of multiple compute clusters, each containing several stateless BE nodes. Compute clusters share the same data but have independent compute resources.

FeatureDescription
Independent resourcesEach compute cluster serves a different business workload independently
Stateless BEDoes not store data persistently; only caches hot data
Elastic scalingAdding or removing nodes does not affect other compute clusters
Local cacheHot data is cached using an LRU policy to reduce access latency

Shared Storage Layer

The storage layer persists all data files, including segment files and inverted index files.

Supported typeExamples
Object storageS3, OSS, COS, OBS, MinIO
Distributed file systemHDFS
CephRGW, CephFS

Architectural Characteristics

  • Elastic compute: Resources scale on demand, suitable for workloads with peaks and troughs.
  • Workload isolation: Different business teams share data while keeping compute resources independent.
  • Low storage cost: Low-cost options such as object storage are available.
  • Minute-level scaling: Compute resources can be adjusted quickly.

Comparison and Selection

Component Function Comparison

The two architectures differ in component responsibilities as follows:

Comparison itemIntegrated storage-computeDecoupled storage-compute
FE nodeRetained, stores all metadataRetained, stores only SQL-layer metadata
BE nodeStateful (stores data)Stateless (only caches hot data)
Data storage locationBE local diskShared storage layer, with BE local disk used as cache
Scaling focusStorage and compute scale togetherCompute and storage can scale independently
Storage costHigher (SSD)Lower (object storage)
Operational complexityLowerHigher (depends on external storage)
Query latencyLower (local I/O)Slightly higher (on cache miss). On cache hit, latency is the same as the integrated storage-compute architecture.

Selection Guidance

Choose the appropriate architecture based on your scenario:

ScenarioIntegrated storage-computeDecoupled storage-compute
Development and test environments, quick experimentation
No shared storage available (HDFS / Ceph / object storage)
No dedicated DBA, with multiple teams maintaining independently
No need for elastic scaling or Kubernetes containerization
Already deployed on a public cloud
Reliable shared storage system available
Kubernetes containerization or private cloud elasticity required
Multi-compute-cluster shared data scenarios
Dedicated platform team for maintenance

Core Technical Modules

Storage Engine

Columnar storage and compression

Data is organized by column, so only the columns involved in a query are read, reducing I/O. Combined with algorithms such as dictionary encoding, bitmap compression, and RLE, this achieves a high compression ratio.

Index structures

Index typeUse case
Sorted composite key (up to 3 columns)Pruning for high-concurrency reports
Min/MaxEquality and range filtering on numeric types
BloomFilterEquality filtering on high-cardinality columns
Inverted indexFast search and full-text search on arbitrary fields

Data models

ModelCharacteristicsUse case
Duplicate modelRetains detailed dataDetail storage for fact tables
Primary Key modelUnique primary key; rows with the same primary key are overwrittenRow-level updates
Aggregate modelRows with the same primary key are aggregated automaticallyPre-aggregation acceleration

Query Engine

MPP distributed queries

Complex queries are decomposed into multiple stages and processed in parallel across multiple BE nodes. Distributed shuffle joins are supported to handle joins on large tables efficiently.

Vectorized execution

A columnar memory layout combined with SIMD instructions delivers 5 to 10x performance on wide-table aggregations.

Pipeline execution engine

Multi-core parallelism with thread count limits prevents thread explosion. Data copies and memory allocation overhead between operators are reduced.

Query optimizer

OptimizerStrategy
RBOConstant folding, subquery rewriting, and predicate pushdown
CBOCost estimation and join reordering
HBOUses historical queries to accelerate repeated queries

Runtime Filter

Filters are generated dynamically at runtime and pushed down to scan nodes to reduce the amount of data processed. Supported types include In, Min/Max, and BloomFilter.

High Availability Mechanism

Multiple replicas and the quorum protocol

  • 3 replicas are stored by default.
  • A write must be confirmed by a majority (such as 2 replicas) to succeed.
  • Partial node failures do not lead to data loss, and the cluster remains available.

Automatic fault isolation

  1. Detect node heartbeat timeouts or replica corruption.
  2. Mark the node as unavailable and stop dispatching tasks to it.
  3. Automatically rebuild missing replicas from healthy replicas.
  4. After recovery, automatically synchronize incremental data.

FE high availability

A Paxos-like consensus protocol ensures metadata consistency. When the Master fails, Followers automatically elect a new Master, transparent to the user.


FAQ

Q: What is the core difference between the integrated and decoupled storage-compute architectures?

In the integrated storage-compute architecture, BE nodes handle both data storage and computation, and data is stored on local disks. In the decoupled storage-compute architecture, data is stored in a shared storage layer (such as S3 or HDFS), and BE nodes act as stateless compute nodes that accelerate queries through a local cache. The decoupled architecture supports independent elastic scaling of compute and storage resources.

Q: When should you choose the decoupled storage-compute architecture?

The decoupled storage-compute architecture is suitable for the following scenarios: existing public cloud deployments, Kubernetes containerization requirements, the availability of a reliable shared storage system (HDFS, Ceph, or object storage), the need for multiple compute clusters to share data, and a dedicated platform team for maintenance. For simple use cases or development and test environments, the integrated storage-compute architecture is more suitable.

Q: Do BE nodes store data in the decoupled storage-compute architecture?

In the decoupled storage-compute architecture, BE nodes are stateless and do not persist data themselves. However, BE nodes use a local SSD to cache hot data (with an LRU eviction policy) to overcome the latency caused by the relatively poor random read performance of object storage and the network transfer overhead.

Q: What is the role of the FE in the decoupled storage-compute architecture?

The FE is retained in the decoupled storage-compute architecture and is mainly responsible for user request entry, SQL parsing and planning, and SQL-layer metadata management (database and table schemas, user privileges, and cluster topology). Metadata management at the data layer (such as tablet metadata and rowset metadata) is handled by the newly added Meta Service.

Q: How is high availability implemented in Doris?

Doris implements high availability through multiple replicas and a quorum protocol: each tablet has 3 replicas by default, and a write must be confirmed by a majority. When some nodes fail, they are automatically isolated and replicas are rebuilt from healthy replicas. The FE uses a Paxos-like consensus protocol to ensure metadata consistency, and a new Master is automatically elected when the current Master fails.

Q: What is the role of Meta Service in the decoupled storage-compute architecture?

Meta Service is a stateless service in the decoupled storage-compute architecture that is dedicated to managing metadata at the data layer. Its responsibilities include data ingestion transaction processing (version management and conflict detection), tablet and rowset metadata management, and resource allocation and scheduling for compute clusters. Because it is stateless, it scales horizontally to improve metadata processing capacity.


Summary

Apache Doris provides two architectures to fit different needs:

ArchitectureUse case
Integrated storage-computePerformance-first, limited operational resources, and manageable scale
Decoupled storage-computeCloud-native, elastic scaling, and shared data across multiple teams

Both architectures provide a complete high availability mechanism to ensure data reliability and service stability.