The Apache Software Foundation
Apache Doris
GitHub repository

Apache Doris (incubating)

Doris is a MPP-based interactive SQL data warehousing for reporting and analysis.

1. Overview

Doris’s implementation consists of two daemons: Frontend (FE) and Backend (BE).


Frontend daemon consists of query coordinator and catalog manager. Query coordinator is responsible for receiving users’ sql queries, compiling queries and managing queries execution. Catalog manager is responsible for managing metadata such as databases, tables, partitions, replicas and etc. Several frontend daemons could be deployed to guarantee fault-tolerance, and load balancing.

Backend daemon stores the data and executes the query fragments. Many backend daemons could also be deployed to provide scalability and fault-tolerance.

A typical Doris cluster generally composes of several frontend daemons and dozens to hundreds of backend daemons.

Users can use MySQL client tools to connect any frontend daemon to submit SQL query. Frontend receives the query and compiles it into query plans executable by the Backend. Then Frontend sends the query plan fragments to Backend. Backend will build a query execution DAG. Data is fetched and pipelined into the DAG. The final result response is sent to client via Frontend. The distribution of query fragment execution takes minimizing data movement and maximizing scan locality as the main goal.

2. Background

At Baidu, Prior to Doris, different tools were deployed to solve diverse requirements in many ways. And when a use case requires the simultaneous availability of capabilities that cannot all be provided by a single tool, users were forced to build hybrid architectures that stitch multiple tools together, but we believe that they shouldn’t need to accept such inherent complexity. A storage system built to provide great performance across a broad range of workloads provides a more elegant solution to the problems that hybrid architectures aim to solve. Doris is the solution.

Doris is designed to be a simple and single tightly coupled system, not depending on other systems. Doris provides high concurrent low latency point query performance, but also provides high throughput queries of ad-hoc analysis. Doris provides bulk-batch data loading, but also provides near real-time mini-batch data loading. Doris also provides high availability, reliability, fault tolerance, and scalability.

3. Rationale

Doris mainly integrates the technology of Google Mesa and Apache Impala. Unlike other popular SQL-on-Hadoop systems, it is designed to be a simple and single tightly coupled system, not depending on other systems.

Mesa is a highly scalable analytic data storage system that stores critical measurement data related to Google’s Internet advertising business. Mesa is designed to satisfy complex and challenging set of users’ and systems’ requirements, including near real-time data ingestion and query ability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes.

Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. At present, by virtue of its superior performance and rich functionality, Impala has been comparable to many commercial MPP database query engine. Mesa can satisfy the needs of many of our storage requirements, however Mesa itself does not provide a SQL query engine; Impala is a very good MPP SQL query engine, but the lack of a perfect distributed storage engine. So in the end we chose the combination of these two technologies.

Learning from Mesa’s data model, we developed a distributed storage engine. Unlike Mesa, this storage engine does not rely on any distributed file system. Then we deeply integrate this storage engine with Impala query engine. Query compiling, query execution coordination and catalog management of storage engine are integrated to be frontend daemon; query execution and data storage are integrated to be backend daemon. With this integration, we implemented a single, full-featured, high performance state the art of MPP database, as well as maintaining the simplicity.