Skip to main content

EXPLAIN

Description​

The EXPLAIN statement displays Doris's query execution plan for a given query. Doris's query optimizer aims to create an efficient plan using statistical data, data characteristics, and features like HASH JOIN, partitioning, and bucketing. However, due to theoretical and practical constraints, the plan may sometimes under perform.

To improve performance, it's essential to analyze the current plan. This article teaches how to use the EXPLAIN statement for optimization.

Syntax​

{EXPLAIN | DESC} [VERBOSE] <query_block>

Required Parameters​

<query_block>

This is the query statement for which you want the explain plan.

Optional Parameters​

[VERBOSE]

Whether to display detailed information is determined by the VERBOSE specification. With VERBOSE, comprehensive details are shown, including specifics on each operator, the tuple IDs they use, and detailed descriptions for each tuple. Without it, concise information is provided.

Return Results​

Basic Concepts​

To better understand the information displayed by EXPLAIN, let's introduce a few core concepts of the Doris execution plan.

NameExplanation
PLANExecution plan. A query is translated into an execution plan by the execution planner, which is then executed by the execution engine.
FRAGMENTExecution fragment. Since Doris is a distributed execution engine, a complete execution plan is divided into multiple single-node execution fragments. A FRAGMENT table represents a complete single-node execution fragment. Multiple FRAGMENTS combine to form a complete PLAN.
PLAN NODEOperator. The smallest unit of the execution plan. A FRAGMENT consists of multiple operators. Each operator is responsible for a specific execution logic, such as aggregation, joins, etc.

Return Result Structure​

The result of the Doris EXPLAIN statement is a complete PLAN. Within the PLAN, FRAGMENTS are ordered from back to front based on the execution sequence. Within each FRAGMENT, operators (PLAN NODES) are also ordered from back to front based on the execution sequence.

An example is provided below:

+--------------------------------------------------+
| Explain String(Nereids Planner) |
+--------------------------------------------------+
| PLAN FRAGMENT 0 |
| OUTPUT EXPRS: |
| cnt[#10] |
| cnt[#11] |
| PARTITION: UNPARTITIONED |
| |
| HAS_COLO_PLAN_NODE: false |
| |
| VRESULT SINK |
| MYSQL_PROTOCAL |
| |
| 7:VEXCHANGE |
| offset: 0 |
| distribute expr lists: |
| |
| PLAN FRAGMENT 1 |
| |
| PARTITION: RANDOM |
| |
| HAS_COLO_PLAN_NODE: false |
| |
| STREAM DATA SINK |
| EXCHANGE ID: 07 |
| UNPARTITIONED |
| |
| 6:VHASH JOIN(354) |
| | join op: INNER JOIN(BROADCAST)[] |
| | equal join conjunct: cnt[#7] = cnt[#5] |
| | cardinality=1 |
| | vec output tuple id: 8 |
| | vIntermediate tuple ids: 7 |
| | hash output slot ids: 5 7 |
| | distribute expr lists: |
| | distribute expr lists: |
| | |
| |----4:VEXCHANGE |
| | offset: 0 |
| | distribute expr lists: |
| | |
| 5:VEXCHANGE |
| offset: 0 |
| distribute expr lists: |
| |
| PLAN FRAGMENT 2 |
| ... |
| |
| PLAN FRAGMENT 3 |
| ... |
+--------------------------------------------------+

Operators are linked to their child nodes with dashed lines. When an operator has multiple children, they are arranged vertically, representing a right-to-left order. In the example above, operator 6 (VHASH JOIN) has operator 5 (EXCHANGE) as its left child and operator 4 (EXCHANGE) as its right child.

Fragment Field Descriptions​

NameDescription
PARTITIONDisplays the data distribution of the current Fragment
HAS_COLO_PLAN_NODEIndicates if the fragment contains colocate operators
SinkThe method of fragment data output, see the table below for details

Sink Methods

NameDescription
STREAM DATA SINKOutputs data to the next Fragment. It includes two lines of information.
First line: The downstream EXCHANGE NODE to which data is sent.
Second line: The method of data distribution.
- UNPARTITIONED means each downstream instance receives the full data set. This typically occurs in broadcast joins or when single-instance logic is required, such as global limit or order by.
- RANDOM means each downstream instance receives a random subset of data without repetition.
- HASH_PARTITIONED uses the listed slots as keys to hash and send data shards to the same downstream instance. This is often used upstream of partition hash joins or the second stage of two-phase aggregations.
RESULT SINKSends result data to the FE. The first line indicates the protocol used for data transmission, currently supporting MySQL and arrow protocols.
OLAP TABLE SINKWrites data to an OLAP table.
MultiCastDataSinksA multicast operator that contains multiple STREAM DATA SINKs. Each STREAM DATA SINK sends the full data set to its downstream.

Tuple Information Description​

When using VERBOSE mode, Tuple information is output. Tuple information describes the SLOT details within a row of data, including SLOT type, nullable status, etc.

The output contains multiple TupleDescriptors, each containing multiple SlotDescriptors. An example is shown below:

Tuples:
TupleDescriptor{id=0, tbl=t1}
SlotDescriptor{id=0, col=c1, colUniqueId=0, type=int, nullable=true, isAutoIncrement=false, subColPath=null}
SlotDescriptor{id=2, col=c3, colUniqueId=2, type=int, nullable=true, isAutoIncrement=false, subColPath=null}

TupleDescriptor​

NameDescription
idThe id of the tuple descriptor
tblThe corresponding table of the tuple, or null if not applicable

SlotDescriptor​

NameDescription
idThe id of the slot descriptor
colThe corresponding column of the slot, or left blank if not applicable
colUniqueIdThe unique id of the corresponding column, or -1 if not applicable
typeThe type of the slot
nullableIndicates if the corresponding data can be null
isAutoIncrementIndicates if the column is auto-incremented
subColPathThe sub-column path within the column, currently only applies to variant types

Operator Descriptions​

Operator List​

NameDescription
AGGREGATEAggregation operator
ANALYTICWindow function operator
ASSERT NUMBER OF ROWSOperator to check the number of downstream output rows
EXCHANGEData exchange receiver operator
MERGING-EXCHANGEData exchange receiver with sorting and row limit functionality
HASH JOINHash join operator
NESTED LOOP JOINNested loop join operator
PartitionTopNIntra-partition data pre-filtering operator
REPEAT_NODEData replication operator
DataGenScanNodeTable-valued function operator
EsScanNodeES table scan operator
HIVE_SCAN_NODEHive table scan operator
HUDI_SCAN_NODEHudi table scan operator
ICEBERG_SCAN_NODEIceberg table scan operator
PAIMON_SCAN_NODEPaimon table scan operator
JdbcScanNodeJdbc table scan operator
OlapScanNodeOlap table scan operator
SELECTFiltering operator
UNIONSet union operator
EXCEPTSet difference operator
INTERSECTSet intersection operator
SORTSorting operator
TOP-NSort and return top N results operator
TABLE FUNCTION NODETable function operator (lateral view)

Common Fields​

NameDescription
limitLimits the number of output rows
offsetNumber of rows to skip before outputting
conjunctsFilters the results of the current node. Executed before projections.
projectionsProjection operations after the current operator. Executed after conjuncts.
project output tuple idThe output tuple after projection. The slot arrangement within the data tuple can be seen via tuple desc.
cardinalityEstimated row count by the optimizer
distribute expr listsThe original data distribution method for the child nodes of the current node
Expression's slot idThe specific slot corresponding to the slot id can be found in the tuple list in verbose mode. This list provides information such as slot type and nullable attributes. Represented as [#5] after the expression.

AGGREGATE​

NameDescription
(Aggregation Phase)The aggregation phase is represented by two terms.
The first term can be either update (local aggregation) or merge (global aggregation).
The second term indicates whether the current data is serialized (serialize) or has completed final calculations (finalize).
STREAMINGOnly local aggregation operators in multi-stage aggregation truncation have this flag. Indicates that the current aggregation node may use STREAMING mode, where input data is passed directly to the next stage of aggregation without actual computation.
outputThe output of the current aggregation operator. All local pre-aggregation functions are prefixed with partial.
group byThe key for aggregation

ANALYTIC​

NameDescription
functionsThe name of the current window function
partition byCorresponds to the partition by clause in the over clause of the window function. Windowing expression.
order bySorting expression and order within the window
windowWindow range

ASSERT NUMBER OF ROWS​

NameDescription
EQThe downstream output must match this row count constraint

HASH JOIN​

NameDescription
join opType of join
equal join conjunctEquality condition in the join condition
other join predicatesConditions in the join condition, excluding equality
mark join predicatesConditions used in mark join
other predicatesFiltering predicates after join execution
runtime filtersGenerated runtime filters
output slot idsList of final output slots
hash output slot idsList of output slots after hash join execution, but before other join conditions are applied
isMarkJoinIndicates whether it is a mark join

NESTED LOOP JOIN​

NameDescription
join opType of join operation
join conjunctsConditions for joining
mark join predicatesConditions used in mark join
predicatesFilter predicates after join
runtime filtersGenerated runtime filters
output slot idsList of final output slots
isMarkJoinWhether it is a mark join

PartitionTopN​

NameDescription
functionsWindow functions applying grouped filter optimization
has global limitPresence of a global limit on the number of rows
partition limitLimit on the number of rows within each partition
partition topn phaseCurrent phase: TWO_PHASE_GLOBAL_PTOPN for global phase after shuffling by partition key, TWO_PHASE_LOCAL_PTOPN for local phase before shuffling by partition key

REPEAT_NODE​

NameDescription
repeatNumber of repetitions for each row and corresponding slot ids for aggregation columns
exprsList of expressions for output data after repetition

DataGenScanNode​

NameDescription
table value functionTable function name

EsScanNode​

NameDescription
SORT COLUMNColumns for sorting results
LOCAL_PREDICATESFilters executed within Doris
REMOTE_PREDICATESFilters executed within ES
ES index/typeES index and type for querying

HIVE_SCAN_NODE​

NameDescription
inputSplitNumNumber of scan splits
totalFileSizeTotal file size being scanned
scanRangesInformation on scan splits
partitionNumber of partitions being scanned
backendsSpecific data info for each BE to scan
cardinalityEstimated number of rows by optimizer
avgRowSizeEstimated average row size by optimizer
numNodesNumber of BEs used by the current operator
pushdown aggAggregations pushed down to scan

HUDI_SCAN_NODE​

NameDescription
inputSplitNumNumber of scan splits
totalFileSizeTotal file size being scanned
scanRangesInformation on scan splits
partitionNumber of partitions being scanned
backendsSpecific data info for each BE to scan
cardinalityEstimated number of rows by optimizer
avgRowSizeEstimated average row size by optimizer
numNodesNumber of BEs used by the current operator
pushdown aggAggregations pushed down to scan
hudiNativeReadSplitsNumber of splits read using native method

ICEBERG_SCAN_NODE​

NameDescription
inputSplitNumNumber of scan splits
totalFileSizeTotal file size being scanned
scanRangesInformation on scan splits
partitionNumber of partitions being scanned
backendsSpecific data info for each BE to scan
cardinalityEstimated number of rows by optimizer
avgRowSizeEstimated average row size by optimizer
numNodesNumber of BEs used by the current operator
pushdown aggAggregations pushed down to scan
icebergPredicatePushdownFilters pushed down to iceberg API

PAIMON_SCAN_NODE​

NameDescription
inputSplitNumNumber of scan splits
totalFileSizeTotal file size being scanned
scanRangesInformation on scan splits
partitionNumber of partitions being scanned
backendsSpecific data info for each BE to scan
cardinalityEstimated number of rows by optimizer
avgRowSizeEstimated average row size by optimizer
numNodesNumber of BEs used by the current operator
pushdown aggAggregations pushed down to scan
paimonNativeReadSplitsNumber of splits read using native method

NESTED LOOP JOIN​

NameDescription
join opType of join operation
join conjunctsConditions for joining
mark join predicatesConditions used in mark join
predicatesFilter predicates after join
runtime filtersGenerated runtime filters
output slot idsList of final output slots
isMarkJoinWhether it is a mark join

PartitionTopN​

NameDescription
functionsWindow functions applying grouped filter optimization
has global limitPresence of a global limit on the number of rows
partition limitLimit on the number of rows within each partition
partition topn phaseCurrent phase: TWO_PHASE_GLOBAL_PTOPN for global phase after shuffling by partition key, TWO_PHASE_LOCAL_PTOPN for local phase before shuffling by partition key

REPEAT_NODE​

NameDescription
repeatNumber of repetitions for each row and corresponding slot ids for aggregation columns
exprsList of expressions for output data after repetition

DataGenScanNode​

NameDescription
table value functionTable function name

EsScanNode​

NameDescription
SORT COLUMNColumns for sorting results
LOCAL_PREDICATESFilters executed within Doris
REMOTE_PREDICATESFilters executed within ES
ES index/typeES index and type for querying

HIVE_SCAN_NODE​

NameDescription
inputSplitNumNumber of scan splits
totalFileSizeTotal file size being scanned
scanRangesInformation on scan splits
partitionNumber of partitions being scanned
backendsSpecific data info for each BE to scan
cardinalityEstimated number of rows by optimizer
avgRowSizeEstimated average row size by optimizer
numNodesNumber of BEs used by the current operator
pushdown aggAggregations pushed down to scan

HUDI_SCAN_NODE​

NameDescription
inputSplitNumNumber of scan splits
totalFileSizeTotal file size being scanned
scanRangesInformation on scan splits
partitionNumber of partitions being scanned
backendsSpecific data info for each BE to scan
cardinalityEstimated number of rows by optimizer
avgRowSizeEstimated average row size by optimizer
numNodesNumber of BEs used by the current operator
pushdown aggAggregations pushed down to scan
hudiNativeReadSplitsNumber of splits read using native method

ICEBERG_SCAN_NODE​

NameDescription
inputSplitNumNumber of scan splits
totalFileSizeTotal file size being scanned
scanRangesInformation on scan splits
partitionNumber of partitions being scanned
backendsSpecific data info for each BE to scan
cardinalityEstimated number of rows by optimizer
avgRowSizeEstimated average row size by optimizer
numNodesNumber of BEs used by the current operator
pushdown aggAggregations pushed down to scan
icebergPredicatePushdownFilters pushed down to iceberg API

PAIMON_SCAN_NODE​

NameDescription
inputSplitNumNumber of scan splits
totalFileSizeTotal file size being scanned
scanRangesInformation on scan splits
partitionNumber of partitions being scanned
backendsSpecific data info for each BE to scan
cardinalityEstimated number of rows by optimizer
avgRowSizeEstimated average row size by optimizer
numNodesNumber of BEs used by the current operator
pushdown aggAggregations pushed down to scan
paimonNativeReadSplitsNumber of splits read using native method

JdbcScanNode​

NameDescription
TABLEJDBC-side table name to scan
QUERYQuery used for scanning

OlapScanNode​

NameDescription
TABLETable being scanned. Parentheses indicate the name of the hit synchronized materialized view.
SORT INFOPresent when SCAN pre-sorting is planned. Indicates partial pre-sorting and pre-truncation of SCAN output.
SORT LIMITPresent when SCAN pre-sorting is planned. Indicates the truncation length for pre-truncation.
TOPN OPTPresent when TOP-N Runtime Filter is planned.
PREAGGREGATIONIndicates whether pre-aggregation is enabled. Relevant for MOR aggregation and primary key models. ON means data at the storage layer satisfies upper-layer needs without extra aggregation. OFF means extra aggregation is performed.
partitionsNumber of partitions currently scanned, total partitions, and list of scanned partition names.
tabletsNumber of tablets scanned and total tablets in the table.
tabletListList of tablets scanned.
avgRowSizeEstimated row size by the optimizer.
numNodesNumber of BEs assigned to the current scan.
pushAggOpResults are returned by reading zonemap metadata. Supports MIN, MAX, COUNT aggregation information.

UNION​

NameDescription
constant exprsList of constant expressions to be included in the output.
child exprsChildren's outputs projected through this expression list as input to the set operator.

EXCEPT​

NameDescription
child exprsChildren's outputs projected through this expression list as input to the set operator.

INTERSECT​

NameDescription
child exprsChildren's outputs projected through this expression list as input to the set operator.

SORT​

NameDescription
order bySorting key and specific sort order.

TABLE FUNCTION NODE​

NameDescription
table functionName of the table function used.
lateral view tuple idTuple ID corresponding to newly generated columns.
output slot idList of slot IDs for columns output after column pruning.

TOP-N​

NameDescription
order bySorting key and specific sort order.
TOPN OPTPresent when TOP-N runtime filter optimization is hit.
OPT TWO PHASEPresent when TOP-N deferred materialization is hit.