Skip to main content

Partition Pruning Optimization: Doris Query Performance Tuning

Partition Pruning is a query optimization technique: it intelligently identifies the relevant partitions based on the query conditions, scans only those partitions, and skips the irrelevant ones.

Through partition pruning, Doris can significantly reduce I/O and computation, accelerating queries on large tables.

Applicability checklist:

  • The table is partitioned by a business column (such as date).
  • The query conditions include filters on the partition column (for example, WHERE date BETWEEN ...).
  • You want to reduce the number of partitions scanned and lower I/O.
  • You need to verify with EXPLAIN whether pruning takes effect.

Case: a sales table partitioned by date

The following case demonstrates the partition pruning capability of Doris.

1. Create a table: range partitioning by date

Goal: Create a sales data table sales partitioned by date, with one partition per month.

Command:

CREATE TABLE sales (
date DATE,
product VARCHAR(50),
amount DECIMAL(10, 2)
)
PARTITION BY RANGE(date) (
PARTITION p1 VALUES LESS THAN ('2023-01-01'),
PARTITION p2 VALUES LESS THAN ('2023-02-01'),
PARTITION p3 VALUES LESS THAN ('2023-03-01'),
PARTITION p4 VALUES LESS THAN ('2023-04-01')
)
DISTRIBUTED BY HASH(date) BUCKETS 16
PROPERTIES
(
"replication_num" = "1"
);

Description: The partition column is date. There are 4 partitions in total, each covering one month of data.

2. Query: with a filter on the partition column

Goal: Query the total sales amount between January 15, 2023 and February 15, 2023.

Command:

SELECT SUM(amount) AS total_amount
FROM sales
WHERE date BETWEEN '2023-01-15' AND '2023-02-15';

Description: The WHERE clause contains a range filter on the partition column date, which is the key to triggering partition pruning.

3. Partition pruning execution process

StepDoris behaviorResult
1Analyze the partition column date in the query conditionsIdentify the date range 2023-01-15 to 2023-02-15
2Compare the query conditions with the partition definitionsHit partitions p2 and p3
3Automatically skip irrelevant partitionsSkip p1 and p4
4Run the scan and aggregation only on the hit partitionsReturn the result quickly

4. Verify pruning with EXPLAIN

Goal: Use the EXPLAIN command to view the execution plan and confirm the actual number of partitions scanned.

Command:

EXPLAIN SELECT SUM(amount) AS total_amount
FROM sales
WHERE date BETWEEN '2023-01-15' AND '2023-02-15';

Key output:

|   0:VOlapScanNode(212)                                                     |
| TABLE: cir.sales(sales), PREAGGREGATION: ON |
| PREDICATES: (date[#0] >= '2023-01-15') AND (date[#0] <= '2023-02-15') |
| partitions=2/4 (p2,p3) |

Description: The partitions=2/4 (p2,p3) field on the OlapScanNode indicates that only 2 out of 4 partitions (p2 and p3) are scanned, which means partition pruning has taken effect.

Comparison: pruning effective vs not effective

DimensionPartition pruning effectivePartition pruning not effective
Query conditionsInclude a filter on the partition columnMissing partition column conditions, or a function is applied to the partition column
Number of partitions scannedOnly the hit partitions are scannedAll partitions are scanned
I/O overheadLowHigh
EXPLAIN outputpartitions=N/M (N < M)partitions=M/M

FAQ

Q1: The query is slow and I suspect partition pruning is not effective. How do I confirm this?

Run EXPLAIN <query> and check the partitions=N/M field on the OlapScanNode. If N == M, all partitions are scanned and pruning is not effective.

Q2: Why does the query still scan all partitions even though WHERE includes the partition column?

Common reasons:

  • A function is applied to the partition column (for example, DATE_FORMAT(date, ...)), so the optimizer cannot derive the range.
  • Type mismatch (for example, the partition column is DATE while the filter value is a string that cannot be implicitly converted).
  • An OR connects a non-partition-column condition, so the condition cannot be pushed down.

Q3: What is the difference between partition pruning and bucket pruning?

  • Partition Pruning: prunes partitions based on the PARTITION BY column.
  • Bucket Pruning / Tablet Pruning: prunes tablets based on equality conditions on the DISTRIBUTED BY HASH column. The two can be combined to further reduce the amount of data scanned.

Summary

  • Partition pruning automatically identifies the mapping between query conditions and partitions, and scans only the necessary partitions.
  • Key prerequisites: the table is partitioned by a business column, and the query includes pushdown-capable filter conditions on the partition column.
  • The partitions=N/M field in EXPLAIN lets you quickly verify whether pruning takes effect.
  • Properly leveraging partition pruning can significantly reduce I/O and computation overhead, accelerating queries on massive datasets.