Parquet Reader Optimization

TL;DR Apache Doris ships a native C++ vectorized Parquet reader for external catalogs and TVFs, avoiding any Java or Arrow round trip. The reader prunes whole row groups and individual pages from the file's own statistics, evaluates predicates on dictionary IDs, and reads filter columns before payload columns. Remote reads are coalesced into fewer requests, so most queries finish without ever decoding the columns they don't need.

Apache Doris Parquet Reader Optimization: A native C++ vectorized Parquet reader that prunes row groups and pages, decodes dictionaries directly, and reads payloads only after filters.

Why use the Parquet reader in Apache Doris?

The Apache Doris native Parquet reader exists to cut the three dominant lakehouse scan costs: fetching bytes from object storage, decompressing and decoding pages, and converting values into the engine's runtime representation. A Hive or Iceberg table on S3 is, in the end, a pile of Parquet files, and a naive reader pays all three costs on every query.

That cost shows up in shapes anyone who has tuned a lakehouse query has hit:

A WHERE order_date >= '2026-04-01' filter scans every row group of every Parquet file because the reader never looked at the file's per-row-group min/max.
A point lookup like WHERE user_id = 12345 reads the entire user_id column when only one row group could possibly contain that ID.
A SELECT * over a wide event table on S3 issues hundreds of small HTTP GETs per file, one per column chunk.
A predicate like country = 'JP' decompresses and decodes UTF-8 across millions of rows, even though the column is dictionary-encoded and the answer could be one integer comparison per row.

The Apache Doris Parquet reader is built to avoid each of these before any page leaves storage.

What is the Apache Doris Parquet reader?

The Apache Doris Parquet reader is a native C++ component that reads Parquet metadata, prunes at every available granularity (row group, page, dictionary, column), and decodes values directly into the same Block format the vectorized executor consumes. There is no Java library or Arrow round trip in the path.

Key terms

Row group: the unit a Parquet file is partitioned into for parallel reads. Each one carries min/max and null-count statistics per column.
PageIndex: an optional Parquet structure (ColumnIndex + OffsetIndex) that exposes the same statistics at page granularity, plus the byte offset of each page so a reader can seek directly to it.
Lazy materialization: a two-pass read that decodes filter columns first, then fetches payload columns only for the rows that survived.
Dictionary filtering: evaluating a predicate against a column chunk's dictionary page once, then comparing rows against the resulting set of dictionary IDs.

How does the Apache Doris Parquet reader work?

The Apache Doris Parquet reader runs a five-stage pipeline where each step shrinks what the next step has to look at.

Row-group pruning. When the file is opened, the reader walks each row group's column statistics and drops the whole group if the predicate cannot match its min/max range. The same step probes the Parquet bloom filter when the file has one. Both checks run on metadata only, before any column page is fetched. Controlled by enable_parquet_filter_by_min_max and enable_parquet_filter_by_bloom_filter.
Page pruning via PageIndex. For each surviving row group, the reader parses ColumnIndex and OffsetIndex to drop individual pages whose min/max cannot match. OffsetIndex gives the byte range of each surviving page, so the reader skips straight to the next one without scanning the rejected bytes. Toggle with the BE config enable_parquet_page_index.
Dictionary filtering. When a column chunk is dictionary-encoded and the predicate is a single-slot expression, the reader decodes only the dictionary page, evaluates the predicate against it, and rewrites the predicate to operate on dictionary IDs. The data pages then turn into bit-packed integer comparisons.
Lazy materialization. Surviving rows still have to be assembled. The reader splits columns into filter columns (referenced by predicates) and payload columns, decodes the filter columns to build a row-selection bitmap, and only then fetches the payload columns for the surviving rows. Controlled by the session variable enable_parquet_lazy_mat, on by default.
IO coalescing and direct decode. Adjacent column chunks within a row group are merged into a single ranged read, so the engine issues one HTTP GET instead of many. Decoded values land directly in Doris vectorized columns, with no Arrow round trip.

The result: rows that survive to the executor have already been confirmed to match, and only the columns the query asked for were ever decoded.

Quick start

SELECT region, count(*) AS orders, sum(amount) AS revenue
FROM s3(
  "uri"    = "s3://demo-bucket/orders/year=2026/*.parquet",
  "format" = "parquet",
  "s3.endpoint"   = "https://s3.us-east-1.amazonaws.com",
  "s3.access_key" = "AK...",
  "s3.secret_key" = "SK..."
)
WHERE order_date >= '2026-04-01' AND status = 'PAID'
GROUP BY region;

Expected result (profile excerpt)

VFileScanNode
  FilteredRowGroups:           18 / 24
  FilteredPages:               210 / 980
  LazyReadFilteredRows:        92.1%
  RemoteIOMergedRanges:        14 (from 96 chunks)

The reader dropped 18 of 24 row groups on the date predicate, then dropped 210 pages inside the survivors using PageIndex on status. Lazy materialization skipped decoding region and amount for the 92.1% of rows that didn't match. Ninety-six column chunks were merged into 14 ranged GETs against S3.

When should you use the Apache Doris Parquet reader?

The Apache Doris Parquet reader is used automatically whenever a query reads Parquet, so the question is how to make sure it can do its job.

Good fit

Hive, Iceberg, Hudi, and Paimon catalogs whose data files are Parquet.
Direct file analysis through the s3() and hdfs() TVFs.
Selective queries on wide tables: predicate-heavy filters give lazy materialization the most to skip.
Equality and range predicates on columns that the writer recorded statistics for.
Dictionary-encoded string columns with point or IN predicates.

Not a good fit

Tables stored as thousands of tiny Parquet files. Each footer must be parsed and each chunk is at least one ranged read; per-file overhead dominates. Compact toward 128 MB to 1 GB files.
Files written without column statistics or PageIndex. The reader has nothing to prune on and falls back to full scans. Enable statistics in the writer (Spark 3.2+ and parquet-mr 1.11+ both write PageIndex).
Selective predicates inside deeply nested struct fields. Some optimizations (page index, dictionary filtering) are disabled for complex types; the reader walks the surviving rows in full. Flatten the access path or use top-level columns when possible.
DATETIME columns written as INT96 by older Hive writers. INT96 statistics are routinely corrupted, and the reader skips min/max pruning on them. If you control the writer, write INT64 timestamps. See the Hive catalog notes on INT96.

Why use the Parquet reader in Apache Doris?​

What is the Apache Doris Parquet reader?​

How does the Apache Doris Parquet reader work?​

Quick start​

When should you use the Apache Doris Parquet reader?​

Further reading​

Why use the Parquet reader in Apache Doris?

What is the Apache Doris Parquet reader?

How does the Apache Doris Parquet reader work?

Quick start

When should you use the Apache Doris Parquet reader?

Further reading