Skip to main content

Efficient Deduplication

Deduplication is one of the most resource-intensive operations in analytical workloads. Apache Doris provides two dedicated data types as alternatives to COUNT DISTINCT, completing deduplication with lower memory and latency cost: choose BITMAP when you need exact results, and choose HLL when you can accept a 1%–2% error in exchange for smaller storage.

Exact Deduplication

Approximate Deduplication