Skip to main content

Vector Quantization Algorithm Survey and Selection

This document introduces common vector quantization algorithms from both an educational and engineering perspective, and provides selection guidance based on Apache Doris ANN use cases. Apache Doris currently uses an optimized version of Faiss as the core implementation for ANN vector indexing and retrieval, so the SQ/PQ mechanism descriptions below map directly to actual Doris behavior.

Quick Navigation

Target readers and corresponding sections:

Problem you want to solveJump to
Do not know why quantization is neededWhy Vector Quantization Is Needed
Want to quickly pick a quantizerDoris Selection Guidance
Want to understand how SQ8/SQ4 workScalar Quantization (SQ)
Want to understand the principles and advantages of PQProduct Quantization (PQ)
Want to run a fair benchmark comparisonBenchmarking Considerations
Common questionsFAQ

One-Sentence Definitions

  • Vector quantization: Encodes high-precision vectors (such as float32) into a low-precision representation, trading acceptable recall loss for lower memory footprint and faster retrieval.
  • SQ (Scalar Quantization): Reduces precision independently for each dimension.
  • PQ (Product Quantization): Splits a vector into multiple sub-vectors and performs clustering-based encoding within each sub-space.

Why Vector Quantization Is Needed

In ANN scenarios (especially HNSW), indexes are often constrained by memory. The core idea of quantization is to encode high-precision vectors such as float32 into a low-precision representation, trading acceptable recall loss for lower memory footprint.

In Doris, the ANN index controls the quantization method through quantizer:

quantizerMeaningCompression ratio relative to float32
flatNo quantization (highest quality, highest memory)1x
sq88-bit scalar quantizationAbout 4x
sq44-bit scalar quantizationAbout 8x
pqProduct quantizationDepends on pq_m and pq_nbits

Minimal example (HNSW + quantizer):

CREATE TABLE vector_tbl (
id BIGINT,
embedding ARRAY<FLOAT>,
INDEX ann_idx (embedding) USING ANN PROPERTIES (
"index_type" = "hnsw",
"metric_type" = "l2_distance",
"dim" = "768",
"quantizer" = "sq8"
)
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "3");

Algorithm Overview

MethodCore ideaTypical benefitMain cost
SQ (Scalar Quantization)Quantize each dimension independentlySignificant memory reduction, simple to implementBuild cost is higher than FLAT; stronger compression tends to lower recall
PQ (Product Quantization)Split into sub-vectors and quantize per groupBetter balance of compression and query speed in common scenariosHigher training/encoding cost, parameters require tuning

Scalar Quantization (SQ)

Principle

SQ does not change the vector dimension. It only reduces the numeric precision of each dimension.

A common min-max quantization mapping:

  • max_code = (1 << b) - 1
  • scale = (max_val - min_val) / max_code
  • code = round((x - min_val) / scale)

There are two main types of SQ in Faiss:

TypeRange statistics methodApplicable scenario
UniformAll dimensions share one set of min/maxNumeric scales are similar across dimensions
Non-uniformEach dimension is statistically analyzed separatelyNumeric scales differ noticeably across dimensions, with smaller reconstruction error

Characteristics

Strengths:

  • Direct implementation with stable behavior.
  • Predictable compression ratio (relative to float32 values, sq8 is about 4x and sq4 is about 8x).

Limitations:

  • Fundamentally still fixed-step bucketing.
  • If the distribution of a single dimension is clearly non-uniform (for example, a long-tail distribution), the error increases.

Faiss Source Code Highlights (SQ)

In the optimized Faiss implementation path used by Doris, SQ training first computes the minimum and maximum, then slightly expands the range as needed to reduce the risk of out-of-range values during the subsequent add stage. The simplified form is as follows:

void train_Uniform(..., const float* x, std::vector<float>& trained) {
trained.resize(2);
float& vmin = trained[0];
float& vmax = trained[1];
// Scan samples to obtain min/max
// Then expand the range based on rs_arg
}

For non-uniform SQ, Faiss computes statistics per dimension instead of using a single global range, which usually works better on data where numeric scales differ noticeably across dimensions.

Practical Observations

In internal 128D/256D HNSW tests:

  • sq8 recall is usually noticeably better than sq4.
  • SQ build/encoding time is significantly higher than FLAT.
  • sq8 query latency typically does not change much, while sq4 shows more obvious recall degradation.

The bar charts below are based on example benchmark data:

SQ build time vs rows (128D)

SQ memory usage vs rows (128D)

Product Quantization (PQ)

Principle

PQ splits a D-dimensional vector into M sub-vectors (each sub-vector is D/M-dimensional) and applies k-means quantization in each sub-space.

Key parameters:

ParameterMeaningTuning direction
pq_mNumber of sub-quantizersLarger values improve precision but increase training and encoding cost
pq_nbitsNumber of encoding bits per sub-vectorDetermines the codebook size of each sub-space (2^pq_nbits)

Why PQ Queries Can Be Faster

PQ can use a LUT (lookup table) for distance approximation:

  1. Pre-compute distances from each query sub-vector to the centroids of each sub-space.
  2. Estimate the overall distance at query time by table lookup and accumulation.

This avoids full reconstruction and reduces CPU cost during the search stage in many scenarios.

Faiss Source Code Highlights (PQ)

Within the same implementation path, the Faiss ProductQuantizer trains a codebook over each sub-space and stores the centroids in contiguous memory. The simplified form is as follows:

void ProductQuantizer::train(size_t n, const float* x) {
Clustering clus(dsub, ksub, cp);
IndexFlatL2 index(dsub);
clus.train(n * M, x, index);
for (int m = 0; m < M; m++) {
set_params(clus.centroids.data(), m);
}
}

The centroid layout can be understood as (M, ksub, dsub):

  • M: Number of sub-quantizers.
  • ksub: Codebook size of each sub-space (2^pq_nbits).
  • dsub: Sub-vector dimension (D / M).

Practical Observations

In the same internal tests:

  • PQ delivers clear positive gains on compression.
  • PQ has higher training/encoding cost.
  • Compared with SQ, PQ can often achieve better query-stage speed by leveraging the LUT, but recall and build cost still depend on the data distribution and parameter combination.

The bar charts below are based on example benchmark data:

PQ on-disk index size vs rows (128D/256D)

PQ build time vs rows (128D/256D)

PQ search time vs rows (128D/256D)

Doris Selection Guidance

Selection by Scenario

ScenarioRecommended quantizerRationale
Sufficient memory and recall is the priorityflatNo precision loss, highest quality ceiling
Want to reduce memory at low risk with more stable qualitysq8Compression of about 4x, controllable recall drop
Severe memory pressure and lower recall is acceptablesq4Compression of about 8x, but recall drops more noticeably
Pursue a balance of compression and performance and accept tuningpqLUT accelerates queries, requires parameter tuning
  1. First build a baseline with flat and record Recall@K and query latency.
  2. Test sq8 first and compare Recall against P95/P99 latency.
  3. If memory is still insufficient, test pq (you can start from pq_m = D/2).
  4. Consider sq4 only when memory has higher priority than recall.

Benchmarking Considerations

Absolute latency is strongly correlated with hardware, thread count, and dataset. When running side-by-side comparisons, fix the following variables to ensure conclusions are comparable:

CategoryVariables to fix
DataVector dimension, dataset distribution, row count
IndexIndex parameters, segment size
QueryQuery set and ground-truth set

Evaluation metrics should cover all of the following at the same time:

  • Recall@K
  • Index size
  • Build time
  • Query latency (P50/P95/P99)

FAQ

Q1: How to choose between sq8 and sq4?

Prefer sq8. sq8 compresses by about 4x with a controllable recall drop. sq4 compresses by about 8x but the recall drop is more noticeable, so use it only when memory is extremely tight.

Q2: When should you choose PQ over SQ?

Choose PQ when you need a higher compression ratio or want to use the LUT to achieve better query-stage speed. The trade-off is higher training/encoding cost, and pq_m and pq_nbits need tuning based on the data distribution.

Q3: Does quantization affect build time?

Yes. The build/encoding time of every quantizer is significantly higher than flat, and the training stage of PQ has the highest cost. Record build time separately during evaluation.

Q4: Is non-uniform SQ always better than uniform SQ?

Not necessarily. When numeric scales differ noticeably across dimensions (for example, when some dimensions are inherently larger in magnitude), non-uniform has smaller reconstruction error. If scales are similar across dimensions, the gap between the two is limited.

Q5: How to set the starting value of pq_m?

You can start from pq_m = D/2, then adjust on either side based on Recall@K and query latency. A larger pq_m improves precision but increases training and encoding cost.