Practical Guide
This guide provides a production-oriented workflow for Apache Doris ANN vector search, from schema design to tuning and troubleshooting.
1. Scope and Typical Scenarios
Apache Doris 4.x supports ANN indexing on high-dimensional vectors for scenarios such as:
- Semantic search
- RAG retrieval
- Recommendation
- Image or multimodal retrieval
- Outlier detection
Supported index types:
hnsw: high recall and online query performanceivf: lower memory and faster build in large-scale cases
Supported approximate distance functions:
l2_distance_approximate(ORDER BY ... ASC)inner_product_approximate(ORDER BY ... DESC)
Cosine note:
- ANN index does not support
metric_type="cosine"directly. - For cosine-based retrieval, normalize vectors first, then use
inner_product.
2. Prerequisites and Constraints
Before using ANN indexes, confirm the following:
- Doris version:
>= 4.0.0 - Table model: only
DUPLICATE KEYis supported for ANN - Vector column: must be
ARRAY<FLOAT> NOT NULL - Dimension consistency: input vector dimension must match index
dim
Example table model:
CREATE TABLE document_vectors (
id BIGINT NOT NULL,
embedding ARRAY<FLOAT> NOT NULL
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");
2.1 Using Cosine Similarity in Doris ANN
If your ranking metric is cosine similarity, use this pattern:
- Normalize every vector to unit length before ingestion.
- Build ANN index with
metric_type="inner_product". - Query with
inner_product_approximate(...)andORDER BY ... DESC.
Reason:
cos(x, y) = (x · y) / (||x|| ||y||)- After normalization,
||x|| = ||y|| = 1, socos(x, y) = x · y
That is why cosine ranking can be implemented through inner product in Doris ANN.
3. End-to-End Workflow
Step 1: Create Table
You can choose one of two patterns:
- Define ANN index when creating table.
- Index is built during ingest.
- Faster time-to-query after loading.
- Slower ingest throughput.
- Create table first, then
CREATE INDEXandBUILD INDEXlater.- Better for large batch import.
- More control over compaction and build timing.
Example (index defined in CREATE TABLE):
CREATE TABLE document_vectors (
id BIGINT NOT NULL,
title VARCHAR(500),
content TEXT,
category VARCHAR(100),
embedding ARRAY<FLOAT> NOT NULL,
INDEX idx_embedding (embedding) USING ANN PROPERTIES (
"index_type" = "hnsw",
"metric_type" = "l2_distance",
"dim" = "768"
)
)
ENGINE = OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");
Step 2: Configure ANN Index
Common properties:
index_type:hnsworivfmetric_type:l2_distanceorinner_productdim: vector dimensionquantizer:flat,sq8,sq4,pq(optional)
HNSW-specific:
max_degree(default32)ef_construction(default40)
IVF-specific:
nlist(default1024)
Example:
CREATE INDEX idx_embedding ON document_vectors (embedding) USING ANN PROPERTIES (
"index_type" = "hnsw",
"metric_type" = "l2_distance",
"dim" = "768",
"max_degree" = "64",
"ef_construction" = "128"
);
Step 3: Load Data
Recommended order for bulk workloads:
- Create table (without ANN index or without
BUILD INDEXyet) - Import data in batch (Stream Load, S3 TVF, or SDK)
- Trigger index build
For production, prefer batch loading approaches such as Stream Load or SDK batch insert.
Step 4: Build and Monitor Index
When index is created after table creation, run BUILD INDEX manually:
BUILD INDEX idx_embedding ON document_vectors;
SHOW BUILD INDEX WHERE TableName = "document_vectors";
Build states include PENDING, RUNNING, FINISHED, and CANCELLED.
4. Query Patterns
TopN search
SELECT id, title,
l2_distance_approximate(embedding, [0.1, 0.2, ...]) AS dist
FROM document_vectors
ORDER BY dist
LIMIT 10;
Range search
SELECT id, title
FROM document_vectors
WHERE l2_distance_approximate(embedding, [0.1, 0.2, ...]) < 0.5;
Search with filters
SELECT id, title,
l2_distance_approximate(embedding, [0.1, 0.2, ...]) AS dist
FROM document_vectors
WHERE category = 'AI'
ORDER BY dist
LIMIT 10;
Doris uses pre-filtering in vector search plans, which helps preserve recall in mixed filter scenarios.
5. Tuning Checklist
Query-side parameters
- HNSW:
hnsw_ef_search(higher recall vs higher latency) - IVF:
nprobe(orivf_nprobe, depending on version/session variables)
Example:
SET hnsw_ef_search = 100;
SET nprobe = 128;
SET optimize_index_scan_parallelism = true;
Build-side recommendations
- Run compaction before final index build on large datasets.
- Avoid oversized segments when targeting high recall.
- Benchmark several parameter groups (
max_degree,ef_construction,ef_search) on the same dataset.
Capacity planning
As a practical baseline, estimate vector memory with dim * 4 bytes * row_count, then add ANN structure overhead and reserve memory headroom for non-vector columns and execution operators.
For single-node and distributed sizing references at 10M/100M scale, see Large-scale Performance Benchmark.
6. Index Operations
Common management SQL:
SHOW INDEX FROM document_vectors;
SHOW DATA ALL FROM document_vectors;
ALTER TABLE document_vectors DROP INDEX idx_embedding;
When changing index parameters, use drop-and-recreate workflow, then rebuild index.
7. Troubleshooting
Index not used
Check:
- Index exists:
SHOW INDEX - Build finished:
SHOW BUILD INDEX - Correct function: use
_approximatefunctions
Low recall
Check:
- HNSW parameters (
max_degree,ef_construction,hnsw_ef_search) - IVF probe parameters (
nprobe/ivf_nprobe) - Segment size and post-compaction rebuild
High latency
Check:
- Cold vs warm query behavior (index loading)
- Overly large
hnsw_ef_search - Parallel scan setting
- BE memory pressure
Data import errors
Common causes:
- dimension mismatch (
dimvs actual data) - null vector values
- invalid array format
8. Hybrid Search Pattern
You can combine ANN with text search by defining both ANN and inverted indexes in the same table, then filtering with text predicates and ordering with vector distance. This is a common approach for production RAG pipelines.