Practical Guide

This guide provides a production-oriented workflow for Apache Doris ANN vector search, from schema design to tuning and troubleshooting.

1. Scope and Typical Scenarios

Apache Doris 4.x supports ANN indexing on high-dimensional vectors for scenarios such as:

Semantic search
RAG retrieval
Recommendation
Image or multimodal retrieval
Outlier detection

Supported index types:

hnsw: high recall and online query performance
ivf: lower memory and faster build in large-scale cases

Supported approximate distance functions:

l2_distance_approximate (ORDER BY ... ASC)
inner_product_approximate (ORDER BY ... DESC)

Cosine note:

ANN index does not support metric_type="cosine" directly.
For cosine-based retrieval, normalize vectors first, then use inner_product.

2. Prerequisites and Constraints

Before using ANN indexes, confirm the following:

Doris version: >= 4.0.0
Table model: only DUPLICATE KEY is supported for ANN
Vector column: must be ARRAY<FLOAT> NOT NULL
Dimension consistency: input vector dimension must match index dim

Example table model:

CREATE TABLE document_vectors (
  id BIGINT NOT NULL,
  embedding ARRAY<FLOAT> NOT NULL
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");

2.1 Using Cosine Similarity in Doris ANN

If your ranking metric is cosine similarity, use this pattern:

Normalize every vector to unit length before ingestion.
Build ANN index with metric_type="inner_product".
Query with inner_product_approximate(...) and ORDER BY ... DESC.

Reason:

cos(x, y) = (x · y) / (||x|| ||y||)
After normalization, ||x|| = ||y|| = 1, so cos(x, y) = x · y

That is why cosine ranking can be implemented through inner product in Doris ANN.

3. End-to-End Workflow

Step 1: Create Table

You can choose one of two patterns:

Define ANN index when creating table.
- Index is built during ingest.
- Faster time-to-query after loading.
- Slower ingest throughput.
Create table first, then CREATE INDEX and BUILD INDEX later.
- Better for large batch import.
- More control over compaction and build timing.

Example (index defined in CREATE TABLE):

CREATE TABLE document_vectors (
  id BIGINT NOT NULL,
  title VARCHAR(500),
  content TEXT,
  category VARCHAR(100),
  embedding ARRAY<FLOAT> NOT NULL,
  INDEX idx_embedding (embedding) USING ANN PROPERTIES (
    "index_type" = "hnsw",
    "metric_type" = "l2_distance",
    "dim" = "768"
  )
)
ENGINE = OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");

Step 2: Configure ANN Index

Common properties:

index_type: hnsw, ivf, or ivf_on_disk
metric_type: l2_distance or inner_product
dim: vector dimension
quantizer: flat, sq8, sq4, pq (optional)

HNSW-specific:

max_degree (default 32)
ef_construction (default 40)

IVF-specific:

nlist (default 1024; used by both ivf and ivf_on_disk)

Example:

CREATE INDEX idx_embedding ON document_vectors (embedding) USING ANN PROPERTIES (
  "index_type" = "hnsw",
  "metric_type" = "l2_distance",
  "dim" = "768",
  "max_degree" = "64",
  "ef_construction" = "128"
);

Step 3: Load Data

Recommended order for bulk workloads:

Create table (without ANN index or without BUILD INDEX yet)
Import data in batch (Stream Load, S3 TVF, or SDK)
Trigger index build

For production, prefer batch loading approaches such as Stream Load or SDK batch insert.

Step 4: Build and Monitor Index

When index is created after table creation, run BUILD INDEX manually:

BUILD INDEX idx_embedding ON document_vectors;
SHOW BUILD INDEX WHERE TableName = "document_vectors";

Build states include PENDING, RUNNING, FINISHED, and CANCELLED.

4. Query Patterns

TopN search

SELECT id, title,
       l2_distance_approximate(embedding, [0.1, 0.2, ...]) AS dist
FROM document_vectors
ORDER BY dist
LIMIT 10;

Range search

SELECT id, title
FROM document_vectors
WHERE l2_distance_approximate(embedding, [0.1, 0.2, ...]) < 0.5;

Search with filters

SELECT id, title,
       l2_distance_approximate(embedding, [0.1, 0.2, ...]) AS dist
FROM document_vectors
WHERE category = 'AI'
ORDER BY dist
LIMIT 10;

Doris uses pre-filtering in vector search plans, which helps preserve recall in mixed filter scenarios.

5. Tuning Checklist

Query-side parameters

HNSW: hnsw_ef_search (higher recall vs higher latency)
IVF: nprobe (or ivf_nprobe, depending on version/session variables)

Example:

SET hnsw_ef_search = 100;
SET nprobe = 128;
SET optimize_index_scan_parallelism = true;

Build-side recommendations

Run compaction before final index build on large datasets.
Avoid oversized segments when targeting high recall.
Benchmark several parameter groups (max_degree, ef_construction, ef_search) on the same dataset.

Capacity planning

As a practical baseline, estimate vector memory with dim * 4 bytes * row_count, then add ANN structure overhead and reserve memory headroom for non-vector columns and execution operators.
For single-node and distributed sizing references at 10M/100M scale, see Large-scale Performance Benchmark.

6. Index Operations

Common management SQL:

SHOW INDEX FROM document_vectors;
SHOW DATA ALL FROM document_vectors;
ALTER TABLE document_vectors DROP INDEX idx_embedding;

When changing index parameters, use drop-and-recreate workflow, then rebuild index.

7. Troubleshooting

Index not used

Check:

Index exists: SHOW INDEX
Build finished: SHOW BUILD INDEX
Correct function: use _approximate functions

Low recall

Check:

HNSW parameters (max_degree, ef_construction, hnsw_ef_search)
IVF probe parameters (nprobe/ivf_nprobe)
Segment size and post-compaction rebuild

High latency

Check:

Cold vs warm query behavior (index loading)
Overly large hnsw_ef_search
Parallel scan setting
BE memory pressure

Data import errors

Common causes:

dimension mismatch (dim vs actual data)
null vector values
invalid array format

8. Hybrid Search Pattern

You can combine ANN with text search by defining both ANN and inverted indexes in the same table, then filtering with text predicates and ordering with vector distance. This is a common approach for production RAG pipelines.

1. Scope and Typical Scenarios​

2. Prerequisites and Constraints​

2.1 Using Cosine Similarity in Doris ANN​

3. End-to-End Workflow​

Step 1: Create Table​

Step 2: Configure ANN Index​

Step 3: Load Data​

Step 4: Build and Monitor Index​

4. Query Patterns​

TopN search​

Range search​

Search with filters​

5. Tuning Checklist​

Query-side parameters​

Build-side recommendations​

Capacity planning​

6. Index Operations​

7. Troubleshooting​

Index not used​

Low recall​

High latency​

Data import errors​

8. Hybrid Search Pattern​

1. Scope and Typical Scenarios

2. Prerequisites and Constraints

2.1 Using Cosine Similarity in Doris ANN

3. End-to-End Workflow

Step 1: Create Table

Step 2: Configure ANN Index

Step 3: Load Data

Step 4: Build and Monitor Index

4. Query Patterns

TopN search

Range search

Search with filters

5. Tuning Checklist

Query-side parameters

Build-side recommendations

Capacity planning

6. Index Operations

7. Troubleshooting

Index not used

Low recall

High latency

Data import errors

8. Hybrid Search Pattern