Skip to main content

Vector Index Practical Guide

This document is intended for users who need to deploy vector retrieval (ANN) in Apache Doris. It provides a complete operational path from table design to query tuning and troubleshooting. If you are evaluating how to migrate semantic search, RAG, or recommendation recall to Doris, you can follow the steps in this document directly.

Quick Navigation

What you want to doSection
Confirm whether the Doris version and table model meet the requirementsPrerequisites and Limitations
Choose between HNSW and IVF indexApplicable Scenarios and Index Selection
Run the full table creation -> ingestion -> query workflowEnd-to-End Operational Workflow
Sort by cosine similarityUsing Cosine Similarity
Increase recall / reduce latencyQuery and Build Tuning
Troubleshoot index not taking effect / low recall / ingestion failuresCommon Troubleshooting

Applicable Scenarios and Index Selection

Starting from Apache Doris 4.x, ANN (Approximate Nearest Neighbor) vector indexes are supported. Common deployment scenarios include:

  • Semantic search
  • RAG retrieval augmentation
  • Recommendation system recall
  • Image or multimodal retrieval
  • Anomaly detection

Index Type Comparison

Index typeRecallOnline query performanceBuild speedMemory usageApplicable scenario
hnswHighGoodSlowHigherOnline low-latency retrieval
ivfMediumBetterFastMore efficientLarge-scale datasets
ivf_on_diskMediumMediumFastMost efficientUltra-large scale, memory-constrained

Supported Distance Functions

FunctionSort directionDescription
l2_distance_approximateORDER BY ... ASCEuclidean distance, smaller distance means more similar
inner_product_approximateORDER BY ... DESCInner product, larger value means more similar

Cosine similarity cannot be configured directly via metric_type="cosine". It must be implemented by normalizing the vectors and using inner product. For details, see Using Cosine Similarity.


Prerequisites and Limitations

Before using ANN indexes, confirm the following conditions:

Check itemRequirement
Doris version>= 4.0.0
Table modelOnly DUPLICATE KEY is supported
Vector column typeARRAY<FLOAT> NOT NULL
Dimension consistencyThe dimension of ingested vectors must match the index dim

Minimal table creation example:

CREATE TABLE document_vectors (
id BIGINT NOT NULL,
embedding ARRAY<FLOAT> NOT NULL
)
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");

End-to-End Operational Workflow

The complete workflow consists of 4 steps: create table -> configure index -> ingest data -> build and monitor index.

Step 1: Create the Vector Table

There are two ways to create the table. Choose based on the data scale and ingestion mode:

MethodProsConsRecommended scenario
Define the ANN index directly when creating the tableQueryable as soon as data is writtenSlower ingestionSmall scale, streaming ingestion
Create the table and ingest data first, then CREATE INDEX + BUILD INDEXFaster ingestion, controllable build timingRequires an extra build stepLarge-scale batch ingestion

Example of defining an ANN index directly when creating the table:

CREATE TABLE document_vectors (
id BIGINT NOT NULL,
title VARCHAR(500),
content TEXT,
category VARCHAR(100),
embedding ARRAY<FLOAT> NOT NULL,
INDEX idx_embedding (embedding) USING ANN PROPERTIES (
"index_type" = "hnsw",
"metric_type" = "l2_distance",
"dim" = "768"
)
)
ENGINE = OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");

Step 2: Configure Vector Index Parameters

Common parameters:

ParameterValuesDescription
index_typehnsw / ivf / ivf_on_diskIndex type
metric_typel2_distance / inner_productDistance metric
dimIntegerVector dimension
quantizerflat / sq8 / sq4 / pqQuantization method (optional)

HNSW-specific parameters:

ParameterDefaultDescription
max_degree32Maximum number of neighbors per node
ef_construction40Search width during build

IVF-specific parameters (shared by ivf and ivf_on_disk):

ParameterDefaultDescription
nlist1024Number of cluster centroids

Example of creating the index after the table:

CREATE INDEX idx_embedding ON document_vectors (embedding) USING ANN PROPERTIES (
"index_type" = "hnsw",
"metric_type" = "l2_distance",
"dim" = "768",
"max_degree" = "64",
"ef_construction" = "128"
);

Step 3: Ingest Data

Recommended order for batch ingestion:

  1. Create the table, without building the index for now
  2. Batch-write the data (Stream Load / S3 TVF / SDK)
  3. Build the index uniformly after the data ingestion is complete

In production environments, this batch mode is preferred. It can significantly reduce ingestion time.

Step 4: Build the Index and Monitor

If the post-ingestion index creation method is used, you need to trigger it manually:

BUILD INDEX idx_embedding ON document_vectors;

SHOW BUILD INDEX WHERE TableName = "document_vectors";

Build states include: PENDING, RUNNING, FINISHED, CANCELLED.


Query Patterns

SELECT id, title,
l2_distance_approximate(embedding, [0.1, 0.2, ...]) AS dist
FROM document_vectors
ORDER BY dist
LIMIT 10;
SELECT id, title
FROM document_vectors
WHERE l2_distance_approximate(embedding, [0.1, 0.2, ...]) < 0.5;

Hybrid Search with Filter Conditions

SELECT id, title,
l2_distance_approximate(embedding, [0.1, 0.2, ...]) AS dist
FROM document_vectors
WHERE category = 'AI'
ORDER BY dist
LIMIT 10;

In hybrid filtering scenarios, Doris uses a pre-filtering strategy, which balances both performance and recall.


Using Cosine Similarity

ANN indexes do not support configuring metric_type="cosine" directly. If your business needs to sort by cosine similarity, use the following pattern:

  1. Apply L2 normalization to vectors before ingestion (convert them to unit vectors)
  2. Use metric_type="inner_product" when creating the ANN index
  3. Use inner_product_approximate(...) in queries, and sort by ORDER BY ... DESC

Principle:

  • cos(x, y) = (x · y) / (||x|| · ||y||)
  • After normalization, ||x|| = ||y|| = 1, so cos(x, y) = x · y

In a unit-vector space, cosine sorting is equivalent to inner product sorting.


Query and Build Tuning

Query Parameters

Index typeTuning parameterEffect
HNSWhnsw_ef_searchLarger value yields higher recall and higher latency
IVFnprobe or ivf_nprobe (depending on version)Larger value yields higher recall
SET hnsw_ef_search = 100;
SET nprobe = 128;
SET optimize_index_scan_parallelism = true;

Build Recommendations

  1. For large-scale data, run compaction first, then trigger the final index build
  2. Control the segment scale to avoid impacting recall when segments are too large
  3. Run A/B benchmarks on multiple parameter sets against the same dataset

Capacity Estimation

  • Rough vector memory formula: dim * 4 bytes * row_count
  • Add the overhead of the ANN index structure on top of this
  • Reserve a memory budget for non-vector columns and execution operators

For 10M / 100M scale capacity reference on single-node and distributed deployments, see Large-Scale Performance Test.


Index Management

Common management SQL:

-- View the index list
SHOW INDEX FROM document_vectors;

-- View data scale
SHOW DATA ALL FROM document_vectors;

-- Drop the index
ALTER TABLE document_vectors DROP INDEX idx_embedding;

To adjust index parameters, the recommended approach is to drop the old index and rebuild it.


Common Troubleshooting

Index Not Taking Effect

Investigate in this order:

  1. Whether the index exists: run SHOW INDEX
  2. Whether the index has finished building: run SHOW BUILD INDEX
  3. Whether the query uses a distance function with the _approximate suffix

Low Recall

Investigation directionRecommendation
HNSW parametersIncrease max_degree, ef_construction, hnsw_ef_search
IVF probe parametersIncrease nprobe / ivf_nprobe
Segment scaleRebuild the index after compaction

High Query Latency

Investigation directionRecommendation
Cold query vs. hot queryIndex loading time differs. You can warm up after service startup
hnsw_ef_search too largeReduce it appropriately to lower latency
Parallel scan not enabledSet optimize_index_scan_parallelism = true
BE memory pressureCheck BE memory levels and GC behavior

Ingestion Failure

Common causeRecommendation
Dimension mismatchCheck that the ingested vector dimension matches the index dim
NULL appears in the vector columnFill or filter out NULL on the business side
Invalid vector array formatValidate the JSON / Stream Load payload format

FAQ

Q1: Can ANN indexes be used on UNIQUE KEY or AGGREGATE KEY tables?

No. ANN indexes only support the DUPLICATE KEY model.

Q2: Can ANN indexes and inverted indexes be created at the same time?

Yes. You can create both an ANN index and an inverted index on the same table. Combining text filtering with vector sorting enables the hybrid retrieval pattern that is common in online RAG.

Q3: What if I need to use cosine similarity?

ANN does not support metric_type="cosine". Normalize the vectors and use inner_product, and the effect is equivalent. For details, see Using Cosine Similarity.

Q4: What if BUILD INDEX is stuck in RUNNING?

Check the progress with SHOW BUILD INDEX. Building a large table itself takes a long time, so first confirm whether it is still building normally. If there is no progress for a long time, check the BE memory and disk status.

Q5: How do I adjust ANN index parameters?

ANN index parameters do not support in-place modification. The recommendation is to DROP INDEX first, then CREATE INDEX with the new parameters, and finally BUILD INDEX.