IVF On-Disk in Apache Doris
ivf_on_disk is an ANN index type for large-scale vector search in Doris. It keeps IVF inverted-list data on disk, while loading hot data through a dedicated cache at query time. This makes it possible to run high-recall IVF search with a lower steady-state memory footprint than fully in-memory IVF.
Why IVF On-Disk
For pure in-memory IVF, memory usage grows with vector count and dimensions. At tens of millions of vectors, index memory can become the bottleneck. ivf_on_disk is designed for this scenario:
- Retain IVF retrieval behavior (
nlist/nprobetuning model). - Move IVF list payload from mandatory full-memory residency to disk + cache.
- Keep SQL usage and index DDL close to existing ANN workflows.
Scope and User Value
Compared with ivf, ivf_on_disk focuses on a different operating point:
- Better memory elasticity under large data scale.
- More controllable memory-performance tradeoff via cache sizing.
- Similar SQL semantics and index tuning approach for users already familiar with IVF.
It is suitable when memory budget is tight, but brute-force scan is too slow and users still need ANN acceleration.
User-Facing Interfaces
1) Index DDL
Use index_type="ivf_on_disk" in ANN index properties.
CREATE TABLE vec_tbl (
id BIGINT NOT NULL,
embedding ARRAY<FLOAT> NOT NULL,
INDEX idx_emb (embedding) USING ANN PROPERTIES (
"index_type" = "ivf_on_disk",
"metric_type" = "l2_distance",
"dim" = "768",
"nlist" = "1024"
)
) ENGINE=OLAP
DUPLICATE KEY(id)
DISTRIBUTED BY HASH(id) BUCKETS 8
PROPERTIES ("replication_num" = "1");
Notes:
nlistis required for bothivfandivf_on_disk.metric_typesupportsl2_distanceandinner_product.- Query syntax remains the same (
l2_distance_approximate/inner_product_approximate).
2) Query-time IVF parameter
ivf_nprobe continues to be the main query-time parameter:
SET ivf_nprobe = 64;
Larger nprobe usually improves recall while increasing latency.
3) BE cache configuration
ivf_on_disk introduces a dedicated cache for IVF list data:
ann_index_ivf_list_cache_limit(default:70%)ann_index_ivf_list_cache_stale_sweep_time_sec(default:3600)
The percentage value of ann_index_ivf_list_cache_limit is based on process-available memory (mem_limit), not total machine memory.
Observability
For ivf_on_disk, Doris provides dedicated profile counters and BE metrics.
Common profile counters include:
AnnIvfOnDiskLoadCostsAnnIvfOnDiskCacheHitCntAnnIvfOnDiskCacheMissCnt
Common BE metrics include:
ann_ivf_on_disk_fetch_page_costs_msann_ivf_on_disk_fetch_page_cntann_ivf_on_disk_search_costs_msann_ivf_on_disk_search_cntann_ivf_on_disk_cache_hit_cntann_ivf_on_disk_cache_miss_cnt
These metrics help estimate whether the current cache size is appropriate and whether latency comes from cache misses or compute.
Usage Notes
ivf_on_diskfollows the same ANN table constraints as existing ANN indexes (for example, vector column type and key model constraints).- Training quality still depends on data scale and parameter choice (
nlist,ivf_nprobe). ivf_on_disksupports regular ingestion paths such as Stream Load. Use validation on your own workload before production rollout.
Performance Reference
The following results are a reference benchmark snapshot. They show the practical tradeoff between cache coverage, memory usage, and latency under similar recall.
| Scenario | Memory Usage (GB) | AnnIndexIVFListCache Hit Ratio | Max QPS | Recall@100 | Avg Latency (s) | P99 Latency (s) | P95 Latency (s) |
|---|---|---|---|---|---|---|---|
| Brute Force (No Index) | - | - | 0.2922 | 0.0000 | 292.5394 | 307.9490 | 307.9442 |
| IVF In Memory | 32.0 | 100% | 71.8535 | 0.9598 | 0.4167 | 0.5623 | 0.5151 |
| OnDisk Cache 100% | 32.0 | 100% | 72.3649 | 0.9599 | 0.8274 | 1.1236 | 1.0395 |
| OnDisk Cache 79% | 22.0 | 70% | 45.0266 | 0.9599 | 1.9900 | 4.4059 | 3.3568 |
| OnDisk Cache 60% | 16.7 | 55% | 38.3141 | 0.9599 | 2.3281 | 4.0063 | 3.5542 |
How to read this table:
- Under similar recall (~0.96), reducing cache size lowers memory usage but increases tail latency.
- At full cache coverage,
ivf_on_diskkeeps recall close to in-memory IVF with a moderate latency increase. - Cache hit ratio is a key indicator for sizing
ann_index_ivf_list_cache_limit.
Tuning Suggestions
- Start with the same
nlist/ivf_nprobebaseline you use forivf. - Set
ann_index_ivf_list_cache_limitaccording to memory budget, then observe cache hit/miss metrics. - Increase cache first if recall is stable but latency is unstable due to frequent misses.
- Re-tune
nprobeafter cache ratio changes to rebalance recall and latency.