Send Feedback
Skip to content

cuVS example

This page covers practical examples for using the cuVS module in KDB-X.

These examples progress from initial table creation and search to index tuning, VRAM planning, and search performance optimization. Each example is self-contained and includes the setup code needed to run it. The examples assume KDB-X is installed with a cuVS-enabled license and the environment is configured as described in GPU Environment Setup and Install KDB-X and the cuVS Module.

Note

CAGRA requires at least intermediate_graph_degree + 1 rows before the index can build (default minimum: 129 rows). For small datasets, use brute-force search until enough rows have been accumulated.


Note

Throughout this documentation, M denotes millions of vectors (for example, 1M = 1 million, 5M = 5 million).

Create a table with a CAGRA index, insert vectors, and search for nearest neighbors.

Prerequisite

This example uses in-memory data generated at runtime. Ensure the cuVS module is loaded.

q

.cuvs:use`kx.cuvs;

nTrain:10000;
nTest:50;
dims:1024;
k:64; / the max of k is 64
GPUID:0;
myPath:hsym `$"index001";

/ train the data
vecs:{(x;y)#(x*y)?1e};
data:vecs[nTrain;dims];
testVecs:data[answer:neg[nTest]?nTrain];

/ define metric and build algorithm
metric:`CS; / available metric: `L2`CS`IP
build_algo:`nn_descent;

/ set cagra index parameters
cagraParams:(`metric`intermediate_graph_degree`graph_degree`build_algo`nn_descent_niter`gpuid)!(metric;  64; 64; build_algo; 20; GPUID);

/ initialize the index
cIndex:.cuvs.cagra.init[cagraParams];

/ insert the data
.cuvs.cagra.insert[cIndex;data];
totalCount:.cuvs.cagra.count[cIndex];
show totalCount;

/ define search parameters
paramsSearch:`max_queries`itopk_size`max_iterations`algo`team_size`search_width`min_iterations`thread_block_size`hashmap_mode`hashmap_min_bitlen`hashmap_max_fill_rate`num_random_samplings!(0;64;0;`SINGLE_CTA;0;1;0;0;`HASH;0;0.5;1);

/ perform a search
searchResult:.cuvs.cagra.search[cIndex;testVecs;k;paramsSearch];
show searchResult;

Choosing a build algorithm

The build_algo parameter controls how the initial CAGRA graph is seeded. Choose based on your dataset size, VRAM budget, and recall requirements.

Value Best for
AUTO Prototyping or when dataset size is unknown. cuVS auto-selects the algorithm.
IVF_PQ (default) Production use, datasets > 1M vectors, shared or memory-constrained GPUs.
nn_descent Maximum recall, dedicated GPU, datasets < ~5M vectors.
iterative_cagra_search Iterative graph build and refinement using CAGRA's own search algorithm. Experimental – not recommended for production use.
/ choose build algorithm
build_algo:`IVF_PQ; / available algo: `AUTO_SELECT`IVF_PQ`nn_descent`ITERATIVE_CAGRA_SEARCH

/ set cagra parameters
cagraParams:(`metric`intermediate_graph_degree`graph_degree`build_algo`gpuid)!(`CS; 64; 64; build_algo; 0);

/ initialize the index
cIndex:.cuvs.cagra.init[cagraParams];

Defining parameters

/ Minimal index params (all defaults)
paramsIndex:`dims`metric!(128;`L2)

/ IVF_PQ – production recommended
paramsIndex:(`gpuid`dims`metric`intermediate_graph_degree`graph_degree`build_algo)!(0;128;`L2;64;32;`IVF_PQ)

/ nn_descent – higher recall, higher VRAM, dedicated GPU only
paramsIndex:(`gpuid`dims`metric`intermediate_graph_degree`graph_degree`build_algo)!(0;128;`L2;128;64;`nn_descent)

Index tuning

Adjust graph_degree and intermediate_graph_degree to balance recall, build time, and VRAM usage. Higher values improve recall at the cost of more memory and longer build times.

Prerequisite

This example uses in-memory data generated at runtime. Ensure the cuVS module is loaded.

.cuvs:use`kx.cuvs;

nTrain:10000;
dims:128;
GPUID:0;

vecs:{(x;y)#(x*y)?1e};
data:vecs[nTrain;dims];

/ graph_degree controls the recall/memory trade-off
/ higher graph_degree = better recall, more VRAM
/ intermediate_graph_degree must be >= graph_degree
/ nn_descent_niter only applies when build_algo is `nn_descent

/ low memory – faster build, lower recall
cagraParams:`metric`intermediate_graph_degree`graph_degree`build_algo`gpuid!(`L2;64;32;`IVF_PQ;GPUID);
cIndex:.cuvs.cagra.init[cagraParams];
.cuvs.cagra.insert[cIndex;data];

/ balanced – default recommended settings
cagraParams:`metric`intermediate_graph_degree`graph_degree`build_algo`gpuid!(`L2;128;64;`IVF_PQ;GPUID);
cIndex:.cuvs.cagra.init[cagraParams];
.cuvs.cagra.insert[cIndex;data];

/ maximum recall – dedicated GPU only, higher VRAM
cagraParams:`metric`intermediate_graph_degree`graph_degree`build_algo`nn_descent_niter`gpuid!(`L2;128;64;`nn_descent;20;GPUID);
cIndex:.cuvs.cagra.init[cagraParams];
.cuvs.cagra.insert[cIndex;data];

show .cuvs.cagra.count[cIndex];

VRAM planning

CAGRA holds the full vector dataset and graph structure in GPU memory. Use the following estimates when planning capacity:

Dataset size Dims fp32 dataset CAGRA index (approx) IVF_PQ peak build nn_descent peak build
1M vectors 128 0.5 GB ~0.9 GB ~3 GB ~15 GB
10M vectors 64 2.4 GB ~4.3 GB ~15 GB ~78 GB
100M vectors 128 50 GB ~90 GB varies not recommended

Note

CAGRA currently retains approximately 1.8× the raw vector data size in VRAM during search due to an internal float16 copy. If VRAM is constrained, use IVF_PQ as the build algorithm.

Prerequisite

This example uses in-memory data generated at runtime. Ensure the cuVS module is loaded.

.cuvs:use`kx.cuvs;

/ VRAM estimates for fp32 vectors
/ fp32 dataset  : nVectors * dims * 4 bytes
/ CAGRA index   : ~1.8x fp32 dataset size (includes graph structure + internal float16 copy)
/ IVF_PQ build  : additional peak overhead during build only (released after build completes)
/ nn_descent    : significantly higher peak – avoid on shared GPUs

/ helper: estimate VRAM in GB for a given dataset
cagraVramEstimate:{[nVectors;dims]
    fp32GB  : (nVectors * dims * 4) % 1024 xexp 3;
    indexGB : fp32GB * 1.8;
    `fp32_dataset`cagra_index_approx!(fp32GB;indexGB)
    };

/ examples from planning table
show cagraVramEstimate[1000000;128];     / 1M x 128 – fp32: 0.5GB, index: ~0.9GB
show cagraVramEstimate[10000000;64];     / 10M x 64  – fp32: 2.4GB, index: ~4.3GB
show cagraVramEstimate[100000000;128];   / 100M x 128 – fp32: 50GB, index: ~90GB

/ practical check: build a small index and verify count before scaling
dims:128;
GPUID:0;
nSmall:1000;

vecs:{(x;y)#(x*y)?1e};
data:vecs[nSmall;dims];

cagraParams:`metric`intermediate_graph_degree`graph_degree`build_algo`gpuid!(`L2;128;64;`IVF_PQ;GPUID);
cIndex:.cuvs.cagra.init[cagraParams];
.cuvs.cagra.insert[cIndex;data];
-1 "vectors indexed: ",string .cuvs.cagra.count[cIndex];

Search performance tuning

Tune the CAGRA search algorithm and parameters to balance recall, throughput, and latency for your workload.

Search algorithm

The algo search parameter controls how CAGRA parallelizes beam search across GPU thread blocks:

Value Algorithm Best for
0 SINGLE_CTA Very small batches (1–few queries). Does not scale.
1 MULTI_CTA Recall-sensitive workloads at 1M+ scale.
2 MULTI_KERNEL
3 AUTO (recommended) General use. cuVS selects based on batch size.

Note

At dataset sizes of 1M+, SINGLE_CTA can show measurably lower recall than MULTI_CTA because it runs out of search steps on larger graphs. AUTO optimizes for throughput rather than recall. For recall-sensitive workloads at scale, set algo:``MULTI_CTA explicitly.

Key search parameters

Parameter Description
itopk_size Internal candidate list size. Primary recall/speed trade-off. Max 512 for SINGLE_CTA.
search_width Graph nodes explored in parallel per iteration.
max_queries Pre-allocates internal scratch buffers. Set to your expected batch size to avoid per-call allocation overhead.
algo Search parallelism strategy (refer to the Search algorithm section above).

Prerequisite

This example uses in-memory data generated at runtime. Ensure the cuVS module is loaded.

.cuvs:use`kx.cuvs;

nTrain:10000;
nTest:10;
dims:128;
k:64;
GPUID:0;

vecs:{(x;y)#(x*y)?1e};
data:vecs[nTrain;dims];
testVecs:data[answer:neg[nTest]?nTrain];

cagraParams:`metric`intermediate_graph_degree`graph_degree`build_algo`gpuid!(`L2;128;64;`IVF_PQ;GPUID);
cIndex:.cuvs.cagra.init[cagraParams];
.cuvs.cagra.insert[cIndex;data];

/ default params – SINGLE_CTA, itopk_size 64
/ suitable for small batches, low latency
defaultParams:`max_queries`itopk_size`max_iterations`algo`team_size`search_width`min_iterations`thread_block_size`hashmap_mode`hashmap_min_bitlen`hashmap_max_fill_rate`num_random_samplings!(0;64;0;`SINGLE_CTA;0;1;0;0;`HASH;0;0.5;1);

/ high recall – MULTI_CTA with larger itopk_size
/ better recall at 1M+ scale; k must not exceed itopk_size
highRecallParams:`max_queries`itopk_size`max_iterations`algo`team_size`search_width`min_iterations`thread_block_size`hashmap_mode`hashmap_min_bitlen`hashmap_max_fill_rate`num_random_samplings!(0;128;0;`MULTI_CTA;0;1;0;0;`HASH;0;0.5;1);

/ high throughput – AUTO algo, pre-allocated batch buffer
/ set max_queries to expected batch size to avoid per-call allocation
highThroughputParams:`max_queries`itopk_size`max_iterations`algo`team_size`search_width`min_iterations`thread_block_size`hashmap_mode`hashmap_min_bitlen`hashmap_max_fill_rate`num_random_samplings!(nTest;64;0;`AUTO;0;1;0;0;`AUTO_HASH;0;0.5;1);

/ compare timing across configs
configs:`default`high_recall`high_throughput;
params:(defaultParams;highRecallParams;highThroughputParams);

timeIt:{[idx;vecs;kk;p]
    t:.z.p;
    .cuvs.cagra.search[idx;vecs;kk;p];
    "j"$(.z.p-t)%0D00:00:00.000001
    };

timings:timeIt[cIndex;testVecs;k;] each params;
-1 "timings (microseconds):";
show configs!timings;

Batch size

CAGRA works efficiently with batched queries. Increasing batch size improves GPU utilization and overall throughput. For concurrent workloads with many threads, increasing batch size per thread is more effective than increasing thread count alone.


Next steps