Nvidia cuVS Integration with KDB.AI
This page covers the cuVS integration available in KDB.AI, including system requirements, index configuration, VRAM planning, and search performance tuning.
cuVS (CUDA Vector Search) is Nvidia's CUDA-accelerated library for similarity search, built to handle large-scale vector data efficiently using GPUs. It includes CAGRA (Cuda ANNS Graph), a graph-based algorithm optimized for fast, scalable, and memory-efficient nearest neighbor search on high-dimensional embeddings. By combining KDB.AI's vector database with cuVS, you can run search-intensive workloads – such as semantic search, recommendation, or anomaly detection – at GPU speed.
Use this guide to configure your environment for cuVS, and deploy KDB.AI Server in a GPU-enabled container optimised for large-scale vector operations.
Why use cuVS with KDB.AI?
When building production-grade applications that depend on fast similarity search, integrating CUDA-based acceleration offers several advantages:
- Performance: Use GPU acceleration to handle millions of vectors with high throughput and low latency.
- Scalability: Offload intensive search operations to the GPU while keeping KDB.AI's memory-efficient data structures on the host.
- Efficiency: Leverage CAGRA's compressed graph format to reduce memory usage without compromising search accuracy.
- Compatibility: KDB.AI integrates natively with cuVS and runs inside GPU-enabled Docker containers.
By deploying KDB.AI with cuVS, you can scale vector workloads across millions of records, reduce CPU load, and accelerate inference pipelines – all while using the familiar KDB.AI interface and APIs.
Getting started
Prerequisites
System requirements
Ensure your host system meets these requirements:
| Component | Requirement |
|---|---|
| Operating System | Linux kernel 4.18 or newer (Ubuntu 20.04+, RHEL 8+, CentOS 8+) |
| CPU | x86_64 (AMD64) architecture; up to 24 cores (Standard Edition limit) |
| GPU | Ampere architecture or newer (e.g. A100/H100, 40 GB+ VRAM) recommended for large-scale datasets. VRAM requirements vary with index size. |
| GPU Driver | Nvidia driver ≥ 580 (Linux). Refer to CUDA compatibility for details |
ARM not supported
Unlike the kdbai-db image, which is multi-architecture (supports both ARM and x86_64), kdbai-db-cuvs requires an x86_64 (AMD64) host. ARM architectures (including Apple Silicon) are not supported.
For a full overview, refer to the CUDA compatibility guide.
Software requirements
Nvidia container toolkit
- Version 1.11 or newer – provides GPU support within your container engine.
- Supported container engines are Docker, Containerd, CRI-O, and Podman. Refer to supported platforms for version requirements.
- Install the container toolkit by following Nvidia's official installation guide.
Once installed, verify your installation:
docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi
Ensure your GPUs are listed and the driver version meets requirements.
KDB.AI client
The standard kdbai-client python package works with kdbai-db-cuvs. No additional client is required. Refer to Pre-requisites for installation details.
Account and license
New users
If you haven't signed up yet, follow the KDB.AI Server setup guide – this covers registration, Docker login, and obtaining your license key.
Existing users
If you're already running kdbai-db, your existing KDB_LICENSE_B64 (or KDB_K4LICENSE_B64) from your Welcome email works with kdbai-db-cuvs. No additional license is required. Replace kdbai-db with kdbai-db-cuvs in your Docker run command.
In both cases, export your license key before running the container:
export KDB_LICENSE_B64=<your-license-from-welcome-email>
Run the container
Separate image required
This guide requires the kdbai-db-cuvs image, which is not the standard kdbai-db image. It is a larger image that bundles all required GPU and cuVS dependencies.
Launch the kdbai-db-cuvs container with GPU support enabled:
docker run -d --name kdbai-gpu \
--gpus all \
-p 8081:8081 \
-p 8082:8082 \
-e KDB_LICENSE_B64="$KDB_LICENSE_B64" \
-v "$PWD/vdbdata":/tmp/kx/data \
portal.dl.kx.com/kdbai-db-cuvs
How CAGRA works
CAGRA builds a directed k-nearest neighbor graph (k-NNG) across your vector dataset entirely on the GPU, then runs a parallelized beam search at query time. Graph construction has two phases:
- Initial graph build – seeds the graph using either
IVF-PQ(default) orNN-Descent. - Graph pruning and optimization – removes redundant edges and improves connectivity.
At query time, CAGRA traverses this graph rather than scanning inverted lists, which gives it significantly higher throughput than CPU-based HNSW.
Note
Throughout this documentation, M denotes millions of vectors (for example, 1M = 1 million, 5M = 5 million).
Build algorithms
The build_algo parameter controls how CAGRA seeds the initial graph:
| Algorithm | Description | Best for |
|---|---|---|
IVF_PQ (default) |
Uses IVF clustering + Product Quantization to find approximate neighbors. GPU-native, fast. | Datasets > 1M vectors, production use, build-time sensitive workloads |
nn_descent |
Iterative refinement of a random k-NNG. Slower but higher-quality initial graph. |
Datasets < ~5M vectors where maximum recall is required |
AUTO |
cuVS selects between IVF_PQ and nn_descent based on dataset size and available GPU memory. |
Prototyping and general use |
iterative_cagra_search |
Iterative graph build and refinement using CAGRA search | When build quality matters more than build speed; dedicated GPU workflows |
Key index parameters
Use the following parameters to tune the index further:
| Parameter | Default | Description |
|---|---|---|
graph_degree |
64 |
Edges per node in the final graph. Controls the trade-off between recall and memory usage. |
intermediate_graph_degree |
128 |
Degree before pruning. Must be ≥ graph_degree. |
build_algo |
IVF_PQ |
Graph construction algorithm (refer to Build algorithms). |
Limitations
- The index must fit in GPU memory. CAGRA loads the full index into VRAM – refer to VRAM planning.
- Best suited for batched queries. For single-query workloads, review the search algorithm settings – refer to Search performance tuning.
- Minimum dataset size required. At least
intermediate_graph_degree + 1rows are needed before the index can build. Use brute-force search for small datasets.
Quickstart
Python
The following example creates a CAGRA-indexed table in KDB.AI, inserts vectors, and runs a similarity search.
import kdbai_client as kdbai
import numpy as np
# Connect to KDB.AI Server
session = kdbai.Session(endpoint="http://localhost:8082")
db = session.database("default")
# Define schema and CAGRA index
schema = [
{"name": "id", "type": "int64"},
{"name": "vector", "type": "float32s"}
]
indexes = [
{
"name": "cagraIndex",
"type": "cagra",
"column": "vector",
"params": {
"dims": 128,
"metric": "L2",
"graph_degree": 32,
"intermediate_graph_degree": 64,
"build_algo": "IVF_PQ" # IVF_PQ (default, recommended for production)
# nn_descent (higher recall, much higher VRAM)
# AUTO
}
}
]
table = db.create_table("embeddings", schema, indexes=indexes)
# Insert vectors – ensure N > intermediate_graph_degree before index builds
n = 10_000
dims = 128
ids = np.arange(n, dtype=np.int64)
vecs = np.random.random((n, dims)).astype(np.float32)
import pandas as pd
table.insert(pd.DataFrame({"id": ids, "vector": list(vecs)}))
# Search – returns top-10 nearest neighbors
query = np.random.random((1, dims)).astype(np.float32)
results = table.search(
vectors={"cagraIndex": query},
n=10
)[0]
print(results)
Refer to Build algorithms for guidance on choosing build_algo.
q / kdb+
The following example connects to KDB.AI Server over qIPC and creates a CAGRA-indexed table from q.
// Connect to KDB.AI Server
`gw set hopen 8082;
// Define schema
dims:10;
eDims:3;
mySchema:flip `name`type!(`id`myDate`time`tag`price`myScalar`text;`j`d`p`s`E`f`C);
// Define CAGRA index parameters
GPUID:0; // for any machine which has GPU
paramsIndex:(`gpuid`dims`metric`intermediate_graph_degree`graph_degree`build_algo`nn_descent_niter)!(GPUID;dims;`CS;128;64;`IVF_PQ;20);
paramsSearch:`max_queries`itopk_size`max_iterations`algo`team_size`search_width`min_iterations`thread_block_size`hashmap_mode`hashmap_min_bitlen`hashmap_max_fill_rate`num_random_samplings!(0;64;0;`SINGLE_CTA;0;1;0;0;`HASH;0;0.5;1);
idx: `name`column`type`params!(enlist `myVectorIndex;enlist `price;enlist `cagra;enlist paramsIndex);
// Create the table
createResult:gw(`createTable;`database`table`schema`indexes!(`default;`test_cagra;mySchema;flip idx));
show createResult; //gw(`listTables;enlist[`database]!enlist `default);
// Insert vectors – accumulate enough rows before CAGRA builds (N > intermediate_graph_degree)
N:100;
t: ([] id:til N; myDate:2015.01.01 + asc N?100j; time:asc N?0p; tag:N?`aaa`bbb`ccc; price:(N;dims)#(N*dims)?1e; myScalar:N?1f; text:{rand[256]?" "} each til N); // price is length-dims
gw(`insertData;`database`table`payload!(`default;`test_cagra;t));
// Query – top-10 nearest neighbors for a single query vector
resQry:(gw(`query;`database`table!(`default;`test_cagra)))[`result];
show resQry;
// Search – top-10 nearest neighbors for a single query vector
q:sums neg[0.5]+dims?1f;
tqry:enlist[`myVectorIndex]!enlist enlist q;
res:first (gw(`search;`database`table`vectors`n`indexParams!(`default;`test_cagra;tqry;10;paramsSearch)))[`result];
show res;
// Delete table
gw(`deleteTable;`database`table!`default`test_cagra);
Index configuration examples
// Minimal index params (all defaults)
paramsIndex: `name`type`column`params!(
`cagraIndex;
`cagra;
`vector;
`dims`metric!(128; `L2)
)
// Full params with IVF_PQ build algorithm (production recommended)
indexParams: `name`type`column`params!(
`cagraIndex;
`cagra;
`vector;
`dims`metric`graph_degree`intermediate_graph_degree`build_algo!(
128; `L2; 32; 64; `IVF_PQ
)
)
// nn_descent – higher recall, high VRAM, dedicated GPU only
indexParams: `name`type`column`params!(
`cagraIndex;
`cagra;
`vector;
`dims`metric`graph_degree`intermediate_graph_degree`build_algo!(
128; `L2; 64; 128; `nn_descent
)
)
VRAM planning
CAGRA holds the full vector dataset and graph structure in GPU memory. Use the following estimates when planning capacity:
| Dataset size | Dims | fp32 dataset | CAGRA index (approx) | IVF_PQ peak build | nn_descent peak build |
|---|---|---|---|---|---|
| 1M vectors | 128 | 0.5 GB | ~0.9 GB | ~3 GB | ~15 GB |
| 10M vectors | 64 | 2.4 GB | ~4.3 GB | ~15 GB | ~78 GB |
| 100M vectors | 128 | 50 GB | ~90 GB | varies | not recommended |
nn_descent VRAM scaling
nn_descent peak VRAM requirements scale aggressively with dataset size. It is not recommended for datasets above ~5M vectors or on shared GPUs. For large-scale datasets, use IVF_PQ instead. Refer to Troubleshooting for details.
Search performance tuning
Search algorithm
The algo search parameter controls how CAGRA parallelizes beam search across GPU thread blocks:
| Value | Algorithm | Best for |
|---|---|---|
0 |
SINGLE_CTA |
Very small batches (1–few queries). Does not scale. |
1 |
MULTI_CTA |
Recall-sensitive workloads at 1M+ scale. More GPU blocks per query. |
2 |
MULTI_KERNEL |
Handles searches requiring more than 512 neighbors; used automatically when single-CTA's limit is exceeded. |
3 |
AUTO (recommended) |
General use. cuVS selects based on batch size. |
Recall note
At dataset sizes of 1M+, SINGLE_CTA can show measurably lower recall than MULTI_CTA because it runs out of search steps on larger graphs. AUTO optimizes for throughput (not recall) by switching to SINGLE_CTA for large batches. For recall-sensitive workloads at scale, consider setting algo=1 (MULTI_CTA) explicitly.
CAGRA search parameters reference.
Batch size
CAGRA works efficiently with batched queries. Increasing batch size improves GPU utilization and overall throughput. For concurrent workloads with many threads, increasing batch size per thread is more effective than increasing thread count alone.
Key search parameters
| Parameter | Description |
|---|---|
itopk_size |
Internal candidate list size. Primary recall/speed trade-off. Max 512 for SINGLE_CTA. |
search_width |
Graph nodes explored in parallel per iteration. |
max_queries |
Pre-allocates internal scratch buffers. Set to your expected batch size to avoid per-call allocation overhead. |
algo |
Search parallelism strategy (refer to Search algorithm). |
Delete and update
Delete and update are slow operations on CAGRA indexes as they require a full index rebuild. Avoid frequent deletes and updates where possible, and batch them together when required.
Parameter reference
This section lists all CAGRA-specific index and search parameters.
Index parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
dims |
integer | – | Number of dimensions in the vector embeddings. Must match your dataset. |
metric |
string | L2 |
Distance metric. Supported values: L2 (Euclidean), CS (cosine similarity). |
graph_degree |
integer | 64 |
Edges per node in the final graph. Higher values improve recall at the cost of memory. |
intermediate_graph_degree |
integer | 128 |
Graph degree before pruning. Must be ≥ graph_degree. |
build_algo |
string | IVF_PQ |
Algorithm used to seed the initial graph. Refer to Build algorithms. |
nn_descent_niter |
integer | 20 |
Number of iterations for nn_descent. Higher values improve graph quality but increase build time. Only applies when build_algo=nn_descent. |
gpuid |
integer | 0 |
ID of the GPU to use for index construction. |
Search parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
algo |
integer | 3 (AUTO) |
Search parallelism strategy. Refer to Search algorithm. |
itopk_size |
integer | 64 |
Internal candidate list size. Primary recall/speed trade-off. Maximum 512 for SINGLE_CTA. |
max_queries |
integer | 0 |
Pre-allocates internal scratch buffers. Set to your expected batch size to avoid per-call allocation overhead. |
max_iterations |
integer | 0 |
Maximum search iterations. 0 means no limit. |
min_iterations |
integer | 0 |
Minimum search iterations before early exit is allowed. |
search_width |
integer | 1 |
Graph nodes explored in parallel per iteration. |
team_size |
integer | 0 |
CUDA thread team size per query. 0 lets cuVS select automatically. |
thread_block_size |
integer | 0 |
CUDA thread block size. 0 lets cuVS select automatically. |
hashmap_mode |
string | HASH |
Internal hashmap implementation. |
hashmap_min_bitlen |
integer | 0 |
Minimum bit length for the hashmap. |
hashmap_max_fill_rate |
float | 0.5 |
Maximum hashmap fill rate before resizing. |
num_random_samplings |
integer | 1 |
Number of random seed candidates for graph traversal. |
Troubleshooting
Minimum dataset size (N=1 crash)
Inserting into a CAGRA-indexed table when the dataset contains fewer rows than intermediate_graph_degree will cause a GPU illegal memory access error. The CUDA context becomes permanently corrupted after this fault – all subsequent GPU operations fail until the container is restarted.
Mitigation: Always accumulate at least intermediate_graph_degree + 1 rows (default: 129) before allowing CAGRA to build. If your workload involves very small datasets, defer CAGRA indexing or use brute-force search until enough rows have been inserted.
nn_descent out of memory on large datasets
nn_descent peak VRAM requirements scale aggressively with dataset size. On a shared GPU where another process is holding a loaded index, cudaMemGetInfo() reports per-process free memory – not system-wide free memory – which can mask the true constraint and cause misleading "X GB free" reports.
Mitigation: Use IVF_PQ (build_algo=1) for datasets above ~5M vectors, or on any GPU shared with other processes. IVF_PQ achieves 97%+ recall at 10M scale and is the production-recommended algorithm.
nn_descent VRAM requirements
nn_descent has significantly higher peak VRAM requirements than IVF_PQ. For a 10M × 64-dimension fp32 dataset, nn_descent peaks at approximately 78 GB, while IVF_PQ peaks at approximately 15 GB.
Mitigation: On shared GPUs where other processes are already holding VRAM, IVF_PQ is the lower-memory alternative and recommended choice.
VRAM data retention
CAGRA currently retains approximately 1.8× the raw vector data size in VRAM during search due to an internal float16 copy. Nvidia has acknowledged this and plans to fix it in a future cuVS release.
Mitigation: If VRAM is constrained, IVF_PQ is the lower-memory alternative.
Summary
After completing this guide, you can:
- Deploy KDB.AI Server with GPU acceleration using the
kdbai-db-cuvsimage. - Create CAGRA-indexed tables and run high-throughput similarity search at GPU speed.
- Tune index build and search parameters to balance recall, VRAM usage, and throughput for your workload.
Next steps
- Read more on how CAGRA works.
- Check out our RAG pipeline reference solution with Nvidia microservices on GitHub.
- Read the cuVS overview on the NVIDIA blog.
- Explore the cuVS GitHub repository.