Nvidia cuVS Integration with KDB.AI¶

This page covers the cuVS integration available in KDB.AI, including system requirements, index configuration, VRAM planning, and search performance tuning.

cuVS (CUDA Vector Search) is Nvidia's CUDA-accelerated library for similarity search, built to handle large-scale vector data efficiently using GPUs. It includes CAGRA (Cuda ANNS Graph), a graph-based algorithm optimized for fast, scalable, and memory-efficient nearest neighbor search on high-dimensional embeddings. By combining KDB.AI's vector database with cuVS, you can run search-intensive workloads – such as semantic search, recommendation, or anomaly detection – at GPU speed.

Use this guide to configure your environment for cuVS, and deploy KDB.AI Server in a GPU-enabled container optimised for large-scale vector operations.

Why use cuVS with KDB.AI?

When building production-grade applications that depend on fast similarity search, integrating CUDA-based acceleration offers several advantages:

Performance: Use GPU acceleration to handle millions of vectors with high throughput and low latency.
Scalability: Offload intensive search operations to the GPU while keeping KDB.AI's memory-efficient data structures on the host.
Efficiency: Leverage CAGRA's compressed graph format to reduce memory usage without compromising search accuracy.
Compatibility: KDB.AI integrates natively with cuVS and runs inside GPU-enabled Docker containers.

By deploying KDB.AI with cuVS, you can scale vector workloads across millions of records, reduce CPU load, and accelerate inference pipelines – all while using the familiar KDB.AI interface and APIs.

Getting started¶

Prerequisites¶

System requirements¶

Ensure your host system meets these requirements:

Component	Requirement
Operating System	Linux kernel 4.18 or newer (Ubuntu 20.04+, RHEL 8+, CentOS 8+)
CPU	x86_64 (AMD64) architecture; up to 24 cores (Standard Edition limit)
GPU	Ampere architecture or newer (e.g. A100/H100, 40 GB+ VRAM) recommended for large-scale datasets. VRAM requirements vary with index size.
GPU Driver	Nvidia driver ≥ 580 (Linux). Refer to CUDA compatibility for details

ARM not supported

Unlike the kdbai-db image, which is multi-architecture (supports both ARM and x86_64), kdbai-db-cuvs requires an x86_64 (AMD64) host. ARM architectures (including Apple Silicon) are not supported.

For a full overview, refer to the CUDA compatibility guide.

Software requirements¶

Nvidia container toolkit¶

Version 1.11 or newer – provides GPU support within your container engine.
Supported container engines are Docker, Containerd, CRI-O, and Podman. Refer to supported platforms for version requirements.
Install the container toolkit by following Nvidia's official installation guide.

Once installed, verify your installation:

docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

Ensure your GPUs are listed and the driver version meets requirements.

KDB.AI client¶

The standard kdbai-client python package works with kdbai-db-cuvs. No additional client is required. Refer to Pre-requisites for installation details.

Account and license¶

New users¶

If you haven't set up KDB.AI Server yet, follow the KDB.AI Server setup guide – this covers obtaining a KDB-X license, Docker login, and pulling the image.

Existing users¶

If you're already running kdbai-db, your existing KDB_LICENSE_B64 (or KDB_K4LICENSE_B64) from your Welcome email works with kdbai-db-cuvs. No additional license is required. Replace kdbai-db with kdbai-db-cuvs in your Docker run command.

In both cases, export your license key before running the container:

export KDB_LICENSE_B64=<your-license-from-welcome-email>

Run the container¶

Separate image required

This guide requires the kdbai-db-cuvs image, which is not the standard kdbai-db image. It is a larger image that bundles all required GPU and cuVS dependencies.

Launch the kdbai-db-cuvs container with GPU support enabled:

docker run -d --name kdbai-gpu \
  --gpus all \
  -p 8081:8081 \
  -p 8082:8082 \
  -e KDB_LICENSE_B64="$KDB_LICENSE_B64" \
  -v "$PWD/vdbdata":/tmp/kx/data \
  portal.dl.kx.com/kdbai-db-cuvs

How CAGRA works¶

CAGRA builds a directed k-nearest neighbor graph (k-NNG) across your vector dataset entirely on the GPU, then runs a parallelized beam search at query time. Graph construction has two phases:

Initial graph build – seeds the graph using either IVF-PQ (default) or NN-Descent.
Graph pruning and optimization – removes redundant edges and improves connectivity.

At query time, CAGRA traverses this graph rather than scanning inverted lists, which gives it significantly higher throughput than CPU-based HNSW.

Note

Throughout this documentation, M denotes millions of vectors (for example, 1M = 1 million, 5M = 5 million).

Build algorithms¶

The build_algo parameter controls how CAGRA seeds the initial graph:

Algorithm	Description	Best for
`IVF_PQ` (default)	Uses IVF clustering + Product Quantization to find approximate neighbors. GPU-native, fast.	Datasets > 1M vectors, production use, build-time sensitive workloads
`nn_descent`	Iterative refinement of a random `k-NNG`. Slower but higher-quality initial graph.	Datasets < ~5M vectors where maximum recall is required
`AUTO`	cuVS selects between `IVF_PQ` and `nn_descent` based on dataset size and available GPU memory.	Prototyping and general use
`iterative_cagra_search`	Iterative graph build and refinement using CAGRA search	When build quality matters more than build speed; dedicated GPU workflows

Key index parameters¶

Use the following parameters to tune the index further:

Parameter	Default	Description
`graph_degree`	`64`	Edges per node in the final graph. Controls the trade-off between recall and memory usage.
`intermediate_graph_degree`	`128`	Degree before pruning. Must be ≥ `graph_degree`.
`build_algo`	`IVF_PQ`	Graph construction algorithm (refer to Build algorithms).

Limitations¶

The index must fit in GPU memory. CAGRA loads the full index into VRAM – refer to VRAM planning.
Best suited for batched queries. For single-query workloads, review the search algorithm settings – refer to Search performance tuning.
Minimum dataset size required. At least intermediate_graph_degree + 1 rows are needed before the index can build. Use brute-force search for small datasets.

Quickstart¶

Python¶

The following example creates a CAGRA-indexed table in KDB.AI, inserts vectors, and runs a similarity search.

import kdbai_client as kdbai
import numpy as np

# Connect to KDB.AI Server
session = kdbai.Session(endpoint="http://localhost:8082")
db = session.database("default")

# Define schema and CAGRA index
schema = [
    {"name": "id",     "type": "int64"},
    {"name": "vector", "type": "float32s"}
]

indexes = [
    {
        "name":   "cagraIndex",
        "type":   "cagra",
        "column": "vector",
        "params": {
            "dims":                    128,
            "metric":                  "L2",
            "graph_degree":            32,
            "intermediate_graph_degree": 64,
            "build_algo":              "IVF_PQ"     # IVF_PQ (default, recommended for production)
                                                    # nn_descent (higher recall, much higher VRAM)
                                                    # AUTO
        }
    }
]

table = db.create_table("embeddings", schema, indexes=indexes)

# Insert vectors – ensure N > intermediate_graph_degree before index builds
n = 10_000
dims = 128
ids = np.arange(n, dtype=np.int64)
vecs = np.random.random((n, dims)).astype(np.float32)

import pandas as pd
table.insert(pd.DataFrame({"id": ids, "vector": list(vecs)}))

# Search – returns top-10 nearest neighbors
query = np.random.random((1, dims)).astype(np.float32)
results = table.search(
    vectors={"cagraIndex": query},
    n=10
)[0]

print(results)

Refer to Build algorithms for guidance on choosing build_algo.

q / kdb+¶

The following example connects to KDB.AI Server over qIPC and creates a CAGRA-indexed table from q.

// Connect to KDB.AI Server
`gw set hopen 8082;

// Define schema
dims:10;
eDims:3;
mySchema:flip `name`type!(`id`myDate`time`tag`price`myScalar`text;`j`d`p`s`E`f`C);

// Define CAGRA index parameters
GPUID:0; // for any machine which has GPU
paramsIndex:(`gpuid`dims`metric`intermediate_graph_degree`graph_degree`build_algo`nn_descent_niter)!(GPUID;dims;`CS;128;64;`IVF_PQ;20);
paramsSearch:`max_queries`itopk_size`max_iterations`algo`team_size`search_width`min_iterations`thread_block_size`hashmap_mode`hashmap_min_bitlen`hashmap_max_fill_rate`num_random_samplings!(0;64;0;`SINGLE_CTA;0;1;0;0;`HASH;0;0.5;1);
idx:      `name`column`type`params!(enlist `myVectorIndex;enlist `price;enlist `cagra;enlist paramsIndex);

// Create the table
createResult:gw(`createTable;`database`table`schema`indexes!(`default;`test_cagra;mySchema;flip idx));
show createResult; //gw(`listTables;enlist[`database]!enlist `default);

// Insert vectors – accumulate enough rows before CAGRA builds (N > intermediate_graph_degree)
N:100; 
t:   ([] id:til N; myDate:2015.01.01 + asc N?100j; time:asc N?0p; tag:N?`aaa`bbb`ccc; price:(N;dims)#(N*dims)?1e; myScalar:N?1f; text:{rand[256]?" "} each til N); // price is length-dims
gw(`insertData;`database`table`payload!(`default;`test_cagra;t));

// Query – top-10 nearest neighbors for a single query vector
resQry:(gw(`query;`database`table!(`default;`test_cagra)))[`result];
show resQry;

// Search – top-10 nearest neighbors for a single query vector
q:sums neg[0.5]+dims?1f;
tqry:enlist[`myVectorIndex]!enlist enlist q;
res:first (gw(`search;`database`table`vectors`n`indexParams!(`default;`test_cagra;tqry;10;paramsSearch)))[`result];
show res;

// Delete table
gw(`deleteTable;`database`table!`default`test_cagra);

Index configuration examples

// Minimal index params (all defaults)
paramsIndex: `name`type`column`params!(
    `cagraIndex;
    `cagra;
    `vector;
    `dims`metric!(128; `L2)
)

// Full params with IVF_PQ build algorithm (production recommended)
indexParams: `name`type`column`params!(
    `cagraIndex;
    `cagra;
    `vector;
    `dims`metric`graph_degree`intermediate_graph_degree`build_algo!(
        128; `L2; 32; 64; `IVF_PQ
    )
)

// nn_descent – higher recall, high VRAM, dedicated GPU only
indexParams: `name`type`column`params!(
    `cagraIndex;
    `cagra;
    `vector;
    `dims`metric`graph_degree`intermediate_graph_degree`build_algo!(
        128; `L2; 64; 128; `nn_descent
    )
)

VRAM planning¶

CAGRA holds the full vector dataset and graph structure in GPU memory. Use the following estimates when planning capacity:

Dataset size	Dims	fp32 dataset	CAGRA index (approx)	IVF_PQ peak build	nn_descent peak build
1M vectors	128	0.5 GB	~0.9 GB	~3 GB	~15 GB
10M vectors	64	2.4 GB	~4.3 GB	~15 GB	~78 GB
100M vectors	128	50 GB	~90 GB	varies	not recommended

nn_descent VRAM scaling

nn_descent peak VRAM requirements scale aggressively with dataset size. It is not recommended for datasets above ~5M vectors or on shared GPUs. For large-scale datasets, use IVF_PQ instead. Refer to Troubleshooting for details.

Search performance tuning¶

Search algorithm¶

The algo search parameter controls how CAGRA parallelizes beam search across GPU thread blocks:

Value	Algorithm	Best for
`0`	`SINGLE_CTA`	Very small batches (1–few queries). Does not scale.
`1`	`MULTI_CTA`	Recall-sensitive workloads at 1M+ scale. More GPU blocks per query.
`2`	`MULTI_KERNEL`	Handles searches requiring more than 512 neighbors; used automatically when single-CTA's limit is exceeded.
`3`	`AUTO` (recommended)	General use. cuVS selects based on batch size.

Recall note

At dataset sizes of 1M+, SINGLE_CTA can show measurably lower recall than MULTI_CTA because it runs out of search steps on larger graphs. AUTO optimizes for throughput (not recall) by switching to SINGLE_CTA for large batches. For recall-sensitive workloads at scale, consider setting algo=1 (MULTI_CTA) explicitly.

CAGRA search parameters reference.

Batch size¶

CAGRA works efficiently with batched queries. Increasing batch size improves GPU utilization and overall throughput. For concurrent workloads with many threads, increasing batch size per thread is more effective than increasing thread count alone.

Key search parameters¶

Parameter	Description
`itopk_size`	Internal candidate list size. Primary recall/speed trade-off. Max `512` for `SINGLE_CTA`.
`search_width`	Graph nodes explored in parallel per iteration.
`max_queries`	Pre-allocates internal scratch buffers. Set to your expected batch size to avoid per-call allocation overhead.
`algo`	Search parallelism strategy (refer to Search algorithm).

Delete and update¶

Delete and update are slow operations on CAGRA indexes as they require a full index rebuild. Avoid frequent deletes and updates where possible, and batch them together when required.

Parameter reference

This section lists all CAGRA-specific index and search parameters.

Index parameters

Parameter	Type	Default	Description
`dims`	integer	–	Number of dimensions in the vector embeddings. Must match your dataset.
`metric`	string	`L2`	Distance metric. Supported values: `L2` (Euclidean), `CS` (cosine similarity).
`graph_degree`	integer	`64`	Edges per node in the final graph. Higher values improve recall at the cost of memory.
`intermediate_graph_degree`	integer	`128`	Graph degree before pruning. Must be ≥ `graph_degree`.
`build_algo`	string	`IVF_PQ`	Algorithm used to seed the initial graph. Refer to Build algorithms.
`nn_descent_niter`	integer	`20`	Number of iterations for `nn_descent`. Higher values improve graph quality but increase build time. Only applies when `build_algo=nn_descent`.
`gpuid`	integer	`0`	ID of the GPU to use for index construction.

Search parameters

Parameter	Type	Default	Description
`algo`	integer	`3` (`AUTO`)	Search parallelism strategy. Refer to Search algorithm.
`itopk_size`	integer	`64`	Internal candidate list size. Primary recall/speed trade-off. Maximum `512` for `SINGLE_CTA`.
`max_queries`	integer	`0`	Pre-allocates internal scratch buffers. Set to your expected batch size to avoid per-call allocation overhead.
`max_iterations`	integer	`0`	Maximum search iterations. `0` means no limit.
`min_iterations`	integer	`0`	Minimum search iterations before early exit is allowed.
`search_width`	integer	`1`	Graph nodes explored in parallel per iteration.
`team_size`	integer	`0`	CUDA thread team size per query. `0` lets cuVS select automatically.
`thread_block_size`	integer	`0`	CUDA thread block size. `0` lets cuVS select automatically.
`hashmap_mode`	string	`HASH`	Internal hashmap implementation.
`hashmap_min_bitlen`	integer	`0`	Minimum bit length for the hashmap.
`hashmap_max_fill_rate`	float	`0.5`	Maximum hashmap fill rate before resizing.
`num_random_samplings`	integer	`1`	Number of random seed candidates for graph traversal.

Troubleshooting¶

Minimum dataset size (N=1 crash)¶

Inserting into a CAGRA-indexed table when the dataset contains fewer rows than intermediate_graph_degree will cause a GPU illegal memory access error. The CUDA context becomes permanently corrupted after this fault – all subsequent GPU operations fail until the container is restarted.

Mitigation: Always accumulate at least intermediate_graph_degree + 1 rows (default: 129) before allowing CAGRA to build. If your workload involves very small datasets, defer CAGRA indexing or use brute-force search until enough rows have been inserted.

nn_descent out of memory on large datasets¶

nn_descent peak VRAM requirements scale aggressively with dataset size. On a shared GPU where another process is holding a loaded index, cudaMemGetInfo() reports per-process free memory – not system-wide free memory – which can mask the true constraint and cause misleading "X GB free" reports.

Mitigation: Use IVF_PQ (build_algo=1) for datasets above ~5M vectors, or on any GPU shared with other processes. IVF_PQ achieves 97%+ recall at 10M scale and is the production-recommended algorithm.

nn_descent VRAM requirements¶

nn_descent has significantly higher peak VRAM requirements than IVF_PQ. For a 10M × 64-dimension fp32 dataset, nn_descent peaks at approximately 78 GB, while IVF_PQ peaks at approximately 15 GB.

Mitigation: On shared GPUs where other processes are already holding VRAM, IVF_PQ is the lower-memory alternative and recommended choice.

VRAM data retention¶

CAGRA currently retains approximately 1.8× the raw vector data size in VRAM during search due to an internal float16 copy. Nvidia has acknowledged this and plans to fix it in a future cuVS release.

Mitigation: If VRAM is constrained, IVF_PQ is the lower-memory alternative.

Summary¶

After completing this guide, you can:

Deploy KDB.AI Server with GPU acceleration using the kdbai-db-cuvs image.
Create CAGRA-indexed tables and run high-throughput similarity search at GPU speed.
Tune index build and search parameters to balance recall, VRAM usage, and throughput for your workload.

Next steps¶

Read more on how CAGRA works.
Check out our RAG pipeline reference solution with Nvidia microservices on GitHub.
Read the cuVS overview on the NVIDIA blog.
Explore the cuVS GitHub repository.