Skip to content

Nvidia cuVS Integration with KDB.AI

This page covers the cuVS integration available in KDB.AI, including system requirements, index configuration, VRAM planning, and search performance tuning.

cuVS (CUDA Vector Search) is Nvidia's CUDA-accelerated library for similarity search, built to handle large-scale vector data efficiently using GPUs. It includes CAGRA (Cuda ANNS Graph), a graph-based algorithm optimized for fast, scalable, and memory-efficient nearest neighbor search on high-dimensional embeddings. By combining KDB.AI's vector database with cuVS, you can run search-intensive workloads – such as semantic search, recommendation, or anomaly detection – at GPU speed.

Use this guide to configure your environment for cuVS, and deploy KDB.AI Server in a GPU-enabled container optimised for large-scale vector operations.

Why use cuVS with KDB.AI?

When building production-grade applications that depend on fast similarity search, integrating CUDA-based acceleration offers several advantages:

  • Performance: Use GPU acceleration to handle millions of vectors with high throughput and low latency.
  • Scalability: Offload intensive search operations to the GPU while keeping KDB.AI's memory-efficient data structures on the host.
  • Efficiency: Leverage CAGRA's compressed graph format to reduce memory usage without compromising search accuracy.
  • Compatibility: KDB.AI integrates natively with cuVS and runs inside GPU-enabled Docker containers.

By deploying KDB.AI with cuVS, you can scale vector workloads across millions of records, reduce CPU load, and accelerate inference pipelines – all while using the familiar KDB.AI interface and APIs.

Getting started

Prerequisites

System requirements

Ensure your host system meets these requirements:

Component Requirement
Operating System Linux kernel 4.18 or newer (Ubuntu 20.04+, RHEL 8+, CentOS 8+)
CPU x86_64 (AMD64) architecture; up to 24 cores (Standard Edition limit)
GPU Ampere architecture or newer (e.g. A100/H100, 40 GB+ VRAM) recommended for large-scale datasets. VRAM requirements vary with index size.
GPU Driver Nvidia driver ≥ 580 (Linux). Refer to CUDA compatibility for details

ARM not supported

Unlike the kdbai-db image, which is multi-architecture (supports both ARM and x86_64), kdbai-db-cuvs requires an x86_64 (AMD64) host. ARM architectures (including Apple Silicon) are not supported.

For a full overview, refer to the CUDA compatibility guide.

Software requirements

Nvidia container toolkit
  • Version 1.11 or newer – provides GPU support within your container engine.
  • Supported container engines are Docker, Containerd, CRI-O, and Podman. Refer to supported platforms for version requirements.
  • Install the container toolkit by following Nvidia's official installation guide.

Once installed, verify your installation:

docker run --rm --gpus all nvidia/cuda:13.1.0-base-ubuntu22.04 nvidia-smi

Ensure your GPUs are listed and the driver version meets requirements.

KDB.AI client

The standard kdbai-client python package works with kdbai-db-cuvs. No additional client is required. Refer to Pre-requisites for installation details.

Account and license

New users

If you haven't signed up yet, follow the KDB.AI Server setup guide – this covers registration, Docker login, and obtaining your license key.

Existing users

If you're already running kdbai-db, your existing KDB_LICENSE_B64 (or KDB_K4LICENSE_B64) from your Welcome email works with kdbai-db-cuvs. No additional license is required. Replace kdbai-db with kdbai-db-cuvs in your Docker run command.

In both cases, export your license key before running the container:

export KDB_LICENSE_B64=<your-license-from-welcome-email>

Run the container

Separate image required

This guide requires the kdbai-db-cuvs image, which is not the standard kdbai-db image. It is a larger image that bundles all required GPU and cuVS dependencies.

Launch the kdbai-db-cuvs container with GPU support enabled:

docker run -d --name kdbai-gpu \
  --gpus all \
  -p 8081:8081 \
  -p 8082:8082 \
  -e KDB_LICENSE_B64="$KDB_LICENSE_B64" \
  -v "$PWD/vdbdata":/tmp/kx/data \
  portal.dl.kx.com/kdbai-db-cuvs

How CAGRA works

CAGRA builds a directed k-nearest neighbor graph (k-NNG) across your vector dataset entirely on the GPU, then runs a parallelized beam search at query time. Graph construction has two phases:

  1. Initial graph build – seeds the graph using either IVF-PQ (default) or NN-Descent.
  2. Graph pruning and optimization – removes redundant edges and improves connectivity.

At query time, CAGRA traverses this graph rather than scanning inverted lists, which gives it significantly higher throughput than CPU-based HNSW.

Note

Throughout this documentation, M denotes millions of vectors (for example, 1M = 1 million, 5M = 5 million).

Build algorithms

The build_algo parameter controls how CAGRA seeds the initial graph:

Algorithm Description Best for
IVF_PQ (default) Uses IVF clustering + Product Quantization to find approximate neighbors. GPU-native, fast. Datasets > 1M vectors, production use, build-time sensitive workloads
nn_descent Iterative refinement of a random k-NNG. Slower but higher-quality initial graph. Datasets < ~5M vectors where maximum recall is required
AUTO cuVS selects between IVF_PQ and nn_descent based on dataset size and available GPU memory. Prototyping and general use
iterative_cagra_search Iterative graph build and refinement using CAGRA search When build quality matters more than build speed; dedicated GPU workflows

Key index parameters

Use the following parameters to tune the index further:

Parameter Default Description
graph_degree 64 Edges per node in the final graph. Controls the trade-off between recall and memory usage.
intermediate_graph_degree 128 Degree before pruning. Must be ≥ graph_degree.
build_algo IVF_PQ Graph construction algorithm (refer to Build algorithms).

Limitations

  • The index must fit in GPU memory. CAGRA loads the full index into VRAM – refer to VRAM planning.
  • Best suited for batched queries. For single-query workloads, review the search algorithm settings – refer to Search performance tuning.
  • Minimum dataset size required. At least intermediate_graph_degree + 1 rows are needed before the index can build. Use brute-force search for small datasets.

Quickstart

Python

The following example creates a CAGRA-indexed table in KDB.AI, inserts vectors, and runs a similarity search.

import kdbai_client as kdbai
import numpy as np

# Connect to KDB.AI Server
session = kdbai.Session(endpoint="http://localhost:8082")
db = session.database("default")

# Define schema and CAGRA index
schema = [
    {"name": "id",     "type": "int64"},
    {"name": "vector", "type": "float32s"}
]

indexes = [
    {
        "name":   "cagraIndex",
        "type":   "cagra",
        "column": "vector",
        "params": {
            "dims":                    128,
            "metric":                  "L2",
            "graph_degree":            32,
            "intermediate_graph_degree": 64,
            "build_algo":              "IVF_PQ"     # IVF_PQ (default, recommended for production)
                                                    # nn_descent (higher recall, much higher VRAM)
                                                    # AUTO
        }
    }
]

table = db.create_table("embeddings", schema, indexes=indexes)

# Insert vectors – ensure N > intermediate_graph_degree before index builds
n = 10_000
dims = 128
ids = np.arange(n, dtype=np.int64)
vecs = np.random.random((n, dims)).astype(np.float32)

import pandas as pd
table.insert(pd.DataFrame({"id": ids, "vector": list(vecs)}))

# Search – returns top-10 nearest neighbors
query = np.random.random((1, dims)).astype(np.float32)
results = table.search(
    vectors={"cagraIndex": query},
    n=10
)[0]

print(results)

Refer to Build algorithms for guidance on choosing build_algo.

q / kdb+

The following example connects to KDB.AI Server over qIPC and creates a CAGRA-indexed table from q.

// Connect to KDB.AI Server
`gw set hopen 8082;

// Define schema
dims:10;
eDims:3;
mySchema:flip `name`type!(`id`myDate`time`tag`price`myScalar`text;`j`d`p`s`E`f`C);

// Define CAGRA index parameters
GPUID:0; // for any machine which has GPU
paramsIndex:(`gpuid`dims`metric`intermediate_graph_degree`graph_degree`build_algo`nn_descent_niter)!(GPUID;dims;`CS;128;64;`IVF_PQ;20);
paramsSearch:`max_queries`itopk_size`max_iterations`algo`team_size`search_width`min_iterations`thread_block_size`hashmap_mode`hashmap_min_bitlen`hashmap_max_fill_rate`num_random_samplings!(0;64;0;`SINGLE_CTA;0;1;0;0;`HASH;0;0.5;1);
idx:      `name`column`type`params!(enlist `myVectorIndex;enlist `price;enlist `cagra;enlist paramsIndex);

// Create the table
createResult:gw(`createTable;`database`table`schema`indexes!(`default;`test_cagra;mySchema;flip idx));
show createResult; //gw(`listTables;enlist[`database]!enlist `default);

// Insert vectors – accumulate enough rows before CAGRA builds (N > intermediate_graph_degree)
N:100; 
t:   ([] id:til N; myDate:2015.01.01 + asc N?100j; time:asc N?0p; tag:N?`aaa`bbb`ccc; price:(N;dims)#(N*dims)?1e; myScalar:N?1f; text:{rand[256]?" "} each til N); // price is length-dims
gw(`insertData;`database`table`payload!(`default;`test_cagra;t));

// Query – top-10 nearest neighbors for a single query vector
resQry:(gw(`query;`database`table!(`default;`test_cagra)))[`result];
show resQry;

// Search – top-10 nearest neighbors for a single query vector
q:sums neg[0.5]+dims?1f;
tqry:enlist[`myVectorIndex]!enlist enlist q;
res:first (gw(`search;`database`table`vectors`n`indexParams!(`default;`test_cagra;tqry;10;paramsSearch)))[`result];
show res;

// Delete table
gw(`deleteTable;`database`table!`default`test_cagra);

Index configuration examples

// Minimal index params (all defaults)
paramsIndex: `name`type`column`params!(
    `cagraIndex;
    `cagra;
    `vector;
    `dims`metric!(128; `L2)
)

// Full params with IVF_PQ build algorithm (production recommended)
indexParams: `name`type`column`params!(
    `cagraIndex;
    `cagra;
    `vector;
    `dims`metric`graph_degree`intermediate_graph_degree`build_algo!(
        128; `L2; 32; 64; `IVF_PQ
    )
)

// nn_descent – higher recall, high VRAM, dedicated GPU only
indexParams: `name`type`column`params!(
    `cagraIndex;
    `cagra;
    `vector;
    `dims`metric`graph_degree`intermediate_graph_degree`build_algo!(
        128; `L2; 64; 128; `nn_descent
    )
)

VRAM planning

CAGRA holds the full vector dataset and graph structure in GPU memory. Use the following estimates when planning capacity:

Dataset size Dims fp32 dataset CAGRA index (approx) IVF_PQ peak build nn_descent peak build
1M vectors 128 0.5 GB ~0.9 GB ~3 GB ~15 GB
10M vectors 64 2.4 GB ~4.3 GB ~15 GB ~78 GB
100M vectors 128 50 GB ~90 GB varies not recommended

nn_descent VRAM scaling

nn_descent peak VRAM requirements scale aggressively with dataset size. It is not recommended for datasets above ~5M vectors or on shared GPUs. For large-scale datasets, use IVF_PQ instead. Refer to Troubleshooting for details.

Search performance tuning

Search algorithm

The algo search parameter controls how CAGRA parallelizes beam search across GPU thread blocks:

Value Algorithm Best for
0 SINGLE_CTA Very small batches (1–few queries). Does not scale.
1 MULTI_CTA Recall-sensitive workloads at 1M+ scale. More GPU blocks per query.
2 MULTI_KERNEL Handles searches requiring more than 512 neighbors; used automatically when single-CTA's limit is exceeded.
3 AUTO (recommended) General use. cuVS selects based on batch size.

Recall note

At dataset sizes of 1M+, SINGLE_CTA can show measurably lower recall than MULTI_CTA because it runs out of search steps on larger graphs. AUTO optimizes for throughput (not recall) by switching to SINGLE_CTA for large batches. For recall-sensitive workloads at scale, consider setting algo=1 (MULTI_CTA) explicitly.

CAGRA search parameters reference.

Batch size

CAGRA works efficiently with batched queries. Increasing batch size improves GPU utilization and overall throughput. For concurrent workloads with many threads, increasing batch size per thread is more effective than increasing thread count alone.

Key search parameters

Parameter Description
itopk_size Internal candidate list size. Primary recall/speed trade-off. Max 512 for SINGLE_CTA.
search_width Graph nodes explored in parallel per iteration.
max_queries Pre-allocates internal scratch buffers. Set to your expected batch size to avoid per-call allocation overhead.
algo Search parallelism strategy (refer to Search algorithm).

Delete and update

Delete and update are slow operations on CAGRA indexes as they require a full index rebuild. Avoid frequent deletes and updates where possible, and batch them together when required.

Parameter reference

This section lists all CAGRA-specific index and search parameters.

Index parameters

Parameter Type Default Description
dims integer Number of dimensions in the vector embeddings. Must match your dataset.
metric string L2 Distance metric. Supported values: L2 (Euclidean), CS (cosine similarity).
graph_degree integer 64 Edges per node in the final graph. Higher values improve recall at the cost of memory.
intermediate_graph_degree integer 128 Graph degree before pruning. Must be ≥ graph_degree.
build_algo string IVF_PQ Algorithm used to seed the initial graph. Refer to Build algorithms.
nn_descent_niter integer 20 Number of iterations for nn_descent. Higher values improve graph quality but increase build time. Only applies when build_algo=nn_descent.
gpuid integer 0 ID of the GPU to use for index construction.

Search parameters

Parameter Type Default Description
algo integer 3 (AUTO) Search parallelism strategy. Refer to Search algorithm.
itopk_size integer 64 Internal candidate list size. Primary recall/speed trade-off. Maximum 512 for SINGLE_CTA.
max_queries integer 0 Pre-allocates internal scratch buffers. Set to your expected batch size to avoid per-call allocation overhead.
max_iterations integer 0 Maximum search iterations. 0 means no limit.
min_iterations integer 0 Minimum search iterations before early exit is allowed.
search_width integer 1 Graph nodes explored in parallel per iteration.
team_size integer 0 CUDA thread team size per query. 0 lets cuVS select automatically.
thread_block_size integer 0 CUDA thread block size. 0 lets cuVS select automatically.
hashmap_mode string HASH Internal hashmap implementation.
hashmap_min_bitlen integer 0 Minimum bit length for the hashmap.
hashmap_max_fill_rate float 0.5 Maximum hashmap fill rate before resizing.
num_random_samplings integer 1 Number of random seed candidates for graph traversal.

Troubleshooting

Minimum dataset size (N=1 crash)

Inserting into a CAGRA-indexed table when the dataset contains fewer rows than intermediate_graph_degree will cause a GPU illegal memory access error. The CUDA context becomes permanently corrupted after this fault – all subsequent GPU operations fail until the container is restarted.

Mitigation: Always accumulate at least intermediate_graph_degree + 1 rows (default: 129) before allowing CAGRA to build. If your workload involves very small datasets, defer CAGRA indexing or use brute-force search until enough rows have been inserted.

nn_descent out of memory on large datasets

nn_descent peak VRAM requirements scale aggressively with dataset size. On a shared GPU where another process is holding a loaded index, cudaMemGetInfo() reports per-process free memory – not system-wide free memory – which can mask the true constraint and cause misleading "X GB free" reports.

Mitigation: Use IVF_PQ (build_algo=1) for datasets above ~5M vectors, or on any GPU shared with other processes. IVF_PQ achieves 97%+ recall at 10M scale and is the production-recommended algorithm.

nn_descent VRAM requirements

nn_descent has significantly higher peak VRAM requirements than IVF_PQ. For a 10M × 64-dimension fp32 dataset, nn_descent peaks at approximately 78 GB, while IVF_PQ peaks at approximately 15 GB.

Mitigation: On shared GPUs where other processes are already holding VRAM, IVF_PQ is the lower-memory alternative and recommended choice.

VRAM data retention

CAGRA currently retains approximately 1.8× the raw vector data size in VRAM during search due to an internal float16 copy. Nvidia has acknowledged this and plans to fix it in a future cuVS release.

Mitigation: If VRAM is constrained, IVF_PQ is the lower-memory alternative.

Summary

After completing this guide, you can:

  • Deploy KDB.AI Server with GPU acceleration using the kdbai-db-cuvs image.
  • Create CAGRA-indexed tables and run high-throughput similarity search at GPU speed.
  • Tune index build and search parameters to balance recall, VRAM usage, and throughput for your workload.

Next steps