FAQ - Performance and Optimization¶

Common performance questions observed across KDB.AI deployments. For setup-specific troubleshooting, see Server Setup FAQ.

Diagnosing slowness¶

Q: A workload feels slow end-to-end. How do I find which layer is responsible?

Decompose wall-clock time before tuning anything. In the client, measure per-stage durations: data preparation, embedding generation, request serialization, network round-trip, server processing, and result deserialization.

Cross-check against the server log for each request. If the server reports ~10 ms while the client observes seconds or minutes, the bottleneck is client-side serialization, the network path, or backing storage – not the engine. This single comparison resolves the majority of perceived performance issues.

Q: What information should I gather before opening a performance investigation?

At minimum:

Index type and exact parameter values (M, efConstruction, efSearch for HNSW)
Embedding dimensionality
Rows per request and batch size
Transport: REST vs qIPC
Whether TLS is in the path and where it is terminated
Schema column types (string vs symbol – see Schema design below)
Whether metadata filtering is in use
Storage: provisioned IOPS and throughput
On-disk table size by partition and index size

Q: How do NUM_WRK and THREADS affect performance?

These two environment variables control parallelism at different levels:

NUM_WRK sets the number of worker processes. Each worker handles requests independently, so increasing this value raises the number of concurrent requests the service can process.
THREADS sets the number of threads available to each worker for parallel computation within a single request (for example, index building, search across partitions, or batch search). For a detailed explanation of how multithreading works, see the multithreading page.

See the Configuration Guide for the full list of available environment variables.

Index selection and tuning¶

Q: Which index type should I use?

See Use Indexes for the full comparison table. As a general guide:

Small datasets or exact recall required: use qFlat (on-disk) rather than flat (in-memory) to avoid holding the full index in RAM.
General-purpose ANN, memory is constrained: use qHnsw (on-disk HNSW).
General-purpose ANN, memory is not constrained: use hnsw.
Very large datasets where compression matters more than recall: use ivf or ivfpq.
Lexical / keyword search to complement dense search: use bm25.
GPU-accelerated workloads: see the Nvidia cuVS/CAGRA integration guide.

A single table can hold multiple indexes simultaneously, which is the standard pattern for hybrid and multimodal use cases.

A common mistake is deploying flat or hnsw (in-memory) when the dataset is large enough to require qFlat or qHnsw (on-disk). Memory consumption will grow unboundedly and degrade or crash the service. If you suspect this, verify the index type in the running configuration – not just the design document.

Q: When does it make sense to move to a GPU-accelerated index?

GPU indexes (CAGRA) are worthwhile once the dataset is large enough that CPU-based HNSW search no longer meets your throughput target. Below that scale, CPU HNSW is simpler and lower cost. Key constraints to be aware of before adopting GPU indexes:

The entire index must fit in a single GPU's VRAM – no multi-GPU sharding is supported.
CAGRA indexes are static: inserts after build require a full rebuild. Plan a periodic rebuild cadence for workloads with ongoing ingestion; for streaming inserts, use hnsw or qHnsw on CPU instead.
Peak VRAM during build is higher than the steady-state index size, so size for peak.

See the Nvidia cuVS/CAGRA integration guide for setup and parameter tuning.

Q: How should I tune HNSW parameters?

The three parameters are M, efConstruction, and efSearch.

M and efConstruction are set once at build time. Higher values improve recall but increase build time and memory.
efSearch is the runtime quality lever – tune it per query. Increasing it improves recall at the cost of latency.
If selective metadata filters are in use, raise efSearch to compensate for graph traversals that cannot find enough valid neighbours after filtering.

Q: Are there fixed limits on index size, embedding dimension, or metadata per row?

There are no fixed product limits – capacity is bounded by the host's memory and storage. Embedding columns are stored as float32; dimensionalities of 1,024 and 1,536 are common in production. Metadata is persisted as an on-disk table, so its limit is your available storage. The practical constraint surfaces at query time: if a query attempts to load more data than the host can hold in memory, performance degrades. Size for the working set (active indexes plus the hottest data), not the total dataset size.

Ingestion throughput¶

Q: Vector inserts are taking much longer than expected. How do I speed them up?

Work through this in order:

Confirm the bottleneck is the engine, not the client or storage. Compare the client-observed time to the server-reported processing time in the logs. If the server reports milliseconds while the client observes seconds, the issue is transport, serialization, or storage – not the database.
Switch from REST to qIPC where possible. REST adds serialization overhead. qIPC has lower overhead and is the recommended transport for high-throughput workloads.
Right-size your batches. Send fewer, larger batches to amortize index update overhead.
Check storage provisioning. Each insert loads the vector index from disk, updates it, and writes it back. Un-tiered NFS storage without guaranteed IOPS is a common root cause. Use SSD or NVMe-backed persistent volumes for any production workload.

Q: My dataset is larger than available RAM. How should I build the index?

Ingest in batches and let the index update incrementally rather than attempting to build the full structure in one pass. Use qFlat or qHnsw (on-disk indexes) so the index is not held entirely in memory. For genuinely streaming inserts, hnsw or qHnsw on CPU support incremental updates; GPU-accelerated indexes (CAGRA) are static and require a full rebuild when new data is added – see the Nvidia cuVS/CAGRA integration guide for rebuild strategies.

Transport¶

Q: When should I use REST versus qIPC?

qIPC is the recommended high-performance transport. REST is the pragmatic fallback when only HTTPS is permitted by the network team. Key differences:

	REST	qIPC
Serialization overhead	Higher	Lower
TLS	Native HTTPS	Terminate at gateway; forward plain TCP
Port	8081	8082

Q: Should I batch queries or send them one at a time?

Batched queries (multiple query vectors per call) significantly improve throughput, particularly for GPU-backed indexes where kernel launch overhead dominates small batches. For latency-sensitive single-query workloads the optimal answer may genuinely be one at a time — clarify the workload goal before optimizing.

Schema design¶

Q: How should I choose column types to keep queries fast?

Choose the column type based on cardinality:

Use symbol for columns with repetitive or categorical values (for example, tenant ID, category, status). Symbols are interned — comparisons and storage are cheap.
Use string for columns with many unique values (for example, free-text fields, document IDs).

When using the Python client, be aware of the type mapping: a Python str maps to a q symbol, and Python bytes maps to a q char list. Choose deliberately rather than relying on defaults.

Avoid overly wide tables. Each additional metadata column is read during every filtered query.

Multi-tenancy¶

Q: For multi-team usage, should I run a single shared instance or one instance per team?

Both patterns are supported. A single shared instance is simpler to operate but couples teams together — a heavy workload from one team can affect others, and maintenance affects everyone at once. Separate instances per team isolate workloads and allow independent upgrades, at the cost of more operational overhead. For teams with significantly different workload profiles or data isolation requirements, separate instances are the safer choice.

Q: I have many tenants sharing one table. Should I build one index per tenant?

Almost never. For workloads such as many users each with a small number of documents, the right pattern is a single index with a tenant_id (or user_id) metadata filter, not one index per tenant. Metadata pre-filters narrow the candidate set during ANN traversal and typically reduce latency. For very selective filters with HNSW, raise efSearch to maintain recall.

Still need help?¶

For general questions, ask the Slack community.

For performance issues requiring investigation, email support@kdb.ai and include:

Index type and parameters (M, efConstruction, efSearch, etc.)
Embedding dimensionality and dataset size
Transport in use (REST or qIPC)
Client-observed timing vs server-reported processing time (from logs)
Storage type (SSD, NVMe, network-attached)
The exact error messages or symptoms you're seeing