Performance best practices
Query
Attributes and schema
Key to the performance of a kdb+ database is appropriate schema design and choice of attributes.
Additionally, symbols allow rapid access to string data. They should not be used for random or large-cardinality string columns as they cannot be garbage collected when no longer necessary; performance degrades as symbols are added.
Response serialization
The kdb Insights Database supports QIPC binary and REST with JSON and binary payloads.
REST, along with JSON, allows for a convenient query method, but has significant serialization overhead compared to the much faster QIPC binary serialization format.
You can use the QIPC binary format from a REST call by specifying the application/octet-stream
Accept
header:
curl -X POST --header "Content-Type: application/json" \
--header "Accept: application/octet-stream" \
--header "Authorization: Bearer $INSIGHTS_TOKEN" \
-o results.dat \
--data "{\"table\":\"trades\"}" \
"https://${INSIGHTS_HOSTNAME}/servicegateway/kxi/getData"
curl -X POST --header "Content-Type: application/json" \
--header "Accept: application/octet-stream" \
-o results.dat \
--data "{\"table\":\"trades\"}" \
"http://${SERVICE_GATEWAY}/kxi/getData"
getData
and custom APIs vs SQL
SQL offers a convenient method to explore data within an kdb Insights Database, but is much slower in comparison to the getData
API and q custom APIs that are crafted for a specific purpose from q. For best performance, use getData
where possible.
To further reduce performance variability and SQL taking resources from getData
and custom APIs queries, SQL is disabled by default on the non-qe
gateway. SQL is only accessible via the resource-isolated qe
gateway, and the kdb Insights Enterprise UI makes use of this for all queries made within the UI.
Scaling can be done by increasing the number of threads or replicas.
Threads
Query performance benefits from additional threads when a query uses built-in primitives. The number of threads should match the number of CPUs allocated to each DAP. See query scaling to configure threads for DAPs in your deployment.
Replicas
Data Access Processes are individual kdb+ processes, which, while fast in executing queries, are single-threaded. To increase query throughput and the number of concurrent queries, rather than execution of a single query, you can use additional DAP replicas. The gateway makes use of all replicas available when load-balancing query work. See query scaling to configure a number of replicas in your deployment.
Ingestion
Compression
Compression is on by default for all tiers (including the IDB). Compression offers significant storage savings, but for very high data ingestion rates, the CPU overhead of compression can cause SM to not keep up. While observing SM resource utilization during operation, if CPU is being used significantly more than storage IO, disabling compression may help increase ingestion performance.
Batch publishing
Publishing individual rows in each message causes significant overhead in message processing within both SM and DA. To vectorize the ingestion operations, batch publisher messages as much as possible (e.g., publish a message of the form (<table-name>; <many-records>)
rather than (<table-name; <row>)
).
blockSize
The blockSize
configuration within SM controls how frequently data should be written to disk. If this value is too small, the Storage Manager flushes to disk too often. If this value is too high, the RAM limits are less impactful. If the blockSize
is omitted, a default blockSize
of 12 million records (per table) is used.
If SM (particularly EOI) appears to be hanging and not keeping up with the stream, check that disk IO is being performed. Enable trace logging to observe a log message each time SM flushes to storage.
Interval configuration
The SM interval currently controls the frequency at which data is written to new IDB partitions. The choice of interval has a few impacts:
- The interval governs the amount of data held in-memory within the RDB - the larger the interval, the more data is held in-memory. This also impacts how much RAM is required for the RDB.
- The interval governs how much data needs to be replayed when SM or DA start up for recovery. An hour-long interval means that SM and DA may need to replay an hour (or more if unable to keep up) of data. A 5-minute interval has a much faster recovery time.
- The more frequently data is written to the IDB, the more data is divided within IDB partitions. For a 1-hour interval, data is divided into 24 (during regular operation) intervals throughout the day, while a 10 minute interval divides data into 144 intervals. The more intervals, the more files must be interacted with during a query.