Performance best practices
Query
Attributes and schema
Key to the performance of a kdb+ database is appropriate schema design and choice of attributes.
Additionally, symbols allow rapid access to string data. They should not be used for random or large-cardinality string columns as they cannot be garbage collected when no longer necessary; performance degrades as symbols are added.
Response serialization
The kdb Insights Database supports QIPC binary and REST with JSON and binary payloads.
REST, along with JSON, allows for a convenient query method, but has significant serialization overhead compared to the much faster QIPC binary serialization format.
You can use the QIPC binary format from a REST call by specifying the application/octet-stream
Accept
header:
curl -X POST --header "Content-Type: application/json" \
--header "Accept: application/octet-stream" \
--header "Authorization: Bearer $INSIGHTS_TOKEN" \
-o results.dat \
--data "{\"table\":\"trades\"}" \
"https://${INSIGHTS_HOSTNAME}/servicegateway/kxi/getData"
curl -X POST --header "Content-Type: application/json" \
--header "Accept: application/octet-stream" \
-o results.dat \
--data "{\"table\":\"trades\"}" \
"http://${SERVICE_GATEWAY}/kxi/getData"
getData
and custom APIs vs SQL
SQL offers a convenient method to explore data within an kdb Insights Database, but is much slower in comparison to the getData
API and q custom APIs that are crafted for a specific purpose from q. For best performance, use getData
where possible.
To further reduce performance variability and SQL taking resources from getData
and custom APIs queries, SQL is disabled by default on the non-qe
gateway. SQL is only accessible via the resource-isolated qe
gateway, and the kdb Insights Enterprise UI makes use of this for all queries made within the UI.
Scaling can be done by increasing the number of threads or replicas.
Threads
Query performance benefits from additional threads when a query uses built-in primitives. The number of threads should match the number of CPUs allocated to each DAP. See query scaling to configure threads for DAPs in your deployment.
Replicas
Data Access Processes are individual kdb+ processes, which, while fast in executing queries, are single-threaded. To increase query throughput and the number of concurrent queries, rather than execution of a single query, you can use additional DAP replicas. The gateway makes use of all replicas available when load-balancing query work. See query scaling to configure a number of replicas in your deployment.
Ingestion
Compression
Compression is on by default for all tiers (including the IDB). Compression offers significant storage savings, but for very high data ingestion rates, the CPU overhead of compression can cause SM to not keep up. While observing SM resource utilization during operation, if CPU is being used significantly more than storage IO, disabling compression may help increase ingestion performance.
Batch publishing
Publishing individual rows in each message causes significant overhead in message processing within both SM and DA. To vectorize the ingestion operations, batch publisher messages as much as possible (e.g., publish a message of the form (<table-name>; <many-records>)
rather than (<table-name; <row>)
).
blockSize
The blockSize
configuration within SM controls how frequently data should be written to disk. If this value is too small, the Storage Manager flushes to disk too often. If this value is too high, the RAM limits are less impactful. If the blockSize
is omitted, a default blockSize
of 12 million records (per table) is used.
If SM (particularly EOI) appears to be hanging and not keeping up with the stream, check that disk IO is being performed. Enable trace logging to observe a log message each time SM flushes to storage.
Interval configuration
The SM interval currently controls the frequency at which data is written to new IDB partitions. The choice of interval has a few impacts:
- The interval governs the amount of data held in-memory within the RDB - the larger the interval, the more data is held in-memory. This also impacts how much RAM is required for the RDB.
- The interval governs how much data needs to be replayed when SM or DA start up for recovery. An hour-long interval means that SM and DA may need to replay an hour (or more if unable to keep up) of data. A 5-minute interval has a much faster recovery time.
- The more frequently data is written to the IDB, the more data is divided within IDB partitions. For a 1-hour interval, data is divided into 24 (during regular operation) intervals throughout the day, while a 10 minute interval divides data into 144 intervals. The more intervals, the more files must be interacted with during a query.
End of day processing
Moving data from the IDB to HDB can be a time, CPU and memory intensive process due to several factors:
- The data in a table needs to be read and written a few times during the existing EOD process.
- The data may need to be uncompressed and compressed multiple times if the IDB and/or HDB tiers have compression turned on.
- As tables get longer the sort operation (
xasc
) can become expensive in terms of time and hardware resources.
For these reasons, an alternative SM sort method has been developed which will decrease the amount of time needed to complete the EOD process while also reducing the amount of memory it needs.
The method is activated by default and it can be disabled by setting KXI_SM_EOD_SORT
to 0
i.e. it would appear in an assembly as:
elements:
sm:
env:
- name: KXI_SM_EOD_SORT
value: "0"
By default, the new method allocates RAM to the EOD process by considering the SM Kubernetes memory request and limit as well as the physical RAM available on the node. If it becomes necessary to further lower the amount of RAM allocated to the sort process, then the KXI_SM_EOD_SORT_LIMIT
(GB) variable can be set.
- name: KXI_SM_EOD_SORT_LIMIT
value: "96"
As a guide:
-
if
spec.elements.sm.k8sPolicy.limits.memory
andspec.elements.sm.k8sPolicy.requests.memory
are both set then thespec.elements.sm.k8sPolicy.limits.memory
value is used to allocate RAM to the sort. -
if
KXI_SM_EOD_SORT_LIMIT
is set then the RAM allocated is set to the minimum of its value and eitherspec.elements.sm.k8sPolicy.limits.memory
orspec.elements.sm.k8sPolicy.requests.memory
. -
if neither
spec.elements.sm.k8sPolicy.*.memory
norKXI_SM_EOD_SORT_LIMIT
are set then the RAM allocated to the sort process is 10% of the physical RAM of the node upon which theSM
pod is running.
Since the EOD process involves reading, writing and manipulating large vectors, the eodPeachLevel
is set to column
so that several columns can be read or written at the same time. This is especially true if the data in HDB is to be compressed. Since the level is set to column
then also remember to check the SM Kubernetes CPU request value and the number of threads allocated to EOD process(KXI_SM_EOD_THREADS
) to ensure the values in each are not conflicting.
- name: KXI_SM_EOD_THREADS
value: "6"
The alternative sort method takes advantage of the relationship between how the data is sorted in the IDB and HDB and which attributes have been applied to each. Below are a few notes to keep in mind when deciding on the correct schema a table:
-
The most common way to sort data in the HDB is by
sym,time
or in more pipeline-like terminology<an ID column>,<prtnCol>
with aparted
attribute applied to the ID column in the HDB and asorted
attribute applied to theprtnCol
in the IDB. -
The Columnar database and query optimization white paper may be useful when deciding which attributes to apply to your tables.
-
If the goal of a pipeline is to simply ingest and complete the EOD process in the minimum amount of time using the minimum amount of resources then the data should be
sorted
in both IDB and HDB by theprtnCol
associated with the table and thesorted
attribute applied in both cases. This has the added benefit of allowing thegetData
API to automatically optimize a request to ensure that the attributed columns are correctly prioritized when reading the data from the disk backed tiers. -
If attributes are not applied to
partnCol
then you should optimize your query to ensure that the column with the attribute is first in the list of constraints of the request.