Query performance¶
The two main things affecting query performance are the time taken to read the data (from disk or in memory) and the time taken to transmit the data. We'll discuss the best way to keep these two things at an optimal level.
Read times¶
When querying, the ordering of the constraints has the biggest impact on the read time.
The symList parameter¶
The Refinery API's have a parameter called symList. This is used to select values from the primary "key" column in the table and Refinery applies some optimisations internally to the data in this column.
In memory, the grouped attribute is applied to this column. This means that kdb internally stores a lookup table from each distinct value to the rows in which they appear. This allows for much faster look up times on this column.
On disk, the parted attribute is applied to this column. This means that distinct values are stored contiguously on disk; this allows for all the data for a particular value to be retrieved in a single read operation. This enables much faster retrieval from disk.
Using the symList parameter effectively is one of the best ways to improve query read performance. If you find you are constantly querying for all possible values with symList and then filtering on another column, you should consider reconfiguring the column that symList applies to.
The applyFilter parameter¶
In order to filter data on more columns than the symList column, we can use the applyFilter parameter. Using the applyFilter parameter we can apply any number of additional filters to the data before retrieving it from the server. See here for examples of the type of filters that can be applied.
When using the applyFilter parameter, the biggest performance impact comes from the ordering of the filters. The filters should ordered from most restrictive to least restrictive. In other words, the filter that will reduce the resulting dataset the most should be applied first.
For example, imagine we are filtering on the VenueID and LimitOrMarket columns in the table below and we want to return the the rows where LimitOrMarket=MKT and VenueID=XNYS
the
| VenueID | LimitOrMarket | Price |
|---------|---------------|-------|
| XLON | LIM | 9.0 |
| XLON | MKT | 9.3 |
| XLON | MKT | 1.2 |
| XNYS | LIM | 6.5 |
| XNYS | MKT | 1.1 |
| XNYS | LIM | 5.7 |
| XNYS | MKT | 9.7 |
| XAMS | LIM | 7.8 |
| XAMS | MKT | 10.1 |
Filtering on VenueID is the most restrictive, followed by LimitOrMarket. Querying for VenueID=XNYS first reduces the dataset to 4 rows
, followed by LimitOrMarket=MKT reducing it to down to 2 rows.
If we switched the order of these, the dataset would be reduce to 5 rows and then reduced to 2 rows.
When scaled up to production level volumes, these differences can have a large impact.
The columns parameter¶
The columns parameter allows you to specify which columns you want to return to the client. kdb+ is a column-based database; on disk data is stored in a way that optimises accessing specific columns and performing operations on columns as opposed to rows.
The file structure of a date partitioned database is given below. This is the most common database structure used in the Refinery.
date-1/
table-1/
column-1
column-2
...
table-2/
column-1
column-2
...
...
date-2/
table-1/
column-1
column-2
...
table-2/
column-1
column-2
...
...
Tables are stored as folders containing a single file for each individual column in that table.
When we specify the columns we are interested in using the columns parameter, only those column files are accessed and read from disk. By reducing the number of column files that are accessed in each query, we can improve the performance of our queries. This becomes especially apparent when we are querying tables that have 100 or more columns and we are only interested in returning a few important columns to the client.
Further information about partitioned tables can be found here
The temporality parameter¶
When we specify a start date and time and an end date and time, there are different ways that the data can retrieved between these points in time.
If you're querying across multiple days with a start time and end time, you can set temporality=slice and the query will only return data between those times on each day.
Another option is to set temporality=continuous which returns all data from the start time on the first day through to the end time on the last day. For example, if we are running a query across 3 days with a start time of 09:00:00 and an end time of 11:00:00. If we set temporality=slice the data that will be retrieved are the cells marked with an 'x'
| Time | Day 1 | Day 2 | Day 3 |
|---|---|---|---|
| ... | |||
| 08:00:00 | |||
| 09:00:00 | x | x | x |
| 10:00:00 | x | x | x |
| 11:00:00 | |||
| 12:00:00 | |||
| ... |
On the other hand, if we set temporality=continuous then the cells that would be return are
| Time | Day 1 | Day 2 | Day 3 |
|---|---|---|---|
| ... | x | x | |
| 08:00:00 | x | x | |
| 09:00:00 | x | x | x |
| 10:00:00 | x | x | x |
| 11:00:00 | x | x | |
| 12:00:00 | x | x | |
| ... | x | x |
If we're only interested in a particular window of data, then using slice temporality can greatly reduce the amount of data that needs to be accessed and read.
Transmission times¶
Reducing the size of the dataset being returned from the server is the best way to decrease the transmission time.
Reducing the number of rows¶
Filtering the data aggresively using the symList, applyFilter and temporality parameters will help to reduce the number of rows being returned to the client. Avoid returning large number of rows to the client by doing as much reduction on the server as possible.
Reducing the number of columns¶
We dicussed using the columns parameter above to improve the read times of our queries. This parameter also helps to improve the tranmission times of our queries since it reduces the size of our resultant dataset; we don't waste time serializing and deserializing columns that will be ignored when received by the client.
When the number of rows that you are returning exceeds the number of columns in your table, reducing the number of columns returned is the most effective way to decrease the size of the dataset.
For example, if you have a charting application only return the rows that are necessary for the charts you are using; don't return all columns and then drop them on the client side.
Polling queries¶
If you have a query that is polling Refinery for data on a sufficiently frequent interval (e.g once a minute), consider switching to having this data streamed to your application instead.
This section will be expanded in future with some metrics about the optimal time to switch to polling.