# Architecture of kdb+ systems

Applications that use kdb+ typically comprise multiple processes

The small footprint of the q interpreter, the interprocess communication baked into q, and the range of interfaces available make it straightforward to incorporate kdb+ into a multi-process application architecture.

Certain kinds of process recur across applications.

## Data feed

This is a source of real-time data; for example, financial quotes and trades from Bloomberg or Refinitiv, or readings from a network of sensors

## Feedhandler

Parses data from the data feed to a format that can be ingested by kdb+.

Kx’s Fusion interfaces connect kdb+ to a range of other technologies, such as R, Apache Kafka, Java, Python and C.

## Tickerplant

Captures the initial data feed, writes it to the log file and publishes these messages to any registered subscribers. Aims for zero-latency. Includes ingesting data in batch mode.

Manages subscriptions: adds and removes subscribers, and sends subscriber table definitions.

Handles end-of-day (EOD) processing.

Best practices for tickerplants

Tickerplants should be lightweight, not capturing data and using very little memory.

For best resilience, and to avoid core resource competition, run them on their own cores.

## Log file

This is the file to which the Tickerplant logs the q messages it receives from the feedhandler. It is used for recovery: if the RDB has to restart, the log file is replayed to return to the current state.

Best practices for log files

The logging process can run on any hardware and OS, from a RaspberryPi to a cloud server.

Store the file on a fast local disk to minimize publication delay and I/O waits.

## Real-time database

Subscribes to messages from the Tickerplant, stores them in memory, and allows this data to be queried intraday.

At end of day usually writes intraday data to the Historical Database, and sends it a new EOD message.

Best practices for real-time databases

RDBs queried intraday should exploit attributes in their tables. For example, a trade table might be marked as sorted by time (s#time) and grouped by sym (g#sym).

RDBs require RAM as they are storing the intraday messages. Calculate how much RAM your RDB needs for a given table:

(Expected max # of messages) * schema cost * flexibility ratio

Schema cost: for a given row, a sum of the datatype size.
Flexibility ratio: 1.5 is a common value

## Real-time subscriber

Subscribes to the intraday messages and typically performs some additional function on receipt of new data – e.g. calculating an order book or maintaining a subtable with the latest price for each instrument.

Best practices for real-time subscribers

Write streaming analytics to compute the required results, rather than timed computations.

Ensure analytics can deal with multiple messages, so there are no dependencies here if the tickerplant runs in batch mode.

Check analytic run time versus expected TP publish intervals to ensure you don’t bottleneck. In general, look to the most busy and stressful market day for this, and add additional scaling factors. E.g. If my TP publishes a message ~every 30ms, my analytic should take less than 30ms to run. To allow for message throughput to double in the TP, the analytic should run in <15ms.

## Historical database

Provides a queryable data store of historical data; for example, for creating customer reports on order execution times, or sensor failure analyses.

Large tables are usually stored on disk partitioned by date, with each column stored as its own file.

The dates are referred to as partitions and this on-disk structure contributes to the high performance of kdb+.

Best practices for historical databases

Attributes are key. Partition tables on disk on the most-queried column.

If the first two columns are time and sym, sorting on time within sym partions is assumed and provides a performance boost.

Can add grouping attribute for other highly-queried columns.

When creating the database schema consider the symbol versus string type choice very carefully:

• Symbol type: Use symbols for columns with highly repeating data that are queried most frequently e.g. sym, exchange, side etc.
• String type: Any highly variable data e.g. order ID

Database sizing follows the same formula as the RDB sizing.

Consider using compression for older data, or less-queried columns, to reduce on-disk size. Typically compression sees ⅕ the space usage. When compressing databases, choose compression algorithm and blocksizes through performance comparisons on typical queries.

## Gateway

The entry point into the kdb+ system. Responsible for routing incoming queries to the appropriate processes, and returning their results.

Can connect both the real-time and historical data to allow users to query across both. In some cases, a gateway will combine the result of a series of queries to different processes.

Best practices for gateways

Run only lightweight code.

Track disconnections and queries submitted.

Return sensible errors when queries fail.

Use the deferred-response feature (V3.6) to avoid additional coding on the side of connecting non-kdb+ processes.