This guide walks you through creating a Data Warehouse in KX Stream. It introduces each of the basic components used in the system and some of the core frameworks they are built on. The guide builds a data warehouse from scratch for a sample use case; steps are provided for you to extend it for your own applications.
What is a data warehouse?¶
A data warehouse is a system of components for capturing and storing large amounts of data across varying sources and serving this data to end-clients. Tools are provided to:
- store realtime and historical data
- perform complex enrichment of data streams
- serve the data to end-clients in a consolidated view
A variety of clients can be serviced by a data warehouse through a set of access frameworks; including web and downstream applications, analyst users, and reporting tools.
In KX Stream data is typically split by age into time buckets, usually calendar days. Historical data are partitioned by date and stored on disk. The data for the current bucket uses a hybrid approach of in-memory and on-disk. The most recent data is stored in memory for fast access, with the rest saved on disk.
This is the intraday writedown approach. It protects against data volumes outgrowing available memory, especially when servicing queries and performing end of day (EOD). This approach also uses another process, the log replay, to perform the on-disk writes, which frees up database components for queries or other tasks. The diagram below shows the basic architecture.
KX Stream provides out-of-the box templates for the different roles, with hooks for you to customize their behavior.
|TP||Tickerplant||acts as an entry point for streaming data; logs data streams for failover and distributes to other processes|
|RDB||Realtime database||stores realtime data in-memory for fast access|
|IDB||Intraday database||stores intraday data on-disk|
|HDB||Historical database||data store for historical data from previous windows|
|LR||Log replay||uses the logs generated by the
|RTE||Realtime engine||cleanses and enriches data streams|
The system is a set of disparate processes; the Messaging framework lets them discover and stream data to each other.
The framework consists of servers and clients. When the clients launch, they register with the server and publish the metadata of what topics they are interested in subscribing and/or publishing to.
The server stores this metadata and matches publishers to consumers when an overlap of topics occurs. It initiates a handshake between the processes and sets up a subscription between them.
After the handshake, the clients communicate directly.
In KX Control, processes are pre-configured before they are run. The guide defines all processes as service class entities, which let you run many processes from a single definition. The components of the data warehouse can be scaled elastically in response to system load.
The service-class documentation has detail on the benefits of using this model but in short, it provides much greater scalability and flexibility.