The KX Insights Storage Manager (SM) handles the functions of data persistance and migration.
Storage Manager subscribes to data from an ingestion stream and persists it at scheduled intervals within a day, and at the end of day - these processes are referred to as end-of-interval and end-of-day writedown respectively. The writedown process organizes it on disk in an appropriate, fault-tolerant manner, dividing days into periods and time into days.
Storage Manager also must integrate with either KX Insights Data Access and its Data Access Processes, or custom data access processes (i.e. RDBs and HDBs in a traditional kdb+ application) of an existing custom application, for its management of the data lifecycle.
It will also migrate data from pre-defined tiers of on-disk storage, like, for example, from SSD to spinning disk to object storage. This migration occurs as the last step of the daily persistence operations of data.
The Data Lifecycle¶
The initial phase of the data lifecyle involves the movement of data by the Storage Manager from an in-memory store (ex. a Data Access Process, or an RDB) to on-disk storage. SM takes care of this movement of data without interruption to the running processes. When complete, it provides a signal for consumption by the aforementioned processes to indicate that writedown operations have been completed and it is safe to perform a memory purge of the appropriate data (more on this below).
Each writedown event performed by SM is considered as a general migration of data between the storage tiers. Each data access process exposes the data from a certain mount, which corresponds to one or more storage tiers. The details of mounts and storage tiers are specified in the assembly configuration yaml file.
The first step in movement of data from memory to disk involves a Storage Manager concept called the Interval Database, or IDB. This is an on-disk datastore that is written to at a prescribed, configurable interval within a single day. Its purpose is to relieve memory pressure by allowing more frequent pushes to disk, rather than waiting until an end-of-day writedown to achieve this after a full day's ingestion. This means that a portion of the in-memory data is moved to the IDB at a regular interval throughout the day. This interval is commonly one hour, but can and should be set based on the expected data volumes of the application, and the resource allocation of the hardware it is running on. For example, if data volumes are very high, and machine memory is limited, it is likely that the capacity of an in-memory datastore is limited and frequent writes to the IDB are required to avoid memory starvation and impact to system operations. On the other hand, if data volume is low enough that the RDB processes can hold an entire day's worth of data, the system can be configured without an IDB mount (but with an IDB storage tier), in which case the RDB purge will be triggered by SM only after the end-of-day writedown (for a general storage tier migration, SM only notifies the processes which use the mounts affected by the migration).
The IDB will populate individual partitions on disk at the designed interval throughout the day. At the end-of-day point, Storage Manager will persist all IDB partitions of the data to the HDB and the intraday interaction between data in memory and the IDB will restart for the following day.
Storage Manager is comprised of a set of four processes that collectively perform its functions. These components are expected to operate in an integrated manner, and are therefore not intended, or built, for individual use. They are described at a high level below:
SM is responsible for the coordination of the writedown operations, as well as exposing the front end interface for other microservices to communicate with SM.
EOI is responsible for performing the end-of-interval operation that persists a portion of an in-memory datastore to disk, stored in IDB partitions.
EOD is responsible for performing the end-of-day operation that persists the entirety of the on-disk IDB data to disk, stored in HDB partitions.
DBM is responsible for performing the migration of data between HDB storage tiers, which are unique portions of the on-disk database spread across various storage volumes. Such volumes are commonly of various storage types, ranging from high-performance storage (for storing most-recent, business-critical data) to slower, cheaper storage (for storing data that is less frequently accessed, potentially in a compressed format).
To achieve the instant reloading of HDB and IDB and full recoverability from any writedown failure, SM creates the loadable kdb+ database where the table directories are symbolic links to the versioned physical table data. If an existing kdb+ database is detected in SM's configured first HDB-tier directory on the first run, it will be enhanced with all the necessary symbolic links that will be used by SM in writedown management going forward.
SM writes down all tables it receives from the message stream which are configured in the assembly schema. Currently, each partitioned table must have a timestamp column (configured as
prtnCol in the schema) which is used to determine the target partition date for end-of-day writedown.
Storage Manager is most appropriate for adoption in cases where you need:
- a process for handling writedown of data that is resilient against failures
- a parallelized writedown process for greater efficiency in persistence operations
- to ensure writedown operations do not impact other system processes and their functions ex. ability to serve queries
- to move data from on-disk storage to object storage in the cloud
- to run a tiered database and seamlessly move data through the tiers
- the ability to compress certain tiers of your database
Current Available Features¶
- Data-agnostic in design and implementation
- Commits ingested data to disk in an organized, fault-tolerant manner that is resilient against failure at any point during writedown operations and efficient for querying
- Manages its own checkpoint state
- Supports transparent periodic data reorganization processing
- Notifies processes affected by data reorganization
- Supports various types of storage (memory, NVMe, SSD, etc.)
- Provides parallelism of maintenance operations where possible, to reduce elapsed time
- Supports late data (data arriving on a day different from that with which it is associated)
- Supports standard kdb+ table types (basic, splayed, partitioned)
- Supports tiered storage with configurable migration and compression policies
- Supports starting off from a pre-existing standard kdb+ partitioned database (which gets converted into SM format)
Planned Future Features¶
- Migration of historical data to object storage
- Support for dynamic schema changes with zero downtime
- Dynamic scalability
- Export of on-disk data to a standard kdb+ partitioned database