Late Data

As of Refinery 5.5.0 ingestion of late data is handled. In prior versions, data flowing through the Tickerplant would be placed in the current day's partition. In cases where data was loaded from more than 1 day ago, the data may not be retrieved correctly from the HDB.

In Refinery 5.5.0+ inbound data is now written to the appropriate intra-day partition, then merged into the HDB in the background. This enhancement means that historical data loading can now be done by simply publishing to the Tickerplant, just like live data ingestion.

Late data intraday database

A new process was added, the LDIDB. This process has the specific role of exposing late data for queries between the time that end of day occurs and the data being merged into the HDB. This process is optional in pipeline configurations, as often 24/7 and immediate query of historical/late data is not needed.

Intraday behaviour

Without an IDB

If running without an IDB, all ingested data will be accessible in the RDB until end of day (EOD).

Note - there is risk here of wsfull if too much data is published into the system.

Meanwhile, the IPDB will persist the data into appropriate intraday date partitions.

With an IDB

The RDB will hold data published since the last intraday writedown. The IDB will expose the days data prior to the last intraday writedown. This enables large historical data volumes to be ingested and immediately exposed for query.

EOD behaviour

On-disc data lifecycle

Previous day intraday partitions are migrated into the late directory, a staging area for data that requires merging with existing HDB partitions. The current day's data is sorted and added to the HDB. The RDB is then wiped.

The late data is then merged with the HDB data, day by day, and copied in. This is done in the background and does not affect HDB queriablilty.

With LDIDB

If a LDIDB is running, then data in the late directory is available for query during the period it is being integrated with HDB data.

Without LDIDB

Between the EOD time (usually midnight) and the time it is integrated into the HDB, no data from the prior day's ingested late data will be available for query. As soon as it is merged with a HDB day partition, it will be available for query from the HDB.

Adding a late data IDB to a pipeline

To add a late data IDB to a pipeline, it needs the follow block under processes

    ldidb:
      timeout: 0

see creating pipelines

FAQ

How does this affect the system if none of my data is late

If no data with timestamps prior to the current day are ingested, the code path and behaviour is completely unchanged from versions prior to Refinery 5.5.0 .

Should I have a LDIDB?

A LDIDB is only needed if there is a use case for immediately querying late data after EOD. Otherwise, the late data will be available to query after the merge to the main HDB has taken place.

How long will the late data merge take?

This is dependent on multiple factors: the number of distinct dates, the data volumes ingested, the size of data currently in the HDB as well as disk and CPU speed. If the data volumes are in the order of magnitude of MBs per date, then it may only take seconds or minutes. Approaching 100s GBs or TBs, it could take hours.

Why does adding data for a late day take longer than adding data for the current day?

For data timestamped today, all the system needs to do is sort the partition and copy it into the HDB directory. If the late partition does not exist in the HDB, the sort and writedown is just as quick. However, if the partition already exists, then the new data must be merged.

Data cannot simply be upserted into a pre-existing HDB partition as the HDB process is mounted upon the filesystem. Any changes to the underlying files would cause query errors, as kdb keeps an in-memory map of the data.

Therefore to merge data, the HDB partition is copied to a staging area, the new data is then upserted; this is then sorted and moved into the main HDB directory. Generally this copy operation is the bottleneck, but is necessary to facilitate uninterrupted queries on the main HDB. For this reason, the LDIDB exists to service queries for the late data before it is merged to the main HDB.

What happens if I continue publishing data while there is still data in the LDIDB waiting to be merged

The IPDB, IDB and LDIDB are decoupled. The initial step at EOD of moving the late partition means that data can continue to be ingested in the IPDB (exposed by IDB), even if it's timestamp is of the same date as data ingested yesterday