Data persistence in Refinery 5

Understanding Refinery's data persistence

The process of storing data has changed between Refinery 4 (R4) and Refinery 5 (R5).

In R4, the data that came in over the course of the day would be stored in the Real-Time Database (RDB). At End of Day (EOD), midnight (00:00), the data would be written and sorted by the Persistence Database (PDB) into the .z.d-1 (the previous day's) partition in the Historical Database (HDB) directory. This is where the HDB was mounted and was the only means of publishing data to the HDB. While the PDB is writing and sorting the data, it would not accept any new data.

With R5, this process gained new life, allowing Refinery to be more dynamic and versatile. This versatility comes in the form of being able to call the EOD process at anytime during the day. This means that if you want to write your data to your HDB straight away after it is published via your TP, you can use a Command Line Interface (CLI) script to do this.

Stepping through Refinery 5's data persistence process

Data Life Cycle

Data ingestion

Data starts off by getting published into the Tickerplant (TP) via feed handlers that ensure that all incoming data is of the correct Refinery schema format before being published. This data then gets published to the RDB, where the data is held in memory for querying, and the IPDB (Intraday Persisting Database) [IPDB 1] via 25! (async broadcasting).

Intraday rollover

Without IDB

If the pipeline does not have an IDB (Intraday Database), then there is a single directory that inbound data is written down to either on a specified interval or configured number of rows (referred to as the intraday writedown). Data is wiped from the IPDB memory after upsert.

Data is stored in an intraday partition that matches the date component of the configured time column's timestamp.

With IDB

If the pipeline does have an IDB, then there are two persist directories that it alternates between reading from when performing queries. The IDB is mounted on the directory that is not being written to.

The IPDB will be holding the previous intraday interval in memory, as well as the new one, such that each alternating persist directory "catches up" to the one that was previously written to. The IPDB also holds the last row of each table and sends these to the RDB to flush [IPDB 2] everything before and including these rows each time an intraday write occurs.

EOD rollover

EOD can happen at any time, or multiple times in a day, due to the late data support in the system. When the EOD event is triggered, a message is sent and the IPDB will immediately do an intraday write to flush any recent memory to disk. At this point it will hold the last row of each table, which is later used to co-ordinate the RDB flush.

The EPDB (End of Day Persisting Database) is responsible for handling further operations so that the IPDB can continue to persist incoming data during EOD procedures.

"Late data" is any new data for which a partition already exists within the HDB. If there is late data, the EPDB makes a copy of the HDB directory data into the HDB-alt directory [EPDB 3] so that it can merge new data without affecting HDB operation.

The partitions that were in the IPDB persist directories are moved to the "late data" directory [EPDB 1]. If the pipeline has an LDIDB (Late Data Intraday Database), that will then mount and service queries during the EOD period.

If data is not "late", it is copied from the IPDB persist directory into the HDB directory and sorted [EPDB 2], as it will not affect HDB query.

The data from the late directory partitions is then upserted by the EPDB into the HDB-alt directory partitions and sorted [EPDB 4]. Once a hdb-alt partition is sorted, the HDB partition is moved to a temporary garbage directory [EPDB 5], the HDB-alt partition is moved into the HDB [EPDB 6], and the garbage partition is then deleted. The HDB is synchronously blocked during this move operation to prevent query error. On unblock a HDB reload is forced to pick up the new partition. If there is a LDIDB, that is also blocked and reloaded during this operation to prevent query errors.

Once the most recent data is in the HDB, a message is sent to the RDB telling it to wipe each table from memory, synchronised with the reload of the HDB.