Fault tolerance and process recovery¶

Refinery setup¶

Does the introduction of NAS for historical storage complicate the ingestion path?¶

NAS (network-affected storage) is a file-level storage architecture that makes stored data more accessible to devices on the same network. NAS works effectively in a single side failure, but be aware that network speed affects the performance of querying and writing.

It is unwise to have more than one instance of a system looking at a single NAS data source as this removes the redundancy concept of each side being independent. To get around this, have only the primary path instance be used for ingesting data and another instance(s) used for looking at the data on the NAS.

Warning

All instances cannot be set up to ingest and look at data as this will cause complications when writing down data

Does KX Refinery run 24/7?¶

KX Refinery should always be run in 24/7 mode. If the system is down, issues may occur due to configurations not being implemented, or rollovers not occurring due to the system being down.

Setting ports for clustered systems¶

Ports can be specified for processes in the pipeline YAML under each type of process. However, for the pipeline to work across all hosts, these port numbers must be the same across all N-instances, as the systems won't know where to listen to connect to each other. If the TP on one process is set to port 5000, then all TPs must be set to port 5000.

processes:
    tp:
      port: 5000

For example, in an RTE on a pipeline with a TP feeding a RDB is the way to do this by defining a manual upd function like:

upd:{[tableName;data] 
  if tableName == what we want
    upsert
  else
    do nothing
}

and in this case does that mean we're forced to accept all data to the process before filtering?

For tables, the subscription function .tp.subscribe[tables] is forced to accept all data being sent to the process.

This function reduces TP CPU usage greatly as there is less work happening within the system. Meaning that subscribers can be multicasted to in order to not require each subscriber to be serialised. Serialisation only occurs once and then the data is dispatched to all subscribers. When setting up an RTE with a subset of symbols, the filtering on these symbols is started when the data first comes into the RTE. This functionality supports RDB sharding.

Data integrity¶

Is there a risk to data integrity when a process fails?¶

Data loss¶

There is no risk of data loss when a single side failure occurs in a hot-hot system. The queries are re-routed from the failed process to one of the N-1 processes in an instant, resulting in no loss of data. Once the original process is back up and running again after a --force-start, the missing data can be copied from the re-routed process to the original process through the use of datacopy.

Duplicate data¶

When a query is sent, the query only targets one of each process (HDB/RDB/etc) in the data lifecycle, meaning there can't be any returned duplicate data by trying to query both 0.HDB and 1.HDB. When repairing the processes through datacopy (realigning the data on the failed process to be the same as the rest of the processes), the entire day's data is replaced, ensuring no duplicate data is on disk.

How does the PDB manage datacopy?¶

The PDB applies datacopy after the EOD of the affected day, which copies the HDB date and replaces everything that has already be written. Datacopy cannot be applied intraday data.

Process failure¶

What is the failover time?¶

Failover time is the time it takes to switch the routing of the query to the one of the N-1 processes after the primary process fails. This time changes from process to process, but for data processes (RDB/IDB/HDB), this transition is effectively instant, seamless and automatic. The user won't notice when failover occurs.

The only time this might result in being an issue is when a query is sent to the process just before the process dies. This is extremely unlikely and will only result in a single failed query. When a following query is sent, the query is sent to another process and returns the desired information.

Through the use of user layer software above the gateway level (for example, HA proxy), a single entrypoint and top-level failover layer can be achieved. Should an entire host or cluster node fail, the re-routing to another gateway would be seamless.

What is the routing of processes through a hot-hot system?¶

The queries are routed based on the primary routing state of process, instance 0 by default.

Time window to avoid data outage when upgrading processes¶

To avoid data outages when upgrading your system, upgrade only one instance at a time; this way when one instance is down the other(s) will take its place and avoid any possibilities of a data outage.

What happens if the Messenger Server dies (specifically for C# client)¶

For the initial startup of Refinery, the Messenger Server (MS) is required, but if the MS fails after the initial startup then the publishing of data won't be affected. The MS can be restarted when necessary. The MS shouldn't randomly die as it is allows Delta Messaging clients to find each other and matches publishers with subscribers, unlike the RDB which is storing rows upon rows of data.

Note

To restart the MS, the following CLI command is required, refinery workflow --start REFINERY_CORE_A

Recovery¶

What is failback?¶

Failback is the restarting of a process that has failed and restoring it to its original system state. This is run manually through the use of CLI commands. It is designed this way so that the process failures are acknowledged and can be investigated by the system's administration team to look into the root cause of the failures so that they don't happen again.

Intraday recovery of the RDB/IDB using the TP logs¶

The RDB and IDB can be automatically recovered when they are restarted by enabling this setting in the pipeline YAML file. Even though this feature exists, it is not the recommended way of approaching this issue. If a RDB has failed, it is most likely due to high volumes of data. Restarting the RDB through the TP log recovery will cause an error to occur in the pipeline.

To access the data stored in the failed RDB, call the CLI command refinery pipeline --intraday-write <pipeline> --instance <number> to flush the data that is stored in memory in the IPDB to disk so that the IDB has the most recent batch of data.

The IDB doesn't need recovery as it is just looking at the data that has been written down already. As long as the TP and PDB are still running, there won't be a data outage when the IDB is restarted. An IDB is the preferred process when working with high memory intraday volumes.

Intraday recovery times¶

IDB¶

An IDB takes about 1-2 seconds to recover, which is the amount of time that the process requires to boot up via CLI.

RDB¶

The time for an RDB to recover depends on the amount of data that was in the RDB at the time of failure. If the failure was caused due to too much data, then restarting the RDB is not going to work as the reason for it failing is that there is too much data getting stored. To get around this, decrease the length of time the IPDB writes the data to disk from memory. This will consistently keep the max row count in the RDB low.

In a clustered/hot-hot system, what happens when RDB.0 and HDB.1 go down?¶

In a case like this, the system would re-route the query to go through RDB.1 and keep the flow going through HDB.0, which is the default. Failover is restored on an individual process to process case.

How do pipelines function if the process manager fails?¶

The Process Manager is only required for the booting up of processes, Once the processes are online, the Process Manager sits idle until it is required to start back up a failed process. This means that the Process Manager can be restarted with no effect on the system.

Is there a way to recover the HDB if the EPDB goes down mid-EOD write down?¶

The EPDB (End-of-day Persisting Database) is responsible for moving intraday partitions and sorting to the HDB. There are no data writes involved with the EPDB, so there won't be any data lost if it fails mid-EOD writedown.

Gateways¶

For KX dashboards, what is the role of the refinery-gw-client in a clustered system and how do they perform load balancing?¶

For querying data in dashboards, the data source connection selected is the Refinery-gw-client. This allows APIs to be called on the gateways that access the data tables. If multiple gw-clients are set up, the Refinery-gw-clients run as a pool of processes, which if a failure occurs when connecting to one of the gateways, a different gateway will be connected to via a round-robin selection. The gateway clients can be stopped and started when needed with no effect on the dashboard queries, as long as one gateway client is still running.

The pooled Refinery-gw-clients provide scalability and ensure that queries are always returned. When building an application, the application should target the entrypoint gateways and can be run with as many gateways as required. HAProxy or in-built logic can be used to handle failure within the pooled gateways. This functionality is built on the top layer of the application.