Fault tolerance¶
On a single node deployment¶
A single node deployment is a Refinery deployment usually installed on a single host. This means that there is a no duplication or redundancy in data ingestion. See the table below for the recommended action in event of a failure on a single node deployment:
| Failed process | Effect | Recommended user action |
|---|---|---|
| RDB dies | Cannot query real-time data from this process | refinery pipeline --force-start pipeline |
| HDB dies | Cannot query historical data | refinery pipeline --force-start pipeline |
| TP dies | Data flow interrupted, RDB data compromised | refinery pipeline --force-start pipeline |
| Entrypoint dies | Cannot query system | refinery pipeline --force-start pipeline |
| Entire host dies | Total system shutdown | Complete steps for system start-up |
On a clustered deployment¶
A clustered deployment is a system that has duplicated pipelines, normally on separate hosts. This provides redundancy in the system. Also referred to as hot-hot. In a clustered deployment where the primary control process goes down, the secondary control process will be elected leader. Internally within the KX environment, the failover of the control processes is seamless.
On HDB/RDB failover, the results returned will be based on the contents/state of the secondary host. It is therefore necessary to publish to both primary and secondary hosts.
In general:
Processes can be restarted using refinery pipeline --force-start *pipeline*. The --force-start command will boot any processes in a pipeline that are not currently running. A --start commmand on a partially running pipeline will not start any currently offline processes.
Pipelines can be rerouted to a subsequent host using refinery failover --failover --pipeline *pipeline-name* --to-instance *instance-number* A data copy can be triggered using refinery datacopy --copy --pipeline *pipeline-name* --instance *instance-number* --date *date*
Process level failure scenarios and actions¶
Pipeline.0.rdb.0 dies¶
Real-time data cannot be queried from this process.
Automatic Response¶
Queries are routed to pipeline.n.rdb.0 by entrypoint
Recommended user action¶
Restart pipeline.0.rdb.0 using refinery pipeline --force-start pipeline
Pipeline.0.hdb.0 dies¶
Historical data cannot be queried via this hdb
Automatic response¶
If hdbs are clustered, the routing will be continued to pipeline.0.hdb.n. If the system is clustered, the routing will be swapped to pipeline.1.hdb.n.
Recommended user action¶
If routing has swapped to pipeline.1.hdb.0, restart pipeline.0.hdb.0 using refinery pipeline --force-start pipeline and revert routing to pipeline.0.hdb.0 with refinery failover --failover.
Pipeline.0.pdb.0 dies¶
Data is not written to disk intra-day
Automatic response¶
None
Recommended user action¶
Restart pipeline.0.pdb.0 using refinery pipeline --force-start pipeline. Trigger data copy at EOD using refinery datacopy --copy.
Pipeline.0.tp.0 dies¶
Data flow is interrupted and pipeline.0.rdb.0 data is compromised
Automatic response¶
Routing is swapped to pipeline.1.rdb.0, pipeline.0.hdb.0 copy comes from host_n
Recommended user action¶
Restart pipeline.0.tp.0 using refinery pipeline --force-start pipeline. Revert routing to pipeline.0.rdb.0 using refinery failover --failover. Trigger datacopy at EOD using refinery datacopy --copy.
DefaultEntrypoint.0.gw.0 dies¶
System cannot be queried via this gateway
Automatic response¶
The client application should query DefaultEntrypoint.0.gw.N or DefaultEntrypoint.N.gw.0 (either pool of gateways in same machine, or pool of gateways on other cluster)
Recommended user action¶
Restart DefaultEntrypoint.0.gw.0 using refinery pipeline --force-start DefaultEntrypoint.
User client application should have logic to use one of the entrypoint gateways, so can swap back at any point once a gateway is restarted
Pipeline.0.idb.0 dies¶
Intraday on disk data cannot be queried
Automatic response¶
Routing is swapped to pipeline.1.idb.0
Recommended user action¶
Restart pipeline.0.idb.0 using refinery pipeline --force-start pipeline. Revert routing to pipeline.0.idb.0 using refinery failover --failover.
Entire host dies¶
No data capture or data querying possible
Automatic response¶
The entire system is routed to host_n
Recommended user action¶
Complete steps for system start-up.
Revert routing to pipeline.0 using refinery failover --failover. Trigger data copy at EOD using refinery datacopy --copy.