Fault tolerance

On a single node deployment

A single node deployment is a Refinery deployment usually installed on a single host. This means that there is a no duplication or redundancy in data ingestion. See the table below for the recommended action in event of a failure on a single node deployment:

Failed process Effect Recommended user action
RDB dies Cannot query real-time data from this process refinery pipeline --force-start pipeline
HDB dies Cannot query historical data refinery pipeline --force-start pipeline
TP dies Data flow interrupted, RDB data compromised refinery pipeline --force-start pipeline
Entrypoint dies Cannot query system refinery pipeline --force-start pipeline
Entire host dies Total system shutdown Complete steps for system start-up

On a clustered deployment

A clustered deployment is a system that has duplicated pipelines, normally on separate hosts. This provides redundancy in the system. Also referred to as hot-hot. In a clustered deployment where the primary control process goes down, the secondary control process will be elected leader. Internally within the KX environment, the failover of the control processes is seamless.

On HDB/RDB failover, the results returned will be based on the contents/state of the secondary host. It is therefore necessary to publish to both primary and secondary hosts.

In general:

Processes can be restarted using refinery pipeline --force-start *pipeline*. The --force-start command will boot any processes in a pipeline that are not currently running. A --start commmand on a partially running pipeline will not start any currently offline processes.

Pipelines can be rerouted to a subsequent host using refinery failover --failover --pipeline *pipeline-name* --to-instance *instance-number* A data copy can be triggered using refinery datacopy --copy --pipeline *pipeline-name* --instance *instance-number* --date *date*

Process level failure scenarios and actions

Pipeline.0.rdb.0 dies

Real-time data cannot be queried from this process.

Automatic Response

Queries are routed to pipeline.n.rdb.0 by entrypoint

Restart pipeline.0.rdb.0 using refinery pipeline --force-start pipeline

Pipeline.0.hdb.0 dies

Historical data cannot be queried via this hdb

Automatic response

If hdbs are clustered, the routing will be continued to pipeline.0.hdb.n. If the system is clustered, the routing will be swapped to pipeline.1.hdb.n.

If routing has swapped to pipeline.1.hdb.0, restart pipeline.0.hdb.0 using refinery pipeline --force-start pipeline and revert routing to pipeline.0.hdb.0 with refinery failover --failover.

Pipeline.0.pdb.0 dies

Data is not written to disk intra-day

Automatic response

None

Restart pipeline.0.pdb.0 using refinery pipeline --force-start pipeline. Trigger data copy at EOD using refinery datacopy --copy.

Pipeline.0.tp.0 dies

Data flow is interrupted and pipeline.0.rdb.0 data is compromised

Automatic response

Routing is swapped to pipeline.1.rdb.0, pipeline.0.hdb.0 copy comes from host_n

Restart pipeline.0.tp.0 using refinery pipeline --force-start pipeline. Revert routing to pipeline.0.rdb.0 using refinery failover --failover. Trigger datacopy at EOD using refinery datacopy --copy.

DefaultEntrypoint.0.gw.0 dies

System cannot be queried via this gateway

Automatic response

The client application should query DefaultEntrypoint.0.gw.N or DefaultEntrypoint.N.gw.0 (either pool of gateways in same machine, or pool of gateways on other cluster)

Restart DefaultEntrypoint.0.gw.0 using refinery pipeline --force-start DefaultEntrypoint.

User client application should have logic to use one of the entrypoint gateways, so can swap back at any point once a gateway is restarted

Pipeline.0.idb.0 dies

Intraday on disk data cannot be queried

Automatic response

Routing is swapped to pipeline.1.idb.0

Restart pipeline.0.idb.0 using refinery pipeline --force-start pipeline. Revert routing to pipeline.0.idb.0 using refinery failover --failover.

Entire host dies

No data capture or data querying possible

Automatic response

The entire system is routed to host_n

Complete steps for system start-up.

Revert routing to pipeline.0 using refinery failover --failover. Trigger data copy at EOD using refinery datacopy --copy.