Failover and replication
The Control process is the core component of the Kx Platform. It's responsible for; running processes, storing configuration, code etc. In order to provide resilience in the event of external failures, it runs in a cluster of multiple nodes, usually split across servers. The cluster processes maintain long-lived connections to each other and elect a master who is responsible for client requests and applying state-changes. This section describes the various aspects of the Control cluster.
The long-lived connections the cluster processes maintain to each other are used to determine the health of the other node. If one of the processes was to die, this would be picked up by the other nodes and trigger a failover. This however doesn't give a full picture of a process' health. The process could go into an unresponsive state without dropping its handle. In the case of a network or server failure, the handle isn't dropped for a period of time.
Heartbeats between the processes help to alleviate this by timing out and disconnecting unresponsive nodes.
The Control cluster needs to replicate its internal state to all nodes in the cluster. All state-changes via the public interfaces (Web UI, Process API etc) are automatically persisted to a transaction log and streamed to slaves in realtime. Each change is tracked by a UID. When a process starts, it connects to the master and compares its own UID with that of the cluster. If it has fallen behind, it will re-sync with the master and become a slave.
The internal state of the cluster is backed-up to disk for resiliency. The persistence section goes into detail on how and where the data is stored, and recovery in the case of corruptions.
There are a couple of considerations when deploying packages to the Control cluster.