Replication

The Control cluster needs to replicate its internal state to all nodes in the cluster. All state-changes via the public interfaces (Web UI, Process API etc) are automatically persisted to a transaction log and streamed to followers in Realtime. Each change is tracked by a UID. When a process starts, it connects to the leader and compares its own UID with that of the cluster. If it has fallen behind, it will re-sync with the leader and become a follower.

The cluster is usually setup via the install script, which writes a CSV file of cluster details. This file is located at ${DELTA_CONFIG}/failover.csv.

servera.domain.com,5000
serverb.domain.com,5000

Start-up

When starting the Control cluster, it's important to ensure all processes are started at the same time. This ensures the cluster is fully in-sync before further changes are made. The example below illustrates why this is important.

At previous shutdown time, the followers were all shutdown first
Leader was temporarily left running and state-changes occurred
This process is now further ahead than the rest of the cluster
Now on the next start-up, this process was not started at the same time
The other processes start and elect a leader but are missing state
When the last leader starts later, there are two possibilities
The cluster will have progressed beyond the previous leader - in this case the process will join as a follower and all state changes made to it previously will be lost.
The cluster will still be behind the previous leader - in this case that process will refuse to start as it's UID is higher than the current leader.

In the latter case, it's possible to force start in this mode and accept any resulting data loss. This can be enabled using the below setting in the delta.profile. With this mode enabled, this process will force its start-up and demote the other processes to follower status.

DELTACONTROL_FOLLOWER_STARTUPOVERWRITE=YES

You can configure the time allowed for a process to determine the leader during startup by following the example below. If the leader has not been resolved after this period, the process becomes the leader.

DELTACONTROL_FAILOVER_TIMEOUT=60 # Default - 60 seconds

When connecting to a leader, processes attempt to sync by replaying replication logs. To set a retry interval and timeout, follow the example below. By default, no retry is configured.

DELTACONTROL_SYNCRETRY_INTERVAL=500 # Default - 500 milliseconds
DELTACONTROL_SYNCRETRY_TIMEOUT=0 # Default - 0 milliseconds

Always follower mode

For disaster recovery (DR) purposes, there is often a requirement to have processes in the cluster that only subscribe to state changes and never become the leader. An example might be having a two processes running in a separate data centre in case of an outage. The network configuration or latency would make it unfeasible for them to ever become leader but they can act as a backup at a separate site. To enable this for a Control process, set the following in the delta.profile;

DELTACONTROL_ALWAYS_FOLLOWER=YES