Process recovery guide¶
Setting up a robust system with failover¶
Let's take a look at a complete guide in how to set up a working basic 2 host hot-hot systems.
Note
A hot-hot system can be made up of N hosts and N instances, but for this demonstration only 2 are being used.
There are a few steps to follow to achieve this:
1. Have access to 2 hosts¶
2. Have the same Refinery packages set up on both hosts¶
This is required to be set up by the onboard developer. When setting up your Refinery settings, make sure that the following config is set up correctly:
auto-configure-instance-hostname-a = aaa.host.com
delta-control-master-hostname = aaa.host.com
delta-control-slave-hostname = bbb.host.com
auto-configure-instance-hostname-b = bbb.host.com
...
delta-control-clustering = 1
3. Set up a clustered deployment system¶
The system pipeline YAML requires a slight change. The 2 different IP addresses of the hosts are added to the layout section (primary-server & secondary-server) of the system YAML.
system:
layout:
-
name: primary-server
nodes:
-
host: aaa.host.com
-
name: secondary-server
nodes:
-
host: bbb.host.com
default-cpu-taskset: 0-256
data-hierarchy:
- region
- data-source
- data-class
- sub-class
delta-messaging-server: DS_MESSAGING_SERVER:refinery_a
timezone: UTC
time-sort: false
Next the pipeline YAML requires that you add the primary server and secondary server into the proc-layout. This tells each process which server they are running on. Individual process instances can be split independently between the different servers.
pipeline:
name: "DemoPipeline"
type: "realtime"
expose-to-gw: true
proc-layout:
# Example of how process instances can be split across servers
# -
# tp.0: primary-server
# tp.1: secondary-server
# hdb.0: primary-server
# hdb.1: secondary-server
-
all: primary-server
-
all: secondary-server
taxonomy:
region: test
data-source: demo
processes:
tp:
pub-mode: timer
pub-freq-ms: 100
log-to-journal: true
rollover-mode: daily-at-time
rollover-time: "00:00:00.001"
port: 41221
enable-analyst: true
rdb:
port: 41222
timeout: 30
enable-analyst: true
hdb:
port: 41223
timeout: 30
enable-analyst: true
ipdb:
port: 41224
write-freq: 10000
write-row-limit: 0
enable-analyst: true
idb:
port: 41225
timeout: 0
enable-analyst: true
epdb:
timeout: 0
enable-analyst: true
Make sure that there is a table added so the data can be stored (see example below).
table:
name: DemoTable
id-col: sym
time-col: time
intra-persist-type: splay
end-persist-type: date-partition
taxonomy:
-
region: global
data-source: demo
columns:
-
name: time
data-type: timestamp
attribute: sorted
-
name: sym
data-type: symbol
attribute: grouped
-
name: price
data-type: float
-
name: volume
data-type: long
Note
Make sure that these changes are applied across both hosts.
4. Start up Refinery¶
Step 1 - start Control¶
Start up Delta Control on the primary server first and then run it on the secondary server.
refinery application --start-control
Check the deltaControl.log file to confirm the two Controls have found each other.
refinery logs --view --process DeltaControl
Step 2 - start Control daemon¶
Start up the Delta Control daemon on the primary server first and then run it on the secondary server.
refinery application --start-daemon
Step 3 - start Process Manager¶
Start up Process Manager on the primary server only. The Process Manager is in charge of running all the pipelines and workflows across both servers.
refinery process-manager --start --wait
Step 4 -start core workflows¶
Start up the core workflows. With this being a clustered deployment system, 2 instances of each workflow are started up for the servers (A & B).
refinery workflow --start REFINERY_CORE_A
refinery workflow --start REFINERY_ENTRYPOINT_0_a
refinery workflow --start REFINERY_CORE_B
refinery workflow --start REFINERY_ENTRYPOINT_0_b
Step 5 - start pipelines¶
Start up the pipelines, starting with the default entrypoint pipeline and then the demo pipeline.
refinery pipeline --start DefaultEntrypoint
refinery pipeline --start DemoPipeline
Step 6 -start gateway client¶
Lastly for setting up Refinery, the gateway client is required.
refinery service-class --start-template refinery-gw-client
5. Publishing data¶
Now that clustered deployment system is up and running, it's time to publish some data to the system. For this you will need to create a publisher script q file. Using the DemoTable that is already uploaded to Refinery, the following q script opens a connection to the Tickerplant (TP) on each server and publishes dummy data to the table.
Note
This script below publishes randomly generated data at 1 second intervals to both TPs.
tp1: hopen `:aaa.host.com:41221;
tp2: hopen `:bbb.host.com:41221;
i:0;
n:1000;
// function for formatting the log message for the number of rows published, i.e. 100000 -> 100,000.
hrf:{reverse "," sv 3 cut reverse string x}
gen:{[]
t: flip `time`sym`price`volume!(n#.z.p-1D;n?`$/:.Q.a;`float$n#i;`float$n?1000);
i+::1;
t
};
.z.ts:{
tab: gen[];
tp1(`upd;`DemoTable;tab);
tp2(`upd;`DemoTable;tab);
show"published number ",string[i]," - total ",hrf[i*n]," rows";
}
\t 1000
After the dummy data has been sent to the TP, the data is published to the RDB. As the data streams in, the IPDB writes batches of this data to disk (IDB). The data is stored in the IDB until EOD where EPDB sorts the data, applies attributes and moves the data to the HDB.
This is a complete basic Refinery system set up with hot-hot recovery enabled, along with a demonstration of how to send data through Refinery.
Failover recovery¶
Failover is the re-routing of the queries to data when failure occurs in one of the primary route processes. The primary routing state is registered to instance 0 by default and can be observed by running the CLI command:
refinery failover --status --pipeline DemoPipeline
When a process on the primary routing state is killed or fails, the automatic failover operation takes action. Noticing the failed process, the system re-routes the query through the secondary instance of the process.
Failed / killed process¶
| Log message type | Specific process name | Process | Message details |
|---|---|---|---|
| WARN | DefaultEntrypoint.0.gw.0-2684 **** 0 | gw | Active downstream process has disconnected [ Process: DemoPipeline.0.rdb.0 ] |
Failover occurring¶
| Log message type | Specific process name | Process | Message details |
|---|---|---|---|
| INFO | DefaultEntrypoint.0.gw.0-2684 **** 0 | gw | Attempting auto-failover to new process instance [ Process: DemoPipeline.0.rdb.0 ] [ Pipeline: DemoPipeline ] [ Instance: 0 ] [ New: 1 ] |
| INFO | DefaultEntrypoint.0.gw.0-2684 **** 0 | gw | Validating new instance process is available [ Process Name: DemoPipeline.1.rdb.0 ] |
| INFO | DefaultEntrypoint.0.gw.0-2684 **** 0 | gw | Updating process primary configuration [ Source: DemoPipeline.0.rdb.0 ] [ New: DemoPipeline.1.rdb.0 ] |
Restarting processes after they've failed¶
To restart a process after it's failed, Refinery's --force-start CLI command is required.
refinery pipeline --force-start DemoPipeline --instance N
Note
For this guide, the primary routing path is set to --instance 0.
Once the process has been restarted, it won't be automatically re-routed back into the primary routing state. This can be clearly seen in the failover status table, where the primary column no longer saying yes for the DemoPipeline.0.rdb.0 but instead says yes for DemoPipeline.1.rdb.0.
[ DefaultEntrypoint.0.gw.0 ] Primary Routing State:
processName pipeline pipelineInstance primary registered busy busySince
-------------------- ------------ ---------------- ------- ---------- ---- ---------
DemoPipeline.0.rdb.0 DemoPipeline 0 no no no
DemoPipeline.0.hdb.0 DemoPipeline 0 yes yes no
DemoPipeline.0.idb.0 DemoPipeline 0 yes yes no
DemoPipeline.1.rdb.0 DemoPipeline 1 yes yes no
DemoPipeline.1.hdb.0 DemoPipeline 1 no no no
DemoPipeline.1.idb.0 DemoPipeline 1 no no no
To re-route the primary process back into the primary routing path, the following failover CLI command is required:
refinery failover --failover --pipeline DemoPipeline --to-instance 0
The re-routing back to the original primary processes is confirmed in the failover status table seen below:
[ DefaultEntrypoint.0.gw.0 ] Primary Routing State:
processName pipeline pipelineInstance primary registered busy busySince
-------------------- ------------ ---------------- ------- ---------- ---- ---------
DemoPipeline.0.rdb.0 DemoPipeline 0 yes yes no
DemoPipeline.0.hdb.0 DemoPipeline 0 yes yes no
DemoPipeline.0.idb.0 DemoPipeline 0 yes yes no
DemoPipeline.1.rdb.0 DemoPipeline 1 no yes no
DemoPipeline.1.hdb.0 DemoPipeline 1 no no no
DemoPipeline.1.idb.0 DemoPipeline 1 no no no
Note
This process recovery guide can be applied to any process that fails.