RT Hard Reset
A hard reset removes all messages from the RT stream to allow the reliable transport to be restarted with an empty stream.
Warning
This removes all messages from the stream and can result in duplicate or lost data. We recommend you only use this feature if advised to do so by KX Support.
Hard Reset Workflow
The API to execute the Hard Reset can be configured, however the default behavior is listed below:
-
The following directories on the PVCs are deleted:
/s/in/
/s/state/
-
The
/s/out/
logs are emptied, rather than deleted. This ensures that all messages sent prior to the reset are removed from the stream and each subscriber eventually. -
The session number of the output merged log files being is incremented, i.e. from
log.x.y
tolog.x+1.y
Note
The session number is maintained in a config map, which is managed outside of the PVCs.
The sequence of events during the Hard Reset are as follows:
- All RT services as stopped, except the supervisor.
- The contents of the
/s/state/
directory is deleted. - The contents of the
/s/in/
directory is deleted. - The contents of the
/s/out/
directory is pruned. See note below. - The supervisor is taken down, this will be picked up by kubernetes and the supervisor, plus child processes are restarted.
Pruning /s/out/
directory logs
The effect of pruning the out logs, is that after the reset each subscriber's corresponding logs will also be pruned. This ensures that all messages sent prior to the reset are removed from the stream and each subscriber eventually.
Consequences of a Hard Reset
Duplicate data
RT keeps track of what data it has processed to avoid sending duplicate data to a subscriber. When executing the hard reset, RT loses knowledge of what it has processed. Therefore, after carrying out a hard reset, RT will have no knowledge of what data has been replicated/merged from a publisher.
If any data resides on the publisher, it will be (re)replicated back into the RT cluster and published on to the subscriber.
To avoid duplicate data, you should take the following steps:
- Publisher taken down
- Client log files removed
- Hard Reset executed
Warning
Taking down the publisher may result in data loss.
Data loss
Data loss can become a factor if the steps above in the duplicate data section are followed.
Beside that, data loss can be a reason for needing a hard reset. For example, if an internal RT log were to become corrupted, a hard reset would be required to get the system back to a healthy state. A hard reset would not be the cause of the data loss in this scenario, but a resolution to a corrupted log. This is similar to a tickerplant becoming corrupted, data can be recovered up to the point the log file was corrupted.
Triggering a Hard Reset
You can call an API on the RT supervisor process running in you kubernetes cluster to trigger a hard reset. This only needs to be performed on one of the nodes in the RT cluster, RT automatically manages the orchestration of the hard reset across the other nodes.
Reset
Call the hardReset
REST request against port 6000 of one the RT nodes in an RT cluster:
GET hardReset/
Warning
RT can only handle one reset request at a time. If the RT pod is in the middle of executing a reset command, it will return to a HTTP request with a status of 503 (unavailable).
Status
The resetStatus
REST request should be called against port 6000 of any of the RT nodes in an RT cluster and provides the reset status of that node. This allows the user to identify whether the RT node is in the middle of a hard reset, or whether it is available for a hard reset.
GET resetStatus/
Status
The status returned by the resetStatus
call is UP
during normal operation. However, once a hard reset is requested, it will cycle through a set of operations, updating it's state each time. The hard reset is complete when the returned value reverts back to UP
.
Example execution
In order to use both APIs you must port-forward port 6000 of one of the applicable RT pods to their localhost. For example where remote 6000 has been port-forwarded to localhost 6000:
Successful call
curl http://127.0.0.1:6000/hardReset
{"hostname":"rt-sdk-sample-assembly-north-2","status":"Resetting"}
Unsuccessful call
curl http://127.0.0.1:6000/hardReset
{"hostname":"rt-sdk-sample-assembly-north-2","status":"Not available for Reset"}
Reset Status
curl http://127.0.0.1:6000/resetStatus
{"hostname":"rt-sdk-sample-assembly-north-0","status":"UP"}
Affect on a subscriber
When a hard reset occurs, the session number is incremented and the subscriber will be notified of the hard reset occurrence via an event message.
Consider a stream of messages being written to log.0.100
. The subscriber would be receiving messages from the latest log file, log.0.100
. Upon a hard reset, the RT log directories will be deleted and the session number and latest RT log file will roll to log.1.0
. The rt.qpk will detect the session number increment and take an action to trigger a re-subscription to the new log file, e.g. move the subscription from log.0.100
to log.1.0
.