Alerts¶
Monitoring Alerting components let you send alerts in real-time via email and Slack. Alerts are configured by:
- Defining a threshold to track against a metric
- Creating an alert notification template
Install and configure the Monitoring Alerting component before defining alerts.
By default the following alerts are automatically enabled after installation:
ProcessOffline Process disconnect
ProcessNotMonitored Process not registering with Monitoring Daemon
By consuming the metrics generated from Refinery into your monitoring system, you can use defined thresholds to build alerting. All the values can be customized as required for your application.
Recommended Refinery alerts¶
Common metrics¶
| Metric name | Recommended alert | Threshold range |
|---|---|---|
| ipc.pendingHandleBytes | anyPendingBytes is true |
Additional alerting on the bytes per handle |
| process.queryTimeout | timeoutSecs is 0 |
Any other timeout value that is erroneous or if the value changes |
| process.connectionOpen | None (Informational) | |
| process.connectionClosed | None (Informational) | |
| process.exit | Every occurrence | |
| process.userAuthFailure | Every occurrence if GW or QR | |
| mem.kdbUsageBytes | heap is 75% of mphy |
Available memory on the server |
| process.ping | remoteReceiveTime - daemonSendTime is greater than 1 second |
Any time-span value |
| system.timeSync | timeSync is false |
Process-level alerts¶
| Process type | Metric name | Recommended alert | Threshold range |
|---|---|---|---|
| TP | tp.updDayCount | 3 successive updates remain the same and data is expected (30 seconds of no capture) | Update sent every 10 seconds, minimum check is 2 successive updates |
| RDB | consumer.updDayCount | 3 successive updates remain the same and data is expected (30 seconds of no capture) | Update sent every 10 seconds, minimum check is 2 successive updates |
| RDB | rdb.eodFlush | If complete is false |
|
| CTP/RTE | consumer.updDayCount | If 3 successive updates remain the same and data is expected (30 seconds of no capture) | Update sent every 10 seconds, minimum check is 2 successive updates |
| HDB | hdb.availableDates | If latest is not yesterday's date |
Use all to validate all expected dates |
| HDB | hdb.latestDateRowCounts | If latest is a business day, rowCount.*table* is non-zero |
|
| PDB | pdb.rollover | If state does follow the flow: started -> reloading -> complete |
|
| GW | gw.queryStatus | If success is false |
|
| QR | qr.queryDispatch | None (Informational) | |
| QR | qr.queryResult | If success is false |
Recommended system information alerts¶
Server-level alerts¶
| Metric name | Recommended alert | Threshold range |
|---|---|---|
| system.cpu | ||
| system.fs | ||
| system.kernel | ||
| system.load | ||
| system.mem | ||
| system.netproto | ||
| system.nic | ||
| system.process | ||
| system.sys |
Process-level alerts¶
| Metric name | Recommended alert | Threshold range |
|---|---|---|
| process.cpu | ||
| process.io | ||
| process.mem |
Defining an alert threshold¶
A threshold is the primary configuration of the Alerting component. It specifies when alerts should be generated. All alert thresholds are defined within the .alrt.cfg.thresholds configuration table. This table has the following columns:
alertName name of the alert
metricName name of the metric to build the alert against
triggerOn key within the metric for simple functions, or null symbol for complex
comparison comparison function to run: built-in kdb+ or simple or complex custom
threshold value which causes the alert to be fired
For custom comparison functions, look at the threshold template file:
/etc/kx-refinery-monitoring-api-alrt/alrt-templates/comparisons/alrt.template.q
Built-in simple comparison example¶
The simplest threshold can be set using a built-in q function. Using =, an alert gets fired when any process has a process query timeout of 0.
.alrt.cfg.thresholds[`ProcessTimeout]:(`process.queryTimeout; `timeoutSecs; =; 0)
Custom simple comparison example¶
Define custom comparison functions within:
/etc/kx-refinery-monitoring-api-alrt/alrt-templates/comparisons/
A simple comparison function requires triggerOn and threshold set in the threshold configuration. The function is passed two arguments - threshold and currentValue and returns a boolean: whether the alert should be fired. Using this method we can implement the same check as with the built-in q function:
/ alrt.examples.q
.alrt.comp.customEq:{[threshold; currentValue] threshold = currentValue }
/ Configuration
.alrt.cfg.thresholds[`ProcessTimeoutCustom]:
(`process.queryTimeout; `timeoutSecs; `.alrt.comp.customEq; 0)
Custom complex comparison example¶
Define custom comparison functions within:
/etc/kx-refinery-monitoring-api-alrt/alrt-templates/comparisons/
A complex comparison function provides access to all data provided by the metric (triggerOn must be null) and optionally may not require a threshold. The function is passed a dictionary argument (alertDict) with three keys:
alertName The name of the current alert
threshold The configured threshold value
metric All data provided by the metric
The function returns a dictionary with the following keys.
alertActive boolean: whether the alert should be fired
triggerValue optional value to show the calculated value that breached the threshold
Using this method, we can build an alert that fires when the kdb+ process memory breaches a threshold percentage:
/ alrt.examples.q
.alrt.comp.memUsage:{[alertDict]
memPct:100 * (%). alertDict[`metric]`heap`mphy;
`alertActive`triggerValue!(memPct > alertDict`threshold; memPct) }
/ Configuration
.alrt.cfg.thresholds[`ProcessMemory]:
(`mem.kdbUsageBytes; `; `.alrt.comp.memUsage; 50)
Creating an alert notification template¶
Alert notification templates can be created per alert name. There is a default notification template for any alerts that are fired that do not have an associated template.
When creating a new notification template
- it must live within
/etc/kx-refinery-monitoring-api-alrt/alrt-templates/notifications - the file name must be
*alert-name*.alert.yml - it must have
subjectandbodyYAML tags
Copy the default notification template default.alert.yml and modify it.
As listed in the default template, a number of variables can be defined to be substituted when the alert is fired:
~~ALERT_NAME~~ name of the alert that was fired
~~ALERT_DATA_COUNT~~ number of values in the alert data
~~ALERT_DATE_TIME~~ date and time the alert was generated
~~ALERT_DETAIL~~ detail of the alert in table form
Other related configuration¶
Alongside the main threshold configuration table (.alrt.cfg.thresholds), other configuration items may be useful:
| Configuration | Function |
|---|---|
.alrt.cfg.notificationInterval |
Determines the frequency of alerts. For each alert, there will be at most one alert notification (if triggered) in this specified interval. |
| `.alrt.cfg.subjectPrefix | All notifications are prefixed with this string as a prefix. |
.alrt.cfg.generateAlertOnAlertFailure |
If set to true and if any of thresholds comparison functions fail, a AlertExecutionFailure alert is generated. |
.alrt.cfg.generateAlertOnStaleMetrics |
If set to true and any timer-based metric goes stale, a StaleMetric alert is generated. |
| Specific to email | |
.alrt.email.cfg.sendAsUser |
The from email address |
.alrt.email.cfg.style |
The CSS formatting style sent with each notification email |
| Specific to Slack | |
.alrt.slack.cfg.sendAsUser |
For Slack notifications, this configuration defines who the message is posted from |
Manually reloading alert thresholds¶
If any alert thresholds are manually modified in the process while it is running, call .alrt.reloadThresholds[] to activate them.