Alerts

Monitoring Alerting components let you send alerts in real-time via email and Slack. Alerts are configured by:

  1. Defining a threshold to track against a metric
  2. Creating an alert notification template

Install and configure the Monitoring Alerting component before defining alerts.

By default the following alerts are automatically enabled after installation:

ProcessOffline        Process disconnect
ProcessNotMonitored   Process not registering with Monitoring Daemon

By consuming the metrics generated from Refinery into your monitoring system, you can use defined thresholds to build alerting. All the values can be customized as required for your application.

Available metrics

Common metrics

Metric name Recommended alert Threshold range
ipc.pendingHandleBytes anyPendingBytes is true Additional alerting on the bytes per handle
process.queryTimeout timeoutSecs is 0 Any other timeout value that is erroneous or if the value changes
process.connectionOpen None (Informational)
process.connectionClosed None (Informational)
process.exit Every occurrence
process.userAuthFailure Every occurrence if GW or QR
mem.kdbUsageBytes heap is 75% of mphy Available memory on the server
process.ping remoteReceiveTime - daemonSendTime is greater than 1 second Any time-span value
system.timeSync timeSync is false

Process-level alerts

Process type Metric name Recommended alert Threshold range
TP tp.updDayCount 3 successive updates remain the same and data is expected (30 seconds of no capture) Update sent every 10 seconds, minimum check is 2 successive updates
RDB consumer.updDayCount 3 successive updates remain the same and data is expected (30 seconds of no capture) Update sent every 10 seconds, minimum check is 2 successive updates
RDB rdb.eodFlush If complete is false
CTP/RTE consumer.updDayCount If 3 successive updates remain the same and data is expected (30 seconds of no capture) Update sent every 10 seconds, minimum check is 2 successive updates
HDB hdb.availableDates If latest is not yesterday's date Use all to validate all expected dates
HDB hdb.latestDateRowCounts If latest is a business day, rowCount.*table* is non-zero
PDB pdb.rollover If state does follow the flow: started -> reloading -> complete
GW gw.queryStatus If success is false
QR qr.queryDispatch None (Informational)
QR qr.queryResult If success is false

Server-level alerts

Metric name Recommended alert Threshold range
system.cpu
system.fs
system.kernel
system.load
system.mem
system.netproto
system.nic
system.process
system.sys

Process-level alerts

Metric name Recommended alert Threshold range
process.cpu
process.io
process.mem

Defining an alert threshold

A threshold is the primary configuration of the Alerting component. It specifies when alerts should be generated. All alert thresholds are defined within the .alrt.cfg.thresholds configuration table. This table has the following columns:

alertName      name of the alert
metricName     name of the metric to build the alert against
triggerOn      key within the metric for simple functions, or null symbol for complex
comparison     comparison function to run: built-in kdb+ or simple or complex custom
threshold      value which causes the alert to be fired

For custom comparison functions, look at the threshold template file:

/etc/kx-refinery-monitoring-api-alrt/alrt-templates/comparisons/alrt.template.q

Built-in simple comparison example

The simplest threshold can be set using a built-in q function. Using =, an alert gets fired when any process has a process query timeout of 0.

.alrt.cfg.thresholds[`ProcessTimeout]:(`process.queryTimeout; `timeoutSecs; =; 0)

Custom simple comparison example

Define custom comparison functions within:

/etc/kx-refinery-monitoring-api-alrt/alrt-templates/comparisons/

A simple comparison function requires triggerOn and threshold set in the threshold configuration. The function is passed two arguments - threshold and currentValue and returns a boolean: whether the alert should be fired. Using this method we can implement the same check as with the built-in q function:

/ alrt.examples.q
.alrt.comp.customEq:{[threshold; currentValue] threshold = currentValue }

/ Configuration
.alrt.cfg.thresholds[`ProcessTimeoutCustom]:
    (`process.queryTimeout; `timeoutSecs; `.alrt.comp.customEq; 0)

Custom complex comparison example

Define custom comparison functions within:

/etc/kx-refinery-monitoring-api-alrt/alrt-templates/comparisons/

A complex comparison function provides access to all data provided by the metric (triggerOn must be null) and optionally may not require a threshold. The function is passed a dictionary argument (alertDict) with three keys:

alertName    The name of the current alert
threshold    The configured threshold value
metric       All data provided by the metric

The function returns a dictionary with the following keys.

alertActive    boolean: whether the alert should be fired
triggerValue   optional value to show the calculated value that breached the threshold

Using this method, we can build an alert that fires when the kdb+ process memory breaches a threshold percentage:

/ alrt.examples.q
.alrt.comp.memUsage:{[alertDict]
  memPct:100 * (%). alertDict[`metric]`heap`mphy;
  `alertActive`triggerValue!(memPct > alertDict`threshold; memPct) }

/ Configuration
.alrt.cfg.thresholds[`ProcessMemory]:
    (`mem.kdbUsageBytes; `; `.alrt.comp.memUsage; 50)

Creating an alert notification template

Alert notification templates can be created per alert name. There is a default notification template for any alerts that are fired that do not have an associated template.

When creating a new notification template

  • it must live within /etc/kx-refinery-monitoring-api-alrt/alrt-templates/notifications
  • the file name must be *alert-name*.alert.yml
  • it must have subject and body YAML tags

Copy the default notification template default.alert.yml and modify it.

As listed in the default template, a number of variables can be defined to be substituted when the alert is fired:

~~ALERT_NAME~~         name of the alert that was fired
~~ALERT_DATA_COUNT~~   number of values in the alert data
~~ALERT_DATE_TIME~~    date and time the alert was generated
~~ALERT_DETAIL~~       detail of the alert in table form

Alongside the main threshold configuration table (.alrt.cfg.thresholds), other configuration items may be useful:

Configuration Function
.alrt.cfg.notificationInterval Determines the frequency of alerts. For each alert, there will be at most one alert notification (if triggered) in this specified interval.
`.alrt.cfg.subjectPrefix All notifications are prefixed with this string as a prefix.
.alrt.cfg.generateAlertOnAlertFailure If set to true and if any of thresholds comparison functions fail, a AlertExecutionFailure alert is generated.
.alrt.cfg.generateAlertOnStaleMetrics If set to true and any timer-based metric goes stale, a StaleMetric alert is generated.
Specific to email
.alrt.email.cfg.sendAsUser The from email address
.alrt.email.cfg.style The CSS formatting style sent with each notification email
Specific to Slack
.alrt.slack.cfg.sendAsUser For Slack notifications, this configuration defines who the message is posted from

Manually reloading alert thresholds

If any alert thresholds are manually modified in the process while it is running, call .alrt.reloadThresholds[] to activate them.