kdb Insights Enterprise alerts and notifications
kdb Insights Enterprise provides a set of pre-configured alerts to help you monitor and maintain the health of your kdb Insights Enterprise activity on Azure Marketplace. The following User Guide details how to activate or deactivate those alerts depending on your needs.
Getting started
-
Go to Azure Homepage and click Monitor
-
On the left-hand panel, click on Alerts
-
On the top panel, click Alert rules
The Alert rules section shows the full list of pre-packaged alerts kdb Insights Enterprise provides you. For the Customer Managed Plan, by default, those alerts are pre-configured with the kdb Insights Enterprise recommended thresholds however they are in a disabled state so that you can decide whether to enable them. This is because there is a cost associated with this Azure service.
The following guide details how you can edit any status, notification type and threshold value if you choose to enable an Alert.
Default Alerts
For the kdb Insights Enterprise KX Managed App Plan, these Alerts are enabled by default and are required by the kdb Insights Enterprise Managed Service Team to provide Support. Customers should not edit or disable Alerts in a kdb Insights Enterprise Managed deployment.
Alerts deployed with the system
List of alerts and their triggering thresholds automatically deployed with kdb Insights Enterprise.
name | description | default threshold |
---|---|---|
quota limit | Creates the service limit or quota request for the specified resource. | no default |
create or update logger | Creates a logger or Updates an existing logger. | no default |
delete logger | Deletes the specified logger. | no default |
CPU Percentage | Aggregated average CPU utilization measured in percentage across the cluster. | Two separate alerts, one threshold ≥ 80% and another one for threshold ≥ 90% |
disk Percentage | Disk space used in percentage by device. | Two separate alerts, one threshold ≥ 80% and another one for threshold 90% |
failed pod | Count of failed pods by controller, namespace, node and phase. | if > 0 |
OOM killed containers | Count of OOM (Out of Memory) killed containers by controllers, kubernetes namespace. | if > 0 |
EOD process complete | Completion of End of Day process. This is an informative alert. | if ≥ 1 |
PostgreSQL container failed | PostgreSQL container which supports Keycloak is no longer running. | if = 1 |
keycloak pod down | Pod responsible for Keycloak has failed. | if = 1 |
crash loop back off detected | A CrashLoopBackOff has been detected (a pod is failing to restart successfully). | if ≥ 1 |
rook ceph percentage | Aggregated average rook-ceph utilization measured in percentage across the cluster. Only deployed if rook-ceph is deployed. | Two separate alerts, one threshold if > 80% and another one for threshold > 90% |
rook ceph MB | Amount of free space in rook ceph (MB) across the cluster. Only deployed if rook-ceph is deployed. | if < 2000 MB |
rook ceph health status | Ceph health status metric: healthy, warning, error. Only deployed if rook-ceph is deployed. If this metric returns something different from 1 (healthy), the cluster is having critical issues which must be investigated. | if ≠ healthy |
invalid access request | There has been 3 failed access attempts in the past 10 minutes. | if > 3 |
kdb stream processor failure | The Stream Processor component has failed or been manually stopped. | if = 1 |
node not in Ready state | A node appears to not be in a ready state. | if ≥ 1 |
pod in unknown state | A pod state has not been obtained. | if ≥ 1 |
rt container down | RT container has either failed or been manually stopped. | if = 1 |
storage manager failure | The Storage Manager container has failed or been manually stopped. | if = 1 |
non RT pv percentage | A PV connected to one or multiple pods has surpassed the threshold value from its total capacity. Specific to non-RT PV's. | Two separate alerts, one threshold if > 60% and another one for threshold 80% |
RT pv percentage | PV for RT pods has reached 93% of its total capacity. Specific for RT PV's. | if > 93% |
License Expiring | Kx License has not renewed after the 7th day and has less than 3 days until expiry - Check for failures. | if > 0 |
License Renew Error | There is an error in the kx license renewal job - act immediately. | if > 0 |
High Aggregated Errors | Aggregator errors for the last minute. | if > 20 |
High Aggregated Queue Size | Aggregator request queue size for the last minute | if > 20 |
High SM EOD Time | Time take for an EOD | if > 4h |
SM No Records Written During EOI | An End of Interval ran but no records were written | if = 0 |
No DAPs Present | At Least one assembly is deployed, but no Resource Coordinator DAPs exist | if = 0 |
Pod Not Ready | Pod in NotReady state for the last minute | if > 0 |
Pod in CrashLoopBackOff | Pod failing to restart on for the last minute | if > 0 |
High RC Retries | Resource Coordinator Request retries | if > 20 |
RCs Without DAPs | Resource Coordinators have connected licnets but there are no Data Access Processes connected to them | if > 0 |
No RDB Growth | Rate of RDB Growth is 0% | if = 0 |
High SG Pending Queries | Service Gateway pending queries for the last minute are high | if > 20 |
No active RT Leader | There is no leader for the Stream and therefore no messages will be merged and available for the subscribers | if > 0 |
No EOI Records | An End of Interval ran but no records were written | if = 0 |
RC Queue Size | Resource Coordinator queue size is growing | if > 20 |
Enable or disable alerts
Once you have accessed Alert rules:
- Find the alert you want to change the status for
- Click on the "..." next to Status
-
Select whether to Enable or Disable it
Modify notification
Once you have accessed Alert rules:
- Find the alert you want to change the notification type for
- Click on the "..." next to Status
-
Select Edit
-
A new screen opens, scroll down until you find Actions
-
Click on the Action Group
-
A new screen opens, scroll down until you find Notifications
- Under Notification type, find the element to modify
-
Select the pencil
-
A new screen opens, update the information and click OK
-
Click on Save Changes
Modify threshold
Once you have accessed Alert rules
-
Click on the alert to modify
-
A new screen opens, scroll down until you find Condition
-
Click on the name of the alert
-
A new screen opens, scroll down until you find Threshold Value, modify it
-
Click Done
-
Click Save