kdb Insights Grafana dashboard reference
kdb Insights Grafana dashboards provide visualizations to help you monitor the performance and status of a kdb Insights system. The dashboards can be automatically deployed alongside each kdb Insights Enterprise instance.
Getting started
-
Go to the Grafana homepage, select the Toggle menu and choose Dashboards.
-
The list of folders includes the namespace into which kdb Insights Enterprise is deployed. The following Dashboards are available within the namespace folder:
References to databases
When databases are referenced here and in the dashboards, this refers to both assemblies deployed via the kdb Insights CLI and databases created from the kdb Insights Enterprise UI.
kdb Insights logging summary
Panels shown on this dashboard provide the ability to drill down into the log messages to identify issues. The dashboard displays the number of log messages by type and then allows you to group those messages using any label that Kubernetes uses to distinguish the components logging the messages. For example, you could choose to just display the error messages and group them by database to see which databases are raising error messages, then you can filter on a specific database and see a list of messages for the database you have chosen.
Variables
At the top of the dashboard there is a set of variables that allows you to filter some of the panels to show messages that have particular properties.
variable | description |
---|---|
Log Status | The log status filters the panels in the second and third row. The states are: FATAL, ERROR, WARN, INFO, DEBUG and TRACE. |
Group by | Group the second row by any label associated with the messages. This allows you to find components that are logging the messages. For example choosing insights_kx_com_app groups the messages by database. |
Include Messages | Filter all rows to only include messages with the specific text included in the message |
+ | Built in Grafana option to filter on any label that is included in the messages |
Log messages by type
Count of the number of messages per type. On the left there is the total number of messages per type in the selected time range. On the right is a line chart showing the total number of messages per type over time.
Variable filters
The 'Include Messages' and '+' variables are the only ones that filter this row.
Log type details
Count of the number of messages grouped by the selected View by
label. On the left there is the count of the number of messages per label value across the selected time range. On the right is a line chart showing the number of messages per label value over time.
This allows you to find which databases or components are logging the errors and warnings to assist you in determining the root cause of an issue.
Variable filters
All filters are applied to this row.
Messages
Detailed list of all the messages that match the variables selected.
Click on the '>' error to drill into a specific log message.
Variable filters
All filters are applied to this row.
Kubernetes Events
List of Kubernetes events raised in the namespace in the time range.
metrics | description |
---|---|
Time | Time of event |
Reason | Reason for the event |
Object | Object raising the event |
Message | Message details |
Variable filters
The 'Include Messages' filter is the only variable that is applied to this row.
kdb Insights Enterprise Database
This dashboard is intended to assist in monitoring the CPU, memory and disk of each database as well as giving details on the logs and alerts associated with the whole namespace.
Variables
At the top of the dashboard there is a set of variables that allows you to filter some of the panels to show messages that have particular properties.
variable | description |
---|---|
Database | A list of all deployed databases. Filters the panels in all except the first row. |
Filters + | Built in Grafana option to filter on any label that is included in the records. |
Alerts and logs summary
This row shows a high level overview of all the alerts and logs for the whole namespace. This allows you to view information from for all databases and components that are shared between the databases, for example, the Service Gateway and kdb Insights CLI.
panels | description |
---|---|
Critical Alerts | Total number of critical alerts that have occurred in the time range |
Warning Alerts | Total number of warning alerts that have occurred in the time range |
Info Alerts | Total number of information alerts that have occurred in the time range |
Logs | Total number of log messages per type that have occurred in the time range |
Alerts | Detailed list of all the messages that match the variables selected. Click on the '>' error to drill into a specific alert. |
Database Status | Status of each database including Ready and NotReady. If the database is not ready, a reason is included |
Overview
This row shows a high-level overview of the database selected in the Database variable above.
panels | description |
---|---|
HDB Size | Current size of the HDB |
Stream Ingestion | Rate of ingestion of data into each stream associated with the database |
Pods CPU above Requested | Number of pods with CPU above their requested values * |
Pods CPU above Limit | Number of pods with CPU above their limit values * |
Memory CPU above Requested | Number of pods with memory above their requested values * |
Memory CPU above Limit | Number of pods with memory above their limit values * |
*On the dashboard,the CPU and Memory rows provide details of each pod that has breached these limits.
CPU
This row shows the CPU details of each pod in the selected database and a chart that is populated with the details over time for the selected pod. To select a pod, click on the pod name in the grid.
metrics | description | color thresholds |
---|---|---|
CPU Usage | CPU utilization in seconds | |
CPU Requested | CPU seconds requested | |
CPU Req % | Percentage of requested CPU currently being used | Yellow: 80% // Orange: 90% // Red: 100% |
CPU Limit | CPU seconds limit | |
CPU Limit % | Percentage of requested CPU currently being used | Yellow: 80% // Orange: 90% // Red: 100% |
Memory
This row shows the memory details of each pod in the selected database and a chart that is populated with the details of the selected pod over time. To select a pod, click on the pod name in the grid.
metrics | description | color thresholds |
---|---|---|
Memory Usage (MB) | Memory utilization in MBs | |
Memory Requested (MB) | Memory requested in MBs | |
Memory Req (%) | Percentage of requested memory currently being used | Yellow: 80% // Orange: 90% // Red: 100% |
Memory Limit (MB) | Memory limit in MBs | |
Memory Limit (%) | Percentage of memory limit currently being used | Yellow: 80% // Orange: 90% // Red: 100% |
Disk
This row shows the persistent volume claim (PVC) disk usage of each PVC in the selected database and a chart that is populated with the details of the selected PVC over time. To select a PVC, click on the PVC name in the grid.
metrics | description | color thresholds |
---|---|---|
PVC (GB) | PVC size | |
PVC Used (GB) | Amount of the PVC being used | |
Used % | Percentage of the PVC being used | Yellow: 80% // Orange: 90% // Red: 100% |
1 Day Growth (GB) | Growth in the last 24 hours | |
2 Day Growth (GB) | Growth in the last 48 hours |
kdb Insights detail
This dashboard is intended to assist in monitoring the whole of your kdb Insights deployment. It provides in depth details on the components, and gives information about the logs and alerts associated with the namespace.
Alerts
This row shows all the alerts raised in the whole namespace.
panels | description |
---|---|
Critical Alerts | Total number of critical alerts that have occurred in the time range |
Warning Alerts | Total number of warning alerts that have occurred in the time range |
Info Alerts | Total number of information alerts that have occurred in the time range |
Alerts | Detailed list of all the alerts. Click on the '>' error to drill into a specific alert. The alerts list is ordered alphabetically. |
Base infrastructure
This row shows general information about the status of the databases and pods.
Deployment status
This panel provides a list of databases / assemblies and reasons why they are not ready. Each query environment has its own record.
metrics | description |
---|---|
Database | Name of database |
Ready | true if the database is ready |
Not Ready | true if the database is not ready |
Reason | The reason the database is not ready |
License status
This panel allows you to see if any of your pod licenses are expiring.
metrics | description |
---|---|
Pod | Pod linked to the license |
Process Cores | Number of CPU cores running in the cluster |
Release Date | Date when the license was issued |
Release Version | Version the license was released on |
License Expiry | Date when the license expires |
StatefulSet status
StatefulSets are workload API objects used to manage stateful applications. They manage the deployment and scaling of a set of pods that are based on an identical container and provides guarantees about the ordering and uniqueness of these Pods.
This panel shows StatefulSets that may not have all the requested replicas available.
metrics | description |
---|---|
StatefulSet | Name of the StatefulSet |
Requested | The number of replicas requested |
Available | The number of replicas available |
Deployment status
Deployments provide declarative updates for pods and ReplicaSets.
This panel shows deployments that may not have all the requested replicas available.
metrics | description |
---|---|
Deployment | Name of the resource object responsible for keeping a set of pods running |
Requested | The number of replicas requested |
Available | The number of replicas available |
Pods not available
This panel shows details of all the pods that are not available and the reason.
metrics | description |
---|---|
Pod | Pod identifier name |
Ready | Readiness of the pod. 0 means the pod is not ready. |
Restarts | Number of times the pod has restarted, trying to successfully become ready. |
Reason 1 | Short summary on the reason why the pod is not available |
Reason 2 | Detailed technical reason why the pod is not available |
Persistent volume claim usage
This panel shows details of all disk usage for all PVCs.
metrics | description |
---|---|
PVC | Name of the persistent volume claim |
Used (GB) | Disk space used |
Capacity (GB) | Disk space available |
Used (%) | Percentage of the disk space used |
Ingest
This row shows details of each pod involved in data ingestion and how much data they are processing.
RT Services
This panel shows details of the messages being ingested by each RT pod.
metrics | description |
---|---|
RT Pod | Name of the specific reliable transport pod |
Leader | Leadership status of the pod, there should always be one leader per RT service |
Node Index | The node index from the hostname |
In Msg/s | Incoming messages per second. * |
Message Queue Size | Number of messages in the queue * |
In Bytes/s | Incoming bytes per second * |
- These metrics are only recorded for the leader node
RT Publishers Messages In
This panel shows details of the messages being ingested by each RT pod per publisher.
metrics | description |
---|---|
RT Pod | Name of the specific reliable transport pod |
Publisher | Name of the directory the publisher is publishing to |
In Bytes/s | Incoming bytes per second from the publisher |
RT Publishers Messages Out
This panel shows details of the messages being sent by each RT pod to each subscriber.
metrics | description |
---|---|
RT Pod | Name of the specific reliable transport pod |
Publisher | Name of the directory the subscriber is subscribing to |
Out Msg/s | Outgoing messages per second to the subscriber |
DAP ingest
This panel shows details of the DAPs including their purview time range, their ingestion rate and how many records they retain after a purge.
metrics | description |
---|---|
Pod | DAP pod identifier |
Instance Type | Data Access Processor type of instance (rdb, idb, hdb) |
Purview Start | Start timestamp of Data Access Purview |
Purview End | End timestamp of Data Access Purview |
Records/s | Inbound records received by the Data Access Processor per second |
Stream Pos | Current subscriber stream position |
Records Post Purge | Number of records left in the Data Access Processor after purge |
Storage Manager ingest
This panel shows details of the Storage Manager clients, ingestion and EOI and EOD status.
metrics | description |
---|---|
Pod | Storage Manager pod identifier |
Connected Clients | Number of connected clients |
Stream Records | Number of records held by the stream |
Stream Msgs | Number of messages streamed by the stream |
EOI Stream position | End of interval stream position |
EODs Pending | Number of end of day requests pending |
Data persistence
This row shows details of each pod storing symbols and the symbol growth rate.
Symbols
metrics | description |
---|---|
Pod | Pod identifier name |
Symbols | Number of symbols for the component container |
Sym growth (1d) | Daily growth of symbols for the component container |
Sym growth (7d) | Weekly growth of symbols for the component container |
EOI by shard
metrics | description |
---|---|
Pod | Pod identifier name |
Last EOI duration (s) | Number of seconds the last end of interval lasted |
Last EOI records written | Number of records written during the last end of interval |
Pending EOIs | Number of EOI requests awaiting completion |
EOD by shard
metrics | description |
---|---|
Pod | Pod identifier name |
Last EOD duration (s) | Number of seconds the last end of day lasted |
Last EOD records written | Number of records written into hdb at end of day |
HDB Partitions | Number of partitions in the historical database |
HDB Size (MB) | Size in MB of the historical database |
Pending EODs | Amount of EOD requests awaiting completion |
Query
Gateway query status
metrics | description |
---|---|
Pod | Pod identifier name |
Service | Service identifier name |
Pending Queries | Number of pending queries (Both HTTP/IPC) |
IPC Requests/s | Number of incoming IPC requests per second |
Connected Clients | Number of connected clients |
Connected Aggs | Number of connected aggregators |
Connected DAPs | Number of connected Data Access Processors |
Resource coordinator query status
metrics | description |
---|---|
Service | Service identifier name |
Pod | Pod identifier name |
Queue size | Length of the outstanding request queue |
Avg Response (ms) | Average response time in milliseconds |
Requests/s | Number of incoming requests per second |
Success Query/s | Number of successful queries per second |
Retry Rate/s | Number of retries per second |
Connected Aggs | Number of connected Aggregators |
Connected DAPs | Number of connected Data Access Processes |
Agg Query Status
metrics | description |
---|---|
Pod | Pod identifier name |
Request/s | Number of incoming requests per second |
Errors/s | Number of errors received per second |
Timeouts/s | Number of timeouts per second |
Active Queries | Number of queries being executed now |
Avg Response (ms) | Average response time in milliseconds |
DAP Request Status
metrics | description |
---|---|
Pod | Pod identifier name |
Endpoint | Database type where the Data Access Process is pointing at |
Success Query/s | Number of successful queries per second |
Failed Query/s | Number of failed queries per second |
Failure (%) | Percentage of queries that failed |
Kubernetes Events
List of Kubernetes events raised in the namespace in the time range.
metrics | description |
---|---|
Time | Time of event |
Reason | Reason for the event |
Object | Object raising the event |
Message | Message details |