kdb Insights Grafana dashboard reference

kdb Insights Grafana dashboards provide visualizations to help you monitor the performance and status of a kdb Insights system. The dashboards can be automatically deployed alongside each kdb Insights Enterprise instance.

Getting started

Go to the Grafana homepage, select the Toggle menu and choose Dashboards.
The list of folders includes the namespace into which kdb Insights Enterprise is deployed. The following Dashboards are available within the namespace folder:

References to databases

When databases are referenced here and in the dashboards, this refers to both assemblies deployed via the kdb Insights CLI and databases created from the kdb Insights Enterprise UI.

kdb Insights logging summary

Panels shown on this dashboard provide the ability to drill down into the log messages to identify issues. The dashboard displays the number of log messages by type and then allows you to group those messages using any label that Kubernetes uses to distinguish the components logging the messages. For example, you could choose to just display the error messages and group them by database to see which databases are raising error messages, then you can filter on a specific database and see a list of messages for the database you have chosen.

Variables

At the top of the dashboard there is a set of variables that allows you to filter some of the panels to show messages that have particular properties.

variable	description
Log Status	The log status filters the panels in the second and third row. The states are: FATAL, ERROR, WARN, INFO, DEBUG and TRACE.
Group by	Group the second row by any label associated with the messages. This allows you to find components that are logging the messages. For example choosing `insights_kx_com_app` groups the messages by database.
Include Messages	Filter all rows to only include messages with the specific text included in the message
+	Built in Grafana option to filter on any label that is included in the messages

Log messages by type

Count of the number of messages per type. On the left there is the total number of messages per type in the selected time range. On the right is a line chart showing the total number of messages per type over time.

Variable filters

The 'Include Messages' and '+' variables are the only ones that filter this row.

Log type details

Count of the number of messages grouped by the selected View by label. On the left there is the count of the number of messages per label value across the selected time range. On the right is a line chart showing the number of messages per label value over time.

This allows you to find which databases or components are logging the errors and warnings to assist you in determining the root cause of an issue.

Variable filters

All filters are applied to this row.

Messages

Detailed list of all the messages that match the variables selected.

Click on the '>' error to drill into a specific log message.

Variable filters

All filters are applied to this row.

Kubernetes Events

List of Kubernetes events raised in the namespace in the time range.

metrics	description
Time	Time of event
Reason	Reason for the event
Object	Object raising the event
Message	Message details

Variable filters

The 'Include Messages' filter is the only variable that is applied to this row.

kdb Insights Enterprise Database

This dashboard is intended to assist in monitoring the CPU, memory and disk of each database as well as giving details on the logs and alerts associated with the whole namespace.

Variables

At the top of the dashboard there is a set of variables that allows you to filter some of the panels to show messages that have particular properties.

variable	description
Database	A list of all deployed databases. Filters the panels in all except the first row.
Filters +	Built in Grafana option to filter on any label that is included in the records.

Alerts and logs summary

This row shows a high level overview of all the alerts and logs for the whole namespace. This allows you to view information from for all databases and components that are shared between the databases, for example, the Service Gateway and kdb Insights CLI.

panels	description
Critical Alerts	Total number of critical alerts that have occurred in the time range
Warning Alerts	Total number of warning alerts that have occurred in the time range
Info Alerts	Total number of information alerts that have occurred in the time range
Logs	Total number of log messages per type that have occurred in the time range
Alerts	Detailed list of all the messages that match the variables selected. Click on the '>' error to drill into a specific alert.
Database Status	Status of each database including Ready and NotReady. If the database is not ready, a reason is included

Overview

This row shows a high-level overview of the database selected in the Database variable above.

panels	description
HDB Size	Current size of the HDB
Stream Ingestion	Rate of ingestion of data into each stream associated with the database
Pods CPU above Requested	Number of pods with CPU above their requested values *
Pods CPU above Limit	Number of pods with CPU above their limit values *
Memory CPU above Requested	Number of pods with memory above their requested values *
Memory CPU above Limit	Number of pods with memory above their limit values *

*On the dashboard,the CPU and Memory rows provide details of each pod that has breached these limits.

CPU

This row shows the CPU details of each pod in the selected database and a chart that is populated with the details over time for the selected pod. To select a pod, click on the pod name in the grid.

metrics	description	color thresholds
CPU Usage	CPU utilization in seconds
CPU Requested	CPU seconds requested
CPU Req %	Percentage of requested CPU currently being used	Yellow: 80% // Orange: 90% // Red: 100%
CPU Limit	CPU seconds limit
CPU Limit %	Percentage of requested CPU currently being used	Yellow: 80% // Orange: 90% // Red: 100%

Memory

This row shows the memory details of each pod in the selected database and a chart that is populated with the details of the selected pod over time. To select a pod, click on the pod name in the grid.

metrics	description	color thresholds
Memory Usage (MB)	Memory utilization in MBs
Memory Requested (MB)	Memory requested in MBs
Memory Req (%)	Percentage of requested memory currently being used	Yellow: 80% // Orange: 90% // Red: 100%
Memory Limit (MB)	Memory limit in MBs
Memory Limit (%)	Percentage of memory limit currently being used	Yellow: 80% // Orange: 90% // Red: 100%

Disk

This row shows the persistent volume claim (PVC) disk usage of each PVC in the selected database and a chart that is populated with the details of the selected PVC over time. To select a PVC, click on the PVC name in the grid.

metrics	description	color thresholds
PVC (GB)	PVC size
PVC Used (GB)	Amount of the PVC being used
Used %	Percentage of the PVC being used	Yellow: 80% // Orange: 90% // Red: 100%
1 Day Growth (GB)	Growth in the last 24 hours
2 Day Growth (GB)	Growth in the last 48 hours

kdb Insights detail

This dashboard is intended to assist in monitoring the whole of your kdb Insights deployment. It provides in depth details on the components, and gives information about the logs and alerts associated with the namespace.

Alerts

This row shows all the alerts raised in the whole namespace.

panels	description
Critical Alerts	Total number of critical alerts that have occurred in the time range
Warning Alerts	Total number of warning alerts that have occurred in the time range
Info Alerts	Total number of information alerts that have occurred in the time range
Alerts	Detailed list of all the alerts. Click on the '>' error to drill into a specific alert. The alerts list is ordered alphabetically.

Base infrastructure

This row shows general information about the status of the databases and pods.

Deployment status

This panel provides a list of databases / assemblies and reasons why they are not ready. Each query environment has its own record.

metrics	description
Database	Name of database
Ready	`true` if the database is ready
Not Ready	`true` if the database is not ready
Reason	The reason the database is not ready

License status

This panel allows you to see if any of your pod licenses are expiring.

metrics	description
Pod	Pod linked to the license
Process Cores	Number of CPU cores running in the cluster
Release Date	Date when the license was issued
Release Version	Version the license was released on
License Expiry	Date when the license expires

StatefulSet status

StatefulSets are workload API objects used to manage stateful applications. They manage the deployment and scaling of a set of pods that are based on an identical container and provides guarantees about the ordering and uniqueness of these Pods.

This panel shows StatefulSets that may not have all the requested replicas available.

metrics	description
StatefulSet	Name of the StatefulSet
Requested	The number of replicas requested
Available	The number of replicas available

Deployment status

Deployments provide declarative updates for pods and ReplicaSets.

This panel shows deployments that may not have all the requested replicas available.

metrics	description
Deployment	Name of the resource object responsible for keeping a set of pods running
Requested	The number of replicas requested
Available	The number of replicas available

Pods not available

This panel shows details of all the pods that are not available and the reason.

metrics	description
Pod	Pod identifier name
Ready	Readiness of the pod. 0 means the pod is not ready.
Restarts	Number of times the pod has restarted, trying to successfully become ready.
Reason 1	Short summary on the reason why the pod is not available
Reason 2	Detailed technical reason why the pod is not available

Persistent volume claim usage

This panel shows details of all disk usage for all PVCs.

metrics	description
PVC	Name of the persistent volume claim
Used (GB)	Disk space used
Capacity (GB)	Disk space available
Used (%)	Percentage of the disk space used

Ingest

This row shows details of each pod involved in data ingestion and how much data they are processing.

RT Services

This panel shows details of the messages being ingested by each RT pod.

metrics	description
RT Pod	Name of the specific reliable transport pod
Leader	Leadership status of the pod, there should always be one leader per RT service
Node Index	The node index from the hostname
In Msg/s	Incoming messages per second. *
Message Queue Size	Number of messages in the queue *
In Bytes/s	Incoming bytes per second *

These metrics are only recorded for the leader node

RT Publishers Messages In

This panel shows details of the messages being ingested by each RT pod per publisher.

metrics	description
RT Pod	Name of the specific reliable transport pod
Publisher	Name of the directory the publisher is publishing to
In Bytes/s	Incoming bytes per second from the publisher

RT Publishers Messages Out

This panel shows details of the messages being sent by each RT pod to each subscriber.

metrics	description
RT Pod	Name of the specific reliable transport pod
Publisher	Name of the directory the subscriber is subscribing to
Out Msg/s	Outgoing messages per second to the subscriber

DAP ingest

This panel shows details of the DAPs including their purview time range, their ingestion rate and how many records they retain after a purge.

metrics	description
Pod	DAP pod identifier
Instance Type	Data Access Processor type of instance (rdb, idb, hdb)
Purview Start	Start timestamp of Data Access Purview
Purview End	End timestamp of Data Access Purview
Records/s	Inbound records received by the Data Access Processor per second
Stream Pos	Current subscriber stream position
Records Post Purge	Number of records left in the Data Access Processor after purge

Storage Manager ingest

This panel shows details of the Storage Manager clients, ingestion and EOI and EOD status.

metrics	description
Pod	Storage Manager pod identifier
Connected Clients	Number of connected clients
Stream Records	Number of records held by the stream
Stream Msgs	Number of messages streamed by the stream
EOI Stream position	End of interval stream position
EODs Pending	Number of end of day requests pending

Data persistence

This row shows details of each pod storing symbols and the symbol growth rate.

Symbols

metrics	description
Pod	Pod identifier name
Symbols	Number of symbols for the component container
Sym growth (1d)	Daily growth of symbols for the component container
Sym growth (7d)	Weekly growth of symbols for the component container

EOI by shard

metrics	description
Pod	Pod identifier name
Last EOI duration (s)	Number of seconds the last end of interval lasted
Last EOI records written	Number of records written during the last end of interval
Pending EOIs	Number of EOI requests awaiting completion

EOD by shard

metrics	description
Pod	Pod identifier name
Last EOD duration (s)	Number of seconds the last end of day lasted
Last EOD records written	Number of records written into hdb at end of day
HDB Partitions	Number of partitions in the historical database
HDB Size (MB)	Size in MB of the historical database
Pending EODs	Amount of EOD requests awaiting completion

Query

Gateway query status

metrics	description
Pod	Pod identifier name
Service	Service identifier name
Pending Queries	Number of pending queries (Both HTTP/IPC)
IPC Requests/s	Number of incoming IPC requests per second
Connected Clients	Number of connected clients
Connected Aggs	Number of connected aggregators
Connected DAPs	Number of connected Data Access Processors

Resource coordinator query status

metrics	description
Service	Service identifier name
Pod	Pod identifier name
Queue size	Length of the outstanding request queue
Avg Response (ms)	Average response time in milliseconds
Requests/s	Number of incoming requests per second
Success Query/s	Number of successful queries per second
Retry Rate/s	Number of retries per second
Connected Aggs	Number of connected Aggregators
Connected DAPs	Number of connected Data Access Processes

Agg Query Status

metrics	description
Pod	Pod identifier name
Request/s	Number of incoming requests per second
Errors/s	Number of errors received per second
Timeouts/s	Number of timeouts per second
Active Queries	Number of queries being executed now
Avg Response (ms)	Average response time in milliseconds

DAP Request Status

metrics	description
Pod	Pod identifier name
Endpoint	Database type where the Data Access Process is pointing at
Success Query/s	Number of successful queries per second
Failed Query/s	Number of failed queries per second
Failure (%)	Percentage of queries that failed

Kubernetes Events

List of Kubernetes events raised in the namespace in the time range.

metrics	description
Time	Time of event
Reason	Reason for the event
Object	Object raising the event
Message	Message details