kdb Insights Enterprise Azure Monitoring Workbooks

kdb Insights Enterprise on Azure Workbooks are a compilation of relevant metrics to help the user monitor the performance and status of kdb Insights Enterprise and the Azure Cloud infrastructure in a centralized and holistic way.

Azure Workbooks are based on Microsoft Azure log analytics data, a feature that allows the user to obtain performance statistics from the system, while offering a tight integration across the Microsoft supported deployments.

The kdb Insights Workbooks are automatically deployed alongside each kdb Insights Enterprise to assist with monitoring the performance of kdb Insights Enterprise.

Getting started

Go to Azure Homepage and click Resource Group
Select KX Insights Workbook

Navigate the Workbook

Given its multi-deployment tracking capability, the user can navigate through Subscriptions without changing the screen.

The user needs to select Subscription, Cluster Name, Workspace and Time-Range.

Make your selection on the main tab.

A set of tabs below will help the user navigate each metric category.
Make your selection on the sub tabs

Tabs

Cluster Overview

Metrics shown on this tab provide a general health overview of Azure and kdb Insights Enterprise underlying hardware. It provides Kubernetes cluster-level overview of CPU memory and disk usage.

metric	description	risk
Cluster CPU	Percentage of CPU Utilised by the product.
Cluster Memory	Percentage of available Memory utilised by the product.
Cluster Disk Usage %	Percentage of available disk utilised by the product.

Nodes

Information shown on this tab contains details for nodes that are part of that particular Kubernetes cluster, with each node being a virtual or a physical machine. It can be identified as “kxinsightaks” – KX Insights Azure Kubernetes Service.

metric	description	risk
Node CPU	Percentage of CPU Utilised to run the system and functions.
Node Memory	Percentage of Memory RAM utilised to run the system and functions.
Node Disk Usage %	Percentage of Disk utilised to store ingested data.

metric	description
Received Bytes	Rate at which data is received.
Sent Bytes	Rate at which data is sent.
Disk Busy %	Utilization of Disk by transactions and access requests.
Read Bytes	Data read from disk.
Written Bytes	Data written down to Disk.
Disk IOPs	Total Input/Output operations being performed.
Network Bytes IN	Network bytes received.
Network Bytes OUT	Network bytes transmitted.
Errors IN	Total errors receiving data.
Errors OUT	Total errors transmitting data.
IOPs in progress	Input/Output operations that are in progress of execution.

Pods

Pods are a group of one or more running containers (containers can run one or more processes). Information shown on this tab relates to the pods of both kdb Insights Enterprise deployment and Azure Kubernetes (AKS).

Azure Kubernetes (AKS) relies on controllers to monitor and manage pods and to coordinate resources for software applications. Namespaces provide a mechanism for isolating groups of resources within a single cluster. Names of resources need to be unique within a namespace, but not across namespaces. Namespace-based scoping is applicable only for namespaced objects (e.g. Deployments, Services, etc) and not for cluster-wide objects (e.g. StorageClass, Nodes, PersistentVolumes, etc).

metric	description	risk
CPU Used	CPU being used.
Memory Used	Memory used by every pod.
Pod Status by Node	List of all Pods and their current Health: Total, Running, Succeeded, Pending, Failed, Unknown.
Controllers per Namespace	Total number of controllers deployed on each namespace within the deployment.

Rook-Ceph

Information shown on this tab relates to Rook-Ceph, a storage management tool used on Kubernetes. It automates the storage management processes of the system, making storage self-healing, self-managing and self-scaling.

Note

These charts are only populated if the user chose to deploy Rook-Ceph during the configuration of their kdb Insights Enterprise. If the user does not select Rook-Ceph during deployment time, then only the alternative storage class Azure NFS metrics will be available in the Workbook and the Rook-Ceph charts will be empty.

Rook-Ceph uses Object Storage Daemons (OSDs) to manage devices and ensure data can be accessed and relies on Pools to obtain resilience to data loss and also uses Objects to store data.

metric	description	risk
Cluster State	System’s health status: Healthy, Warning, Error.
Number of OSDs	Quantity Object Storage Daemons deployed on Ceph.
Number of OSDs Up	Amount of OSDs running	If Number OSDs ≠ Number OSDs up, Cluster state will change status.
Number of Pools	Number of Pools deployed.
Number of Objects	Number of Objects deployed.
Cluster Disk usage %	Percentage of Disk utilised.
Read/Write bytes	Total amount of data written and read by the OSDs of Ceph.
Read/Writes	Number of read and write operations.
Pool stored %	Disk space by Pool.
Pool Stored bytes written	Rate at which each Pool writes data.
Pool Stored bytes read	Rate at which data is read.

Assemblies

This section is used to monitor data flow through each assembly in kdb Insights Enterprise. An assembly is the entity that represents the resources needed to ingest data into kdb Insights Enterprise, transform it and store it in the database. An assembly includes a schema and a database, alongside one or more streams and pipelines.

This tab provides information about the volume of data and number of messages flowing through each assembly. It also provides lower-level details about the messages/sec, bytes/sec and the average message size per Assembly and Stream. A stream, also known as a Reliable Transport (or RT), is a component which transports data into kdb Insights Enterprise and between components of kdb Insights Enterprise.

All Assemblies view

metric	description	risk
Stream Messages In bytes/sec	Rate at which data is passed into a Stream at a given time. This may be from an external source, or from a Stream Processor.
Stream Messages Out bytes/sec	Rate at which data is passed out of a Stream. This may be to the Stream Processor or the Storage Manager.	Data is expected to flow through a Stream, Stream Messages Out is expected to be equal to Stream Messages In, unless filtering rules are in place, which would filter out certain messages.
Database Messages In/sec	Rate of data flow from a Stream into each of the database tiers.

Stream Details

metric	description	risk
Stream Messages In bytes/sec	Rate at which data is passed into a Stream at a given time. This may be from an external source or from the Stream Processor.
Stream Messages In/sec	Total number of messages being passed through a Stream
Average Size of Messages in bytes	Stream Messages Out bytes/sec Rate at which data is passed from the Stream Transport (RT) to kdb Insights Enterprise or the Storage Manager.	If Stream Messages Out ≠ Stream Messages In. Data may be trapped.
Stream Messages Out/sec	Rate at which data is passed out of a Stream. This may be from to the Stream Processor or the Storage Manager	Data is expected to flow through the Stream. Rate of "Stream Messages Out" may differ from the rate of "Stream Messages In" only if filtering rules are in place to filter out certain messages.
Average Size of Messages Out bytes	Average size of each message flowing through a Stream.

Database Tier details

Quick summary of the amount of data being passed from the Stream into the Database Tier.

metric	description
Records In/sec	Rate of records entering each database tiers.
Messages In/sec	Total Messages entering the different database tiers
Records per message	Total records contained in each message.

DB Ingestion

This section provides a deeper look into how data is passed through the different database tiers.

Real Time Stream

This tab depicts the number of data messages received by each tier from a Stream.

metric	description
Messages per sec	Total Messages entering the different database tiers.
Records per message	Number of records inside each message at a given point in time entering the different database tiers.

Note

It is expected that each tier receives the same number of messages.

Intraday

This tab depicts how data moves from a Real Time Database (RDB) to an Intraday Database (IDB). This occurs at regular intervals throughout the day, by default this occurs every 10 minutes.

During an End of Interval process (EOI), data for the last 10 minutes is transferred to the IDB, where it will be persisted to disk temporarily until it is persisted to disk in a historical database (HDB) partition at the end of the day (EOD).

metric	description	risk
Duration of last EOI transition	Length of each End of Interval process.
Records written during last EOI	Amount of data held in RDB that has been written to IDB during the last EOI.	If the data stream has a steady data flow then the number of written records between each transition should be consistent.

Historical Database

This tab depicts how the historical database grows with each End of Day process (EOD). By default this occurs once a day.

metric	description	risk
HDB Size	Current size of the HDB.
Number of HDB Partitions	Current number of partitions in HDB.
Records Written During Last EOD Transition	Amount of data transferred to the HDB during an EOD process	If the data stream has a steady data flow then the number of written records between each transition should be consistent.

DB Queries

Information about all queries requested by processes that are either internal or external to kdb Insights Enterprise.

These queries are actioned by the following components: Data Access Process (DAP), Resource Coordinator, Service Gateway and Aggregators.

Data Access Request

Information on how fast and successful a database tier is at actioning a request.

metric	description
Request Duration by Database tier	Speed at which queries retrieve the request on each database tier.
Failed Requests by Database Tier/sec	Number of failed data requests on each database tier.

Resource Coordinator

The Resource Coordinator takes each request and sends it on to each database tier that needs to provide data to return the results of the query.

metric	description	risk
Request Completion Time	Speed at which the system completes requests.	An increase in this could indicate a number of things: large number of requests are being made causing the system to come under pressure, some requests are expecting a large volume of data, there is a resource issue in kdb Insights Enterprise.
Queue Length	Total number of requests that are in queue with the resource coordinator and have not yet been processed.	If this is high, or is increasing the system is under pressure and requests are building up.
Connected Components	Shows the number of components connected to the Resource Connector, including DAPs and Aggregators.
Retry Count	Number of retries for the requests.	If the retry count is not zero then resources could be under pressure, or an error is occurring when trying to run the request.

Service Gateway

The Service Gateway bridges network access and external access requests.

metric	description	risk
Connected Components	Number of components currently connected to the Service Gateway.	A high number of connected components may coincide with a high value for the pending requests if the volume of requests is high.
Pending Requests	Number of requests the Service Gateway has not yet processed.	A rise in this metric may indicate a performance issue as the Service Gateway has a backlog of requests to action.
HTTP Requests and Responses	Number of HTTP requests and responses.	If Requests ≠ Responses, system is not processing Requests correctly.
IPC Requests and Responses	Number of IPC requests and responses.	If Requests ≠ Responses, system is not processing Requests correctly.

Aggregator

The Aggregator combines data from multiple database tiers and tables.

metric	description	risk
Requests in Progress by Pod	Number of aggregation requests being processed by each aggregator.
Errors and Timeouts	Number of aggregation requests that have failed.
Requests by type	Total number of requests by type.
Aggregation Duration	Speed at which each aggregator completes a request.