Skip to content

KX Insights Azure Monitoring Workbooks

KX Insights on Azure Workbooks are a compilation of relevant metrics to help the user monitor the performance and status of the KX Insights Platform and the Azure Cloud infrastructure in a centralized and holistic way.

Azure Workbooks are based on Microsoft Azure log analytics data, a feature that allows the user to obtain performance statistics from the system, while offering a tight integration across the Microsoft supported deployments.

The KX Insights Workbooks are automatically deployed alongside each KX Insights Platform to assist with monitoring the performance of the platform.

Getting started

  1. Go to Azure Homepage and click Resource Group

    Access

  2. Select KX Insights Workbook

    Monitor

Given its multi-deployment tracking capability, the user can navigate through Subscriptions without changing the screen.

The user needs to select Subscription, Cluster Name, Workspace and Time-Range.

  1. Make your selection on the main tab.

    A set of tabs below will help the user navigate each metric category.

    Tabs

  2. Make your selection on the sub tabs

    Access

Tabs

Cluster Overview

Metrics shown on this tab provide a general health overview of Azure and the KX Insights Platform underlying hardware. It provides Kubernetes cluster-level overview of CPU memory and disk usage.

metric description risk
Cluster CPU Percentage of CPU Utilised by the platform.
Cluster Memory Percentage of available Memory utilised by the platform.
Cluster Disk Usage % Percentage of available disk utilised by the platform.

Nodes

Information shown on this tab contains details for nodes that are part of that particular Kubernetes cluster, with each node being a virtual or a physical machine. It can be identified as “kxinsightaks” – KX Insights Azure Kubernetes Service.

metric description risk
Node CPU Percentage of CPU Utilised to run the system and functions.
Node Memory Percentage of Memory RAM utilised to run the system and functions.
Node Disk Usage % Percentage of Disk utilised to store ingested data.
metric description
Received Bytes Rate at which data is received.
Sent Bytes Rate at which data is sent.
Disk Busy % Utilization of Disk by transactions and access requests.
Read Bytes Data read from disk.
Written Bytes Data written down to Disk.
Disk IOPs Total Input/Output operations being performed.
Network Bytes IN Network bytes received.
Network Bytes OUT Network bytes transmitted.
Errors IN Total errors receiving data.
Errors OUT Total errors transmitting data.
IOPs in progress Input/Output operations that are in progress of execution.

Pods

Pods are a group of one or more running containers (containers can run one or more processes). Information shown on this tab relates to the pods of both the KX Insights Platform deployment and Azure Kubernetes (AKS).

Azure Kubernetes (AKS) relies on controllers to monitor and manage pods and to coordinate resources for software applications. Namespaces provide a mechanism for isolating groups of resources within a single cluster. Names of resources need to be unique within a namespace, but not across namespaces. Namespace-based scoping is applicable only for namespaced objects (e.g. Deployments, Services, etc) and not for cluster-wide objects (e.g. StorageClass, Nodes, PersistentVolumes, etc).

metric description risk
CPU Used CPU being used.
Memory Used Memory used by every pod.
Pod Status by Node List of all Pods and their current Health: Total, Running, Succeeded, Pending, Failed, Unknown.
Controllers per Namespace Total number of controllers deployed on each namespace within the deployment.

Rook-Ceph

Information shown on this tab relates to Rook-Ceph, a storage management tool used on Kubernetes. It automates the storage management processes of the system, making storage self-healing, self-managing and self-scaling.

Note

These charts are only populated if the user chose to deploy Rook-Ceph during the configuration of their KX Insights Platform. If the user does not select Rook-Ceph during deployment time, then only the alternative storage class Azure NFS metrics will be available in the Workbook and the Rook-Ceph charts will be empty.

Rook-Ceph uses Object Storage Daemons (OSDs) to manage devices and ensure data can be accessed and relies on Pools to obtain resilience to data loss and also uses Objects to store data.

metric description risk
Cluster State System’s health status: Healthy, Warning, Error.
Number of OSDs Quantity Object Storage Daemons deployed on Ceph.
Number of OSDs Up Amount of OSDs running If Number OSDs ≠ Number OSDs up, Cluster state will change status.
Number of Pools Number of Pools deployed.
Number of Objects Number of Objects deployed.
Cluster Disk usage % Percentage of Disk utilised.
Read/Write bytes Total amount of data written and read by the OSDs of Ceph.
Read/Writes Number of read and write operations.
Pool stored % Disk space by Pool.
Pool Stored bytes written Rate at which each Pool writes data.
Pool Stored bytes read Rate at which data is read.

Assemblies

This section is used to monitor data flow through each assembly in the KX Insights Platform. An assembly is the entity that represents the resources needed to ingest data into the platform, transform it and store it in the database. An assembly includes a schema and a database, alongside one or more streams and pipelines.

This tab provides information about the volume of data and number of messages flowing through each assembly. It also provides lower-level details about the messages/sec, bytes/sec and the average message size per Assembly and Stream. A stream, also known as a Reliable Transport (or RT), is a component which transports data into the platform and between components of the platform.

All Assemblies view

metric description risk
Stream Messages In bytes/sec Rate at which data is passed into a Stream at a given time. This may be from an external source, or from a Stream Processor.
Stream Messages Out bytes/sec Rate at which data is passed out of a Stream. This may be to the Stream Processor or the Storage Manager. Data is expected to flow through a Stream, Stream Messages Out is expected to be equal to Stream Messages In, unless filtering rules are in place, which would filter out certain messages.
Database Messages In/sec Rate of data flow from a Stream into each of the database tiers.

Stream Details

metric description risk
Stream Messages In bytes/sec Rate at which data is passed into a Stream at a given time. This may be from an external source or from the Stream Processor.
Stream Messages In/sec Total number of messages being passed through a Stream
Average Size of Messages in bytes Stream Messages Out bytes/sec Rate at which data is passed from the Stream Transport (RT) to KX Insights or the Storage Manager. If Stream Messages Out ≠ Stream Messages In. Data may be trapped.
Stream Messages Out/sec Rate at which data is passed out of a Stream. This may be from to the Stream Processor or the Storage Manager Data is expected to flow through the Stream. Rate of "Stream Messages Out" may differ from the rate of "Stream Messages In" only if filtering rules are in place to filter out certain messages.
Average Size of Messages Out bytes Average size of each message flowing through a Stream.

Database Tier details

Quick summary of the amount of data being passed from the Stream into the Database Tier.

metric description
Records In/sec Rate of records entering each database tiers.
Messages In/sec Total Messages entering the different database tiers
Records per message Total records contained in each message.

DB Ingestion

This section provides a deeper look into how data is passed through the different database tiers.

Real Time Stream

This tab depicts the number of data messages received by each tier from a Stream.

metric description
Messages per sec Total Messages entering the different database tiers.
Records per message Number of records inside each message at a given point in time entering the different database tiers.

Note

It is expected that each tier receives the same number of messages.

Intraday

This tab depicts how data moves from a Real Time Database (RDB) to an Intraday Database (IDB). This occurs at regular intervals throughout the day, by default this occurs every 10 minutes.

During an End of Interval process (EOI), data for the last 10 minutes is transferred to the IDB, where it will be persisted to disk temporarily until it is persisted to disk in a historical database (HDB) partition at the end of the day (EOD).

metric description risk
Duration of last EOI transition Length of each End of Interval process.
Records written during last EOI Amount of data held in RDB that has been written to IDB during the last EOI. If the data stream has a steady data flow then the number of written records between each transition should be consistent.

Historical Database

This tab depicts how the historical database grows with each End of Day process (EOD). By default this occurs once a day.

metric description risk
HDB Size Current size of the HDB.
Number of HDB Partitions Current number of partitions in HDB.
Records Written During Last EOD Transition Amount of data transferred to the HDB during an EOD process If the data stream has a steady data flow then the number of written records between each transition should be consistent.

DB Queries

Information about all queries requested by processes that are either internal or external to the platform.

These queries are actioned by the following components: Data Access Process (DAP), Resource Coordinator, Service Gateway and Aggregators.

Data Access Request

Information on how fast and successful a database tier is at actioning a request.

metric description
Request Duration by Database tier Speed at which queries retrieve the request on each database tier.
Failed Requests by Database Tier/sec Number of failed data requests on each database tier.

Resource Coordinator

The Resource Coordinator takes each request and sends it on to each database tier that needs to provide data to return the results of the query.

metric description risk
Request Completion Time Speed at which the system completes requests. An increase in this could indicate a number of things: large number of requests are being made causing the system to come under pressure, some requests are expecting a large volume of data, there is a resource issue in the platform.
Queue Length Total number of requests that are in queue with the resource coordinator and have not yet been processed. If this is high, or is increasing the system is under pressure and requests are building up.
Connected Components Shows the number of components connected to the Resource Connector, including DAPs and Aggregators.
Retry Count Number of retries for the requests. If the retry count is not zero then resources could be under pressure, or an error is occurring when trying to run the request.

Service Gateway

The Service Gateway bridges network access and external access requests.

metric description risk
Connected Components Number of components currently connected to the Service Gateway. A high number of connected components may coincide with a high value for the pending requests if the volume of requests is high.
Pending Requests Number of requests the Service Gateway has not yet processed. A rise in this metric may indicate a performance issue as the Service Gateway has a backlog of requests to action.
HTTP Requests and Responses Number of HTTP requests and responses. If Requests ≠ Responses, system is not processing Requests correctly.
IPC Requests and Responses Number of IPC requests and responses. If Requests ≠ Responses, system is not processing Requests correctly.

Aggregator

The Aggregator combines data from multiple database tiers and tables.

metric description risk
Requests in Progress by Pod Number of aggregation requests being processed by each aggregator.
Errors and Timeouts Number of aggregation requests that have failed.
Requests by type Total number of requests by type.
Aggregation Duration Speed at which each aggregator completes a request.