Health

This section provides information on how to act when issues arise with the health of your pipelines and databases.

kdb Insights Enterprise enables you to view the health and age of the pipelines and databases of your deployment. This gives you a real-time view of the state of key parts of your data ingestion pipeline, providing early indicators of issues that may impact your system.

The Database and Pipeline index pages in the UI provide information on the health of your databases and pipelines. This section provides information on how to interpret warning messages, and outlines actions you can take to resolve them.

Warning and remediating actions

The following table lists the warning messages displayed in the Database and Pipeline index pages. The degraded state column contains links to the steps you can take to correct or improve the situation.

Warning Message	Degraded State
`Encountered schema mismatch for table {table}`	Shape Mismatch
`Storage is lagging, waiting on {number} pending reload signals`	Storage Lagging
`Encountering memory pressure and triggering emergency interval writedown`	Emergency EOI
`There is a recognized schema mismatch for [table name]`	Stream Integrity Error
`Log truncated before expected truncation frequency`	Truncation before expected interval
`Encountered {error} error in operator {operator}`	Processing Error
`Failed to initialize controller: {error}`	Controller initialization/load fails
`Failed to initialize worker: {error}`,`Failed to load worker: {error}`	Worker initialization fails
`Missed heartbeats: {error}`	Missed heartbeats
`Error in operator: {error}`	Failed to process data

The following sections explain why these errors are generated and the steps you can take to address them.

Getting Support

If your system encounters an issue not covered in this section, report the details to KX via the Support Portal to help us enhance our health coverage.

Shape Mismatch

The DAP and/or SM has received data from RT that does not match the schema definition of the database and is unable to proceed in ingesting data.

A shape mismatch warning can be caused by one of the following:

Column name mismatch for a given table, indicated by a Column name mismatch for table {table} message.
Column type mismatch, indicated by a Column type mismatch for table {table} message.
Nested column type mismatch, indicated by a Nested column type mismatch for table {table} message.
Array column type mismatch , indicated by a Array column type mismatch for table {table} message.

The first action to take is to confirm the type of mismatch by reviewing the application logs to assess the type of mismatch that you are looking for. For example, a column name mismatch error would occur if you were pushing data to a table with a column titled trade, but the schema defines that column as trades.

Also, consider that the issue could be caused by:

the data client streaming data to kdb Insights Enterprise
a transform operation being performed by the Stream Processor during ingest
the definition of the schema at the storage level itself.

The course of action depends on what you determine the cause of the issue to be.

The schema definition at the storage layer is wrong:
- In the UI tear down your database and edit your schema to the correct definition. Then save and redeploy, as outlined in Modifying a schema. The Storage Manager applies the schema change automatically, after which you can bring up all processes to resume operation.
- Using the CLI, tear down your package, edit the schema in the package, save and redeploy.
The source data appears to be incorrect. Investigate and determine the necessary code changes in your data client.
The Stream Processor may be performing an incorrect transformation.

Storage Lagging

A storage lagging warning means the Storage Manager is falling behind on reload signals (DAPs are told to reload after each EOI disk write is complete). Data is not being written down to disk during this state.

There are two options for remediation:

Monitor your system closely. This may be a temporary state while the system is under pressure, but kdb Insights Enterprise can catch up from this state on its own.
If you encounter this issue repeatedly or over an extended period, investigate further:
- Review the Storage Manager logs to assess if there are any errors.
- Check the cluster to determine the following:
  - Is the CPU maxed out? The Database Overview screen indicates if the DAPs, SM, or RT are maxed out.
  - Are there sufficient threads assigned to EOI process?

Emergency EOI

If a DAP does not have enough RAM to hold an entire interval of data in memory, it can trigger an emergency EOI to release memory pressure. This may indicate that there is a spike in data volume or the system is falling behind in ingesting data.

This is an early warning sign but not necessarily indicative of a problem. Emergency EOIs are specifically designed to protect the system by intervening before a memory issue occurs.

In the event of this warning:

Monitor your system closely. If this is a one-off occurrence, it is likely the system received a temporary spike in data volumes or temporary operational sluggishness.
If this is a repeating warning:
- Investigate your data volumes - have they changed?
- Have you reduced the resources allocated to your DAP(s) and SM pods?
You might need to adjust your system resource allocation to reflect the new operational state.

Stream Integrity Error

If a DAP or SM receives a badmsg, badtail or a skip-forward event from RT, data has been lost. For more information on these events, refer to Other events.

Truncation Before Expected Interval

RT log truncation occurs at a specified frequency, but can occur ahead of schedule if the disk is under pressure. If truncation occurs before an expected interval, the infrastructure may be misconfigured.

RT works by moving log files between clients and server. To stop these log files consuming too much disk space, RT truncates/archives logs files. An RT log file can grow to 1GB in size before the log file is rolled. For example, another log is created and messages are appended to this new log file.

The rate of log files truncation is controlled by 3 different configurations:

Time - The log file is truncated when a certain time threshold has been met. This time threshold is the time since a message a log file was rolled.
Disk - The amount of disk space consumed by RT log files before they are truncated.
Limit - The cumulative size the log files consume before they are truncated.

In a healthy system, the time configuration is the value determining when the log files are truncated. If log files are being truncated due to either disk or limit, this is considered to be a misconfigured system and RT is marked as being in a degraded state.

To avoid this from happening, you must ensure there is adequate space allocated to the location where RT logs are being written. If the system does go into a degraded state, RT continues to truncate the log files. To come out of a degraded state, you must resize your PVC. Refer to RT stream log archival for more detail on managing logs and PVC sizing.

Controller initialization/load fails

This can occur while setting up the pipeline workers due to network issues or other transient failures.

Check the controller logs for more information and try restarting the pipeline.

Worker initialization fails

This occurs when initialising the worker and typically indicates a problem loading the pipeline spec. It could also be a connectivity problem between the worker and the controller.

Check the worker logs and try restarting the pipeline.

Missed heartbeats

This is typically because the worker is busy processing data and usually corrects itself after a period of time.

Check pipeline status and worker logs.

Failed to process data

This is typically due to an issue with the incoming data, for example a shape or type issue.

Check the worker logs.

You can also debug the pipeline using Full / Quick Test.

Configure Health Monitoring

As of kdb Insights Enterprise version 1.11.0, this monitoring is enabled by default but can also be explicitly turned on/off by the following setting in your values.yaml file.

healthz:  

     enabled: true (OR false)