Troubleshooting¶

This page describes common errors and information to help you troubleshoot data flow and general Reliable Transport (RT) issues.

Use the specific APIs documented and linked below to analyze and diagnose any issues with data flow and management. This guide assumes familiarity with both RT architecture and the q interface.

A troubleshooting overview is provided, with links to specific issues and questions, but you can also jump to common errors and solutions, and frequently asked questions.

Risk of data loss

Always check and confirm steps with your KX Support team before performing actions that may result in data loss.

Troubleshooting overview¶

The diagram below illustrates the troubleshooting process for RT data flow issues. The following questions should help you identify how to diagnose and remedy specific problems.Each decision point links to a detailed section below.

Is the publisher running?¶

You can use the rt-clients API to query clients, which includes publishers and subscribers.

RT client APIs

RT is decoupled from its clients (which could be publishers or subscribers), which means these APIs enable you to identify where in the process the data flow has stalled or broken down. The database is both a publisher and a subscriber to RT in kdb Insights Enterprise.

Call `rt-clients`¶

The following example shows how to run the rt-clients API, which returns information about the publishers and subscribers connected to the RT:

$ # user has port forwarded the RT rest port of 6000 to run the API below
$ curl 0:6000/rt-clients | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  2416  100  2416    0     0  60400      0 --:--:-- --:--:-- --:--:-- 58926
{
  "result": {
    "node": "rt-kxi-py-transport-0",
    "publishers": [
      {
        "kxi-py-sm-0": [
          {
            "active": true,
            "connect_time": "2025-06-06T10:07:46.673z",
            "last_message": "2025-06-06T10:10:01.013z",
            "replicated_pos": 1052,
            "total_bytes": 1086,
            "transfer_rate": 0,
            "target_directory": "/s/in/kxi-py-sm-0.kxi-py-transport"
          },
          {
            "active": true,
            "connect_time": "2025-06-06T10:07:46.730z",
            "last_message": "2025-06-06T10:07:46.734z",
            "replicated_pos": 0,
            "total_bytes": 34,
            "transfer_rate": 0,
            "target_directory": "/s/in/kxi-py-sm-0.sm-batchIngest"
          },
          {
            "active": true,
            "connect_time": "2025-06-06T10:07:46.730z",
            "last_message": "2025-06-06T10:07:46.790z",
            "replicated_pos": 0,
            "total_bytes": 34,
            "transfer_rate": 0,
            "target_directory": "/s/in/kxi-py-sm-0.sm-batchDelete"
          },
          {
            "active": true,
            "connect_time": "2025-06-06T10:07:46.817z",
            "last_message": "2025-06-06T10:10:00.100z",
            "replicated_pos": 438,
            "total_bytes": 472,
            "transfer_rate": 0,
            "target_directory": "/s/in/kxi-py-sm-0.sm-prtnend-eoi.dedup"
          },
          {
            "active": true,
            "connect_time": "2025-06-06T10:07:46.837z",
            "last_message": "2025-06-06T10:07:46.841z",
            "replicated_pos": 0,
            "total_bytes": 34,
            "transfer_rate": 0,
            "target_directory": "/s/in/kxi-py-sm-0.sm-schemaChange"
          },
          {
            "active": true,
            "connect_time": "2025-06-06T10:07:46.913z",
            "last_message": "2025-06-06T10:07:46.916z",
            "replicated_pos": 0,
            "total_bytes": 34,
            "transfer_rate": 0,
            "target_directory": "/s/in/kxi-py-sm-0.sm-prtnend-eod.dedup"
          }
        ]
      },

Storage Manager output

While the database is both a publisher and a subscriber to RT in kdb Insights Enterprise, for simplicity the output is limited to the publishers key. By default, the Storage Manager (SM) publishes an EOI event to RT every 10 minutes. For more information, see Storage tier types.

Look for the active: true entry under the publisher you are troubleshooting. If active, you know the publisher is running and can move on to the next stage: is the publisher sending data?

If the publisher shows active: false, check the publisher configuration. For more information, see Publishers.

Is the publisher sending data?¶

Use the rt-clients API to confirm whether the publisher is not only connected, but actively pushing data into RT.

What to look for¶

active: true for the publisher entry in the publishers list.
A recent and advancing last_message timestamp.
Increasing total_bytes values.
Advancing replicated_pos values.
A non-zero transfer_rate when the publisher is sending continuously.

A publisher can be connected yet idle. If the publisher is active but total_bytes and replicated_pos remain static after successive API calls, the publisher is not sending new messages.

Troubleshooting steps¶

Run curl 0:6000/rt-clients | jq and find the publisher entry for the component you are troubleshooting.
Check last_message, total_bytes, and replicated_pos.
Repeat the call after a short interval to verify those values are advancing.
If they advance, the publisher is sending data and you can continue to Is RT merging data?
If they do not advance, inspect the publisher configuration and logs for errors, blocked output, or deduplication issues.

When the publisher is active but not sending data¶

If the publisher is connected but shows no data movement, check:

publisher logs for startup or runtime errors
RT target directory configuration
sequence ID / deduplication settings
whether the publisher is producing messages at all

If the publisher is inactive, restart it and review its configuration. For publisher-specific details, see Publishers.

Is RT merging data?¶

If RT is receiving data from publisher clients but there is still a problem, there may be an issue that prevents data from being merged by RT.

Call `latest-out-position`¶

The following example shows how to run the latest-out-position API to confirm if RT is actively merging data. This API returns the latest position in the RT stream, which you can use to confirm that the position is advancing. For more information, see Latest output position.

$ curl 0:6000/latest-out-position | jq
{
  "latestOutPosition": 633318697601211
}

Run the latest-out-position API twice and check it is incrementing.

If the latest-out-position doesn't change between runs, no messages are being merged. See Why is RT not merging data?

Why is RT not merging data?¶

There are a number of potential causes for RT not merging data:

Is there an issue with deduplication?
Is there a disk issue?
Is the merge queue growing?

Is there an issue with deduplication?¶

Deduplication issues can cause problems with data flow and merging. For more information on deduplication, see Deduplication.

The following examples show how to use the dedup-rt-clients API to identify active publishers pushing data into a deduplication stream, and any potential deduplication issues. For more information on the API, see Reliable Transport (RT) REST API.

$ curl http://0:6000/dedup-rt-clients
{
    "result": [
        {
            "dedup_id": "sm-prtnend-eoi",
            "watermark": 8,
            "publishers": [
                {
                    "kxi-py-sm-0": {
                        "last_message_merged": "2025-06-06T11:10:00.121914360",
                        "msgs_merged": 8,
                        "seq_num_merged": 8,
                        "last_message_pushed": "2025-06-06T11:10:00.121914360",
                        "seq_num_pushed": 8
                    }
                }
            ]
        }
    ]
}

Run the dedup-rt-clients API and check that the watermark value is increasing.

When the publisher initializes a connection to RT, it defines the dedup_id. The publisher publishes a message, which includes a sequence ID .rt.id. This sequence ID must be greater than the watermark to avoid being treated as a duplicate and ignored. If it's not greater than the watermark, reset the deduplication sequence ID.

Another way to troubleshoot deduplication issues is to run the dedup-rt-clients API twice.

As the following example shows, if the watermark remains unchanged between runs while seq_num continues to increase, this indicates that an active publisher is sending data, but the data is being treated as duplicate, so it is not merged or forwarded to subscribers.

1^st run

 "result": 
    {
      "dedup_id": "north-compliance",
      "watermark": 2633,
      "publishers": 
        {
          "sp-client-north-compliance": {
            "last_message": "2025-01-15T14:04:27.769532009",
            "msgs_merged": 2505,
            "seq_num": 2505   // this got reset as the SP checkpoint was deleted
          }
        }
    }

2^nd run

 "result": 
    {
      "dedup_id": "north-compliance",
      "watermark": 2633,
      "publishers": 
        {
          "sp-client-north-compliance": {
            "last_message": "2025-01-15T14:04:27.769532009",
            "msgs_merged": 2505,
            "seq_num": 2512   // this got reset as the SP checkpoint was deleted
          }
        }
    }

In the above examples, the seq_num increases from 2505 to 2512. The watermark remains at 2633, which indicates duplicate messages are being received.

Reset deduplication¶

To resolve deduplication issues, see the appropriate resolution for the publisher type. For more information on types, see Publishers.

Database: Reset deduplication sequence ID

For database publishers, the following example shows how to use the storage manager API to override the sequence ID being sent to RT. Use the dedup-rt-clients API as shown above to work out the correct sequence ID, then adapt the following example to increase the sequence ID.

Risk of permanent data loss

Always check with KX Support before running this API to reset the sequence ID. Incorrect use can result in permanent data loss.

curl "http://${SM_HOSTPORT}/idbstatus" -H 'Content-Type: application/json' -H 'Accept: application/json' -d '{"seqid":123}'
{"success":true}

# the SM typically runs on port 10001
curl http://localhost:10001/idbstatus -H 'Content-Type: application/json' -H 'Accept: application/json' -d '{"seqid": 33}'
{"success":true}

Stream Processor (SP): Reset deduplication sequence ID

Another possible deduplication issue can occur if you write your own Stream Processor Helm chart but don’t set the Stream Processor group ID, which is used for deduplication. In this case, a new pipeline was using the same dedup_id as the old one and data wasn’t being promoted.

To avoid this, ensure that KXI_SP_GROUP is set correctly.

Initialize publisher deduplication ID

For a publisher to enable deduplication, the publisher must initialize their connection to RT with a dedup_id key, as shown:

h:.rt.pub `topic_prefix`stream`publisher_id`dedup_id!("rt-";"mystream";"pub1";"finance")

The publisher sends a sequence ID to RT as part of the RT message header .rt.id. When you enable deduplication, RT maintains a high watermark with dedup_id as the key.

The following example shows a sample client and RT server interaction that enables deduplication:

q)// client
q).pub.getSeqID:{
  id:@[get;hsym`$"/tmp/seqID";0N];
  if[null id; show"No Sequence ID recovered. Is this a new publisher?"];
  :id;
  }
q).pub.setSeqID:{
  set[hsym`$"/tmp/seqID";.pub.id];
  }  
q).z.exit:.pub.setSeqID;  
q)init:{
  .pub.id:.pub.getSeqID[];
  p:`topic_prefix`stream`publisher_id`dedup_id!("rt-";"mystream";"pub1";"finance")
  .pub.h:.rt.pub p;
  }
q)pub:{
  .pub.id+:1; 
  .rt.id:.pub.id;  .rt.ts:.z.p; .rt.on:.z.h;
  .pub.h(`upd;`trade;`time`sym`price!(.z.p;`FDP;134.0));
  }
q)init[]  // initialize the connection to RT
q)pub[]  // send a payload to RT including an incrementing sequence ID
q).z.ts:pub
q)\t 5000


// RT server side
$ curl http://localhost:6000/dedup-rt-clients
 "result": 
    {
      "dedup_id": "finance",
      "watermark": 1234,
      "publishers": 
        {
          "pub1-finance": {
            "last_message": "2025-01-15T14:04:27.769532009",
            "msgs_merged": 1234,
            "seq_num": 1234
          }
        }
    }

Database The database publisher stores the RT sequence ID in the idbStatus internal table. You can use the above API to edit this table.

Is the merge queue growing?¶

Each merge instruction must be acknowledged by all nodes in the cluster before it's executed. If nodes can't keep up, instructions queue up. The kxi_rt_merge_queue_size metric tracks the number of instructions waiting to be executed — a persistently growing value indicates that merging has stalled. For more information on available metrics, see Reliable Transport Monitoring.

If the merge queue isn't draining, check for disk issues on any node in the cluster. See Is there a disk issue?

Is there a disk issue?¶

If a node in the cluster runs out of disk space, this can cause problems with merging data. RT requires a majority of nodes to be healthy in order to merge data. If 2 of 3 nodes in a 3 node RT cluster are affected by disk issues, this can cause all nodes in the cluster to stop merging, including those without disk issues. You might see disk-related errors in the pod logs.

To check whether merging has stalled across the cluster, search for fsync errors in the pod logs of each RT pod and verify that the merge position is advancing.

Resizing persistent volumes using Kubernetes

You can increase the size of the PVC (PersistentVolumeClaim) in the StatefulSet (STS) by updating its yaml config and recreating the STS. Any PVC with a pending resize is flagged with FileSystemResizePending until restarted. Note that PVC size cannot be decreased. For more information, see Resizing persistent volumes using Kubernetes.

For more detail on merge instructions and how they're created, see Sequencer.

You can do a soft or hard reset to address any merging problems due to disk sizing.

Sizing guide

For detailed advice and guidance on sizing kdb Insights deployments, see Sizing kdb Insights.

Soft reset¶

Triggering a soft reset is a good strategy when you have an offline node that is struggling to rejoin the cluster, or merging problems due to disk sizing.

Performing a soft reset also deletes the RAFT log and triggers RT to use a snapshot to recover. Snapshots are taken after merge instructions are executed.

A snapshot isn't taken after every merge instruction. As a result, if the RAFT log is deleted, some merge instructions are replayed, potentially causing duplicate data to be received by downstream subscribers.

Hard reset¶

You should only trigger a hard reset when you have no other option. A hard reset removes all messages from the RT stream to allow RT to be restarted with an empty stream. For more information, see Hard reset.

Risk of data loss

A hard reset removes all messages from the stream and can result in duplicate or lost data. We recommend you only use this feature if advised to do so by KX Support.

Is RT sending data to subscribers?¶

Use the rt-clients API to check whether data is being sent to subscribers.

The following example shows how to run the rt-clients API, which returns information about the publishers and subscribers connected to the RT. For more information, including detailed examples and responses, see RT clients observability.

Look for an "active": true entry to confirm there is an active subscriber, as this example output shows:

$ curl http://localhost:6000/dedup-rt-clients

    "subscribers": [
      {
        "kxi-py-dap-da-0": [
          {
            "active": true,
            "connect_time": "2025-06-09T13:20:00.775z",
            "last_message": "2025-06-09T13:20:06.602z",
            "replicated_pos": 668503069688893,
            "total_bytes": 1167,
            "transfer_rate": 0
          }
        ]
      },
      {
        "kxi-py-sm-0": [
          {
            "active": true,
            "connect_time": "2025-06-09T13:20:03.968z",
            "last_message": "2025-06-09T13:20:06.602z",
            "replicated_pos": 668503069688893,
            "total_bytes": 1167,
            "transfer_rate": 0
          }
        ]
      }
    ]
  }

EOI as a health signal

The SM publishes an EOI event to RT every 10 minutes by default. This is a good indicator that the DAP is successfully receiving updates.

Is my client receiving updates?¶

If you've confirmed RT is sending data to subscribers, your next step is to use the q API to confirm the last message processed by a q client such as a DAP or Stream Processor. In some cases, RT can push updates to the client node but the subscribed q process may not receive updates.

Call `get_rt_position`¶

The following example shows how to use the get_rt_position q API to check if the q client has received updates:

q) // Open a socket connection to my DAP 
q).z.h
`kxi-py-dap-da-0
q)rdb:hopen 5083
q)rdb(`.rt.get_rt_position;`)
stream                                   | hostname        last_message_received         position
-----------------------------------------| -------------------------------------------------------------
/logs/rt/kxi-py-dap-da-0/kxi-py-transport| kxi-py-dap-da-0 2025.06.09D14:10:00.535389752 668503069692092

This shows that RT is delivering the data successfully to the client, which is the database in this case. If the data from a simple insert is not making it into the table, the most likely cause is a data schema mismatch. Check the database's standard out logs.

Investigate logs¶

This section describes the most important steps for investigating RT log issues and points you to the detailed diagnostics below.

Confirm whether the problem is on the publisher side or the subscriber side.
Use Call rt-clients to inspect publisher connection status, transfer_rate, replicated_pos, and whether the publisher is actively sending data.
Use Call get_rt_position to verify that a subscribed q client has received and processed the latest RT messages.
If an RT log is present but the q layer is not consuming data, see Why is data not reaching the q layer? for log corruption and re-replication guidance.
If RT reports pending files or an unhealthy state, review RT unhealthy with pending log files.
For broader failure modes and semantics, check Common errors and solutions.

Why is data not reaching the q layer?¶

Corrupt subscriber RT log

If the RT log for the SM grows but data isn’t received by the SM’s q layer, the log may be corrupt. For example, a power cut on the SM could corrupt the RT log, which means the q layer cannot deserialize the data in the logs.

To resolve this problem, you can truncate the latest RT log on the subscriber, which causes the RT log to be re-replicated from RT back onto the subscriber node for processing.

Common errors and solutions¶

Stream integrity error¶

If a DAP or SM receives a badmsg, badtail or a skip-forward event from RT, data has been lost or, in the case of skip-forward, it has been archived by the RT server. For more information on these events, including how to troubleshoot and resolve errors, refer to Other events.

Truncation before expected interval¶

RT log truncation occurs at a specified frequency but can occur ahead of schedule if the disk is under pressure. If truncation occurs before an expected interval, the infrastructure may be misconfigured.

RT works by moving log files between clients and server. To prevent these log files from consuming too much disk space, RT truncates and archives log files. An RT log file can grow to 1 GiB in size before the log file is rolled. For example, another log is created and messages are appended to this new log file.

The rate of log file truncation is controlled by three different configurations:

Time: Log files are truncated when a certain time threshold has been met. This time threshold is measured from when the log file was rolled.
Disk: Log files are truncated when RT's total disk space consumption exceeds a percentage threshold of the overall disk space on the server.
Limit: Log files are truncated when their cumulative size exceeds a configured limit.

In a healthy system, the time configuration is the value determining when the log files are truncated. If log files are being truncated due to either disk or limit, this indicates a misconfigured system and RT is marked as being in a degraded state.

To avoid this, ensure there is adequate space allocated to the location where RT logs are being written. If the system goes into a degraded state, RT continues to truncate the log files. To recover from a degraded state, resize your PVC. Refer to RT stream log archival for more details on managing logs and PVC sizing.

RT unhealthy with pending log files¶

This state can occur transiently after a leader change while the new leader reconstructs its state from the latest snapshot and RAFT log. If the cluster recovers on its own within a short time, no further action is needed. If the UNHEALTHY state persists, use the following diagnostic steps.

If you run the rt-clients API and see a message that indicates the RT is UNHEALTHY and log files are pending, similar to this example:

RT is UNHEALTHY with 7 pending log files

You should connect to the RT leader and run the following code:

seq:hopen 4000
seq({select from .fw.t where any dir like/:("*eoi*";"*eod*")};`)
seq({select from lens_upd where any d like/:("*eoi*";"*eod*")};`)

One symptom to look for is multiple entries for the same log session, similar to below:

q)seq({select from .fw.t where any dir like/:("*eoi*";"*eod*")};`)
dir                                sess no| wd gs                          len   t
------------------------------------------| ------------------------------------------------------------------
kxodh-asm-sm-0.sm-prtnend-eoi.dedup 1521 0 |    `n1`n0`n2!15521 15521 15521 15521 2026.04.30D09:56:53.066501051
kxodh-asm-sm-0.sm-prtnend-eoi.dedup 1522 0 |    `n2`n1`n0!437 437 437       437   2026.04.30D09:56:53.066550570
kxodh-asm-sm-0.sm-prtnend-eod.dedup 1522 0 |    `n2`n1`n0!26 26 26          26    2026.04.30D09:56:53.066556108
kxodh-asm-sm-0.sm-prtnend-eod.dedup 1523 0 |    `n1`n2`n0!26 26 26          26    2026.04.30D09:56:53.066594468

This can indicate that the publisher and sequencer are subscribed to different instances (not the multiple session IDs). Check rt-clients for a non-zero transfer_rate:

{
            "active": true,
            "connect_time": "2026-04-30T09:25:02.831z",
            "last_message": "2026-04-30T09:25:27.857z",
            "replicated_pos": 26792899345645950,
            "total_bytes": 36970,
            "transfer_rate": 0,
            "target_directory": "/s/in/kxodh-asm-sm-0.sm-prtnend-eoi.dedup"
          },

(prd@'20 14 30#'2)vs  26792899345645950
1523 0 382

Try restarting RT and rerunning the first code example:

q)seq:hopen 4000
q)seq({select from .fw.t where any dir like/:("*eoi*";"*eod*")};`)
dir                                sess no| wd gs                       len  t
------------------------------------------| --------------------------------------------------------------
kxodh-asm-sm-0.sm-prtnend-eod.dedup 1    0 |    `n2`n0`n1!-1 -1 -1       0    2026.04.30D10:22:52.723110614
kxodh-asm-sm-0.sm-prtnend-eoi.dedup 1    0 |    `n2`n0`n1!-1 -1 -1       0    2026.04.30D10:22:52.723115791
kxodh-asm-sm-0.sm-prtnend-eod.dedup 0    0 |    `n2`n0`n1!-1 -1 -1       0    2026.04.30D10:22:52.723237071
kxodh-asm-sm-0.sm-prtnend-eoi.dedup 0    0 |    `n2`n0`n1!-1 -1 -1       0    2026.04.30D10:22:52.723243826
kxodh-asm-sm-0.sm-prtnend-eoi.dedup 1523 0 |    `n2`n0`n1!1067 1067 1067 1067 2026.04.30D10:22:52.723312809
q)seq({select from lens_upd where any d like/:("*eoi*";"*eod*")};`)
nd d                                 | s    n t
-------------------------------------| ------------------------------------
n0 kxodh-asm-sm-0.sm-prtnend-eod.dedup| 0    0 2026.04.30D10:23:12.722878115
n1 kxodh-asm-sm-0.sm-prtnend-eod.dedup| 0    0 2026.04.30D10:23:12.722337807
n2 kxodh-asm-sm-0.sm-prtnend-eod.dedup| 0    0 2026.04.30D10:23:12.721849147
n0 kxodh-asm-sm-0.sm-prtnend-eoi.dedup| 1523 0 2026.04.30D10:23:12.722878115
n1 kxodh-asm-sm-0.sm-prtnend-eoi.dedup| 1523 0 2026.04.30D10:23:12.722337807
n2 kxodh-asm-sm-0.sm-prtnend-eoi.dedup| 1523 0 2026.04.30D10:23:12.721849147

If you compare these results to the first results, you should notice less variation in the session IDs and the EOI jobs should run more cleanly.

Frequently asked questions (FAQ)¶

What causes SM replication state corruption ("Inconsistent file detected") errors?¶

This error is thrown by RT replicators when a publisher has deleted log files. This typically happens if the RT client has deleted the PVC that they were using to write their logs, and is then restarted.

In the case of the Stream Processor, change the name of the pipeline to fix this. This forces the introduction of a new publisher ID to RT, which RT treats as a new publisher.

What conditions typically lead to skip-forward events?¶

RT archives its log files once certain thresholds are met, which means the RT logs that clients subscribe to are removed. If a client subscribes to a log file which RT has archived, then the client receives a skip-forward event. The client will receive the earliest available log file that it has subscribed to which has not been archived. For more information on archiving, see RT Archiver.

If clients are configured to backup log files to object storage before archiving, see Recovering archived logs for information on retrieving them.

Next steps¶

You can also view specific troubleshooting topics for: