Troubleshooting

This guide is meant to help users diagnose system and configuration issues when starting a system for the first time.

General tips

Ensure you're working with the proper versions of images. DA, RC, GW, Agg should always be the same version. Using mismatched versions may cause unexpected problems.
The log corrlator can be used to trace the query through the logs. By default, the GW generates a random GUID as the correlator, but you can supply your own in the request to make scanning the logs easier.
```
(`myAPI;`my`args!1 2;`myCallback;``logCorr!(`;"myLogCorrelator"))
```

Queries

My queries get refused

If the GW returns immediately with an error of the form

'Resource Coordinator connection not established

Then this indicates that the GW has not connected to the RC. Alternatively, you can check the GW logs for the following messages:

// Error
INFO  com.kx.sg.IPCService - Coordinator connection establish error: java.net.ConnectException: Connection refused (Connection refused). Will try again in 5 seonds...

// Success
INFO  com.kx.sg.IPCService - Connected to coordinator <rc_address>

If the Error message appears multiple times without the Success message, then the GW is unable to connect to the RC. Check the connection details in the GW configuration.

My queries always fail

This can happen for multiple reasons. The first element in the response is the response header (see "header" page) and the rc, ac, and ai fields (see "codes" page) contain details on the error. The following table summarizes the possible routing errors from the RC.

rc	ac	ai	cause	solution
NOT_READY (58)	NOT_READY (20)	"No resources connected"	No DAPs (or other RCs) are connected to the RC.	See RC-DA and RC-RC.
NOT_SUPPORTED (12)	NOT_SUPPORTED (17)	"SQL library not loaded"	Attempting to make a SQL query in unsupported q version.	Upgrade q.
APP (5)	*	"SQL parse error"	Error parsing SQL statement.	Fix query.
APP (5)	ARG (23)	"bad args purview dimensions"	Invalid routing arguments.	Check query, labels must be symbols, `startTS` and `endTS` are timestamps.
APP (5)	ARG (23)	"startTS >= endTS"	Invalid request timestamps.	Start time must be less than end time.
NOT_SUPPORTED (12)	DOMAIN (27)	"outside all RC taxonomies"	Request for unknown label values.	Check that labels in query match labels of desired DAPs (assembly file). If this looks correct, then the desired DAPs or RCs may not be connected properly. See RC-DA and RC-RC.
NO_DEST (56)	NO_DEST (17)	"no Agg"	Can't find an aggregator.	See RC-Agg.
TIMEOUT (45)	ERR (10)		Request timed out before completing	See Timeouts.

Timeouts

Constant timeouts are usually due to the RC being unable to find a DAP to cover a temporal region of the request, which would point to a DAP not correctly connecting with the RC. You should be able to confirm that this is the case by scanning the RC logs for queued requests that are not dequeued.

// Receive the request
DEBUG [KXI-SG-RC-sg-rc] Received request for resources, api=<api> logCorr=<log_correlator>

// Allocate what it can do
DEBUG [KXI-SG-RC-sg-rc] Allocating resources, logCorr=<log_correlator> gw=<gw_address> daps=<list_daps_enlisted>

// Enqueue what it can't do
DEBUG [KXI-SG-RC-sg-rc] Enqueuing <number> request portion(s), logCorr=<log_correlator>

// On a successful dequeue
DEBUG [KXI-SG-RC-sg-rc] Dequeueing request, dap=<dap_to_dequeue_to> logCorr=<log_corrlator>

Note that the receiving RC can also send portions of the request to another RC if it registered DAPs registered elsewhere. In this case, the following message should appear in the logs.

// In the RC that originally received the request
DEBUG [KXI-SG-RC-sg-rc] Sending request portion to RC(s), logCorr=<log_correlator> handle=<handle_number>

// In the secondary RC
DEBUG [KXI-SG-RC-sg-rc] Received request from peer RC, logCorr=<log_correlator? handle=<handle_number> rpID=<internal_IDs>

... (allocating, enqueueing, dequeueing messages as above)

If the number of requests enqueued is greater than the number of "Dequeueing ..." messages (in any RC participating in the request), then this confirms one or more RCs cannot find a DAP to service a portion of the temporal interval of the request. The problem is likely that one or more DAPs are not connecting correctly to the RC (see RC-DAP). To narrow down what DAP is missing:

If you can attach or open a handle to the RC process and there are no active requests, run the following command:
```
.sgrc.i.summarizeDAPs[]
```
This returns a table with the labels, startTS and endTS for each contiguous set of label values. If any label combination does not cover from startTS=-0Wp to endTS=0Wp, one or more of the DAPs with those labels is missing. - Scan the logs of the RC and DAPs to see which DAP did not register/update its purview correctly (see RC-DAP for details).

RC-DA

At startup, a DA reaches out to its configured RC and initiates connection. The DA logs should have a message of the form:

INFO  [rdb] DA Connected to GW on <handle_number>

If this message does not appear, the DA did not connect to the RC. Verify connection details:

If using discovery, the gwAssembly value in the DA's assembly should be the same as the KXI_NAME environment variable in RC.
- Note, if discovery is properly configured, both the DA and RC processes should have a log message of the form:
```
[DISCOVERY] INFO  Registering uid <host>:<port> for service KXI-<process_type>
```
If this message is absent, this indicates that discovery is misconfigured.
If not using discovery, the gwEndpoints value in DA's assembly should be the host:port of the RC.

If connection is established, the RC logs should contain a message of the form:

INFO [KXI-SG-RC-sg-rc] SGRC Registering DAP, asm=<dap_assembly_name> host=<dap_host> port=<dap_port> handle=<handle> avail=<availability> purview=<purview>

Ensure that the purview matches the expected values. The RC uses these values (except ver) to route requests; they must the labels in your request.

If there is no registration message, look for one of the following error messages in the RC log:

// Incorrect args to the registration call.
ERROR [KXI-SG-RC-sg-rc] SGRC Incorrect DAP registration param types

// Bad purview.
ERROR [KXI-SG-RC-sg-rc] Invalid DAP purview registration

// Invalid metadata.
ERROR [KXI-SG-RC-sg-rc] Invalid DAP metadata

// Invalid schema definitions.
ERROR [KXI-SG-RC-sg-rc] Invalid DAP schemas

// Unexpected update.
ERROR [KXI-SG-RC-sg-rc] Unknown DAP update

// Invalid purview on update.
ERROR [KXI-SG-RC-sg-rc] Invalid purview update, rejecting DAP

Any of these messages indicates a bug. Failing any of the above, look for any error messages in the DAP or RC that would indicate that initialization failed before the registration was attempted, which would also indicate a bug. Contact technical support.

RC-RC

At startup, if using multiple RCs, for every RC pair, one RC should initiate a connection to the other. To ensure the RCs are configured to find each other, ensure the following message appears.

INFO  [KXI-SG-RC-sg-rc] SGMRC Setting multi-RC mode to <mode>

If not, one of the following messages should appear:

INFO  [KXI-SG-RC-sg-rc] SGRC Setting mono-RC mode

// OR

FATAL  [KXI-SG-RC-sg-rc] SGRC Unrecognized 'KXI_SG_DISC' mode: <mode>

These indicate the KXI_SG_DISC environment variable is either unset (former) or set to an unrecognized value (latter). See "SG configuration page" for details.

Assuming discovery mode is correctly set, for every pair of RCs, one RC should attempt to connect to the other (which one is not guaranteed). Say, RC_0 connects to RC_1. Ensure the following log messages appear in the logs:

// RC_0
INFO  [KXI-SG-RC-sg-rc-0] SGRC Attempting connection to RC, name=<rc_names> hostport=<rc_host:ports>

// RC_1
INFO  [KXI-SG-RC-sg-rc-1] SGRC Registering RC, name=<rc_0_name> host=<rc_0_host> port=<rc_0_port>
INFO  [KXI-SG-RC-sg-rc-1] SGRC Reciprocating registration with RC

// RC_0
INFO  [KXI-SG-RC-sg-rc-south] SGRC Registering RC, name=<rc_1_name> host=<rc_1_host> port=<rc_1_port>

In the event of communcation failure, there may be a log message of the form "Error sending: <error>", but they RCs should retry in a few seconds, so these can be ignored, unless they repeat constantly.

If these messages do not appear in either RC, then one or both are incorrectly configured.

If using "kubernetes" discovery mode, ensure that the annotations are correctly set in the pod metadata:

kind: Pod
metadata:
  annotations:
    kxi-kdisc: "${Name of the RC container}: sc=KXI-SG-RC port=${Container port number}" # Check these!

The name of the RC container should match the name spec.containers.name of the RC container:

spec:
   containers:
   - name: ${Name of the RC container}

RC-Agg

At startup, the Agg should open a connection to its RC. The following message should appear in the Agg's logs:

INFO [KXI-SG-AGG-sg-agg] SGAGG Attempting to register with RC, hp=:<rc_host>:<rc_port>
INFO [KXI-SG-AGG-sg-agg] SGAGG Connected to RC, handle=<hanle_number>

If instead, the following messages appears:

FATAL [KXI-SG-AGG-sg-agg] Must define KXI_SG_RC_ADDR env variable

Then the KXI_SG_RC_ADDR environment variable is not set (should be KXI_SG_RC_ADDR="<rc_host>:<rc_port>"). If the following message continually pops up:

INFO [KXI-SG-AGG-sg-agg] SGAGG Attempting to register with RC, hp=:<rc_host>:<rc_port>
WARN [KXI-SG-AGG-sg-agg] SGAGG Unable to connect to RC

Check to make sure the RC's host/port are set.

Once connection is established, a confirmation message should appear in the RC's logs:

INFO [KXI-SG-AGG-sg-agg] Registering Agg, host=<agg_host> port=<agg_port> handle=<handle_number>"