Troubleshooting
This guide is meant to help users diagnose system and configuration issues while running the kdb Insights Database.
General tips
- Work with proper versions of the images. DA, RC, GW, Agg, and SM must be the same version. Using mismatched versions may cause unexpected problems.
-
Use the log correlator to trace the query through the logs. By default, the GW generates a random GUID as the correlator, but you can supply your own in the request to make scanning the logs easier.
(`myAPI;`my`args!1 2;`myCallback;``logCorr!(`;"myLogCorrelator"))
DAP configuration issues
For reference, the elements section of DAP assembly configuration should look similar to this:
elements: # Elements section of assembly
dap: # Start of DAP configuration section
instances:
hdb: # Start of config for DAPs with KXI_SC env set to "hdb"
mountName:hdb # Mount this DAP to provide read access to (must match name in `mounts` section)
rdb: # Start of config for DAPs with KXI_SC env set to "rdb"
mountName:rdb
Missing DAP from elements section of assembly
This issue occurs when the assembly file does not have a dap
section under the elements
section of the assembly. This can be due to the section being missing, or a simple typo as in the below example.
elements:
dapp: # Typo in "dap"
instances:
hdb:
mountName:hdb
rdb:
mountName:rdb
Missing service class config under dap
instances of assembly
If this startup error occurs, it is because the KXI_SC
environment variable set for the DAP does not match any name under elements.dap.instances
of the assembly file. To resolve this, compare the KXI_SC
set for the DAP with those in its assembly file.
Mount does not exist within assembly mounts
DAPs fail to start if the mount they are configured to provide access to does not exist within the assembly file. This mount is set by elements.dap.instances.*.mountName
. For example, in the above snippet the DAP hdb
has a mount called hdb
, and the DAP rdb
DAP has a mount called rdb
.
To resolve this error ensure that a mounts
section exists for the assembly file, and that the mount names there match those defined in the DAP config. An example mounts
section for the above might look like this:
mounts:
rdb: # This line defines the name of the mount that `mountName` references
type: stream
baseURI: none
partition: none
hdb:
type: local
baseURI: file:///data/hdb
partition: date
Error mounting database: .DS_Store
.DS_Store
is a hidden file automatically created on macOS
, and used to store custom attributes of a given folder. The presence of these files results in a FATAL error when re-starting the database. To resolve this issue, use the following command to delete all these files inside the database folder.
find $DB_PATH -name .DS_Store -type f -exec rm {} \;
No mount dir defined - set baseURI
of associated mount
The mounts
configuration needs to have the baseURI
set for the mount so that the DAP can find where to mount the on-disk data from.
Invalid table configuration, exiting
DAPs fail to start if one or more schemas are misconfigured. Before failing with the above error the DAP outputs any detected errors with the schema definition. Possible errors are summarized below:
Error Message | Remediation |
---|---|
assembly.table.%s has invalid name, consider using |
Table name is invalid and must be changed |
assembly table.%s missing required key(s): |
Add required key to table |
assembly.table.%s columns missing required key(s): |
Columns must have a name and type. Add the missing key |
assembly.table.%s has no columns |
Remove table, or add columns |
assembly.table.%s column name: '%s' is invalid, please consider using '%s' |
Change noted column name to suggestion or something else |
assembly.table.%s.%s column type: %s is invalid |
Column type is invalid, change to supported type |
assembly.table.%s missing required key(s): `prtnCol |
Partitioned tables require a prtnCol key |
assembly.table.%s.%s prtnCol is not timestamp type |
The prtnCol key must be of a timestamp type |
assembly.table.%s %s: %s conflict with %s: %s. Can NOT apply %s# to %s column of %s table. |
Add the referenced column with attribute to referenced sortCols key |
assembly.table.%s: %s conflict with parted attribute: Can NOT apply parted attribute to vector columns %d |
A vector column type cannot have a parted attribute applied to it |
No assembly labels defined - set labels in assembly
The RC's routing of DAPs relies on there being labels defined for the assembly, so DAP/RC registration expects that the DAP has some labels attached to the assembly. If none are defined, then the DAP startup fails. To fix this, add at least one label to the labels
section of the assembly. For example, to add a region
label of eu
, add this to the assembly:
labels:
region: eu
Queries
My queries are not returning data
When no data is returned, either the query is succeeding but finding no data given the parameters of the request, or the request itself is failing. The way to tell the difference is to look at the rc
, ac
, and ai
in the header of the response. If the rc
and ac
are 0, then from the DAP's perspective the queries have succeeded and there was no data to return. If either the rc
or ac
are non-zero the ai
should give more information about the specific error encountered.
My queries are refused
If the GW returns immediately with an error of the form:
'Resource Coordinator connection not established
Then this indicates that the GW has not connected to the RC. Alternatively, you can check the GW logs for the following messages:
// Error
INFO com.kx.sg.IPCService - Coordinator connection establish error: java.net.ConnectException: Connection refused (Connection refused). Will try again in 5 seconds...
// Success
INFO com.kx.sg.IPCService - Connected to coordinator <rc_address>
If the Error
message appears multiple times without the Success
message, then the GW is unable to connect to the RC. Check the connection details in the GW configuration.
My queries always fail
This can happen for multiple reasons. The first element in the response is the response header (see "header" page) and the rc
, ac
, and ai
fields (see "codes" page) contain details on the error. The following table summarizes the possible routing errors from the RC.
rc |
ac |
ai |
cause | solution |
---|---|---|---|---|
NOT_READY (58) |
NOT_READY (20) |
"No resources connected" |
No DAPs (or other RCs) are connected to the RC. | See RC-DA and RC-RC. |
NOT_SUPPORTED (12) |
NOT_SUPPORTED (17) |
"SQL library not loaded" |
Attempting to make a SQL query in unsupported q version. | Upgrade q. |
APP (5) |
* |
"SQL parse error" |
Error parsing SQL statement. | Fix query. |
APP (5) |
ARG (23) |
"bad args purview dimensions" |
Invalid routing arguments. | Check query, labels must be symbols, startTS and endTS are timestamps. |
APP (5) |
ARG (23) |
"startTS >= endTS" |
Invalid request timestamps. | Start time must be less than end time. |
NOT_SUPPORTED (12) |
DOMAIN (27) |
"outside all RC taxonomies" |
Request for unknown label values. | Check that labels in query match labels of desired DAPs (assembly file). If this looks correct, then the desired DAPs or RCs may not be connected properly. See RC-DA and RC-RC. |
NO_DEST (56) |
NO_DEST (17) |
"no Agg" |
Can't find an aggregator. | See RC-Agg. |
TIMEOUT (45) |
ERR (10) |
Request timed out before completing | See Timeouts. |
Timeouts
Constant timeouts are usually due to the RC being unable to find a DAP to cover a temporal region of the request, which would point to a DAP not correctly connecting with the RC. When the RC encounters a timeout it augments the log and ai
code of the request with metadata about portions of the request that did not complete.
The metadata includes the RC responsible for the request, the current status of the request, and details of items in the queue. The queue details include the labels and temporal range for the request being serviced, and the reason the portion was not sent to a DAP to be serviced.
No DAP covers labels/time range
The RC reports this in cases where no DAP has registered with the appropriate time ranges or labels. If you can attach or open an IPC handle to the RC process and there are no active requests, run the following command on the process:
.sgrc.i.summarizeDAPs[]
This returns a table with the labels, startTS
and endTS
for each contiguous set of label values. If any label combination does not cover from startTS=-0Wp
to endTS=0Wp
, one or more of the DAPs with those labels is missing.
Busy executing another API
Occurs when one or more DAPs needed to service the request were too busy servicing another API. Depending on the request parameters, there might not be anything wrong other than the system was not sufficiently resourced to handle the query load. If this issue occurs often, some possible remediation steps are to look into the query code to see if it can be optimized (in the case of a user defined analytics), or to add more DAPs to the system to service queries.
DAP reference vintage %n does not match %s reference vintage %n
DAPs keep track of where they are in the RT stream any time they ingest reference data, and they report this position (or vintage) to the RC. When sending a request to DAPs, the RC checks that the vintages of the portions of the request match, so as to minimize disparities in the data reported. This timeout error can occur when one or more DAPs are falling behind in ingestion, and so the reference vintage they report is lagging behind the global value. When this occurs, it is best to check the logs and determine the cause of any ingestion issue that might be affecting particular DAPs.
Unavailable for unspecified reasons
When this reason is specified, the DAP has marked itself unavailable to the RC. This usually occurs when the DAP is acting on a reload signal and is thus performing in-memory data purges, or garbage collect calls. The DAP logs contain details of what it was doing during the time of the request.
This can occur in cases where the DAP has started and registered with the RC, but has remained marked as "unavailable". In this case, the issue could be that it has not ingested its end-of-replay marker. In a healthy system you should be able to see the DAP log both of these:
INFO DA Injecting end of log replay message, marker
INFO DA Finished RT log replay
In cases where there is an ingestion issue, you see only the first log message. If you have access to the DAP console, you can confirm it has not received the end-of-replay message by querying .da.i.eorReceived
in the DAP that is unavailable and confirming the value is 0b
.
Unknown reason for not assigning request portion to DAP(s)
Occurs when the RC does not know why the request portion was unable to be served. This requires deeper investigation into the logs and state of the system.
RC-DAP
At startup, a DA reaches out to its configured RC and initiates connection. The DA logs should have a message of the form:
[rdb] INFO DA Initializing RC connection, rc=<KXI_SG_RC_ADDR>
[rdb] INFO KXDSCONN Attempting to connect to rc at <KXI_SG_RC_ADDR>
[rdb] INFO KXDSCONN Connected to proc=rc addr=<KXI_SG_RC_ADDR> handle=<handle>
[rdb] INFO DA Registering with RC
If this message does not appear, the DA did not connect to the RC. Verify connection details.
If connection is established, the RC logs should contain a message of the form:
[KXI-SG-RC-sg-rc] INFO SGRC Received DAP registration request, handle=<handle>
[KXI-SG-RC-sg-rc] INFO SGRC Registering valid DAP, handle=<handle> asm=<dap_assembly_name> instance=<KXI_SC> addr=<KXI_NAME>:<KXI_PORT>
Ensure that the purview matches the expected values. The RC uses these values (except ver
) to route requests; they must the labels in your request.
[KXI-SG-RC-sg-rc] INFO SGRC Adding new purview ID, id=1 labels=[labelKey=`labelValue]
If there is no registration message, look for one of the following error messages in the RC log:
// Incorrect args to the registration call.
ERROR [KXI-SG-RC-sg-rc] SGRC Incorrect DAP registration param types
// Bad purview.
ERROR [KXI-SG-RC-sg-rc] Invalid DAP purview registration
// Invalid metadata.
ERROR [KXI-SG-RC-sg-rc] Invalid DAP metadata
// Invalid schema definitions.
ERROR [KXI-SG-RC-sg-rc] Invalid DAP schemas
// Unexpected update.
ERROR [KXI-SG-RC-sg-rc] Unknown DAP update
// Invalid purview on update.
ERROR [KXI-SG-RC-sg-rc] Invalid purview update, rejecting DAP
Any of these messages indicates a bug. Failing any of the above, look for any error messages in the DAP or RC that would indicate that initialization failed before the registration was attempted, which would also indicate a bug. Contact technical support.
RC-RC
At startup, if using multiple RCs, for every RC pair, one RC should initiate a connection to the other. To ensure the RCs are configured to find each other, look for the following message:
INFO [KXI-SG-RC-sg-rc] SGMRC Setting multi-RC mode to <mode>
If the message is not present, one of the following messages should appear:
INFO [KXI-SG-RC-sg-rc] SGRC Setting mono-RC mode
// OR
FATAL [KXI-SG-RC-sg-rc] SGRC Unrecognized 'KXI_DISC_MODE' mode: <mode>
These indicate the KXI_DISC_MODE
environment variable is either unset (former) or set to an unrecognized value (latter). See "SG configuration page" for details.
Assuming discovery mode is correctly set, for every pair of RCs, one RC should attempt to connect to the other (which one is not guaranteed). Say, RC_0
connects to RC_1
. Ensure the following messages appear in the logs:
// RC_0
INFO [KXI-SG-RC-sg-rc-0] SGRC Attempting connection to RC, name=<rc_names> hostport=<rc_host:ports>
// RC_1
INFO [KXI-SG-RC-sg-rc-1] SGRC Registering RC, name=<rc_0_name> host=<rc_0_host> port=<rc_0_port>
INFO [KXI-SG-RC-sg-rc-1] SGRC Reciprocating registration with RC
// RC_0
INFO [KXI-SG-RC-sg-rc-south] SGRC Registering RC, name=<rc_1_name> host=<rc_1_host> port=<rc_1_port>
In the event of communication failure, there may be a log message of the form "Error sending: <error>"
, but the RCs should retry in a few seconds, so these can be ignored, unless they repeat constantly.
If these messages do not appear in either RC, then one or both are incorrectly configured.
-
If using
"kubernetes"
discovery mode, ensure that the annotations are correctly set in the pod metadata:kind: Pod metadata: annotations: kxi-kdisc: "${Name of the RC container}: sc=KXI-SG-RC port=${Container port number}" # Check these!
The name of the RC container should match the name
spec.containers.name
of the RC container:spec: containers: - name: ${Name of the RC container}
RC-Agg
At startup, the Agg should open a connection to its RC. The following message should appear in the Agg's logs:
INFO [KXI-SG-AGG-sg-agg] SGAGG Attempting to register with RC, hp=:<rc_host>:<rc_port>
INFO [KXI-SG-AGG-sg-agg] SGAGG Connected to RC, handle=<handle_number>
If instead, the following messages appears:
FATAL [KXI-SG-AGG-sg-agg] Must define KXI_SG_RC_ADDR env variable
then the KXI_SG_RC_ADDR
environment variable is not set (should be KXI_SG_RC_ADDR="<rc_host>:<rc_port>"
). If the following message continually pops up:
INFO [KXI-SG-AGG-sg-agg] SGAGG Attempting to register with RC, hp=:<rc_host>:<rc_port>
WARN [KXI-SG-AGG-sg-agg] SGAGG Unable to connect to RC
check to make sure the RC's host and port are set.
Once connection is established, a confirmation message should appear in the RC's logs:
INFO [KXI-SG-AGG-sg-agg] Registering Agg, host=<agg_host> port=<agg_port> handle=<handle_number>"