Data Access configuration

In its most basic form, Data Access is a set of Docker images that are combined using minimal configuration. Below is an explanation of the images required, what configuration parameters need to be defined, and an some example configurations.

Images

Two images are provided for deploying Data Access Processes:

Single mount

To deploy a Data Access Process providing query access to a single mount (e.g., an RDB, IDB, or HDB), use the registry.dl.kx.com/kxi-da image.

Additionally, set the mount preferred using mountName: <mount> in the assembly for that dap instance.

For example,

elements:
  dap:
    rdb:
      mountName: stream

Multiple mounts

To deploy a Data Access Process providing query access to multiple mounts within a single image (to share compute resources across tiers), use the registry.dl.kx.com/kxi-da-single image.

Additionally, set the list of mounts preferred using mountList: [<mounts>] in the assembly.

elements:
  dap:
    db:
      mountList: [stream, intraday, historical]

Environment variables

The DA microservice relies on certain environment variables to be defined in the containers. The variables are described below.

variable	required	containers	description
KXI_NAME	Yes	DA	Process name.
KXI_PORT	No	DA	Port. Can also be started with `"-p $KXI_PORT"`.
KXI_SC	Yes	DA	Service Class type for data access (e.g. RDB, IDB,HDB)
KXI_LOG_FORMAT	No	DA, sidecar	Message format (see qlog documentation).
KXI_LOG_DEST	No	DA, sidecar	Endpoints (see qlog documentation).
KXI_LOG_LEVELS	No	DA, sidecar	Component routing (see qlog documentation).
KXI_ASSEMBLY_FILE	Yes	DA	Assembly yaml.file.
KXI_CONFIG_FILE	Yes	sidecar	Discovery configuration file (see KXI Service Discovery documentation).
KXI_CUSTOM_FILE	No	DA	File containing custom code to load in DA processes.
KXI_DAP_SANDBOX	No	DA	Whether this DAP is a sandbox.
SBX_MAX_ROWS	No	DA	Maximum number of rows, per partitioned table, to store in memory.
KXI_ALLOWED_SBX_APIS	No	DA	Comma-delimited list of sandbox APIs to allow in non-sandbox DAPs (ex: ".kxi.sql,.kxi.qsql").
KXI_DA_RELOAD_STAGGER	No	DA	Time in seconds between DAPs of the same class reloading after an EOX (default: `30`)
KXI_DA_USE_REAPER	No	DA	Whether to use KX Reaper and object storage cache - follow the (configuration)[#object-store-config] (default: `false`)
KXI_MAX_RECORD_INTV	No	DA	Maximum number of records in an interval before triggering emergency reload (default: `unlimited`)
KXI_SAPI_HB_FREQ	No	DA	Time in milliseconds to run the heartbeat to connected processes (default is `30,000`).
KXI_SAPI_HB_TOL	No	DA	Number of heartbeat intervals a process can miss before being disconnected (default is `2`).

See example section below.

Object store config

The Data Access HDB processes are able to cache and reap object storage results to avoid repeated downloads of the same data.

Be sure to configure RT log archiving to not overlap with the cache

If using RT and the RT log volume, be sure to size the RT log volume appropriately to make additional room for the object storage cache.

The following environment variables should be set:

variable	required	containers	description
KX_OBJSTOR_CACHE_PATH	Yes (unless Platform)	DA	Path to where the object storage cache should be. This uses the RT Log Volume in Platform.
KX_OBJSTOR_CACHE_SIZE	Yes	DA	Size of the object storage cache in MB. Increase the RT Log Volume by this amount in Platform.

For Platform, the RT Log Volume is used for the object storage cache. Since all RT Log Volumes must be sized identically for log archiving, increase the RT Log Volume by the object storage cache size. For example, for a 20Gi log volume and a desired 5Gi object storage cache, set the RT Log Volume size for the HDB to 25Gi and set KXI_DA_USE_REAPER to "true" for the HDB DAP element.

Names

Data Access process names help determine the order in which RDB processes reload, to help avoid processes all reloading at once. This is handled by Kubernetes StatefulSet configuration, which will name Pods as pod-name-<ordinal> and by Docker Compose, which will name containers as container-name_<ordinal>. In cases where this naming convention isn't followed, either explicitly or via Kubernetes/Docker Compose, the reloads will be immediate with no staggering. See the KXI_DA_RELOAD_STAGGER to control the time period between reloads.

Assembly

The assembly configuration is a yaml file that defines the DA configuration, i.e. what data it is expected to offer, how it responds to queries. Assemblies are used in all KX Insights microservices.

field	required	description
name	Yes	Assembly name.
description	No	Description of the assembly.
labels	Yes	Labels (i.e. dimensions) along which the data is partitioned in the DAs, and possible values (see Labels).
tables	Yes	Schema for tables to be loaded into DAs.
bus	No	Messaging protocol to be used by streaming DAs.
mounts	Yes	Reference mount point for that a DA is expected to surface data for. In-memory mounts are referred to as `stream`
elements	Yes	Additional, service specific configuration (see Elements).

See Labels/Elements or Example for example assembly yaml configurations. The assembly yaml file must be included in the Docker container.

Labels

Labels are used to define the DA purview. That is, the data that it grants access to. If using the KX Insights Service Gateway, these are the values reported as the DAP's purview (see "Service Gateway" page).

Below are some examples.

Example 1 - Provides FX data for America.

labels:
    region: amer
    assetClass: fx

Example 2 - Provides electrical, weekly billing for residential customers.

labels:
    sensorType: electric
    clientType: residential
    billing: weekly

Tables

A Table schema has the following structure:

description: String describing the purpose of this table. Optional.
type: String; one of {splayed, partitioned}.
primaryKeys: List of names of primary key columns. Optional.
partCol: Name of a column to be used for storage partitioning. Optional.
blockSize: Integer; Number of rows to keep in-memory before SM writes to disk. Optional.
updTsCol: Name of the arrival timestamp column. Optional.
columns: List of column schemas.

A column schema has the following structure:

name: Name of the column.
description: String describing the purpose of this column. Optional.
type: Q type name.
foreign: This column is a foreign key into another table in this assembly of the form table.column. Optional.
attrMem: String; column attribute when stored in memory. Optional.
attrDisk: String; column attribute when stored on disk. Optional.
attrOrd: String; column attribute when stored on disk with an ordinal partition scheme. Optional.
attrObj: String; column attribute when stored in Object store (e.g. S3). Optional.

Bus

Data Access ingests data from an event stream; a Bus contains the information necessary to subscribe to that stream.

The bus section consists of a dictionary of bus entries. Each bus entry provides several fields: - name: For DAPs the name is expected to be stream when the environment variable KXI_RT_LIB is defined. - protocol: Short string indicating the protocol of the messaging system. Currently, the only valid choices for this protocol are custom and rt. A protocol of custom indicates that custom Q code should be loaded from the path given by an environment variable KXI_RT_LIB. A protocol of rt indicates that the data access process will be using the Insights Realtime Transport protocol. - topic: String indicating the subset of messages in this stream that consumers are interested in. - nodes: List of one or more connection strings to machines/services which can be used for subscribing to this bus. In the case of the custom protocol, this list should contain a single hostname:port string.

Mounts

Data Access can mount data from any of the supported tiers each with its own locality and format. Loosely speaking the type of Data Access process is defined by the type of Mount. Where stream is similar to a traditional kdb+ RDB, and local equivalent to an HDB. The object tier is unique to cloud based storage.

The Mounts section is a dictionary mapping user-defined names of storage locations to dictionaries with the following fields:

type: String; one of {stream, local, object}.
baseURI: String URI representing where that data can be mounted by other services. Presently this supports the file:// URI schema, or object storage URIs.
partition: Partitioning scheme for this mount. One of:
none: do not partition; store in the order it arrives.
ordinal: partition by a numeric virtual column which increments according to a corresponding storage tier's schedule and resets when the subsequent tier (if any) rolls over.
date: partition by each table's partCol column, interpreted as a date.
sym: (Object storage only) A file:// URI or object storage URI path to a sym file
par: (Object storage only) A file:// URI or object storage URI path to a par.txt file
storageURI: (Object storage only) An object storage URI that points to a database.

Notes:

A mount of type stream must be partition:none.
A mount of type local or object must be partition:ordinal or partition:date.

Elements

Assemblies coordinate a number of processes and/or microservices, which we call elements of the assembly. The elements section provides configuration details which are only relevant to individual services. This guide will focus on the configuration options for Data Access, which go in the dap entry of elements.

The dap element configuration has the following configuration parameters:

sgArch: Architecture of service gateway process. Support for traditional and asymmetric. Default is asymmetric if unspecified.
rcEndpoints: List of hostname:port strings of known resource coordinators to connect to if the discovery service is unavailable.
rcName: The name of the resource coordinator for the DAP to connect to, as defined by its KXI_NAME environment variable.
smEndpoints: The hostname:port strings of storage manager service for data accesss process to connnect to.
tableLoad: How to populate in-memory database tables. Support for empty, splay, and links. Default behaviour is empty.
mountName: Name of mount from mounts section of assembly for DA to mount and provide access to.
mapPartitions: Whether a local mount should map partitions after a remount. See kdb+ documentation here.
purview : Inclusive start, exclusive end purview for startup of DA process
enforceSchema : Whether stream DAP should validate all incoming table data against what's defined in the schema. There is a performance cost having this enabled.
pctMemThreshold : Percentage of available memory to allocate to ingestion of a single interval. Decimal value between 0 and 1.
allowPartialResults : Whether an HDB DAP should return a successful response if it's entered low memory mode and stopped ingesting late data. Default is true.

Within the assembly it is structured under the dap element, instances. Config that applies to all DAPs are indented one level above the instances themselves. This can be overridden at the instance level as well.

elements:
  dap:
    # These configs apply to all DA below
    rcName: sg_rc # Used with discovery to determine resource coordinator to connect to
    instances:
      RDB:
        # Config specific to DAPs with a KXI_SC of RDB
        mountName: rdb # Must match name of mount in "mounts" section
      IDB:
        # Config specific to DAPs with a KXI_SC of IDB
        mountName: idb
      HDB:
        # Config specific to DAPs with a KXI_SC of HDB
        mountName: hdb

Discovery

By default, the database microservices (SG, DA, SM) use environment variables to connect to one another. An example of using environment variables is outlined in the deployment example. In this mode, The dynamic processes connect to the static processes (DAs connect to SG and SM), so processes can still come and go despite being explicitly configured.

Alternatively, all database microservices can use KX Insights Service Discovery in order for processes to discover and connect with each other dynamically (see the KXI Service Discovery documentation). When using service discovery, all images must be configured to use discovery. Modes can not be intermixed. Images required for this are as follows.

process	description	image
sidecar	Discovery sidecar.	kxi_sidecar
discovery	Discovery client. Configure one, which all processes seamlessly connect to.	kxi-eureka-discovery
proxy	Discovery proxy.	discovery_proxy

Custom file

The DA processes load the q file pointed to by the KXI_CUSTOM_FILE environment variable. In this file, you can load any custom APIs/functions that you want accessible by the DA processes. Note that while DA only supports loading a single file, you can load other files from within this file using \l (allowing you to control load order). The current working directory (pwd) at load time is the base directory of the file.

This can be combined with the Service Gateway microservice (which allows custom aggregation functions) to create full custom API support within KX Insights (see "Service Gateway" for details).

Note: It's recommended to avoid .da* namespaces to avoid colliding with DA functions.

To make an API executable within DA, use the .sgagg.registerAPI API, whose signature is as follows. * api - symbol - Aggregation function name. * metadata - list|string|dictionary - Aggregation function metadata (see "SAPI - Metadata" documentation).

API functions MUST be registered with .sgagg.registerAPI in order to be invoke-able by the DA processes. See Custom file example below for an example.

If using the Service Gateway microservice, you can see which APIs are available (and in which DAP), use the .kxi.getMeta API (See "SG - APIs").

When creating custom analytics that access data there is a helper function .kxi.selectTable which understands the data model within each DAP and can help select from the tables necessary to return the appropriate records. It's interface is as follows:

Name	Type	Description
tn	symbol	Name of table to retrieve data from
ts	timestamp[2]	Time period of interest
wc	list[]	Where clause of what to select
bc	dict/boolean	By clause for select
cn	symbol	Names of columns to select for. Include any columns needed in aggregations
agg	dict	Select clause/aggregations to apply to table

EOX Event Hooks

When loading a custom file into a Data Access Process, there are two functions which are intended to overwritten to augment the DAPs EOX event handling. These functions are .da.prtnEndCB and .da.reloadCB.

The function .da.prntEndCB is invoked by receipt of the _prtnEnd table published by Storage Manager to mark the end of an interval. This callback function is invoked after DAP has adjusted any receive filters and redirected updates to any delta tables.

Name	Type	Description
startTS	timestamp	Start timestamp of interval
endTS	timestamp	End timestamp of interval
opts	dictionary	List of additional options (Detailed below)

Where the options can have these keys: |Name|Type|Description| |---|---|---| |date|date|Date of interval| |partNo|long|EOI partition number| |soiTS|timestamp|Start of interval timestamp| |intv|int|Interval length|

The function .da.reloadCB is invoked by Storage Manager notifying the DAPs that the EOX has been finished and committed. The callback function is invoked after any database has been reloaded, tables have been purged, but before the DAP has marked itself as available to the Resource Coordinator. The function takes a dictionary of arguments with the following keys:

Name	Type	Description
ts	timestamp	Timestamp of reload event
minTS	timestamp	Lower inclusive start of this DAPs purview
maxTS	timestamp	Upper inclusive start of this DAPs purview
startTS	timestamp	Start time of inverval
endTS	timestamp	End time of interval
pos	int	Position of _prtnEnd event that triggered this EOX

Sandbox Mode

If a data access process is passed the environment variable KXI_DAP_SANDBOX with a value of "true" then it will be started in a a "sandboxed" mode. Under this mode the DAP will not initialize connections to the resource coordinator, or storage manager. In addition local mount types will load any splayed tables into memory.

For stream mounts there is an additional environment parameter SBX_MAX_ROWS which the DAP will use to limit the number of rows a partitioned table has in memory. When it's set the only the last SBX_MAX_ROWS records received/updated will be kept in memory.

Example

A full example of an integrated deployment using Docker Compose is available here.