Skip to content

Configuration

A set of services collectively making up a data ingestion, storage, and access pipeline is collectively called an Assembly. To aid in using them together as part of an assembly, KXI components share a common configuration file format, called an Assembly Configuration file (AC). An AC is a YAML document read from a path specified for each service by the KXI_ASSEMBLY_FILE environment variable.

The Storage Manager expects the following top-level structure in an AC:

  • name: String giving a short name for this assembly.
  • description: String describing the purpose of this assembly. Optional.
  • tables: Dictionary of schemas for the tables operated upon within the assembly.
  • mounts: Dictionary of mount points for stored data.
  • bus: Dictionary containing the configuration of the message bus used for coordination between elements.
  • elements: Dictionary of services that should run within this assembly, and any configuration they each require.

mounts[X].uri, elements.sm.source, and elements.sm.tiers[N].store permit URIs; these may presently use the file:// or s3:// URI schemas. Other schemas may be supported in the future.

Tables

A Table schema has the following structure:

  • description: String describing the purpose of this table. Optional.
  • type: String; one of {splayed, partitioned}.
  • primaryKeys: List of names of primary key columns. Optional.
  • prtnCol: Name of a column to be used for storage partitioning. Optional.
  • shards: Integer; shard count. Optional.
  • partitions: Integer; Partition count. Optional.
  • blockSize: Integer; Block size. Optional.
  • updTsCol: Name of the arrival timestamp column. Optional.
  • columns: List of column schemas.

A column schema has the following structure:

  • name: Name of the column.
  • description: String describing the purpose of this column. Optional.
  • type: Q type name.
  • foreign: This column is a foreign key into another table in this assembly of the form table.column. Optional.
  • attrMem: String; column attribute when stored in memory. Optional.
  • attrDisk: String; column attribute when stored on disk. Optional.
  • attrOrd: String; column attribute when stored on disk with an ordinal partition scheme. Optional.
  • attrObj: String; column attribute when stored in Object store (e.g. S3). Optional.

Mounts

The Storage Manager migrates data between a hierarchy of tiers, each with its own locality, segmentation format, and rollover configuration. Mounts describe where other services can then access this data.

The Mounts section is a dictionary mapping user-defined names of storage locations to dictionaries with the following fields:

  • type: String; one of {stream, local, object}.
  • uri: String URI representing where that data can be mounted by other services.
  • partition: Partitioning scheme for this mount. One of:
  • none: do not partition; store in the order it arrives.
  • ordinal: partition by a numeric virtual column which increments according to a corresponding storage tier's schedule and resets when the subsequent tier (if any) rolls over.
  • date: partition by each table's prtnCol column, interpreted as a date.

Notes:

  • A mount of type stream must be partition:none.
  • A mount of type local or object must be partition:ordinal or partition:date.

Bus

The Storage Manager ingests data from an event stream; a Bus contains the information necessary to subscribe to that stream.

The bus section consists of a dictionary of bus entries. Each bus entry provides several fields:

  • protocol: Short string indicating the protocol of the messaging system. Currently, the only valid choice for this protocol is custom, which indicates that custom Q code should be loaded from the path given by an environment variable KXI_RT_LIB.
  • topic: String indicating the subset of messages in this stream that consumers are interested in. Optional.
  • nodes: List of one or more connection strings to machines/services which can be used for subscribing to this bus. In the case of the custom protocol, this list should contain a single hostname:port string.

Elements

Assemblies coordinate a number of processes and/or microservices, which we call elements of the assembly. The elements section provides configuration details which are only relevant to individual services. This guide will focus on the configuration options for the Storage Manager, which go in the sm entry of elements.

The sm element configuration has the following structure:

  • tiers: List of storage tiers. Required.
  • enforceSchema: Boolean, defaults to false. If true, enforce table schemas when persisting (with performance penalty; for debugging).
  • disableREST: Boolean, defaults to false. If true, disables the REST interface, leaving only qIPC support.
  • disableDiscovery: Boolean, defaults to false. If true, disables registration with discovery.
  • chunkSize: Integer, defaults to 500000. The chunk size used for writing tables.
  • sortLimitGB: Integer, defaults to 10. Memory limit when sorting splayed tables or partitions on disk, in GB.
  • waitTm: Integer, defaults to 250. Time to wait between connection attempts in milliseconds.
  • eodPeachLevel: Level at which EOD peaches to parallelise HDB table processing. A list including any combination of {part,table}.

A storage tier has the following structure:

  • name: String used to refer to a particular tier.
  • store: URI describing where this tier will physically store data. If not specified, refer to the uri field of the corresponding mount.

  • mount: Name of a corresponding mounts entry at which data in this tier may be accessed.

  • schedule: Dictionary describing a policy for when rollovers should be considered. If present, this dictionary must have the following keys:
  • freq: Timespan, in Q notation. How often should this tier roll data over into the next tier? For example, 00:10:00 means rollover happens every 10 minutes.
  • snap: Time, in Q notation. At what whole multiples of time should rollovers be scheduled? For example, 01:00:00 means rollover will happen at the beginning of an hour.

  • retain: Dictionary describing a policy for how much data should be stored in this tier before it is rolled over into the next tier. This dictionary may have one or more of the following keys. If multiple keys are set, they are interpreted in an inclusive-OR fashion:

  • time: A timespan consisting of a number followed by a unit: {Years,Months,Weeks,Days,Hours,Minutes}, e.g. 2 Years. Rollover occurs for data which has been stored for this length of time.
  • sizePct: A size as percentage of total storage of corresponding mount, specified as a number from 1 to 100.

  • compression: Dictionary describing a policy for compression of data, if any. If present, contains the following keys:

  • algorithm: Compression algorithm: {none, qipc, gzip, snappy, lz4hc}
  • block: Block size
  • level: Compression level

Notes:

  • If retain is not specified, all data will be transferred to the next tier when rollover occurs. If retain is not specified on the final tier, data will be preserved at that tier indefinitely. Conversely, if the final tier has a retain policy, data which rolls over will be destroyed!
  • A mount partitioned as ordinal, or of type stream cannot be used with a storage tier that has a retain policy.
  • The compression policy currently only applies to tiers associated with a mount which is of type:local and partition:date.