Database: tables in the filesystem

Roughly speaking, KDB-X is what happens when q tables are persisted and then mapped back into memory for operations.
— Jeffry A. Borror, Q for Mortals

Tables are important first-class entities in q. While smaller q tables can be held in memory, we will need to persist them if they are large enough or if we want to keep them after the process terminates. We might also want to do operations on these persisted tables.

This document lists the options for saving tables to the filesystem in KDB-X.

How we serialize a table depends on its size and how we need to use it.

Serialization	Representation	Recommended use case
flat table	single binary file	small and most queries use most columns
splayed table	directory of column files	up to 100 million rows
partitioned table	table partitioned by e.g. date, with a splayed table for each date	more than 100 million records; or growing steadily
segmented database	partitioned tables distributed across disks	tables larger than disks; or you need to parallelize access

Flat table

q will serialize and file any object as a single binary file – the simplest way to persist a table.

A database with tables trades and quotes, and a sym list:

db/
├── quotes
├── sym
└── trades

We can also export the table in other formats (for example, .csv, .xls) if needed with save.

Splayed table

A table is splayed by storing each of its columns as a single file. The table is stored as a directory.

db/
├── quotes/
|   ├── time
|   ├── sym
|   └── price
└── trades/
    ├── time
    ├── sym
    ├── price
    └── vol

With a splayed table, each column file is only read into memory when a query requires it.

Consider splaying a table if most queries on a reasonably sized table do not need all the columns.

Partitioned table

The records of a partitioned table are divided in its root directory between multiple partition directories. The table is partitioned by the values of a single column. Each partition contains records that have the same value in the partitioning column. With time series data, this is most commonly a date or time.

db/
├── 2020.10.03/
│   ├── quotes/
│   │   ├── price
│   │   ├── sym
│   │   └── time
│   └── trades/
│       ├── price
│       ├── sym
│       ├── time
│       └── vol
├── 2020.10.05/
│   ├── quotes/
│   │   ├── price
│   │   ├── sym
│   │   └── time
│   └── trades/
│       ├── price
│       ├── sym
│       ├── time
│       └── vol
└── sym

The partition directory is named for its partition value and contains a splayed table with just the records that have that value.

Consider partitioning a table if any of the following conditions apply:

It grows over time.
It contains more than 100 million records.
It includes columns that exceed the maximum object size allowed in memory.

Segmented database

The root directory of a segmented database contains only two files:

par.txt: a text file listing the paths to the segments
the sym file for enumerated symbol columns

Segments are stored outside the root, usually on various volumes. Each segment contains a partitioned table.

DISK 0             DISK 1                     DISK 2
db/                db/                       db/
├── par.txt        ├── 2020.10.03/           ├── 2020.10.04/
└── sym            │   ├── quotes/           │   ├── quotes/
                   │   │   ├── .d            │   │   ├── .d
                   │   │   ├── price         │   │   ├── price
                   │   │   ├── sym           │   │   ├── sym
                   │   │   └── time          │   │   └── time
                   │   └── trades/           │   └── trades/
                   │       ├── .d            │       ├── .d
                   │       ├── price         │       ├── price
                   │       ├── sym           │       ├── sym
                   │       ├── time          │       ├── time
                   │       └── vol           │       └── vol
                   ├── 2020.10.05/           ├── 2020.10.06/
                   │   ├── quotes/           │   ├── quotes/
                   ..                        ..

Consider segmenting a table across multiple storage devices if any of the following conditions apply:

The table exceeds the capacity of a single storage device.
You need to parallelize access to the table.
You want to partition the table by a non-integer datatype, such as a symbol.

Dividing the table between storage devices lets us

store very large tables
parallelize queries
optimize updates

Queries on serialized tables

Deserialization and reserialization is implicit in qSQL queries.

q)select city,pop,country.code from `:linked/cities
city     pop      code
----------------------
Tokyo    37435191 81
Delhi    29399141 91
Shanghai 26317104 86

q)`:linked/countries upsert (`Brazil;`$"South America";55)
`:linked/countries
q)get`:linked/countries
country| cont          code
-------| ------------------
China  | Asia          86
India  | Asia          91
Japan  | Asia          81
Brazil | South America 55

Operations on serialized tables

Some operators and keywords work on some serialized tables.

For example, cols works on tables in memory or mapped to memory as well as on file symbols for splayed tables, but not on serialized flat tables.

Serializing objects (including flat tables)
Splayed tables
Partitioned tables
Segmented databases

Q for Mortals §14. Introduction to KDB-X