KdbplusForMortals/kdbplus database

From Kx Wiki
Jump to: navigation, search

Contents

1.6 Kdb+ Database

Many kdb+ newcomers find the notion of a kdb+ database shrouded in mystery. In this section, we answer many newbie questions:

What is a kdb+ database?
How is it different from an RDBMS?
How do I create a kdb+ database?
How do I startup a kdb+ database?
What happens at startup?

1.6.1 Comparing kdb+ to an RDBMS

While q-sql makes certain aspects of a kdb+ database seem familiar to a SQL programmer, the two are quite different under the covers. Folks coming into the world of q and kdb+ from a traditional RDBMS environment may feel as if they have entered the Matrix after taking the red pill from Morpheus.

The fundamental differences are:

  1. Kdb+ is based on lists, which are ordered collections allowing duplicates, whereas SQL is based on sets, which are unordered collections of distinct elements.
2. Kdb+ stores data as contiguous items in column lists, whereas an RDMS stores data as fields within non-contiguous rows. Neo in kdb+ would say, “There are no rows.”
3. Kdb+ table operations are vector operations on columns, whereas SQL operates on individual fields and rows.

We summarize the major differences between an RDBMS and kdb+ in the following chart:

Traditional RDBMS Kdb+ Database
Table Creation Tables defined declaratively using DDL and created on disk. Tables created functionally in the q language.
Data Persistence Tables and related metadata held in an opaque repository. Tables are stored by row. Serialized q entities stored in the O/S file system. No separate table metadata. Tables are stored by column.
Data Access Access to stored information is via DDL for metadata and SQL for data. Must retrieve via a query into program. Data directly accessible in q. q-sql provides query forms for table manipulation.
Memory Residency Tables reside on disk; query result sets reside in program memory. Tables live in memory but can be persisted to disk. Column subsets are page faulted into memory for mapped tables.
Data Format Based on sets, which are unordered collection of distinct items. Data is stored in fields within rows, which are not contiguous. Based on lists, which are ordered collections allowing duplicates. Data is stored as contiguous items in column lists.
Data Modification Persisted table modifiable via SQL (INSERT, UPDATE, etc.) Memory resident tables modifiable via q and q-sql. Persisted table modifiable only with append (upsert)
Data Programming SQL is declarative relational. Programs, called stored procedures, written in proprietary procedural language. Programs written in integrated vector functional language q. Tables are first class entities in q.
Transactions Support for transactions via COMMIT and ROLLBACK. No built-in transaction support.

1.6.2 The Physical Layout of a kdb+ Database

What is a kdb+ database? This is a Zen koan. At the most basic level, a kdb+ database is a file system directory (and subdirectories) holding q entities. This directory is the root of the database. All constituents of the database are simply q entities saved to disk files. Most entities reside at some level under the root; those which do not are pointed to from the root.

We describe the various kdb+ database components in this section and show examples in the following section. In our descriptions we use /db as our root, but this is arbitrary.

1.6.2.1 The sym File

The sym file is an optional serialized q data file containing a list of unique symbols used as the domain for symbol enumeration. Placing the sym file in the root guarantees that it will be loaded into memory at q startup.

/db
    sym
    …

There is no strict requirement that there be only one enumeration domain, but conventionally all symbol columns from all tables are enumerated over a single domain sym. The .Q utilities that handle symbol enumeration assume this. If you choose to have multiple enumeration domains, be aware that symbols in the resulting enumerations will not be comparable with ~ unless you resolve them (e.g., apply value), since they have different types.

Important: Pay very careful attention to the sym file, as it is a single point of failure. Corrupting or losing it will result in all your symbol columns being irresolvable. Good practice is to use conditional enumeration (e.g., `:/db/sym?) when you enumerate symbols manually or use .Q utilities that do this for you.

The sym file will not be a choke point when loading historical data from an external source into a kdb+ because conditional enumeration (and the .Q utilities) use file locking to mediate concurrent updates. Alternately, if the symbol domain is known in advance, you can load the sym list into memory and use non-conditional enumeration--i.e., `sym$. For example, one approach is to create a preprocessing utility in Perl to extract all symbols from the source data, import this as a list in q and place the distict items in sym. Then create the historical partitions running concurrent processes that use unconditional enumerations. Be mindful that an unconditional enumeration fails for a symbol not in the domain.

1.6.2.2 Other Serialized Files in the Root

While it is seldom done in practice (perhaps because few kdb+ programmers are aware of the possibility), it is possible to place any serialized q entity into the root and have it loaded into an eponymous file. This can be a more efficient than using a script to initialize variables in the workspace.

/db
    [sym]
    var1
    var2
    …

One particular type of serialized file that can be initialized in this fashion is a reference table (or keyed table) of modest size (up to millions of rows). Such a table will be loaded into memory in its entirety, so lookups or joins will be quite fast.

If you follow best practice and locate all global variables in the root context, it is easy to initialize the entire workspace with a snapshot. Simply write the root context directory and then read it and assign it to the context dictionary at startup.

1.6.2.3 Scripts

A script in the root can hold arbitrary q code. In particular, functions defined in such a script can be viewed as stored procedures for the database.

While it is possible to have multiple q scripts in the root, you probably want precise control over the order in which the scripts are executed. Good practice is to keep in the root one startup script that loads scripts residing in libraries elsewhere.

/db
    …
    init.q

1.6.2.4 Splayed Tables

Reference tables of intermediate size (up to tens of millions of rows) can be splayed under the root. Such a splayed directory must be immediately under the root.

/db
    [sym]
    …
/splay1
/splay2
…

Because splayed tables are mapped and columns are page-faulted into memory as needed, no more then a few columns will be simultaneously resident in the workspace. Assuming that the server has adequate memory available to kdb+, q should expect to find recently accessed columns cached by the OS. In this scenario, performance of lookups against splayed tables can be close to that of in-memory tables.

1.6.2.5 Partitioned Tables

The partition directories of non-segmented tables must be immediately under the root. Every partition directory contains a subdirectory for each table sharing the partition; in each subdirectory is the splayed slice for that table's partition value. For example, for daily partitioned trades and quotes, a day’s directory would contain a splayed directory for that day’s trades and another splayed directory for that day’s quotes.

/db
    [sym]
    …
    /part1
        /trade
        /quote
    /part2
        /trade
        /quote
    …

There need not be a directory for every partition value but the partition directories must have a uniform structure. That is, a partition value having any entry must contain splayed directories for all tables in the partition. Splay an empty schema for a table having no records for a specific partition value.

1.6.2.6 Segmented Tables

In contrast to simple partitioned tables (i.e., non-segmented tables) in which all records live under the root, no records of a segmented table reside under the root. Instead, the root contains a text file par.txt having one entry per line. Each entry represents an OS path for a segment directory containing the data in that segment; symlinks are permissible.

Each segment directory contains partition directories whose structure is completely analogous to that of a regular partitioned table. There is one directory per partition value and beneath that, one subdirectory per table. Note that depending on the nature of the segmentation, multiple segment directories may contain directories for the same partition value. The same restriction about consistency across partition directories applies.

/db
    [sym]
    …
    par.txt
    …

== drive 1 =============
/seg1
    /part1
    /part2
    …

== drive 2 =============
/seg2
    /parta
    /partb
    …

== … =============

1.6.3 Creating and Loading a Kdb+ Database

Here we demonstrate how to create and load various kdb+ database configurations. Although the examples are simplistic, they cover most common situations.

There are no wizards for laying out and populating a kdb+ database. Instead, you must write each component in the correct form to the correct location. There are utilities to help with the process but it is your responsibility to ensure that everything is valid. Improperly structured directories or incorrectly written data files will generate run-time errors the first time q code attempts to access them.

Once you have created a kdb+ database, start a q process pointing at the root directory of the database. Alternately, in a running q instance you can issue \l on the root directory.

1.6.3.0 What happens at startup?

This question has a deceptively simple answer. When you point q at a directory or file, it applies a single rule: whatever it recognizes as it's own, it does. Thus, a serialized q entity is loaded; a splayed, partitioned or segmented table is mapped; a script is executed.

Caution: Anything that q does not recognize causes it to abort that portion of startup. In particular, if q discovers foreign or undecipherable files in splayed, partitioned or segmented directories, it will not map the tables contained therein. Even unexpected hidden files (e.g., those written by the Mac OS Finder) will abort the map.

When you point q startup at a directory, that directory becomes the root directory (see §1.6.1) for the kdb+ database. We shall refer to this scenario as kdb+ startup to distinguish it from an arbitrary q session. Kdb+ startup handles the items it finds under the root in specific order, so we shall cover them in that order:

  1. Serialized q entities
2. Splayed tables
3. Partitioned or segmented tables
4. Scripts

Note: File handle paths that are not fully qualified—i.e., relative paths—are interpreted relative to the q home directory.

1.6.3.1 Serialized q Entities

When kdb+ startup recognizes a serialized q data file, it loads that data into an eponymous variable. The canonical example is the sym file in the root, containing the serialized list of (unique) symbols in the enumeration domain for (all) symbol columns. Multiple serialized entities are loaded in the order in which the OS file system lists them.

In our first example, we manually write out a sym file, exit q and then restart q pointing at the root.

    `:/db/sym set `a`b`c`z`x`y
`:/db/sym

    \\

>q /db
KDB+ ….

    sym
`a`b`c`z`x`y

Note: In all following examples, we shall elide the exit and restart sequence.

Starting with a fresh /db, we place a defined “constant,” a utility function and a serialized keyed table in the root.

    `:/db/LIFE set 42
`:/db/LIFE
    `:/db/f set {x*y}
`:/db/f
    `:/db/lookup set ([s:`a`b`c] v:1 2 3)
`:/db/lookup

…      / exit and restart q

    LIFE
42

    f
{x*y}

    lookup
s| v
-| -
a| 1
b| 2
c| 3

Note: As of this writing (Jul 2010), q will not look for serialized data in directories beneath the root. Nor will it automatically load serialized data whose file names have extensions. However, you can load such data files manually using \l or get.

1.6.3.2 Splayed Tables

If kdb+ startup finds a subdirectory immediately beneath the root that it recognizes as a splayed table, it maps the table into memory.

Note: All symbol columns must be enumerated for a table to be splayed. Ensure that the sym file is in the root directory.

Starting with a fresh /db, we splay two tables under the root. The first does not have symbol columns; the second does.

    `:/db/tref/ set ([] c1:1 2 3; c2:1.1 2.2 3.3)
`:/db/tref/
    `:/db/cust/ set ([] sym:`:/db/sym?`ibm`msft`goog;
  name:`:/db/sym?`oracle`microsoft`google)
`:/db/cust/

…      / exit and restart

    tref
c1 c2
------
1 1.1
2 2.2
3 3.3
    cust
sym  name
--------------
ibm  oracle
msft microsoft
goog google

Note: As of this writing (Jul 2010), q will not recognize splayed tables in directories not parented directly to the root.

1.6.3.3 Partitioned Tables

If kdb+ startup finds subdirectories immediately beneath the root whose names constitute valid partition values, it examines them for splayed tables comprising partition slices. Valid partitioned tables are mapped.

The presence of partitioned tables is independent of the presence of (plain) splayed tables. Partitioned tables and segmented tables are mutually exclusive.

Note: All symbol columns in every partition slice must be enumerated. Ensure that the sym file is in the root directory.

We recreate the example of §2.3 in which multiple tables share a daily partition.

    `:/db/2009.01.01/t/ set
        ([] ti:09:30:00 09:31:00;
            sym:`:/db/sym?`ibm`msft; p:101 33f)
`:/db/2009.01.01/t/
    `:/db/2009.01.02/t/ set
        ([] ti:09:30:00 09:31:00;
            sym:`:/db/sym?`ibm`msft; p:101.5 33.5)
`:/db/2009.01.02/t/

    `:/db/2009.01.01/q/ set
        ([] ti:09:30:00 09:31:00;
            sym:`:/db/sym?`ibm`msft; b:100.75 32.75; a:101.25 33.25f)
`:/db/2009.01.01/q/
    `:/db/2009.01.02/q/ set
        ([] ti:09:30:00 09:30:00;
            sym:`:/db/sym?`ibm`msft; b:101.25 33.25; a:101.75 33.75)
`:/db/2009.01.02/q/

…      /exit and restart

    select from t where date within 2009.01.01 2009.01.02

date       ti       sym  p
------------------------------
2009.01.01 09:30:00 ibm  101
2009.01.01 09:31:00 msft 33
2009.01.02 09:30:00 ibm  101.5
2009.01.02 09:31:00 msft 33.5

    select from q where date within 2009.01.01 2009.01.02
date       ti       sym  b      a
--------------------------------------
2009.01.01 09:30:00 ibm  100.75 101.25
2009.01.01 09:31:00 msft 32.75  33.25
2009.01.02 09:30:00 ibm  101.25 101.75
2009.01.02 09:30:00 msft 33.25  33.75

1.6.3.4 Segmented Tables

If kdb+ startup finds a par.txt file in the root, it interprets each line as a segment location and it examines the locations for valid segments of partitioned tables. Valid tables are mapped.

Important: Ensure that the segment directories are not located under the root. If you place them under the root, bad things will happen.

The presence of segmented tables is independent of the presence of plain splayed tables. Segmented tables and partitioned tables are mutually exclusive.

Note: All symbol columns in every segment must be enumerated. Ensure that the sym file is in the root directory.

We recreate an example of §1.4.3 in which multiple tables share a segmentation by symbol range.

    extr:{[t;r] select from t where (`$1#'string sym) within r}

    t:([] ti:09:30:00 09:31:00; sym:`:/db/sym?`ibm`t; p:101 17f)
    q:([] ti:09:29:59 09:29:59 09:30:00; sym:`:/db/sym?`ibm`t`ibm;
        b:100.75 16.9 100.8;a:101.25 17.1 101.1)

    `:/am/2009.01.01/t/ set extr[t;`a`m]
`:/am/2009.01.01/t/
    `:/nz/2009.01.01/t/ set extr[t;`n`z]
`:/nz/2009.01.01/t/
    `:/am/2009.01.01/q/ set extr[q;`a`m]
`:/am/2009.01.01/q/
    `:/nz/2009.01.01/q/ set extr[q;`n`z]
`:/nz/2009.01.01/q/

    t:([] ti:09:30:00 09:31:00; sym:`:/db/sym?`t`ibm; p:17.1 100.9)
    q:([] ti:09:29:59 09:29:59 09:30:00; sym:`:/db/sym?`t`ibm`t;
        b:17 100.7 17.1;a:17.2 101.25 17.25)

    `:/am/2009.01.02/t/ set extr[t;`a`m]
`:/am/2009.01.02/t/
    `:/nz/2009.01.02/t/ set extr[t;`n`z]
`:/nz/2009.01.02/t/
    `:/am/2009.01.02/q/ set extr[q;`a`m]
`:/am/2009.01.02/q/
    `:/nz/2009.01.02/q/ set extr[q;`n`z]
`:/nz/2009.01.02/q/

    `:/db/par.txt 0: ("/am"; "/nz")
`:/db/par.txt

    select from t where date within 2009.01.01 2009.01.02
date       ti       sym p
-----------------------------
2009.01.01 09:30:00 ibm 101
2009.01.01 09:31:00 t   17
2009.01.02 09:31:00 ibm 100.9
2009.01.02 09:30:00 t   17.1

    select from q where date within 2009.01.01 2009.01.02
date       ti       sym b      a
-------------------------------------
2009.01.01 09:29:59 ibm 100.75 101.25
2009.01.01 09:30:00 ibm 100.8  101.1
2009.01.01 09:29:59 t   16.9   17.1
2009.01.02 09:29:59 ibm 100.7  101.25
2009.01.02 09:29:59 t   17     17.2
2009.01.02 09:30:00 t   17.1   17.25

1.6.3.5 Scripts

Files with extension .q are interpreted as q scripts; those with extension .k as k scripts. Scripts found in the root are loaded and executed in alphabetical order. You will probably want better control over the order of execution, so it is best to have in the root a single script that loads other scripts in the desired sequence.

Note: An invalid expression in a script causes the entire script loading procedure to abort. Locating the invalid entry can be non-trivial since most console display is suppressed. Judicious use of 0N! can be helpful. Sometimes using binary search via block comments is an effective approach to locating the offending line. Bottom line: a good reason to keep scripts short!

In our first example, we initialize the workspace using a startup script as an alternative to using serialized variables shown in §1.6.3.1. Create the following and save it to /db/init.q using any text editor:

LIFE:42
f:{x*y}
lookup:([s:`a`b`c] v:1 2 3)

Now start up kdb+ with root /db and execute the following at the console.

    LIFE
42
    f
{x*y}
    lookup
s| v
-| -
a| 1
b| 2
c| 3

We observe that since scripts are loaded as the last step in kdb+ startup, any loaded script can rely on startup serialized variables and mapped tables bring present. For example, in a fresh /db we create a splayed table and a startup script from within q.

    `:/db/cust/ set
        ([] sym:`:/db/sym?`ibm`msft`goog;
           name:`:/db/sym?`oracle`microsoft`google)
`:/db/cust/

    `:/db/init.q 0: ("sym"; "show select from cust")
`:/db/init.q/

…      / exit and restart echoes the following to console

`oracle`microsoft`google`ibm`msft`goog
sym  name
--------------
ibm  oracle
msft microsoft
goog google

Now we demonstrate a script that loads other scripts. Start with a fresh /db. Using a text editor, create and save the following file as /lib/math.q.

.jab.lib.math.f1:{x+y}
.jab.lib.math.f2:{2*x}

Similarly, save the following as /lib/expr.q:

.jab.lib.expr.subst:{x@y}

Finally save the following startup script as /db/init.q:

TheUniverse:6*7

\l /lib/math.q

\l /lib/expr.q

Now start up kdb+ with root /db.

>q /db
KDB+ …

Executing the following at the q console demonstrates that the scripts were executed:

    TheUniverse
42
    \f .jab.lib.math
`s#`f1`f2

    \f .jab.lib.expr
,`subst

1.6.3.6 Putting It All Together

In this section, we combine the various constituents of a kdb+ database demonstrated in the previous sections to give two archetypal examples. The first has partitioned tables, while the second has segmented tables.

To construct our partitioned database, we combine the steps in §1.6.3.1, §1.6.3.2 and §1.6.3.3 and §1.6.3.5 to obtain the following meta-script. (We assume that the script libraries in /lib already exist.)

/ create serialized variables
`:/db/LIFE set 42
`:/db/f set {x*y}
`:/db/lookup set ([s:`a`b`c] v:1 2 3)

/ create splayed tables
`:/db/tref/ set ([] c1:1 2 3; c2:1.1 2.2 3.3)
`:/db/cust/ set
    ([] sym:`:/db/sym?`ibm`msft`goog;
        name:`:/db/sym?`oracle`microsoft`google)

/create partitioned tables
`:/db/2009.01.01/t/ set
    ([] ti:09:30:00 09:31:00;
        sym:`:/db/sym?`ibm`msft; p:101 33f)
`:/db/2009.01.02/t/ set
    ([] ti:09:30:00 09:31:00;
        sym:`:/db/sym?`ibm`msft; p:101.5 33.5)

`:/db/2009.01.01/q/ set
    ([] ti:09:30:00 09:31:00;
        sym:`:/db/sym?`ibm`msft; b:100.75 32.75;
        a:101.25 33.25f)
`:/db/2009.01.02/q/ set
    ([] ti:09:30:00 09:30:00;
        sym:`:/db/sym?`ibm`msft; b:101.25 33.25;
        a:101.75 33.75)

/ create load script
`:/db/init.q 0: ("TheUniverse:42";"\\l /lib/math.q";
    "\\l /lib/expr.q")

Now execute this script to create the database. The resulting partitioned database has the following layout:

/db
    sym
    LIFE
    f
    lookup
    /tref
        .d
        c1
        c2
    /cust
        .d
        sym
        name
    /2009.01.01
        /t
            .d
            ti
            sym
            p
        /q
            .d
            ti
            sym
            b
            a
    /2009.01.02
        /t
            .d
            ti
            sym
            p
        /q
            .d
            ti
            sym
            b
            a
    init.q

To construct a segmented database, we combine the steps in §1.6.3.1, §1.6.3.2 and §1.6.3.3 and §1.6.3.6 to obtain the following meta-script. (We again assume that the script libraries in /lib already exist.)

/ create serialized variables
`:/db/LIFE set 42
`:/db/f set {x*y}
`:/db/lookup set ([s:`a`b`c] v:1 2 3)

/ create splayed tables
`:/db/tref/ set ([] c1:1 2 3; c2:1.1 2.2 3.3)
`:/db/cust/ set
    ([] sym:`:/db/sym?`ibm`msft`goog;
       name:`:/db/sym?`oracle`microsoft`google)

/ create segmented tables
extr:{[t;r] select from t where (`$1#'string sym)
            within r}

t:([] ti:09:30:00 09:31:00; sym:`:/db/sym?`ibm`t;
    p:101 17f)

q:([] ti:09:29:59 09:29:59 09:30:00;
    sym:`:/db/sym?`ibm`t`ibm;
    b:100.75 16.9 100.8;a:101.25 17.1 101.1)

`:/am/2009.01.01/t/ set extr[t;`a`m]
`:/nz/2009.01.01/t/ set extr[t;`n`z]
`:/am/2009.01.01/q/ set extr[q;`a`m]
`:/nz/2009.01.01/q/ set extr[q;`n`z]

t:([] ti:09:30:00 09:31:00; sym:`:/db/sym?`t`ibm;
    p:17.1 100.9)
q:([] ti:09:29:59 09:29:59 09:30:00;
    sym:`:/db/sym?`t`ibm`t;
    b:17 100.7 17.1;a:17.2 101.25 17.25)

`:/am/2009.01.02/t/ set extr[t;`a`m]
`:/nz/2009.01.02/t/ set extr[t;`n`z]
`:/am/2009.01.02/q/ set extr[q;`a`m]
`:/nz/2009.01.02/q/ set extr[q;`n`z]

`:/db/par.txt 0: ("/am"; "/nz")

/ create load script
`:/db/init.q 0: ("TheUniverse:6*7";"\\l /lib/math.q";
    "\\l /lib/expr.q")

Now execute this script to create the database. The resulting partitioned database has the following layout:

/db
    sym
    LIFE
    f
    lookup
    /tref
        .d
        c1
        c2
    /cust
        .d
        sym
        name
    par.txt
    init.q

== drive 1 ==

/am
    /2009.01.01
        /t
            .d
            ti
            sym
            p
        /q
            .d
            ti
            sym
            b
            a
    /2009.01.02
        /t
            .d
            ti
            sym
            p
        /q
            .d
            ti
            sym
            b
            a

== drive 2 ==
/nz
    /2009.01.01
        /t
            .d
            ti
            sym
            p
        /q
            .d
            ti
            sym
            b
            a
    /2009.01.02
        /t
            .d
            ti
            sym
            p
        /q
            .d
            ti
            sym
            b
            a

1.6.4 There's No Place Like QHOME

Here we describe how kdb+ interacts with the OS environment when it looks for files.

1.6.4.1 The Environment Variables

The three environment variables QHOME, QLIC and QINIT are used by kdb+ at startup.

The variable QHOME specifies the directory where kdb+ expects to find the bootstrap file q.k. By default, it also looks there for the license file k4.lic. If QHOME is not defined, kdb+ falls back to $HOME/q for unix-based systems and c:\q for Windows.

The variable QLIC overrides the default location for the license file. If QLIC is not defined, kdb+ falls back to QHOME (or its fallback).

The variable QINIT specifies the name of the file that is executed immediately after the load of q.k. If QINIT is not defined, kdb+ attempts to load the file q.q from QHOME. If QHOME is not defined or q.q is not found, no error is reported.

Note: The file in QINIT is executed in the root context (i.e., \d .)

1.6.4.2 Where is q?

After startup, an easy way to see where in the file system the q executable resides is to use a function such as the niladic whereami below. Recall that the naked command \cd acts like the DOS cd command by returning the current working directory.

   whereami:{-1 "cd ~ ",system "cd";}

Upon startup, the current directory is the directory from which the q session was started.

Loading a script that resides elsewhere does not in itself change the current directory; of course, if that script executes a \cd then the current directory may change. In particular, the current directory is not temporarily changed to the location of the target script.

Loading a database, either as part of the startup invocation or via a \l command, changes the current directory to the database root. All subsequent relative access is based from there. This is especially important when starting q with -u as it determines the file hierarchy visibility.

When asked to load a file whose name is not fully qualified, q first searches the current directory. If the file is not found, it next searches QHOME if it is defined and $HOME otherwise.


Prev: 1.5 Using .Q Utilities for Splayed and Partitioned Tables

Table of Contents

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox