KdbplusForMortals/dotQ utilities

From Kx Wiki
Jump to: navigation, search

Contents

1.5 Using .Q Utilities for Splayed and Partitioned Tables

The .Q namespace contains useful functions for creating and maintaining splayed and partitioned tables. In this section, we describe the more commonly used entries and demonstrate their use.

1.5.1 .Q.qp

The monadic .Q.qp asks whether a table is partitioned. It returns 1b if its argument is a partitioned table mapped into memory and 0b otherwise.

Note: Since segmented tables are partitioned, .Q.qp will return 1b for segmented tables.

The following examples show the results of .Q.qp on in-memory, splayed, partitioned and a segmented tables.

    tmem:([] c1:1 2 3; c2:1.1 2.2 3.3)
    .Q.qp tmem
0

    `:/db/tsplay/ set tmem
`:/db/tsplay/
    \l /db
    .Q.qp tsplay
0b

    `:/db/2009.01.01/t/ set tmem
`:/db/2009.01.01/t/
    \l /db
    .Q.qp t
1b

    `:/p1/2009.01.01/t/ set tmem
`:/p1/2009.01.01/t/
    `:/p2/2009.01.01/t/ set 100+tmem
`:/p2/2009.01.01/t/
    `:/db2/par.txt 0: ("/p1"; "/p2")
`:/db2/par.txt
    \l /db2
    .Q.qp t
1b

We examine the results for partitioned tables more closely.

    t1:t
    .Q.qp t1
1b
    .Q.qp select from t
0

The first test demonstrates that assigning a mapped table is done via reference counting rather than copying. The second shows that the result of a select against a mapped table is held in memory.

1.5.2 .Q.en

Recall that all symbol columns in a splayed table must be enumerated. We saw in §1.2 how to splay a table with symbol columns manually. The dyadic function .Q.en assists in splaying a table containing symbol column(s) by automating the enumeration. Since partitioned and segmented tables also have splayed slices, .Q.en is useful for them too.

The dyadic .Q.en takes a symbolic file handle of a database root as its first argument and a table as its second argument. It creates a list sym comprising the unique items across all symbol columns and writes it to an eponymous file in the specified directory. It returns a table obtained from the original table by enumerating its symbol column(s) over sym.

The following example shows how to splay the original sample table from §1.2 in one step using .Q.en. Observe that we use a projected form of .Q.en for readability.

    `:db/t/ set .Q.en[`:/db;] ([]s:`a`c`b; p:101 102 103f)
`:/db/t/

Now splay another table into the same root directory.

    `:/db/u/ set .Q.en[`:/db;] ([]s:`x`y`z; p:101.1 102.2 103.3)
`:/db/u/

Examine the directory listing for /db and you will see the splayed subdirectories for t and u, as well as a single sym file. Observe that the sym file was created as needed. Now examine the contents of the sym file.

    get `:/db/sym
`a`c`b`x`y`z

Using .Q.en to splay multiple tables into a common root directory results in a consolidated sym file. All symbol columns in the splayed tables are enumerated over the consolidated domain sym.

Now we use .Q.en to splay a table having multiple symbol columns. As expected, it creates a consolidated enumeration domain.

    `:/db/tm/ set .Q.en[`:/db/;] ([]s:`x`y`z; v:10 20 30; n:`ab`cde`fg)
`:/db/t/
    get `:/db/sym
`x`y`z`ab`cde`fg

1.5.3 .Q.ind

As we saw in §2.3, the virtual column i reflects the relative row number in a partitioned table. That is, the value of i is the row number within the partition slice. How to retrieve records by absolute row in a partitioned or segmented table?

A masochist might use a q expression to determine the partition and relative row for the absolute row number x:

    select first date, first rr from
        (select date,rr:(sums rc)-x from
            select rc:count i by date from t) where rr>=0
date       rr
-------------
2009.01.02 1

It is less painful to use the dyadic .Q.ind, whose first argument is a partitioned table and whose second argument is a list of long values representing absolute row numbers. The result is a table (in memory).

    .Q.ind[t;1 2j]
date       ti       p
-------------------------
2009.01.01 09:31:00 102
2009.01.02 09:30:00 101.5

Observe that you must enlist a single row number.

    .Q.ind[t;enlist 3j]
date       ti       p
-------------------------
2009.01.02 09:31:00 102.5

1.5.4 .Q.dpft

The function .Q.dpft assists in creating partitioned and segmented tables by incorporating the functionality of .Q.en at a slightly higher level. It is convenient when partitions are loaded and written out iteratively. The first argument is the symbolic file handle of the database root directory. The second argument is the q data value corresponding to the partition subdirectory. The third argument is the name of the field to which the `p# attribute is applied (usually `sym for trades and quotes). The last argument is the table name.

The .Q.dpft function rearranges the named table so that the partition column is first and the parted column is second, and then splays it into the specified partition in the specified directory. When appropriate, it saves a sym list in the root and enumerates all symbol columns over sym. The result is the table name if the operation is successful.

Notes:

  1. Because the final parameter is a table name, this function cannot be applied to a local variable.
2. The table columns must be simple or compound lists.
3. The target cannot be a keyed table.

We recycle our favorite example, using .Q.dpft to write it as a partitioned table into an fresh /db.

    t:([] ti:09:30:00 09:31:00; sym:`ibm`msft; p:101.5 33.5)
    .Q.dpft[`:/db;2009.01.01;`sym;`t]
`t

    t:([] ti:09:31:00 09:32:00; sym:`ibm`msft; p:101 33f)
    .Q.dpft[`:/db;2009.01.02;`sym;`t]
`t

    \l /db
    select from t where date in 2009.01.01 2009.01.02
date       sym  ti       p
------------------------------
2009.01.01 ibm  09:30:00 101.5
2009.01.01 msft 09:31:00 33.5
2009.01.02 ibm  09:31:00 101
2009.01.02 msft 09:32:00 33

Most would agree that this is simpler than performing the detail steps manually.

1.5.5 .Q.fs

In many cases, splayed tables (as well as partitioned or segmented slices) are too large to fit into memory. Thus, we face the dilemma of how to create such tables, since splaying requires the table to be in memory. The utility .Q.fs comes to the rescue by allowing us to process text files in "chunks."

Note: The chunk size used by .Q.fs is hard-coded. You may wish to copy the k source code to your own function and experiment with different values. The optimal chunk size will vary for each application.

Loosely, .Q.fs applies a function to the records of a text file in “chunks” instead of processing the entire file in one gulp. It iterates the function over a number of bite-sized record lists. As an example, we create a text file containing the first 100000 non-negative integers, one per line and we chunk the calculation of their squares. (The final call to 0N! enables us to see the result of each chunk as it is processed.)

    `:/data/L 0: string til 100000

    .Q.fs[{0N!v*v:"J"$x};`:/data/L]
0 1 4 9 16 25 36 49 64 81 100 121 144 169 196 225 256 289 324 361 400 441 484..
412130601 412171204 412211809 412252416 412293025 412333636 412374249 4124148..
1522170225 1522248256 1522326289 1522404324 1522482361 1522560400 1522638441 ..
3332637441 3332752900 3332868361 3332983824 3333099289 3333214756 3333330225 ..
5843532249 5843685136 5843838025 5843990916 5844143809 5844296704 5844449601 ..
9054854649 9055044964 9055235281 9055425600 9055615921 9055806244 9055996569 ..
688890j

While chunking provides no material benefit in this trivial example, it is easily adaptable to larger problems. Assume we have a csv file containing one day’s pseudo-trade records (here we manufacture a trivial one).

    ct:00:00:00.000+(10*(til 100000))mod `int$24:00:00.000
    ix:100000?til 2
    cs:(`ibm`ms) ix
    cp:(115 25@ix)+-1.+100000?1.
    trade:([] ti:ct; sym:cs; px:cp)
    save `:/data/trade.csv
`:/temp/trade.csv
    read0 `:/data/trade.csv
"ti,sym,px"
"00:00:00.000,ibm,114.8385"
"00:00:00.010,ibm,114.181"
"00:00:00.020,ibm,114.4171"
"00:00:00.030,ms,24.27091"

We intend to save this data into a kdb+ daily partition slice, but suppose that naively loading it with the following expression results in a WSFULL error because our q process has insufficient memory.

    flip `time`sym`price!("TSF";",") 0: `:/data/trade.csv

Not enough storage is available to process this command.
Wsfull

We demonstrate how to read the csv file and write out to the partition slice in chunks. For convenience, we wrap the read and symbol enumeration into a function ldchunk. Then we use .Q.fs to iterate ldchunk over the chunks, upserting each result to the partition directory. (Again we instrument with 0N! to see each chunk result.)

    ldchunk:{.Q.en[`:/db] flip `time`sym`price!("TSF";",") 0: x}
    .Q.fs[{0N!.[`:/db/2009.01.01/t/;();,;ldchunk x]}] `:/data/trade.csv
`:/db/2009.01.01/t/
`:/db/2009.01.01/t/
`:/db/2009.01.01/t/
…
2638673j

We verify that the partition is properly written:

    \l /db
    select from t
date       time         sym price
------------------------------------
2009.01.01              sym
2009.01.01 00:00:00.000 ibm 114.8385
2009.01.01 00:00:00.010 ibm 114.181
2009.01.01 00:00:00.020 ibm 114.4171
2009.01.01 00:00:00.030 ms  24.27091
…

Where performance matters in the real world, we would proceed to sort the completed partition slice by time within sym and apply the `p# attribute to the sym column.

    `sym`time xasc `:/db/2009.01.01/t
`:/db/2009.01.01/t
    @[`:/db/2009.01.01/t; `sym; `p#]
`:/db/2009.01.01/t

    \l /db

     meta t
c    | t f a
-----| -----
date | d
time | t
sym  | s   p
price| f

1.5.6 .Q.chk

The utility .Q.chk is a monadic function whose argument is the symbolic file handle of the root directory. It examines each partition subdirectory in the root and writes an empty splayed table of the appropriate form where necessary.

We reconstruct the example in §1.3 having multiple tables sharing a partition, this time using .Q.dpft instead of performing the steps manually.

    t:([] ti:09:30:00 09:31:00; sym:`ibm`msft; p:101 33f)
    .Q.dpft[`:/db;2009.01.01;`sym;`t]
`t
    t:([] ti:09:30:00 09:31:00; sym:`ibm`msft; p:101.5 33.5)
    .Q.dpft[`:/db;2009.01.02;`sym;`t]
`t
    q:([] ti:09:30:00 09:31:00; sym:`ibm`msft;
        b:100.75 32.75; a:101.25 33.25f)
    .Q.dpft[`:/db;2009.01.01;`sym;`q]
`q
    q:([] ti:09:30:00 09:31:00; sym:`ibm`msft;
        b:101.25 33.25; a:101.75 33.75)
    .Q.dpft[`:/db;2009.01.02;`sym;`q]
`q

Now we introduce two new partitions, each containing splayed data of just one table.

    t:([] ti:09:30:00 09:31:00; sym:`ibm`msft; p:102 34f)
    .Q.dpft[`:/db;2009.01.04;`sym;`t]
`t

    q:([] ti:09:30:00 09:31:00; sym:`ibm`msft;
        b:101. 33.; a:101.5 34f)
    .Q.dpft[`:/db;2009.01.03;`sym;`q]
`q

As currently stored, the database in /db cannot be properly mapped because neither of the two tables spans all the partitions. We previously saw how to rectify this situation manually by saving an empty schema wherever a table is missing. This approach requires examining every partition to determine what is missing, which is cumbersome and error prone.

Fortunately, the utility .Q.chk automates this process. In our example:

    .Q.chk `:/db
()
()
,`:/db/2009.01.03
,`:/db/2009.01.04

    \l /db
    select from t
date       sym  ti       p
------------------------------
2000.01.04 ibm  09:30:00 102
2000.01.04 msft 09:31:00 34
2009.01.01 ibm  09:30:00 101
2009.01.01 msft 09:31:00 33
2009.01.02 ibm  09:30:00 101.5
2009.01.02 msft 09:31:00 33.5

    select from q
date       sym  ti       b      a
--------------------------------------
2009.01.01 ibm  09:30:00 100.75 101.25
2009.01.01 msft 09:31:00 32.75  33.25
2009.01.02 ibm  09:30:00 101.25 101.75
2009.01.02 msft 09:31:00 33.25  33.75
2009.01.03 ibm  09:30:00 101    101.5
2009.01.03 msft 09:31:00 33     34

The display of .Q.chk shows the list of any tables it fixes in each partition as it goes. We can verify that Q.chk has indeed inserted the missing splayed tables. For example, we now find an empty slice for t in the 2009.01.03 partition,

    get `:/db/2009.01.03/t
s ti p
------

Note: In case you want a different behavior, you can easily rewrite .Q.chk so that it simply displays which partition slices are missing.

1.5.7 .Q.view

The monadic .Q.view is handy when you are executing queries against partitioned or segmented tables. Recall that while there can be only one partitioning in a kdb+ database, multiple tables can share the partitioning. The argument of .Q.view is a list of partition values that acts as a filter for all queries against any partitioned table in the database. Otherwise put, the practical effect of applying .Q.view is to insert its argument as a constraint in the first sub-phrase of the where clause of each query.

In the following example, we create a daily partitioned table with three days of data. By using .Q.view to restrict the view, we reign in the unqualified query.

    `:/db/2009.01.01/t/ set ([] ti:09:30:00 09:31:00; p:101 102f)
`:/db/2009.01.01/t/

    `:/db/2009.01.02/t/ set ([] ti:09:30:00 09:31:00; p:100 100f)
`:/db/2009.01.02/t/

    `:/db/2009.01.03/t/ set ([] ti:09:30:00 09:31:00; p:103 104f)
`:/db/2009.01.03/t/

    \l /db

    .Q.view 2009.01.01 2009.01.02

    select from t        / <- normally dangerous!
date       ti       p
-----------------------
2009.01.01 09:30:00 101
2009.01.01 09:31:00 102
2009.01.02 09:30:00 100
2009.01.02 09:31:00 100

This is especially useful when you are testing or when you expose large historical databases to users who are prone to forget that they must qualify their queries by date.

To reset the default view to all partitions, invoke .Q.view niladically.

    .Q.view[]
    select from t        / <- once again dangerous
date       ti       p
-----------------------
2009.01.01 09:30:00 101
2009.01.01 09:31:00 102
2009.01.02 09:30:00 100
2009.01.02 09:31:00 100
2009.01.03 09:30:00 103
2009.01.03 09:31:00 104

You can use the virtual partition column in the argument to .Q.view.

    .Q.view -1#date
    select from t        / <- normally dangerous
date       ti       p
-----------------------
2009.01.03 09:30:00 103
2009.01.03 09:31:00 104

Prev: 1.4 Segments Next: 1.6 Kdb+ Database

Table of Contents

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox