QforMortals2/queries q sql

From Kx Wiki
Jump to: navigation, search

Contents

Queries: q-sql

Overview

Q has a collection of functions for manipulating tables that are similar to their counterparts in SQL. This collection, which we call q-sql, includes the usual suspects such as insert, select, update, etc., as well as functionality that is not available in traditional SQL. While q-sql provides a superset of SQL functionality, there are some significant differences in the syntax and behavior.

The first important difference is that a q table has well-defined record and column orders. This is particularly useful in dealing with the situation in which records are inserted in a canonical order. Subsequent actions against the table will then retrieve records in this order. For example, a time series can be created by inserting (in time order) pairs consisting of a time (or date, or datetime) value and data value(s). The result of any select will then be in time order, without requiring a sort.

A second difference is that every q table is stored physically as a collection of column vectors. This means that operations on column data are easy and fast since atomic, aggregate or uniform functions applied to columns are optimized vector operations.

A third difference is that q-sql provides upsert semantics. This means that one dataset can be applied to another without the need to separate inserts from updates. Upsert can simplify operations significantly in practice.

In this chapter, we cover the important features of q-sql, including all the basic operations in kdb+. We demonstrate each feature with a simple example. Gradually, more complex examples are introduced.

Many examples are based on the sp.q distribution script. The schemas for the tables in the script are,

        s:([s:()]name:();status:();city:())
        p:([p:()]name:();color:();weight:();city:())
        sp:([]s:`s$();p:`p$();qty:())

The contents of the tables are,

      s
s | name  status city
--| -------------------
s1| smith 20     london
s2| jones 10     paris
s3| blake 30     paris
s4| clark 20     london
s5| adams 30     athens

        p
p | name  color weight city
--| -------------------------
p1| nut   red   12     london
p2| bolt  green 17     paris
p3| screw blue  17     rome
p4| screw red   14     london
p5| cam   blue  12     paris
p6| cog   red   19     london

        sp
s  p  qty
---------
s1 p1 300
s1 p2 200
s1 p3 400
s1 p4 200
s4 p5 100
s1 p6 100
s2 p1 300
s2 p2 400
s3 p2 200
s4 p2 200
s4 p4 300
s1 p5 400

Insert

Insert appends records to a table or keyed table.

Basic Insert

To add records to a table, use the dyadic function insert,

insert[st;L]

where st is a symbol containing the name of a table (target) and L is a list whose items correspond to records of target. The result of insert is a list of int representing the positions of the new record(s).

Warning.png Note: Since the items in L are appended to the column vectors of st, each value must type-match the corresponding column vector.

For a regular (i.e., non-keyed) table, the effect of insert is to append a new record holding the specified values. Let's use our simple example.

        t:([] name:`Dent`Beeblebrox`Prefect; iq:42 98 126)

Insert a record into t as follows,

        insert[`t;(`Slartibartfast;156)]
,3

        t
name            iq
-------------------
Dent            42
Beeblebrox      98
Prefect        126
Slartibartfast 156

Alternate Forms

Since the dyadic insert is also a verb, it can take various notational forms. For example, the previous insert can be written as a binary operator.

        `t insert (`Slartibartfast; 156)

It can also be expressed as a projection onto the first argument with juxtaposition of the second argument.

        insert[`t] (`Slartibartfast; 156)

You may find one of these more readable. We shall use them interchangeably.

You can also insert a record, as opposed to a list of row values.

        `t insert `name`iq!(`Slartibartfast; 156)

This is useful when you wish to insert a table which is the result of a select.

Repeated Inserts

For a (non-keyed) table, repeatedly inserting the same data is permissible and it results in duplicate records.

        t:([] name:`Dent`Beeblebrox`Prefect; iq:42 98 126)
        `t insert (`Slartibartfast; 156)                 / one form
,3

        insert[`t] (`Slartibartfast; 156)                / equivalent form
,4

        t
name            iq
-------------------
Dent            42
Beeblebrox      98
Prefect         126
Slartibartfast 156
Slartibartfast 156

Columnar Bulk Insert

In the preceding, we have considered the case when the list in an insert represents a set of values for a single row. Each item is an atom destined for the corresponding column in the table. It is also possible to bulk insert multiple entries.

Recall that a table is a dictionary of columns. So in the example,

        t:([] name:`Dent`Beeblebrox; iq:98 42)
        `t insert (`Prefect;126)
,2

the right operand looks like a row, but is in fact a list of column values. With this perspective, a bulk insert can be achieved with a compound list, each of whose items is a list of column values destined for the corresponding column in the table.

        t:([] name:`Dent`Beeblebrox; iq:98 42)
        `t insert (`Prefect`Mickey;126 1024)
2 3

Table Insert

It is also possible to bulk insert records (i.e., rows). A table can be viewed as a list of records (and vice versa), so it is reasonable to insert one table into another provided the columns are compatible.

        t:([] name:`Dent`Beeblebrox`Prefect; iq:98 42 126)
        tnew:([] name:`Slartibartfast`Mickey; iq:158 1042)
        `t insert tnew
3 4

        t
name           iq
-------------------
Dent           98
Beeblebrox     42
Prefect        126
Slartibartfast 158
Mickey         1042

Insert into Keyed Tables

Inserting data into a keyed table works just like inserting data into a regular table, with the additional requirement that the key must not already exist in the table. Using our previous example of a keyed table,

        t:([eid:1001 1002] name:`Dent`Beeblebrox; iq:98 42)
        t
eid | name       iq
----| -------------
1001| Dent       98
1002| Beeblebrox 42

        `t insert (1004; `Slartibartfast; 158)
,2

        t
eid | name            iq
----| -------------------
1001| Dent            98
1002| Beeblebrox      42
1004| Slartibartfast 158

The following insert fails because the key 1004 already exists in t,

        `t insert (1004; `Slartibartfast; 158)
'insert

Observe that, by default, the records in a keyed table are stored in insert order rather than key order.

        `t insert (1003; `Prefect; 126)
        t
eid | name            iq
----| -------------------
1001| Dent            98
1002| Beeblebrox      42
1004| Slartibartfast  158
1003| Prefect         126

Insert into Empty Tables

We consider the situation of an empty table with no column types specified. The column types are inferred from the first insert.

        t:([] name:(); iq:())
        type t.name
0h

        type t.iq
0h

        `t insert (`Dent; 98)
,0

        type t.name
11h

        type t.iq
6h

If you define an empty table without types, be especially careful to get the first insert correct.

        `t insert (98; `Dent)
                .
                .
                .

        `t insert (`Beeblebrox; 42)
`type

It is advantageous to define an empty table with types. In our example,

        t:([] name:`symbol$(); iq:`int$())
        t:([] name:0#`; iq:0#0)        / an equivalent way

        type t.name
11h
        type t.iq
6h

Insert and Foreign Keys

When inserting data into a table that has a foreign key, everything works as for a regular table, except that a value destined for a foreign key column must already exist as a key in the corresponding primary key table.

Warning.png Note: This last requirement is how q implements referential integrity.

Returning to our example of the previous section,

        kt:([eid:1001 1002 1003] name:`Dent`Beeblebrox`Prefect; iq:98 42 126)
        tdetails:([] eid:`kt$1003 1002 1001 1002 1001; sc:126 36 92 39 98)
        kt
eid | name       iq
----| --------------
1001| Dent       98
1002| Beeblebrox 42
1003| Prefect    126

        tdetails
eid  sc
--------
1003 126
1002 36
1001 92
1002 39
1001 98

        `tdetails insert (1002; 42)
,5

         tdetails
eid  sc
--------
1003 126
1002 36
1001 92
1002 39
1001 98
1002 42

The following insert fails because the key 1004 does not exist in kt.

        `tdetails insert (1004; 158)
'cast

The select and exec Templates

In this section, we investigate the general form of select, which we met briefly in Basic select. We present select as a template having required and optional elements. The template elements, in turn, contain phrases whose expressions involve column values. The q interpreter applies the template against the specified table to produce a result table. While the syntax and results resemble those of the analogous SQL statement, the underlying mechanics are quite different.

We shall examine each of the constituents of the select template in detail. Our approach is to introduce the concepts with illustrative examples using trivial tables and then to proceed with more meaningful examples using time series. Here are our sample table definitions:

        tk:([eid:1001 1002 1003] name:`Dent`Beeblebrox`Prefect; iq:98 42 126)
        tdetails:([] eid:`tk$1003 1002 1001 1002 1001 1002; sc:126 36 92 39 98 42)

Syntax

The select template has the following form, where elements enclosed in matching angle brackets (<...>) are optional.

select <ps> <by pb> from texp <where pw>

The select and from keywords are required, as is texp, which is a q expression whose result is a table or keyed table. The elements ps, pb and pw are the select, the by and the where phrases, respectively. The result of select is a list of records or, equivalently, a table.

Warning.png Note: If where is present and texp is itself the result of a select, the expression that produces texp must be enclosed in parentheses.

Some simple examples follow.

        select from tk
eid | name       iq
----| --------------
1001| Dent       98
1002| Beeblebrox 42
1003| Prefect    126

        select eid,name from tk where name=`Dent
eid  name
---------
1001 Dent

        select cnt:count sc by eid.name from tdetails
name      | cnt
----------| ---
Beeblebrox| 3
Dent      | 2
Prefect   | 1

        select topsc:max sc, cnt:count sc by eid.name from tdetails where eid.name<>`Prefect
name      | topsc cnt
----------| ---------
Beeblebrox| 42    2
Dent      | 98    2

The order of execution for select is:

(1) from expression texp,
(2) where phrase pw
(3) by phrase pb
(4) select phrase ps

In particular, the from expression is always evaluated first and the select phrase last.

Warning.png Note: If ps is absent, all columns are returned. There is no need for the * wildcard of SQL.

Each phrase in the select template is a comma-separated list of subphrases. A subphrase is an expression involving columns of texp or virtual columns if a table related to texp via foreign key. The subphrases within a phrase are evaluated left-to-right, but each expression comprising a subphrase is parsed right-to-left, like any q expression.

Warning.png Important: The commas separating the subphrases are separators, meaning that it is not necessary to enclose a subphrase in parentheses. However, any expression containing the join operator ( , ) must be enclosed in parentheses to distinguish it from the separator.

The where Phrase

The where phrase controls which records appear in the result. The action of this phrase is a generalization of the built-in where function (See Appendix A).

Each subphrase is a criterion on columns. It produces a boolean result vector corresponding to records passing or failing the criterion. The effect of a where subphrase is to select only the records that pass its criterion.

The individual where subphrases are applied from left-to-right. Each step produces a result whose rows are a subset of the previous one. The net effect is a series of progressively narrowed interim tables.

        select from tk where iq <100
eid  name       iq
------------------
1001 Dent       98
1002 Beeblebrox 42

        select from tdetails where eid=1002
eid  sc
-------
1002 36
1002 39
1002 42

        select from tdetails where eid=1002,sc<eid.iq
eid  sc
-------
1002 36
1002 39

         select from tdetails where (eid=1002)&sc<eid.iq
eid  sc
-------
1002 36
1002 39

We point out that the last two queries return the same result but execute differently; we shall see more about this later. Also observe that the parentheses in the last query are necessary due to right-to-left evaluation of expressions.

The select Phrase

The select phrase controls which columns appear in the result. Each select subphrase produces a column. The name of the result column from each subphrase is taken from the last underlying column referenced in the subphrase evaluation unless the result is renamed by assignment.

        select LastName:name,iq from tk
LastName   iq
--------------
Dent       98
Beeblebrox 42
Prefect    126

If a column is repeated in the select phrase, it appears more than once in the result. This behaves like SQL SELECT.

        select iq,iq from tk
iq  iq
-------
98  98
42  42
126 126
Warning.png Important: A virtual column i holding the position of each record is implicitly available in the select phrase. This is useful, for example, in aggregation if you want a column with record counts without reference to a specific column name.
        select cnt:count i by eid from tdetails
eid | cnt
----| ---
1001| 2
1002| 3
1003| 1

In this situation, i plays a role somewhat similar to * in SQL, but is more useful since it can be used to select specific records. For example, criteria on i can be used to fill only one page of results when you do not wish to transmit an entire result set. Here is the second page of detail records for a page size of 3, noting that the within function includes both its endpoints (see Appendix A).

        select from tdetails where (3<=i) and i<6
eid  sc
-------
1002 39
1001 98
1002 42

This is difficult to do in SQL and vendors have added proprietary extensions to handle it.

The by Phrase

The by phrase controls how rows are grouped in the result. The action of this phrase is a generalization of the built-in group function (See Appendix A).

Each by subphrase is an expression involving a column. It produces a grouping criterion for that column. The columns resulting from the by phrase become the primary keys of the select result. Multiple subphrases in the by phrase result in a compound primary key in the result.

Warning.png Note: If the by phrase is included, the result of select is a keyed table; if it is omitted, the result is a table.
Warning.png Important: Every column included in the by phrase is automatically included in the result and should not be included separately in the select phrase.

It is possible to group without aggregation. The result is a table with non-simple lists for columns - that is, non-atomic column values. (See Complex Column Data for more on tables with non-simple column lists.)

        select sc by eid from tdetails
eid | sc
----| --------
1001| 92 98
1002| 36 39 42
1003| ,126

This cannot be achieved easily with GROUP BY in SQL.

The function ungroup can be used to normalize the result of grouping back to a flat table.

        seid:select sc by eid from tdetails
        seid
eid | sc
----| -----
1001| 92 98
1002| 36 39 42
1003| ,126

       ungroup seid
eid  sc
--------
1001 92
1001 98
1002 36
1002 39
1002 42
1003 126

The exec Template

The syntax of the exec template is identical to select.

exec <ps> <by pb> from texp <where pw>

The difference from select is that the result is not a table.

If only one column is produced by the select phrase, the result of exec is a list containing the column values produced. This contrasts with select, which produces a table with a single column in this situation.

With tk as above,

        tk
eid | name       iq
----| --------------
1001| Dent       98
1002| Beeblebrox 42
1003| Prefect    126

        select name from tk
name
----------
Dent
Beeblebrox
Prefect

        exec name from tk
`Dent`Beeblebrox`Prefect

Using exec to extract a single column of a table (as opposed to a keyed table) is more powerful than other mechanisms to extract the column because you can apply constraints on other columns.

        tdetails.sc
126 36 92 39 98 42
        tdetails[`sc]
126 36 92 39 98 42
        exec sc from tdetails
126 36 92 39 98 42
        exec sc from tdetails where eid in 1001 1002
36 92 39 98 42

If more than one column is produced by the select phrase, the result of exec is a dictionary mapping the column names to the values produced. This contrasts with select, which produces a table with the specified columns.

        select eid,name from tk
eid  name
---------------
1001 Dent
1002 Beeblebrox
1003 Prefect

        exec eid,name from tk
eid | 1001 1002       1003
name| Dent Beeblebrox Prefect

Using distinct in select and exec

The built-in distinct function (see Appendix A) applied to a source table returns a table containing the unique records in the source.

        tdup:([]c1:10 20 10 30 10 20 40 30;c2:`a`b`a`c`z`b`d`c)
        tdup
c1 c2
-----
10 a
20 b
10 a
30 c
10 z
20 b
40 d
30 c

        distinct tdup
c1 c2
-----
10 a
20 b
30 c
10 z
40 d

By including distinct in the select phrase of a select or exec query, you can similarly suppress duplicates from the result.

        select distinct c1 from tdup
c1
--
10
20
30
40

        exec distinct c2 from tdup
`a`b`c`z`d
Warning.png Note: When distinct is used in select, it appears immediately after ‘select’ and is applied across all the specified columns, meaning that it returns rows with distinct values in those columns. By contrast, in exec, distinct can apply to any column and the result will be a non-rectangular in general.
        select distinct c2,c1 from tdup
c2 c1
-----
a  10
b  20
c  30
z  10
d  40

        exec distinct c2,c1 from tdup
c2| `a`b`c`z`d
c1| 10 20 10 30 10 20 40 30

        exec distinct c2,distinct c1 from tdup
c2| `a`b`c`z`d
c1| 10 20 30 40

One way to understand this behavior is as follows. The result of select is a table, which is rectangular; hence distinct must produce full rows. The result of exec is a dictionary, so each column name (i.e., key) can have a different number of values.

Using each in where

If a function or operator used in a where criterion is not atomic or uniform in its argument, you must use an each adverb. This is because the criterion is applied across the column vector(s).

        ts:([]f:1.1 2.2 3.3;s:("abc";"d";"ef"))
        select from ts where  s~"abc"
f s
---

        select from ts where  ("abc"~) each s
f   s
---------
1.1 "abc"

The first select does not achieve the desired result because it asks if the entire column matches the specified string. The second select works correctly because it is the projection of the binary match operator applied to each item of the column.

Nested where

As was mentioned in The where Phrase the criteria in the subphrases of a where phrase are applied to the records of the table sequentially from left to right. Consequently, the final list of records is obtained via a succession of intermediate results, each of which is narrowed by the following subphrase criterion. Otherwise put, the where subphrases constitute a nested set of criteria.

The order of the subphrases in a nested where can have significant performance implications for queries against large tables. Whenever possible, list the subphrases in order of decreasing restrictiveness. That is, choose the subphrase at each position to be the one that results in the greatest narrowing. Each intermediate table will be smallest and consequently less processing will be required at the next step.

Warning.png Note: If there is one where subphrase that will always result in a significantly smaller result set, it should be placed first in the sequence. In the case of a partitioned table, place any constraint on the partition column first.

A typical example is a series of measurements for entities with an identifier. This could be real-time stock prices, daily bond yields, yearly batting averages, test scores, etc. Say there are many different identifier values and you want to select certain records for a given identifier. It is better to filter on the identifier first since this will immediately restrict the result set to a small subset of the original. This can lead an order of magnitude improvement.

Let's take our trivial example of IQ test scores and imagine that the table contains the result of SAT scores for all high school seniors in the United States. In this case, there will be several million students with only a few records per student. Clearly if you want to perform an analysis on the scores of an individual, it is best to limit the result by student first, since the initial and subsequent intermediate tables will be tiny.

Imagine the following table containing millions of student social security numbers and scores.

        tscores:([]ssn:0#`;sc:0#0)
        `tscores insert (`$"111-11-1111"; 999)
        `tscores insert (`$"222-22-2222"; 1242)
        `tscores insert (`$"333-33-3333"; 735)
        `tscores insert (`$"444-44-4444"; 1600)
        `tscores insert (`$"555-55-5555"; 1178)
        `tscores insert (`$"111-11-1111"; 1021)
        `tscores insert (`$"666-66-6666"; 882)
         .
         .
         .

Since each student takes the test only a few times, the following query,

        select from tscores where ssn=`$"111-11-1111",0<deltas sc

executes significantly faster than,

        select from tscores where (ssn=`$"111-11-1111")&0<deltas sc

We point out that any nested where phrase is logically equivalent to an unnested phrase in which each of the subphrases is joined by & . In our example, the nested query produces the am results as either,

        select from tscores where (ssn=`$"111-11-1111")&0<deltas sc

        / or

        select from tscores where (0<deltas sc)&ssn=`$"111-11-1111"

However, both unnested versions will execute more slowly since the compound criterion is applied against all records in the table.

select[n]

You can return the first or last n records of a select result using function parameter syntax on the select. A positive parameter returns the first n records specified by the select body, while a negative parameter returns the last records.

        select[2] from tk
eid | name       iq
----| -------------
1001| Dent       98
1002| Beeblebrox 42

        select[-1] from tk
eid | name    iq
----| -----------
1003| Prefect 126

fby

It is sometimes desirable to use an aggregate function in the where phrase of select. For example, suppose we are given a table with a foreign key and we wish to determine which key values have more than one entry in the table. A first attempt might be to place a condition in the where phrase that filters on the count being greater than 1. In our example of tdetails, this would be something like,

        select distinct eid from tdetails where 1<count eid
eid
----
1003
1002
1001

You can see this doesn't work, as the record for eid value 1003 is included even though it has only a single entry in tdetails. What went wrong?

The better question is, what does this where expression actually do? Since count is an aggregate function, it is applied against the list of column values for eid. It cannot select individual rows since it does not return a boolean vector result. Indeed, it returns the scalar 5, the number of items in the column vector.

You could achieve the desired result with a correlated subquery. The inner query counts the records for each key value using aggregation and grouping.

        q1:select cnt:count eid by eid from tdetails
        q1
eid | cnt
----| ---
1001| 2
1002| 3
1003| 1

The outer query selects the records with the desired count.

        select eid from q1 where 1<cnt
eid
----
1001
1002

An easier way to accomplish this result is to use fby in the where phrase. Placing fby in a where subphrase allows an aggregate function to be used to select individual rows. The action is similar to the grouping of by, with the specified aggregate function applied across the grouped values. (Hence the name "fby" which is short for "function by").

The use of fby is somewhat more abstract than other elements of the select template. It is a binary operator of the form,

(fagg;expcol) fby c

The left operand is a two-item list consisting of an aggregate function fagg and a column expression exp'col on which the function will be applied. The right operand c is a symbol containing the name of the column whose values are grouped to form lists for the aggregate function.

Inclusion of fby in a where subphrase selects those records whose group passes the subphrase criterion specified by the aggregate function. This means that all records in a group either pass or fail together, depending on the result of the aggregation on the group.

In our example above, we can achieve the desired result with an un-nested select using fby. First, we verify that fby does indeed accomplish what we want. Remember to evaluate the where criterion right-to-left.

        select eid from tdetails where 1<(count;eid) fby eid
eid
----
1002
1001
1002
1001
1002

Now we eliminate the duplicates.

        select distinct eid from tdetails where 1<(count;eid) fby eid
eid
----
1002
1001
Warning.png Note: Multiple columns in the right operand of fby must be encapsulated in a table. To do this is, create an anonymous empty table with the desired column names only.
        t:([]sym:`IBM`IBM`MSFT`IBM`MSFT;
		ex:`N`O`N`N`N;
		time:12:10:00.0 12:30:00.0 12:45:00.0 12:50:00.0 13:30:00.0;
		price:82.1 81.95 23.45 82.05 23.40)
        t
sym  ex time         price
--------------------------
IBM  N  12:10:00.000 82.1
IBM  O  12:30:00.000 81.95
MSFT N  12:45:00.000 23.45
IBM  N  12:50:00.000 82.05
MSFT N  13:30:00.000 23.4

        select from t where price=(max;price) fby ([]sym;ex)
sym  ex time         price
--------------------------
IBM  N  12:10:00.000 82.1
IBM  O  12:30:00.000 81.95
MSFT N  12:45:00.000 23.45

It may take a while to get used to this notation.

The update Template

Basic update

The update template has the same form as the select template.

update <pu> <by pb> from texp <where pw>

The difference is that column assignments in the update phrase pu represent modifications to columns instead of column name aliases.

        t:([] c1:`one`two`three; c2:10 20 30)
        update c1:`third,c2:33 from t where c1=`three
c1    c2
--------
one   10
two   20
third 33
Warning.png Important: In order to modify the contents of texp you must refer to a table by name.

After execution of the query above, we still find,

        t
c1    c2
--------
one   10
two   20
three 30

However, t can be modified in place by referring to the table by name.

        update c1:`third,c2:33 from `t where c1=`three
`t
        t
c1    c2
--------
one   10
two   20
third 33
Warning.png Note: Unlike updates in SQL, update can add a new column.
        t:([] c1:20 10 30 20; c2:`z`y`x`a)
        t
c1 c2
-----
20 z
10 y
30 x
20 a

        update c3:100+c1 from `t
`t

        t
c1 c2 c3
---------
20 z  120
10 y  110
30 x  130
20 a  120

update-by

When the by phrase is present, update can be used to create new columns from the grouped values. When an aggregate function is used, it is applied to each group of values and the result is assigned to all records in the group.

        t:([] n:`a`b`a`c`c`b; p:10 15 12 20 25 14)
        t
n p
----
a 10
b 15
a 12
c 20
c 25
b 14

        update av:avg p by n from t
n p  av
---------
a 10 11
b 15 14.5
a 12 11
c 20 22.5
c 25 22.5
b 14 14.5

If a uniform function is used, it is applied across the grouped values and the result is assigned in sequence to the records in the group. With t as above,

        update s:sums p by n from t
n p  s
-------
a 10 10
b 15 15
a 12 22
c 20 20
c 25 45
b 14 29

upsert

The dyadic function upsert is an alternate name for join ( , ) on tables and keyed tables.

For keyed tables, the match is done by key value.

        kt:([k:`one`two`three] c:10 20 30)
        kt
k    | c
-----| --
one  | 10
two  | 20
three| 30

        ku:([k:`three`four]; c:300 400)
        ku
k    | c
-----| ---
three| 300
four | 400

        kt upsert ku
k    | c
-----| ---
one  | 10
two  | 20
three| 300
four | 400

For regular (non-keyed) tables, the records are appended.

        t:([]c1:`one`two`three;c2:10 20 30)
        t
c1    c2
--------
one   10
two   20
three 30

        u:([]c1:`three`four;c2:30 40)
        u
c1    c2
--------
three 30
four  40

        t upsert u
c1    c2
--------
one   10
two   20
three 30
three 30
four  40
Warning.png Note: The upsert expressions above do not affect the original table. You must refer to the table by name to modify the original.

delete

The syntax of the delete template is simpler than that of select, with the added restriction that either pcols or pw can be present but not both.

delete <pcols> from texp <where pw>

If pcols is present as a symbol list of column names, the result is a table derived from texp in which the secified columns are removed. If pw is present, the result is a table derived from texp in which records meeting the criteria of pw are removed.

        t:([]c1:`a`b`c;c2:`x`y`z)
        t
c1 c2
-----
a  x
b  y
c  z

        delete c1 from t
c2
--
x
y
z
        delete from t where c2=`z
c1 c2
-----
a  x
b  y
Warning.png Important: In order to modify the contents of texp you must refer to the table by name.

Thus, after execution of the last query above, we still find,

        t
c1 c2
-----
a  x
b  y
c  z

However, t can be modified in place with,

        delete from `t where c2=`z
't
        t
c1 c2
-----
a  x
b  y

Grouping and Aggregation

Aggregation is the result of applying an aggregate function - one that produces an atom from a list - to a column.

SQL Aggregation

In traditional SQL, aggregation and grouping are limited and cumbersome. Aggregation and grouping are bound together: only columns that appear in the GROUP BY can participate in the SELECT result. Moreover, there is a limited collection of built-in aggregation functions.

In q, grouping and aggregation can be used independently or together.

Grouping without Aggregation

Grouping in q collects rows having a common value in the group domain. Unlike SQL, any column can participate in the select result when grouping. Moreover, the columns in the by phrase are automatically included in the result as keys.

When a column not in the by phrase is explicitly specified in the select phrase, the result of grouping without aggregation has a corresponding column of non-simple type. There will be one item in the value list for each record matching a given group domain value.

For example, we can group order quantities by supplier in the sp.q script sample tables.

        select qty by s from sp
s | qty
-----| -----------------------
s1| 300 200 400 200 100 4000
s2| 300 400
s3| ,200
s4| 100 200 300

You can group by the result of a function applied to a column. For example, the following query groups all products meeting a certain order quantity threshold.

        select distinct p by thrsh:qty>200 from sp
thrsh| p
-----| ------------------
0    | `p$`p2`p4`p5`p6
1    | `p$`p1`p3`p2`p4`p5

You can also group by virtual columns from foreign keys.

        select sname:s.name, qty by pname:p.name from sp
pname| sname                    qty
-----| ----------------------------------------
bolt | `smith`jones`blake`clark 200 400 200 200
cam  | `clark`smith             100 400
cog  | ,`smith                  ,100
nut  | `smith`jones             300 300
screw| `smith`smith`clark       400 200 300
Warning.png Important: When no columns are explicitly specified in the select phrase, the result of grouping without aggregation has columns of simple type. The value for each result column is obtained by picking the value of the last record matching the group domain value.

For example, the following query,

        select by p from sp
p | s  qty
--| ------
p1| s2 300
p2| s4 200
p3| s1 400
p4| s4 300
p5| s1 400
p6| s1 100

is equivalent to the following query using the aggregate last on each non-grouped column,

        select last s, last qty by p from sp
p | s  qty
--| ------
p1| s2 300
p2| s4 200
p3| s1 400
p4| s4 300
p5| s1 400
p6| s1 100

One way to obtain all the remaining columns in a grouping without explicitly listing them in a select is to use the xgroup function. It takes column symbol(s) as the left operand and a table as its right operand. The result is a keyed table that is that same as listing all the non-grouped columns in the comparable select.

Using the distibution example,

        `p xgroup sp
p | s               qty
--| -------------------------------
p1| `s$`s1`s2       300 300
p2| `s$`s1`s2`s3`s4 200 400 200 200
p3| `s$,`s1         ,400
p4| `s$`s1`s4       200 300
p5| `s$`s4`s1       100 400
p6| `s$,`s1         ,100

Aggregation without Grouping

Aggregation can be applied against a column of non-simple type in any table. The aggregate function can be any function that processes a list of the appropriate form and produces an atom. While q has many built-in aggregates, you can also define and use your own.

We calculate the average order quantity in the sp.q script sample tables by using the built-in aggregate avg.

        select totq:sum qty, avgq:avg qty from sp
totq avgq
-------------
3100 258.3333

Grouping with Aggregation

The equivalent of SQL aggregation is achieved in q by combining grouping with aggregation.

Continuing with the sp.q script example, we combine grouping and aggregation to compute the average order quantity by supplier.

        select avgqty:avg qty by s.name from sp
name | avgqty
-----| --------
blake| 200
clark| 200
jones| 350
smith| 266.6667

Using Uniform and Aggregate Functions

Any uniform or aggregate function can by applied directly to columns in aggregation.

Again using the sp.q distribution example, for each salesperson we can find the cumulative low quantity at the same time with the average and high.

        select cumlo:mins qty, av:avg qty, hi:max qty by s.name from sp
name | cumlo                   av       hi
-----| ------------------------------------
blake| 200                     200      200
clark| 100 100 100             200      300
jones| 300 300                 350      400
smith| 300 200 200 200 100 100 266.6667 400

Using each

If the data in a column is not atomic (that is, the column has a list of values for each row), you must use the each modifier to apply an aggregate.

In our sp.q example, suppose we define a table of intermediate results as,

        o:select qty by p.name from sp
        o
name | qty
-----| ---------------
bolt | 200 400 200 200
cam  | 100 400
cog  | 100
nut  | 300 300
screw| 400 200 300

We must use each to compute the average order size for each product in o.

        select name, avqty:avg qty from o
'length

        select name, avqty:avg each qty from o
name  avqty
-----------
bolt  250
cam   250
cog   100
nut   300
screw 300

Using ungroup

The monadic function ungroup is a partial inverse to the resultant keyed tables of select and xgroup. It unwinds the keyed table into a table whose records have the same format as the original table. How closely its output resembles the original table depends on whether information has been collapsed in the grouping.

We use the sp table from the distribution script for our examples.

        sp
s  p  qty
---------
s1 p1 300
s1 p2 200
s1 p3 400
s1 p4 200
s4 p5 100
s1 p6 100
s2 p1 300
s2 p2 400
s3 p2 200
s4 p2 200
s4 p4 300
s1 p5 400

        `p xgroup sp

p  s  qty
---------
p1 s1 300
p1 s2 300
p2 s1 200
p2 s2 400
p2 s3 200
p2 s4 200
p3 s1 400
p4 s1 200
p4 s4 300
p5 s4 100
p5 s1 400
p6 s1 100

Since no aggregation has been performed and all non-key columns are present, the result of ungroup is the same as the original table with the rows sorted by the group column(s).

        ungroup `p xgroup sp
p  s  qty
---------
p1 s1 300
p1 s2 300
p2 s1 200
p2 s2 400
p2 s3 200
p2 s4 200
p3 s1 400
p4 s1 200
p4 s4 300
p5 s4 100
p5 s1 400
p6 s1 100

If aggregation has been performed or columns have been omitted, then only the selected values will be reflected after the ungroup. For example, we omit the s column in the following grouping, so it is also missing after the ungroup.

        select qty by p from sp
p | qty
--| ---------------
p1| 300 300
p2| 200 400 200 200
p3| ,400
p4| 200 300
p5| 100 400
p6| ,100

        ungroup select qty by p from sp
p  qty
------
p1 300
p1 300
p2 200
p2 400
p2 200
p2 200
p3 400
p4 200
p4 300
p5 100
p5 400
p6 100
Warning.png Note: The result of a select in which grouping is specified but no columns are explicitly listed is not a keyed table of the proper form for ungroup. You will receive an error if you apply ungroup to the result of such a query.

8.7 Sorting

Recall that tables and keyed tables are lists of records and therefore have an inherent order. A table or keyed table can be sorted by the values of any column(s).

We use the following table definition in this section.

        t:([]c1:20 10 30 20;c2:`z`y`x`a)
        t
c1 c2
-----
20 z
10 y
30 x
20 a

xasc

The dyadic xasc takes a scalar or list of symbols containing column names as its left argument and a table as its right argument. It returns the records of the table sorted in ascending order of the items in the specified column(s). The order of the column names indicates the sort order, from major to minor.

        `c1 xasc t
c1 c2
-----
10 y
20 z
20 a
30 x

       `c2 xasc t
c1 c2
-----
20 a
30 x
10 y
20 z

       `c1`c2 xasc t
c1 c2
-----
10 y
20 a
20 z
30 x
Warning.png Important: In order to modify the contents of a table you must refer to the table by name.

After execution of the expressions above, we still find,

       t
c1 c2
-----
20 z
10 y
30 x
20 a

However, t can be sorted in place with,

         `c1`c2 xasc `t
`t
        t
c1 c2
-----
10 y
20 a
20 z
30 x

xdesc

The dyadic xdesc behaves exactly as xasc, except that the sort is performed in descending order.

        t
c1 c2
-----
20 z
10 y
30 x
20 a

        `c1`c2 xdesc t
c1 c2
-----
30 x
20 z
20 a
10 y

Renaming and Rearranging Columns

Since a table is the flip of a column dictionary, its columns are named and ordered by the list of symbols in the dictionary domain. It is sometimes necessary to rename or reorder the columns. This is accomplished using the dyadic functions xcol and xcols.

We use the following table definition in this section.

        t:([]c1:20 10 30 20;c2:`z`y`x`a;c3:101.1 202.2 303.3 404.4)
        t
c1 c2 c3
-----------
20 z  101.1
10 y  202.2
30 x  303.3
20 a  404.4

xcol

The dyadic xcol takes a scalar or list of symbols containing column names as its left argument (names) and a table (source) as its right argument. The count of names must be less than or equal to the number of columns in source. The result is a table obtained from source by renaming the columns, in order, using the symbols in names.

For example,

        `id`name`val xcol t
id name val
-------------
20 z    101.1
10 y    202.2
30 x    303.3
20 a    404.4
Warning.png Important: The function xcol does not modify its table operand.

After execution of the expressions above, we still find,

        t
c1 c2 c3
-----------
20 z  101.1
10 y  202.2
30 x  303.3
20 a  404.4

However, t can effectively be renamed with,

        t:`id`name`val xcol t
        t
id name val
-------------
20 z    101.1
10 y    202.2
30 x    303.3
20 a    404.4

If the count of names is less than the number of columns in source, the remaining columns are unaffected. Returning to the original definition of t,

        `id`name xcol t
id name c3
-------------
20 z    101.1
10 y    202.2
30 x    303.3
20 a    404.4

xcols

The dyadic xcols takes a scalar or list of symbols containing column names as its left argument ("names") and a table ("source") as its right argument. The count of "names" must be less than or equal to the number of columns in "source". It returns a table obtained from "source" by reordering the columns according to the symbols in "names".

Warning.png Note: The source operand can not be a keyed table.

For example,

        `c3`c2`c1 xcols t
c3    c2 c1
-----------
101.1 z  20
202.2 y  10
303.3 x  30
404.4 a  20
Warning.png Important: The function xcols does not modify its source.

After execution of the expressions above, we still find,

        t
c1 c2 c3
-----------
20 z  101.1
10 y  202.2
30 x  303.3
20 a  404.4

However, t can effectively be reordered with,

        t: `c3`c2`c1 xcols t
        t
c3    c2 c1
-----------
101.1 z  20
202.2 y  10
303.3 x  30
404.4 a  20

If the count of names is less than the number of columns in source, the specified columns are reordered at the beginning of the column list and the remaining columns are left unchanged. Returning to the original definition of t,

        `c3`c1 xcols t
c3    c1 c2
-----------
101.1 20 z
202.2 10 y
303.3 30 x
404.4 20 a

Joins

It is common in SQL to reassemble normalized data by joining a table having a foreign key (source) to its primary key table along common key values. This situation occurs when the tables have a master-detail relation, or when the values of a field have been factored into a lookup table. Such an inner join with equals in the join criterion is called an equal join or an equijoin. In an equijoin, the join can be specified in either order, and there will be exactly one record it the result for each record in the source.

An inner join combines two tables having compatible columns by selecting a subset of the Cartesian product along matching column values. In a left inner join, each row from the first table (source) is paired with any matching rows from the second table. In a right inner join, each row from the second table (source) is paired with any matching rows from the first table. The match columns do not need to be key columns. In an inner join, there may be no rows or multiple rows in the result for each row in the source.

SQL also has outer joins, in which each element of one table (source) is paired with all matching elements of the other table. The match columns do not need to be key columns. In an outer join, there is at least one row in the result for each row in the source.

Equijoin on Foreign Key

Given a primary key table m, foreign key table d and common key column k, an equijoin can be expressed in various SQL notations, among them,

        m,d WHERE m.k = d.k

        m INNER JOIN d ON m.k = d.k

A SELECT statement for this join refers to columns in the join by using dot notation based on the constituent tables.

SELECT d.cold, m.colm FROM m,d WHERE m.k = d.k

As we saw in Foreign Keys and Relations a join along a foreign key is accomplished with an enumeration in q. The join is implicit in the following select on the detail table.

select cold, k.colm from d

This generalizes to the situation where d has multiple foreign keys. Say d has foreign keys k1, k2, ... ,kn referring to primary key tables m1, m2, ... ,mn. Columns from the n-way join of d to the primary key tables are accessed via a select of the form,

select cold, k1.colm1, k2.colm2, ... , kn.colmn from d

For example, in the sp.q distribution script,

        select sname:s.name,pname:p.name,qty from sp
sname pname qty
---------------
smith nut   300
smith bolt  200
smith screw 400
smith screw 200
clark cam   100
smith cog   100
jones nut   300
jones bolt  400
blake bolt  200
clark bolt  200
clark screw 300
smith cam   400

Multi-way equijoins also arise when m and d are as above and additionally d has a primary key l. If s is a table with a foreign key whose enumeration domain is l, then m, d and s can be joined. In SQL,

SELECT m.colm, d.cold, s.cols from m,d,s WHERE m.k=d.k AND d.l=s.l

In q this is

select l.k.colm,l.cold,cols from s

Pseudo Join

It is possible to lookup a table's values in a keyed table even if there is no foreign key relationship defined. One method to achieve this is to perform a dictionary lookup in select. There is no requirement for column names to match and the result will be a left outer join.

In the following example, observe that we must transform the column to be looked up into the proper shape.

        kt:([k:101 102 103] v:`a`b`c)
        t:([] c1:101 103 104)
        select c1, v:kt[flip enlist c1;`v] from t
c1  v
-----
101 a
103 c
104

Here is an example with compound keys.

        t:([]c1:`a`b`c; c2:`x`x`z)
        ktc:([k1:`a`b`a; k2:`y`x`x] v:`one`two`three)
        select c1, c2, v:txf[ktc;(c1;c2);`v] from t
c1 c2 v
-----------
a  x  three
b  x  two
c  z

Ad hoc Left Join

You can also create a left outer join using the dyadic lj. The right operand is a keyed table (lookup) and the left operand is a table (source) having column(s) that match the key column(s) in lookup. In particular, source can have a foreign key defined over lookup. The ad hoc join lj uses lookup to map the records of the appropriate source column(s) and upserts source with the value column(s) from lookup.

In our example,

        tdetails lj tk
eid  sc  name       iq
-----------------------
1003 126 Prefect    126
1002 36  Beeblebrox 42
1001 92  Dent       98
1002 39  Beeblebrox 42
1001 98  Dent       98
1002 42  Beeblebrox 42

The same result can be obtained with a foreign key join by listing all the columns

        select eid, sc, eid.name, eid.iq from tdetails
eid  sc  name       iq
-----------------------
1003 126 Prefect    126
1002 36  Beeblebrox 42
1001 92  Dent       98
1002 39  Beeblebrox 42
1001 98  Dent       98
1002 42  Beeblebrox 42
Warning.png Note: The performance of an equijoin on a key is approximately 2.5 times faster than an ad hoc left join.

In contrast to the equijoin, an ad hoc left join does not require a column in the source table to be defined explicitly as a foreign key into the lookup keyed table.

        td:([] eid:1003 1001 1002 1001 1002; sc:126 36 92 39 98)
        td lj tk
eid  sc  name       iq
-----------------------
1003 126 Prefect    126
1001 36  Dent       98
1002 92  Beeblebrox 42
1001 39  Dent       98
1002 98  Beeblebrox 42
Warning.png Note: If the column(s) for the join are not foreign key(s) into the keyed table, the name(s) must match the key name(s).

Let's examine the general result of lj closely. Say t is the source table and kt is the lookup keyed table. For each record in t, the result has at least one record. If there are no records in kt whose values in the join column(s) match those in the corresponding column(s) of t, the t columns are present in the result and the remaining columns are null. If there are matching records in kt, for each match the result has a record comprising the catenation of the matching records.

        kt:([k:1 2 3] b:100 200 300)
        kt
k| b
-| ---
1| 100
2| 200
3| 300

         t:([]k:1 1 2 2 3 4; a:10 11 20 21 30 40)
         t
k a
----
1 10
1 11
2 20
2 21
3 30
4 40

         t lj kt
k a  b
--------
1 10 100
1 11 100
2 20 200
2 21 200
3 30 300
4 40
Information.png Advanced: The behavior of lj differs from that of a SQL outer join when there are duplicate columns in the two tables. The SQL left outer join will display both columns, whereas lj upserts the appropriate column items of the source table with those of the lookup keyed table.
         t2:([]k:1 2 3;b:10 20 30)
         t2
k  b
------
1 10
2 20
3 30

        kt2:([k:1 2 3 4]b:100 200 300 400)
        kt2
k| b
-| ---
1| 100
2| 200
3| 300
4| 400

        t2 lj kt2
k b
-----
1 100
2 200
3 300

Plus Join

The plus join pj is a type of left join that is useful for adding matching values in tables containing numeric data. As with an ad hoc join, the right operand of plus join is a keyed table (lookup) and the left operand is a table (source) having column(s) that match the key column(s) in lookup. The plus join pj uses lookup to map the records of the appropriate source column(s), zero filling nulls in the result from the lookup value column(s). It then performs a table add of this interim result into the source table.

For example,

        kt:([k1:1 2; k2:`x`y] a:10 20; b:1.1 2.2)
        t:([]k1:1 2 3; k2:`x`y`z; a:100 200 300)
        t pj kt
k1 k2 a   b
-------------
1  x  110 1.1
2  y  220 2.2
3  z  300 0

We examine the result of pj more closely. Each record of t has a corresponding record in the result.

Along the matching rows, the value columns from lookup kt are added to those of source t. In our example, this means that columns a and b are added into t on matching rows. Since a exists in both tables, corresponding values are added. According to the rules of table arithmetic, since b does not exist in t, it is implicitly assumed to have 0 values in t for the addition.

For non-matching rows, the values of the source t are extended with 0 in the columns of lookup.

Information.png Advanced: Note that the result in our example can also be obtained by the expression,
        t+0^kt[`k1`k2#t]
k1 k2 a   b
-------------
1  x  110 1.1
2  y  220 2.2
3  z  300 0

Union Join

The union join uj combines any two tables. In the result, the rows and columns of the left operand appear before those of the right operand. Column value lists are joined for common columns. For non-common columns, the value lists are extended with nulls so that they are the same length. The column value lists of the left operand have nulls appended, whereas those of the right operand have nulls prepended.

        t1:([]c1:1 2 3;c2:101 102 103;c3:`x`y`z)
        t2:([]c2:103 104 105 106;c4:`a`b`c`d)
        t1
c1 c2  c3
---------
1  101 x
2  102 y
3  103 z

        t2
c2  c4
------
103 a
104 b
105 c
106 d

        t1 uj t2
c1 c2  c3 c4
------------
1  101 x
2  102 y
3  103 z
   103    a
   104    b
   105    c
   106    d

Asof Join

The asof join is so-named because it is often used to join tables along time columns, but this is not a restriction. In general, the triadic function aj can be used to join two tables along common columns. Significantly, there is no requirement for any of the join columns to be keys. The syntax of asof join is,

aj[c1...cn;t1;t2]

where c1...cn is a symbol list of common column names for the join and t1 and t2 are the tables to be joined. The result is a table containing records from the left outer join of t1 and t2 along the specified columns.

For each record in t1, the result has one record containing all the items in t1. If there is no record in t2 whose values in the specified columns match those in the corresponding columns of t1, there are no further items in the result record. If there are matching records in t2, the items of the last (in row order) matching record are appended to those of the t1 record in the result.

For example,

        t:([]ti:10:01:01 10:01:03 10:01:04;sym:`msft`ibm`ge;qty:100 200 150)
        t
ti       sym  qty
-----------------
10:01:01 msft 100
10:01:03 ibm  200
10:01:04 ge   150

        q:([]ti:10:01:00 10:01:01 10:01:01 10:01:03;sym:`ibm`msft`msft`ibm; px:100 99 101 98)
        q
ti       sym  px
-----------------
10:01:00 ibm  100
10:01:01 msft 99
10:01:01 msft 101
10:01:03 ibm  98

        aj[`ti`sym;t;q]
ti       sym  qty px
---------------------
10:01:01 msft 100 101
10:01:03 ibm  200 98
10:01:04 ge   150

Parameterized Queries

Relational databases have the concept of stored procedures, which are programs that operate on tables via SQL statements. The programming languages that extend SQL are not part of the SQL standard, differ across vendors and the capabilities of the programming environments vary greatly.

This situation forces a programmer to make a difficult choice: pay a steep price in programming power to place functionality close to the data, or extract the data into an application server in order to perform calculations. Multi-tier architectures with separate database and application servers have evolved largely to address this problem, but they increase cost and complexity.

This choice is obviated in kdb+, since the q programming environment has all the power and performance you need. In fact, q is much faster than traditional database programming environments for retrieval and calculations on large time series. Other components of the application can perform their data retrieval and manipulation by making calls to q.

Traditional calls to a database are made via stored procedures, which are programs executed by the database manager. Often the stored procedure has parameters that supply specific values to the queries. Such parameters are limited to the basic data types of SQL.

Any q program can serve as a stored procedure; there is no distinction between data retrieval and calculations. Any valid q expression that operates on tables or dictionaries can be invoked in a function. Function parameters can be used to supply specific values for queries. In particular, the select, update and delete templates can be invoked within a function by using parameters to pass specific values to the query. Such a function is called a parameterized query.

Warning.png Important: Parameterized queries have restrictions. First, a parameterized query cannot use implicit function parameters. Second, columns cannot be passed as parameters.

In the following example using our tdetails table, we pass a specific value for a foreign key match criterion.

        getScByEid:{[e] select from tdetails where eid=e}
        getScByEid 1003
eid  sc
--------
1003 126

This example can be generalized to handle a scalar or list argument.

        getScByEid:{[e] select from tdetails where eid in ((),e)}
        getScByEid 1001
eid  sc
-------
1001 92
1001 98

        getScByEid 1001 1003
eid  sc
--------
1003 126
1001 92
1001 98

The last expression in the revised function definition warrants closer examination. The empty-list join turns a scalar argument into a list and has no effect on a list. It must be enclosed in parentheses because it appears in a phrase in select, otherwise the comma would be interpreted as a separator.

You can pass a table as a parameter to a stored procedure. Suppose we have multiple trade tables, all having at the columns px (price) and date in common. The following parameterized query returns the maximum price over a specified date range from any trade table.

        maxpx:{[t;range] select max px from t where date within range}

Here t is a trade table and range is a list of two dates in increasing order.

Information.png Advanced: You can effectively parameterize column names in two ways. First, you can mimic a common technique from SQL in which the query is created dynamically: build the query text in a string and then pass the string to value for execution. There is a performance penalty for this approach. Also, you must remember to escape special characters in the string.

The second method is to use the functional form of the query, which has no performance penalty. In the functional form, all columns are referred to by name, so columns names are passed as symbols.

Views

In SQL, a view is essentially a stored procedure whose result set is used like a table. Views are used to encapsulate such data transformations as hiding data columns or rows, renaming columns, or simplifying complex queries. Q-sql implements a view as an alias to a query.

View

A view is a named query created as an alias with the double assignment (::) operator. In the following, the double–colon signifies that v is an alias for the query rather than the current result of the query.

        t:([] c1:`a`b`c; c2:1 2 3)
        v::select c1 from t where c2=2
        v
c1
--
b

When the content of the underlying table changes, the result will be reflected in the view. This is not true of the equivalent single assignment.

        r:select c1 from t where c2=2
        `t insert (`d;2)
,3

        t
c1 c2
-----
a  1
b  2
c  3
d  2

        r
c1
--
b

        v
c1
--
b
d

Functional Forms

The functional forms of select, update and delete can be used in any situation but are especially useful for programmatically generated queries, such as when column names are dynamically produced. The functional forms are,

        ?[t;c;b;a]                / select

        ![t;c;b;a]                / update and delete

where t is a table, a is a dictionary of aggregates, b is a dictionary of groupbys and c is a list of constraints.

Warning.png Note: All q entities in a, b and c must be referenced by name, meaning they appear as symbols containing the entity names.

The q interpreter parses the syntactic forms of select, exec, update and delete into their equivalent functional forms, so there is no performance difference.

Information.png Advanced: The function parse can be applied to a string containing a query template to produce a parse tree whose items are close to the arguments of the equivalent functional form. See the description of parse in Appendix A for more details.

Functional select

Let's start with a simple select example.

        t:([]n:`x`y`x`z`z`y;p:0 15 12 20 25 14)
        t
n p
----
x 0
y 15
x 12
z 20
z 25
y 14

        select m:max p,s:sum p by name:n from t where p>0,n in `x`y
name| m  s
----| -----
x   | 12 12
y   | 15 29

Following is the equivalent functional form. Note the use of enlist to create singletons, ensuring that appropriate entities are lists.

        c: ((>;`p;0);(in;`n;enlist `x`y))
        b: (enlist `name)!enlist `n
        a: `m`s!((max;`p);(sum;`p))
        ?[t;c;b;a]
name| m  s
----| -----
x   | 12 12
y   | 15 29

Of course, the functional form can be written without the intermediate variables a, b and c. We leave this as an exercise to the macho coder.

The general form of functional select is,

         ?[t;c;b;a]

where t is a table, c is a list of where specifications (constraints), b is a dictionary of grouping specifications (by phrase), and a is a dictionary of select specifications (aggregations).

Every item in c is a triple consisting of a boolean or int valued dyadic function together with its arguments, each an expression containing column names and other variables. The function is applied to the two arguments, producing a boolean vector. The resulting boolean vector selects the rows that yield non-zero results. The selection is performed in the order of the items in c, from left to right.

The domain of b is a list of symbols that are the key names for the grouping. The range of b is a list of column expressions whose results are used to construct the groups. The grouping is ordered by the domain elements, from major to minor.

The domain of a is a list of symbols containing the names of the produced columns. Each element of the range of a is an evaluation list consisting of a function and its argument(s), each of which is a column name or another such result list. For each evaluation list, the function is applied to the specified value(s) for each row and the result is returned. The evaluation lists are resolved recursively when operations are nested.

Warning.png Note: Here are the degenerate cases: For no constraints, make c the empty (general) list For no grouping make b a boolean 0b To produce all columns of the original table in the result, make a the empty list

For example,

        select from t                / is equivalent to functional form
        ?[t;();0b;()]                   / degenerate case for c, b, a

Functional exec

The functional form of exec is a simplified form of select. Since the constraint parameter is the same as in select, we omit it in the following.

In the simplest example of a single result column, the groupby parameter is the empty list and the aggregate parameter is a symbol atom.

	exec n from t
`x`y`x`z`z`y
	?[t;();();`n] 	/ same as previous exec
`x`y`x`z`z`y

In the same query with multiple columns, the groupby parameter is the empty list and the aggregate parameter is a dictionary as it would be in a select. Remember that the result is a dictionary rather than a table.

	exec n,p from t
n| x y  x  z  z  y
p| 0 15 12 20 25 14
	?[t;();();`n`p!`n`p] 	/ same as previous exec
n| x y  x  z  z  y
p| 0 15 12 20 25 14

If you wish to group by a single column, specify it as a symbol atom.

	exec p by n from t
x| 0  12
y| 15 14
z| 20 25
	?[t;();`n;`p] 		/ same as previous exec
x| 0  12
y| 15 14
z| 20 25

More complex examples of exec seem to reduce to the equivalent select.

Functional update

The functional form of update is completely analogous to that of select. Again note the use of enlist to create singletons to ensure that appropriate entities are lists.

        update p:max p by n from t where p>0
n p
----
x 0
y 15
x 12
z 25
z 25
y 15

        c: enlist (>;`p;0)
        b: (enlist `n)!enlist `n
        a: (enlist `p)!enlist (max;`p)
        ![t;c;b;a]
n p
----
x 0
y 15
x 12
z 25
z 25
y 15
Warning.png Note: The degenerate cases are the same as in functional select.

Functional delete

The functional form of delete is a simplified form of functional update,

         ![t;c;0b;a]

where t is a table, c is a list of where specifications (constraints) and a is a list of column names. Either c or a, but not both, must be present. The list of constraints, which has the same format as in functional select and update, chooses which rows will be removed. The aggregates argument is a simple list of symbols with the names of columns to be removed.

In the following examples, note the use of enlist to create singletons to ensure that appropriate entities are lists.

        t:([]c1:`a`b`c;c2:`x`y`z)

        / following is: delete c2 from t
        ![t;();0b;enlist `c2]
c1
--
a
b
c

        / following is: delete from t where c2 = `y
        ![t;enlist (=;`c2; enlist `y);0b;`symbol$()]
c1 c2
-----
a  x
c  z

Examples

In this section we demonstrate many of the capabilities of q-sql using semi-serious examples taken from the world of finance. We create a sample table representing a month's worth of trades for a small set of American stocks. To make things easy, we treat all trades as buys.

The Table Schemas

Our vastly over-simplified trading example involves two tables. The instrument table is a reference keyed table that contains basic information about the companies whose financial instruments (stocks in our case) are traded. Its schema has fields for the stock symbol, the name of the company and the industry classification of the company.

	instrument:([sym:`symbol$()] name:`symbol$(); industry:`symbol$())
        instrument
sym| name industry
---| -------------

The trade table represents a collection of trades. Each trade record comprises: the symbol of the instrument; the date and time of the trade; the quantity—i.e. number of shares traded; and the price of the trade.

	trade:([] sym:`instrument$(); date:`date$(); time:`time$(); quant:`int$();px:`float$())
	trade
sym date time quant px
----------------------
Warning.png Note: In practice, the trade table would likely be partitioned by day on disk and only the current day's trades would be stored in memory.

Creating the Tables

Populating the instrument reference table is done via simple inserts.

 `instrument insert (`ibm; `$"International Business Machines"; `$"Computer Services")
 `instrument insert (`msft; `$"Microsoft"; `$"Software")
 `instrument insert (`g; `$"Google"; `$"Internet")
 `instrument insert (`intc; `$"Intel"; `$"Semiconductors")
 `instrument insert (`gm; `$"General Motors"; `$"Automobiles")
 `instrument insert (`ge; `$"General Electric"; `$"Diversified Industries")

Here is the console display of instrument,

	instrument
sym | name                            industry
----| ------------------------------------------------------
ibm | International Business Machines Computer Services
msft| Microsoft                       Software
g   | Google                          Internet
intc| Intel                           Semiconductors
gm  | General Motors                  Automobiles
ge  | General Electric                Diversified Industries

In order to populate the trade table with somewhat realistic data, we create an auxiliary function. The filltrade function takes the name of the target trade table, a stock symbol, a median price and a count. It populates the named table with simulated trade data for the month of Jan 2007. The trades are randomly distributed across days and times. The quantities occur in multiples of 10. The prices are uniformly distributed around the median price. We do not claim that this represents realistic trade data; only that it is sufficient to serve our query examples.

filltrade:{[tname;s;p;n]
	// tname is name of target table
	// s is stock symbol
	// p is median price
	// n is count of items
	//
	/ sym column duplicates stock symbol n times
	sc:n#s;
	/ date column has n random days in Jan 2007
	dc:2007.01.01+n?31;
	/ time column has n random times
	tc:n?24:00:00.000;
	/ quantity column has n random multiples of 10
	qc:10*n?1000;
	/ price column has n random prices that are
	/ distributed uniformly around p
	/ prices are in pennies
	pc:.01*floor (.9*p)+n?.2*p*:100;
	/ bulk insert columns into target table
	tname insert (sc;dc;tc;qc;pc)
	}

	filltrade[`trade;`ibm;115;10000]
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 ..

	trade
sym date       time         quant px
----------------------------------------
ibm 2007.01.15 02:32:54.217 9280  111.59
ibm 2007.01.20 08:56:05.985 9960  110.69
ibm 2007.01.24 19:20:17.727 5970  114.58
ibm 2007.01.21 08:44:50.939 1090  113.32
..

We invoke filltrade on each of the remaining instruments.

	filltrade[`trade;`msft;30;5000]
10000 10001 10002 10003 10004 10005 10006 10007 10008 10009 10010..
	filltrade[`trade;`g;540;12000]
15000 15001 15002 15003 15004 15005 15006 15007 15008 15009 15010..
	filltrade[`trade;`intc;25;4000]
27000 27001 27002 27003 27004 27005 27006 27007 27008 27009 27010..
	filltrade[`trade;`ge;40;9000]
31000 31001 31002 31003 31004 31005 31006 31007 31008 31009 31010..
	filltrade[`trade;`gm;35;3000]
40000 40001 40002 40003 40004 40005 40006 40007 40008 40009 40010..

Finally, we sort trade by date and time so that it represents trades as they came in.

	`date`time xasc `trade
`trade
	trade
sym  date       time         quant px
-----------------------------------------
intc 2007.01.01 00:00:04.569 5440  26.63
ge   2007.01.01 00:02:24.871 8280  40.11
gm   2007.01.01 00:02:43.419 4280  32.13
ibm  2007.01.01 00:03:06.278 5070  105.73
intc 2007.01.01 00:03:24.229 1740  24.47
gm   2007.01.01 00:04:17.590 830   36.53
gm   2007.01.01 00:04:18.227 5060  33.02
ge   2007.01.01 00:04:18.772 8290  43.73
msft 2007.01.01 00:06:01.424 5170  27.71
..

Basic Queries

In this section, we demonstrate the use of basic q-sql to query the trade and instrument tables we have created.

We can count the total number of trades in several ways.

	count trade
43000
	select count i from trade
x
-----
43000
	exec count i from trade
43000

We can count the number of trades for an individual symbol.

	exec count i from trade where sym=`ibm
10000
	count select from trade where sym=`ibm
10000

Observe that the former retrieves only a single record from the query whereas the latter retrieves all matching records and then counts them.

We can count the number of trades across all symbols.

	select count i by sym from trade
sym | x
----| -----
g   | 12000
ge  | 9000
gm  | 3000
ibm | 10000
intc| 4000
msft| 5000

	() xkey select count i by sym from trade
sym  x
----------
g    12000
ge   9000
gm   3000
ibm  10000
intc 4000
msft 5000

Observe that the former retrieves the results as a keyed table and the latter removes the key.

We find one day's trades for GM.

	select from trade where sym=`gm, date=2007.01.07
sym date       time         quant px
---------------------------------------
gm  2007.01.07 00:29:31.311 4390  32.24
gm  2007.01.07 00:29:57.886 1270  38.08
gm  2007.01.07 00:30:35.671 3370  35.67
gm  2007.01.07 00:30:43.216 8090  36.77
gm  2007.01.07 00:44:26.336 1800  35.03
..

We find all lunch hour trades for GM.

	select from trade where sym=`gm, time within (12:00:00;13:00:00)

sym date       time         quant px
---------------------------------------
gm  2007.01.01 12:01:32.133 7960  33.61
gm  2007.01.01 12:37:45.021 8480  31.84
gm  2007.01.01 12:39:46.197 5350  32.34
gm  2007.01.01 12:57:13.215 1090  33.34
gm  2007.01.02 12:53:06.764 1080  31.63
..

We find the maximum daily price for GE. Due to our simplistic construction, it is statistically constant.

	select maxpx:max px by date from trade where sym=`ge
date      | maxpx
----------| -----
2007.01.01| 43.97
2007.01.02| 43.99
2007.01.03| 43.99
2007.01.04| 43.98
..

We find the minimum and maximum trade price over the time span for each symbol and display the result by company name. The latter resolves the foreign key to the instrument table with an implicit inner join.

	select lo:min px, hi:max px by sym.name from trade

name                           | lo    hi
-------------------------------| ------------
General Electric               | 36    43.99
General Motors                 | 31.5  38.49
Google                         | 486   593.99
Intel                          | 22.5  27.49
International Business Machines| 103.5 126.49
Microsoft                      | 27    32.99

We find the total and average trade volume for three symbols. Due to our simplistic construction, the latter are statistically the same.

	select totq:sum quant, avgq:avg quant by sym from trade where sym in`ibm`msft`g
sym | totq     avgq
----| -----------------
g   | 59748830 4979.069
ibm | 49983940 4998.394
msft| 24988910 4997.782

We find the daily volume weighted average price for Intel.

	select vwap:quant wavg px by date from trade where sym=`intc
date      | vwap
----------| --------
2007.01.01| 24.86849
2007.01.02| 25.00113
2007.01.03| 24.82538
2007.01.04| 24.98049
2007.01.05| 25.27898
..

We find the high, low and close over one minute intervals for Intel.

	select hi:max px,lo:min px,close:last px by date, time.minute from trade where sym=`intc

date       minute| hi    lo    close
-----------------| -----------------
2007.01.01 00:12 | 23.3  23.3  23.3
2007.01.01 00:17 | 24.03 24.03 24.03
2007.01.01 00:26 | 24.45 24.45 24.45
2007.01.01 00:51 | 25.73 25.73 25.73
2007.01.01 00:55 | 25.34 25.34 25.34
..

We demonstrate how to use your own functions in queries. Suppose we define a funky average that weights items by their position.

	favg:{(sum x*1+til count x)%(count x)*count x}

Then we can apply this just as we did the built-in q function avg.

	select favgpx:favg px by sym from trade
sym | favgpx
----| --------
g   | 270.0021
ge  | 19.99897
gm  | 17.51145
ibm | 57.53255
intc| 12.48081
msft| 15.00309

Meaty Queries

In this section, we demonstrate more interesting q-sql against the trade table.

We find the volume weighted average price over 5 minute intervals for intel.

	select vwap:quant wavg px by date, bucket:5 xbar time.minute from trade where sym=`intc
date       bucket| vwap
-----------------| --------
2007.01.01 00:10 | 23.3
2007.01.01 00:15 | 24.03
2007.01.01 00:25 | 24.45
2007.01.01 00:50 | 25.73
2007.01.01 00:55 | 25.34
..

We use favg from the previous section to demonstrate that user functions can appear in any phrase of the query.

	select from trade where px<2*(favg;px) fby sym
sym  date       time         quant px
-----------------------------------------
gm   2007.01.01 00:06:02.168 5270  33.6
g    2007.01.01 00:07:36.023 9340  527.71
g    2007.01.01 00:09:46.313 3640  491.6
intc 2007.01.01 00:12:05.909 610   23.3
ibm  2007.01.01 00:12:17.056 6410  112.92
..

We find the average daily volume and price for all instruments and store the result for the next example.

	atrades:select avgqt:avg quant, avgpx:avg px by sym, date from trade
	atrades
sym date      | avgqt    avgpx
--------------| -----------------
g   2007.01.01| 5098.892 542.3796
g   2007.01.02| 5021.136 538.6672
g   2007.01.03| 5114     539.1208
g   2007.01.04| 4712.385 541.5371
g   2007.01.05| 5202.108 539.6128
..

We find the days when the average price went up. Note that we must explicitly exclude the first day because deltas is funky on its first value. Observe that the avpx column scrolls off the page.

	select date, avgpx by sym from atrades where 0<{0,1_deltas x} avgpx
sym | date
----| -------------------------------------------...
g   | 2007.01.03 2007.01.04 2007.01.06 2007.01.08...
ge  | 2007.01.02 2007.01.04 2007.01.06 2007.01.08...
gm  | 2007.01.02 2007.01.04 2007.01.05 2007.01.07...
ibm | 2007.01.01 2007.01.03 2007.01.05 2007.01.08...
intc| 2007.01.04 2007.01.05 2007.01.08 2007.01.10...
msft| 2007.01.01 2007.01.02 2007.01.04 2007.01.07...

To see a more representative display, take only the first few field values.

	select 2#date, 2#avgpx by sym from atrades where 0<{0,1_deltas x} avgpx
sym | date                  avgpx
----| ---------------------------------------
g   | 2007.01.03 2007.01.04 539.1208 541.5371
ge  | 2007.01.02 2007.01.04 39.98092 40.115
gm  | 2007.01.02 2007.01.04 35.13107 35.25371
ibm | 2007.01.01 2007.01.03 115.1667 115.1036
intc| 2007.01.04 2007.01.05 24.83024 25.18836
msft| 2007.01.01 2007.01.02 29.73195 30.03784

We can denormalize trade to obtain a keyed table with one row and complex columns for each symbol. We display the first two items of each field to make the structure more evident.

	dntrades:select date,time,quant,px by sym from trade
	select 2#date,2#time,2#quant,2#px by sym from trade
sym | date                  time                      quant     px
----| -----------------------------------------------------------------------
g   | 2007.01.01 2007.01.01 00:09:54.444 00:12:34.851 4670 3080 591.05 523.08
ge  | 2007.01.01 2007.01.01 00:02:24.871 00:04:18.772 8280 8290 40.11  43.73
gm  | 2007.01.01 2007.01.01 00:02:43.419 00:04:17.590 4280 830  32.13  36.53
ibm | 2007.01.01 2007.01.01 00:03:06.278 00:06:27.951 5070 9740 105.73 117.76
intc| 2007.01.01 2007.01.01 00:00:04.569 00:03:24.229 5440 1740 26.63  24.47
msft| 2007.01.01 2007.01.01 00:06:01.424 00:23:28.908 5170 1370 27.71  29.86

In such a complex table or keyed table, you must use each to apply a monadic (unary) function across the items in a field.

	select sym,cnt:count each date, avgpx:avg each px from dntrade

	/ or the following alternate notation is equivalent

 	select sym,cnt:each[count] date, avgpx: each[avg] px from dntrade

sym  cnt   avgpx
-------------------
g    12000 540.0778
ge   9000  39.99574
gm   3000  34.98716
ibm  10000 114.978
intc 4000  24.96621
msft 5000  29.98583

We can also apply our own monadic favg function with each.

	select sym, favgpx:favg each px from dntrades
sym  favgpx
-------------
g    269.94
ge   19.98121
gm   17.48443
ibm  57.49667
intc 12.48413
msft 15.0314

We find the volume weighted average price by applying the dyadic wavg. In this case we must use the each-both adverb '. Observe that our simplistic construction makes the average price and volume weighted average price statistically the same.

	select sym, vwap:quant wavg' px from dntrade
	/ is equivalent to the alternate notation
	select sym, vwap:wavg'[quant;px] from dntrade	
sym  vwap
-------------
g    540.1832
ge   40.00807
gm   34.95398
ibm  114.9836
intc 24.97542
msft 29.96661

Note that the latter form generalizes to n-adic functions for any n>1.

We find the profit of the ideal transaction over the month for each symbol. This is the maximum amount of money that could be made with 20-20 hindsight. In other words, find the largest profit obtainable by buying at any traded price and selling at the highest subsequently traded price. To solve this, we reverse the perspective. For each traded price, we look at the minimum prices that preceeded it. The largest such difference is our answer.

	select max px-mins px by sym from trade
sym | px
----| ------
g   | 107.99
ge  | 7.99
gm  | 6.99
ibm | 22.99
intc| 4.99
msft| 5.99

Remote Queries

In this section, we demonstrate how to execute q-sql queries against a remote server. We assume that our sample tables have been created in a q instance (the server) that is listening on some port, say 5042. We also assume that we have another q process (the client) with an open handle h to the server. See IO for details on how to connect to remote processes in q. The following expressions are all executed on the client.

We can ask the server to list its tables.

	h "tables `."
`dntrades`instrument`trade

We can ask the server for the count of its trade table.

	h "count trade"
43000

We look up a name by sym. Observe the result is a vector.

	h "exec sym from instrument where name=`Intel"
,`intc

We can look up a sym by name. Observe the necessity of escaping the double quotes inside the dynamic q-sql string.

        h "exec name from instrument where name=`$\"General Electric\""
,`General Electric

We can construct a query on the client and send it to the server along with parameters to be executed.

	qdaily:{[s;d] select from trade where sym=s, date=d}
	h (qdaily;`g;2007.01.12)
sym date       time         quant px
----------------------------------------
g   2007.01.12 00:03:24.082 3570  507.44
g   2007.01.12 00:05:31.920 2900  588.99
..

We can construct the same query on the server and execute it remotely.

        h "qdaily:{[s;d] select from trade where sym=s, date=d}"
	/ verify that it's there
	h "qdaily"
{[s;d] select from trade where sym=s, date=d}

	/ execute it
	h "qdaily[`msft;2007.01.31]"
sym  date       time         quant px
----------------------------------------
msft 2007.01.31 00:00:41.237 9940  29.65
msft 2007.01.31 00:01:36.508 580   27.19
..

Prev: Tables Next: Execution Control

Table of Contents

©2006-2007 Kx Systems, Inc. and Continuux LLC. All rights reserved.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox