QforMortals/queries q sql

From Kx Wiki
Jump to: navigation, search

Contents

Queries: q-sql

Overview

Q has a collection of functions for manipulating tables that are similar to their counterparts in SQL. This collection, which we call q-sql, includes the usual suspects such as insert, select, update, etc., as well as functionality that is not available in traditional SQL. While q-sql provides a superset of SQL functionality, there are some significant differences in the syntax and behavior.

The first important difference is that a q table has well-defined record and column orders. This is particularly useful in dealing with the situation in which records are inserted in a canonical order. Subsequent actions against the table retrieve records in insert order. For example, a time series can be created by inserting, in time order, pairs consisting of a datetime value and a data value. The result of a select will be in time order, without a sort.

A second difference is that every q table is stored physically as a collection of column vectors. This means that operations on column data are easy and fast since atomic, aggregate or uniform functions applied to columns are vector operations.

A third difference is that q-sql provides upsert semantics. This means that one dataset can be applied to another without the need to separate inserts from updates. This can simplify things significantly in practice.

In this chapter, we cover the important features of q-sql, which includes all the basic operations in kdb+. We demonstrate each feature with a trivial example. Gradually examples that are more complex are introduced.

We shall present some examples based on the sp.q distribution script. The schemas for the tables in the script are,

	s:([s:()]name:();status:();city:())
	p:([p:()]name:();color:();weight:();city:())
	sp:([]s:`s$();p:`p$();qty:())

The contents of the tables are,

	show s
s | name  status city
--| -------------------
s1| smith 20     london
s2| jones 10     paris
s3| blake 30     paris
s4| clark 20     london
s5| adams 30     athens
	show p
p | name  color weight city
--| -------------------------
p1| nut   red   12     london
p2| bolt  green 17     paris
p3| screw blue  17     rome
p4| screw red   14     london
p5| cam   blue  12     paris
p6| cog   red   19     london
	show sp
s  p  qty
---------
s1 p1 300
s1 p2 200
s1 p3 400
s1 p4 200
s4 p5 100
s1 p6 100
s2 p1 300
s2 p2 400
s3 p2 200
s4 p2 200
s4 p4 300
s1 p5 400

Insert

Insert appends records to a table.

Basic Insert

To add records to a table, use the dyadic function insert.

  insert[st;L]

where st is a symbol containing the name of a table (target) and L is a list whose items correspond to a record of the table. The result of insert is a list of int representing the positions of the new record(s).

Information.png Since the items in L are appended to the column vectors, each value must match the type of the corresponding column vector.

For a regular table, the effect of insert is to append a new record holding the specified values. Let's use our simple example,

	t:([] name:`Dent`Beeblebrox`Prefect; iq:42 98 126)

Insert a record into t as follows,

	insert[`t;(`Slaartibartfast;156)]
,3
	show t
name            iq
-------------------
Dent            42
Beeblebrox      98
Prefect         126
Slaartibartfast 156

Alternate Forms

Since the dyadic insert is also a verb, it can take various notational forms. For example, the previous insert can also be written as a binary operator,

	`t insert (`Slaartibartfast; 156)

It can also be expressed as a projection onto the first argument with juxtaposition of the second argument,

	insert[`t] (`Slaartibartfast; 156)

You may find one of these more readable. We shall use them interchangeably.

You can also insert a record, as opposed to a list of row values.

	`t insert `name`iq!(`Slaartibartfast; 156)

This is useful when you wish to insert a table of the results of a select.

Repeated Inserts

In a (non-keyed) table, repeatedly inserting the same data is permissible, but it results in duplicate records.

	t:([] name:`Dent`Beeblebrox`Prefect; iq:42 98 126)
	`t insert (`Slaartibartfast; 156)                    / one form
	insert[`t] (`Slaartibartfast; 156)                   / equivalent form
	show t
name            iq
-------------------
Dent            42
Beeblebrox      98
Prefect         126
Slaartibartfast 156
Slaartibartfast 156

Bulk Insert

Since a table is a list of records, it is possible to bulk insert one table into another provided the columns are compatible.

	t:([] name:`Dent`Beeblebrox`Prefect; iq:42 98 126)
	tnew:([] name:`Slaartibartfast`Mickey; iq:158 1042)
	`t insert tnew
3 4
	show t
name            iq
--------------------
Dent            42
Beeblebrox      98
Prefect         126
Slaartibartfast 158
Mickey          1042

Insert into Keyed Tables

Inserting data into a keyed table works just like inserting data into a regular table, with the additional requirement that the key must not already exist in the table. Using our previous example of a keyed table,

	t:([eid:1001 1002] name:`Dent`Beeblebrox; iq:42 98)
	show t
eid | name       iq
----| -------------
1001| Dent       42
1002| Beeblebrox 98
	`t insert (1004; `Slaartibartfast; 158)
	show t
eid | name            iq
----| -------------------
1001| Dent            42
1002| Beeblebrox      98
1004| Slaartibartfast 158

The following insert fails because the key 1004 already exists in t,

	`t insert (1004; `Slaartibartfast; 158)
'insert

Observe that by default the records in a keyed table are stored in insert order rather than key order.

	`t insert (1003; `Prefect; 126)
	show t
eid | name            iq
----| -------------------
1001| Dent            42
1002| Beeblebrox      98
1004| Slaartibartfast 158
1003| Prefect         126

Insert into Empty Tables

We consider the situation of an empty table with no column types specified. The column types are inferred from the first insert.

	t:([] name:(); iq:())
	type t.name
0h
	type t.iq
0h
	`t insert (`Dent; 42)
,0
	type t.name
11h
	type t.iq
6h

If you define an empty table without types, be especially careful to get the first insert correct.

	`t insert (42; `Dent)
	`t insert (`Beelbebrox; 98)
'type

It is preferrable to define the empty table with types. In our example,

	t:([] name:0#`; iq:0#0)
	type t.name
11h
	type t.iq
6h

Insert and Foreign Keys

When inserting data into a table that has a foreign key, everything works as for a regular table, except that a value destined for a foreign key column must already exist as a key in the corresponding primary key table. Returning to our example of the previous section,

	kt:([eid:1001 1002 1003] name:`Dent`Beeblebrox`Prefect; iq:42 98 126)
	tdetails:([] eid:`kt$1003 1001 1002 1001 1002; sc:126 36 92 39 98)
	show kt
eid | name       iq
----| --------------
1001| Dent       42
1002| Beeblebrox 98
1003| Prefect    126
	show tdetails
eid  sc
--------
1003 126
1001 36
1002 92
1001 39
1002 98
	`tdetails insert (1001; 42)
	show tdetails
eid  sc
--------
1003 126
1001 36
1002 92
1001 39
1002 98
1001 42

The following insert fails because the key 1004 does not exist in kt.

	`tdetails insert (1004; 158)
'cast

The select and exec Templates

In this section, we investigate the general form of select, which we met briefly. We present select as a template having required and optional elements. The template elements, in turn, contain phrases with expressions involving column values. The q interpreter applies the template against the specified table to produce a result table. While the syntax and results resemble those of the analogous SQL statement, the underlying mechanics are different.

Syntax

We shall examine each of the constituents of the select template in detail. Our approach is to introduce the concepts with illustrative examples using trivial tables and then to proceed with more meaningful examples using time series. Here are our simple table definitions:

	tk:([eid:1001 1002 1003] name:`Dent`Beeblebrox`Prefect; iq:42 98 126)
	tdetails:([] eid:`tk$1003 1001 1002 1001 1002 1001; sc:126 36 92 39 98 42)

The select template has the following form, where elements enclosed in matching angle brackets (<...>) are optional,

select <ps> <by pb> from texp <where pw>

The select and from keywords are required, as is texp, which is any q expression whose result is a table. The elements ps, pb, and pw are the select, the by and the where phrases, respectively. The result of select is a list of records or, equivalently, a table.

Warning.png If where is present and texp is the result of select, the expression that produces texp must be enclosed in parentheses.

Some simple examples follow.

	show select from tk
eid | name       iq
----| --------------
1001| Dent       42
1002| Beeblebrox 98
1003| Prefect    126
	show select eid,name from tk where name=`Dent
eid  name
---------
1001 Dent
	show select cnt:count sc by eid.name from tdetails
name      | cnt
----------| ---
Beeblebrox| 2
Dent      | 3
Prefect   | 1
	show select topsc:max sc, cnt:count sc by eid.name from tdetails where eid.name<>`Prefect
name      | topsc cnt
----------| ---------
Beeblebrox| 98    2
Dent      | 39    2

The order of execution for select is:

  1. from expression texp
2. where phrase pw
3. by phrase pb
4. select phrase ps

In particular, the from expression is always evaluated first and the select phrase last.

Information.png If ps is absent, all columns are returned. There is no need for the * wildcard of SQL.

Each phrase in the select template is a comma-separated list of subphrases. A subphrase is an expression involving columns of texp or a table related to texp via foreign key. The subphrases within a phrase are evaluated left-to-right, but each expression comprising a subphrase is parsed right-to-left, like any q expression.

Warning.png The commas separating the subphrases are separators, meaning that it is not necessary to enclose a subphrase in parentheses. However, any expression containing the join operator (,) must be enclosed in parentheses to distinguish it from the separator.

The where Phrase

The where phrase controls which records appear in the result. The action of this phrase is a generalization of the built-in where function.

Each subphrase is a criterion on columns. It produces a boolean result vector corresponding to records passing or failing the criterion. The effect of a where subphrase is to select only the records that pass its criterion.

The individual where subphrases are applied from left-to-right. Each step results in an interim table extracted from the previous one by the where subphrase. The net effect is a series of progressively narrowed interim tables.

	show select from tk where iq <100
eid  name       iq
------------------
1001 Dent       42
1002 Beeblebrox 98
	show select from tdetails where eid=1001
eid  sc
-------
1001 36
1001 39
1001 42
	show select from tdetails where eid=1001,sc<eid.iq
eid  sc
-------
1001 36
1001 39
	show select from tdetails where (eid=1001)&sc<eid.iq
eid  sc
-------
1001 36
1001 39

We point out that the last two queries return the same result but execute differently; we shall see more about this later. Also, note that the parentheses in the last query are necessary due to right-to-left evaluation of expressions.

The select Phrase

The select phrase controls which columns appear in the result. Each select subphrase produces a column. The name of the result column from each subphrase is taken from the last underlying column referenced in the subphrase evaluation unless it is aliased by assignment.

	show select LastName:name,iq from kt
LastName   iq
--------------
Dent       42
Beeblebrox 98
Prefect    126

If a column is repeated in the select phrase, it appears more than once in the result. This behaves like SQL SELECT.

	show select iq,iq from kt
iq  iq
-------
42  42
98  98
126 126
Warning.png A virtual column i holding the position of each record is implicitly available in the select phrase. This is useful, for example, in aggregation if you want a column with record counts without reference to a specific column name.
	show select cnt:count i by eid from tdetails
eid | cnt
----| ---
1001| 3
1002| 2
1003| 1

In this situation, i plays a role somewhat similar to * in SQL, but is more useful since it can be used to select specific records. For example, criteria on i can be used to fill only one page of results when you do not wish to transmit an entire result set. Here is the second page of detail records for a page size of 3.

	show select from tdetails where (3<=i) and i<6
eid  sc
-------
1001 39
1002 98
1001 42

This is difficult to do in SQL and vendors have added proprietary extensions to handle it.

The by Phrase

The by phrase controls how rows are grouped in the result. The action of this phrase is a generalization of the built-in group function.

Each by subphrase is an expression involving a column. It produces a grouping criterion for that column. The columns resulting from the by phrase become the primary keys of the select result. Multiple subphrases in the by phrase result in a compound primary key in the result.

Information.png If the by phrase is omitted, the result of select has no primary key.
Information.png Every column included in the by phrase is automatically included in the result and should not by included separately in the select phrase.

It is possible to group without aggregation. The result is a table with non-simple lists for columns—that is, non-atomic column values.

	show select sc by eid from tdetails
eid | sc
----| --------
1001| 36 39 42
1002| 92 98
1003| 126

This cannot be achieved easily with GROUP BY in SQL.

It is more common to aggregate in tandem with grouping using aggregation functions. For example,

	show select maxsc:max sc, avgsc:avg sc by eid from tdetails
eid | maxsc avgsc
----| ----------
1001| 42    39
1002| 98    95
1003| 126   126

The exec Template

The syntax of the exec template is identical to select.

exec <ps> <by pb> from texp <where pw>

The difference is that the result is not a table.

If only one column is produced by the select phrase, the result of exec is a list containing the column values produced. This contrasts with select, which produces a table with a single column in this situation. With kt as above,

	show kt
eid | name       iq
----| --------------
1001| Dent       42
1002| Beeblebrox 98
1003| Prefect    126
	show select name from kt
name
----------
Dent
Beeblebrox
Prefect
	exec name from kt
`Dent`Beeblebrox`Prefect

If more than one column is produced by the select phrase, the result of exec is a column dictionary containing the column values produced. This contrasts with select, which produces a table with the columns.

	show select eid,name from tk
eid  name
---------------
1001 Dent
1002 Beeblebrox
1003 Prefect
	exec eid,name from tk
`eid`name!(1001 1002 1003;`Dent`Beeblebrox`Prefect)

Using distinct in select

The built-in distinct function applied to a source table returns a table containing the unique records in the source. By including distinct in the select phrase of a select or exec query, you can similarly suppress duplicates from the result.

	show tdup
c1 c2
-----
10 a
20 b
10 a
30 c
10 z
20 b
40 d
30 c
	show distinct tdup
c1 c2
-----
10 a
20 b
30 c
10 z
40 d
	show select distinct c2 from tdup
c2
--
a
b
c
z
d

Nested where

As was mentioned, the criteria in the subphrases of a where phrase are applied to the records of the table sequentially from left to right. Consequently, the final list of records is essentially obtained via a succession of intermediate results, each of which is narrowed by the next subphrase criterion. Otherwise put, the where subphrases form a nested set of criteria.

The order of the subphrases in a nested where can have significant performance implications for queries against large tables. Whenever possible, list the subphrases in order of decreasing restrictiveness. That is, choose the subphrase at each position to be the one that results in the greatest narrowing. Each resulting intermediate table will be smallest and consequently less processing will be required at the next step.

Information.png If there is one where subphrase that will always result in a significantly smaller result set, it should be placed first in the sequence.

A typical example is a series of measurements for entities with an identifier. This could be real-time stock prices, daily bond yields, yearly batting averages, test scores, etc. Say there are many different identifier values and you want to select certain records for a given identifier. It is better to filter on the identifier first since this will immediately restrict the result set to a small subset of the original. This can lead an order of magnitude improvement.

Let’s take our trivial example of IQ test scores and imagine that the table contains the result of SAT scores for all high school seniors in the United States. In this case, there will be several million students with only a few records per student. Clearly if you want to perform an analysis on the scores of an individual, it is best to limit the result by student first, since the initial and subsequent intermediate tables will be tiny.

Imagine the following table containing millions of student social security numbers and scores.

	tscores:(ssn:0#`;sc:0#int)
	`tscores insert (`$"111-11-1111"; 999)
	`tscores insert (`$"222-22-2222"; 1242)
	`tscores insert (`$"333-33-3333"; 735)
	`tscores insert (`$"444-44-4444"; 1600)
	`tscores insert (`$"555-55-5555"; 1178)
	`tscores insert (`$"111-11-1111"; 1021)
	`tscores insert (`$"666-66-6666"; 882)
	...

Since each student takes the test only a few times, the following query,

	select from tscores where ssn=`$"111-11-1111",0<deltas sc

executes significantly faster than,

	select from tscores where 0<deltas sc,ssn=`$"111-11-1111"

We point out that a nested where phrase is logically equivalent to a unnested phrase in which each of the subphrases is joined by &. In our example,

	select from tscores where (ssn=`$"111-11-1111")&0<deltas sc
	
	/ or
	
	select from tscores where (0<deltas sc)&ssn=`$"111-11-1111"

However, both unnested versions will execute more slowly since the compound criterion is applied against all records in the table.

fby

It is sometimes desirable to use an aggregate function in the where phrase of select. Suppose we are given a table with a foreign key and we wish to determine which key values have more than one entry in the table. A first attempt might be to place a condition in the where phrase that filters on the count being greater than 1. In our example of tdetails, this would be something like,

	show select distinct eid from tdetails where 1<count eid
eid
----
1003
1001
1002

You can see this doesn’t work, as the record for eid value 1003 is included even though it has only a single entry in tdetails. What went wrong?

The question should be, what does this where expression actually do? Since count is an aggregate function, it is applied against the list of column values for eid. It cannot select individual rows since it always returns the same result, namely the number of items in the column vector, which is 5.

You could achieve the desired result with a correlated subquery. The inner query counts the records for each key value using aggregation and grouping,

	show q1:select cnt:count eid by eid from tdetails
eid | cnt
----| ---
1001| 2
1002| 2
1003| 1

The outer query selects the records with the desired count.

	show select eid from q1 where 1<cnt
eid
----
1001
1002

An easier way to accomplish this result is to use fby in the where phrase. Placing fby in a where subphrase allows an aggregate function to be used to select individual rows. The action is similar to the grouping of by, with the specified aggregate function applied across the grouped values. (Hence the name "fby" which is short for "function by").

The use of fby is somewhat more abstract than other elements of the select template. It is a binary operator of the form,

	(f,,agg,,;exp,,col,,) fby c

The left operand is a two-item list consisting of an aggregate function fagg and a column expression expcol on which the function will be applied. The right operand c is a symbol containing the name of the column whose values are grouped to form lists for the aggregate function.

Inclusion of fby in a where subphrase selects those records whose group passes the subphrase criterion specified by the aggregate function. This means that all records in a group either pass or fail together, depending on the result of the aggregation on the group.

In our example above, we can achieve the desired result with an un-nested select using fby. First, we verify that fby does indeed accomplish what we want,

	show select eid from tdetails where 1<(count;eid) fby eid
eid
----
1001
1002
1001
1002

Now we eliminate the duplicates,

	show select distinct eid from tdetails where 1<(count;eid) fby eid
eid
----
1001
1002
Information.png Multiple columns in the right operand of fby must be encapsulated in a table. To do this is, create an anonymous empty table with the desired column names only.
	t:([]sym:`IBM`IBM`MSFT`IBM`MSFT;
	     ex:`N`O`N`N`N;
	     time:12:10:00.0 12:30:00.0 12:45:00.0 12:50:00.0 13:30:00.0;
	     price:82.1 81.95 23.45 82.05 23.40)
	show t
sym  ex time         price
--------------------------
IBM  N  12:10:00.000 82.1
IBM  O  12:30:00.000 81.95
MSFT N  12:45:00.000 23.45
IBM  N  12:50:00.000 82.05
MSFT N  13:30:00.000 23.4
	show select from t where price=(max;price) fby ([]sym;ex)
sym  ex time         price
--------------------------
IBM  N  12:10:00.000 82.1
IBM  O  12:30:00.000 81.95
MSFT N  12:45:00.000 23.45

It may take a while to get used to this notation

The update Template

Basic update

The update template has the same form as the select template,

update <pu> <by pb> from texp <where pw>

The difference is that column assignments in the update phrase pu represent modifications to columns instead of column name aliases.

	show update c1:`third,c2:33 from t where c1=`three
c1    c2
--------
one   10
two   20
third 33
Warning.png In order to modify the contents of texp you must refer to a table by name.

After execution of the query above, we still find,

	show t
c1    c2
--------
one   10
two   20
three 30

However, t can be modified in place with,

	update c1:`third,c2:33 from `t where c1=`three
`t
	show t
c1    c2
--------
one   10
two   20
third 33
Information.png Unlike updates in SQL, update can add a new column.
	show t
c1 c2
-----
20 z
10 y
30 x
20 a
	update c3:100+c1 from `t
t
	show t
c1 c2 c3
---------
20 z  120
10 y  110
30 x  130
20 a  120

update-by

When the by phrase is present, update can be used to create new columns from the grouped values. When an aggregate function is used, it is applied to each group of values and the result is assigned to all records in the group.

	show t
n p
----
a 10
b 15
a 12
c 20
c 25
b 14
	show update av:avg p by n from t
n p  av
---------
a 10 11
b 15 14.5
a 12 11
c 20 22.5
c 25 22.5
b 14 14.5

When a uniform function is used, it is applied across the grouped values and the result is assigned in sequence to the records in the group. With t as above,

	show update s:sums p by n from t
n p  s
-------
a 10 10
b 15 15
a 12 22
c 20 20
c 25 45
b 14 29

upsert

The dyadic function upsert is an alternate name for join (,) on tables and keyed tables.

For keyed tables, the match is done by key value.

	show kt
k    | c
-----| --
one  | 10
two  | 20
three| 30
	show ku
k    | c
-----| ---
three| 300
four | 400
	show kt upsert ku
k    | c
-----| ---
one  | 10
two  | 20
three| 300
four | 400

For regular (unkeyed) tables, the records are appended.

	t:([]c1:`one`two`three;c2:10 20 30)
	show t
c1    c2
--------
one   10
two   20
three 30
	u:([]c1:`three`four;c2:30 40)  show u
c1    c2
--------
three 30
four  40
	show t upsert u
c1    c2
--------
one   10
two   20
three 30
three 30
four  40

delete

The format of the delete template is simpler than that of select with only an optional where phrase.

delete from texp <where pw>

The result is a table derived from texp in which records meeting the criteria of pw are removed.

	show t
c1 c2
-----
a  x
b  y
c  z
	td:delete from t where c2=`z
	show td
c1 c2
-----
a  x
b  y
Warning.png In order to modify the contents of texp you must refer to the table by name.

Thus, after execution of the query above, we still find,

	show t
c1 c2
-----
a  x
b  y
c  z

However, t can be modified in place with,

	delete from `t where c2=`z
t
show t
c1 c2
-----
a  x
b  y

Aggregation

Aggregation is a result of applying an aggregate function—one that produces a scalar from a list—to a column.

SQL Aggregation

In traditional SQL, aggregation and grouping are limited and cumbersome. Aggregation and grouping are bound together: only columns that appear in the GROUP BY can participate in the SELECT result. Moreover, there is a limited collection of built-in aggregation functions, which cannot be extended.

In q, grouping and aggregation can b used independently or together.

Grouping

Grouping produces a list of values for each value in the group domain. The result of grouping (without aggregation) is a table with non-simple column vectors. Unlike SQL, any column can participate in the select result when grouping. Moreover, the columns in the by phrase are automatically included in the result as keys.

We can group order quantities by supplier in the sp.q script sample tables,

	show select qty by s.name from sp
name | qty
-----| -----------------------
blake| 200
clark| 100 200 300
jones| 300 400
smith| 300 200 400 200 100 400

Aggregation

Aggregation can be applied against a column of non-simple type in any table. An aggregate function is any function that processes a list and produces a result of an appropriate form. While q has many built-in aggregates, you can also define and use your own.

We calculate the average order quantity in the sp.q script sample tables by using the built-in aggregate avg,

	show select avgqty:avg qty from sp
avgqty
--------
258.3333

Grouping with Aggregation

The equivalent of SQL aggregation is achieved in q by combining grouping with aggregation.

Continuing with the sp.q script example, we combine grouping and aggregation to compute the average order quantity by supplier.

	show select avgqty:avg qty by s.name from sp
name | avgqty
-----| --------
blake| 200
clark| 200
jones| 350
smith| 266.6667

Using Uniform and Aggregate Functions

Any uniform or aggregate function can by applied directly to columns in aggregation.

Again using the sp.q distribution example,

	show select lo:min qty,av:avg qty,hi:max qty by s.name from sp
name | lo  av       hi
-----| ----------------
blake| 200 200      200
clark| 100 200      300
jones| 300 350      400
smith| 100 266.6667 400

Using each

If the data in a column is not atomic (that is, the column has a list of values for each row), you must the each modifier to apply an aggregate.

In our sp.q example, suppose we define a table of intermediate results as,

	o:select qty by p.name from sp
	show o
name | qty
-----| ---------------
bolt | 200 400 200 200
cam  | 100 400
cog  | 100
nut  | 300 300
screw| 400 200 300

We must use each to compute the average order size for each product,

	select avqty:avg qty from o
'length
	show select avqty:avg each qty from o
avqty
---
250
250
100
300
300

Sorting

Recall that tables and keyed tables are lists of records and therefore have an inherent order. A table or keyed table can be sorted by the values of any columns.

We use the following table definition in this section,

	t:([]c1:20 10 30 20;c2:`z`y`x`a)
	show t
c1 c2
-----
20 z
10 y
30 x
20 a

xasc

The dyadic xasc takes a scalar or list of symbols containing column names as its left argument and a table as its right argument. It returns the records of the table sorted in ascending order of the items in the specified column(s). The order of the column names indicates the sort order, from major to minor.

	show `c1 xasc t
c1 c2
-----
10 y
20 z
20 a
30 x
	show `c2 xasc t
c1 c2
-----
20 a
30 x
10 y
20 z
	show `c1`c2 xasc t
c1 c2
-----
10 y
20 a
20 z
30 x
Warning.png In order to modify the contents of a table you must refer to the table by name.

Thus, after execution of the expressions above, we still find,

	show t
c1 c2
-----
20 z
10 y
30 x
20 a

However, t can be sorted in place with,

	`c1`c2 xasc `t
t
	show t
c1 c2
-----
10 y
20 a
20 z
30 x

xdesc

The dyadic xdesc behaves exactly as xasc, except that the sort is performed in descending order.

	show t
c1 c2
-----
20 z
10 y
30 x
20 a
	show `c1`c2 xdesc t
c1 c2
-----
30 x
20 z
20 a
10 y

Renaming and Rearranging Columns

Since a table is the flip of a column dictionary, its columns are named and ordered by the list of symbols in the dictionary domain. It is sometimes convenient to rename or reorder the columns. This is accomplished using the dyadic functions xcol and xcols.

We use the following table definition in this section,

	t:([]c1:20 10 30 20;c2:`z`y`x`a;c3:101.1 202.2 303.3 404.4)
	show t
c1 c2 c3
-----------
20 z  101.1
10 y  202.2
30 x  303.3
20 a  404.4

xcol

The dyadic xcol takes a scalar or list of symbols containing column names as its left argument (names) and a table (source) as its right argument. The count of names must be less than or equal to the number of columns in source. It result is a table obtained from source by renaming the columns, in order, using the symbols in names.

For example,

	show `id`name`val xcol t
id name val
-------------
20 z    101.1
10 y    202.2
30 x    303.3
20 a    404.4
Warning.png The function xcols does not modify its table operand.

Thus, after execution of the expressions above, we still find,

	show t
c1 c2 c3
-----------
20 z  101.1
10 y  202.2
30 x  303.3
20 a  404.4

However, t can effectively be renamed with,

	t:`id`name xcol `t

If the count of names is less than the number of columns in source, the remaining columns are not renamed.

	show `id`name xcol t
id name c3
-------------
20 z    101.1
10 y    202.2
30 x    303.3
20 a    404.4

xcols

The dyadic xcols takes a scalar or list of symbols containing column names as its left argument (names) and a table (source) as its right argument. The count of names must be less than or equal to the number of columns in source. It returns a table obtained from source by reordering the columns according to the symbols in names. Any symbols in names that are not column names in source are ignored.

For example,

	show `c3`c2`c1 xcols t
c3    c2 c1
-----------
101.1 z  20
202.2 y  10
303.3 x  30
404.4 a  20
Warning.png The function xcols does not modify its source.

Thus, after execution of the expressions above, we still find,

	show t
c1 c2 c3
-----------
20 z  101.1
10 y  202.2
30 x  303.3
20 a  404.4

However, t can effectively be reordered with,

	t: `c3`c2`c1 xcols t

If the count of names is less than the number of columns in source, the specified columns are reordered at the beginning of the column list and the remaining columns are left unchanged.

	show `c3`c1 xcols t
c3    c1 c2
-----------
101.1 20 z
202.2 10 y
303.3 30 x
404.4 20 a

Joins

It is common in SQL to reassemble data that has normalized by joining a table having a foreign key (source) to its primary key table along common key values. This situation occurs, for example, when the tables have a master-detail relation, or when the values of a field are factored into a lookup table. Such an inner join with equals in the join criterion is called an equal join or an equijoin. In an equijoin, the join can be specified in either order, and there will be exactly one record it the result for each record in the source.

An inner join combines two tables having columns with compatible columns by selecting a subset of the Cartesian product along matching column values. In a left inner join, each row from the first table (source) s paired with any matching rows from the second table. In a right inner join, each row from the second table (source) is paired with any matching rows from the first table. The match columns do not need to be key columns. In an inner join, there may be no rows or multiple rows in the result for each row in the source.

SQL also has outer joins, in which each element of one table (source) is paired with all matching elements of the other table. The match columns do not need to be key columns. In an outer join, there is at least one row in the result for each row in the source.

Equijoin on Key

Given a primary key table m, foreign key table d and common key column k, this join can be expressed in various SQL notations, among them,

m,d WHERE m.k = d.k
m INNER JOIN d ON m.k = d.k

A SELECT statement for this join refers to columns in the join by using dot notation based on the constituent tables,

SELECT d.cold, m.colm FROM m, d WHERE m.k = d.k

A join along a foreign key is accomplished with an enumeration in q. The join is implicit in the following select on the detail table,

select cold, k.colm from d

This generalizes to the situation where d has multiple foreign keys. Say d has foreign keys k1,k2,...,kn referring to primary key tables m1,m2,...,mn. Columns from the n-way join of d to the primary key tables are accessed via a select of the form,

select cold, k1.colm1, k2.colm2, ... , kn.colmn from d

For example, in the sp.q distribution script,

	show select sname:s.name,pname:p.name,qty from sp
sname pname qty
---------------
smith nut   300
smith bolt  200
smith screw 400
smith screw 200
clark cam   100
smith cog   100
jones nut   300
jones bolt  400
blake bolt  200
clark bolt  200
clark screw 300
smith cam   400

Multi-way equijoins also arise when m and d are as above and additionally d has a primary key l. If s is a table with a foreign key whose enumeration domain is l, then m, d and s can be joined. In SQL

#!sql
	SELECT m.colm, d.cold, s.cols from m,d,s WHERE m.k=d.k AND d.l=s.l

In q this is

	select l.k.colm,l.cold, cols from s

Ad hoc Left Join

You can explicitly create a left join using the dyadic lj. The right operand is a keyed table and the left operand is a table having a foreign key into the keyed table.

In our example,

	show tdetails lj kt
eid  sc  name       iq
-----------------------
1003 126 Prefect    126
1001 36  Dent       42
1002 92  Beeblebrox 98
1001 39  Dent       42
1002 98  Beeblebrox 98
1001 42  Dent       42

The same result can be obtained with a foreign key join by listing all the columns

	show select eid, sc, eid.name, eid.iq from tdetails
eid  sc  name       iq
-----------------------
1003 126 Prefect    126
1001 36  Dent       42
1002 92  Beeblebrox 98
1001 39  Dent       42
1002 98  Beeblebrox 98
1001 42  Dent       42
Information.png The performance of an equijoin on a key is approximately 2.5 times faster than an ad hoc left join.

Union Join

The union join uj combines any two tables. The rows and columns of the left operand appear before those of the right operand. Column value lists are joined for common columns. For non-common columns, the value lists are extended with nulls so that they are the same length. The column value lists of the left operand have nulls appended, whereas those of the right operand have nulls prepended.

	t1:([]c1:1 2 3;c2:101 102 103;c3:`x`y`z)
	t2:([]c2:103 104 105 106;c4:`a`b`c`d)
	show t1
c1 c2  c3
---------
1  101 x
2  102 y
3  103 z
	show t2
c2  c4
------
103 a
104 b
105 c
106 d
	show t1 uj t2
c1 c2  c3 c4
------------
1  101 x
2  102 y
3  103 z
	103    a
	104    b
	105    c
	106    d

Parameterized Queries

Relational databases have the concept of stored procedures, which are programs that operate on tables using SQL statements. The programming languages that extend SQL are not part of the SQL standard and differ across vendors. The capabilities of the programming environments also vary greatly.

This situation forces a programmer to make a difficult choice: pay a steep price in programming power to place functionality close to the data, or extract the data into an application server in order to perform aggregations and calculations. Multi-tier architectures with separate database and application servers have evolved largely to address this problem, but they increase cost and complexity.

In kdb+, this choice is unnecessary since the q programming environment has all the power and performance you need. In fact, q is much faster than traditional database programming environments for retrieval and calculations on large time series. Other components of the application can perform their data retrieval and manipulation by making calls to q.

Traditional calls to a database are made via stored procedures, which are programs executed by the database manager. Often the stored procedure has parameters that supply specific values to the queries. The parameters are limited to the basic data types of SQL

Any q program can serve as a stored procedure, with no distinction made between data retrieval and calculations. Any valid q expression that operates on tables or dictionaries can be invoked in a function. Function parameters can be used to supply specific values for queries. In particular, the select, update and delete templates can be invoked within a function by using parameters to pass specific values to the query. Such a function is called a parameterized query.

Warning.png Parameterized queries have restrictions. First, a parameterized query cannot use implicit function parameters. Second, columns cannot be passed as parameters.

In the following example with our tdetails table, we pass a specific value for a foreign key match criterion.

	getScByEid:{[e] select from tdetails where eid=e}
	show getScByEid 101
eid sc
------
101 36
101 39
101 42

This example can be generalized to handle a scalar or list argument.

	getScByEid:{[e] select from tdetails where eid in ((),e)}
	show getScByEid 101
eid sc
------
101 36
101 39
101 42
	show getScByEid 101 103
eid sc
-------
103 126
101 36
101 39
101 42

The last expression in the revised function definition warrants closer examination. The empty-list join turns a scalar argument into a list and has no effect on a list. It must be enclosed in parentheses because it appears in a phrase in select.

You can pass a table as a parameter. Suppose we have multiple trade tables, all having at the columns px (price) and date in common. The following parameterized query returns the maximum price over a specified date range from any trade table.

	maxpx:{[t,range] select max px from t where date within range}

Here t is a trades table and range is a list of two dates in increasing order.

Information.png You can effectively parameterize column names in two ways. First, you can mimic a common technique from SQL in which the query is built dynamically: build the query text in a string and then pass the string to value for execution. There is a performance penalty for this approach. Even q demigods use the functional forms of select and update, which have no performance penalty. In the functional form, all columns are referred to by name, so columns names are passed as symbols.

Functional Forms of select and update

The functional forms of select and update are useful for generated programmatically queries, such as when column names are dynamically produced. The functional forms are,

	?[t;c;b;a]                / select
	![t;c;b;a]                / update

where t is a table, a is a dictionary of aggregates, b is a dictionary of groupbys and c is a list of constraints.

Information.png All q entities in a, b and c must be referenced by name, meaning as symbols containing the entity names.
Information.png The syntactic forms of select and update are parsed into their equivalent functional forms by the q interpreter, so there is no performance difference between the two forms.

Functional select

Let's start with a simple select example.

	t:([]n:`x`y`x`z`z`y;p:0 15 12 20 25 14)
	show t
n p
----
x 0
y 15
x 12
z 20
z 25
y 14
	show select m:max p,s:sum p by name:n from t where p  >  0,n in `x`y
name| m  s
----| -----
x   | 12 12
y   | 15 29

Following is the equivalent functional form. Note the use of enlist to create singletons to ensure that appropriate entities are lists.

	c: ((  >  ;`p;0);(in;`n;enlist `x`y))
	b: (enlist `name)!enlist `n
	a: `m`s!((max;`p);(sum;`p))
	show ?[t;c;b;a]
name| m  s
----| -----
x   | 12 12
y   | 15 29

Of course, the functional form can be written without the intermediate variables a, b and c by macho types.

The general form of functional select is,

	?[t;c;b;a]

where t is a table, c is a list of where subphrase (constraint) specifications, b is a dictionary of by subphrase specifications, and a is a dictionary of select phrase (aggregate) specifications.

Each item in c is a triple consisting of a boolean or int valued dyadic function together with its arguments, each of which is an expression containing column names and other variables. The function is applied to the two arguments, producing a boolean vector. The resulting boolean vector selects the rows that yield non-zero results. The selection is performed in the order of the items in c, from left to right.

The domain of b is a list of symbols that are the key names for the grouping. The range of b is a list of column expressions whose results are used for grouping. The grouping is ordered by the range elements, from major to minor.

The domain of a is a list of symbols containing the names of the selected columns. Each element of the range of a is a pair consisting of a function and its argument, which is a symbol containing a column name. The function is applied to the specified column value for each row (or values, if grouping has been performed) and the result is used to filter the rows.

Functional update

The functional form of update is completely analogous to that of select. Again note the use of enlist to create singletons to ensure that appropriate entities are lists.

	show update p:max p by n from t where p  >  0
n p
----
x 0
y 15
x 12
z 25
z 25
y 15
	c: enlist (  >  ;`p;0)  b: (enlist `p)!enlist `p  a: (enlist `p)!enlist (max;`p)
	show ![t;c;b;a]
n p
----
x 0
y 15
x 12
z 20
z 25
y 14

Prev: Tables, Next: Execution Control

©2006 Kx Systems, Inc. and Continuux LLC. All rights reserved.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox