7. Transforming Data¶
7.1 Types¶
Casting is a way to convert a value of one type to a compatible type. Sometimes the conversion is exact; other times information is lost. Enumeration and parsing in q also fit into the cast pattern.
7.1.1 Basic DataTypes¶
A type can be specified in three equivalent ways: a char, a short and a symbol. For convenience, we repeat the data types table from Chapter 1.
type | type symbol | type char | type num |
---|---|---|---|
boolean | `boolean | B | 1h |
guid | `guid | G | 2h |
byte | `byte | X | 4h |
short | `short | H | 5h |
int | `int | I | 6h |
long | `long | J | 7h |
real | `real | E | 8h |
float | `float | F | 9h |
char | `char | C | 10h |
symbol | ` | S | 11h |
timestamp | `timestamp | P | 12h |
month | `month | M | 13h |
date | `date | D | 14h |
datetime | `datetime | Z | 15h |
timespan | `timespan | N | 16h |
minute | `minute | U | 17h |
second | `second | V | 18h |
time | `time | T | 19h |
7.1.2 The type
Operator¶
The non-atomic unary function (type
) can be applied to any entity in q to return its data type expressed as a short. It is a "feature" of q that the data type of an atom is negative whereas the type of a simple list is positive.
q)type 42
-7h
q)type 10 20 30
7h
q)type 98.6
-9h
q)type 1.1 2.2 3.3
_
q)type `a
-11h
q)type `a`b`c
_
q)type "z"
-10h
q)type "abc"
_
Observe that infinities and nulls have their respective types.
q)type 0W
-7h
q)type 0N
_
q)type -0w
-9h
q)type 0n
_
q)type `
-11h
The type of any general list is 0h
.
q)type (42h; 42i; 42j)
0h
q)type (1 2 3; 10 20 30)
_
q)type ()
_
The type of any dictionary, including a keyed table, is 99h
.
q)type (`a`b`c!10 20 30)
99h
q)type ([k:`a`b`c] v:10 20 30)
_
The type of any table is 98h
.
q)type ([] c1:`a`b`c; c2:10 20 30)
98h
7.1.3 Type of a Variable¶
Since q is a dynamically typed language, a variable has the type of the value currently assigned to it. Unlike statically typed languages, there is no predeclaration of type.
q)a:42
q)type a
-7h
q)a:"abc"
q)type a
10h
In more detail, the variable a
above is an association between the name `a
and its assigned value. Since this is the first time that the variable is assigned, the q interpreter creates an entry for the key `a
in its dictionary of global variables and associates it with the value 42
.
q)get `.
a| 42
On the subsequent assignment, there is already an entry for the key `a
in the dictionary, so the value "abc"
is upserted.
q)get `.
a| "abc"
The type of a variable is the type of the value associated with the variable's name.
Global variables are stored in ordinary q dictionaries with special names. For example, the symbol `.
is the name of the default global dictionary. Start a fresh q session and observe the life of this dictionary using value
.
q)value `.
q)a:42
q)value `.
a| 42
q)f:{x*x}
q)value `.
a| 42
f| {x*x}
..
Since q internally uses the same dictionary structures available to us, it has been said (crassly) that q eats its own dog food.
7.2 Cast¶
Casting renders an underlying value of one type into another compatible type. For example, a short integer has a natural interpretation as a long integer. In the Mathematics Refresher we discussed the implicit identifications of naturals with their rational counterparts and rationals with their repeating decimal real counterparts. These identifications are actually examples of casting, since the numeric values representing the different types are the same in essence.
Since q is dynamically typed, casting occurs at run-time using the binary operator $
, which is atomic in both operands. The right operand is the source value and the left operand specifies the target type. There are three ways to specify the target type, indicated by the first three columns of the type table at the beginning of this chapter.
- A (positive) numeric short type value
- A char type value
- A type name symbol
7.2.1 Casts that Widen¶
In these examples, no information is lost in the cast, as the target type is wider than the source type. Here are examples using the short type specification in the target.
q)7h$42i / int to long
42
q)6h$42 / long to int
42i
q)9h$42 / long to float
42f
It is arguably more readable to use the type char. Here are the same examples recast (ouch) to use the type char.
q)"j"$42i
42
q)"i"$42
42i
q)"f"$42
42f
It is arguably most readable to use the symbolic type name.
q)`int$42
_
q)`long$42i
_
q)`float$42
_
7.2.2 Casts across Disparate Types¶
It may come as a surprise that q allows casting between superficially disparate types. When the underlying values are the same, the types are essentially just different representation formats, so why not allow a cast? We use the symbolic names in the following examples; the other formats work equally well.
The underlying value of a char is its position in the ASCII collation sequence, so we can cast char to and from integers, provided the integer is less than 256.
q)`char$42
"*"
q)`long$"\n"
_
The underlying value of a date is its count of days from the millennium, so we can cast to and from anint
.
q)`date$0
2000.01.01
q)`int$2001.01.01 / millennium occurred on leap year
_
The underlying value of a timespan is its count of nanoseconds from midnight, so we can cast it to and from long.
q)`long$12:00:00.0000000000
43200000000000
q)`timespan$0
_
7.2.3 Casts that Narrow¶
Some casts lose information. This includes the usual suspects of float to integer and wider integers to narrower ones.
q)`long$12.345
12
q)`short$123456789
32767h
Cast any numeric to a boolean using the C philosophy that zero is 0b
and anything else is 1b
.
q)`boolean$0
0b
q)`boolean$0.0
0b
q)`boolean$123
1b
q)`boolean$-12.345
1b
We can also extract constituents from complex types.
q)`date$2015.01.02D10:20:30.123456789
2015.01.02
q)`year$2015.01.02
2015i
q)`month$2015.01.02
2015.01m
q)`mm$2015.01.02
1i
q)`dd$2015.01.02
2i
q)`hh$10:20:30.123456789
10i
q)`minute$10:20:30.123456789
10:20
q)`uu$10:20:30.123456789
20i
q)`second$10:20:30.123456789
10:20:30
q)`ss$10:20:30.123456789
30i
This is to be preferred over dot notation since the latter does not work inside functions.
7.2.4 Casting Integral Infinities¶
When integral infinities are cast to integers of wider type, they are their underlying bit patterns, reinterpreted. Since these bit patterns are legitimate values for the wider type, the cast results in a finite value.
q)`int$0Wh
32767i
q)`int$-0Wh
-32767i
q)`long$0Wi
2147483647
q)`long$-0Wi
-2147483647
7.2.5 Coercing Types¶
Casting can be used to coerce type-safe assignment. Recall that assignment into a simple list must strictly match the type.
q)L:10 20 30 40
q)L[1]:42h
'type
q)L,:43h
'type
This situation can arise when the list and the assignment value are created dynamically. Coerce the type by casting it to that of the target, provided of course that the cast is legitimate.
q)L[1]:(type L)$42h
q)L,:(type L)$43h
q)L
_
The type of a simple list is positive to make this construct work.
7.2.6 Cast is Atomic¶
Cast is atomic in the right operand.
q)"i"$10 20 30
10 20 30i
q)`float$(42j; 42i; 42j)
42 42 42f
Cast is atomic in the left operand.
q)`short`int`long$42
42h
42i
42
q)"ijf"$98.6
99i
99
98.6
Cast is atomic in both operands simultaneously.
q)"ijf"$10 20 30
10i
20
30f
7.3 Data to and from Text¶
Recall that a q string is a simple list of char. All grown-up programming languages have a mechanism for translating literal text into values and vice versa. It isn’t strictly correct to say that converting a value to a string and parsing a string are casts since the value is only implicit in the text, but the operations are closely related and use the same operator in q.
7.3.1 Data to Strings¶
The function string
can be applied to any q entity to produce a text representation suitable for console display or storage in a file. Here are the key features of string
.
- The result is always a list of char, never a single char. Thus you will see singleton char lists from single digits.
- The result contains no q type indicators or other decorations. In general, the result is the most compact representation of the input, which may not actually be convertible (i.e., parsed) back to the original value.
- Applying
string
to an actual string (i.e., list of char) probably will not give you what you want.
Following are some examples.
q)string 42
"42"
q)string 4
,"4"
q)string 42i
"42"
q)a:2.0
q)string a
,"2"
q)f:{x*x}
q)string f
"{x*x}"
The string
function is clearly not atomic – for example, it takes the atom 42
to the list "42"
. However, it is pseudo-atomic, in that it recurses into its argument and applies to the individual atoms. As such, the result conforms to the input, which explains its behavior on strings.
q)string 1 2 3
,"1"
,"2"
,"3"
q)string "string"
,"s"
,"t"
,"r"
,"i"
,"n"
,"g"
q)string (1 2 3; 10 20 30)
,"1" ,"2" ,"3"
"10" "20" "30"
Use string
to convert a list (or column) of symbols to strings.
q)string `Life`the`Universe`and`Everything
_
7.3.2 Creating Symbols from Strings¶
Casting from a string (i.e., a list of char) to a symbol is a foolproof way to create symbols. It is the only way to create symbols with embedded blanks or other special characters that cannot be entered into a literal symbol. To cast a char or a string to a symbol, use `$
.
q)`$"abc"
`abc
q
q)`$"Hello World"
`_
Do not use `symbol$
for this as it generates an error. This is a common qbie mistake.
You can include any characters in a symbol this way but you may need to escape them into the string.
q)`$"Zaphod \"Z\""
`Zaphod "Z"
q)`$"Zaphod \n"
`Zaphod
The source string is left- and right-trimmed during the cast
The author knows no workaround to force leading or trailing blanks into a symbol. Why would you want them there, anyway?
q)string `$" abc "
"abc"
The unary `$
is atomic and will thus convert an entire list (or column) of strings to symbols.
q)`$("Life";"the";"Universe";"and";"Everything")
`Life`the`Universe`and`Everything
7.3.3 Parsing Data from Strings¶
The $
operator is overloaded to parse strings into data of any type exactly as the q interpreter does. This overload is invoked by using an uppercase type char as the target left operand and a string in the right operand. If the specified parse cannot be performed, a null of the target type is returned – i.e., missing or bad data – instead of an exception.
q)"J"$"42"
42
q)"F"$"42"
42f
q)"F"$"42.0"
42f
q)"I"$"42.0"
0Ni
q)"I"$" "
0Ni
Date parsing is flexible with respect to the format of the date..
q)"D"$"12.31.2014"
2014.12.31
q)"D"$"12-31-2014"
_
q)"D"$"12/31/2014"
_
q)"D"$"12/1/2014"
_
q)"D"$"2014/12/31"
_
To create a function from a string, use the built-in value
, which is the q interpreter or parse
, which is the parse step of the interpreter.
q)value "{x*x}"
{x*x}
q)parse "{x*x}"
_
7.4 Creating Typed Empty Lists¶
The general empty list has type 0 since it has no items and therefore affords no canonical way to determine a type.
q)type ()
0h
An issue arises when appending an item in place to a general empty list, Namely, if that item is an atom, the resulting singleton list is now a simple list of the type of the atom.
q)L:()
q)type L
0h
q)L,:42
q)type L
7h
Should this list be a column in a table, this can be problematic. For example, suppose you intend the column to be a list of floats but the first row appended to the table happens to have a long in the field for this column. Then the errant field is accepted, the type of the column is set to long, and all subsequent appends of float raise a 'type
error.
q)c1:()
q)c1,:42
q)c1,:98.6
'type
To avoid this, cast the empty list using the name of the desired type, which makes it an empty simple list of that type. Now only atoms of the specified type can be appended in place.
q)c1:`float$()
q)c1,:42
'type
q)c1:98.6
q)c1
_
Notice that an operation that yields a simple list retains the type on an empty result.
q)0#10 20 30
_
This yields a succinct idiom to create typed empty lists.
q)0#0
`long$()
q)0#0.0
_
q)0#`
_
There is no way in q to type nested empty lists
7.5 Enumerations¶
We have seen that the binary Cast operator $
transforms its right operand into a conforming entity of the type specified by the left operand. In the basic Cast form, the left operand can be a char type abbreviation, a type short, or a symbol type name. In this section, casting is extended to user-defined target domains, providing a functional version of enumerated types.
7.5.1 Traditional Enumerations¶
To begin, recall that in traditional languages, an enumerated type is a way of associating a series of names with a corresponding set of integral values. Often the sequence of numbers is consecutive and begins with 0. The association is usually given a name and represents a new type.
A traditional enumerated type serves multiple purposes.
- It allows a descriptive name to be used instead of an arbitrary number – e.g., 'red', 'green', 'blue' instead of 0, 1 and 2.
- It enables type checking to ensure that only permissible values are supplied – e.g., choosing a color name from a list instead of remembering its number is less prone to error.
- It provides namespacing, meaning the same name can be reused in different domains without fear of confusion – e.g., color.blue and note.blue (the flatted fifth).
There is also a subtler, more powerful use of enumerations: normalizing data.
7.5.2 Data Normalization¶
Broadly speaking, data normalization seeks to eliminate duplication, retaining only the minimum required data. In the archetypal example, suppose you know that you will have a list of text entries taken from a fixed and reasonably short set of values – e.g., stock exchange ticker symbols. Storing a long list of such strings verbatim presents two problems.
- Values of variable length complicate storage management and make retrieval inefficient.
- There is potentially much duplication of data arising from repeated values. This is hard to keep in sync when values change.
Let’s see how an enumeration solves both problems. The key ingredients are a (presumably repetitive) list v
of symbols drawn from a unique list of symbols u
. As in the case of ticker symbols, it may be that we know the list u
in advance.
q)u:`g`aapl`msft`ibm
q)v:1000000?u
q)v
`g`g`msft`aapl`msft`aapl`msft`ibm`msft`aapl`g`ibm`aapl`msft`msft`aapl`g..
Or it may be that we are given v
and need to determine u
.
q)v
`jha`jha`fna`fed`fna`fna`jha`jha`jgc`pkh`pkh`pkh`fna`fed`jha`cpi`pkh`pkh`igb`
q)u:distinct v
q)u
`jha`fna`fed`jgc`pkh`cpi`igb`hln`mjh`ooj
In any case, we have a list of symbols v
drawn from a unique list u
.
Consider a new list k
that represents the position in u
of each item of v
. This can be generated by our old friend the Find operator ?
.
q)u:`g`aapl`msft`ibm
q)v:1000000?u
q)k:u?v
q)k
2 1 1 3 3 1 0 0 0 3 0 2 2 1 2 3 1 0 1 1 2 1 2 0 2 1 1 0 1 1 3 0..
The key observation is that u
and k
together carry precisely the same information as v
; indeed, they can be used to reconstitute it.
q)u[k]
`msft`aapl`aapl`ibm`ibm`aapl`g`g`g`ibm`g`msft`msft`aapl`msft`ibm`aapl..
q)v~u[k]
1b
After a bit of Zen, you will recognize that u
and k
exactly constitute a traditional enumeration discussed above. Indeed, u
is the list of names, while the associated values are (implicitly) the indices of the items in u
. Every symbol in v
can be replaced by the corresponding index in k
– exactly what a traditional compiler does for an enumeration.
Why would we want to do this trade? Easy-peasy-lemon-squeezy: speed and compactness. First, v
is a list of variable length text, which is time-consuming to search, whereas k
is a uniform list of integers, which is very fast to traverse. Moreover, u
and k
normalize the data of v
. In general, v
will have many repetitions of each symbol, but u
stores each symbol once. Reading and writing the index list k
from/to disk is a block operation that will be very fast.
Advanced
Extra credit for recognizing that in terms of maps, v
is the composite map u·k
. For all you category theorists, we have factored the map of the non-unique list v
through the unique list u
via the index map k
. Mathematically,
v = u·k
Let’s examine the storage requirements for a list of symbols in more detail. Say that the count of u
is a and the maximum width of the text inside the symbols in u
is b. For a list v
of variable count x, the amount of storage required is potentially
b*x
For the indexed form, the storage is known to be,
a*b+4*x
which represents the fixed amount of storage for u
plus the variable amount of storage for the simple integer list k
. If a is small and b is even moderately large, the factorization is significantly smaller.
This can be seen by comparing the sizes of v
, u
and k
in a slightly modified version of our example.
v:`ccccccc`bbbbbbb`aaaaaaa`ccccccc`ccccccc`bbbbbbb
u:distinct v
u
`ccccccc`bbbbbbb`aaaaaaa
k:u?v
k
0 1 2 0 0 1
Now imagine v
and k
to be much longer.
7.5.3 Enumerating Symbols¶
The process of converting a list of symbols to the equivalent list of indices described in the previous section is called enumeration in q. It uses (yet another overload of) $
with the name of the variable holding the unique symbols as the left operand and a list of symbols drawn from that domain on the right.
Under the covers, $
does the indexing operation in the previous section and then replaces each symbol with its index. Fortunately, you don’t have to see how the sausage is made – i.e., q hides all this from you and displays the enumerated symbols in their reconstituted form with the name of the unique domain as an annotation. Continuing the example of the previous section:
q)`u$v
`u$`msft`aapl`aapl`ibm`ibm`aapl`g`g`g`ibm`g`msft`msft`aapl`msft..
You can recover the underlying integer values (i.e., k
above) by casting to an integer.
q)ev:`u$v
q)`int$ev
2 1 1 3 3 1 0 0 0 3 0 2 2 1 2 3 1 0 1 1 2 1 2 0 2 1 1 0 1 1 3 ..
Let’s summarize. The basic form of an enumerated symbol is,
`u$v
where u
is a simple list of unique symbols and v
is either an atom appearing in u
or a (possibly nested) list of such. We call u
the domain of the enumeration and the projection `u$
is enumeration over u
. Under the covers, applying the enumeration `u$
to a vector v
produces the index list k
as above.
For this style of enumeration, all potential values must be in the list u
; otherwise you will get a 'cast
error when trying to enumerate.
q)u:`a`b`c
q)`u$`d
'cast
We shall see in §7.5.7 an alternate approach when the full extent of u
is not known in advance.
When working with tables in kdb+, by convention all symbol columns in all tables are enumerated over a common domain sym. You will hear this referred to as the sym list or the sym file, depending on where it resides.
Although integers are 64-bit in q3+, for reasons known to the q gods, enumerations are 32-bit.
7.5.4 Using Enumerated Symbols¶
We continue with the example of the previous section, renamed to use the standard sym domain.
q)sym:`g`aapl`msft`ibm
q)v:1000000?sym
q)ev:`sym$v
The enumerated ev
can be substituted for the original v
in nearly all situations.
q)v[3]
`aapl
q)ev[3]
`u$`aapl
q)v[3]:`ibm
q)ev[3]:`ibm
q)v=`ibm
000100010010011101000010010100000000100100000001000000001100001001011..
q)ev=`ibm
q)where v=`aapl
4 5 19 20 21 31 33 34 41 42 43 49 58 59 61 74 81 83 90 94 95 98 114..
q)where ev=`aapl
4 5 19 20 21 31 33 34 41 42 43 49 58 59 61 74 81 83 90 94 95 98 114..
000100010010011101000010010100000000100100000001000000001100001001011..
q)v?`aapl
4
q)ev?`aapl
4
q)v in `ibm`aapl
000111010010011101011110010100010110100101110001010000001111011001011..
q)ev in `ibm`aapl
000111010010011101011110010100010110100101110001010000001111011001011..
While the enumerated version is item-wise equal to the original, the entities are not identical.
q)all v=ev
1b
q)v~ev
0b
This is because the types matter with ~
.
7.5.5 Type of Enumerations¶
Each enumeration is assigned a new numeric data type, beginning with 20h
. Starting with q version 3.2, the type 20h
is reserved for the conventional enumeration domain sym, whether you use it or not (you should). The types of other enumerations you create will begin with 21h
and proceed sequentially. The convention of negative type for atoms and positive type for simple lists still holds. In a fresh q session we see the following.
q)sym1:`g`aapl`msft`ibm
q)type `sym1$1000000?sym1
21h
q)sym2:`a`b`c
q)type `sym2$`c
-22h
The above was true in kdb+ V3.2. In later versions the type remains 20h
.
In contrast, the sym domain has type 20h
even if created after another enumeration. Continuing in the previous session,
q)sym:`b`c`a
q)type `sym$100?sym
20h
Enumerations with different domains are distinct, even when all the constituents are the same.
q)sym1:`c`b`a
q)sym2:`c`b`a
q)ev1:`sym1$`a`b`a`c`a
q)ev2:`sym2$`a`b`a`c`a
q)ev1=ev2
11111b
q)ev1~ev2
0b
7.5.6 Updating an Enumerated List¶
The normalization provided by an enumeration reduces updating all occurrences of a given value to a single operation. This can have significant performance implications for large lists with many repetitions. Continuing with our example above, suppose the list u
contains the items in a stock index and we wish to change one of the constituents. A single update to u
suffices.
q)sym:`g`aapl`msft`ibm
q)ev:`sym$`g`g`msft`ibm`aapl`aapl`msft`ibm`msft`g`ibm`g..
q)sym[0]:`twit
q)sym
`twit`aapl`msft`ibm
q)ev
`sym$`twit`twit`msft`ibm`aapl`aapl`msft`ibm`msft`twit`ibm`twit..
In contrast, to make the equivalent update to v
requires changing every occurrence.
q)v
`g`g`msft`ibm`aapl`aapl`msft`ibm`msft`g`ibm`g…
q)@[v; where v=`g; :; `twit]
_
Be extremely cautious about modifying the sym list manually. (Better not to do it at all.)
Should you corrupt the sym list, your entire database will be scrambled! Make a persistent copy/backup before modifying the list, else update your CV after.
7.5.7 Dynamically Appending to an Enumeration Domain¶
One situation in which an enumeration is more complicated than working with the denormalized data is when you want to add a new value. Continuing with the example above, appending a new item to an ordinary list of symbols is a single operation. We saw in §7.5.4 this is not the case when the new value is not in the enumeration domain.
q)sym:`g`aapl`msft`ibm
q)v:1000000?sym
q)ev:`sym$v
q)v,:`twtr
q)ev,:`twtr
'cast
The new value must first be added to the unique list.
q)sym,:`twtr
q)ev,:`twtr
In practice, to use $
with dynamically generated values, you must test to see if the value you intend to append is already in the enumeration domain and, if not, append it there first. Fortunately, q has anticipated this situation.
If you cannot know the full extent of the enumeration domain in advance, you can use (yet another overload of) ?
to create the domain on the fly. The syntax of ?
is the same as the enumeration overload of $
– i.e., the name of a (unique) list of symbols as left operand and a source symbol or list of symbols as right operand.
This application of ?
has the side effect of first checking to see if the source symbols are in the domain named by the left operand, and appends any that aren't. In any case, it returns the enumerated version of the source just like $
.
You can build the enumeration domain from scratch.
q)sym:()
q)`sym$`g
'cast
q)`sym?`g
`sym$`g
q)sym
,`g
q)`sym?`ibm`aapl
`sym$`ibm`aapl
q)sym
`g`ibm`aapl
q)`sym?`g`msft
`sym$`g`msft
q)sym
`g`ibm`aapl`msft
Our previous example now works, with ?
in place of $
.
q)ev,:`sym?`twtr
7.5.8 Resolving an Enumeration¶
An enumerated symbol can be substituted for its equivalent symbol value in most expressions. However, there are some situations in which you need non-enumerated values. One case is converting from one enumeration domain to another, which happens when copying from one kdb+ database to another or in merging two databases.
Given an enumerated symbol, or a list of such, you can recover the un-enumerated value(s) by applying the built-in value
. In our on-going example,
q)sym:`g`aapl`msft`ibm
q)v:1000000?sym
q)ev:`sym$v
q)value ev
`aapl`g`msft`msft`ibm`msft`msft`msft`msft`msft`g`ibm`ibm`ibm..
q)v~value ev
1b
This is another overload of value
, the function that is essentially the q interpreter.