7. Transforming Data
7.1 Types and Cast
Casting is a way to convert a value of one type to a compatible type. Sometimes the conversion is exact; other times information is lost. Enumerations and data parsing in q also fit into the cast pattern.
7.1.1 Basic Types
A type can be specified in three equivalent ways: a char, a short and a symbol. For convenience, we repeat the data types table from Chapter 1.
| Type | Type Symbol | Type Char | Type Num |
|---|---|---|---|
| boolean | `boolean |
b | 1h |
| guid | `guid |
g | 2h |
| byte | `byte |
x | 4h |
| short | `short |
h | 5h |
| int | `int |
i | 6h |
| long | `long |
j | 7h |
| real | `real |
e | 8h |
| float | `float |
f | 9h |
| char | `char |
c | 10h |
| symbol | `symbol |
s | 11h |
| timestamp | `timestamp |
p | 12h |
| month | `month |
m | 13h |
| date | `date |
d | 14h |
| datetime | `datetime |
z | 15h |
| timespan | `timespan |
n | 16h |
| minute | `minute |
u | 17h |
| second | `second |
v | 18h |
| time | `time |
t | 19h |
7.1.2 The type Operator
The non-atomic unary function type can be applied to any entity in q to return its data type expressed as a short. It is a quirk of q that the data type of an atom is negative whereas the type of the corresponding simple list is positive.
Tip
See 7.2.5 for an explanation of why this is.
q)type 42
-7h
q)type 10 20 30
7h
q)type 98.6
-9h
q)type 1.1 2.2 3.3
9h
q)type `a
-11h
q)type `a`b`c
11h
q)type "z"
-10h
q)type "abc"
10h
Observe that infinities and nulls have their respective types.
q)type 0W
-7h
q)type 0N
-7h
q)type -0w
-9h
q)type 0n
-9h
q)type `
-11h
The type of any general list is 0h.
q)type (42h; 42i; 42j)
0h
q)type (1 2 3; 10 20 30)
0h
q)type ()
0h
The type of any dictionary, including a keyed table, is 99h.
q)type (`a`b`c!10 20 30)
99h
q)type ([a:10; b:20; c:30])
99h
q)type ([k:`a`b`c] v:10 20 30)
99h
The type of any non-keyed table is 98h.
q)type ([] c1:`a`b`c; c2:10 20 30)
98h
7.1.3 Type of a Variable
Since q is a dynamically typed language, a variable has the type of the value currently assigned to it. Unlike statically typed languages, there is no predeclaration of type.
q)a:4
q)type a
-7h
q)a:"abc"
q)type a
10h
In more detail, the variable a above is an association between the name `a and its assigned value. Since this is the first time that the variable is assigned, the q interpreter creates an entry for the key `a in its dictionary of global variables and associates it with the value 42. You can display this relation with get on the symbolic name of a context. In this case the root context holds globals not assigned to a specific namespace. In a fresh q session,
q)a:42
q)get `.
a| 42
On a subsequent assignment, there is already an entry for the key `a in the dictionary, so the value "abc" is upserted.
q)a:"abc"
q)get `.
a| "abc"
The type of a variable is the type of the (current) value associated with the variable's name.
Advanced
Global variables are stored in ordinary q dictionaries with special names. For example, the symbol `. is the name of the
default global dictionary. Start a fresh q session and observe the life of this dictionary using value.
q)a:42
q)value `.
a| 42
q)f:{x*x}
q)value `.
a| 42
f| {x*x}
Since q internally uses the same dictionary structures available to us, it has been said (crassly) that q eats its own dog food.
7.2 Cast ($)
Casting renders an underlying value of one type into another compatible type. For example, a short integer has a natural interpretation as a long integer. In the Mathematical Refresher we discussed the implicit identifications of naturals with their rational counterparts and rationals with their repeating decimal real counterparts. These identifications are examples of casting, since the numeric values representing the different types are essentially the same.
Since q is dynamically typed, casting occurs at run-time. The binary cast operator $ is atomic in both operands. The right operand is the source value and the left operand specifies the target type. There are three ways to specify the target type, indicated by the first three columns of the type table at the beginning of this chapter.
- A (positive) numeric short type value
- A char type value
- A type name symbol
7.2.1 Casts that Widen
In these examples, no information is lost in the cast, as the target type is wider than the source type. Here are examples using the short type specification in the target.
q)7h$42i
42
q)6h$42
42i
q)9h$42
42f
It is arguably more readable to use the type char. Here are the same examples recast (ouch!) to use the type char.
q)"j"$42i
42
q)"i"$42
42i
q)"f"$42
42f
It is arguably most readable to use the symbolic type name.
q)`int$42
42i
q)`long$42i
42
q)`float$42
42f
7.2.2 Casts across Disparate Types
It may come as a surprise that q allows casting between superficially disparate types. When the underlying values are the same, the types are essentially just differently formatted representations, so why not allow a cast? We use the symbolic names in the following examples; the other formats work equally.
The underlying value of a char is its offset in the ASCII collation sequence, so we can cast char to and from integers, provided the integer is less than 256.
q)`char$42
"*"
q)`long$"\n"
10
The underlying value of a date is its count of days from the millennium, so we can cast to and from an int.
q)`date$0
2000.01.01
q)`int$2001.01.01 / millennium started on leap year
366i
The underlying value of a timespan is its count of nanoseconds from midnight, so we can cast it to and from long.
q)`long$12:00:00.0000000000
43200000000000
q)`timespan$0
0D00:00:00.000000000
7.2.3 Casts that Narrow
Some casts lose information. This includes the usual suspects of float to integer types and wider integers to narrower ones. Notice that unlike in earlier versions, q now caps values at the appropriate infinity for narrowing casts.
q)`long$12.345
12
q)`short$12345
12345h
q)`short$123456789
0Wh
Cast any numeric to a boolean by the rule that zero is 0b and anything else is 1b.
q)`boolean$0
0b
q)`boolean$0.0
0b
q)`boolean$123
1b
q)`boolean$-12.345
1b
We can also extract constituents from complex types.
q)`date$2025.01.02D10:20:30.123456789
2025.01.02
q)`year$2025.01.02
2025i
q)`month$2025.01.02
2025.01m
q)`mm$2025.01.02
1i
q)`dd$2025.01.02
2i
q)`hh$10:20:30.123456789
10i
q)minute$10:20:30.123456789
10:20
q)`uu$10:20:30.123456789
20i
q)`second$10:20:30.123456789
10:20:30
q)`ss$10:20:30.123456789
30i
7.2.4 Casting Integral Infinities
When integral infinities are cast to integers of wider type, they are their underlying bit patterns, reinterpreted. Since these bit patterns are legitimate values for the wider type, the cast results in a finite value.
q)`int$0Wh
32767i
q)`int$-0Wh
-32767i
q)`long$0Wi
2147483647
q)`long$-0Wi
-2147483647
7.2.5 Coercing Types
Casting can be used to coerce type-safe assignment. Recall that assignment into a simple list must strictly match the type.
q)L:10 20 30 40
q)L[1]:42h
'type
q)L,:43h
'type
This situation can arise when the list and the assignment value are created dynamically. Coerce the type by casting it to that of the target, provided of course that the cast is legitimate.
q)L[1]:(type L)$42h
q)L,:(type L)$43h
q)L
10 42 30 40 43
Note
This example answers the age-old question about q: why is the type of a simple list positive of the type of an atom negative. Answer: to facilitate the coercive cast to the type of the list during an append in place. (Good tidbit to show off to your questioner in a q test)
7.2.6 Cast Is Atomic
Cast is atomic in the right operand.
q)"i"$10 20 30
10 20 30i
q)`float$(42j; 42i; 42j)
42 42 42f
Cast is atomic in the left operand.
q)`short`int`long$42
42h
42i
42
q)"ijf"$98.6
99i
99
98.6
Cast is atomic in both operands simultaneously.
q)"ijf"$10 20 30
10i
20
30f
7.3 Data to and from Text
Recall that a q string is a simple list of char. All grown-up programming languages have a mechanism for translating literal text into values and vice versa. It isn't strictly correct to say that converting a value to a string and parsing a string are casts since the value is only implicit in the text. The operations are closely related and use the same operator in q.
7.3.1 Data to Strings
The function string can be applied to any q entity to produce a text representation suitable for console display or storage in a file. Here are the key features of string.
-
The result is always a list of char, never a single char. Thus you will see singleton char lists from single chars.
-
The result contains no q type indicators or other decorations. In general, the result is the most compact representation of the input, which may not actually be convertible (i.e., parsed) back to the original value.
-
Applying
stringto an actual string (i.e., list of char) probably will not give you what you want.
Following are some examples.
q)string 42
"42"
q)string 4
,"4"
q)string 42i
"42"
q)a:2.0
q)string a
,"2"
q)f:{x*x}
q)string f
"{x*x}"
The string function is not the same as other atomic operators thus far in that it doesn't return an atom for an atom – for example, it takes the atom 42 to the list "42". However, it is atomic in that its result on a list has the i‑th item equal to its result on the ith item. As such, the result conforms to the input, which explains its behavior on strings.
q)string 1 2 3
,"1"
,"2"
,"3"
q)string "string"
,"s"
,"t"
,"r"
,"i"
,"n"
,"g"
q)string (1 2 3; 10 20 30)
,"1" ,,"2" ,,"3"
"10" "20" "30"
Tip
Use string to convert a list (or column) of symbols to strings.
q)string `Life`the`Universe`and`Everything
"Life"
"the"
"Universe"
"and"
"Everything"
7.3.2 Creating Symbols from Strings
"Casting" from a string (i.e., a list of char) to a symbol is a foolproof way to create symbols. It is the only way to create
symbols with embedded blanks or other special characters that cannot be entered into a literal symbol. To cast a char or a string to a symbol, use `$.
q)`$"abc"
`abc
q)`$"Hello World"
`Hello World
Tip
Do not attempt to use `symbol$ for this as it generates an error. This is a common newbie mistake.
You can include any characters in a symbol this way but you may need to escape them into the string. Symbols handle Unicode better than strings.
q)`$"Zaphod \"Z\""
`Zaphod "Z"
q)`$"Zaphod \n"
`Zaphod
q)`$"\312\211"
`ʉ
q)`$"\340\270\217"
`ฏ
q)`$"\340\270\255\340\270\262\340\270\243\340\271\214\340\270\212\340\270\265\340\271\210"
`อาร์ชี่
Note
The source string is left and right trimmed during the cast. There is no real workaround to force leading or trailing blanks into a symbol. Why would you want them, anyway?
q)string `$" abc "
"abc"
Tip
The unary `$ will convert an entire column of strings to symbols without needing each.
q)`$("Life";"the";"Universe";"and";"Everything")
`Life`the`Universe`and`Everything
7.3.3 Parsing Data from Strings
The $ operator is overloaded to parse strings into data of any type exactly as the q interpreter does. This overload is invoked by using an uppercase type char as the target left operand and a string in the right operand. If the specified parse cannot be performed, a null of the target type is returned – i.e., missing or bad data – instead of an exception.
q)"F"$"42"
42f
q)"F"$"42.0"
42f
q)"I"$"42.0"
0Ni
q)"I"$" "
0Ni
Date parsing is flexible with respect to the format of the date, but two digits are mandatory in the month and day fields.
q)"D"$"12.31.2024"
2024.12.31
q)"D"$"12-31-2024"
2024.12.31
q)"D"$"12/31/2024"
2024.12.31
q)"D"$"2024/12/31"
2024.12.31
Advanced
To create a function from text in a string, use the built-in parse, which is a feature of q interpreter.
q)parse "{x*x}"
{x*x}
7.4 Creating Typed Empty Lists
The general empty list has type 0 since it has no items and therefore affords no canonical way to determine a type.
q)type ()
0h
An issue arises when appending an item in place to a general empty list. Namely, if that item is an atom, the resulting singleton list is now a simple list of the type of the atom.
q)L:()
q)type L
0h
q)L,:42
q)type L
7h
Should this list be a column in a table, this can be problematic. For example, suppose you intend the column to be a list of floats but the first row appended to the table happens to have a long in the field for this column. When errant field is accepted, the type of the column is set to long, and all subsequent appends of float raise a 'type error.
q)c1:()
q)c1,:42
q)c1,:98.6
'type
To avoid this, cast the empty list using the name of the desired type, which makes it an empty simple list of that type. Now only atoms of the specified type can be appended in place.
q)c1:`float$()
q)c1,:42
'type
q)c1:98.6
q)c1
98.6
Notice that an operation that yields a simple list retains the type on an empty result.
q)0#10 20 30
`long$()
Tip
This yields a succinct idiom to create typed empty lists.
q)0#0
`long$()
q)0#0.0
`float$()
q)0#`
`symbol$()
7.5 Enumerations
We have seen that the binary cast operator $ transforms its right operand into a conforming entity of type specified by the left operand. In the basic cast form, the left operand can be a char type abbreviation, a type short, or a symbol type name. In this section, casting is extended to user-defined target domains, providing a functional version of enumerated types.
7.5.1 Traditional Enumerations
To begin, recall that in traditional languages, an enumerated type is a way of associating a series of names with a corresponding set of integral values. Often the sequence of numbers is implicit, is consecutive and begins with 0. The association is usually given a name and represents a new type.
A traditional enumerated type serves multiple purposes.
-
It allows a descriptive name to be used instead of an arbitrary number - e.g., 'red', 'green', 'blue' instead of 0, 1 and 2.
-
It enables type checking to ensure that only permissible values are supplied - e.g., choosing a color name from a list instead of remembering its number is less prone to error.
-
It provides name spacing, meaning the same name can be reused in different domains without fear of confusion - e.g., color.blue and note.blue (the flatted fifth).
There is also a subtler, more powerful use of enumerations: normalizing data.
7.5.2 Data Normalization
Broadly speaking, data normalization seeks to eliminate duplication, retaining only the minimum required data. In the archetypal example, suppose you know that you will have a list of text entries taken from a fixed and reasonably short set of values – e.g., stock exchange ticker symbols. Storing a long list of such strings verbatim presents two problems.
-
Values of variable length complicate storage management and make retrieval inefficient.
-
There is potentially much duplication of data arising from repeated values. This is hard to keep in sync when a ticker symbol changes.
Let's see how an enumeration solves both problems. The key ingredients are a (repetitive) list v of symbols drawn from a unique
list of symbols u. As in the case of ticker symbols, it may be that we know the list u in advance.
q)u:`goog`aapl`msft`ibm
q)v:1000000?u
q)v
`goog`goog`msft`aapl`msft`aapl`msft`ibm`msft`aapl`goog`ibm`aapl…
Or it may be that we are given v and need to determine u.
q)v
`jha`jha`fna`fed`fna`fna`jha`jha`jgc`pkh`pkh`pkh`fna`fed`jha`cpi`pkh`pkh`igb
q)u:distinct v
q)u
`jha`fna`fed`jgc`pkh`cpi`igb`hln`mjh`ooj`
In any case, we start with a list of symbols v drawn from a unique list u.
Consider a new list k that represents the position in u of each item of v. This can be generated by our good friend the find operator ?.
q)u:`goog`aapl`msft`ibm
q)v:1000000?u
q)show k:u?v
2 0 2 0 1 3 2 0 0 0 2 2 1 0 0 3 0 1 3 0 1 3 0 3 0 2 1 0 0 2 0 2 …
The key observation is that u and k together carry precisely the same information as v; indeed, they can be used to reconstitute it.
q)u[k]
`msft`goog`msft`goog`aapl`ibm`msft`goog`goog`goog`msft`msft …
q)v~u[k]
1b
After a brief Zen meditation, you will recognize that u and k exactly constitute a traditional enumeration discussed above. Indeed, u is the list of names, while the associated values are (implicitly) the indices of the items in u. Every symbol in v can be replaced by the corresponding index in k – exactly what a traditional compiler does for an enumeration.
Why would we want to do this trade? Easy-peasy-lemon-squeezy: speed and compactness. First, v is a list of variable length text, which is time-consuming to search, whereas k is a uniform list of integers, which is very fast to traverse. Moreover, u and k normalize the data of v. In general, v will have many repetitions of each symbol, but u stores each symbol once. Reading and writing the index list k from/to disk is a block operation that will be very fast.
Advanced
-
Extra credit for recognizing that in terms of maps, v is the composite map u·k. As category theorists would say, we have factored the map of the non-unique list v through the unique list u via the index map k.
v = u·k -
Let's examine the storage requirements for a list of symbols in more detail. Say that the count of u is a and the maximum width of the text inside the symbols in u is b. For a list v of variable count x, the amount of storage required is potentially
For the indexed form, the storage is known to be,*b***x*which represents the fixed amount of storage for u plus the variable amount of storage for the simple integer list k. If a is small and b is even moderately large, the factorization is significantly smaller.*a***b*+4**x*
This can be seen by comparing the sizes of v, u and k in a slightly modified version of our example.
q)vv:`ccccccc`bbbbbbb`aaaaaaa`ccccccc`ccccccc`bbbbbbb
q)uu:distinct vv
q)uu
`ccccccc`bbbbbbb`aaaaaaa
q)kk:uu?vv
q)kk
0 1 2 0 0 1
7.5.3 Enumerating Symbols
The process of converting a list of symbols to the equivalent list of indices described in the previous section is called enumeration in q. It uses (yet another overload of) $ with the name of the variable holding the unique symbols as the left operand and a list of symbols drawn from that domain on the right.
Under the covers, $ performs the indexing operation in the previous section and then replaces each symbol with its index. Fortunately, you don't have to see how the sausage is made - i.e., q hides all this from you and displays the enumerated symbols in their reconstituted form with the name of the unique domain as an annotation. Continuing the example of the previous section:
q)ev:`u$v
q)ev
`u$`msft`goog`msft`goog`aapl`ibm`msft`goog`goog`goog`msft`msft ..
You can recover the underlying long integer values (i.e., k above) by casting to an integer.
q)`long$ev
2 0 2 0 1 3 2 0 0 0 2 2 1 0 0 3 0 1 3 0 1 3 0 3 0 2 1 0 0 2 0 2..
q)k~`long$ev
1b
Let’s summarize. The basic form of an enumerated symbol is,
`u$v
where u is a simple list of unique symbols and v is either an atom appearing in u or a (possibly nested) list of such. We call u the domain of the enumeration and the projection (`u$) is enumeration over u. Under the covers, applying the enumeration (`u$) to a vector v produces the index list k as above.
Important
For this style of enumeration, all potential values must be in the list u; otherwise you will get a 'cast error when trying to enumerate.
q)u1:`a`b`c
q)`u1$`d
'cast
We shall see in 7.5.7 an alternative approach when the full extent of u is not known in advance.
When working with tables in kdb, normally all symbol columns in all tables are enumerated over a common domain sym. You will hear this referred to as the sym list or the sym file, depending on where it resides.
7.5.4 Using Enumerated Symbols
We reconstrue the example of the previous section to use the standard name sym for the enumeration domain.
q)sym:`goog`aapl`msft`ibm
q)v:1000000?sym
q)ev:`sym$v
The enumerated ev can be substituted for the original v in nearly all situations.
q)ev:`sym$v
q)v[3]
`aapl
q)ev[3]
`sym$`aapl
q)v[3]:`ibm
q)ev[3]:`ibm
q)v=`ibm
110100000000000000011000001000001101100000000100110000011100010..
q)ev=`ibm
110100000000000000011000001000001101100000000100110000011100010..
q)where v=`aapl
4 8 9 10 14 22 23 27 28 29 34 38 39 40 42 59 62 67 68 70 73 83 ..
q)where ev=`aapl
4 8 9 10 14 22 23 27 28 29 34 38 39 40 42 59 62 67 68 70 73 83 ..
q)v?`aapl
4
q)ev?`aapl
4
q)v in `ibm`aapl
11011000111000100001101100111100111110111010010011000001110101 ..
q)ev in `ibm`aapl
11011000111000100001101100111100111110111010010011000001110101 ..
Note
While the enumerated version is item-wise equal to the original, the entities are not identical.
q)all v=ev
1b
q)v~ev
0b
This is because the types matter with ~.
7.5.5 Type of Enumerations
In q3.6 and later, enumerations over a named domain - i.e., a variable containing a list of symbols or a keyed table - all have type 20h. They are distinguished by the name of their domain, which is included in their display. We also point out what was implicit in our presentation: the underlying indices of an enumeration are normal long integers which are 64 bits. The convention of negative type for atoms and positive type for simple lists still holds for enumerations.
Tip
There is no practical limit to how many different enumeration domains you can have. However, you should rethink your design if you think you need more than a few (maybe even more than one) as this will complicate program maintenance exponentially.
Note
In versions of q before 3.6, even when all integral values defaulted to 64 bits, enumerations were 32-bit and enumerations over different domains got different types starting at 21h. If you have been chosen by fate to own a system that was developed prior to q3.6, see the the KX Documentation for details on the previous enumeration mechanism.
Continuing with our example from the previous section,
q)type ev
20h
q)type `sym1$1000000?sym1:`goog`aapl`msft`ibm
20h
q)sym2:`a`b`c
q)type `sym2$`c
-20h
Enumerations with different domains are distinct, even when all the constituents are all equal.
q)sym1:`c`b`a
q)sym2:`c`b`a
q)ev1:`sym1$`a`b`a`c`a
q)ev2:`sym2$`a`b`a`c`a
q)ev1=ev2
11111b
q)ev1~ev2
0b
This is not usually a problem in practice but, should it be, use value to un-enumerate.
q)value ev1
`a`b`a`c`a
7.5.6 Updating an Enumerated List
The normalization provided by an enumeration reduces updating all occurrences of a value to a single operation. This can have significant performance implications for large lists with many repetitions.
Continuing with our example above, suppose the list sym contains the items in a stock index and we wish to change one constituent name. A single update to sym suffices.
q)sym:`goog`aapl`msft`ibm
q)v:1000000?sym
q)ev:`sym$v
q)sym[0]:`g
q)sym
`g`aapl`msft`ibm
q)ev
`sym$`msft`msft`g`msft`g`aapl`g`aapl`aapl`g`msft`aapl`ibm`g ..
In contrast, to make the equivalent update to v requires changing every occurrence.
q)v
`msft`msft`goog`msft`goog`aapl`goog`aapl`aapl`goog`msft`aapl ..
q)@[v; where v=`goog; :; `g]
`msft`msft`g`msft`g`aapl`g`aapl`aapl`g`msft`aapl`ibm`g`msft ..
Important
Be extremely cautious about modifying the sym list manually (better not to do it at all). Should you corrupt the sym list, your entire database will be scrambled! Make a persistent copy/backup before modifying the list, else update your CV after.
7.5.7 Dynamically Appending to an Enumeration Domain
One situation in which an enumeration is more complicated than working with the unnormalized data is when you want to add a new value. Continuing with the example above, appending a new item to an ordinary list of symbols is a single operation. We saw in §7.5.4 this is not the case when the new value is not in the enumeration domain.
q)sym:`goog`aapl`msft`ibm
q)v:1000000?sym
q)ev:`sym$v
q)v,:`nuba
q)ev,:`nuba
'cast
The new value must first be added to the unique list.
q)sym,:`nuba
q)ev,:`nuba
In practice, to use $ with dynamically generated values, you would have to test if the value you intend to append is already in the
enumeration domain and, if not, append it there first. Fortunately, q will do this for you.
If you cannot know the full extent of the enumeration domain in advance, you can use (yet another overload of ? to create and maintain the domain on the fly. The syntax of ? is the same as the enumeration overload of $: the name of a (unique) list of symbols as left operand and a source symbol or list of symbols as right operand.
This application of ? has the side effect of first checking to see if the source symbols are in the domain named by the left operand, appending any that aren't. In any case, it returns the enumerated version of the source just like $.
Tip
You can build the enumeration domain from scratch.
q)sym:()
q)`sym$`goog
'type
q)`sym?`goog
`sym$`goog
q)sym
,`goog
q)`sym?`ibm`aapl
`sym$`ibm`aapl
q)`sym?`goog`msft
`sym$`goog`msft
q)sym
`goog`ibm`aapl`msft
Our previous example now works, with ? in place of $.
q)ev,:`sym?`nuba
q)sym
`goog`ibm`aapl`msft`nuba
7.5.8 Resolving an Enumeration
An enumerated symbol can be substituted for its equivalent symbol value in most expressions. However, there are some situations in which you need non-enumerated values. One case is converting from one enumeration domain to another, which happens when merging two kdb databases.
Given an enumerated symbol, or a list of such, you can recover the un-enumerated value(s) by applying the built-in value. In our on-going example,
q)sym:`goog`aapl`msft`ibm
q)v:1000000?sym
q)ev:`sym$v
q)value ev
`goog`ibm`aapl`aapl`ibm`aapl`ibm`ibm`aapl`ibm`goog`aapl`ibm ..
q)v~value ev
1b
This is a legitimate use of value, the function that is essentially the q interpreter.