QforMortals2/casting and enumerations
Contents |
Casting and Enumerations
Types and Cast
Casting manifests the malleability of data. In some cases, such as changing a string to a symbol, this is obvious and straightforward. Converting a char to its underlying ASCII code or converting an datetime to a float require a little more consideration. Enumerations also fit into the cast pattern.
Basic Types
Every atom has both an associated numeric and symbolic data type. For convenience we repeat the data types table from atoms.
type | type symbol | type char | type num |
boolean | `boolean | b | 1h |
byte | `byte | x | 4h |
short | `short | h | 5h |
int | `int | i | 6h |
long | `long | j | 7h |
real | `real | e | 8h |
float | `float | f | 9h |
char | `char | c | 10h |
symbol | ` | s | 11h |
month | `month | m | 13h |
date | `date | d | 14h |
datetime | `datetime | z | 15h |
minute | `minute | u | 17h |
second | `second | v | 18h |
time | `time | t | 19h |
type
The monadic function type can be applied to any entity in q to find its (numeric) short data type. It is a quirk of q that the data type of atoms is a short with the negative of the value in the fourth column above;
type 42 -6h type 1b -1h type 4.2 -9h type 4h -5h type `42 -11h type "4" -10 type 2007.04.02 -14h
Observe that infinities also carry a type.
type 0W -6h type -0w -9h
The type of a simple list is a short containing the positive value of the type of its constituent atoms.
type 1 2 3 6h type "abc" 10h type 1 2 3f 9h
The type of any general list is 0.
type (1;2h;3j) 0h type (1;2;(3 4)) 0h type (`1;"2";3) 0h
Type of a Variable
How q handles the type of a variable may be confusing to those coming from verbose languages. In many typed languages, the variable's type must be specified before the variable is assigned a value - that is, when it is declared. In q, a variable is assigned without declaration. The variable can subsequently be reassigned a new value of a different type.
a:42 type a -6h a:98.6 type a -9h
This can be understood by considering that q considers a variable to be a name (symbol) associated with a value. The association is made upon assignment. A variable has the type of the value associated with its name.
In the example at hand, a variable with name 'a' is created when the initial assignment is made. Since this is the first time that the name 'a' is assigned, the q interpreter creates an entry for 'a' in its dictionary of variable names and associates it with the int value 42. On the second assignment, there is already an entry for 'a' in the dictionary, so this name is simply re-associated with the float value 98.6.
When you ask q for the type of a variable, it returns the type of the value associated with the variable's name. Thus, when you reassign the variable, the type of the variable reflects the type of its new value.
Cast ($)
As in verbose languages, it is possible to cast an entity from one type to another, provided the underlying values are compatible. Such a cast informs the compiler that you want it to consider the variable to be of the specified type for subsequent operations. Such a cast may result in a compile-time or run-time error if it can not be performed.
The q cast operator, denoted $, is a binary verb that is atomic in its right operand source value, and whose left operand is the target type. The target can be represented in any of three type designators in the table of Basic Types.
- The type's (positive) numeric short value
- A char type value
- A type name symbol
First, examples using the numeric type.
5h$42 42h 6h$4.2 4
This form is useful when the target type is obtained programmatically using the type function.
It is arguably more readable to use the type's char in a cast.
"i"$4.2 4 "x"$42 0x2a "d"$2004.04.02T04:02:24.042 2004.04.02
The most readable (but longest) form uses the symbolic type name.
`int$4.2 4 `short$42 42h `date$2004.04.02T04:02:24.042 2004.04.02
The result of casting between superficially distinct types can be uncovered by considering the underlying numeric values. Chars correspond to their underlying ASCII sequence; dates to their offset from Jan 1, 2000; and times to their count of milliseconds.
"c"$0x42 "B" `date$42 2000.02.12
Because cast is atomic in its right operand, it is extended item-wise to a list.
"x"$(10 20 30;255) 0x0a141e 0xff
Cast is also atomic in its left operand.
5 6 7h$42 42h 42 42j
Advanced: When integral infinities are cast to integers of wider type, they are considered to be their underlying bit patterns. Since these bit patterns are legitimate values for the wider type, the cast results in a finite value.
"i"$0Wh 32767 "i"$-0Wh -32767 "j"$-0W -2147483647j "j"$0W 2147483647j
Creating Symbols from Strings
Casting from a string (i.e., a list of char) to a symbol is a convenient way to create symbols. It is the preferred way to create symbols with embedded blanks or other special characters. To cast a char or a string to a symbol, use the empty symbol ( ` ) as the target domain.
`$"z" `z `$"Zaphod Beeblebrox" `Zaphod Beeblebrox `$("Life";"the";"Universe";"and";"Everything") `Life`the`Universe`and`Everything
Cast is atomic in both operands.
A string is trimmed as part of the cast.
`$" abc " `abc string `$" abc " "abc"
Parsing Strings to Data
Cast can also be used to parse data from a string by using an upper case type char in the left argument.
"I"$"4267" 4267 "T"$"23:59:59.999" 23:59:59.999
Date string parsing is flexible with respect to the format of the date.
"D"$"2007-04-24" 2007.04.24 "D"$"12/25/2006" 2006.12.25 "D"$"07/04/06" 2006.07.04
Coercing Types
Casting can be used to coerce type-safe assignment. Recall that assignment into a simple list must strictly match the type.
c:10 20 30 40 c[1]:42h `type
This situation can arise when the list and the assignment value are created dynamically. You can coerce the type by casting it to that of the target.
c[1]:(type c)$42h c 10 42 30 40 c[0 1 3]:(type c)$(1.1; 42j; 0x2a) c 1 42 30 42
Creating Typed Empty Lists
We met the empty list in lists. Observe that it has type 0h, meaning that is a general list whose elements have no specific type,
type () 0h
This empty list can be considered as the degenerate case of a general list, so we call it the general empty list. In situations where type enforcement is desired, it is necessary to have an empty list with a specific type. Casting the general empty list using a symbolic type name makes this clear.
L1:`int$() type L1 6h L2:`float$() type L2 9h L3:`$() type L3 11h
A typed empty list is the degenerate case of a simple list of the specified type. This is useful because type matching is enforced when you append items.
L1,:4.2 'type L1,:42 L1 ,42
Enumerations
We have seen that the dyadic casting operator ( $ ) transforms its right operand into a conforming entity of type specified by the left operand. In the basic operation, the left operand can be a char type abbreviation, a type short, or a symbol type name. In this section, casting is extended to user-defined target domains, providing a functional version of enumerated types.
Traditional Enumerations
To begin, recall that in some verbose languages, an enumerated type is a way of associating a series of names with a corresponding set of integral values. Often the sequence of numbers is consecutive and begins with 0. The specific set of names/values is called the domain of the enumerated type and its name identifies the enumeration.
A traditional enumerated type serves multiple purposes.
- It allows a descriptive name to be used instead of an arbitrary number - e.g., 'blue' instead of 3.
- It permits strong type checking to ensure that only permissible values are supplied - i.e., choosing a named color from a list instead of remembering a number is less prone to error.
- It can provide name spaces, meaning the same name can be reused in different domains without fear of confusion - e.g., color.blue and mood.blue.
There is a subtler, more powerful use: an enumeration normalizes data.
Data Normalization
Broadly speaking, data normalization seeks to eliminate duplicates and retain the minimum amount of data. Suppose you know that you will have a list—in either the colloquial or q sense—of text entries taken from a fixed and reasonably short set of values. Storing a long list of such strings verbatim presents two problems.
- Values of variable length complicate storage management for the list
- There is potentially much duplication of data in the list arising from repeated values
An enumeration solves both problems.
To see how, we start with the case of a q list v containing arbitrary symbols representing character values. Let u be the unique values in v. This is achieved with the distinct function (See Appendix A for a detailed description).
u:distinct v
Let's try a simple example.
v:`c`b`a`c`c`b`a`b`a`a`a`c u:distinct v u `c`b`a
Observe that order of the items in u is the order of their first appearances in v.
Now consider a new list k that represents the positions in u of each of the items in v. This is achieved with the find (?) operator (See Find).
k:u?v k 0 1 2 0 0 1 2 1 2 2 2 0
Then we have,
u[k] `c`b`a`c`c`b`a`b`a`a`a`c v~u[k] 1b
We observe that u and k indeed normalize the data of v. In general, v will have many repetitions of each of the underlying values, but u stores each value once. Changing an underlying value requires only one operation in the normalized version but potentially many updates to the non-unique list.
Extra credit for recognizing that v is simply the composite map u◦k. Effectively, we have factored the non-unique list v through the unique list u via the index map k.
v = u◦k
Why would we want to do this? Easy: compactness and speed.
Advanced: Let's say that the count of u is a and the maximum width (in the colloquial sense) of the symbols in u is b. For a list v of variable count x, the amount of storage required is potentially
b*x
For the factored form, the storage is known to be
a*b+4*x
which represents the fixed amount of storage for u plus the variable amount of storage for the simple integer list k. If a is small and b is even moderately large, the factorization is significantly smaller.
This can be seen by comparing the sizes of v, u and k in a slightly modified version of our example.
v:`ccccccc`bbbbbbb`aaaaaaa`ccccccc`ccccccc`bbbbbbb u:distinct v u `ccccccc`bbbbbbb`aaaaaaa k:u?v k 0 1 2 0 0 1
Now imagine v and k to be much longer.
Reading and writing the factored index list from/to disk is a block operation that will be very fast.
Assuming that items of v are symbols stored in a hash-table, item indexing in the un-factored list requires looking up each symbol. Indexing into the factored list can be done directly via position since it is a uniform list of integers. This will be faster.
Enumerations
Enumeration encapsulates the above factorization of an arbitrary list of symbols through a list of unique values. An enumeration uses the binary cast operator ($) and is a generalization of the basic cast between types.
The general form of an enumerated value is,
`u$v
where u is a simple list of unique symbol values and v is either an atom in u or a list of such. The projection `u$ is the enumeration, u is the domain of the enumeration and `u$v represents the enumerated value(s).
Under the covers, applying the enumeration `u$ to a vector v actually factors v through u as in the previous section. The resulting index list k is stored internally and the lookup is performed automatically.
5.3.4 Working with an Enumeration
We recast our factorization example as an enumeration,
u:`c`b`a v:`c`b`a`c`c`b`a`b`a`a`a`c ev:`u$v ev `u$`c`b`a`c`c`b`a`b`a`a`a`c
While the display of the enumeration ev shows the values of v within the domain u, only the implicit int index list is actually stored.
The enumeration ev acts just like the original v.
v[3] `c ev[3] `u$`c v[3]:`b v `c`b`a`b`c`b`a`b`a`a`a`c ev[3]:`b ev `u$`c`b`a`b`c`b`a`b`a`a`a`c v=`a 001000101110b ev=`a 001000101110b v in `a`b 011101111110b ev in `a`b 011101111110b
Note: While the enumeration is item-wise equal to - and can be freely substituted for - the original, they are not identical.
v=ev 111111111111b v~ev 0b
The find operator ( ? ) can be used with an enumeration to locate the first position of specific values.
v?`a 2 ev?`a 2
The function where can be used to find all occurrences of a specific value.
where v=`a 2 6 8 9 10 where ev=`a 2 6 8 9 10
Updating an Enumeration
The normalization provided by an enumeration reduces updating all occurrences of a value into a single operation. This can have significant performance implications for large lists with many repetitions.
With u, v and e as above,
u[1]:`x ev `u$`c`x`a`c`c`x`a`x`a`a`a`c v `c`b`a`c`c`b`a`b`a`a`a`c
To make the equivalent update to v, it is necessary to change every occurrence.
v[where v=`b]:`x v `c`x`a`c`c`x`a`x`a`a`a`c
Appending to an Enumeration
One situation in which an enumeration is more complicated than working with the denormalized data is when you want to add a new value. Continuing with the example above, appending a new item to v is s single operation but this is not the case for the corresponding enumeration ev.
u:`c`b`a v:`c`b`a`c`c`b`a`b`a`a`a`c ev:`u$v v,:`d v `c`b`a`c`c`b`a`b`a`a`a`c`d ev,:`d 'cast
What went wrong? The new value must first be added to the unique list.
u,:`d ev,:`d ev `u$`c`b`a`c`c`b`a`b`a`a`a`c`d
You may have already recognized that this presents a complication in practice. Because you may not know whether the value you are appending to v is already in u, in order to maintain uniqueness in u you must test this before appending.
Fortunately, q has anticipated this situation. When dyadic ? is used with the name of a (simple) list of symbols as its left argument and a symbol as its right argument, it appends the symbol to the list if and only if it is not an item in the list.
u `c`b`a`d `u?`a `u$`a u `c`b`a`d `u?`e `u$`e u `c`b`a`d`e
If you wish to append items to an enumerated value programmatically, simply add to the unique list using ? before appending to the enumerated value.
u:`c`b`a v:`c`b`a`c`c`b`a`b`a`a`a`c ev:`u$v `u?`e `u$`e ev,:`e u `c`b`a`e ev `u$`c`b`a`c`c`b`a`b`a`a`a`c`e
Resolving an Enumeration
If you are given an enumerated value, you can recover the original value by applying value. In our example,
ev `u$`c`b`a`c`c`b`a`b`a`a`a`c value ev `c`b`a`c`c`b`a`b`a`a`a`c
Type of an Enumeration
Each enumeration is assigned a new numeric data type, beginning with 20h. If you start a new q session and load no script files, you will observe the following.
u1:`c`b`a u2:`2`4`6`8 u3:`a`b`c u4:`c`b`a type `u1$`c`a`c`b`b`a 20h type `u1$`a`a`b`b`c`c 20h type `u2$`8`8`4`2`6`4 21h type `u3$`c`a`c`b`b`a 22h type `u4$`c`a`c`b`b`a 23h
Note: Enumerations with distinct domains are distinct, even when the domains match.
u1~u4 1b v:`c`a`c`b`b`a (`u1$v)~`u4$v 0b
Prev: Functions Next: Dictionaries
©2006-2007 Kx Systems, Inc. and Continuux LLC. All rights reserved.