QforMortals2/overview

From Kx Wiki
Jump to: navigation, search

Contents

Overview

The Evolution of q

The q programming language and its database kdb+ were developed by Arthur Whitney. Released by Kx Systems, Inc. in 2003, the primary design objectives of q are expressiveness, speed and efficiency. In these, it is beyond compare. The design tradeoff is a terseness that can be disconcerting to programmers coming from more verbose database programming environments - e.g., C++, Java or C#, combined with SQL. While the q programming gods revel in programs resembling an ASCII core dump, this manual is for the rest of us.

Q evolved from APL (A Programming Language), which was first invented as a mathematical notation by Kenneth Iverson at Harvard University in the 1950s. APL became one of the first computer languages when it was introduced by IBM as a vector programming language, meaning that it was able to process lists of numbers in a single operation. It became successful in finance and other industries that required heavy number crunching.

Since q is a vector processing language by birth, it is well suited to performing complex calculations quickly on large volumes of data. What's new in q is that it can also process large volumes of data very efficiently in the relational paradigm. Its syntax allows select expressions that are similar to SQL 92, and its collection of built-in functions forms a rich superset of those in SQL 92.

There is also some LISP in q's genes. In fact, the fundamental data construct of q is a list. The notation and terminology are different, but the functionality is there and is arguably simpler. For those so inclined, writing compilers is a snap in q.

The DNA sequencing of q also shows the influence of functional programming. While q is not purely functional, it is arguably as functional as C++, Java and C# are object-oriented.

Philosophy

A proficient q developer thinks differently than in conventional RDBMS programming environments such as C++, Java and C#, henceforth referred to as "verbose programming". In order to get you into the correct mindset, we summarize some of the potential discontinuities for the q newbie.

There are three major issues in verbose database programming:

Much of verbose programming design is spent getting the various representations correct, and much of verbose programming code is spent marshaling resources and synchronizing the different representations. These issues disappear in q.

In Memory Database: One way to think of kdb+ is as an in-memory database with persistent backing. The form in which entities are held in memory is virtually identical to the way they are stored on disk and transported. Since data manipulation is performed in memory with q, there is no separate stored procedure language. This is somewhat akin to disconnected record sets in ADO.NET, but there is no separation between the language used to construct the table objects (C#) and that used to manipulate the data in the tables on disk (SQL).

Interpreted: Q is interpreted instead of compiled. During execution, data and functions live in an in-memory workspace. Iterations of the development cycle tend to be quick because all information needed to debug is available in the workspace. Q programs are stored and executed as scripts. In addition, q functions can be created as strings and executed dynamically, so it is possible to write self-modifying code.

Ordered Lists: Because classical SQL is based on unordered sets, the order of rows in a table is not defined. In q, ordered lists are the foundation of all non-trivial data structures, so table rows have an order. This makes processing large volumes of time series data easy and fast. Very fast.

Evaluation Order: While q is written left-to-right, expressions are evaluated right-to-left or, as the q gods prefer, left of right, meaning that the function or operator to the left executes on what is to the right of it. There is no operator precedence, so parentheses are rarely needed to resolve operation order.

Table Oriented: Give up objects, ye who enter here. In contrast to the languages mentioned above, q does not implement such concepts of object-oriented programming as classes, inheritance and virtual methods. Instead, q builds complexity through the construction and mapping of ordered lists, which are actually sequences or vectors in mathematical parlance. The higher-level constructs for data manipulation in q are dictionaries and tables. A function in q can be named globally in the workspace, or defined anonymously within another function. Variables can be global or local to a function.

Column Oriented: SQL tables store data in rows and an operation applies to a field one row at a time. Q stores tables as columns and applies an operation to an entire column vector.

Types: Q is a strongly typed, dynamically checked language, but its typing is less cumbersome then many typed languages. Each variable has a value of well-defined type and type promotion for operations is automatic. A variable's type is not explicitly declared; instead, the type of a variable name reflects the value assigned to it. Lists that have been assigned with a homogenous data type will not accept or promote other types.

Null Values: In classical SQL, the value NULL represents missing data for a field of any type. In q, types have separate null values. Infinite and null values can participate in arithmetic and other operations with reasonable results.

Integrated I/O: I/O is done through handles that act as functional windows to the outside world. Once such a handle is set up, retrieving the handle's value results in a read and passing a value to the handle is a write.

Mathematical Functions Refresher

In order to understand q, it is important to have a clear grasp of the basic concepts and terminology of mathematical functions. There is no shortcut. In fact, nearly all the constructs of q can be understood as function mappings. The following refresher may help those who are unfamiliar or rusty with mathematical functions.

In mathematics, a function associates a unique output value with each input value. The collection of all input values is the domain of the function and the range is the collection from which the output values are chosen. A function is also called a map (or mapping) from the domain to the range.

The output value that a function f associates to an input value x is read "f of x." More verbosely, we say that the output is the result of applying f to the input parameter(s), or that the output value is f evaluated at x. In mathematics and most programming languages, the output value of a function is represented with the function name to the left of its arguments. The arguments are usually enclosed in matching parentheses or brackets and are separated by commas or semicolons.

There are two basic ways to define a function: an algorithm or a graph. You can specify an algorithm as a list of formulas that perform a sequence of operations on an input value to arrive at the corresponding output value. For example, we define the squaring function, over the domain and range of real numbers, to assign as output value the input value times itself. Alternatively, you can define a function by explicitly listing all the input-output associations. The collection of associated inputs and outputs is the graph of the function.

As you will no doubt recall from many bucolic hours in high-school math class, a function defined by formula can always be converted to a graph by feeding in input values, cranking out the associated outputs, and collecting the results into a table. In general, there is no explicit formula to calculate the values for an arbitrary input-output graph. If it is possible to define a function via a formula, this is usually the preferred way to specify it since it is compact, but there is no guarantee that the formula will be easy or quick to compute.

Here are the two forms for the squaring function over the domain of integers 0 through 3, as you might recall them from school,

f(x) = x2

I O
0 0
1 1
2 4
3 9

When graphing a function, we normally think of the I/O table as a list of (x,y) pairs

        (0, 0)
        (1, 1)
        (2, 4)
        ...

However, it can also be viewed as a pair of columns in which there is a positional correspondence between the input column and the output column.

        0  ——> 0
        1  ——> 1
        2  ——> 4
        ...

The latter perspective will prove very useful.

The number of arguments to a function is called its valence. Some valences are common enough to have their own terminology. A function of valence 1 (i.e., defined by an algorithm that has one parameter) is said to be monadic. An example is neg(x) that takes a number and returns the negative of the number (i.e., -1 times the number). A function of valence 2 (i.e., two parameters) is said to be dyadic. An example is sum(x, y) that takes two numbers and adds them to get the result. A function with no parameters is niladic; for example, a function with no arguments that returns the constant 3.

Given functions f and g for which the range of g is (contained in) the domain of f, the composite of f and g, denoted f◦g, is the function obtained by chaining the output of g into f. That is, the composite assigns to an input x the output value f(g(x)). Pictorially, we can see that the composite chains the output of g into the input of f,

             g              f
        x ——> g{x} ——> f(g(x))

The domain of the composite is the domain of g and its range is the range of f.

A recursive function is a function over an enumerable domain —usually the positive integers — whose definition has a special form. It is defined on some initial value; then for other values, it is defined in terms of previously defined values. In the most common case, a recursive function is defined explicitly for the input value 0 (the initial case) and its value for any n>0 is specified in terms of its values up to n-1. Often, but not always, the value for n is defined in terms of its value for n-1 only. In some situations the initial case will correspond to 1 instead of 0. Many definitions and operations on lists in q will be presented recursively.

Warning.png Important: In the remainder of this document, we shall use the term map, or mapping, to refer to a mathematical function and will always mean a q function when we write "function" without a modifier.

We hope this trip down mathematics memory lane is not new territory for you. If it is, we strongly advise that you linger here until you're comfortable with the material before proceeding. There is no escaping the fact that q is a language whose foundation is mathematical functions. If you build on shaky ground, your understanding will certainly collapse under the weight of what is to come.

Getting Started

Starting Q

The installation places the q executable in $HOME/q (or $QHOME) on Unix-based systems, or in the q directory on the c: drive on Windows.

Start a q session by typing 'q' on the command line. You should see a new window with the Kx Systems copyright notice followed by a q command line. You will see a leading q) on the command line. This is the q console. Type '6*7' and press Enter to see the result.

        q)6*7
        42
        q)
Warning.png In this manual, to increase readability we shall omit the q prompt in all our snippets, showing the input you type as indented and the response as left justified.
        6*7
42
        _

Here the '_' represents the blinking cursor awaiting your next input.

Variables

A variable is a name that is associated with some q entity. Declaring a variable and assigning its value are done in a single step with ':', called amend (or assign). Note that assignment does not misuse '=' as many languages do, but uses ':' instead. To assign variable a to integer value 42, write

        a:42

A variable name must start with an alpha which can be followed by alpha, numeric or underscore. Some folks read the assignment operation succinctly as "gets."

Information.png Naming Recommendations:
  1. Choose a name long enough to make the purpose of the entity evident, but no longer. The purpose of a name is to communicate to another reader. Long names may not make code easier to read. For example, ‘chkDsk’ is clearer than ‘cd’ but is no less clear than ‘checkDisk’.
2. Use verbs for operators and functions; use nouns for data.
3. Be consistent in your use of abbreviations. Be mindful that even “obvious” abbreviations may be opaque to readers whose native language is different than yours.
4. Be consistent in your use of capitalization, such as initial caps, camel casing, etc. Pick a style and stick to it.
5. Use contexts for namespacing.
6. Do not use names such as ‘int’, ‘float’ or other words that are used by q. While not reserved, some carry special meaning when used as arguments for certain q operators.
7. Refrain from using the underscore character in q names. If you insist on using underscore in names, do not use it as the last character. Expressions involving the built-in _ operator and names with underscore will be difficult to read.

Whitespace

In general, q permits, but does not require, whitespace around operators, separators, brackets, braces, etc. You could also write the above expression as

       a : 42

or,

       a: 42

Because the q gods prefer compact code, you will often see programs with no superfluous whitespace ... none, zilch, zip, nada. In order to help you get accustomed to this terseness, we will usually use whitespace mainly in juxtaposition and after semicolon and comma separators. You should feel free to add whitespace for readability where it is permitted, but be consistent in its use or omission. We will point out where whitespace is required or forbidden.

The q Console

Once you type your preferred version of the above assignment into the q console (which you should do now), the only response you will see is the cursor awaiting input on the next line. To see the value of a, type its name and press Enter.

        a:42
        a
42

You may wonder why the q console does not echo the value of a specification. This is simply a design feature of the q console.

Warning.png Note: One noticeable change in release 2.4 of q is the console display of lists, dictionaries and tables. For those accustomed to the k-like display in 2.3, the console representation of complex data types in 2.4 is that of the show function. See the section on .z.pi in Chapter 12 Commands and System Variables on how to alter the console display.

Comments

In q, the forward-slash character (/) is used to indicate the beginning of a comment. Otherwise put, / instructs the interpreter to ignore anything to the end of the line.

Warning.png Note: At least one whitespace character must separate / from any text to the left of it on a line.

In the following example, no definition of b is processed, so an error occurs.

        a: 42         / nothing here counts     b:6*7
        b
`b

And the following generates an error,

        a:42/ intended to be a comment
'
Warning.png Recommendation: The q gods have no need for explanatory error messages or comments since their q code is always correct and self-documenting. Mortals spend many hours poring over cryptic q error messages such as the one above indicating b is undefined. Moreover, many mortals eschew comments in misguided misanthropic coding macho. Don't.

Assignment Value

A variable is not explicitly declared or typed. Instead, the value assigned to a variable carries the type. In our example, the expression to the right of the assignment is syntactically an integer value, so the name 'a' is associated with a value of type int. (The q types will be covered in Chapter 1 Atoms.)

The fact that variables are not declared before assignment means that an assignment can be interpreted either as the initial assignment or as a re-assignment, depending on the context. It is perfectly permissible to reassign a variable with a value of different type. Once this is done, the name will reflect the new type of the value assigned to it.

Warning.png Warning: You can unintentionally change the type of a variable with a wayward assignment. Or you can inadvertently reuse a variable name and wipe out any data in the variable. An undetected typo can result in data being sent to a black hole. Be careful to enter variable names correctly.

Some verbose languages permit only a variable name to the left of an assignment. In q, as in C, an assignment carries the value being assigned and can be used as part of a larger expression. So we find,

        1+a:42
43

Or,

        b:1+a:42
        b
43

Order of Parsing

The interpreter evaluates the above specification of b by parsing the expression from right-to-left (more on this in Chapter 3 Primitive Operations). If it were verbose,

The integer 42 is assigned to a variable named a, then this value is added to the integer 1, then this result is assigned to a variable named b

Because the interpreter always parses expressions from right-to-left, programmers can read q expressions from left-to-right,

The variable b gets the value of the integer 1 plus the value assigned to the variable a, which gets the integer 42

The ability to use the results of assignments in expressions permits a single line of q code to perform the work of an entire verbose program. Such an expression may execute more quickly than an equivalent version with the assignments split onto multiple statements, but the tradeoff is a reduction in readability and maintainability. The q gods carry terseness to the extreme. This choice of programming style should be avoided by mortals, as it can easily lead to write-only code.

Sample Q Program

Now that we know how q works and how to start it up, let's examine some real code that shows the power of q. The following program reads a csv file of time-stamped symbols and prices, places the data into a table and computes the maximum price for each day. It then opens a socket connection to a q process on another machine and retrieves a similar daily aggregate. Finally, it merges the two intermediate tables and appends the result to an existing file.

sample:{
 t:("DSF"; enlist ",") 0: `:c:/q/data/px.csv;
 tmpx:select mpx:max Price by Date,Sym from t;
 h:hopen `:aerowing:5042;
 rtmpx:h "select mpx:max Price by Date,Sym from tpx";
 hclose h;
 .[`:c:/q/data/tpx.dat; (); ,; rtmpx,tmpx]
}

Most people have two immediate reactions upon seeing q code for the first time. First, they are amazed at how much can be done with so little code. Second, they wonder if they will ever be able to read it! We promise that by the end of this tutorial, this program will be easy, and you'll feel right as rain.


Next: Atoms

Table of Contents

©2006-2007 Kx Systems, Inc. and Continuux LLC. All rights reserved.

Personal tools
Namespaces
Variants
Actions
Navigation
Toolbox