Unicode

This page demonstrates how to handle Unicode-encoded data in KDB-X, including storing, displaying, and processing text in various encodings such as UTF-8 and GBK.

Unicode text can be stored in symbol, byte list and character list (string) data types.

Since the data is simply a sequence of bytes, you can store any Unicode format. However, it's best to use an encoding such as UTF-8 or GBK that extends 7-bit ASCII, that is, any byte in the range 00–7f represents the same character as in ASCII.

To display the data correctly, the console must use a matching code page. For example, if you store text in UTF-8, make sure the display code page is also set to UTF-8.

Examples of processing Unicode data

Store UTF-8 in a char vector

The two Chinese characters 香蕉 each use 3 bytes in UTF-8. In this example, the two chinese characters are stored in a char vector, which is then shown to using six 1-byte characters (that is, 2 x 3 bytes).

Comparison with the original UTF-8 characters returns true. The contents are printed in octal format, showing six bytes. When printed to stdout using -1, the UTF-8 representation of the characters appears.

q)t:"香蕉"
q)type t
10h
q)count t
6
q)t
"\351\246\231\350\225\211"
q)t~"香蕉"
1b
q)-1 t;
香蕉

Store data in tables

Table and column names should be plain ASCII. For example, the following has Chinese characters in symbol and character columns:

sym:`apples`bananas`oranges
name:(`$"蘋果";`$"香蕉";`$"橙")
text:("每日一蘋果, 醫生遠離我";"香蕉船是一道可口的甜品";"從佛羅里達州來的鮮橙很甜美")
t:([]sym;name;text)

You can work with this table as usual, but note that the q console displays the text entries as their octal character numbers:

q)select sym,name from t
sym     name
--------------
apples  蘋果
bananas 香蕉
oranges 橙

q)select from t where name=`$"香蕉"
sym     name   text                                      ..
---------------------------------------------------------..
bananas 香蕉 "\351\246\231\350\225\211\350\210\271\346\..

Writing to stdout with -1 shows the formatted text:

q)-1 text 0;
每日一蘋果, 醫生遠離我

Use external interfaces

You can send non-ASCII data using various programming interfaces, such as C or Python. The following example uses the C interface to connect over TCP and set two variables, each being char vectors representing UTF-8 strings.

int main(){
  int c=khp("localhost",5001);
  k(c,"set",ks("a"),kp("香蕉"),(K)0);
  k(c,"set",ks("b"),kp("\351\246\231\350\225\211"),(K)0);
  close(c);
}

Handle Unicode scripts or statements

KDB-X can load a script with such encoding, but it does not support other formats. If you use these encodings, make sure the data does not include a byte-order mark (BOM) prefix.

The q language itself uses only 7-bit ASCII. For example, the statement 2+3 should be given as the three decimal bytes 50 43 51, as in:

q)`char$50 43 51
"2+3"

Using value to evaluate the statement 2+3 results in 5:

q)value `char$50 43 51
5

Fixed-width Unicode formats cannot be used. For example, in UTF-16, the expression 2+3 would be represented by six decimal bytes: 50 0 43 0 51 0. The q language does not recognize this format.

q)value `char$50 0 43 0 51 0
'char