Type mapping between Arrow and kdb+¶
The data layout of an Arrow table is defined by its schema. The schema is composed from a list of fields, one for each column in the table. The field describes the name of the column and its datatype. This page examines each of these and details how they are mapped in kdb+.
Arrow datatypes¶
Currently Arrow supports over 35 datatypes including concrete, parameterized and nested datatypes.
Similar to the C++ Arrow library and PyArrow, arrowkdb
exposes the Arrow datatype constructors to q. When one of these constructors is called it will return an integer datatype identifier which can then be passed to other functions, e.g. when creating a field.
Concrete¶
Concrete datatypes have a single fixed representation.
arrow datatype | description | kdb+ representation |
---|---|---|
na | NULL type having no physical storage | mixed list of empty lists |
boolean | boolean as 1 bit, LSB bit-packed ordering | 1h |
uint8 | unsigned 8-bit little-endian integer | 4h |
int8 | signed 8-bit little-endian integer | 4h |
uint16 | unsigned 16-bit little-endian integer | 5h |
int16 | signed 16-bit little-endian integer | 5h |
uint32 | unsigned 32-bit little-endian integer | 6h |
int32 | signed 32-bit little-endian integer | 6h |
uint64 | unsigned 64-bit little-endian integer | 7h |
int64 | signed 64-bit little-endian integer | 7h |
float16 | 2-byte floating point value (populated from uint16_t ) |
5h |
float32 | 4-byte floating point value | 8h |
float64 | 8-byte floating point value | 9h |
utf8 | UTF8 variable-length string | mixed list of 10h |
large_utf8 | large UTF8 variable-length string | mixed list of 10h |
binary | variable-length bytes (no guarantee of UTF8-ness) | mixed list of 4h |
large_binary | large variable-length bytes (no guarantee of UTF8-ness) | mixed list of 4h |
date32 | int32_t days since the Unix epoch |
14h (with automatic epoch offsetting) |
date64 | int64_t milliseconds since the Unix epoch |
12h (with automatic epoch offsetting and ms scaling) |
month_interval | interval described as a number of months | 13h |
day_time_interval | interval described as number of days and milliseconds | 16h (with automatic ns scaling) |
Parameterized¶
Parameterized datatypes represent multiple logical interpretations of the underlying physical data, where each parameterized interpretation is a distinct datatype in its own right.
arrow datatype | description | kdb+ representation |
---|---|---|
fixed_size_binary (byte_width) | fixed-size binary: each value occupies the same number of bytes | mixed list of 4h |
timestamp (time_unit) | exact timestamp encoded with int64_t (as number of seconds, milliseconds, microseconds or nanoseconds since Unix epoch) |
12h (with automatic epoch offsetting and TimeUnit scaling) |
time32 (time_unit) | time as signed 32-bit integer, representing either seconds or milliseconds since midnight | 19h (with automatic TimeUnit scaling) |
time64 (time_unit) | time as signed 64-bit integer, representing either microseconds or nanoseconds since midnight | 16h (with automatic TimeUnit scaling) |
duration (time_unit) | measure of elapsed time in either seconds, milliseconds, microseconds or nanoseconds | 16h (with automatic TimeUnit scaling) |
decimal128 (precision, scale) | precision- and scale-based signed 128-bit integer in twos complement | mixed list of 4h (each of length 16) |
Nested¶
Nested datatypes define higher-level groupings of either the child datatypes or its constituent fields. (A field specifies its datatype and the field’s name.)
arrow datatype | description | kdb+ representation |
---|---|---|
list (datatype_id) | list datatype specified in terms of its child datatype | mixed list for the parent list array containing a set of sublists (of type determined by the child datatype), one for each of the list value sets |
large_list (datatype_id) | large list datatype specified in terms of its child datatype | mixed list for the parent list array containing a set of sublists (of type determined by the child datatype), one for each of the list value sets |
fixed_size_list (datatype_id, list_size) | fixed size list datatype specified in terms of its child datatype and the fixed size of each of the child lists | same as variable-length lists, except each of the sublists must be of length equal to the list_size |
map (key_datatype_id, item_datatype_id) | map datatype specified in terms of its key and item child datatypes | mixed list for the parent map array, with a dictionary for each map value set |
struct (field_ids) | struct datatype specified in terms of a list of its constituent child field identifiers | mixed list for the parent struct array, containing child lists for each field in the struct |
dictionary (value_datatype_id, index_datatype_id) | a dictionary datatype specified in terms of its value and index datatypes, similar to pandas categorical | two-item mixed list: values and indexes lists |
sparse_union (field_ids) | union datatype specified in terms of a list of its constituent child field identifiers | similar to a struct array except the mixed list has an additional type_id array (5h) at the start which identifies the live field in each union value set |
dense_union (field_ids) | union datatype specified in terms of a list of its constituent child field identifiers | similar to a struct array except the mixed list has an additional type_id array (5h) at the start which identifies the live field in each union value set |
Inferred¶
You can have arrowkbd
infer a suitable Arrow datatype from the type of a kdb+ list.
Similarly, Arrow schemas can be inferred from a kdb+ table.
This approach is easier to use but supports only a subset of the Arrow datatypes and is considerably less flexible.
Infer Arrow datatypes if you are less familiar with Arrow or do not wish to use the more complex or nested Arrow datatypes.
kdb+ list type | inferred Arrow datatype | notes |
---|---|---|
1h | boolean | |
2h | fixed_size_binary (16) | writing path only, reads as mixed list of 4h |
4h | int8 | |
5h | int16 | |
6h | int32 | |
7h | int64 | |
8h | float32 | |
9h | float64 | |
10h | int8 | writing path only, reads as 4h |
11h | utf8 | writing path only, reads as mixed list of 10h |
12h | timestamp (nano) | |
13h | month_interval | |
14h | date32 | |
15h | NA | cast in q with `timestamp$ |
16h | time64 (nano) | |
17h | NA | cast in q with `time$ |
18h | NA | cast in q with `time$ |
19h | time32 (milli) | |
mixed list of 4h | binary | |
mixed list of 10h | utf8 |
The inference works only for trivial kdb+ lists containing simple datatypes
Only mixed lists of char arrays or byte arrays are supported, mapped to Arrow UTF8 and binary datatypes respectively. Other mixed list structures (e.g. those used by the nested arrow datatypes) cannot be interpreted – if required, create manually using the datatype constructors
Parquet datatype limitations¶
The Parquet file format is less fully featured compared to Arrow and consequently the Arrow/Parquet file writer currently does not support some datatypes or represents them using a different datatype:
Arrow datatype | status as of apache-arrow-2.0.0 |
---|---|
float16 | unsupported |
month_interval | unsupported |
day_time_interval | unsupported |
duration | unsupported |
large_utf8 | unsupported |
large_binary | unsupported |
sparse_union | unsupported |
dense_union | unsupported |
date64 | mapped to date32 (days) |
fixed_size_list | mapped to list |
dictionary | categorical representation stored |
uint32 | Parquet v2.0 only, otherwise mapped to int64 |
timestamp(nano) | Parquet v2.0 only, otherwise mapped to timestamp (milli) |
Arrow fields¶
An Arrow field describes a column in the table and is composed of a datatype and a string field name.
Similar to the C++ Arrow library and PyArrow, arrowkdb
exposes the Arrow field constructor to q. The field constructor takes the field name and its datatype identifier and returns an integer field identifier which can then be passed to other functions, e.g. when creating a schema.
Arrow schemas¶
An Arrow schema is built up from a list of fields and is used when working with table data. The datatype of each field in the schema determines the array data layout for that column in the table.
Similar to the C++ Arrow library and PyArrow, arrowkdb
exposes the Arrow schema constructor to q. The schema constructor takes a list of field identifiers and will return an integer schema identifier which can then be passed to other functions, e.g. when writing Arrow or Parquet files.
If you are less familiar with Arrow or do not wish to use the more complex or nested Arrow datatypes, arrowkdb
can infer the schema from a kdb+ table. Each column in the table is mapped to a field in the schema. The column’s name is used as the field name and the field’s datatype is inferred from the column’s kdb+ type.
Arrow tables¶
An Arrow table is composed from a schema and a mixed list of Arrow array data kdb+ objects:
- The array data for each column in the table is then populated using a builder object specific to the field’s datatype
- Similarly, datatype-specific reader objects are used to interpret and inspect the array data for each column in the table
The mixed list of Arrow array data kdb+ objects should be ordered in schema field number. Each kdb+ object representing one of the arrays must be structured according to the field's datatype. This required array data structure is detailed above for each of the datatypes.
Alternatively, separate APIs are provided where the Arrow table is created from a kdb+ table using an inferred schema.