Skip to content

Decoders

Decode an external data format into Stream Processor

Decoding allows data to be converted into a format that can be processed directly within the Stream Processor. Decoders need to be used when ingesting data from an external data format before performing other transformations.

Interfaces

A Python interface is included along side the q interface and can be used if PyKX is enabled. See the Python API for equivalent APIs.

A drag and drop UI is included with kdb Insights Enterprise for building pipelines. See the UI documentation for equivalent UI functions.

Table of Contents

.qsp.decode.arrow

(Beta Feature) Decodes Arrow streams

Beta Features

To enable beta features, set the environment variable KXI_SP_BETA_FEATURES to true.

.qsp.decode.arrow[]
.qsp.decode.arrow[.qsp.use (!) . flip enlist (`asList; asList)]

options:

name type description default
asList bool If true, the decoded result is a list of arrays, corresponding only to the Arrow stream data. Otherwise, by default the decoded result is a table corresponding to both the schema and data in the Arrow stream. 0b

For all common arguments, refer to configuring operators

This operator decodes an Arrow stream into either a kdb+ table or list of arrays.

Decode an Arrow stream into a kdb+ table:

table:([]c1: til 5; c2: 10.1 10.2 10.3 10.4 10.5; c3: (enlist "a";"bbb";"ccc";"ddd";enlist "e"))
serialized: .arrowkdb.ipc.serializeArrowFromTable[table;::]

/ -- Reading to a kdb+ table
input: serialized

.qsp.run
    .qsp.read.fromCallback[`publish]
    .qsp.decode.arrow[]
    .qsp.write.toVariable[`output]

publish input
output
c1 c2   c3
-------------
0  10.1 ,"a"
1  10.2 "bbb"
2  10.3 "ccc"
3  10.4 "ddd"
4  10.5 ,"e"

Decode an Arrow stream into a list of arrays:

table:([]c1: til 5; c2: 10.1 10.2 10.3 10.4 10.5; c3: (enlist "a";"bbb";"ccc";"ddd";enlist "e"))
serialized: .arrowkdb.ipc.serializeArrowFromTable[table;::]

/ -- Reading to a kdb+ table
input: serialized

.qsp.run
    .qsp.read.fromCallback[`publish]
    .qsp.decode.arrow[.qsp.use (!) . flip enlist (`asList; 1b)]
    .qsp.write.toVariable[`output]

publish input
output
0    1     2     3     4
10.1 10.2  10.3  10.4  10.5
,"a" "bbb" "ccc" "ddd" ,"e"

.qsp.decode.csv

Parse CSV data to a table

.qsp.decode.csv[schema]
.qsp.decode.csv[schema; delimiter]
.qsp.decode.csv[schema; delimiter; .qsp.use (!) . flip (
    (`header     ; header);
    (`exclude    ; exclude);
    (`schemaType ; schemaType);
    (`encoding   ; encoding);
    (`newlines   ; newlines))]

Parameters:

name type description default
schema dict or string A table with the desired output schema, a list of types to support as type characters, or "" to treat all columns as strings. ""
delimiter character Field separator for the records in the encoded data. ","

options:

name type description default
header symbol Whether encoded data starts with a header row, either none, first or always. first
exclude symbol[] or int[] A list of columns to exclude from the output. ()
schemaType symbol How to interpret the provided schema object. By default the schema is treated as the desired literal output, alternatively, this can be set to schema. When using schema the provided schema object should be in a special table format of ([] name: $(); datatype:short$()). literal
encoding symbol How the data is expected to be encoded when being consumed. Currently supported options for this are ASCII and UTF8 UTF8
newlines boolean Indicates whether line-returns may be embedded in strings 0b

For all common arguments, refer to configuring operators

This operator decodes CSV data from strings or bytes with delimiter-separated values into tables in the given schema.

Byte Order Marks

When dealing with non-ASCII encoding schemes, the CSV decoding logic will check for and remove byte order mark prefixes on the incoming data. Depending on how string data is presented, the byte order mark may or may not be visible. This can lead to mysterious errors that are hard to track down, so be sure to use the UTF8 encoding option when processing data prefixed with a byte order mark.

q)bom: "c"$0xEFBBBF;
q)ascii: "Hello, World!";
q)utf8: bom,"Hello, World!";
q)`$ascii
`Hello, World!
q)`$utf8
`Hello, World!
q)(`$ascii) ~ `$utf8
0b
q)ascii
"Hello, World!"
q)utf8
"\357\273\277Hello, World!"

Accepted type formats

Certain target types have strict requirements around the string format in order to be parsed correctly. Please see the Apply Schema documentation for details.

Newlines Parameter

Due to a performance penalty when the input data does not contain newlines in strings the newlines parameter is disabled by default

Decode from a CSV file:

// Generate a random table of data and store it in an inventory file.
n: 10000
t: ([]
  date: n?.z.d + neg til 10;
  price: (n?1000)div 100;
  item: n?`3;
  description: {rand[100]?.Q.an}each til n;
  quantity: n?10000)

`:/tmp/inventory.csv 0: csv 0: t

// Read and parse the data from a file
schema: ([] date: `date$(); price:`int$();
  item:`symbol$(); description: (); quantity:`long$());

.qsp.run
  .qsp.read.fromFile["/tmp/inventory.csv"]
  .qsp.decode.csv[schema]
  .qsp.write.toConsole[]
                             | date       price item description                      quantity
-----------------------------| ---------------------------------------------------------------
2021.07.16D19:45:03.929480200| 2021.07.16 7     ehm  "3qupWqmNh6y8TeTzJlW49NlRzv0_0"  2659
2021.07.16D19:45:03.929480200| 2021.07.14 2     iif  "_eB_lq"                         8257
2021.07.16D19:45:03.929480200| 2021.07.12 7     eod  "GhUgGe3PH9Ie2NOw"               3907
2021.07.16D19:45:03.929480200| 2021.07.11 0     goj  "Dvmemf3H2P"                     6100
2021.07.16D19:45:03.929480200| 2021.07.09 1     bpm  "GbSjldDmUprmfiBa0UI8I"          367
..

Decode from a CSV file using a dictionary schema:

// Generate a random table of data and store it in an inventory file.
n: 10000
t: ([]
  date: n?.z.d + neg til 10;
  price: (n?1000)div 100;
  item: n?`3;
  description: {rand[100]?.Q.an}each til n;
  quantity: n?10000)

`:/tmp/inventory.csv 0: csv 0: t

// Read and parse the data from a file
schema: `date`price`item`description`quantity!"dis*j";

.qsp.run
  .qsp.read.fromFile["/tmp/inventory.csv"]
  .qsp.decode.csv[schema]
  .qsp.write.toConsole[]
                             | date       price item description                      quantity
-----------------------------| ---------------------------------------------------------------
2021.07.16D19:45:03.929480200| 2021.07.16 7     ehm  "3qupWqmNh6y8TeTzJlW49NlRzv0_0"  2659
2021.07.16D19:45:03.929480200| 2021.07.14 2     iif  "_eB_lq"                         8257
2021.07.16D19:45:03.929480200| 2021.07.12 7     eod  "GhUgGe3PH9Ie2NOw"               3907
2021.07.16D19:45:03.929480200| 2021.07.11 0     goj  "Dvmemf3H2P"                     6100
2021.07.16D19:45:03.929480200| 2021.07.09 1     bpm  "GbSjldDmUprmfiBa0UI8I"          367
..

Decode from a CSV file using schemaType schema:

// Generate a random table of data and store it in an inventory file.
n: 10000
t: ([]
  date: n?.z.d + neg til 10;
  price: (n?1000)div 100;
  item: n?`3;
  description: {rand[100]?.Q.an}each til n;
  quantity: n?10000)

`:/tmp/inventory.csv 0: csv 0: t

// Read and parse the data from a file
schema: ([] name:`date`price`item`description`quantity; datatype:-14 -7 -11 0 -7h);

.qsp.run
  .qsp.read.fromFile["inventory.csv"]
  .qsp.decode.csv[schema;.qsp.use``schemaType!(::;`schema)]
  .qsp.write.toConsole[]
                             | date       price item description                      quantity
-----------------------------| ---------------------------------------------------------------
2021.07.16D19:45:03.929480200| 2021.07.16 7     ehm  "3qupWqmNh6y8TeTzJlW49NlRzv0_0"  2659
2021.07.16D19:45:03.929480200| 2021.07.14 2     iif  "_eB_lq"                         8257
2021.07.16D19:45:03.929480200| 2021.07.12 7     eod  "GhUgGe3PH9Ie2NOw"               3907
2021.07.16D19:45:03.929480200| 2021.07.11 0     goj  "Dvmemf3H2P"                     6100
2021.07.16D19:45:03.929480200| 2021.07.09 1     bpm  "GbSjldDmUprmfiBa0UI8I"          367
..

.qsp.decode.gzip

(Beta Feature) Inflates (decompresses) gzipped data

Beta Features

To enable beta features, set the environment variable KXI_SP_BETA_FEATURES to true.

Inflates a gzipped stream of bytes into an uncompressed stream of bytes. This decoder will inflate as much data as is available in the inbound stream and buffer any trailing data until the next byte buffer is received. Once data has been inflated, it is passed to the next node in the pipeline.

Fault Tolerance

The gzip decoder is currently marked as beta as it does not currently support fault tolerant replay. If a pipeline fails and is forced to replay data, the gzip decoder will fail with an incomplete byte stream. Fault tolerance support is coming in a future release.

.qsp.decode.gzip[]

For all common arguments, refer to configuring operators

This operator inflates a gzipped byte or char stream while preserving the shape and type of the incoming data.

Decode a stream of gzipped CSV data:

Permissions

When running this example inside a kxi-sp-worker image, you must first run system "cd /tmp" to avoid encountering permissions errors.

`:table.csv 0: csv 0: ([] date: .z.d; sym: 100?3?`3; price: 100?1000f; quantity: 100?100)
system "gzip table.csv"

.qsp.run
    .qsp.read.fromFile[`:table.csv.gz]
    .qsp.decode.gzip[]
    .qsp.decode.csv["DSFJ"]
    .qsp.write.toVariable[`output]

output
date       sym price    quantity
--------------------------------
2022.10.26 aci 106.9924 90
2022.10.26 lcn 422.2224 73
2022.10.26 lcn 767.486  90
2022.10.26 bdp 885.1612 43
2022.10.26 bdp 435.8676 90
2022.10.26 lcn 77.88199 84
..

.qsp.decode.json

Parse JSON data

.qsp.decode.json[]
.qsp.decode.json[.qsp.use enlist[`decodeEach]!enlist decodeEach]

options:

name type description default
decodeEach boolean By default messages passed to the decoder are treated as a single JSON object. Setting decodeEach to true indicates that parsing should be done on each value of a message. This is useful when decoding data that has objects separated by newlines. This allows the pipeline to process partial sets of the JSON file without requiring the entire block to be in memory. 0b

For all common arguments, refer to configuring operators

This operator parses JSON strings to native q types, usually either a dictionary or a table.

Decode JSON from a file:

// Generate a random table of data and write it as JSON data
n: 10000;
t: ([]
  date: n?.z.d + neg til 10;
  price: (n?1000)div 100;
  item: n?`3;
  description: {rand[100]?.Q.an}each til n;
  quantity: n?10000);

`:/tmp/inventory.json 0: enlist .j.j t;

.qsp.run
  .qsp.read.fromFile["/tmp/inventory.json"]
  .qsp.decode.json[]
  .qsp.write.toConsole[];
                             | date         price item  description                                        quantity
-----------------------------| ------------------------------------------------------------------------------------
2021.10.05D19:40:04.536274000| "2021-10-01" 8     "eke" "PlND7JnZejE5j8aKJxSmqLTJycOsxkgTgqz2dB6mH3Q"      5963
2021.10.05D19:40:04.536274000| "2021-10-05" 0     "ldc" "ngctTMTD5PkkTSTOZ_3pwgy2vISuvnJYy"                3057
2021.10.05D19:40:04.536274000| "2021-10-05" 7     "ikb" "nFBU7"                                            8986
2021.10.05D19:40:04.536274000| "2021-09-28" 9     "lhp" "JH9NSxL7UNBGRZ49MYDX9qu_BUYmZoGu11G_GSV"          9488
2021.10.05D19:40:04.536274000| "2021-10-05" 3     "eoi" "E0hp_zZUBAfKERSPvdz_UZnKX07iBe2sd9TgH4mJmFtsLyap" 1301
..

Decode a stream of JSON:

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.decode.json[.qsp.use``decodeEach!11b]
  .qsp.write.toConsole[];

publish .j.j each ([] date: .z.d + til 10; price: 10?100f)
                             | date         price
-----------------------------| ---------------------
2021.10.05D19:47:01.576948900| "2021-10-05" 22.56381
2021.10.05D19:47:01.576948900| "2021-10-06" 51.2789
2021.10.05D19:47:01.576948900| "2021-10-07" 34.48978
2021.10.05D19:47:01.576948900| "2021-10-08" 69.06853
2021.10.05D19:47:01.576948900| "2021-10-09" 71.53166
..

.qsp.decode.pcap

Decodes pcap files

    .qsp.decode.pcap[columns]

options:

name type description default
columns symbol [] The columns to include. (Omit to include all columns) ()

Parameters: columns: the column names of specific columns the user wishes to convert. The available columns are: ack, dstip, dstmac, dstport, flags, id, ip_checksum, ip_length, offset, payload, proto, proto_checksum, seq, srcmac, srcport, timestamp, tos, ttl, udp_length, urgptr, windowsize

By not supplying any columns, pcap will select all columns as the default.

For all common arguments, refer to configuring operators

Limitations: Currently, IPV6 packet reading is not yet supported by the pcap decoder. What this means is that any and all IPV6 packets inside the file will be skipped over. Additionally, chunking does not appear to be possible with the current decoder. Since pcap files include a file header which is mandatory with each file, a file cannot be parsed by simply partitioning the file. If one wishes to partition pcap files, one might go to https://www.wireshark.org/ and follow the guide to split pcap files.

This operator decodes a pcap file into a kdb+ table.

Decode a pcap file into a kdb+ table:

// Reading to a kdb+ table

.qsp.run
    .qsp.read.fromFile["test/data/decode/test_data/20180127_IEXTP1_DEEP1.0.pcap"]
    .qsp.decode.pcap[`timestamp`tos`ip_length`id`offset`proto`ip_checksum]
    .qsp.write.toVariable[`output]

1#output
timestamp                     tos ip_length id   offset proto ip_checksum
--------------------------------------------------------------------------
2023.05.08D20:17:50.720432000 00  200       9155 0      11    -10827

.qsp.decode.protobuf

Parse Protocol Buffer messages to a dictionary or list

.qsp.decode.protobuf[message]
.qsp.decode.protobuf[message; .qsp.use (!) . flip (
    (`file         ; file);
    (`format       ; format);
    (`asList       ; asList))]

Parameters:

name type description default
message string or symbol The name of the Protocol Buffer message type to decode. Required

options:

name type description default
file symbol The path to a .proto file containing the message type definition. Either format or file must be provided. `
format string A string definition of the Protocol Buffer message format to decode. ""
asList boolean By default, the output is a dictionary which includes the field names. But if this option is set to True, then only the list of values is outputted. 0b

For all common arguments, refer to configuring operators

This operator decodes protobuf-encoded messages of the chosen message type, given a protobuf schema containing that message type. The protobuf schema can be provided either as a file using the file parameter, or as a string using the format option. Decoded messages are outputted as either a dictionary or list depending on the value of the asList option.

Import paths

To import your .proto file, the folder containing the .proto file is added as an import path. This means the folder will be scanned when importing future .proto files, so it is important that you avoid having .proto files with the same filename present in import paths you use.

Decode protobuf messages using a Person.proto file:

// Person.proto
syntax="proto3";
message Person {
  string name = 1;
  int32 id = 2;
  string email = 3;
}

.qsp.run
    .qsp.read.fromCallback[`publish]
    .qsp.decode.protobuf[`Person;"Person.proto"]
    .qsp.write.toConsole[];

// The bytes listed below are an example encoded Protocol Buffer payload
publish 0x0a046e616d6510651a0f656d61696c40656d61696c2e636f6d;
2021.11.09D19:05:29.933331149 | name | "name"
2021.11.09D19:05:29.933331149 | id   | 101i
2021.11.09D19:05:29.933331149 | email| "email@email.com"

Decode protobuf messages using format into lists:

format: "syntax=\"proto3\";
 message Person {
    string name = 1;
    int32 id = 2;
    string email = 3;
 }";

.qsp.run
    .qsp.read.fromCallback[`publish]
    .qsp.decode.protobuf[.qsp.use `message`format`asList!(`Person;format;1b)]
    .qsp.write.toConsole[];

// The bytes listed below are an example encoded Protocol Buffer payload
publish 0x0a046e616d6510651a0f656d61696c40656d61696c2e636f6d;
2021.11.09D19:11:40.422536637 | "name"
2021.11.09D19:11:40.422536637 | 101i
2021.11.09D19:11:40.422536637 | "email@email.com"