Compressing Data on Disk in KDB-X

Learn how to save disk space and improve query performance by compressing KDB-X data on the fly. This guide provides practical steps for writing, reading, and configuring compressed files to suit your specific needs.

Overview

This guide explains the key aspects of data compression in KDB-X:

Prerequisites: Ensure your system has the necessary libraries for advanced compression algorithms.
Core operations: Learn the fundamental tasks of writing, reading, appending to, and inspecting compressed files.
Configuration: Set up system-wide default compression settings.
Choosing an algorithm: Understand the trade-offs between different compression algorithms to help you select the best one for your use case.
Advanced topics: Explore important technical details, performance tuning tips, and resource management considerations.

Prerequisites

To use certain compression algorithms, you must first install the corresponding external libraries. KDB-X dynamically links to these libraries when they are required. Both 64-bit and 32-bit versions of KDB-X require the corresponding 64-bit and 32-bit libraries.

If you need help installing these libraries, consult your system administrator.

Gzip (algorithm 2)

Linux	macOS	Windows
`libz.so.1`	`libz.dylib` (pre-installed)	`zlibwapi.dll`

Note

The Windows DLL should be built from the source code. The out-of-the-box build files produce a DLL with the cdecl calling convention, which will not work in KDB-X. To switch to the required stdcall convention, you must define the ZLIB_WINAPI preprocessor symbol. For example, with MinGW:

>make -f win32/Makefile.gcc --eval 'kx-dummy:;@echo $(CFLAGS)'
-O3 -Wall
>make -f win32/Makefile.gcc CFLAGS="-O3 -Wall -DZLIB_WINAPI" SHARED_MODE=1 zlib1.dll

Snappy (algorithm 3)

Linux	macOS	Windows
`libsnappy.so.1`	`libsnappy.dylib` (via Homebrew or MacPorts)	`snappy.dll`

LZ4 (algorithm 4)

Linux	macOS	Windows
`liblz4.so.1`	`liblz4.dylib` (via Homebrew or MacPorts)	`liblz4.dll` (build from source)

Zstd (algorithm 5)

Linux	macOS	Windows
`libzstd.so.1`	`libzstd.1.dylib` (via Homebrew or MacPorts)	`libzstd.dll`

Core operations

This section covers the fundamental tasks for working with compressed files.

Understand compression parameters

To compress a file, provide a list of three integer arguments: (logicalBlockSize; algorithm; compressionLevel).

logicalBlockSize: A power of 2 between 12 (4 KB) and 20 (1 MB). This argument defines the size of the data chunks that are compressed independently. Larger blocks can improve the compression ratio but may be slower to process. Choose a value that works across all platforms that access the files to avoid a disk compression - bad logicalBlockSize error.
algorithm: An integer that specifies the compression algorithm to use. For a complete list of supported algorithms, see the Available algorithms and libraries section.
compressionLevel: An algorithm-specific integer that controls the trade-off between compression ratio and speed. Higher levels typically result in smaller files but take more CPU time to compress.

Write a compressed file

Use the set function with the compression argument list on to write data to a compressed file. For a splayed table, you can specify compression settings for each column individually.

The following example demonstrates writing an uncompressed file, reading it back, and writing it to a new file using gzip compression.

// 1. Create a sample uncompressed file
q)`:uncompressed_data set 1000#enlist asc 1000?10
`:uncompressed_data

// 2. Read the file and write it back with compression settings:
//    - logicalBlockSize: 17 (128 KB)
//    - algorithm: 2 (gzip)
//    - compressionLevel: 9 (max)
q)(`:compressed_data;17;2;9) set get `:uncompressed_data
`:compressed_data

// 3. Verify the contents are identical
q)get[`:uncompressed_data] ~ get`:compressed_data
1b

Performance tip

For better performance, place the source and target files on different physical disk drives. Compression involves reading from the source and writing to the target simultaneously, which can cause significant disk seek contention on a single drive.

Read from a compressed file

Decompression in KDB-X is automatic. Any q operation that reads a file will also read a compressed file transparently, without special syntax. KDB-X only decompresses the data blocks needed for a query, and it caches the decompressed data in memory for the duration of that operation.

To create an uncompressed copy of a compressed file, read it with get and then write it to a new file without providing any compression parameters.

// KDB-X reads compressed files transparently
q)data: get `:compressed_data

// To create a decompressed copy, simply read and write without compression arguments
q)`:new_uncompressed_data set get `:compressed_data
`:new_uncompressed_data

Append to a compressed file

To append data to an existing compressed file (or a compressed splayed table), use the upsert keyword.

// 1. Create an initial compressed file
q)(`:zippedTest;17;2;6) set 100000?10
`:zippedTest

// 2. Append more data to the file
q)`:zippedTest upsert 100000?10
`:zippedTest

Appending with attributes

If you append to a compressed file that has an attribute (for example, a p# attribute on a symbol column), KDB-X will read and rewrite the entire file.

Check compression statistics

Use the internal function -21! to retrieve a dictionary of compression statistics for a compressed file. If the file is uncompressed, this function returns an empty dictionary. You can use hcount to get the uncompressed size of the file.

// Get compression statistics for the file
q)-21!`:zippedTest
compressedLength  | 148946
uncompressedLength| 1600016
algorithm         | 2i
logicalBlockSize  | 17i
zipLevel          | 6i

// Get the uncompressed size
q)hcount `:zippedTest
1600016i

Configuration

This section describes how to apply compression settings system-wide.

Enable compression by default

To enable compression by default for all files written with set that have no file extension, set the system variable .z.zd to the desired compression argument list.

// Set default compression: 128KB blocks, gzip algorithm, level 6
.z.zd: 17 2 6

To disable default compression, set .z.zd to a list of zeroes or expunge the variable. By default, .z.zd is undefined, and files are written uncompressed.

// Method 1: Set to zeroes to disable
.z.zd: 3#0

// Method 2: Expunge the variable to disable
\x .z.zd

Apply compression selectively

You do not need to compress all of your data. Since q can read compressed and uncompressed files transparently, you can leave certain files uncompressed if they compress poorly or if compression would degrade their performance. This way, you apply compression only where it provides a clear benefit.

Choose an algorithm

To select the right compression algorithm, balance three factors: compression ratio, compression speed, and decompression speed.

Performance trade-offs

Compression ratio: The amount of size reduction. A higher ratio means smaller files, which reduces storage costs and can improve query times on slow storage systems. For example, with real NYSE trade data, gzip at level 9 compressed a file to 15% of its original size.
Compression speed: The rate at which data is compressed. Faster compression speed is crucial for high-volume data ingestion systems, as it reduces CPU load and write latency.
Decompression speed: The rate at which data is read from a compressed file. Faster decompression speed leads to faster query execution. A single thread on a modern CPU core can decompress data at approximately 300 MB/s, though this varies by algorithm and data.

No single algorithm is the best for all use cases. Choose an algorithm based on your primary goal, whether it is minimizing storage costs, maximizing ingestion rate, or achieving the fastest query performance.

Available algorithms and libraries

The algorithm argument accepts one of the following integer codes.

Code	Algorithm	Supported levels	Version introduced
`0`	none	`0`
`1`	q IPC	`0`
`2`	`gzip`	`0-9`
`3`	`snappy`	`0`	V3.4
`4`	`lz4hc`	`0-16`	V3.6
`5`	`zstd`	`-7-22`	V4.1

LZ4hc level behavior

For the lz4hc algorithm, a level of 0 uses the default compression, and any level above 16 behaves the same as level 16.

Algorithm characteristics

q IPC (1): Balances performance and compression. A good default choice for typical financial time-series data.
Gzip (2): Offers a high compression ratio but has average compression and decompression speeds. If write speed is important, avoid high compression levels like 8 and 9. Level 5 is a good general-purpose setting.
Snappy (3): Provides excellent compression and decompression speed, making it a strong choice when query and ingestion performance are the top priorities. Its compression ratio is lower than other algorithms.
LZ4 (4): Excels at decompression speed and offers an average compression ratio. The compression level significantly impacts compression speed; level 5 is a good choice for balancing query speed and storage costs. Avoid levels above 11.
Zstd (5): Provides an outstanding compression ratio, especially for low-entropy data. Use low compression levels (like 1) to optimize for write speed, and increase the level for a better ratio. Avoid levels above 14.

Advanced topics

This section explains important technical limitations, performance tuning, and resource management considerations.

Important cautions

The following limitations and potential issues apply when working with compressed files.

Concurrency hazard

Don't read from or write to the same compressed file from multiple threads concurrently. However, you can safely access multiple different compressed files simultaneously, with each file handled by its own thread.

Don't compress log files

Don't use streaming compression for log files. Streaming compression keeps the last block of data in memory and only writes it to disk when the file handle is closed. If your system crashes, the log file will lack metadata from the end, rendering it unusable.

Incompatible lz4 versions

Certain versions of the lz4 library do not work correctly with KDB-X.

lz4-1.7.5 fails to compress.
lz4-1.8.0 can cause the KDB-X process to hang.

KDB-X requires at least lz4-r129. Version lz4-1.8.3 is known to work, but it is recommended to use the latest stable release.

Legacy issue: Appending to enum files

In KDB-X v3.0 (2012.05.17), appending to compressed enum files was blocked due to potential concurrency issues. Don't compress enum files if you use this version and need to append to them.

Nested data compression

When a column file containing nested data (for example, name) is compressed, its companion metadata file (name# or name##) is also automatically compressed. Don't attempt to compress the metadata file explicitly.

Use KDB-X compression, not external tools

Use the set function with compression arguments to create compressed files for KDB-X. Don't use external tools like the gzip command-line utility, as they produce a different file format that KDB-X cannot read with random access.

Performance tuning

To understand the impact of compression, test with your own data, hardware, and queries.

Benchmarking

When benchmarking, be aware that the disk cache can skew results. Flush the disk cache before each test run to ensure you measure actual disk I/O performance.

On Linux:

sync ; sudo echo 3 | sudo tee /proc/sys/vm/drop_caches

On macOS:

purge

Logical block size

The logicalBlockSize parameter affects both compression ratio and performance. It determines the minimum amount of data that must be decompressed to retrieve even a single byte from a file. For example, with a logicalBlockSize of 128 KB, reading a single value from a column requires decompressing a 128 KB block. This overhead is generally acceptable for typical analytical queries that access contiguous chunks of data. Experiment to find the optimal block size for your workload.

Kernel settings

On Linux, consider tuning virtual memory settings like vm.dirty_background_ratio and vm.dirty_ratio for potential performance improvements. The ideal values depend on the total size and number of compressed files, as well as whether your access patterns are random or sequential.

Resource management

Concurrently open files

A KDB-X process can open as many files concurrently as the operating system allows. You may need to increase the open file descriptor limit (for example, using ulimit -n).

File descriptor limits

In KDB-X 3.2 and later, each compressed file uses two file descriptors, which may require a higher ulimit -n setting than in older versions. (Before version 3.1 (2013.02.21), a process could not open more than 4096 compressed files at once.)

Memory allocation

When reading a compressed list, KDB-X reserves enough virtual address space to hold the list's entire uncompressed content. This is necessary because, unlike memory-mapped uncompressed files, the decompressed data pages have no backing file on disk.

Note that reserving virtual memory does not immediately allocate physical memory. The operating system must simply be confident it could swap out the data if needed, so ensure you have sufficient swap space configured. If you encounter wsfull errors, check for any user-level virtual memory limits (for example, via ulimit -v).

Virtual memory reservation

On Linux, you can control the kernel's memory allocation behavior with the overcommit_memory and overcommit_ratio settings.

Summary

This guide explained how to:

Write, read, and append to compressed files on disk.
Configure default compression settings for your system.
Choose the best compression algorithm for your specific needs.
Check compression statistics to verify your settings.
Understand key performance considerations and limitations when working with compressed data.