Object storage

This page explain how the object storage module works on kdb+

The object storage module allows kdb+ to read files kept in cloud storage.

or example, a partitioned database can be stored in S3 and a QSQL statement can be run against the database using the same commands as one stored locally.

Authentication

Authenticate with cloud credentials using Kurl in order to get native access to cloud object storage.

File handle format

Access to files stored on cloud storage is differentiated by the following file prefixes:

aws     `:s3://
gcp     `:gs://
azure   `:ms://

Objects can be accessed using the [prefix]/[bucket]/[key-name] format for file handles.

For example, running hcount on an object with key data/2025.02.02/tbl/b in bucket mybucket within s3 would be:

hcount `:s3://mybucket/data/2025.02.02/tbl/b

Reading data

Start kdb+ with the object store library loaded.

Delve deeper into the buckets:

Amazon Web ServicesMicrosoft AzureGoogle Cloud Platform

AWS_REGION must be set to eu-west-1

# set AWS_REGION=eu-west-1 before starting `q`
q)key`:s3://
`s#`kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data
q)key`$":s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/"
`s#`_inventory`db`par.txt`sym`symlinks
q)key`$":s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/db/"
`s#`2020.01.01`2020.01.02`2020.01.03`2020.01.06`2020.01.07...
q)key`$":s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/db/2020.01.01/trade/"
`s#`.d`cond`ex`price`size`stop`sym`time
q)get`$":s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/db/2020.01.01/trade/.d"
`sym`time`price`size`stop`cond`ex

# set AZURE_STORAGE_ACCOUNT=kxinsightsmarketplace
# set AZURE_STORAGE_SHARED_KEY to value returned by below
# az storage account keys list --account-name kxinsightsmarketplace --resource-group kxinsightsmarketplace
q)key`:ms://
,`data
q)key`$":ms://data/"
`s#`_inventory`db`par.txt`sym
q)key`$":ms://data/db/"
`s#`2020.01.01`2020.01.02`2020.01.03`2020.01.06`2020.01.07...
q)key`$":ms://data/db/2020.01.01/trade/"
`s#`.d`cond`ex`price`size`stop`sym`time
q)get`$":ms://data/db/2020.01.01/trade/.d"
`sym`time`price`size`stop`cond`ex

q)key`:gs://
`s#`kxinsights-marketplace-data
q)key`$":gs://kxinsights-marketplace-data/"
`s#`_inventory`db`par.txt`sym
q)key`$":gs://kxinsights-marketplace-data/db"
`s#`2020.01.01`2020.01.02`2020.01.03`2020.01.06`2020.01.07`2020.01.08`2020.01..
q)key`$":gs://kxinsights-marketplace-data/db/2020.01.01/trade/"
`s#`.d`cond`ex`price`size`stop`sym`time
q)get`$":gs://kxinsights-marketplace-data/db/2020.01.01/trade/.d"
`sym`time`price`size`stop`cond`ex

Other read operations work as if the file were on block storage.

Amazon Web ServicesMicrosoft AzureGoogle Cloud Platform

```q
q)hcount `$":s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/db/2020.01.01/trade/sym"
2955832

q)-21!`$":s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/db/2020.01.01/trade/sym"
compressedLength  | 69520
uncompressedLength| 2955832
algorithm         | 2i
logicalBlockSize  | 17i
zipLevel          | 6i

q)read1`$":s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/db/2020.01.01/trade/.d"
0xff010b000000000073796d0074696d650070726963650073697a650073746f7000636f6e640..
```

```q
q)hcount `$":ms://data/db/2020.01.01/trade/sym"
2955832

q)-21!`$":ms://data/db/2020.01.01/trade/sym"
compressedLength  | 69520
uncompressedLength| 2955832
algorithm         | 2i
logicalBlockSize  | 17i
zipLevel          | 6i

q)read1`$:ms://data/db/2020.01.01/trade/.d"
0xff010b000000000073796d0074696d650070726963650073697a650073746f7000636f6e640..
```

```q
q)hcount `$":gs://kxinsights-marketplace-data/db/2020.01.01/trade/sym"
2955832

q)-21!`$":gs://kxinsights-marketplace-data/db/2020.01.01/trade/sym"
compressedLength  | 69520
uncompressedLength| 2955832
algorithm         | 2i
logicalBlockSize  | 17i
zipLevel          | 6i

q)read1`$":gs://kxinsights-marketplace-data/db/2020.01.01/trade/.d"
0xff010b000000000073796d0074696d650070726963650073697a650073746f7000636f6e640..
```

Mounting an HDB directly

To mount an HDB directly from AWS object storage:

Set up a partition file locally as follows:

$ cat db/par.txt
s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/db

Any number of paths can be specified in par.txt, mixing both cloud and block storage paths.

There should be no trailing / after the object store path
The HDB sym file should also be located in the db dir.

The sym file

The sym file in this database directory should be the enum domain, which is a list of symbols. The partitions listed in par.txt should not include this enum domain in their root.
The local directory tree should look like this:
```
db/
├── par.txt
└── sym
```
The HDB can now be mounted as follows:
```
q)\l db
```
kdb+ changes the working directory to db

Therefore the following will not work:
```
q)\l s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/db
```
because the S3 storage is not presented as a POSIX filesystem. Hence the only way presently to load a database is by par.txt.

Splayed tables resident on object storage can be mapped and queried directly using:

The standard query mechanism:

q)t:get`$":s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/db/2020.01.01/trade/"
q)select from t
sym time         price    size stop cond ex
-------------------------------------------
0   09:32:37.334 18.54794 496  1    G    E
0   09:36:09.052 19.20235 360  1    D    H
0   09:38:12.338 13.56027 142  0    B    J
0   09:44:26.045 11.5101  671  0    O    C
0   09:44:32.357 15.55949 896  1    Y    J
...

Or, the shorter version:

q)select from `$":s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/db/2020.01.01/trade/"
sym time         price    size stop cond ex
-------------------------------------------
0   09:32:37.334 18.54794 496  1    G    E
0   09:36:09.052 19.20235 360  1    D    H
0   09:38:12.338 13.56027 142  0    B    J
0   09:44:26.045 11.5101  671  0    O    C
0   09:44:32.357 15.55949 896  1    Y    J
...
...

Metadata

All keys in a bucket are cached in memory, along with each object’s size.

When a bucket is accessed for the first time, all keys are retrieved.

To trigger a reload of this metadata, use a path of _ after the bucket, indicating ‘drop’. For example:

Amazon Web ServicesMicrosoft AzureGoogle Cloud Platform

q)key`:s3://mybucketname/_

q)key`:ms://mybucketname/_

q)key`:gs://mybucketname/_

Performance can be improved by using an inventory file.

Read only

The objstor library allows read-only access to object storage. The objects should be created using the cloud vendor’s standard CLI tooling to copy data from block storage to the cloud, e.g.

Amazon Web ServicesMicrosoft AzureGoogle Cloud Platform

aws s3 cp "/path/to/file.txt" s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/ --recursive

azcopy cp "/path/to/file.txt" "https://[account].blob.core.windows.net/[container]/[path/to/blob]"

gsutil cp -r "/path/to/file.txt" gs://kxinsights-marketplace-data/

Environment variables

KX_TRACE_OBJSTR

The URIs requested to the cloud are printed to STDERR if the following environment variable is set. e.g.

export KX_TRACE_OBJSTR=1

KX_OBJSTR_INVENTORY_FILE

Location of optional inventory file

Although the Kurl module can detect most of the necessary components for credentials from the environment, the following additional environment variables are required.

AWS_REGION

If a region is not selected, then us-east-1 is used by default.

KX_S3_USE_PATH_REQUEST_STYLE

Alters the URL that is contructed for requests to S3 compatible storage.

By default, virtual host style is used. Virtual host style uses the bucket name to construct the hostname from which data is requested. The virtual host style uses the https://[bucket-name].s3.[region].amazonaws.com/[key-name] format for S3. Setting KX_S3_USE_PATH_REQUEST_STYLE=1 changes to path style, using the format https://s3.[region].amazonaws.com/[bucket-name]/[key-name] where all requests go to the same host. This is often used in S3-compatible object storage that are alternatives to Amazon S3 (for example, MinIO).

KX_S3_ENDPOINT

Changes the endpoint URL used for requests to S3 compatible storage, for example KX_S3_ENDPOINT=https://play.min.io:9000. When used with KX_S3_USE_PATH_REQUEST_STYLE=1, requests use the format [KX_S3_ENDPOINT]/[bucket-name]/[key-name]. This is often used in S3-compatible object storage that are alternatives to Amazon S3 (for example, MinIO).

AZURE_STORAGE_ACCOUNT

The DNS prefix for your storage account; e.g. for mystorage.blob.core.windows.net the name would be mystorage. The list of your storage accounts can be displayed using the Azure CLI tool az:

az storage account list | jq -r '.[] | .name'

GCLOUD_PROJECT_ID

A unique, user-assigned ID that can be used as the request header x-goog-project-id in Google APIs. It is used as a request header that specifies which project you are working on. It may be any valid project number or name. This request header tells Cloud Storage which project to create a bucket in or which project to list buckets for. Examples:

000111222333
my-project-name
example.com:my-google-apps-for-work-project-name

The list of your projects can be displayed using the Google CLI tool, Gcloud, via

gcloud projects list

For other important environment variables please refer to rest client documentation

Performance

Cloud object storage is high-latency and low bandwidth, and will perform significantly worse than block storage.

Caching

Due to the high latency of cloud storage, kdb Insights Core offers the ability to configure a cache to locally store the requests results on high-performance disk.

Cloud vendors charge for object storage as a combination of volume stored, per retrieval request, and volume egress. Using the built-in compression and the cache can help to reduce these costs.

Secondary threads

The way to achieve concurrency with these high-latency queries is with secondary threads, through the command line option -s. It is expected that the larger the number of secondary threads, irrespective of CPU core count, the better the performance of object storage. Conversely the performance of cached data appears to be better if the secondary-thread count matches the CPU core count. A balance is to be found. We expect in future to improve the thread usage for these requests.

The impact of threading is seen mainly in two use cases:

Kdb+ V4.0 has an optimization that columns used in a query are mapped in parallel at the start of a select, so the number of secondary threads should be at least the number of columns selected. Assuming peach is not already being used in an outer function then select performance is improved as it will mmaps columns in parallel.

q)// -s 8
q)\t select from quote where date=2018.09.07
1083

q)// -s 0
q)\t select from quote where date=2018.09.07
6594

Multithreaded prims will show improved performance when running against long vectors, triggering concurrent requests to object store.

q)// -s 8
q)\t select max bid from quote where date=2018.09.07
12443

q)// -s 0
q)\t select max bid from quote where date=2018.09.07
81693

Compression

Due to the cost of storage, possible egress costs, high-latency and low bandwidth, we recommend storing data on cloud object storage using compression.

Metadata Load Times

Metadata load times for a hdb process can be improved by adding an inventory file to the storage account.

The file must be gzipped JSON, as an array of {Key:string,Size:int} objects. An example is shown below:

[
  {
    "Key": "db/2020.12.30/trade/size",
    "Size": 563829
  },
  {
    "Key": "db/2020.12.30/trade/stop",
    "Size": 49731
  },
  {
    "Key": "db/2020.12.30/trade/sym",
    "Size": 69520
  },
  {
    "Key": "db/2020.12.30/trade/time",
    "Size": 1099583
  }
]

The inventory file can be created and uploaded to the storage account using the following commands.

Amazon Web ServicesMicrosoft AzureGoogle Cloud Platform

aws --output json s3api list-objects --bucket kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data --prefix 'db/' --query 'Contents[].{Key: Key, Size: Size}' | gzip -9 -c > aws.json.gz

aws s3 cp aws.json.gz s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data/_inventory/all.json.gz

az storage blob list --account-name kxinsightsmarketplace   --container-name data | jq  '.[] | {Key: .name , Size: .properties.contentLength }' | jq -s '.' | gzip -9 -c > azure.json.gz

az storage blob upload --account-name kxinsightsmarketplace \
      --container-name data --name _inventory/all.json.gz --file azure.json.gz

gsutil ls -lr gs://kxinsights-marketplace-data/db/*/*/* | awk '{printf "{ \"Key\": \"%s\" , \"Size\": %s }\n", $3, $1}' | head -n -1 | jq -s '.' | sed 's/gs:\/\/kxinsights-marketplace-data\/db\///g' | gzip -9 -c > gcp.json.gz

gsutil cp gcp.json.gz gs://kxinsights-marketplace-data/_inventory/all.json.gz

User can control which file is used as inventory via env var

export KX_OBJSTR_INVENTORY_FILE=_inventory/all.json.gz

The reading of the inventory file bypasses the cache, and to avoid cache invalidation issues, is not readable explicitly.

Symbolic links

Symbolic links can be simulated through an optional "Path" field in an entry, which represents the realpath that the Key should resolve to. The Path must reside in the same bucket, and the Size field must represent that of the Path. For example:

[
  {
    "Key": "db/2020.12.30/trade/size",
    "Size": 563829,
    "Path": "newdb/2020.12.30/trade/size"
  },
  {
    "Key": "db/2020.12.30/trade/stop",
    "Size": 49731,
    "Path": "newdb/2020.12.30/trade/stop"
  },
  {
    "Key": "db/2020.12.30/trade/sym",
    "Size": 69520,
    "Path": "newdb/2020.12.30/trade/sym"
  },
  {
    "Key": "db/2020.12.30/trade/time",
    "Size": 1099583,
    "Path": "newdb/2020.12.30/trade/time"
  }
]

S3-compatible object storage

For this section, we use MinIO as an example, although other S3 compatible stores should also work.

An explicit endpoint can be embedded into the S3 URI using the following pattern

`:s3://http[s]://hostname[:port]/bucket/path

For example, using min.io playground

`:s3://https://play.min.io:9000/kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data

Alternatively, as the endpoint for S3 requests is https://play.min.io:9000 (instead of an Amazon host) and bucket names are not used to form the destination addresses (one destination host is used for all buckets), we can set the environment variables:

export KX_S3_ENDPOINT=https://play.min.io:9000
export KX_S3_USE_PATH_REQUEST_STYLE=1

which allows MinIO to be used as the default implicit endpoint used for all relative S3 URIs such as:

`:s3://kxs-prd-cxt-twg-roinsightsdemo/kxinsights-marketplace-data