Object storage

Authenticate with cloud credentials via Kurl and get native access to cloud object storage

Example: a single bucket on each of the clouds, using the prefixes:

aws     `:s3://
gcp     `:gs://
azure   `:ms://

Start kdb+ with the object store library loaded. AWS_REGION must be set to us-east-2

KDB+cloud 4.0 2021.01.15 Copyright (C) 1993-2020 Kx Systems
l64/ 8(16)core 32005MB ...
q)key`:s3://
`s#`kxinsights-marketplace-data
q)key`:gs://
`s#`kxinsights-marketplace-data
q)key`:ms://
`s#`kxinsightsmarketplacedata

Delve deeper into the buckets

q)key`:gs://kxinsights-marketplace-data/
`s#`db`sym
q)key`:gs://kxinsights-marketplace-data/db
`s#`2018.09.04`2018.09.05`2018.09.06`2018.09.07`2018.09.10
q)key`:gs://kxinsights-marketplace-data/db/2018.09.04/trade/
`s#`.d`cond`ex`price`size`stop`sym`time

and get the contents of the file.

q)get`:gs://kxinsights-marketplace-data//db/2018.09.04/trade/.d
`time`sym`cond`ex`price`size`stop

Other read operations work as if the file were on block storage.

q)hcount `:gs://kxinsights-marketplace-data//db/2018.09.04/trade/sym
276955208

q)-21!`:gs://kxinsights-marketplace-data//db/2018.09.04/trade/sym
compressedLength  | 579442
uncompressedLength| 276955208
algorithm         | 2i
logicalBlockSize  | 17i
zipLevel          | 6i

q)read1`:gs://kxinsights-marketplace-data//db/2018.09.04/trade/.d
0xff010b000700000074696d650073796d00636f6e640065780070726963650073697a6500737..

To mount an HDB directly from object storage, we set up a partition file locally as

$ more db/par.txt
s3://kxinsights-marketplace-data//db

with the HDB sym file local too. Any number of paths can be specified in par.txt, mixing both cloud and block storage paths. There should be no trailing / after the object store path!

The local directory tree should look like

db/
├── par.txt
└── sym

The sym file

The sym file in this db dir should be the enum domain, i.e. a symbol list. The partitions listed in par.txt should not contain this enum domain in their root.

par.txt

The HDB can then be mounted via the usual

q)\l db

Since kdb+ changes the working directory to db, the following will not work

q)\l s3://kxinsights-marketplace-data/db

because the S3 storage is not presented as a POSIX filesystem. Hence the only way presently to load a database is by par.txt.

Splayed tables resident on object storage can be mapped and queried directly:

q)t:get`:s3://kxinsights-marketplace-data/db/2018.09.04/trade/
q)select from t
time         sym cond ex price size  stop
-----------------------------------------
07:17:49.434 A   T    P  67.56 40    0
09:30:00.329 A   I    Z  67.23 1     0
09:30:00.329 A   I    Z  67.23 1     0
09:30:01.004 A   O    N  67.34 13674 0
...

or the shorter

q)select from `:s3://kxinsights-marketplace-data/db/2018.09.04/trade/
time         sym cond ex price size  stop
-----------------------------------------
07:17:49.434 A   T    P  67.56 40    0
09:30:00.329 A   I    Z  67.23 1     0
09:30:00.329 A   I    Z  67.23 1     0
09:30:01.004 A   O    N  67.34 13674 0
...

Metadata

All keys in a bucket are cached in memory, along with each object’s size. The first time a bucket is read from, all keys will be retrieved. To trigger a reload of this metadata, use a path of _ after the bucket, indicating ‘drop’. e.g.

AWS Azure GCP

q)key`:s3://mybucketname/_

q)key`:ms://mybucketname/_

q)key`:gs://mybucketname/_

Read only

The objstor library allows read-only access to object storage. The objects should be created using the cloud vendor’s standard CLI tooling to copy data from block storage to the cloud, e.g.

AWS Azure GCP

aws s3 cp "/path/to/file.txt" s3://kxinsights-marketplace-data/ --recursive

azcopy cp "/path/to/file.txt" "https://[account].blob.core.windows.net/[container]/[path/to/blob]"

gsutil cp -r "/path/to/file.txt" gs://kxinsights-marketplace-data/

Environment variables

Although the Kurl module can detect most of the necessary components for credentials from the environment, the following additional environment variables are required.

AWS_REGION

If a region is not selected, then us-east-1 is used by default.

The URIs requested to the cloud are printed to STDERR if the following environment variable is set. e.g.

export KX_TRACE_OBJSTR=1

AZURE_STORAGE_ACCOUNT

The DNS prefix for your storage account; e.g. for mystorage.blob.core.windows.net the name would be mystorage. The list of your storage accounts can be displayed using the Azure CLI tool az:

az storage account list | jq -r '.[] | .name'

GCLOUD_PROJECT_ID

A unique, user-assigned ID that can be used as the request header x-goog-project-id in Google APIs. It is used as a request header that specifies which project you are working on. It may be any valid project number or name. This request header tells Cloud Storage which project to create a bucket in or which project to list buckets for. Examples:

000111222333
my-project-name
example.com:my-google-apps-for-work-project-name

The list of your projects can be displayed using the Google CLI tool, Gcloud, via

gcloud projects list

For other important environment variables please refer to rest client documentation

Performance

Cloud object storage is high-latency and low bandwidth, and will perform significantly worse than block storage.

Cache

Due to the high latency of cloud storage, kdbs3.so offers to cache the results of requests on a local high-performance disk, the path to which should be specified in the environment variable KX_OBJSTR_CACHE_PATH, e.g.

export KX_OBJSTR_CACHE_PATH=/myfastssd/kxs3cache

and kdb+ will then create the cached files under a subdir as $KX_OBJSTR_CACHE_PATH/objects.

Shared cache

This cache area on disk is designed to be shared by multiple kdb+ processes, and will only ever populate it.

Managing eviction from this cache folder is done by the kxreaper process.

Bear in mind that cloud vendors charge for object storage as a combination of volume stored, per retrieval request, and volume egress. Using the built-in compression and the cache can help to reduce these costs.

Secondary threads

The way to achieve concurrency with these high-latency queries is with secondary threads, through the command line option -s. It is expected that the larger the number of secondary threads, irrespective of CPU core count, the better the performance of object storage. Conversely the performance of cached data appears to be better if the secondary-thread count matches the CPU core count. A balance is to be found. We expect in future to improve the thread usage for these requests.

The impact of threading is seen mainly in two use cases:

Kdb+ V4.0 has an optimization that columns used in a query are mapped in parallel at the start of a select, so the number of secondary threads should be at least the number of columns selected. Assuming peach is not already being used in an outer function then select performance is improved as it will mmaps columns in parallel.

q)// -s 8
q)\t select from quote where date=2018.09.07
1083

q)// -s 0
q)\t select from quote where date=2018.09.07
6594

Multithreaded prims will show improved performance when running against long vectors, triggering concurrent requests to object store.

q)// -s 8
q)\t select max bid from quote where date=2018.09.07
12443

q)// -s 0
q)\t select max bid from quote where date=2018.09.07
81693

Compression

Due to the cost of storage, possible egress costs, high-latency and low bandwidth, we recommend storing data on cloud object storage using compression.

HDB Load Times

Load times for a hdb process can be improved by adding an inventory file to the storage account. The file must be gzipped JSON, as an array of {Key:string,Size:int} objects. An example is shown below:

[
  {
    "Key": "2020.12.30/trade/size",
    "Size": 563829
  },
  {
    "Key": "2020.12.30/trade/stop",
    "Size": 49731
  },
  {
    "Key": "2020.12.30/trade/sym",
    "Size": 69520
  },
  {
    "Key": "2020.12.30/trade/time",
    "Size": 1099583
  }
]

The inventory file can be created and uploaded to the storage account using the following commands.

AWS Azure GCP

aws --output json s3api list-objects --bucket kxinsights-marketplace-data --prefix 'db/' --query 'Contents[].{Key: Key, Size: Size}' | gzip > aws.json.gz

aws s3 cp aws.json.gz s3://kxinsights-marketplace-data/_inventory/aws.json.gz

az storage blob list --account-name kxinsightsmarketplace   --container-name data | jq  '.[] | {Key: .name , Size: .properties.contentLength }' | jq -s '.' | gzip > azure.json.gz

az storage blob upload --account-name kxinsightsmarketplace \
      --container-name data --name _inventory/azure.json.gz --file azure.json.gz

gsutil ls -lr gs://kxinsights-marketplace-data/db/*/*/* | awk '{printf "{ \"Key\": \"%s\" , \"Size\": %s }\n", $3, $1}' | head -n -1 | jq -s '.' | sed 's/gs:\/\/kxinsights-marketplace-data\/db\///g' | gzip > gcp.json.gz

gsutil cp gcp.json.gz gs://kxinsights-marketplace-data/_inventory/gcp.json.gz

User can control which file is used as inventory via env var

export KX_OBJSTR_INVENTORY_FILE=_inventory/all.json.gz

The reading of the inventory file bypasses the cache, and to avoid cache invalidation issues, is not readable explicitly.

S3-compatible object storage

For this section, we use min.io as an example, although other S3 compatible stores should also work.

An explicit endpoint can be embedded into the S3 URI using the following pattern

`:s3://http[s]://hostname[:port]/bucket/path

e.g. using min.io playground

`:s3://https://play.min.io:9000/kxinsights-marketplace-data

n.b. remember to set the AWS Access Key ID and AWS Secret Access Key.

The default implicit endpoint used for all relative S3 URIs such as

`:s3://kxinsights-marketplace-data

defaults to

https://bucket-name.s3.Region.amazonaws.com/keyname

and can be overridden via the environment variable KX_S3_ENDPOINT.

e.g.

export KX_S3_ENDPOINT=https://play.min.io:9000

S3 requests are sent by default as a virtual-hosted–style request, where the bucket name is part of the domain name in the URL. Amazon S3 virtual-hosted-style URLs use the following format:

https://bucket-name.s3.Region.amazonaws.com/key name

In this example, my-bucket is the bucket name, US West (Oregon) is the region, and puppy.png is the key name:

https://my-bucket.s3.us-west-2.amazonaws.com/puppy.png

Other S3-compatible storage vendors may not support virtual-hosted–style requests, or it might be onerous to configure serverside, so kdb+ also supports sending path-style requests, where URLs use the following format:

`:s3//https://s3.Region.amazonaws.com/bucket-name/key name

For example, if you create a bucket named mybucket in the US West (Oregon) region, and you want to access the puppy.jpg object in that bucket, you can use the following path-style URL:

`:s3//https://s3.us-west-2.amazonaws.com/mybucket/puppy.jpg

kdb+ will use path-style requests for S3 if the following environment variable is set

export KX_S3_USE_PATH_REQUEST_STYLE=1