Ingesting data from object storage

Object storage is a cloud-native storage architecture designed for massive scalability, flexible data management, and easy access over HTTP. Organizations use it to store vast amounts of unstructured data for processing, archiving, and global distribution.

Popular managed services:

Cloud	Service	URI prefix
AWS	Amazon S3	`s3://`
GCP	Google Cloud Storage (GCS)	`gs://`
Azure	Azure Blob Storage	`ms://`

kdb Insights uses the ms:// scheme instead of the full HTTPS URL for Azure blobs.

The kdb Insights Stream Processor provides separate readers tailored for reading data from these cloud providers. The readers expose many options to modify how data is read, and can be configured to provide a truly custom solution. There are currently two versions - Version 2 is the recommended for almost every use case. This page will go into depth on the capabilities and differences of each, for more information on these object storage reader APIs specifically, see the API reference.

Version 1 of the readers uses HTTP requests to read files synchronously directly into memory. This version is slower but has no dependency on disk space. Version 2 is the recommended version.

Version 2 of the readers offers faster speeds through multiple concurrent downloads. This provides significant enhancements in data ingestion compared to Version 1, allowing for higher throughput and more efficient processing.

Key differences of Version 2

Version 2 is based on an improved download algorithm, optimized for large-scale cloud data transfers. Multiple downloads now occur in parallel, improving throughput and reducing ingestion time.

Note

Internal testing of Version 2 shows vast improvements across all readers and scenarios, with improvements >80% when dealing with some larger datasets.

Disk Requirements

Version 2 introduces a two-stage read process with a configurable buffer. In contrast to Version 1, Version 2 downloads the full file to local disk. The available disk space directly influences the number of concurrent downloads. The system uses a custom algorithm to optimize this process, balancing disk space and concurrency.

To ensure smooth operation, sufficient disk space must be allocated to handle concurrent downloads and the temporary storage of files before they are loaded into memory.

Note

As a general rule, available disk should be at least the size of the largest file plus buffer space. Allocating at least 2-3 times the size of the largest file is recommended to accommodate buffering and concurrent downloads. If available space is exhausted, the reader will error.

Free Space Algorithm & Buffer

The readers now preallocate a percentage of remaining disk space to be used as buffer space, ensuring that downloaded files never fill all remaining space on disk. They also incorporate a custom algorithm that handles managing the allocated space, intellgently downloading and cleaning up files to maximize throughput.

The custom download algorithm dynamically scans disk space and limits the number of concurrent downloads based on the remaining space. Files are downloaded in parallel, with each download being handled asynchronously, allowing for multiple downloads to occur simultaneously. The algorithm intelligently schedules downloads to ensure that many smaller files do not block larger files from completing, pausing downloads where appropriate. Once written to disk, files are read into memory using the reader configration options (i.e. chunked reads or text/binary modes). When a file is read into memory and processed, it is removed from disk to make space for the next file.

Configuration

New environment variables have been introduced in v2 to configure download behavior and resource allocation.

KXI_SP_DOWNLOAD_BUFFER: The calculation algorithm assigns a buffer as a percentage of remaining disk space to ensure that disks are never entirely filled. This buffer can be modified using the KXI_SP_DOWNLOAD_BUFFER environment variable, and the default value is 0.05 (5%).
KXI_SP_DOWNLOAD_NUMBER: The maximum number of concurrent downloads is set to 2 by default. This can be modified using the KXI_SP_DOWNLOAD_NUMBER environment variable.
KXI_SP_DOWNLOAD_DIR: The directory where downloaded files are staged before being read to memory is controlled by the KXI_SP_DOWNLOAD_DIR environment variable. It defaults to /sp/checkpoints/downloads using default value of $KXI_SP_CHECKPOINT_DIR, which itself defaults to /sp/checkpoints.

Amazon S3

Usage

V1v2

.qsp.run
  .qsp.read.fromAmazonS3[
      `:s3://my-bucket/logs/2025-06-*.csv          / path (wildcards allowed)
      ;"us-east-1"                                 / AWS region
  ]
  .qsp.map[{"F"$x}]                                / parse each CSV line
  .qsp.write.toConsole[]

Note

.qsp.read.fromAmazonS3 implicitly resolves to Version 1. It can also be explicitly defined as .qsp.v1.read.fromAmazonS3

.qsp.run
  .qsp.v2.read.fromAmazonS3[
      `:s3://my-bucket/logs/2025-06-*.csv          / path (wildcards allowed)
      ;"us-east-1"                                 / AWS region
  ]
  .qsp.map[{"F"$x}]                                / parse each CSV line
  .qsp.write.toConsole[]

Authentication

Required Permissions

This reader requires the ability to list and retrieve objects on the defined S3 bucket and as such requires the credentials provided by the user to have access to the s3:ListBucket, s3:GetObject and s3:ListObjects ACLs, see here for more information.

V1V2

Version 1 follows the standard AWS credential chain:

Instance role / container IAM role (automatic when running on AWS)
Environment variables

export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...        # if using STS

These can be fetched from the AWS cloud console, or using the following command with the aws cli: aws configure export-credentials --format env. See here for more information.

Shared credentials/config file (~/.aws/credentials)
Explicit secret reference in the reader options.

Version 1 uses kurl for credential discovery. See credential discovery in the AWS tab for more information.

Docker

For Docker based configurations, set the variables in the worker image configuration.

docker-compose:

version: "3.3"
services:
  worker:
    image: portal.dl.kx.com/kxi-sp-worker:1.14.0
    environment:
      AWS_ACCESS_KEY_ID: "abcd"
      AWS_SECRET_ACCESS_KEY: "iamasecret"

Docker credentials file for authentication:

To load a custom configuration file into a Docker deployment, mount the credentials file directory into the container and set KXI_SP_CONFIG_PATH to point to the configuration directory.

version: "3.3"
services:
  worker:
    image: portal.dl.kx.com/kxi-sp-worker:1.14.0
    volumes:
      - $HOME/.aws/:/config/awscreds
    environment:
      KXI_SP_CONFIG_PATH: "/config"

Next, add the secret name to the S3 reader configuration

.qsp.read.fromAmazonS3[`:s3://bucket/hello; "us-east-1"; .qsp.use``credentials!``awscreds]

Now deploying the pipeline will read the credentials from the AWS credentials file. Note that the credentials file name must be credentials.

Kubernetes

For Kubernetes deployments, environment variables can be passed via the REST request sent when launching a job.

curl -X POST http://localhost:5000/pipeline/create -d \
    "$(jq -n  --arg spec "$(cat spec.q)" \
    '{
        name     : "s3-reader",
        type     : "spec",
        config   : { content: $spec },
        settings : { maxWorkers: "10" },
        env      : { AWS_ACCESS_KEY_ID: "abcd", AWS_SECRET_ACCESS_KEY: "iamasecret" }
    }' | jq -asR .)"

Kubernetes secrets for authentication:

When using a Kubernetes deployment, Kubernetes secrets can be used to install credentials into the worker.

First, create a secret using an AWS credentials file. Take care to ensure the secret is created in the correct namespace.

kubectl create secret generic --from-file credentials=$HOME/.aws/credentials awscreds

Note that the secret must be formatted so that the variable names are lower case and there is a space either side of the equals sign like so:

[default]
aws_access_key_id = abcd1234
aws_secret_access_key = abcd1234

Next, add the secret name to the S3 reader configuration

.qsp.read.fromAmazonS3[`:s3://bucket/hello; "us-east-1"; .qsp.use``credentials!``awscreds]

Lastly, when deploying the worker, add a secret to the Kubernetes configuration in the request.

curl -X POST http://localhost:5000/pipeline/create -d \
    "$(jq -n  --arg spec "$(cat spec.q)" \
    '{
        name       : "s3-reader",
        type       : "spec",
        config     : { content: $spec },
        settings   : { maxWorkers: "10" },
        kubeConfig : { secrets : ["awscreds"] }
    }' | jq -asR .)"

Version 2 only supports environment variables, it does not follow the standard credential chain or support credential discovery:

Environment variables

export AWS_ACCESS_KEY_ID=AKIA...
export AWS_SECRET_ACCESS_KEY=...
export AWS_SESSION_TOKEN=...        # if using STS

These can be fetched from the AWS cloud console, or using the following command with the aws cli: aws configure export-credentials --format env. See here for more information.

**"Docker"

For Docker based configurations, set the variables in the worker image configuration.

docker-compose:

version: "3.3"
services:
  worker:
    image: portal.dl.kx.com/kxi-sp-worker:1.14.0
    environment:
      AWS_ACCESS_KEY_ID: "abcd"
      AWS_SECRET_ACCESS_KEY: "iamasecret"

Kubernetes

For Kubernetes deployments, environment variables can be passed via the REST request sent when launching a job.

curl -X POST http://localhost:5000/pipeline/create -d \
    "$(jq -n  --arg spec "$(cat spec.q)" \
    '{
        name     : "s3-reader",
        type     : "spec",
        config   : { content: $spec },
        settings : { maxWorkers: "10" },
        env      : { AWS_ACCESS_KEY_ID: "abcd", AWS_SECRET_ACCESS_KEY: "iamasecret" }
    }' | jq -asR .)"

Google Cloud Storage

Usage

V1V2

.qsp.run
  .qsp.read.fromGoogleStorage[
      `:gs://my-bucket/raw/*.json                   / glob over many objects
  ]
  .qsp.write.toConsole[]

Note

.qsp.read.fromGoogleStorage implicitly resolves to Version 1. It can also be explicitly defined as .qsp.v1.read.fromGoogleStorage

.qsp.run
  .qsp.v2.read.fromGoogleStorage[
      `:gs://my-bucket/raw/*.json                   / glob over many objects
  ]
  .qsp.write.toConsole[]

Authentication

Required Permissions

This reader requires the ability to list and retrieve objects on the defined Google bucket and as such requires the credentials provided by the user to have access to the storage.objects.list, storage.objects.get and storage.buckets.get IAM permission, see here for more information.

V1V2

Version 1 uses Google’s Application Default Credentials.

Environment variables

To setup authentication using an environment variable, set GOOGLE_STORAGE_TOKEN with the output of running gcloud auth print-access-token. To install the Google SDK, refer to these instructions.

export GOOGLE_STORAGE_TOKEN=$(gcloud auth print-access-token)

The service account (or token subject) needs storage.objects.get (and optionally list) permissions on the bucket.

Version 1 uses kurl for credential discovery. See credential discovery in the GCP tab for more information. When running on Google provisioned cloud infrastructure, the credential discovery will automatically use the credentials of the user that launched the instance.

Docker

For Docker based configurations, set the variables in the worker image configuration.

docker-compose:

version: "3.3"
services:
  worker:
    image: portal.dl.kx.com/kxi-sp-worker:1.14.0
    environment:
      GOOGLE_STORAGE_TOKEN: "123abc"


**Kubernetes**

For Kubernetes deployments, environment variables can be passed via the REST request sent
when launching a job.

```bash
curl -X POST http://localhost:5000/pipeline/create -d \
    "$(jq -n  --arg spec "$(cat spec.q)" \
    '{
        name     : "gs-reader",
        type     : "spec",
        config   : { content: $spec },
        settings : { maxWorkers: "10" },
        env      : { GOOGLE_STORAGE_TOKEN: "123abc" }
    }' | jq -asR .)"

Kubernetes secrets for authentication:

When using a Kubernetes deployment, Kubernetes secrets can be used to install credentials into the worker.

First, create a secret using a generic secret. Take care to ensure the secret is created in the correct namespace.

kubectl create secret generic --from-literal token=$(gcloud auth print-access-token) gscreds

Next, add the secret name to the Google Cloud Reader reader configuration

.qsp.read.fromGoogleStorage[`:gs://bucket/hello; .qsp.use``credentials!``gscreds]

Lastly, when deploying the worker, add a secret to the Kubernetes configuration in the request.

curl -X POST http://localhost:5000/pipeline/create -d \
    "$(jq -n  --arg spec "$(cat spec.q)" \
    '{
        name       : "gs-reader",
        type       : "spec",
        config     : { content: $spec },
        settings   : { maxWorkers: "10" },
        kubeConfig : { secrets : ["gscreds"] }
    }' | jq -asR .)"

Version 2 uses Google’s Application Default Credentials. It does not support credential discovery.

Both the following need to be set, along with the default credential file existing on disk

 export GOOGLE_APPLICATION_CREDENTIALS=~/.config/gcloud
 export GOOGLE_STORAGE_TOKEN=$(gcloud auth print-access-token)

GOOGLE_STORAGE_TOKEN: This is the access token required for authentication. It can be retrieved by running gcloud auth print-access-token using the gloud CLI.

GOOGLE_APPLICATION_CREDENTIALS: This variable points to the location of the application_default_credentials.json file on disk. The credentials file must be mounted into the worker container separately. It can be retrieved by running gcloud auth application-default login. The default path for this file is ~/.config/gcloud.

Docker

For Docker based configurations, set the variables in the worker image configuration.

docker-compose:

version: "3.3"
services:
  worker:
    image: portal.dl.kx.com/kxi-sp-worker:1.14.0
    environment:
      GOOGLE_STORAGE_TOKEN: "123abc"

Kubernetes

For Kubernetes deployments, environment variables can be passed via the REST request sent when launching a job.

curl -X POST http://localhost:5000/pipeline/create -d \
    "$(jq -n  --arg spec "$(cat spec.q)" \
    '{
        name     : "gs-reader",
        type     : "spec",
        config   : { content: $spec },
        settings : { maxWorkers: "10" },
        env      : { GOOGLE_STORAGE_TOKEN: "123abc" }
    }' | jq -asR .)"

Microsoft Azure Blob Storage

Usage

V1V2

.qsp.run
  .qsp.read.fromAzureStorage[
      `:ms://mycontainer/data.csv                   / ms://container/blob
      ;"myStorageAccount"                           / storage account name
      ;.qsp.use mode:`text                          / optional reader opts
  ]
  .qsp.write.toConsole[]

Note

.qsp.read.fromAzureStorage implicitly resolves to Version 1. It can also be explicitly defined as .qsp.v1.read.fromAzureStorage

.qsp.run
  .qsp.v2.read.fromAzureStorage[
      `:ms://mycontainer/data.csv                   / ms://container/blob
      ;"myStorageAccount"                           / storage account name
      ;.qsp.use mode:`text                          / optional reader opts
  ]
  .qsp.write.toConsole[]

Authentication

Required Permissions

This reader requires the ability to list and retrieve objects on the defined Azure blob storage buckets and as such requires the credentials provided by the user to have access to the following operations List Blobs and Get Blobs follow the supplied links for more information.

V1V2

Provide a storage‑account shared key or use Azure‑hosted managed identity.

Environment variables

export AZURE_STORAGE_ACCOUNT=myStorageAccount
export AZURE_STORAGE_SHARED_KEY=Eby8vdM02xNO...

To setup authentication using environment variables, set AZURE_STORAGE_ACCOUNT to the name of the storage account to read from and AZURE_STORAGE_SHARED_KEY to one of the keys for that account. To use shared keys, refer to these instructions

The key grants List & Read access to all blobs in the account. In Kubernetes, mount these as a Secret and reference them from your KX deployment.

Version 1 uses kurl for credential discovery. See credential discovery in the Azure tab for more information. When running on Azure provisioned cloud infrastructure, the credential discovery will automatically use the credentials of the user that launched the instance.

Docker

For Docker based configurations, set the variables in the worker image configuration.

docker-compose:

version: "3.3"
services:
  worker:
    image: portal.dl.kx.com/kxi-sp-worker:1.14.0
    environment:
      AZURE_STORAGE_ACCOUNT: "123abc"
      AZURE_STORAGE_SHARED_KEY: "aAabBbcCcdDd111222333"

Kubernetes

For Kubernetes deployments, environment variables can be passed via the REST request sent when launching a job.

curl -X POST http://localhost:5000/pipeline/create -d \
    "$(jq -n  --arg spec "$(cat spec.q)" \
    '{
        name     : "ms-reader",
        type     : "spec",
        config   : { content: $spec },
        settings : { maxWorkers: "10" },
        env      : {
            AZURE_STORAGE_ACCOUNT: "123abc",
            AZURE_STORAGE_SHARED_KEY: "aAabBbcCcdDd111222333"
        }
    }' | jq -asR .)"

Kubernetes secrets for authentication:

When using a Kubernetes deployment, Kubernetes secrets can be used to install credentials into the worker.

First, create a secret using a generic secret. Take care to ensure the secret is created in the correct namespace.

kubectl create secret generic --from-literal token="aAabBbcCcdDd111222333" mscreds

Next, add the secret name and the account name to the Azure Storage Reader reader configuration.

.qsp.read.fromAzureStorage[`:ms://bucket/hello; "abc123"; .qsp.use``credentials!``mscreds]

Lastly, when deploying the worker, add a secret to the Kubernetes configuration in the request.

curl -X POST http://localhost:5000/pipeline/create -d \
    "$(jq -n  --arg spec "$(cat spec.q)" \
    '{
        name       : "ms-reader",
        type       : "spec",
        config     : { content: $spec },
        settings   : { maxWorkers: "10" },
        kubeConfig : { secrets : ["mscreds"] }
    }' | jq -asR .)"

Version 2 supports only shared keys as environment variables. It does not use credential discovery. To use shared keys, refer to these instructions

Environment variables

export AZURE_STORAGE_ACCOUNT=myStorageAccount
export AZURE_STORAGE_SHARED_KEY=Eby8vdM02xNO...

In additional, Version 2 uses a different credential heirarchy. Variables are read in the following order:

AZURE_STORAGE_CONNECTION_STRING or
(AZURE_STORAGE_SERVICE_ENDPOINT or AZURE_STORAGE_ACCOUNT) and (AZURE_STORAGE_KEY or AZURE_STORAGE_SAS_TOKEN)

Docker

For Docker based configurations, set the variables in the worker image configuration.

docker-compose:

version: "3.3"
services:
  worker:
    image: portal.dl.kx.com/kxi-sp-worker:1.14.0
    environment:
      AZURE_STORAGE_ACCOUNT: "123abc"
      AZURE_STORAGE_SHARED_KEY: "aAabBbcCcdDd111222333"

Kubernetes

For Kubernetes deployments, environment variables can be passed via the REST request sent when launching a job.

curl -X POST http://localhost:5000/pipeline/create -d \
    "$(jq -n  --arg spec "$(cat spec.q)" \
    '{
        name     : "ms-reader",
        type     : "spec",
        config   : { content: $spec },
        settings : { maxWorkers: "10" },
        env      : {
            AZURE_STORAGE_ACCOUNT: "123abc",
            AZURE_STORAGE_SHARED_KEY: "aAabBbcCcdDd111222333"
        }
    }' | jq -asR .)"

Custom domains

By default the cloud readers all read from the endpoints, e.g. s3.amazonaws.com, storage.googleapis.com, *.blob.core.windows.net. However there are use-cases where data may be accessed via non-standard URLs. The SP readers support these using the domain parameter which allows you to specify a custom URL.

Firstly this would be used where access to the bucket has been configured to go through a custom URL. This is supported by the three cloud providers.

The other use-case is where buckets and files are hosted by a third-party service, e.g. MinIO or Polygon.io.

Example: Amazon S3 with MinIO

MinIO is a third-party object storage solution which is compatible with AWS S3. The following documentation provides the steps to configure MinIO Object Storage for Kubernetes. In addition to the steps in the documentation, it may also be necessary to add a service to the minio-dev.yaml to allow access within the cluster.

See below for an example pipeline accessing data using the MinIO object storage. Note. the bucket name is required in both the URL path and the domain path:

.qsp.run 
    .qsp.read.fromAmazonS3["s3://example-bucket/exampleData.csv"; .qsp.use enlist[`domain]!enlist "http://minio.minio-dev.svc.cluster.local:9000/example-bucket"] 
    .qsp.write.toConsole[]