Persisting to object storage

kdb Insights Enterprise can persist data to object storage for inexpensive, durable, long-term data storage. Persisting to object storage has major cost saving implications, and should be considered for older data sets.

Immutable data

Data in object storage cannot be modified. Primarily data that is meant for long term storage should go here. If the data is modified/deleted externally, you should restart any HDB pods, and drop cache.

Object storage tiers vs mounting an object storage database

An object storage tier is a read-write location and is configured differently than a read-only database. See querying object storage for an example of how to query an existing object storage database.

An object storage tier has to be the final tier in your storage configuration.

Authentication

You will need to provide authentication details to access your private buckets. Object storage tiers can be configured using environment variables for routing details and for authentication. Authentication can also be configured using service accounts.

Authentication

For more information on service accounts see automatic registration and environment variables.

AWS Azure GCP

variable	description	default
`AWS_REGION`	AWS S3 buckets can be tied to a specific region or set of regions. This value indicates what regional endpoint to use to read and write data from the bucket.	`us-east-1`
`AWS_ACCESS_KEY_ID`	AWS account access key ID.
`AWS_SECRET_ACCESS_KEY`	AWS account secert access key.

See this guide on service accounts in EKS on how to use an IAM role for authorization.

variable	description	default
`AZURE_STORAGE_ACCOUNT`	The storage account name to use for routing Azure blob storage requests.
`AZURE_STORAGE_SHARED_KEY`	A shared key for the blob storage container to write data to.

See this guide on serivce accounts in AKS to use an IAM role for authorization.

variable	description	default
`GCLOUD_PROJECT_ID`	Name of the Google Cloud project ID to use when interfacing with cloud storage.

See this guide on serivce accounts in GKE on how to use an IAM role for authorization.

Setting environment variables

Docker

In a Docker Compose configuration, the environment variables for authentication need to be set on the Storage Manager and Data Access processes.

services:
  dap:
    image: ${kxi_da_single}
    env_file: .env
    environment:
      - AWS_REGION=us-east-2
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}
  sm:
    image: ${kxi_sm_single}
    env_file: .env
    environment:
      - AWS_REGION=us-east-2
      - AWS_ACCESS_KEY_ID=${AWS_ACCESS_KEY_ID}
      - AWS_SECRET_ACCESS_KEY=${AWS_SECRET_ACCESS_KEY}

Performance considerations

The layout of your table should be considered before uploading to object storage. Neglecting to apply attributes, as well as not sorting symbols or partitioning by date can result in any query becoming a linear scan that pulls back the whole data set.

Attributes and table layouts are specified in the tables section of the assembly

Deployment

To persist data to cloud storage, you must deploy an assembly with an HDB tier that has a storage key. This setting will create a segmented database where the HDB loads on-disk and cloud storage data together seamlessly.

Data in object storage is immutable, and cannot be refactored after it has been written. For this reason, date-based partitions bound for object storage can only be written down once per day.

For example, this Storage Manager tier configuration will:

Store data in memory for 10 minutes
Keep the last 2 days of data on disk
Migrate data older than 2 days to object storage

sm:
  tiers:
    - name: streaming
      mount: rdb
    - name: interval
      mount: idb
      schedule:
        freq: 00:10:00
    - name: ondisk
      mount: hdb
      schedule:
        freq: 1D00:00:00
        snap:   01:35:00
      retain:
        time: 2 days
    - name: s3
      mount: hdb
      store: s3://examplebucket/db

Mounts

You may see other kdb Insights Enterprise examples referring to object type mounts. Those mounts are for reading from existing cloud storage. For writing your own database to storage, no object mounts are involved, only HDB tier settings to make your HDB a segmented database.

For a guide on reading from an existing object storage database see querying object storage

Example using explicit credentials

This example assumes your data is published into a topic called south.

AWS Azure

The name of the example bucket is s3://examplebucket, and we want to save our database to a root folder called db. Our example uses the AWS_REGION: us-east-2. Stream processor and table schema sections have been omitted.

First create a new secret with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

kubectl create secret generic aws-access-secret\
    --from-literal=AWS_ACCESS_KEY_ID=${MY_KEY}\
    --from-literal=AWS_SECRET_ACCESS_KEY=${MY_SECRET}

Now create your assembly, setting sm.tiers to write to an S3 store, and environment variables for storage.

spec:
  env:
    - name: AWS_REGION
      value: us-east-2
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: aws-access-secret
          key: AWS_ACCESS_KEY_ID
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: aws-access-secret
          key: AWS_SECRET_ACCESS_KEY
  mounts:
    rdb:
      type: stream
      baseURI: none
      partition: none
      dependency:
        - idb
    idb:
      type: local
      baseURI: file:///data/db/idb
      partition: ordinal
    hdb:
      type: local
      baseURI: file:///data/db/hdb
      partition: date
      dependency:
        - idb
  sm:
   source: south
    tiers:
      - name: streaming
        mount: rdb
      - name: interval
        mount: idb
        schedule:
          freq: 00:10:00
      - name: ondisk
        mount: hdb
        schedule:
          freq: 1D00:00:00
          snap:   01:35:00
        retain:
          time: 2 Days
      - name: s3
        mount: hdb
        store: s3://examplebucket/db
  dap:
    instances:
      idb:
        mountName: idb
      hdb:
        mountName: hdb
      rdb:
        tableLoad: empty
        mountName: rdb
        source: south

The name of the example storage container is ms://mycontainer, and we want to save our database to a folder within the container called db. Our example uses the AZURE_STORAGE_ACCOUNT: iamanexample. This means we will write our database to a folder called 'db', inside the container 'mycontainer' for the storage account iamanexample. Stream processor and table schema sections have been omitted.

First create a new secret with your AZURE_STORAGE_SHARED_KEY.

kubectl create secret generic azure-storage-secret --from-literal=AZURE_STORAGE_SHARED_KEY=${MY_KEY}

Now create your assembly, setting sm.tiers to write to an Azure store container, and environment variables for storage.

spec:
  env:
    - name: AZURE_STORAGE_ACCOUNT
      value: mystorageaccount
    - name: AZURE_STORAGE_SHARED_KEY
      valueFrom:
        secretKeyRef:
          name: azure-storage-secret
          key: AZURE_STORAGE_SHARED_KEY
  mounts:
    rdb:
      type: stream
      baseURI: none
      partition: none
      dependency:
        - idb
    idb:
      type: local
      baseURI: file:///data/db/idb
      partition: ordinal
    hdb:
      type: local
      baseURI: file:///data/db/hdb
      partition: date
      dependency:
        - idb
  sm:
    source: south
    tiers:
      - name: streaming
        mount: rdb
      - name: interval
        mount: idb
        schedule:
          freq: 00:10:00
      - name: ondisk
        mount: hdb
        schedule:
          freq: 1D00:00:00
          snap:   01:35:00
        retain:
          time: 2 Days
      - name: s3
        mount: hdb
        store: ms://mycontainer/db
  dap:
    instances:
      idb:
        mountName: idb
      hdb:
        mountName: hdb
      rdb:
        tableLoad: empty
        mountName: rdb
        source: south

Apply the assembly using the kdb Insights CLI:

kxi assembly deploy --filepath object-tier.yml