Persisting to Object Storage

The KX Insights Platform can persist ingested data to object storage.

Persisting to object storage has major cost saving implications, and should be considered for older datasets.

For an example on reading data that has already been persisted to object storage, see querying from existing object storage.

Immutable data

Data in object storage cannot be modified. Primarily data that is meant for long term storage should go here. If the data is modifed/deleted externally, you should restart any HDB pods, and drop cache.

Prerequisites

You will need to configure environment variables for object storage.

You will need to provide credentials or utilize service accounts to access your private buckets.

Authentication

For more information on service accounts see automatic registration and environment variables.

To configure environment variables in an assembly, you may include them underneath the sm.env, dbm.env, eoi.env, eod.env, and dap.instances.*.env components as a list.

To set the environment vars for all pods, you may set spec.env.

For example, to set trace logging for objstor and an AWS_REGION:

spec:
  env:
    - name: AWS_REGION
      value: us-east-2
    - name: KX_TRACE_S3
      value: "1"

Performance considerations

Table layout

The layout of your table should be considered before uploading to object storage.

Neglecting to apply attributes, as well as not sorting symbols or partitioning by date can result in any query becoming a linear scan that pulls back the whole dataset.

Attributes and table layouts are specified in assembly YAML, and configured in the spec.table section of the assembly

Deployment

To persist data to cloud storage, you must deploy an assembly with an HDB tier that has a storage key.

This setting will create a segmented database where the HDB loads on-disk and cloud storage data together seamlessly.

Data in object storage is immutable, and cannot be refactored after it has been written. For this reason, date based partitions bound for object storage can only be written down once per day.

For example, this sm tiering configuration will:

store data in memory for 10 minutes
keep the last 2 days of data on disk
and keep the remainder in object storage

sm:
  tiers:
    - name: streaming
      mount: rdb
    - name: interval
      mount: idb
      schedule:
        freq: 00:10:00
        snap: 00:00:00
    - name: ondisk
      mount: hdb
      schedule:
        freq: 1D00:00:00
        snap:   01:35:00
      retain:
        time: 2 Days
    - name: s3
      mount: hdb
      store: s3://examplebucket/db

Mounts

You may see other KX Insights Platform examples referring to object type mounts. Those mounts are for reading from existing cloud storage. For writing your own database to storage, no object mounts are involved, only HDB tier settings to make your HDB a segmented database.

For a guide on reading from an existing object storage database see querying object storage

Examples

Deploying using explicit credentials

All examples assume your data is published into a topic called "south".

AWS Azure

The name of the example bucket is s3://examplebucket, and we want to save our database to a root folder called db.

Our example uses the AWS_REGION: us-east-2.

Stream processor and table schema sections have been omitted.

First create a new secret with your AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY.

kubectl create secret generic aws-access-secret\
    --from-literal=AWS_ACCESS_KEY_ID=${MY_KEY}\
    --from-literal=AWS_SECRET_ACCESS_KEY=${MY_SECRET}

Now create your assembly, setting sm.tiers to write to an S3 store, and environment variables for storage.

spec:
  env:
    - name: AWS_REGION
      value: us-east-2
    - name: AWS_ACCESS_KEY_ID
      valueFrom:
        secretKeyRef:
          name: aws-access-secret
          key: AWS_ACCESS_KEY_ID
    - name: AWS_SECRET_ACCESS_KEY
      valueFrom:
        secretKeyRef:
          name: aws-access-secret
          key: AWS_SECRET_ACCESS_KEY
  mounts:
    rdb:
      type: stream
      baseURI: none
      partition: none
      dependency:
        - idb
    idb:
      type: local
      baseURI: file:///data/db/idb
      partition: ordinal
    hdb:
      type: local
      baseURI: file:///data/db/hdb
      partition: date
      dependency:
        - idb
  sm:
   source: south
    tiers:
      - name: streaming
        mount: rdb
      - name: interval
        mount: idb
        schedule:
          freq: 00:10:00
          snap: 00:00:00
      - name: ondisk
        mount: hdb
        schedule:
          freq: 1D00:00:00
          snap:   01:35:00
        retain:
          time: 2 Days
      - name: s3
        mount: hdb
        store: s3://examplebucket/db
  dap:
    instances:
      idb:
        mountName: idb
      hdb:
        mountName: hdb
      rdb:
        tableLoad: empty
        mountName: rdb
        source: south

The name of the example storage container is ms://mycontainer, and we want to save our database to a folder within the container called db.

Our example uses the AZURE_STORAGE_ACCOUNT: iamanexample. This means we will write our database to a folder called 'db', inside the container 'mycontainer' for the storage account iamanexample.

Stream processor and table schema sections have been omitted.

First create a new secret with your AZURE_STORAGE_SHARED_KEY.

kubectl create secret generic azure-storage-secret --from-literal=AZURE_STORAGE_SHARED_KEY=${MY_KEY}

Now create your assembly, setting sm.tiers to write to an Azure store container, and environment variables for storage.

spec:
  env:
    - name: AZURE_STORAGE_ACCOUNT
      value: mystorageaccount
    - name: AZURE_STORAGE_SHARED_KEY
      valueFrom:
        secretKeyRef:
          name: azure-storage-secret
          key: AZURE_STORAGE_SHARED_KEY
  mounts:
    rdb:
      type: stream
      baseURI: none
      partition: none
      dependency:
        - idb
    idb:
      type: local
      baseURI: file:///data/db/idb
      partition: ordinal
    hdb:
      type: local
      baseURI: file:///data/db/hdb
      partition: date
      dependency:
        - idb
  sm:
    source: south
    tiers:
      - name: streaming
        mount: rdb
      - name: interval
        mount: idb
        schedule:
          freq: 00:10:00
          snap: 00:00:00
      - name: ondisk
        mount: hdb
        schedule:
          freq: 1D00:00:00
          snap:   01:35:00
        retain:
          time: 2 Days
      - name: s3
        mount: hdb
        store: ms://mycontainer/db
  dap:
    instances:
      idb:
        mountName: idb
      hdb:
        mountName: hdb
      rdb:
        tableLoad: empty
        mountName: rdb
        source: south

Apply the assembly with kubectl apply -f object-tier.yml.