Persisting to Object Storage
The KX Insights Platform can persist ingested data to object storage.
Persisting to object storage has major cost saving implications, and should be considered for older datasets.
For an example on reading data that has already been persisted to object storage, see querying from existing object storage.
Immutable data
Data in object storage cannot be modified. Primarily data that is meant for long term storage should go here. If the data is modifed/deleted externally, you should restart any HDB pods, and drop cache.
Prerequisites
You will need to configure environment variables for object storage.
You will need to provide credentials or utilize service accounts to access your private buckets.
Authentication
For more information on service accounts see automatic registration and environment variables.
To configure environment variables in an assembly, you may include them underneath the sm.env
, dbm.env
, eoi.env
, eod.env
, and dap.instances.*.env
components as a list.
To set the environment vars for all pods, you may set spec.env
.
For example, to set trace logging for objstor and an AWS_REGION
:
spec:
env:
- name: AWS_REGION
value: us-east-2
- name: KX_TRACE_S3
value: "1"
Performance considerations
Table layout
The layout of your table should be considered before uploading to object storage.
Neglecting to apply attributes, as well as not sorting symbols or partitioning by date can result in any query becoming a linear scan that pulls back the whole dataset.
Attributes and table layouts are specified in assembly YAML, and configured in the spec.table section of the assembly
Deployment
To persist data to cloud storage, you must deploy an assembly with an HDB tier that has a storage
key.
This setting will create a segmented database where the HDB loads on-disk and cloud storage data together seamlessly.
Data in object storage is immutable, and cannot be refactored after it has been written. For this reason, date based partitions bound for object storage can only be written down once per day.
For example, this sm tiering configuration will:
- store data in memory for 10 minutes
- keep the last 2 days of data on disk
- and keep the remainder in object storage
sm:
tiers:
- name: streaming
mount: rdb
- name: interval
mount: idb
schedule:
freq: 00:10:00
snap: 00:00:00
- name: ondisk
mount: hdb
schedule:
freq: 1D00:00:00
snap: 01:35:00
retain:
time: 2 Days
- name: s3
mount: hdb
store: s3://examplebucket/db
Mounts
You may see other KX Insights Platform examples referring to object
type mounts. Those mounts are for reading from existing cloud storage.
For writing your own database to storage, no object mounts are involved, only HDB tier settings to make your HDB a segmented database.
For a guide on reading from an existing object storage database see querying object storage
Examples
Deploying using explicit credentials
All examples assume your data is published into a topic called "south".
The name of the example bucket is s3://examplebucket
, and we want to save our database to a root folder called db
.
Our example uses the AWS_REGION
: us-east-2
.
Stream processor and table schema sections have been omitted.
First create a new secret with your AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
.
kubectl create secret generic aws-access-secret\
--from-literal=AWS_ACCESS_KEY_ID=${MY_KEY}\
--from-literal=AWS_SECRET_ACCESS_KEY=${MY_SECRET}
Now create your assembly, setting sm.tiers
to write to an S3 store, and environment variables for storage.
spec:
env:
- name: AWS_REGION
value: us-east-2
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-access-secret
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-access-secret
key: AWS_SECRET_ACCESS_KEY
mounts:
rdb:
type: stream
baseURI: none
partition: none
dependency:
- idb
idb:
type: local
baseURI: file:///data/db/idb
partition: ordinal
hdb:
type: local
baseURI: file:///data/db/hdb
partition: date
dependency:
- idb
sm:
source: south
tiers:
- name: streaming
mount: rdb
- name: interval
mount: idb
schedule:
freq: 00:10:00
snap: 00:00:00
- name: ondisk
mount: hdb
schedule:
freq: 1D00:00:00
snap: 01:35:00
retain:
time: 2 Days
- name: s3
mount: hdb
store: s3://examplebucket/db
dap:
instances:
idb:
mountName: idb
hdb:
mountName: hdb
rdb:
tableLoad: empty
mountName: rdb
source: south
The name of the example storage container is ms://mycontainer
, and we want to save our database to a folder within the container called db
.
Our example uses the AZURE_STORAGE_ACCOUNT
: iamanexample
. This means we will write our database to a folder called 'db', inside the container 'mycontainer' for the storage account iamanexample
.
Stream processor and table schema sections have been omitted.
First create a new secret with your AZURE_STORAGE_SHARED_KEY
.
kubectl create secret generic azure-storage-secret --from-literal=AZURE_STORAGE_SHARED_KEY=${MY_KEY}
Now create your assembly, setting sm.tiers
to write to an Azure store container, and environment variables for storage.
spec:
env:
- name: AZURE_STORAGE_ACCOUNT
value: mystorageaccount
- name: AZURE_STORAGE_SHARED_KEY
valueFrom:
secretKeyRef:
name: azure-storage-secret
key: AZURE_STORAGE_SHARED_KEY
mounts:
rdb:
type: stream
baseURI: none
partition: none
dependency:
- idb
idb:
type: local
baseURI: file:///data/db/idb
partition: ordinal
hdb:
type: local
baseURI: file:///data/db/hdb
partition: date
dependency:
- idb
sm:
source: south
tiers:
- name: streaming
mount: rdb
- name: interval
mount: idb
schedule:
freq: 00:10:00
snap: 00:00:00
- name: ondisk
mount: hdb
schedule:
freq: 1D00:00:00
snap: 01:35:00
retain:
time: 2 Days
- name: s3
mount: hdb
store: ms://mycontainer/db
dap:
instances:
idb:
mountName: idb
hdb:
mountName: hdb
rdb:
tableLoad: empty
mountName: rdb
source: south
Apply the assembly with kubectl apply -f object-tier.yml
.