Initial import process
The initial import process
To import data using this method, follow these steps below. Ensure you have met the prerequisites before attempting an initial import.
- Create a tier.
- Create a copy pod.
- Create a schema.
- Create the database.
- Deploy the database.
- Verify the import.
Read our troubleshooting and considerations sections for guidance if you run into any issues.
Create a tier
This is the default mount configuration used when adding a database:
mounts:
rdb:
type: stream
partition: none
baseURI: none
idb:
type: local
partition: ordinal
baseURI: file:///data/idb
hdb:
type: local
partition: date
baseURI: file:///data/hdb
The data is added to the hdb
mount and can be imported into one or both of the below tiers:
- Local HDB tier backed by a Persistent Volume Claim (PVC)
- Object Store tier
Local HDB tier
This tier is automatically added when a database is added but you must hydrate the tier before the database is started. To do this you must:
- name the PVC
- configure the size
- determine the appropriate storage class
- define access modes
- create the PVC on the cluster
Name the PVC
The PVC name must be the same as the database name followed by -hdb
. This ensures consistency between the PVC and the database it serves, while also distinguishing it as the PVC used for the database. The PVC name should be structured as <database-name>-hdb
. For more information on packages, see the kdb Insights Package Documentation.
Example: Database name
metadata:
name: <database-name>-hdb
Replace <database-name>
with the name of your package. For example, if the database name is my-app
you would enter:
yaml
metadata:
name: my-app-hdb
Configure the size
Specify the size in the PVC manifest under the resources
section using the appropriate unit (for example, Gi for Gigabytes, Mi for Megabytes). The size of the PVC should be configured based on the storage requirements of the database being imported.
Example: Configure size
spec:
resources:
requests:
storage: 20Gi
In this example, the size is set to 20Gi, but you should adjust the size based on your needs.
Determine the storage class
The storage class defines the type of storage to be used for the PVC (for example, SSD, HDD). To determine which storage class to use, check the available storage classes on your cluster with the following command:
kubectl get storageclass
Then select an appropriate class by specifying it in the storageClassName
field in the PVC manifest.
Example: Storage class
spec:
storageClassName: standard
Replace standard
with the appropriate storage class name for your use case, based on the available storage classes in your cluster.
Access mode
The access mode defines how the volume can be mounted by your Pods. For example, in this guide, we use ReadWriteMany (RWX)
, which allows multiple Pods or nodes to mount the volume with both read and write access simultaneously.
This access mode is crucial if you need shared access between different Pods, such as in a distributed application or for shared storage.
Example: Access mode
spec:
accessModes:
- ReadWriteMany
The available access modes are:
- Read Write Once (
RWO
): Volume can be mounted as read-write by only one node. - Read Only Many (
ROX
): Volume can be mounted as read-only by many nodes. - Read Write Many (
RWX
): Volume can be mounted as read-write by many nodes (used here for shared access).
Example: PVC configuration for a database import
Access mode is set to ReadWriteMany
.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-app-hdb
spec:
storageClassName: standard
accessModes:
- ReadWriteMany
resources:
requests:
storage: 20Gi
Once the PVC configuration is ready, apply it to your Kubernetes cluster to make the claim.
- Save the PVC YAML to a file.
- Create a YAML file, for example,
my-app-hdb.yaml
and paste the PVC configuration into the file. - Apply the PVC.
- To create the PVC on the cluster, run the following command in your terminal:
kubectl apply -f my-app-hdb.yaml
This sends the PVC configuration to your Kubernetes API server, where it binds to an available Persistent Volume.
Object store tier
Instead of importing the data into a local storage device it may also be useful to import data straight to object store. To do this,:
- add the tier configurations
- create an object store bucket or account
- copy the
sym
file to the local tier - copy hdb partitions to the bucket or account
Object store can be directly mounted to an HDB tier by adding a tier
entry to the Storage Manager configuration. The below examples add the hdbs3
tier to the hdb
mount:
- name: hdbobj
mount: hdb
store: gs://gcp-data/db/
inventory:
enabled: true
location: inventory_db/inventory.tgz
- name: hdbobj
mount: hdb
store: s3://bucket-name/db/
inventory:
enabled: true
location: inventory_db/inventory.tgz
- name: hdbobj
mount: hdb
store: ms://azuredata/db/
inventory:
enabled: true
location: inventory_db/inventory.tgz
Using an inventory file is optional, however, using it greatly improves performance of object storage operations. Only the path needs to be specified, Storage Manager automatically creates the inventory file.
Refer to objstor example for an example of how to create a bucket or storage account. For details on how to create one using the cloud vendor's UI or CLI, check the documentation of the respective cloud vendor.
Even if only object storage is used for the import, a local date-partitioned tier must exist (usually hdb
) and the sym
file must be placed in the root of this tier. If the schema contains any non-partitioned tables, they must be in the local tier root. Read the instructions above for how to create the local tier.
The resulting local tier data
directory should look like the below:
tree /data/db/hdb/data
data/hdb/data
└── sym
Database structure
Partitions only in object storage
This scenario resembles the Simple partitioned database scenario, except that the location pointed to by the first HDB-based tier contains only the sym
file (if applicable): all the partitions exist in object storage. SM adds an entry for it in the generated par.txt
.
You are now ready to deploy the package. Follow the instructions in the packaging documentation.
aws s3 ls s3://historical-data/db
PRE 2024.01.01/
PRE 2024.01.02/
Create copy pod
This section provides detailed instructions on how to copy a kdb+ database to a Persistent Volume Claim (PVC) mounted on a Kubernetes Pod. You'll need to create a YAML configuration file that specifies how the PVC should be mounted to your Kubernetes Pod. Once the file is created, you'll apply it to your cluster to deploy the Pod and attach the PVC.
Step 1: Create the YAML Configuration File
Create a file (for example, my-app-pod.yaml
) with the following content to specify your Pod and PVC configuration:
apiVersion: v1 # API version for defining a Kubernetes Pod
kind: Pod # Specifies that this resource is a Pod
metadata:
name: my-app-pod # Name of the Pod
spec:
securityContext:
fsGroup: 2000
containers:
- name: my-app-container # Name of the container in the Pod
image: amazonlinux:2 # Container image to run
stdin: true
command: [ "/bin/bash", "-c" ]
args:
- |
yum install -y rsync
mkdir /data/db/hdb/data/
while true; do sleep 30; done;
volumeMounts:
- mountPath: /data/db/hdb # Directory inside the container where the PVC is mounted
name: my-app-storage # Name of the volume (linked to the PVC)
volumes:
- name: my-app-storage # Volume name, referred in the container volumeMounts
persistentVolumeClaim:
claimName: my-app-hdb # Name of the PVC being mounted to the Pod
Step 2: Apply the YAML Configuration
To deploy the Pod and mount the PVC, apply the YAML file using the following command:
kubectl apply -f my-app-pod.yaml
Step 3: Verify the Pod Status
After applying the YAML configuration, verify that the Pod is running and the PVC is successfully mounted using the following command:
kubectl get pod my-app-pod
Copy data
Copy the sym
file
The sym
file must always be copied to the data
directory of the first hdb tier.
kubectl cp sym my-app-pkg-pod:/data/db/hdb/data/
Copy partitions to local tier(optional)
Step 1: Create the krsync.sh
Script
This script allows rsync
to use kubectl
to transfer files to a Pod. It includes batching for larger files and retry logic to handle transfer failures. Save this script as krsync.sh
:
#!/bin/bash
# Check if the script is being run as the rsync remote shell helper (via --rsh)
if [ -z "$RSYNC_RUNNING" ]; then
# Check for arguments: local directory, pod name, and target directory
if [ "$#" -ne 3 ]; then
echo "Usage: $0 <local_directory> <pod_name> <target_directory_in_pod>"
exit 1
fi
local_dir=$1
pod=$2
target_dir=$3
# Set the maximum number of retries
max_retries=5
retry_count=0
success=0
# Export variable to indicate the script is in rsync mode
export RSYNC_RUNNING=true
# Retry loop
while [ $retry_count -lt $max_retries ]; do
echo "Attempt $(($retry_count + 1)) of $max_retries..."
# Run rsync, passing in the arguments
rsync -av --progress --stats -e "bash $0" "$local_dir" "$pod:$target_dir"
# Check if the rsync command was successful
if [ $? -eq 0 ]; then
echo "Rsync successful on attempt $(($retry_count + 1))"
success=1
break
else
echo "Rsync failed on attempt $(($retry_count + 1))"
fi
# Increment retry counter and wait for a bit before retrying
retry_count=$(($retry_count + 1))
sleep 10
done
# If rsync failed after all attempts
if [ $success -eq 0 ]; then
echo "Rsync failed after $max_retries attempts"
exit 1
fi
exit 0
fi
# This part only runs when the script is invoked as the remote shell helper for rsync
pod=$1
shift
kubectl exec -i $pod -- "$@"
Step 2: Run the krsync.sh
Script
To copy files from your local machine to the Pod, use the krsync.sh
script with the following command:
bash krsync.sh <local-files> <pod-name> <mount-path>
This command ensures that files are transferred between your local machine and the Kubernetes Pod efficiently, with retry functionality and batching for larger transfers.
In this section, you have learned how to mount a Persistent Volume Claim (PVC) to a Kubernetes Pod, verify its deployment, and transfer data using a custom rsync script. By following these steps, you can ensure that your data is successfully copied to the PVC and available for use within your Kubernetes cluster.
Copy partitions to object store(optional)
See objstor example for an example of how to copy data to a bucket or storage account.
The resulting bucket could look something like the below:
$ gsutil ls gs://gcp-data/db/ | head -5
gs://gcp-data/db/2020.01.01/
gs://gcp-data/db/2020.01.02/
gs://gcp-data/db/2020.01.03/
gs://gcp-data/db/2020.01.06/
gs://gcp-data/db/2020.01.07/
$ aws s3 ls s3://aws-data/db/ | head -5
PRE 2020.01.01/
PRE 2020.01.02/
PRE 2020.01.03/
PRE 2020.01.06/
PRE 2020.01.07/
$ az storage blob list --account-name azuredata \
--container-name azuredata | jq -r '.[] | .name' | tail -5
db/2020.12.30/trade/size
db/2020.12.30/trade/stop
db/2020.12.30/trade/sym
db/2020.12.30/trade/time
Create schema
To use an existing kdb+ database in kdb Insights, ensure you have the configuration files defining the schema and other table attributes. This can be done manually by following the documentation. Alternatively, this can be done automatically with this script.
The output needs to be manually edited before feeding it into kdb Insights, as some necessary information cannot be automatically determined.
Run the script
Before you run the script, ensure you have an environment with kdb+ installed. Copy the script into a file called create-schema.q
and invoke it with q create-schema.q <options>
.
Use the output
The output of this script is not a valid JSON/YAML document. This is because some of the configuration information required by kdb Insights cannot be deduced from the HDB on disk alone. The output begins with a plain English header explaining the changes to be made. Make sure this is deleted once you have finished configuring the document, otherwise it won't be valid. The document continues with @EDITME@
in place of values that must be manually added. If you are using the -fmt package
option, every output file must be edited individually.
Example schema definition
tables:
trade:
description: Trade data
type: partitioned
prtnCol: time
sortColsOrd: sym
sortColsDisk: sym
columns:
- name: time
description: Time
type: timestamp
- name: sym
description: Symbol name
type: symbol
attrMem: grouped
attrDisk: parted
attrOrd: parted
- name: price
description: Price
type: float
- name: size
description: Size
type: long
exchange:
description: Exchange
type: splayed
primaryKeys: [id]
columns:
- name: id
description: ID
type: symbol
- name: descr
description: Description
type: string
instrument:
description: Instrument
type: basic
primaryKeys: [id]
columns:
- name: id
description: Key
type: symbol
- name: descr
description: Description
type: string
- name: currency
description: Currency
type: symbol
Create the database
By using the yaml
output format and after you finish editing the generated schema files, you must put them into an assembly file, used by kdb+ Insights SDK. Refer to the schema configuration instructions.
Deploy the database
To deploy the database, use Docker Compose as outlined in the Docker Deployment Guide.
For Initial Import, an extra step must be completed before running the deployment: Copy your existing kdb+ database to the db/
directory established here and name it data
.
Database validation
The SM validates the database against the schema configuration within the assembly to ensure that it conforms and is operational. If the SM validation finds any issues with the database it provides details in the logs on what validation failed, and what needs to be addressed, before terminating. In this scenario, the user can take SM offline and resolve the validation failures locally before attempting to re-initialize SM again.
SM checks the size of the database prior to carrying out the validation. The size is measured by the total number of files the database has under its root. By default, this threshold is set to 1,000,000 files. If this threshold is exceeded, the validation carries out spot checks on a reduced number of partitions, for example, for 1 year of partitions, 50% of partitions are validated, for 50 years 5% of partitions are validated. The threshold can be overridden by setting the KXI_VALIDATION_MAX_FILES
environment variable. To enable a full database validation, KXI_VALIDATION_MAX_FILES
can be set to either 0W
or infinity
.
Verify Import
Database status
To check the status of the database using the REST API, use the /database/{}/status
endpoint:
curl -L "https://${INSIGHTS_HOSTNAME}/servicegateway/database/example-db/status"
If successful, it returns a response such as:
{"state":"normal","encryption":"decrypted","progress":{...},"memory":{...}}
state
field being normal
indicates that SM has started up without any errors.
Query imported data
See Querying Data for details on how to query data.
It is possible to have partitions located in object storage: set the store
property of the last HDB-based tier to point to it, for example, s3://historical-data/db
and SM adds an entry for it in the generated par.txt
.
Summary
Now that you have imported your database and queried it, you can use it to provide analytics. If you ran into any issues, refer to our troubleshooting page for help.