Skip to content

Initial Import Process

The initial import process

To import data using this method, follow these steps below. Ensure you have met the prerequisites before attempting an initial import.

  1. Create a tier.
  2. Create a copy pod.
  3. Create a schema.
  4. Create the database.
  5. Deploy the database.
  6. Verify the import.

Read our troubleshooting and considerations sections for guidance if you run into any issues.

Create a tier

This is the default mount configuration used when adding a database:

mounts:
  rdb:
    type: stream
    partition: none
    baseURI: none
  idb:
    type: local
    partition: ordinal
    baseURI: file:///data/idb
  hdb:
    type: local
    partition: date
    baseURI: file:///data/hdb

The data is added to the hdb mount and can be imported into one or both of the below tiers:

  1. Local HDB tier backed by a Persistent Volume Claim (PVC)
  2. Object Store tier

Local HDB tier

This tier is automatically added when a database is added but you must hydrate the tier before the database is started. To do this you must:

  • name the PVC
  • configure the size
  • determine the appropriate storage class
  • define access modes
  • create the PVC on the cluster

Name the PVC

The PVC name must be the same as the database name followed by -hdb. This ensures consistency between the PVC and the database it serves, while also distinguishing it as the PVC used for the database. The PVC name should be structured as <database-name>-hdb. For more information on packages, see the kdb Insights Package Documentation.

Example: Database name

metadata:
  name: <database-name>-hdb

Replace <database-name> with the name of your package. For example, if the database name is my-app you would enter:

yaml metadata: name: my-app-hdb

Configure the size

Specify the size in the PVC manifest under the resources section using the appropriate unit (for example, Gi for Gigabytes, Mi for Megabytes). The size of the PVC should be configured based on the storage requirements of the database being imported.

Example: Configure size

spec:
  resources:
    requests:
      storage: 20Gi

In this example, the size is set to 20Gi, but you should adjust the size based on your needs.

Determine the storage class

The storage class defines the type of storage to be used for the PVC (for example, SSD, HDD). To determine which storage class to use, check the available storage classes on your cluster with the following command:

kubectl get storageclass

Then select an appropriate class by specifying it in the storageClassName field in the PVC manifest.

Example: Storage class

spec:
  storageClassName: standard

Replace standard with the appropriate storage class name for your use case, based on the available storage classes in your cluster.

Access mode

The access mode defines how the volume can be mounted by your Pods. For example, in this guide, we use ReadWriteMany (RWX), which allows multiple Pods or nodes to mount the volume with both read and write access simultaneously.

This access mode is crucial if you need shared access between different Pods, such as in a distributed application or for shared storage.

Example: Access mode

spec:
  accessModes:
    - ReadWriteMany

The available access modes are:

  • Read Write Once (RWO): Volume can be mounted as read-write by only one node.
  • Read Only Many (ROX): Volume can be mounted as read-only by many nodes.
  • Read Write Many (RWX): Volume can be mounted as read-write by many nodes (used here for shared access).

Example: PVC configuration for a database import

Access mode is set to ReadWriteMany.

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: my-app-hdb
spec:
  storageClassName: standard
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: 20Gi

Once the PVC configuration is ready, apply it to your Kubernetes cluster to make the claim.

  1. Save the PVC YAML to a file.
  2. Create a YAML file, for example, my-app-hdb.yaml and paste the PVC configuration into the file.
  3. Apply the PVC.
  4. To create the PVC on the cluster, run the following command in your terminal:
kubectl apply -f my-app-hdb.yaml

This sends the PVC configuration to your Kubernetes API server, where it binds to an available Persistent Volume.

Object store tier

Instead of importing the data into a local storage device it may also be useful to import data straight to object store. To do this,:

  • add the tier configurations
  • create an object store bucket or account
  • copy the sym file to the local tier
  • copy hdb partitions to the bucket or account

Object store can be directly mounted to an HDB tier by adding a tier entry to the Storage Manager configuration. The below examples add the hdbs3 tier to the hdb mount:

- name: hdbobj
  mount: hdb
  store: gs://gcp-data/db/
  inventory:
    enabled: true
    location: inventory_db/inventory.tgz
- name: hdbobj
  mount: hdb
  store: s3://bucket-name/db/
  inventory:
    enabled: true
    location: inventory_db/inventory.tgz
- name: hdbobj
  mount: hdb
  store: ms://azuredata/db/
  inventory:
    enabled: true
    location: inventory_db/inventory.tgz

Using an inventory file is optional, however, using it greatly improves performance of object storage operations. Only the path needs to be specified, Storage Manager automatically creates the inventory file.

Refer to objstor example for an example of how to create a bucket or storage account. For details on how to create one using the cloud vendor's UI or CLI, check the documentation of the respective cloud vendor.

Even if only object storage is used for the import, a local date-partitioned tier must exist (usually hdb) and the sym file must be placed in the root of this tier. If the schema contains any non-partitioned tables, they must be in the local tier root. Read the instructions above for how to create the local tier.

The resulting local tier data directory should look like the below:

tree /data/db/hdb/data
data/hdb/data
└── sym

Database structure

Partitions only in object storage

This scenario resembles the Simple partitioned database scenario, except that the location pointed to by the first HDB-based tier contains only the sym file (if applicable): all the partitions exist in object storage. SM adds an entry for it in the generated par.txt.

You are now ready to deploy the package. Follow the instructions in the packaging documentation.

aws s3 ls s3://historical-data/db
                           PRE 2024.01.01/
                           PRE 2024.01.02/

Create copy pod

This section provides detailed instructions on how to copy a kdb+ database to a Persistent Volume Claim (PVC) mounted on a Kubernetes Pod. You'll need to create a YAML configuration file that specifies how the PVC should be mounted to your Kubernetes Pod. Once the file is created, you'll apply it to your cluster to deploy the Pod and attach the PVC.

Step 1: Create the YAML Configuration File

Create a file (for example, my-app-pod.yaml) with the following content to specify your Pod and PVC configuration:

  apiVersion: v1                          # API version for defining a Kubernetes Pod
  kind: Pod                               # Specifies that this resource is a Pod
  metadata:
    name: my-app-pod                  # Name of the Pod
  spec:
    securityContext:
      fsGroup: 2000
    containers:
      - name: my-app-container        # Name of the container in the Pod
        image: amazonlinux:2              # Container image to run
        stdin: true
        command: [ "/bin/bash", "-c" ]
        args:
          - |
            yum install -y rsync
            mkdir /data/db/hdb/data/
            while true; do sleep 30; done;
        volumeMounts:
          - mountPath: /data/db/hdb       # Directory inside the container where the PVC is mounted
            name: my-app-storage      # Name of the volume (linked to the PVC)
    volumes:
      - name: my-app-storage          # Volume name, referred in the container volumeMounts
        persistentVolumeClaim:
          claimName: my-app-hdb       # Name of the PVC being mounted to the Pod

Step 2: Apply the YAML Configuration

To deploy the Pod and mount the PVC, apply the YAML file using the following command:

kubectl apply -f my-app-pod.yaml

Step 3: Verify the Pod Status

After applying the YAML configuration, verify that the Pod is running and the PVC is successfully mounted using the following command:

kubectl get pod my-app-pod
Ensure the status is "Running" and that the PVC is correctly attached.

Copy data

Copy the sym file

The sym file must always be copied to the data directory of the first hdb tier.

kubectl cp sym my-app-pkg-pod:/data/db/hdb/data/

Copy partitions to local tier(optional)

Step 1: Create the krsync.sh Script

This script allows rsync to use kubectl to transfer files to a Pod. It includes batching for larger files and retry logic to handle transfer failures. Save this script as krsync.sh:

#!/bin/bash

# Check if the script is being run as the rsync remote shell helper (via --rsh)
if [ -z "$RSYNC_RUNNING" ]; then
    # Check for arguments: local directory, pod name, and target directory
    if [ "$#" -ne 3 ]; then
        echo "Usage: $0 <local_directory> <pod_name> <target_directory_in_pod>"
        exit 1
     fi

     local_dir=$1
     pod=$2
     target_dir=$3

     # Set the maximum number of retries
     max_retries=5
     retry_count=0
     success=0

     # Export variable to indicate the script is in rsync mode
     export RSYNC_RUNNING=true

     # Retry loop
     while [ $retry_count -lt $max_retries ]; do
         echo "Attempt $(($retry_count + 1)) of $max_retries..."

          # Run rsync, passing in the arguments
         rsync -av --progress --stats -e "bash $0" "$local_dir" "$pod:$target_dir"

         # Check if the rsync command was successful
         if [ $? -eq 0 ]; then
             echo "Rsync successful on attempt $(($retry_count + 1))"
             success=1
             break
         else
             echo "Rsync failed on attempt $(($retry_count + 1))"
         fi

         # Increment retry counter and wait for a bit before retrying
         retry_count=$(($retry_count + 1))
         sleep 10
     done

     # If rsync failed after all attempts
     if [ $success -eq 0 ]; then
         echo "Rsync failed after $max_retries attempts"
         exit 1
     fi

     exit 0
fi

 # This part only runs when the script is invoked as the remote shell helper for rsync
 pod=$1
 shift
 kubectl exec -i $pod -- "$@"

Step 2: Run the krsync.sh Script

To copy files from your local machine to the Pod, use the krsync.sh script with the following command:

bash krsync.sh <local-files> <pod-name> <mount-path>

This command ensures that files are transferred between your local machine and the Kubernetes Pod efficiently, with retry functionality and batching for larger transfers.

In this section, you have learned how to mount a Persistent Volume Claim (PVC) to a Kubernetes Pod, verify its deployment, and transfer data using a custom rsync script. By following these steps, you can ensure that your data is successfully copied to the PVC and available for use within your Kubernetes cluster.

Copy partitions to object store(optional)

See objstor example for an example of how to copy data to a bucket or storage account.

The resulting bucket could look something like the below:

$ gsutil ls gs://gcp-data/db/ | head -5
gs://gcp-data/db/2020.01.01/
gs://gcp-data/db/2020.01.02/
gs://gcp-data/db/2020.01.03/
gs://gcp-data/db/2020.01.06/
gs://gcp-data/db/2020.01.07/
$ aws s3 ls  s3://aws-data/db/ | head -5
PRE 2020.01.01/
PRE 2020.01.02/
PRE 2020.01.03/
PRE 2020.01.06/
PRE 2020.01.07/
$ az storage blob list --account-name azuredata \
  --container-name azuredata | jq -r '.[] | .name' | tail -5
db/2020.12.30/trade/size
db/2020.12.30/trade/stop
db/2020.12.30/trade/sym
db/2020.12.30/trade/time

Create schema

To use an existing kdb+ database in kdb Insights, ensure you have the configuration files defining the schema and other table attributes. This can be done manually by following the documentation. Alternatively, this can be done automatically with this script.

The output needs to be manually edited before feeding it into kdb Insights, as some necessary information cannot be automatically determined.

Run the script

Before you run the script, ensure you have an environment with kdb+ installed. Copy the script into a file called create-schema.q and invoke it with q create-schema.q <options>.

Use the output

The output of this script is not a valid JSON/YAML document. This is because some of the configuration information required by kdb Insights cannot be deduced from the HDB on disk alone. The output begins with a plain English header explaining the changes to be made. Make sure this is deleted once you have finished configuring the document, otherwise it won't be valid. The document continues with @EDITME@ in place of values that must be manually added. If you are using the -fmt package option, every output file must be edited individually.

Example schema definition
tables:
  trade:
    description: Trade data
    type: partitioned
    prtnCol: time
    sortColsOrd: sym
    sortColsDisk: sym
    columns:
      - name: time
        description: Time
        type: timestamp
      - name: sym
        description: Symbol name
        type: symbol
        attrMem: grouped
        attrDisk: parted
        attrOrd: parted
      - name: price
        description: Price
        type: float
      - name: size
        description: Size
        type: long
    exchange:
      description: Exchange
      type: splayed
      primaryKeys: [id]
      columns:
        - name: id
          description: ID
          type: symbol
        - name: descr
          description: Description
          type: string
    instrument:
      description: Instrument
      type: basic
      primaryKeys: [id]
      columns:
        - name: id
          description: Key
          type: symbol
        - name: descr
          description: Description
          type: string
        - name: currency
          description: Currency
          type: symbol

Create the database

If you are using the package output format, after you have finished editing all the generated schema files, you need to put them into a package. Refer to the Packaging Introduction to understand the steps of creating and deploying a package.

The following example uses a package called mypkg, a database in that package called mydb, and assumes that the generated schemas are stored in a directory called path/to/schemas.

  1. First, navigate to an appropriate working directory and run the following commands:

    kxi package init mypkg
    kxi package add --to mypkg database --name mydb
    
  2. Next, you must add each table with the kxi package add --to mypkg table --name <tablename> command. You can do this manually one at a time, or automate the process with shell commands:

    basename -s .yaml path/to/schemas/*.yaml | xargs -n 1 kxi package add --to mypkg table --name
    
  3. Then, copy your edited schema files and overwrite the default files created by kxi package add:

    cp path/to/schemas/* mypkg/tables/
    

Alternatively, the schema can be created in the web interface using "Code View". To learn how to create the json format using the helper script, read the documentation.

Deploy the database

Refer to the Packaging Introduction to understand the steps of creating and deploying a package.

Database validation

The SM validates the database against the schema configuration within the assembly to ensure that it conforms and is operational. If the SM validation finds any issues with the database it provides details in the logs on what validation failed, and what needs to be addressed, before terminating. In this scenario, the user can take SM offline and resolve the validation failures locally before attempting to re-initialize SM again.

SM checks the size of the database prior to carrying out the validation. The size is measured by the total number of files the database has under its root. By default, this threshold is set to 1,000,000 files. If this threshold is exceeded, the validation carries out spot checks on a reduced number of partitions, for example, for 1 year of partitions, 50% of partitions are validated, for 50 years 5% of partitions are validated. The threshold can be overridden by setting the KXI_VALIDATION_MAX_FILES environment variable. To enable a full database validation, KXI_VALIDATION_MAX_FILES can be set to either 0W or infinity.

Verify Import

Database status

First, see Requesting an access token for how to get the value for $INSIGHTS_TOKEN.

To check the status of the database using the REST API, use the /database/{}/status endpoint:

curl -L "https://${INSIGHTS_HOSTNAME}/servicegateway/api/v1/database/example-db/status" \
-H "Authorization: Bearer $INSIGHTS_TOKEN"

If successful, a response similar to the below is returned:

{"state":"normal","encryption":"decrypted","progress":{...},"memory":{...}}
The state field being normal indicates that SM has started up without any errors.

In the web interface, look for a green check mark next to the database name that indicates the database has been successfully deployed.

Query imported data

See Querying Data for details on how to query data.

It is possible to have partitions located in object storage: set the store property of the last HDB-based tier to point to it, for example, s3://historical-data/db and SM adds an entry for it in the generated par.txt.

Summary

Now that you have imported your database and queried it, you can use it to provide analytics. If you ran into any issues, refer to our troubleshooting page for help.