Batch ingest

Batch ingest allows you to backfill static data directly to the historical tier of an existing database. This is a useful option for reducing memory footprint when importing large amounts of static data. Batch ingest works by replacing partitions for a given partitioned table with a static copy provided to the database. This is best used for replacing empty partitions with backfilled data.

Data replacement

Batch ingest replaces partitions in the HDB with a new version.

Initial import

Batch ingest is similar to an initial import except it is inteded for an existing database instead of an empty database.

Data organization

For batch ingestion, data must be located in the staging directory in the HDB root. This location is pointed to by the baseURI of the HDB mount. The top level directory is considered the session name for ingestion. The content within the directory should be a partitioned database with only the tables related to the ingestion. Below is an example directory layout.

Example schema definition

tables:
  trace:
    description: Manufacturing trace data
    type: partitioned
    blockSize: 10000
    prtnCol: updateTS
    sortColsOrd: [sensorID]
    sortColsDisk: [sensorID]
    columns:
      - name: sensorID
        description: Sensor Identifier
        type: int
        attrMem: grouped
        attrOrd: parted
        attrDisk: parted
      - name: readTS
        description: Reading timestamp
        type: timestamp
      - name: captureTS
        description: Capture timestamp
        type: timestamp
      - name: valFloat
        description: Sensor value
        type: float
      - name: qual
        description: Reading quality
        type: byte
      - name: alarm
        description: Enumerated alarm flag
        type: byte
      - name: updateTS
        description: Ingestion timestamp
        type: timestamp

/data/db/hdb/staging/backfill
├── 2023.01.01
│   └── trace
│       ├── alarm
│       ├── qual
│       ├── readTS
│       ├── sensorID
│       ├── updateTS
│       └── valFloat
├── 2023.01.03
│   └── trace
│       ├── alarm
│       ├── qual
│       ├── readTS
│       ├── sensorID
│       ├── updateTS
│       └── valFloat
└── sym

In this scenario, the table trace for the dates 2023.01.01 and 2023.01.03 will be overwritten with the content of these directories when a session is started for the backfill directory.

Running a batch ingest

Batch ingest sessions in kdb Insights Enterprise are managed by the Stream Processor. To perform a batch ingest, use either .qsp.write.toDatabase or the UI writer.

Use a batch source

Batch ingest currently only supports writing with a batch data source such as reading from Amazon S3, Azure Blob Storage, Google Cloud storage or static files.

Cleaning up a batch ingest

If a batch ingestion session completes successfully, the session directory will automatically be cleared. If the ingestion fails, the session directory will be left on disk for the user to perform a cleanup. An error will be reported to the client that triggered the ingest which can be used to help debug the failed ingest for another attempt.