Batch ingest
Batch ingest allows you to backfill static data directly to the historical tier of an existing database. This is a useful option for reducing memory footprint when importing large amounts of static data. Batch ingest works by replacing partitions for a given partitioned table with a static copy provided to the database. This is best used for replacing empty partitions with backfilled data.
Data replacement
Batch ingest replaces partitions in the HDB with a new version.
Initial import
Batch ingest is similar to an initial import except it is intended for an existing database instead of an empty database.
Data organization
For batch ingestion, data must be located in the staging
directory in the HDB root. This location is pointed to by the baseURI
of the HDB mount. The top level directory is considered the session name for ingestion. The content within the directory should be a partitioned database with only the tables related to the ingestion. Below is an example directory layout.
Example schema definition
tables:
trace:
description: Manufacturing trace data
type: partitioned
blockSize: 10000
prtnCol: updateTS
sortColsOrd: [sensorID]
sortColsDisk: [sensorID]
columns:
- name: sensorID
description: Sensor Identifier
type: int
attrMem: grouped
attrOrd: parted
attrDisk: parted
- name: readTS
description: Reading timestamp
type: timestamp
- name: captureTS
description: Capture timestamp
type: timestamp
- name: valFloat
description: Sensor value
type: float
- name: qual
description: Reading quality
type: byte
- name: alarm
description: Enumerated alarm flag
type: byte
- name: updateTS
description: Ingestion timestamp
type: timestamp
/data/db/hdb/staging/backfill
├── 2023.01.01
│ └── trace
│ ├── alarm
│ ├── qual
│ ├── readTS
│ ├── sensorID
│ ├── updateTS
│ └── valFloat
├── 2023.01.03
│ └── trace
│ ├── alarm
│ ├── qual
│ ├── readTS
│ ├── sensorID
│ ├── updateTS
│ └── valFloat
└── sym
In this scenario, the table trace
for the dates 2023.01.01
and 2023.01.03
will be overwritten with the content of these directories when a session is started for the backfill
directory.
Running a batch ingest
Once data has been written to the staging directory, a batch ingest can be triggered using the REST interface provided by on the Storage Manager container.
Creating an ingest session
To start an ingestion session, a POST
request must be issued to the SM container's /ingest
endpoint. The body of the request is a JSON object that indicates the name of the ingest session. The response will be 200
if the session started successfully, 404
if the session directory was not found or 409
if a session already exists with the given name.
Storage Manager URL
In the example below, $SM
is intended to be the hostname and port of the SM container. If running in Docker, this could be localhost of a port forwarded container. If running in Kubernetes, this should be the SM pod name and port of the SM process.
curl -X POST "$SM/ingest" -H 'Content-Type: application/json' -d '{"name":"backfill"}'
The default mode
for batch ingest is overwrite
which as the name suggests will overwrite existing data for the same date. The other mode
of operation is merge
which will append incoming data to any data already ingested.
curl -X POST "$SM/ingest" -H 'Content-Type: application/json' -d '{"name":"backfill2", "mode":"merge"}'
Checking an ingest session status
Once the session is running, the /ingest/{name}
endpoint can be used to check the status. This endpoint will report what stage the ingestion is in, if it has completed or if there is an error.
Storage Manager URL
In the example below, $SM
is intended to be the hostname and port of the SM container. If running in Docker, this could be localhost of a port forwarded container. If running in Kubernetes, this should be the SM pod name and port of the SM process.
curl -X GET "$SM/ingest/backfill"
{
"status": "completed",
"error": ""
}
Error response
If an ingestion session has errored, the status API will return a code of 200 but will indicate the error in the JSON response.
{
"status": "errored",
"error": "No such directory 'backfill'"
}
Cleaning up a batch ingest
If a batch ingestion session completes successfully, the session directory will automatically be cleared. If the ingestion fails, the session directory will be left on disk for the user to perform a cleanup. An error will be reported to the client that triggered the ingest which can be used to help debug the failed ingest for another attempt.