Skip to content

Ingest Data

This section covers how to insert randomly generated data into your KDB.AI database.

An optional list of metadata columns is included for each document; this enhances filtering during querying and searching.

Real-world applications often involve substantially higher-dimensional vectors than our example.

Generate data

To begin the data ingestion process, first establish a connection with the table and then prepare the data for insertion.

import numpy as np 
import pandas as pd 

# Connect with the KDB.AI table
documents = session.table('documents') 

# Generate data
n_rows = 2000

data = pd.DataFrame({
    'id': np.arange(n_rows, dtype='int16'),
    'tag': np.random.choice([True, False], n_rows),
    'author': [f'author{i}' for i in range(n_rows)],
    'length': np.random.randint(0, 1000, n_rows, dtype='int32'),
    'content': [f'document{i}' for i in range(n_rows)],
    'createdDate': pd.date_range(start='2020-01-01', periods=n_rows, freq='1D'),
    'embeddings': [np.random.rand(12).astype('float32') for _ in range(n_rows)]
})

JSON data can be generated in any language. Below .j.j from q is used to generate example data. The output of .j.j is used as a standard for how table data should be formatted.

n: 2000;
// Note: for simplicity we do not randomize content, createdDate, or embeddings
rows:.j.j flip `id`tag`author`length`content`createdDate`embeddings!(n?0h;n?0b;n?`bob`joe`jill;n?1000;n#enlist .Q.an;n#.z.p;n#enlist "e"$til 12);

An example preview of this data, where n is '2', would be:

[
  {
    "id": 21212,
    "tag": true,
    "author": "jill",
    "length": 68,
    "content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
    "createdDate": "2023-10-11T00:00:00.000000000",
    "embeddings": [0,1,2,3,4,5,6,7,8,9,10,11]
  },
  {
    "id": 19376,
    "tag": false,
    "author": "joe",
    "length": 626,
    "content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
    "createdDate": "2023-10-11T00:00:00.000000000",
    "embeddings": [0,1,2,3,4,5,6,7,8,9,10,11]
  }
]

For the insert, only include fields that exist in the target table. In the case of metadata columns with null values, KDB.AI handles the insert operation gracefully. However, this approach is not applicable for vector columns, as nulls are not accepted in those cases.

Ensure precise alignment of data types with the table schema to avoid exceptions.

JSON row format

JSON supports the following types:

  • String
  • Number
  • Object
  • Array
  • Boolean
  • Null

JSON row data must be formatted as one of the above types based on the databases column schema.

For example, a table with a boolean column requires the row data to a literal boolean of true or false. Parsing the value from the string true or false, or from the number 0 and 1 is not supported.

For the following database types, the corresponding JSON formats must be used:

Database type ID Database type name JSON data format Description Example
-1h boolean Boolean A Boolean literal of true or false true
-2h guid String A 36 character UUIDv4 formatted String "77579e36-71e7-d395-5551-5a4221e86e2b"
-4h byte String A 2 character hex string ff
-5h short Number A 16 bit number 32767
-6h int Number A 32 bit number 2147483647
-7h long Number A 64 bit number 4611686018427387904
-8h real Number A 32 bit floating point number 3.14159265
-9h float Number A 64 bit floating point number 3.14159265
-10h char String A string of a single character "a"
-11h symbol String A string representing a symbol "BTC"
-12h timestamp String A ISO date-time without an offset in format 'yyyy-MM-dd'T'HH:mm:ss' with optional 0-9 decimals for nanoseconds "2023-01-01T00:00:00.000000000"
-13h month String A string representing a month in format MM-dd "2023-10"
-14h date String A string representing an ISO Date without an offset in format yyyy-MM-dd "2023-10-01"
-15h datetime String A ISO date-time without an offset in format 'yyyy-MM-dd'T'HH:mm:ss' with optional 0-3 decimals for milliseconds "2023-01-01T00:00:00.000"
-16h timespan String A duration of time with units in nanoseconds 'd'D'HH:mm:ss' with optional 0-9 decimals for nanoseconds "0D00:00:00.000000005"
-17h minute String A duration of time in HH:mm "22:59"
-18h second String A duration of time in HH:mm:ss "22:59:13"
-19h time String A duration fo time in HH:mm:ss with optional 0-9 decimals for nanosecond precision "22:59:13.000000000"
1h booleans Array<Boolean> Array of boolean values [true, false]
2h guids Array<String> Array of guid values ["09c4f826-b3f2-e699-c7ff-5195d89a0925", "c0b9ec94-87b6-b0e9-3427-312d62aaec9c"]
4h bytes Array<String> Array of byte values ["ff", "0e"]
5h shorts Array<Number> Array of short values [32767, -32767]
6h ints Array<Number> Array of int values [2147483647, -2147483647]
7h longs Array<Number> Array of long values [4611686018427387904, -4611686018427387904]
8h reals Array<Number> Array of real values [1.1, 1.2]
9h floats Array<Number> Array of float values [1.1, 1.2]
10h string String A string of text "abcdef"
11h symbols Array<String> Array of symbol values ["BTC","MSFT"]
12h timestamps Array<String> Array of timestamp values ["2023-01-01T00:00:00.000000000", "2023-01-02T00:00:00.000000000"]
13h months Array<String> Array of month values ["2023-10", "2023-11"]
14h dates Array<String> Array of date values ["2023-10-01", "2023-10-02"]
15h datetimes Array<String> Array of datetime values ["2023-01-01T00:00:00.000", "2023-01-02T00:00:00.000"]
16h timespans Array<String> Array of timespan values ["0D00:00:00.000000005", "0D00:00:00.000000006"]
17h minutes Array<String> Array of minute values ["22:59", "23:00"]
18h seconds Array<String> Array of second values ["22:59:13", "22:59:14"]
19h times Array<String> Array of time values ["22:59:13.000000000", "2023-01-01T00:00:00.001"]

If null is provided for a value, it is replaced by its relevant empty string or largest negative number.

JSON null is not supported within arrays. If you require a data point to be numerically null, set the value to the largest negative number appropriate.

For columns containing multiple data types, no conversion takes place. All numbers are set as float. This results in suboptimal data storage and compression.

Insert

Now you can populate your table with data.

Populate the documents table with the above dataframe.

documents.insert(data)

Populate the documents table with a curl http request.

Save the following to a local file named insert.json:

{
  "table": "documents",
  "rows": [
    {
      "id": 21212,
      "tag": true,
      "author": "jill",
      "length": 68,
      "content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
      "createdDate": "2023-10-11T00:00:00.000000000",
      "embeddings": [0,1,2,3,4,5,6,7,8,9,10,11]
    },
    {
      "id": 19376,
      "tag": false,
      "author": "joe",
      "length": 626,
      "content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
      "createdDate": "2023-10-11T00:00:00.000000000",
      "embeddings": [0,1,2,3,4,5,6,7,8,9,10,11]
    }
  ]
}

Ensure that the @ sign is included before the filename, otherwise, the file becomes uri-encoded.

curl -H 'Content-Type: application/json' -d @insert.json localhost:8082/api/v1/insert

If the supplied embeddings are not the same dimension as the table, the operation fails.

When working with text, chunking refers to how source material is broken up prior to passing to a model for the creation of embeddings. For example, you can choose between:

  • Processing an entire document and generate a single embedding
  • Separating the document into paragraphs or sentences and have embeddings at that level.

Typically, you would chunk sentences, which results in multiple rows per document.

The KDB.AI Cloud client has a maximum insert size limit of 10MB.

Batch inserts

KX recommends inserting data in batches, by passing a dataframe to the insert method. This can significantly improve performance in some cases. For large dataframes, you can split the df. The fewer batches sent, the more efficient the inserts are.

from tqdm import tqdm 
n = 200  # batch size

for i in tqdm(range(0, data.shape[0], n)):
    documents.insert(data[i:i+n].reset_index(drop=True))

Insert data with caution, the current version of KDB.AI does not support individual record updates or deletions.

Note that batch and chunk are not synonymous:

  • Chunking is determined by the decision taken on how the source text is broken down for embedding purposes.
  • Batching is determined by the total size of your dataframe and the number of rows that can be inserted into KDB.AI at a time.

Tip: Chunk as appropriate, but pack as much data into each batch you insert as possible.

Train index

The IVF and IVFPQ indexes need training before using them for similarity searching. When creating a new table with either of these two indexes, it's mandatory to include the additional parameter trainingVectors within the vectorIndex attribute of the schema. This trainingVectors parameter specifies the quantity of vectors required for training.

The train function operates in a manner similar to the insert function, with the key distinction being that it caches the inserted data until the specified threshold, set with trainingVectors, is reached. Once this threshold is met, the training of the index occurs, followed by the insertion of the data.

# Following a similar schema but with the index set as ivfpq
schema = {'columns': [
        {'name': 'id', 'pytype': 'int16'},
        {'name': 'tag', 'pytype': 'bool'},
        {'name': 'author', 'pytype': 'str'},
        {'name': 'length', 'pytype': 'int32'},
        {'name': 'content', 'pytype': 'str'},
        {'name': 'createdDate', 'pytype': 'datetime64[ns]'},
        {'name': 'embeddings',
            'vectorIndex': {'type': 'ivfpq', 
                            'metric': 'L2', 
                            'nsplits':4,
                            'trainingVectors': 1000}}]}

documentsIVFPQ = session.create_table('documentsIVFPQ', schema)   

# Add a subset of the data
documentsIVFPQ.train(data[:500])

# Return: '500 records cached for training, threshold is 1000'

# Add the rest of the training data that surpass the training threshold
documentsIVFPQ.train(data[501:1100])

# Return: 'Index training complete, 100 extra rows discarded'

Training data is provided identically to an insert, except that the URL is /api/v1/train.

To train an index using the same data from the insert example, insert.json, update the name of the table and run:

# Schema for the new table 
{
  "type": "splayed",
  "columns": [
    {"name": "id", "type": "short"},
    {"name": "tag", "type": "boolean"},
    {"name": "author", "type": "char"},
    {"name": "length", "type": "int"},
    {"name": "content", "type": "char"},
    {"name": "createdDate", "type": "timestamp"},
    {
      "name": "embeddings",
      "type": "reals",
      "vectorIndex": {
        "type": "ivfpq",
        "metric": "L2",
        "nsplits": 4,
        "trainingVectors": 4
      }
    }
  ]
}

curl -H 'Content-Type: application/json' -d @trainSchema.json localhost:8082/api/v1/config/table/documentsIVFPQ

curl -H 'Content-Type: application/json' -d @insert.json localhost:8082/api/v1/train

Training must occur before data insertion. Vectors used for training are not included in the index and they won't be available for search operations.

Next steps

Now that you have data in your table, you can query it.