Ingest Data
This section covers how to insert randomly generated data into your KDB.AI database.
An optional list of metadata columns is included for each document; this enhances filtering during querying and searching.
Real-world applications often involve substantially higher-dimensional vectors than our example.
Generate data
To begin the data ingestion process, first establish a connection with the table and then prepare the data for insertion.
import numpy as np
import pandas as pd
# Connect with the KDB.AI table
documents = session.table('documents')
# Generate data
n_rows = 2000
data = pd.DataFrame({
'id': np.arange(n_rows, dtype='int16'),
'tag': np.random.choice([True, False], n_rows),
'author': [f'author{i}' for i in range(n_rows)],
'length': np.random.randint(0, 1000, n_rows, dtype='int32'),
'content': [f'document{i}' for i in range(n_rows)],
'createdDate': pd.date_range(start='2020-01-01', periods=n_rows, freq='1D'),
'embeddings': [np.random.rand(12).astype('float32') for _ in range(n_rows)]
})
JSON data can be generated in any language. Below .j.j
from q
is used to generate example data. The output of .j.j
is used as a standard for how table data should be formatted.
n: 2000;
// Note: for simplicity we do not randomize content, createdDate, or embeddings
rows:.j.j flip `id`tag`author`length`content`createdDate`embeddings!(n?0h;n?0b;n?`bob`joe`jill;n?1000;n#enlist .Q.an;n#.z.p;n#enlist "e"$til 12);
An example preview of this data, where n is '2', would be:
[
{
"id": 21212,
"tag": true,
"author": "jill",
"length": 68,
"content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
"createdDate": "2023-10-11T00:00:00.000000000",
"embeddings": [0,1,2,3,4,5,6,7,8,9,10,11]
},
{
"id": 19376,
"tag": false,
"author": "joe",
"length": 626,
"content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
"createdDate": "2023-10-11T00:00:00.000000000",
"embeddings": [0,1,2,3,4,5,6,7,8,9,10,11]
}
]
For the insert, only include fields that exist in the target table. In the case of metadata columns with null values, KDB.AI handles the insert operation gracefully. However, this approach is not applicable for vector columns, as nulls are not accepted in those cases.
Ensure precise alignment of data types with the table schema to avoid exceptions.
JSON row format
JSON supports the following types:
- String
- Number
- Object
- Array
- Boolean
- Null
JSON row data must be formatted as one of the above types based on the databases column schema.
For example, a table with a boolean column requires the row data to a literal boolean of true
or false
.
Parsing the value from the string true
or false
, or from the number 0
and 1
is not supported.
For the following database types, the corresponding JSON formats must be used:
Database type ID | Database type name | JSON data format | Description | Example |
---|---|---|---|---|
-1h | boolean | Boolean |
A Boolean literal of true or false |
true |
-2h | guid | String |
A 36 character UUIDv4 formatted String | "77579e36-71e7-d395-5551-5a4221e86e2b" |
-4h | byte | String |
A 2 character hex string | ff |
-5h | short | Number |
A 16 bit number | 32767 |
-6h | int | Number |
A 32 bit number | 2147483647 |
-7h | long | Number |
A 64 bit number | 4611686018427387904 |
-8h | real | Number |
A 32 bit floating point number | 3.14159265 |
-9h | float | Number |
A 64 bit floating point number | 3.14159265 |
-10h | char | String |
A string of a single character | "a" |
-11h | symbol | String |
A string representing a symbol | "BTC" |
-12h | timestamp | String |
A ISO date-time without an offset in format 'yyyy-MM-dd'T'HH:mm:ss' with optional 0-9 decimals for nanoseconds |
"2023-01-01T00:00:00.000000000" |
-13h | month | String |
A string representing a month in format MM-dd |
"2023-10" |
-14h | date | String |
A string representing an ISO Date without an offset in format yyyy-MM-dd |
"2023-10-01" |
-15h | datetime | String |
A ISO date-time without an offset in format 'yyyy-MM-dd'T'HH:mm:ss' with optional 0-3 decimals for milliseconds |
"2023-01-01T00:00:00.000" |
-16h | timespan | String |
A duration of time with units in nanoseconds 'd'D'HH:mm:ss' with optional 0-9 decimals for nanoseconds |
"0D00:00:00.000000005" |
-17h | minute | String |
A duration of time in HH:mm |
"22:59" |
-18h | second | String |
A duration of time in HH:mm:ss |
"22:59:13" |
-19h | time | String |
A duration fo time in HH:mm:ss with optional 0-9 decimals for nanosecond precision |
"22:59:13.000000000" |
1h | booleans | Array<Boolean> |
Array of boolean values | [true, false] |
2h | guids | Array<String> |
Array of guid values | ["09c4f826-b3f2-e699-c7ff-5195d89a0925", "c0b9ec94-87b6-b0e9-3427-312d62aaec9c"] |
4h | bytes | Array<String> |
Array of byte values | ["ff", "0e"] |
5h | shorts | Array<Number> |
Array of short values | [32767, -32767] |
6h | ints | Array<Number> |
Array of int values | [2147483647, -2147483647] |
7h | longs | Array<Number> |
Array of long values | [4611686018427387904, -4611686018427387904] |
8h | reals | Array<Number> |
Array of real values | [1.1, 1.2] |
9h | floats | Array<Number> |
Array of float values | [1.1, 1.2] |
10h | string | String |
A string of text | "abcdef" |
11h | symbols | Array<String> |
Array of symbol values | ["BTC","MSFT"] |
12h | timestamps | Array<String> |
Array of timestamp values | ["2023-01-01T00:00:00.000000000", "2023-01-02T00:00:00.000000000"] |
13h | months | Array<String> |
Array of month values | ["2023-10", "2023-11"] |
14h | dates | Array<String> |
Array of date values | ["2023-10-01", "2023-10-02"] |
15h | datetimes | Array<String> |
Array of datetime values | ["2023-01-01T00:00:00.000", "2023-01-02T00:00:00.000"] |
16h | timespans | Array<String> |
Array of timespan values | ["0D00:00:00.000000005", "0D00:00:00.000000006"] |
17h | minutes | Array<String> |
Array of minute values | ["22:59", "23:00"] |
18h | seconds | Array<String> |
Array of second values | ["22:59:13", "22:59:14"] |
19h | times | Array<String> |
Array of time values | ["22:59:13.000000000", "2023-01-01T00:00:00.001"] |
If null is provided for a value, it is replaced by its relevant empty string or largest negative number.
JSON null is not supported within arrays. If you require a data point to be numerically null, set the value to the largest negative number appropriate.
For columns containing multiple data types, no conversion takes place. All numbers are set as float
. This results in suboptimal data storage and compression.
Insert
Now you can populate your table with data.
Populate the documents
table with the above dataframe.
documents.insert(data)
Populate the documents
table with a curl http request.
Save the following to a local file named insert.json
:
{
"table": "documents",
"rows": [
{
"id": 21212,
"tag": true,
"author": "jill",
"length": 68,
"content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
"createdDate": "2023-10-11T00:00:00.000000000",
"embeddings": [0,1,2,3,4,5,6,7,8,9,10,11]
},
{
"id": 19376,
"tag": false,
"author": "joe",
"length": 626,
"content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
"createdDate": "2023-10-11T00:00:00.000000000",
"embeddings": [0,1,2,3,4,5,6,7,8,9,10,11]
}
]
}
Ensure that the @
sign is included before the filename, otherwise, the file becomes uri-encoded.
curl -H 'Content-Type: application/json' -d @insert.json localhost:8082/api/v1/insert
If the supplied embeddings are not the same dimension as the table, the operation fails.
When working with text, chunking refers to how source material is broken up prior to passing to a model for the creation of embeddings. For example, you can choose between:
- Processing an entire document and generate a single embedding
- Separating the document into paragraphs or sentences and have embeddings at that level.
Typically, you would chunk sentences, which results in multiple rows per document.
The KDB.AI Cloud client has a maximum insert size limit of 10MB.
Batch inserts
KX recommends inserting data in batches, by passing a dataframe to the insert method. This can significantly improve performance in some cases. For large dataframes, you can split the df
. The fewer batches sent, the more efficient the inserts are.
from tqdm import tqdm
n = 200 # batch size
for i in tqdm(range(0, data.shape[0], n)):
documents.insert(data[i:i+n].reset_index(drop=True))
Insert data with caution, the current version of KDB.AI does not support individual record updates or deletions.
Note that batch and chunk are not synonymous:
- Chunking is determined by the decision taken on how the source text is broken down for embedding purposes.
- Batching is determined by the total size of your dataframe and the number of rows that can be inserted into KDB.AI at a time.
Tip: Chunk as appropriate, but pack as much data into each batch you insert as possible.
Train index
The IVF and IVFPQ indexes need training before using them for similarity searching. The train
function operates in a manner similar to the insert
function, except the payload provided is used for the primary purpose of training the index, without the data from the payload being inserted for query into the index.
# Following a similar schema but with the index set as ivfpq
schema = {'columns': [
{'name': 'id', 'pytype': 'int16'},
{'name': 'tag', 'pytype': 'bool'},
{'name': 'author', 'pytype': 'str'},
{'name': 'length', 'pytype': 'int32'},
{'name': 'content', 'pytype': 'str'},
{'name': 'createdDate', 'pytype': 'datetime64[ns]'},
{'name': 'embeddings',
'vectorIndex': {'type': 'ivfpq',
'metric': 'L2',
'nsplits':4}}]}
documentsIVFPQ = session.create_table('documentsIVFPQ', schema)
# Train on a subset of the data
documentsIVFPQ.train(data[:1000])
# Return: 'True'
Training data is provided identically to an insert, except that the URL is /api/v1/train
.
To train an index using the same data from the insert example, insert.json
, update the name of the table and run:
# Schema for the new table
{
"type": "splayed",
"columns": [
{"name": "id", "type": "short"},
{"name": "tag", "type": "boolean"},
{"name": "author", "type": "char"},
{"name": "length", "type": "int"},
{"name": "content", "type": "char"},
{"name": "createdDate", "type": "timestamp"},
{
"name": "embeddings",
"type": "reals",
"vectorIndex": {
"type": "ivfpq",
"metric": "L2",
"nsplits": 4
}
}
]
}
curl -H 'Content-Type: application/json' -d @trainSchema.json localhost:8082/api/v1/config/table/documentsIVFPQ
curl -H 'Content-Type: application/json' -d @insert.json localhost:8082/api/v1/train
Training must occur before data insertion. Vectors used for training are not included in the index and they won't be available for search operations.
Next steps
Now that you have data in your table, you can query it.