How to perform a hybrid search
This section details how to execute a hybrid search in KDB.AI.
Added in v1.1.0.
Hybrid search in KDB.AI
In KDB.AI, a hybrid search returns results based on both dense and sparse vector search. The process includes querying and re-ranking of results across both vector sets. A parameter named alpha
allows you to control the weight factor on the search for each algorithm.
Before we dive in, go to the understanding hybrid search page to make sure you're familiar with dense vectors, sparse vectors, tokenization, and the Best Matching 25 (BM25) algorithm.
A hybrid search blends a dense search with a sparse search. Therefore, you need to perform the following:
Setup
Before you start, make sure you have:
- An active KDB.AI Cloud or Server license
- Installed the latest version of KDB.AI Cloud or Server
- A valid API key if you're using KDB.AI Cloud
- Python Client
1. Dense search
For details on how to perform a standalone dense search, go to the How to conduct a similarity search page.
2. Sparse search
This section details how to perform a standalone sparse search.
2.1 Create table with a sparse index in KDB.AI
To use sparse vectors, you need to create a table with a sparse index.
- Parameters
Option | Description | Type | Required | Default |
---|---|---|---|---|
k |
Term saturation | real | false | 1.25 |
b |
Document length impact on relevance | real | false | 0.75 |
The values of k
and b
can be modified in place at search time.
schema = {'columns': [
{'name': 'id', 'pytype': 'int32'},
{'name': 'sym', 'pytype': 'str'},
{'name': 'time', 'pytype': 'datetime64[ns]'},
{'name': 'text', 'pytype': 'dict',
'sparseIndex': {'k': 1.25, 'b': 0.75}]}
{
"type": "splayed",
"columns": [
{"name": "id", "type": "int"},
{"name": "sym", "type": "char"},
{"name": "time", "type": "timestamp"},
{
"name": "text",
"type": "",
"sparseIndex": {
"k": 1.25,
"b": 0.75,
}
}
]
}
You can use these schemas to create the tables as shown on the Manage Tables page.
2.2 Insert sparse vectors
Before the data is inserted, the sparse vectors should be prepared as dictionaries with integer keys and values:
Here is an example of generating a pandas dataframe with sparse vectors as the payload for insert:
import numpy as np
import pandas as pd
# Connect with the KDB.AI table
documents = session.table('documents')
# Generate data
n_rows = 2000
data = pd.DataFrame({
'id': np.arange(n_rows, dtype='int16'),
'tag': np.random.choice([True, False], n_rows),
'author': [f'author{i}' for i in range(n_rows)],
'length': np.random.randint(0, 1000, n_rows, dtype='int32'),
'content': [f'document{i}' for i in range(n_rows)],
'createdDate': pd.date_range(start='2020-01-01', periods=n_rows, freq='1D'),
'sparse': [{int(y+1):1 for y in np.random.choice(range(1200),x+1,replace=False)} for x in np.random.choice(range(120),1000)]})
Here is an example of generating a JSON file with sparse vectors as the payload for insert:
[
{
"id": 21212,
"tag": true,
"author": "jill",
"length": 68,
"content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
"createdDate": "2023-10-11T00:00:00.000000000",
"sparse": {'1996': 2, '101': 1, '11190': 1, '5598': 1, '2058': 1, '4231': 1, '102': 1}
},
{
"id": 19376,
"tag": false,
"author": "joe",
"length": 626,
"content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
"createdDate": "2023-10-11T00:00:00.000000000",
"sparse": {'1996': 1, '11190': 1, '2058': 1, '4231': 1, '102': 1}
}
]
Now you can populate your table with data.
Populate the documents
table with the above dataframe.
documents.insert(data)
Populate the documents
table with a curl http request.
Save the following to a local file named insert.json
:
{
"table": "documents",
"rows": [
{
"id": 21212,
"tag": true,
"author": "jill",
"length": 68,
"content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
"createdDate": "2023-10-11T00:00:00.000000000",
"sparse": {'1996': 2, '101': 1, '11190': 1, '5598': 1, '2058': 1, '4231': 1, '102': 1}
},
{
"id": 19376,
"tag": false,
"author": "joe",
"length": 626,
"content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
"createdDate": "2023-10-11T00:00:00.000000000",
"sparse": {'1996': 1, '5598': 1, '2058': 1, '4231': 1, '102': 1}
}
]
}
Ensure that the @
sign is included before the filename, otherwise, the file becomes URI-encoded.
curl -H 'Content-Type: application/json' -d @insert.json localhost:8082/api/v1/insert
2.3. Perform a sparse search
The search endpoint is overloaded to accept lists of sparse vectors:
- Parameters Index Options
Option | Description | Type | Required | Default |
---|---|---|---|---|
k |
Term saturation | real | false | 1.25 |
b |
Document length impact on relevance | real | false | 0.75 |
table.search(vectors=[{1:2,3:1}],n=3)
curl -s -H "Content-Type: application/json" localhost:8082/api/v1/kxi/search \
-d '{"table":"documents","n":3,"vectors":[{"1":2,"3":1}]}'
3. Hybrid search
Combining the sparse and dense searches is the basis of the hybrid search. You can use the parameter alpha
to determine the weighting of each leg of the search. A value of 0 means only the sparse leg is used and a value of 1 means the dense leg only is used.
For intermediate values, the following algorithm demonstrates an example that assumes:
- the number of matches is 4.
alpha
is 0.6.- the dense search returned vectors in order 3, 2, 1, 5.
- the sparse search returned vectors in order 4, 3, 2, 1.
Vector | Sparse | Dense | Re-ranked |
---|---|---|---|
Vector 1 | 3 | 4 | 3 |
Vector 2 | 2 | 3 | 2 |
Vector 3 | 1 | 2 | 1 |
Vector 4 | null | 1 | 4 |
Vector 5 | 5 | null | 5 |
We deduced the combined ranking (re-ranked) by the reciprocal rank fusion scores defined by the following formula:
We computed the scores as below:
Since vector 1 occurs in the sparse search in position 3 and in the dense search in position 4, it receives a score of 0.22, as calculated below:
Since vector 2 occurs in the sparse search in position 2 and in the dense search in position 3, it receives a score of 0.28:
Since vector 3 occurs in the sparse search in position 1 and in the dense search in position 2, it receives a score of 0.4:
Since vector 4 doesn't occur in the sparse search and in the dense search in position 1, it receives a score of 0.2:
Since vector 5 occurs in the sparse search in position 5 and doesn't occur in the dense search, it receives a score of 0.08:
Comparing these five scores, the top four results are returned based on the sorted value.
Summary
To perform a hybrid search, you need to define the dense and sparse vectors in the schema. See the Schema section on the Manage Tables page.
schema = {'columns': [
{'name': 'id', 'pytype': 'int32'},
{'name': 'sym', 'pytype': 'str'},
{'name': 'time', 'pytype': 'datetime64[ns]'},
{'name': 'dense', 'pytype': 'float32',
'vectorIndex': {'type': 'hnsw', 'dims': 10, 'metric': 'L2'}},
{'name': 'sparse', 'pytype': 'dict',
'sparseIndex': {'k': 1.25, 'b': 0.75}}]}
{
"type": "splayed",
"columns": [
{"name": "id", "type": "int"},
{"name": "sym", "type": "char"},
{"name": "time", "type": "timestamp"},
{
"name": "dense",
"type": "reals",
"vectorIndex": {
"type": "hnsw",
"metric": "L2",
"dims": 10}
},
{
"name": "sparse",
"type": "",
"sparseIndex": {
"k": 1.25,
"b": 0.75}
}
]
}
Assuming a documents
table exists with the correct schema, you can insert the data via Python (as a pandas dataframe) or via REST (as a JSON file):
Here is an example of generating a pandas dataframe with sparse and dense vectors as the payload for insert:
import numpy as np
import pandas as pd
# Connect with the KDB.AI table
documents = session.table('documents')
# Generate data
n_rows = 2000
data = pd.DataFrame({
'id': np.arange(n_rows, dtype='int16'),
'tag': np.random.choice([True, False], n_rows),
'author': [f'author{i}' for i in range(n_rows)],
'length': np.random.randint(0, 1000, n_rows, dtype='int32'),
'content': [f'document{i}' for i in range(n_rows)],
'createdDate': pd.date_range(start='2020-01-01', periods=n_rows, freq='1D'),
'dense': [np.random.rand(12).astype('float32') for _ in range(n_rows)],
'sparse': [{int(y+1):1 for y in np.random.choice(range(1200),x+1,replace=False)} for x in np.random.choice(range(120),1000)]})
Here is an example of generating a JSON file with sparse and dense vectors as the payload for insert:
[
{
"id": 21212,
"tag": true,
"author": "jill",
"length": 68,
"content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
"createdDate": "2023-10-11T00:00:00.000000000",
"dense": [0,1,2,3,4,5,6,7,8,9,10,11],
"sparse": {'1996': 2, '101': 1, '11190': 1, '5598': 1, '2058': 1, '4231': 1, '102': 1}
},
{
"id": 19376,
"tag": false,
"author": "joe",
"length": 626,
"content": "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ_0123456789",
"createdDate": "2023-10-11T00:00:00.000000000",
"dense": [0,1,2,3,4,5,6,7,8,9,10,11],
"sparse": {'1996': 1, '11190': 1, '2058': 1, '4231': 1, '102': 1}
}
]
Assuming a documents
table exists with a compliant schema, once you insert the data via Python of REST, perform the hybrid search:
table.hybrid_search(dense_vectors=[[0,0,0,0,0,0,0,0,0,0]],sparse_vectors=[{1:2,3:1}],n=3)
curl -s -H "Content-Type: application/json" localhost:8082/api/v1/kxi/hybridSearch \
-d '{"table":"documents","n":3,"sparseVectors":[{"1":2,"3":1}],"denseVectors"=[[0,0,0,0,0,0,0,0,0,0]]}'
Code example
This example demonstrates how to perform each of the three searches on some dummy data.
Click here to expand the code sample
import pandas as pd
import numpy as np
import kdbai_client as kdbai
### start session
session = kdbai.Session()
### create table
schema = {'columns': [
{'name': 'id', 'pytype': 'int32'},
{'name': 'sym', 'pytype': 'str'},
{'name': 'dense', 'pytype': 'float32',
'vectorIndex': {'type': 'hnsw', 'dims': 10, 'metric': 'L2'}},
{'name': 'sparse', 'pytype': 'dict',
'sparseIndex': {'k': 1.25, 'b': 0.75}}]}
table = session.create_table("example", schema)
### insert data
df = pd.DataFrame({
'id': range(1000),
'sym': np.random.choice(['AAA', 'BBB'], 1000),
'dense': [x.astype('float32') for x in np.random.rand(1000, 10)],
'sparse': [{int(y+1):1 for y in np.random.choice(range(1200),x+1,replace=False)} for x in np.random.choice(range(120),1000)]})
table.insert(df)
### Dense search
result = table.search([[0.1,0,0,0,0,0,0,0,0,0]],n=3)
print(result)
### Sparse search
result = table.search([{1:1,4:1,34:2,2:1}],n=3)
print(result)
### Hybrid search
result = table.hybrid_search(dense_vectors=[[0.1,0,0,0,0,0,0,0,0,0]],sparse_vectors=[{1:1,4:1,34:2,2:1}],n=3)
print(result)
### cleanup
session.table('example').drop()
Next steps
Now that you're familiar with hybrid search, you can do the following:
- Check out the use cases and learn how to improve accuracy with hybrid search from the Learning hub.
- Visit our GitHub repo.
- Open the sample.
- Run the notebook in Google Colab.
- Learn how to perform a Transformed Temporal Similarity Search.
- Learn how to perform a Non-Transformed Temporal Similarity Search.