Customize Filters
This section details how to apply custom filters to your search query.
KDB.AI combines vector database functionality with traditional kdb database queries. In particular, you can combine a vector similarity search with traditional where
clauses. The filter
parameter is used for applying custom filtering to the query. This is specified using a list of triples where each triple has the form (operator; column name; values).
[("<=", "valFloat", 100), ("within", "qual", [0,2])]
[["<=", "valFloat", 100], ["within", "qual", [0,2]]]
The filters are applied in the order they are defined when selecting from the table.
Supported filter functions
Function | Parameters | Example |
---|---|---|
in |
Filters data that is not in the list of possible alternatives. | ["in", "sym", ["AAPL", "MSFT"]] keeps symbols that are either AAPL or MSFT . |
within |
Keeps numeric data that is within the bounds of a range using inclusive limits. | ["within", "price", [100, 200]] keeps prices that are greater than or equal to 100 but less than or equal to 200. |
< |
Keeps numeric data that is less than a threshold. | ["<", "price", 100] keeps data that is less than 100. |
> |
Keeps numeric data that is greater than a threshold. | [">", "price", 100] keeps data that is greater than 100. |
<= |
Keeps numeric data that is less than or equal to a threshold. | ["<=", "price", 100] keeps data that is less than or equal to 100. |
>= |
Keeps numeric data that is greater than or equal to a threshold. | [">=", "price", 100] keeps data that is greater than or equal to 100. |
= |
Keeps numeric data equals another value. | ["=", "sym", "AAPL"] keeps only AAPL data. |
<> |
Keeps data that is not equal to a value. | ["<>", "sym", "AAPL"] keeps all data that is not AAPL . |
like |
Filters string data that matches a simple expression | ["like", "sym", "A*"] matches any symbols that start with an A . |
fuzzy |
Filters string data based on approximate matching using edit distance. | ["fuzzy", "sym", [["APL", 1]]] matches symbols like APL with an edit distance of 1, allowing minor variations. |
Filtered search example
This example demonstrates a filtered query where the similarity search is run against vectors satisfying the filter condition.
documents.search(vectors=[[1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0]], n=3, filter=[("within","createdDate",["2020.07.10D15:00:00.0000", "2021.07.10D15:00:00.0000"]),("<=", "length", 100)])
curl -s -H "Content-Type: application/json" localhost:8082/api/v1/kxi/search \
-d '{"table":"documents","n":3,"vectors":[[1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0]],"filter":[["within","createdDate",["2020.07.10D15:00:00.0000", "2021.07.10D15:00:00.0000"]],["<=", "length", 100]]}'
How to improve your filtering performance in KDB.AI
To enhance filtering performance in KDB.AI, leverage metadata filtering. Each vector stored in the KDB.AI vector database can have associated metadata attached. By specifying filters on this metadata, you can effectively shrink the search space and reduce the number of vectors that need to be searched1. Here are some steps you can take:
- Metadata Filtering: Attach relevant metadata to your vectors (e.g., genre, release year, director). Use this metadata to filter your search queries, narrowing down the results. This approach improves both the speed and precision of your searches.
- Batch Insertion: When ingesting data, insert it in batches using the insert method. Split large dataframes into smaller batches for more efficient inserts.
- Compress Input Data: Consider using the product quantization method (IVFPQ) before applying the IVF schema. IVFPQ reduces the index size in memory and improves search speeds3.
Explore the metadata filtering demo in Google Colab for practical examples.
How to use fuzzy filters
In KDB.AI, you can apply fuzzy filters on the metadata columns for query
and the following search
operations:
- Similarity search
- Temporal search (transformed/non-transformed)
- Hybrid search
By adding a fuzzy parameter to the filtering options, you allow approximate string matching based on a specified edit distance.
Important! Supported metadata column types
Fuzzy search only supports searches of the following column types: string
, symbol
, and enumeration
. Numeric
, boolean
, date
, (or collections thereof) are not supported.
Pre-requisites
Before using fuzzy filters with KDB.AI, ensure you have the following:
- Python 3 (versions 3.8 to 3.11), Pip, and Git installed
- Active KDB.AI Cloud or Server license
- Valid API key for KDB.AI Cloud
- Know how to work with vector databases and embedding models
- Understand how to setup the necessary configurations for interacting with either KDB.AI Cloud or Server
How to run a similarity search with fuzzy filters on metadata columns
Conduct similarity search with a specified fuzzy filter to apply. You can apply a fuzzy filter within the filter expression of the search, by using the following arguments:
Parameter | Description | Type | Required | Default |
---|---|---|---|---|
fuzzy | Fuzzy filter keyword function | string | yes | none |
colToScan | Column to scan | string | yes | none |
params | Triple of search string, edit distance, and distance metric (optional) | list | no | none |
Example: How to run a similarity search or query with fuzzy filters on meta
# Connect to you KDB.AI session
import kdbai_client as kdbai
import pandas as pd
import numpy as np
# Generate dummy data data with 1000 rows
n_rows = 1000
data = pd.DataFrame({
'id': np.arange(n_rows, dtype='int32'), # Unique identifier for each row
'time': pd.date_range(start='2020-01-01', periods=n_rows, freq='1MIN'), # Timestamp for each row
'embeddings': [np.random.rand(12).astype('float32') for _ in range(n_rows)], # Random 12-dimensional embeddings
})
# List of stock tickers
tickers = [
'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA', 'FB', 'BRK.B', 'V',
'JNJ', 'WMT', 'JPM', 'NVDA', 'PYPL', 'NFLX', 'DIS', 'ADBE',
'PFE', 'INTC', 'KO', 'CSCO'
]
# Assign random tickers to the 'sym' column
data['sym'] = np.random.choice(tickers, size=n_rows)
# Define the schema for the table
schema = {'columns': [
{'name': 'id', 'pytype': 'int32'},
{'name': 'sym', 'pytype': 'str'},
{'name': 'time', 'pytype': 'datetime64[ns]'},
{'name': 'embeddings', 'pytype': 'float32','vectorIndex': {'dims': 12, 'type': 'hnsw', 'metric': 'L2', 'efConstruction': 8, 'M': 8}}]}
# Create a table named "tickers" with the defined schema
table = session.create_table("tickers",schema)
# Insert the generated data into the table
table.insert(data)
# Query the table (example query)
table.query()
# By setting the edit distance to zero we get all the exact matches for 'AMZN'
table.query(filter=("fuzzy","sym",[["AMZN", 0]]))
# Perform a similarity search with a vector and an exact match filter for 'AMZN'
table.search(vectors=[[0.0,1,2,3,4,0,1,2,3,4,1,2]], filter=("fuzzy","sym",[["AMZN", 0]]))
# Run similarity search with multiple filters
table.search(vectors=[[0.0,1,2,3,4,0,1,2,3,4,1,2]], filter=[("<=", "id", 300),("fuzzy","sym",[["AMZN", 0]])])
# Increase the edit distance allows fuzzy to accept more variations of the target string
table.search(vectors=[[0.0,1,2,3,4,0,1,2,3,4,1,2]], filter=("fuzzy","sym",[["AMN", 1]]))
# Choose different distance metric
table.search(vectors=[[0.0,1,2,3,4,0,1,2,3,4,1,2]], filter=("fuzzy","sym",[["AM Z", 2,"hamming"]]))
# Use curl to send a POST request to the KDB.AI REST
curl -s -H "Content-Type: application/json" localhost:8082/api/v1/kxi/search \
-d '{"table":"tickers","n":3,"vectors":[[0.0,1,2,3,4,0,1,2,3,4,1,2]],"filter":[["fuzzy","sym",[["AMN", 1]]]]}'
Supported distance metrics
If no distance metric is specified then Levenshtein is used. Alternatively, you can specify one of the fuzzy parameters listed below. Note that the parameter is case sensitive.
Fuzzy parameter | Name | Description | Notes |
---|---|---|---|
levenshtein |
Levenshtein | Min number of single-character edits required to change string 1 into string 2. | Allows to insert, delete, replace. |
hamming |
Hamming | Min substitutions required to turn one string into another. | Allows to replace. Only strings of the same length. |
jaro |
Jaro | Measures similarity between two strings based on the matching and swapping of characters. | Focuses on the order and number of common characters. |
jaro_winkler |
Jaro-Winkler | Similar to Jaro. Gives higher scores to strings that match from the start. | Like Jaro but with a prefix scale factor. |
damerau_levenshtein |
Damerau-Levenshtein | Number of operations needed to change string 1 into string 2. | Allows to insert, delete, replace, adjacent swap. |
lcs |
Longest Common Subsequence (LCS) | Finds the longest subsequence shared by two strings. | Only to delete, insert. Not to replace. |
osa |
Optimal String Alignment (OSA) | Similar to Damerau-Levenshtein but you can only edit substrings once. | Insert, delete, replace, swap. |
prefix |
Prefix | The edits needed to change similarity or dissimilarity at the beginning of the strings. | For “unhappy” and “unhealthy,” how many edits change “unhap” to “unhea.” |
postfix |
Postfix | The edits needed to change similarity or dissimilarity at the end of the strings. | For “unhappy” and “unhealthy,” how many edits change “ppy” to “thy.” |
Warning
Larger edit distances could make searches take longer or return a very large subset of the table.
Best practices
To make fuzziness “fuzzier” and enhance search flexibility, consider tweaking fuzziness parameters as follows:
- Customize Fuzziness: Specify the edit distance threshold (for example, >= 1).
- Increase Thresholds: Allow more edits for shorter strings.
Experiment with these settings to achieve the desired outcome.
Next steps
Now that you're familiar with filters, you can do the following:
- Optimize vector search with metadata filtering.
- Visit our GitHub repo, open the sample or run the notebook directly in Google Colab.