If you are not redirected, click here.

How to Customize Filters in KDB.AI

This page details how to apply custom filters to queries and/or any type of searches in KDB.AI.

Tip: For the best experience, start by reading about KDB.AI filters.

KDB.AI combines vector database functionality with traditional kdb database queries. In particular, you can combine a vector similarity search with traditional where clauses. The filter parameter is used for applying custom filtering to the query and search methods. This is specified using a list of triples where each triple has the form (operator; column name; values).

PythonJSONq

[("<=", "valFloat", 100), ("within", "qual", [0,2])]

[["<=", "valFloat", 100], ["within", "qual", [0,2]]]

((<=;`valFloat;100);(within;`qual;(0;2)))

The filters are applied in the order they are defined when selecting from the table.

Supported filter functions

Function	Parameters	Example
`in`	Filters data that is not in the list of possible alternatives.	`["in", "sym", ["AAPL", "MSFT"]]` keeps symbols that are either `AAPL` or `MSFT`.
`within`	Keeps numeric data that is within the bounds of a range using inclusive limits.	`["within", "price", [100, 200]]` keeps prices that are greater than or equal to 100 but less than or equal to 200.
`<`	Keeps numeric data that is less than a threshold.	`["<", "price", 100]` keeps data that is less than 100.
`>`	Keeps numeric data that is greater than a threshold.	`[">", "price", 100]` keeps data that is greater than 100.
`<=`	Keeps numeric data that is less than or equal to a threshold.	`["<=", "price", 100]` keeps data that is less than or equal to 100.
`>=`	Keeps numeric data that is greater than or equal to a threshold.	`[">=", "price", 100]` keeps data that is greater than or equal to 100.
`=`	Keeps numeric data equals another value.	`["=", "sym", "AAPL"]` keeps only `AAPL` data.
`<>`	Keeps data that is not equal to a value.	`["<>", "sym", "AAPL"]` keeps all data that is not `AAPL`.
`like`	Filters string data that matches a simple expression	`["like", "sym", "A*"]` matches any symbols that start with an `A`.
`fuzzy`	Filters string data based on approximate matching using distance.	`["fuzzy", "sym", [["APL", 1]]]` matches symbols like APL with a distance of 1, allowing minor variations.

Filtered search example

This example demonstrates a filtered query where the similarity search is run against vectors satisfying the filter condition.

PythonRESTq

index_name = 'vector_index'
filter_conditions = [("within", "createdDate", [datetime.datetime(2020, 7, 10, 15), datetime.datetime(2021, 7, 10, 15)]),("<=", "length", 100)]
documents.search(vectors={index_name: [[1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0]]}, n=3, filter=filter_conditions)
documents.query(filter=filter_conditions)

curl -s -H "Content-Type: application/json" localhost:8082/api/v2/databases/default/tables/documents/search \
-d '{"n": 3, "vectors": {"vector_index": [[1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0]]}, "filter": [["within", "createdDate", ["2020.07.10D15:00:00.0000", "2021.07.10D15:00:00.0000"]], ["<=", "length", 100]]}'

curl -s -H "Content-Type: application/json" localhost:8082/api/v2/databases/default/tables/documents/query \
-d '{"filter": [["within", "createdDate", ["2020.07.10D15:00:00.0000", "2021.07.10D15:00:00.0000"]], ["<=", "length", 100]]}'

// gw is a handler to the gateway
vectors: enlist[`vector_index]!enlist[enlist[1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0 1.0 0.0 1.0]]
filters: ((within;`createdDate;(2020.07.10D15:00:00.0000;2021.07.10D15:00:00.0000));(<=;`length;100))
gw(`search;`database`table`n`vectors`filter!(`default;`documents;3;vectors;filters))
gw(`query;`database`table`filter!(`default;`documents;filters))

How to improve your filtering performance in KDB.AI

To enhance filtering performance in KDB.AI, leverage metadata filtering. Each vector stored in the KDB.AI vector database can have associated metadata attached. By specifying filters on this metadata, you can effectively shrink the search space and reduce the number of vectors that need to be searched. Here are some steps you can take:

Metadata Filtering: Attach relevant metadata to your vectors (e.g., genre, release year, director). Use this metadata to filter your search queries, narrowing down the results. This approach improves both the speed and precision of your searches.
Batch Insertion: When ingesting data, insert it in batches using the insert method.
Compress Input Data: Consider using the product quantization method (IVFPQ) before applying the IVF schema. IVFPQ reduces the index size in memory and improves search speeds.

Explore the metadata filtering demo in Google Colab for practical examples.

How to use fuzzy filters

In KDB.AI, you can apply fuzzy filters on the metadata columns for query and the following search operations:

Similarity search
Temporal search (transformed/non-transformed)
Hybrid search

By adding a fuzzy parameter to the filtering options, you allow approximate string matching based on a specified edit distance.

Important! Supported metadata column types

Fuzzy search only supports searches of the following column types: string, symbol, and enumeration. Numeric, boolean, date, (or collections thereof) are not supported.

Pre-requisites

Before using fuzzy filters with KDB.AI, ensure you have the following:

Python 3 (versions 3.8 to 3.11), Pip, and Git installed
Active KDB.AI Cloud or Server license
Valid API key for KDB.AI Cloud
Know how to work with vector databases and embedding models
Understand how to setup the necessary configurations for interacting with either KDB.AI Cloud or Server

How to run a similarity search with fuzzy filters on metadata columns

Conduct similarity search with a specified fuzzy filter to apply. You can apply a fuzzy filter within the filter expression of the search, by using the following arguments:

Parameter	Description	Type	Required	Default
fuzzy	Fuzzy filter keyword function	string	yes	none
colToScan	Column to scan	string	yes	none
params	Triple of search string, distance, and distance metric (optional)	list	no	none

Example: How to run a similarity search or query with fuzzy filters on meta

PythonRESTq

# Connect to you KDB.AI session

import kdbai_client as kdbai
import pandas as pd
import numpy as np

# Generate dummy data data with 1000 rows

n_rows = 1000
data = pd.DataFrame({
    'id': np.arange(n_rows, dtype='int32'), # Unique identifier for each row
    'time': pd.date_range(start='2020-01-01', periods=n_rows, freq='1MIN'), # Timestamp for each row
    'embeddings': [np.random.rand(12).astype('float32') for _ in range(n_rows)], # Random 12-dimensional embeddings
})

# List of stock tickers

tickers = [
    'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA', 'FB', 'BRK.B', 'V',
    'JNJ', 'WMT', 'JPM', 'NVDA', 'PYPL', 'NFLX', 'DIS', 'ADBE',
    'PFE', 'INTC', 'KO', 'CSCO'
]

# Assign random tickers to the 'sym' column

data['sym'] = np.random.choice(tickers, size=n_rows)

# Define the schema for the table

schema = [
    {'name': 'id', 'type': 'int32'},
    {'name': 'sym', 'type': 'str'},
    {'name': 'time', 'type': 'datetime64[ns]'},
    {'name': 'embeddings', 'type': 'float32s'}]

index_name = 'vector_index'
indexes = [{'name': index_name, 'type': 'hnsw', 'column': 'embeddings', 'params': {'dims': 12, 'metric': 'L2', 'efConstruction': 8, 'M': 8}}]

# Create a table named "tickers" with the defined schema

session = kdbai.Session()  # by default, creates session to localhost:8082 using QIPC connection
database = session.database('default')
table = database.create_table('tickers', schema=schema, indexes=indexes)

# Insert the generated data into the table

table.insert(data)

# Query the table (example query)

table.query()

# By setting the distance to zero we get all the exact matches  for 'AMZN'

table.query(filter=[('fuzzy', 'sym', [['AMZN', 0]])])

# Perform a similarity search with a vector and an exact match filter for 'AMZN'

table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('fuzzy', 'sym', [['AMZN', 0]])])

# Run similarity search with multiple filters

table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('<=', 'id', 300),('fuzzy', 'sym', [['AMZN', 0]])])

# Increase the edit distance allows fuzzy to accept more variations of the target string

table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('fuzzy', 'sym', [['AMN', 1]])])

# Choose different distance metric

table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('fuzzy', 'sym', [['AM Z', 2, 'hamming']])])

# Use curl to send a POST request to the KDB.AI REST

curl -s -H "Content-Type: application/json" localhost:8082/api/v2/databases/default/tables/tickers/search \
-d '{"n": 3, "vectors": {"vector_index": [[0,1,2,3,4,0,1,2,3,4,1,2]]}, "filter": [["fuzzy", "sym", [["AMN", 1]]]]}'

// gw is a handler to the gateway
vectors: enlist[`vector_index]!enlist[enlist[0 1 2 3 4 0 1 2 3 4 1 2]]
filters: enlist (`fuzzy;`sym;enlist (`AMN;1))
gw(`search;`database`table`n`vectors`filter!(`default;`tickers;3;vectors;filters))

Supported distance metrics

If no distance metric is specified then Levenshtein is used. Alternatively, you can specify one of the fuzzy parameters listed below. Note that the parameter is case sensitive.

Fuzzy parameter	Name	Description	Notes
`levenshtein`	Levenshtein	Min number of single-character edits required to change string 1 into string 2.	Allows to insert, delete, replace.
`hamming`	Hamming	Min substitutions required to turn one string into another.	Allows to replace. Only strings of the same length.
`jaro`	Jaro	Measures similarity between two strings based on the matching and swapping of characters.	Focuses on the order and number of common characters.
`jaro_winkler`	Jaro-Winkler	Similar to Jaro. Gives higher scores to strings that match from the start.	Like Jaro but with a prefix scale factor.
`damerau_levenshtein`	Damerau-Levenshtein	Number of operations needed to change string 1 into string 2.	Allows to insert, delete, replace, adjacent swap.
`lcs`	Longest Common Subsequence (LCS)	Finds the longest subsequence shared by two strings.	Only to delete, insert. Not to replace.
`osa`	Optimal String Alignment (OSA)	Similar to Damerau-Levenshtein but you can only edit substrings once.	Insert, delete, replace, swap.
`prefix`	Prefix	The edits needed to change similarity or dissimilarity at the beginning of the strings.	For “unhappy” and “unhealthy,” how many edits change “unhap” to “unhea.”
`postfix`	Postfix	The edits needed to change similarity or dissimilarity at the end of the strings.	For “unhappy” and “unhealthy,” how many edits change “ppy” to “thy.”

Warning

Larger edit distances could make searches take longer or return a very large subset of the table.

Best practices

To make fuzziness “fuzzier” and enhance search flexibility, consider tweaking fuzziness parameters as follows:

Customize Fuzziness: Specify the distance threshold (for example, >= 1).
Increase Thresholds: Allow more edits for shorter strings.

Experiment with these settings to achieve the desired outcome.

Next steps

Now that you're familiar with filters, you can do the following:

Try querying and fuzzy filtering in our sample project.
Review our articles on metadata filtering and fuzzy filtering for more context.
Watch our video to learn more about querying in vector databases.
Visit our GitHub repo, open the sample or run the notebook directly in Google Colab.

New Documentation Site!

How to Customize Filters in KDB.AI

Supported filter functions

Filtered search example

How to improve your filtering performance in KDB.AI

How to use fuzzy filters

Pre-requisites

How to run a similarity search with fuzzy filters on metadata columns

Supported distance metrics

Best practices

Next steps