Skip to content

Customize Filters

This section details how to apply custom filters to your search query.

KDB.AI combines vector database functionality with traditional kdb database queries. In particular, you can combine a vector similarity search with traditional where clauses. The filter parameter is used for applying custom filtering to the query. This is specified using a list of triples where each triple has the form (operator; column name; values).

[("<=", "valFloat", 100), ("within", "qual", [0,2])]
[["<=", "valFloat", 100], ["within", "qual", [0,2]]]

The filters are applied in the order they are defined when selecting from the table.

Supported filter functions

Function Parameters Example
in Filters data that is not in the list of possible alternatives. ["in", "sym", ["AAPL", "MSFT"]] keeps symbols that are either AAPL or MSFT.
within Keeps numeric data that is within the bounds of a range using inclusive limits. ["within", "price", [100, 200]] keeps prices that are greater than or equal to 100 but less than or equal to 200.
< Keeps numeric data that is less than a threshold. ["<", "price", 100] keeps data that is less than 100.
> Keeps numeric data that is greater than a threshold. [">", "price", 100] keeps data that is greater than 100.
<= Keeps numeric data that is less than or equal to a threshold. ["<=", "price", 100] keeps data that is less than or equal to 100.
>= Keeps numeric data that is greater than or equal to a threshold. [">=", "price", 100] keeps data that is greater than or equal to 100.
= Keeps numeric data equals another value. ["=", "sym", "AAPL"] keeps only AAPL data.
<> Keeps data that is not equal to a value. ["<>", "sym", "AAPL"] keeps all data that is not AAPL.
like Filters string data that matches a simple expression ["like", "sym", "A*"] matches any symbols that start with an A.
fuzzy Filters string data based on approximate matching using edit distance. ["fuzzy", "sym", [["APL", 1]]] matches symbols like APL with an edit distance of 1, allowing minor variations.

Filtered search example

This example demonstrates a filtered query where the similarity search is run against vectors satisfying the filter condition.

documents.search(vectors=[[1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0]], n=3, filter=[("within","createdDate",["2020.07.10D15:00:00.0000", "2021.07.10D15:00:00.0000"]),("<=", "length", 100)])
curl -s -H "Content-Type: application/json" localhost:8082/api/v1/kxi/search \
-d '{"table":"documents","n":3,"vectors":[[1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0]],"filter":[["within","createdDate",["2020.07.10D15:00:00.0000", "2021.07.10D15:00:00.0000"]],["<=", "length", 100]]}'

How to improve your filtering performance in KDB.AI

To enhance filtering performance in KDB.AI, leverage metadata filtering. Each vector stored in the KDB.AI vector database can have associated metadata attached. By specifying filters on this metadata, you can effectively shrink the search space and reduce the number of vectors that need to be searched1. Here are some steps you can take:

  • Metadata Filtering: Attach relevant metadata to your vectors (e.g., genre, release year, director). Use this metadata to filter your search queries, narrowing down the results. This approach improves both the speed and precision of your searches.
  • Batch Insertion: When ingesting data, insert it in batches using the insert method. Split large dataframes into smaller batches for more efficient inserts.
  • Compress Input Data: Consider using the product quantization method (IVFPQ) before applying the IVF schema. IVFPQ reduces the index size in memory and improves search speeds3.

Explore the metadata filtering demo in Google Colab for practical examples.

How to use fuzzy filters

In KDB.AI, you can apply fuzzy filters on the metadata columns for query and the following search operations:

  1. Similarity search
  2. Temporal search (transformed/non-transformed)
  3. Hybrid search

By adding a fuzzy parameter to the filtering options, you allow approximate string matching based on a specified edit distance.

Important! Supported metadata column types

Fuzzy search only supports searches of the following column types: string, symbol, and enumeration. Numeric, boolean, date, (or collections thereof) are not supported.

Pre-requisites

Before using fuzzy filters with KDB.AI, ensure you have the following:

How to run a similarity search with fuzzy filters on metadata columns

Conduct similarity search with a specified fuzzy filter to apply. You can apply a fuzzy filter within the filter expression of the search, by using the following arguments:

Parameter Description Type Required Default
fuzzy Fuzzy filter keyword function string yes none
colToScan Column to scan string yes none
params Triple of search string, edit distance, and distance metric (optional) list no none
Example: How to run a similarity search or query with fuzzy filters on meta
# Connect to you KDB.AI session

import kdbai_client as kdbai
import pandas as pd
import numpy as np

# Generate dummy data data with 1000 rows

n_rows = 1000
data = pd.DataFrame({
    'id': np.arange(n_rows, dtype='int32'), # Unique identifier for each row
    'time': pd.date_range(start='2020-01-01', periods=n_rows, freq='1MIN'), # Timestamp for each row
    'embeddings': [np.random.rand(12).astype('float32') for _ in range(n_rows)], # Random 12-dimensional embeddings
})

# List of stock tickers

tickers = [
    'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA', 'FB', 'BRK.B', 'V', 
    'JNJ', 'WMT', 'JPM', 'NVDA', 'PYPL', 'NFLX', 'DIS', 'ADBE', 
    'PFE', 'INTC', 'KO', 'CSCO'
]

# Assign random tickers to the 'sym' column

data['sym'] = np.random.choice(tickers, size=n_rows)

# Define the schema for the table

schema = {'columns': [
    {'name': 'id', 'pytype': 'int32'},
    {'name': 'sym', 'pytype': 'str'},
    {'name': 'time', 'pytype': 'datetime64[ns]'},
    {'name': 'embeddings', 'pytype': 'float32','vectorIndex': {'dims': 12, 'type': 'hnsw', 'metric': 'L2', 'efConstruction': 8, 'M': 8}}]}

# Create a table named "tickers" with the defined schema

table = session.create_table("tickers",schema)

# Insert the generated data into the table

table.insert(data)

# Query the table (example query)

table.query()

# By setting the edit distance to zero we get all the exact matches  for 'AMZN'

table.query(filter=("fuzzy","sym",[["AMZN", 0]])) 

# Perform a similarity search with a vector and an exact match filter for 'AMZN'

table.search(vectors=[[0.0,1,2,3,4,0,1,2,3,4,1,2]], filter=("fuzzy","sym",[["AMZN", 0]])) 

# Run similarity search with multiple filters

table.search(vectors=[[0.0,1,2,3,4,0,1,2,3,4,1,2]], filter=[("<=", "id", 300),("fuzzy","sym",[["AMZN", 0]])]) 

# Increase the edit distance allows fuzzy to accept more variations of the target string

table.search(vectors=[[0.0,1,2,3,4,0,1,2,3,4,1,2]], filter=("fuzzy","sym",[["AMN", 1]])) 

# Choose different distance metric

table.search(vectors=[[0.0,1,2,3,4,0,1,2,3,4,1,2]], filter=("fuzzy","sym",[["AM Z", 2,"hamming"]])) 
# Use curl to send a POST request to the KDB.AI REST 

curl -s -H "Content-Type: application/json" localhost:8082/api/v1/kxi/search \
-d '{"table":"tickers","n":3,"vectors":[[0.0,1,2,3,4,0,1,2,3,4,1,2]],"filter":[["fuzzy","sym",[["AMN", 1]]]]}'

Supported distance metrics

If no distance metric is specified then Levenshtein is used. Alternatively, you can specify one of the fuzzy parameters listed below. Note that the parameter is case sensitive.

Fuzzy parameter Name Description Notes
levenshtein Levenshtein Min number of single-character edits required to change string 1 into string 2. Allows to insert, delete, replace.
hamming Hamming Min substitutions required to turn one string into another. Allows to replace. Only strings of the same length.
jaro Jaro Measures similarity between two strings based on the matching and swapping of characters. Focuses on the order and number of common characters.
jaro_winkler     Jaro-Winkler Similar to Jaro. Gives higher scores to strings that match from the start. Like Jaro but with a prefix scale factor.
damerau_levenshtein Damerau-Levenshtein Number of operations needed to change string 1 into string 2. Allows to insert, delete, replace, adjacent swap.
lcs Longest Common Subsequence (LCS) Finds the longest subsequence shared by two strings. Only to delete, insert. Not to replace.
osa Optimal String Alignment (OSA) Similar to Damerau-Levenshtein but you can only edit substrings once. Insert, delete, replace, swap.
prefix Prefix The edits needed to change similarity or dissimilarity at the beginning of the strings. For “unhappy” and “unhealthy,” how many edits change “unhap” to “unhea.”
postfix Postfix The edits needed to change similarity or dissimilarity at the end of the strings. For “unhappy” and “unhealthy,” how many edits change “ppy” to “thy.”

Warning

Larger edit distances could make searches take longer or return a very large subset of the table.

Best practices

To make fuzziness “fuzzier” and enhance search flexibility, consider tweaking fuzziness parameters as follows:

  • Customize Fuzziness: Specify the edit distance threshold (for example, >= 1).
  • Increase Thresholds: Allow more edits for shorter strings.

Experiment with these settings to achieve the desired outcome.

Next steps

Now that you're familiar with filters, you can do the following: