If you are not redirected, click here.

Fuzzy filters

This page explains fuzzy filters in KDB.AI.

Fuzzy filters improve vector search efficiency and recall by narrowing down candidates based on specific criteria. Using fuzzy filters leads to more accurate results, even when dealing with imprecise or “fuzzy” queries.

Benefits of fuzzy filters

Fuzzy filters are particularly useful when using data that may contain errors, typos, or variations. For example, if you apply fuzzy filters to searches, you can find documents that contain similar terms to a specified query, even if they're not exactly the same.

The addition of fuzzy filters provides the following benefits:

Handling Typos: fuzzy filters are essential for handling approximate or imprecise search terms. Users often make typos or minor spelling errors, and fuzzy search helps retrieve relevant results despite these variations.
User-Friendly Experience: by accommodating slight variations in search terms, fuzzy filters ensure a better user experience. Users don’t need to input the exact term to get meaningful results.
Robustness: fuzzy filters make your search system more robust by accounting for real-world data imperfections.
Increased Recall: fuzzy filters increase recall by capturing relevant documents that might have been missed due to minor differences in spelling or input errors.

How do fuzzy filters work?

Fuzzy filter algorithms often rely on the concept of edit distance, which measures the minimum number of operations (insertions, deletions, substitutions) required to transform one string into another.

Common algorithms include Levenshtein distance and Damerau-Levenshtein distance. For example, the Levenshtein distance between “cat” and “cot” is 1 (substitute ‘a’ with ‘o’). Fuzzy queries expand the search to include terms within a specified edit distance.

The key concepts to understand fuzzy filters are:

Edit distance
Levenshtein distance

Edit distance

An edit is an insertion, a deletion, or a replacement of a character. In the context of fuzzy filters, edit distance is the number of one-character changes needed to transform one term into another. These changes can include:

Replacing a character: For example, turn “box” into “fox.”
Deleting a character: For instance, change “black” to “lack.”
Inserting a character: For example, turn “all” to “ball.”
Swapping two adjacent characters: For instance, change “act” to “cat.”

Fuzzy queries use edit distance (most commonly measured by the Levenshtein distance) to find terms similar to a search term. The query creates a set of all possible variations within a specified edit distance and returns exact matches for each expansion.

The Levenshtein distance

The Levenshtein distance is a number that measures how different two strings are. The higher the Levenshtein distance, the more different the strings are.

For example, the Levenshtein distance between “bitten” and “fitting” is 3 because it takes 3 text edits to change one into the other:

bitten → fitten (replace "b" with "f")
fitten → fittin (replace "e" with "i")
fittin → fitting (insert "g" at the end).

Check out the full list of distance metrics you can choose from.

Use cases

Fuzzy filters are a powerful tool with various use cases across different domains. Here are some key applications of fuzzy filters:

Spell Checking: fuzzy filters are commonly used in algorithms to suggest corrections for misspelled words. They help users find the correct term even if there are minor typos or spelling mistakes.
Data Cleaning: in data deduplication processes, fuzzy filters identify and merge records that are likely duplicates but may have slight variations in names, addresses, or other fields.
Autocomplete and Suggestions: search engines, websites, or applications rely on fuzzy filters to predict and suggest relevant search terms, even if the user input contains errors or incomplete information.
Information Retrieval: fuzzy filters allow users to find documents, articles, or records that closely match their queries, even when the exact terms are not used.
Product Matching in E-Commerce: e-commerce platforms use fuzzy filters to improve product matching. Customers searching for a product may receive relevant results, even if they use synonyms or alternative terms.
Name Matching in Databases: fuzzy filters can identify and link records with similar names but slight variations, reducing the chance of missing relevant information.
Geographic Search: when searching for locations or addresses, fuzzy filters can accommodate variations in spelling or abbreviations, ensuring accurate results in geographic searches.
Code Search: in software development, fuzzy filters assist programmers in quickly locating code snippets, functions, or methods, especially when they only remember parts of the code.

How to use fuzzy filters

In KDB.AI, you can apply fuzzy filters on the metadata columns for query and the following search operations:

Similarity search
Temporal search (transformed/non-transformed)
Hybrid search

By adding a fuzzy parameter to the filtering options, you allow approximate string matching based on a specified edit distance.

Important! Supported metadata column types

Fuzzy search only supports searches of the following column types: string, symbol, and enumeration. Numeric, boolean, date, (or collections thereof) are not supported.

Pre-requisites

Before using fuzzy filters with KDB.AI, ensure you have the following:

Python 3 (versions 3.8 to 3.11), Pip, and Git installed
Active KDB.AI Cloud or Server license
Valid API key for KDB.AI Cloud
Know how to work with vector databases and embedding models
Understand how to setup the necessary configurations for interacting with either KDB.AI Cloud or Server

How to run a similarity search with fuzzy filters on metadata columns

Step 1. Conduct similarity search

Step 2. Apply fuzzy filter

You can apply a fuzzy filter within the filter expression of the search, by using the following arguments:

Parameter	Description	Type	Required	Default
fuzzy	Fuzzy filter keyword function	string	yes	none
colToScan	Column to scan	string	yes	none
params	Triple of search string, edit distance, and distance metric (optional)	list	no	none

Example: How to run a similarity search or query with fuzzy filters on meta

PythonRESTq

# Connect to you KDB.AI session

import kdbai_client as kdbai
import pandas as pd
import numpy as np

# Generate dummy data data with 1000 rows

n_rows = 1000
data = pd.DataFrame({
    'id': np.arange(n_rows, dtype='int32'), # Unique identifier for each row
    'time': pd.date_range(start='2020-01-01', periods=n_rows, freq='1MIN'), # Timestamp for each row
    'embeddings': [np.random.rand(12).astype('float32') for _ in range(n_rows)], # Random 12-dimensional embeddings
})

# List of stock tickers

tickers = [
    'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA', 'FB', 'BRK.B', 'V',
    'JNJ', 'WMT', 'JPM', 'NVDA', 'PYPL', 'NFLX', 'DIS', 'ADBE',
    'PFE', 'INTC', 'KO', 'CSCO'
]

# Assign random tickers to the 'sym' column

data['sym'] = np.random.choice(tickers, size=n_rows)

# Define the schema for the table

schema = [
    {'name': 'id', 'type': 'int32'},
    {'name': 'sym', 'type': 'str'},
    {'name': 'time', 'type': 'datetime64[ns]'},
    {'name': 'embeddings', 'type': 'float32s'}]

index_name = 'vector_index'
indexes = [{'name': index_name, 'type': 'hnsw', 'column': 'embeddings', 'params': {'dims': 12, 'metric': 'L2', 'efConstruction': 8, 'M': 8}}]

# Create a table named "tickers" with the defined schema

session = kdbai.Session()  # by default, creates session to localhost:8082 using QIPC connection
database = session.database('default')
table = database.create_table('tickers', schema=schema, indexes=indexes)

# Insert the generated data into the table

table.insert(data)

# Query the table (example query)

table.query()

# By setting the edit distance to zero we get all the exact matches  for 'AMZN'

table.query(filter=[('fuzzy', 'sym', [['AMZN', 0]])])

# Perform a similarity search with a vector and an exact match filter for 'AMZN'

table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('fuzzy', 'sym', [['AMZN', 0]])])

# Run similarity search with multiple filters

table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('<=', 'id', 300),('fuzzy', 'sym', [['AMZN', 0]])])

# Increase the edit distance allows fuzzy to accept more variations of the target string

table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('fuzzy', 'sym', [['AMN', 1]])])

# Choose different distance metric

table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('fuzzy', 'sym', [['AM Z', 2, 'hamming']])])

# Use curl to send a POST request to the KDB.AI REST

curl -s -H "Content-Type: application/json" localhost:8082/api/v2/databases/default/tables/tickers/search \
-d '{"n": 3, "vectors": {"vector_index": [[0,1,2,3,4,0,1,2,3,4,1,2]]}, "filter": [["fuzzy", "sym", [["AMN", 1]]]]}'

// gw is a handler to the gateway
vectors: enlist[`vector_index]!enlist[enlist[0 1 2 3 4 0 1 2 3 4 1 2]]
filters: enlist (`fuzzy;`sym;enlist (`AMN;1))
gw(`search;`database`table`n`vectors`filter!(`default;`tickers;3;vectors;filters))

Supported distance metrics

You can set a distance metric of preference. The default is Levenshtein. Below is the extended set of distance metrics you can choose from:

Fuzzy parameter	Name	Description	Notes
`levenshtein`	Levenshtein	Min number of single-character edits required to change string 1 into string 2.	Allows to insert, delete, replace.
`hamming`	Hamming	Min substitutions required to turn one string into another.	Allows to replace. Only strings of the same length.
`jaro`	Jaro	Measures similarity between two strings based on the matching and swapping of characters.	Focuses on the order and number of common characters.
`jaro_winkler`	Jaro-Winkler	Similar to Jaro. Gives higher scores to strings that match from the start.	Like Jaro but with a prefix scale factor.
`damerau_levenshtein`	Damerau-Levenshtein	Number of operations needed to change string 1 into string 2.	Allows to insert, delete, replace, adjacent swap.
`lcs`	Longest Common Subsequence (LCS)	Finds the longest subsequence shared by two strings.	Only to delete, insert. Not to replace.
`osa`	Optimal String Alignment (OSA)	Similar to Damerau-Levenshtein but you can only edit substrings once.	Insert, delete, replace, swap.
`prefix`	Prefix	The edits needed to change similarity or dissimilarity at the beginning of the strings.	For “unhappy” and “unhealthy,” how many edits change “unhap” to “unhea.”
`postfix`	Postfix	The edits needed to change similarity or dissimilarity at the end of the strings.	For “unhappy” and “unhealthy,” how many edits change “ppy” to “thy.”

Warning

Larger edit distances could make searches take longer or return a very large subset of the table.

Best practices

To make fuzziness “fuzzier” and enhance search flexibility, consider tweaking fuzziness parameters as follows:

Customize Fuzziness: Specify the edit distance threshold (for example, >= 1).
Increase Thresholds: Allow more edits for shorter strings.
Prefix Length: Adjust the prefix length for better autocompletion.

Experiment with these settings to achieve the desired outcome.

Next steps

Now that you're familiar with fuzzy filters, you can do the following:

Optimize vector search with metadata filtering.
Visit our GitHub repo, open the sample or run the notebook directly in Google Colab.

New Documentation Site!