Fuzzy filters
This page explains fuzzy filters in KDB.AI.
Fuzzy filters improve vector search efficiency and recall by narrowing down candidates based on specific criteria. Using fuzzy filters leads to more accurate results, even when dealing with imprecise or “fuzzy” queries.
Benefits of fuzzy filters
Fuzzy filters are particularly useful when using data that may contain errors, typos, or variations. For example, if you apply fuzzy filters to searches, you can find documents that contain similar terms to a specified query, even if they're not exactly the same.
The addition of fuzzy filters provides the following benefits:
- Handling Typos: fuzzy filters are essential for handling approximate or imprecise search terms. Users often make typos or minor spelling errors, and fuzzy search helps retrieve relevant results despite these variations.
- User-Friendly Experience: by accommodating slight variations in search terms, fuzzy filters ensure a better user experience. Users don’t need to input the exact term to get meaningful results.
- Robustness: fuzzy filters make your search system more robust by accounting for real-world data imperfections.
- Increased Recall: fuzzy filters increase recall by capturing relevant documents that might have been missed due to minor differences in spelling or input errors.
How do fuzzy filters work?
Fuzzy filter algorithms often rely on the concept of edit distance, which measures the minimum number of operations (insertions, deletions, substitutions) required to transform one string into another.
Common algorithms include Levenshtein distance and Damerau-Levenshtein distance. For example, the Levenshtein distance between “cat” and “cot” is 1 (substitute ‘a’ with ‘o’). Fuzzy queries expand the search to include terms within a specified edit distance.
The key concepts to understand fuzzy filters are:
Edit distance
An edit is an insertion, a deletion, or a replacement of a character. In the context of fuzzy filters, edit distance is the number of one-character changes needed to transform one term into another. These changes can include:
- Replacing a character: For example, turn “box” into “fox.”
- Deleting a character: For instance, change “black” to “lack.”
- Inserting a character: For example, turn “all” to “ball.”
- Swapping two adjacent characters: For instance, change “act” to “cat.”
Fuzzy queries use edit distance (most commonly measured by the Levenshtein distance) to find terms similar to a search term. The query creates a set of all possible variations within a specified edit distance and returns exact matches for each expansion.
The Levenshtein distance
The Levenshtein distance is a number that measures how different two strings are. The higher the Levenshtein distance, the more different the strings are.
For example, the Levenshtein distance between “bitten” and “fitting” is 3 because it takes 3 text edits to change one into the other:
- bitten → fitten (replace "b" with "f")
- fitten → fittin (replace "e" with "i")
- fittin → fitting (insert "g" at the end).
Check out the full list of distance metrics you can choose from.
Use cases
Fuzzy filters are a powerful tool with various use cases across different domains. Here are some key applications of fuzzy filters:
- Spell Checking: fuzzy filters are commonly used in algorithms to suggest corrections for misspelled words. They help users find the correct term even if there are minor typos or spelling mistakes.
- Data Cleaning: in data deduplication processes, fuzzy filters identify and merge records that are likely duplicates but may have slight variations in names, addresses, or other fields.
- Autocomplete and Suggestions: search engines, websites, or applications rely on fuzzy filters to predict and suggest relevant search terms, even if the user input contains errors or incomplete information.
- Information Retrieval: fuzzy filters allow users to find documents, articles, or records that closely match their queries, even when the exact terms are not used.
- Product Matching in E-Commerce: e-commerce platforms use fuzzy filters to improve product matching. Customers searching for a product may receive relevant results, even if they use synonyms or alternative terms.
- Name Matching in Databases: fuzzy filters can identify and link records with similar names but slight variations, reducing the chance of missing relevant information.
- Geographic Search: when searching for locations or addresses, fuzzy filters can accommodate variations in spelling or abbreviations, ensuring accurate results in geographic searches.
- Code Search: in software development, fuzzy filters assist programmers in quickly locating code snippets, functions, or methods, especially when they only remember parts of the code.
How to use fuzzy filters
In KDB.AI, you can apply fuzzy filters on the metadata columns for query
and the following search
operations:
- Similarity search
- Temporal search (transformed/non-transformed)
- Hybrid search
By adding a fuzzy parameter to the filtering options, you allow approximate string matching based on a specified edit distance.
Important! Supported metadata column types
Fuzzy search only supports searches of the following column types: string
, symbol
, and enumeration
. Numeric
, boolean
, date
, (or collections thereof) are not supported.
Pre-requisites
Before using fuzzy filters with KDB.AI, ensure you have the following:
- Python 3 (versions 3.8 to 3.11), Pip, and Git installed
- Active KDB.AI Cloud or Server license
- Valid API key for KDB.AI Cloud
- Know how to work with vector databases and embedding models
- Understand how to setup the necessary configurations for interacting with either KDB.AI Cloud or Server
How to run a similarity search with fuzzy filters on metadata columns
Step 1. Conduct similarity search
Step 2. Apply fuzzy filter
You can apply a fuzzy filter within the filter expression of the search, by using the following arguments:
Parameter | Description | Type | Required | Default |
---|---|---|---|---|
fuzzy | Fuzzy filter keyword function | string | yes | none |
colToScan | Column to scan | string | yes | none |
params | Triple of search string, edit distance, and distance metric (optional) | list | no | none |
Example: How to run a similarity search or query with fuzzy filters on meta
# Connect to you KDB.AI session
import kdbai_client as kdbai
import pandas as pd
import numpy as np
# Generate dummy data data with 1000 rows
n_rows = 1000
data = pd.DataFrame({
'id': np.arange(n_rows, dtype='int32'), # Unique identifier for each row
'time': pd.date_range(start='2020-01-01', periods=n_rows, freq='1MIN'), # Timestamp for each row
'embeddings': [np.random.rand(12).astype('float32') for _ in range(n_rows)], # Random 12-dimensional embeddings
})
# List of stock tickers
tickers = [
'AAPL', 'MSFT', 'GOOGL', 'AMZN', 'TSLA', 'FB', 'BRK.B', 'V',
'JNJ', 'WMT', 'JPM', 'NVDA', 'PYPL', 'NFLX', 'DIS', 'ADBE',
'PFE', 'INTC', 'KO', 'CSCO'
]
# Assign random tickers to the 'sym' column
data['sym'] = np.random.choice(tickers, size=n_rows)
# Define the schema for the table
schema = [
{'name': 'id', 'type': 'int32'},
{'name': 'sym', 'type': 'str'},
{'name': 'time', 'type': 'datetime64[ns]'},
{'name': 'embeddings', 'type': 'float32s'}]
index_name = 'vector_index'
indexes = [{'name': index_name, 'type': 'hnsw', 'column': 'embeddings', 'params': {'dims': 12, 'metric': 'L2', 'efConstruction': 8, 'M': 8}}]
# Create a table named "tickers" with the defined schema
session = kdbai.Session() # by default, creates session to localhost:8082 using QIPC connection
database = session.database('default')
table = database.create_table('tickers', schema=schema, indexes=indexes)
# Insert the generated data into the table
table.insert(data)
# Query the table (example query)
table.query()
# By setting the edit distance to zero we get all the exact matches for 'AMZN'
table.query(filter=[('fuzzy', 'sym', [['AMZN', 0]])])
# Perform a similarity search with a vector and an exact match filter for 'AMZN'
table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('fuzzy', 'sym', [['AMZN', 0]])])
# Run similarity search with multiple filters
table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('<=', 'id', 300),('fuzzy', 'sym', [['AMZN', 0]])])
# Increase the edit distance allows fuzzy to accept more variations of the target string
table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('fuzzy', 'sym', [['AMN', 1]])])
# Choose different distance metric
table.search(vectors={index_name: [[0,1,2,3,4,0,1,2,3,4,1,2]]}, filter=[('fuzzy', 'sym', [['AM Z', 2, 'hamming']])])
# Use curl to send a POST request to the KDB.AI REST
curl -s -H "Content-Type: application/json" localhost:8082/api/v2/databases/default/tables/tickers/search \
-d '{"n": 3, "vectors": {"vector_index": [[0,1,2,3,4,0,1,2,3,4,1,2]]}, "filter": [["fuzzy", "sym", [["AMN", 1]]]]}'
// gw is a handler to the gateway
vectors: enlist[`vector_index]!enlist[enlist[0 1 2 3 4 0 1 2 3 4 1 2]]
filters: enlist (`fuzzy;`sym;enlist (`AMN;1))
gw(`search;`database`table`n`vectors`filter!(`default;`tickers;3;vectors;filters))
Supported distance metrics
You can set a distance metric of preference. The default is Levenshtein. Below is the extended set of distance metrics you can choose from:
Fuzzy parameter | Name | Description | Notes |
---|---|---|---|
levenshtein |
Levenshtein | Min number of single-character edits required to change string 1 into string 2. | Allows to insert, delete, replace. |
hamming |
Hamming | Min substitutions required to turn one string into another. | Allows to replace. Only strings of the same length. |
jaro |
Jaro | Measures similarity between two strings based on the matching and swapping of characters. | Focuses on the order and number of common characters. |
jaro_winkler |
Jaro-Winkler | Similar to Jaro. Gives higher scores to strings that match from the start. | Like Jaro but with a prefix scale factor. |
damerau_levenshtein |
Damerau-Levenshtein | Number of operations needed to change string 1 into string 2. | Allows to insert, delete, replace, adjacent swap. |
lcs |
Longest Common Subsequence (LCS) | Finds the longest subsequence shared by two strings. | Only to delete, insert. Not to replace. |
osa |
Optimal String Alignment (OSA) | Similar to Damerau-Levenshtein but you can only edit substrings once. | Insert, delete, replace, swap. |
prefix |
Prefix | The edits needed to change similarity or dissimilarity at the beginning of the strings. | For “unhappy” and “unhealthy,” how many edits change “unhap” to “unhea.” |
postfix |
Postfix | The edits needed to change similarity or dissimilarity at the end of the strings. | For “unhappy” and “unhealthy,” how many edits change “ppy” to “thy.” |
Warning
Larger edit distances could make searches take longer or return a very large subset of the table.
Best practices
To make fuzziness “fuzzier” and enhance search flexibility, consider tweaking fuzziness parameters as follows:
- Customize Fuzziness: Specify the edit distance threshold (for example, >= 1).
- Increase Thresholds: Allow more edits for shorter strings.
- Prefix Length: Adjust the prefix length for better autocompletion.
Experiment with these settings to achieve the desired outcome.
Next steps
Now that you're familiar with fuzzy filters, you can do the following:
- Optimize vector search with metadata filtering.
- Visit our GitHub repo, open the sample or run the notebook directly in Google Colab.