Best Matching 25 (BM25)

This page describes the parameters for the Best Matching 25 (BM25) ranking algorithm as part of AI libs.

The BM25 algorithm is a scoring method developed for document retrieval. BM25 is part of a family of ranking functions, with roots in the probabilistic information retrieval model.

For each token in the sparse query vector, the algorithm gives a positive weight to any element of the collection that contains that token. The weights are proportional to the rarity of the word in the collection and dependent on two search parameters k and b.

.ai.bm25.psearch

The .ai.bm25.psearch function returns the top k documents against the contents of indexes across partitions.

This function performs a partitioned search across multiple indexes using the BM25 ranking algorithm. It returns the top k most relevant documents and is especially useful for large-scale datasets where information is spread across multiple storage segments.

Note

The BM25 variant that is used with KDB-X is Lucene.

Parameters

Name	Type	Description
`indexName`	symbol	The name of the loaded `bm25` partitioned table
`q`	dict \| long[]	The query sparse object
`k`	short \| int \| long	The number of nearest neighbors to retrieve
`ck`	real \| float	The term saturation
`cb`	real \| float	The document length impact on relevance
`parts`	long[] \| date[] \| month[]	The partitions to query from

Returns

Type	Description
(real[]; long[])	`bm25` scores and index. The indices returned are ready to use with `.Q.ind`.
### Example

q).ai:use`kx.ai
q)sparse:(100;100)#10000?200j;
q)index:.ai.bm25.put[()!();1.25e;0.75e;sparse];
q).ai.bm25.search[index;first sparse;5;1.25e;0.75e];
q)dates:reverse .z.D-til 3;
q)path:`:db;
q)paths:` sv/:path,/:`$string dates;
q)indexName:`test;
q).ai.bm25.write[;index;indexName] each paths;
q).Q.lo[`:db;0;0];
q).ai.bm25.psearch[`test;first sparse;5;1.25e;0.75e;dates]

82.95999 82.95999 82.95999 43.1496 43.1496
0        100      200      25      125

This example builds a BM25 index from a synthetic sparse matrix, writes three dated partitions of that index to :db, reloads the database, and then runs a partitioned search (psearch) across those dates. It queries with the first sparse vector, requests the top 5 results, and uses BM25 parameters k1=1.25, b=0.75. The two output rows are the scores (top) and their corresponding document IDs (bottom), aggregated across partitions.

.ai.bm25.put

The .ai.bm25.put function inserts sparse vectors into a bm25 index.

This function inserts sparse vector representations of documents into a BM25 index. By storing these sparse vectors, the index can later be used to perform relevance-based ranking and retrieval.

Parameters

Name	Type	Description
`index`	dict	The `bm25` object
`ck`	real \| float	The term saturation
`cb`	real \| float	The document length impact on relevance
`sparse`	dict \| long[]	The tokenizer input IDs list or counted grouped tokenizer input IDs

Returns

Type	Description
table \| dict	Returns updated `bm25` object

Example

q).ai:use`kx.ai
q)sparse:(100;100)#10000?200j;
q)index:.ai.bm25.put[()!();1.25e;0.75e;sparse]

token   | +`token`document`occurs`noccurs!(`g#0 1 2 3 4 5 6 7 8 9 10 11 12 13..
document| +`dlen`denoms!(100 100 100 100 100 100 100 100 100 100 100 100 100 ..
stats   | +`ck`cb!(,1.25e;,0.75e)

This example inserts a 100x100 sparse matrix into a fresh BM25 index with k1=1.25 and b=0.75. The returned keyed tables show the index structure: token (postings by term), document (per-document stats like length/denominators), and stats (global BM25 parameters). It demonstrates how a raw sparse representation becomes a queryable BM25 index.

.ai.bm25.score

The .ai.bm25.score function calculates the scores against the contents of an index.

This function computes BM25 scores for documents against the contents of an index. The scores quantify how relevant each document is to a given query. It provides the underlying ranking values that drive search results, making it useful for debugging, analysis, or custom ranking workflows.

Note

The BM25 variant that is used with KDB-X is Lucene.

Parameters

Name	Type	Description
`index`	dict \| symbol	The `bm25` object or on-disk name
`q`	dict \| long[]	The query sparse object
`ck`	real \| float	The term saturation
`cb`	real \| float	The document length impact on relevance

Returns

Type	Description
real[]	The `bm25` scores

Example

q).ai:use`kx.ai
q)sparse:(100;100)#10000?200j;
q)index:.ai.bm25.put[()!();1.25e;0.75e;sparse];
q).ai.bm25.score[index;first sparse;1.25e;0.75e]

81.89743 28.61314 33.78403 33.34681 34.98316 24.59848 36.47417 27.12549 29.11884 33.15578 30.1717..

After building an index from the sparse data, the example computes BM25 scores for every document against a single query vector (the first row). Unlike a top-k search, it returns a full score array, one value per document. This is useful for custom ranking, thresholding, or debugging the effect of k1/b.

.ai.bm25.search

The .ai.bm25.search function returns the top k nearest neighbors for sparse search.

This function retrieves the top k nearest neighbors in a sparse vector space using BM25 scoring. It executes efficient similarity-based searches to identify documents most relevant to the query terms. It provides fast, ranked retrieval in text-based search applications.

Note

The BM25 variant that is used with KDB-X is Lucene.

Parameters

Name	Type	Description
`index`	dict \| symbol	The `bm25` object
`q`	dict \| long[]	The query sparse object
`k`	short \| long \| int	The number of nearest neighbors to retrieve
`ck`	real \| float	The term saturation
`cb`	real \| float	The document length impact on relevance

Returns

Type	Description
(real[]; long[])	The `bm25` scores and index

Example

q).ai:use`kx.ai
q)sparse:(100;100)#10000?200j;
q)index:.ai.bm25.put[()!();1.25e;0.75e;sparse];
q).ai.bm25.search[index;first sparse;5;1.25e;0.75e]

81.73318 44.41976 41.04497 40.98081 39.71086
0        54       66       29       21

Here the index is queried with the first sparse vector and the top 5 nearest documents are returned using BM25 scoring (k1=1.25, b=0.75). The output's first line lists the relevance scores, and the second line lists the matching document IDs in rank order. It illustrates the standard retrieval workflow for ranked results.

.ai.bm25.write

The .ai.bm25.write function saves a bm25 index to disk broken up into three tables.

This function writes a BM25 index to disk by splitting it into three separate tables. Persisting the index allows for reloading and reuse across sessions without needing to rebuild it. This ensures scalability and efficient storage management for large-scale search systems.

Parameters

Name	Type	Description
`path`	symbol	The filehandle to save location
`index`	dict	The `bm25` index
`indexName`	symbol	The name to save the index as on disk

Returns

Type	Description
symbol[]	The filehandles to set components

Example

q).ai:use`kx.ai
q)sparse:(100;100)#10000?200j;
q)index:.ai.bm25.put[()!();1.25e;0.75e;sparse];
q).ai.bm25.write[`:db;index;`test]

`:db/teststats/`:db/testtoken/`:db/testdocument/

This example persists the in-memory BM25 index to disk under :db with the logical name test. The output shows the three persisted tables - teststats/, testtoken/, and testdocument/- which together reconstruct the index on reload. It demonstrates how to save an index for reuse across sessions or deployments.

Next steps

Read the Hybrid Search with BM25 in KDB-X AI Libraries tutorial on Medium.