Best Matching 25 (BM25)
This page describes the parameters for the Best Matching 25 (BM25) ranking algorithm as part of AI libs.
The BM25 algorithm is a scoring method developed for document retrieval. BM25 is part of a family of ranking functions, with roots in the probabilistic information retrieval model.
For each token in the sparse query vector, the algorithm gives a positive weight to any element of the collection that contains that token. The weights are proportional to the rarity of the word in the collection and dependent on two search parameters k and b.
.ai.bm25.psearch
The .ai.bm25.psearch function returns the top k documents against the contents of indexes across partitions.
This function performs a partitioned search across multiple indexes using the BM25 ranking algorithm. It returns the top k most relevant documents and is especially useful for large-scale datasets where information is spread across multiple storage segments.
Note
The BM25 variant that is used with KDB-X is Lucene.
Parameters
| Name | Type | Description |
|---|---|---|
indexName |
symbol | The name of the loaded bm25 partitioned table |
q |
dict | long[] | The query sparse object |
k |
short | int | long | The number of nearest neighbors to retrieve |
ck |
real | float | The term saturation |
cb |
real | float | The document length impact on relevance |
parts |
long[] | date[] | month[] | The partitions to query from |
Returns
| Type | Description |
|---|---|
| (real[]; long[]) | bm25 scores and index. The indices returned are ready to use with .Q.ind. |
| ### Example |
q).ai:use`kx.ai
q)sparse:(100;100)#10000?200j;
q)index:.ai.bm25.put[()!();1.25e;0.75e;sparse];
q).ai.bm25.search[index;first sparse;5;1.25e;0.75e];
q)dates:reverse .z.D-til 3;
q)path:`:db;
q)paths:` sv/:path,/:`$string dates;
q)indexName:`test;
q).ai.bm25.write[;index;indexName] each paths;
q).Q.lo[`:db;0;0];
q).ai.bm25.psearch[`test;first sparse;5;1.25e;0.75e;dates]
82.95999 82.95999 82.95999 43.1496 43.1496
0 100 200 25 125
This example builds a BM25 index from a synthetic sparse matrix, writes three dated partitions of that index to :db, reloads the database, and then runs a partitioned search (psearch) across those dates. It queries with the first sparse vector, requests the top 5 results, and uses BM25 parameters k1=1.25, b=0.75. The two output rows are the scores (top) and their corresponding document IDs (bottom), aggregated across partitions.
.ai.bm25.put
The .ai.bm25.put function inserts sparse vectors into a bm25 index.
This function inserts sparse vector representations of documents into a BM25 index. By storing these sparse vectors, the index can later be used to perform relevance-based ranking and retrieval.
Parameters
| Name | Type | Description |
|---|---|---|
index |
dict | The bm25 object |
ck |
real | float | The term saturation |
cb |
real | float | The document length impact on relevance |
sparse |
dict | long[] | The tokenizer input IDs list or counted grouped tokenizer input IDs |
Returns
| Type | Description |
|---|---|
| table | dict | Returns updated bm25 object |
Example
q).ai:use`kx.ai
q)sparse:(100;100)#10000?200j;
q)index:.ai.bm25.put[()!();1.25e;0.75e;sparse]
token | +`token`document`occurs`noccurs!(`g#0 1 2 3 4 5 6 7 8 9 10 11 12 13..
document| +`dlen`denoms!(100 100 100 100 100 100 100 100 100 100 100 100 100 ..
stats | +`ck`cb!(,1.25e;,0.75e)
This example inserts a 100x100 sparse matrix into a fresh BM25 index with k1=1.25 and b=0.75. The returned keyed tables show the index structure: token (postings by term), document (per-document stats like length/denominators), and stats (global BM25 parameters). It demonstrates how a raw sparse representation becomes a queryable BM25 index.
.ai.bm25.score
The .ai.bm25.score function calculates the scores against the contents of an index.
This function computes BM25 scores for documents against the contents of an index. The scores quantify how relevant each document is to a given query. It provides the underlying ranking values that drive search results, making it useful for debugging, analysis, or custom ranking workflows.
Note
The BM25 variant that is used with KDB-X is Lucene.
Parameters
| Name | Type | Description |
|---|---|---|
index |
dict | symbol | The bm25 object or on-disk name |
q |
dict | long[] | The query sparse object |
ck |
real | float | The term saturation |
cb |
real | float | The document length impact on relevance |
Returns
| Type | Description |
|---|---|
| real[] | The bm25 scores |
Example
q).ai:use`kx.ai
q)sparse:(100;100)#10000?200j;
q)index:.ai.bm25.put[()!();1.25e;0.75e;sparse];
q).ai.bm25.score[index;first sparse;1.25e;0.75e]
81.89743 28.61314 33.78403 33.34681 34.98316 24.59848 36.47417 27.12549 29.11884 33.15578 30.1717..
After building an index from the sparse data, the example computes BM25 scores for every document against a single query vector (the first row). Unlike a top-k search, it returns a full score array, one value per document. This is useful for custom ranking, thresholding, or debugging the effect of k1/b.
.ai.bm25.search
The .ai.bm25.search function returns the top k nearest neighbors for sparse search.
This function retrieves the top k nearest neighbors in a sparse vector space using BM25 scoring. It executes efficient similarity-based searches to identify documents most relevant to the query terms. It provides fast, ranked retrieval in text-based search applications.
Note
The BM25 variant that is used with KDB-X is Lucene.
Parameters
| Name | Type | Description |
|---|---|---|
index |
dict | symbol | The bm25 object |
q |
dict | long[] | The query sparse object |
k |
short | long | int | The number of nearest neighbors to retrieve |
ck |
real | float | The term saturation |
cb |
real | float | The document length impact on relevance |
Returns
| Type | Description |
|---|---|
| (real[]; long[]) | The bm25 scores and index |
Example
q).ai:use`kx.ai
q)sparse:(100;100)#10000?200j;
q)index:.ai.bm25.put[()!();1.25e;0.75e;sparse];
q).ai.bm25.search[index;first sparse;5;1.25e;0.75e]
81.73318 44.41976 41.04497 40.98081 39.71086
0 54 66 29 21
Here the index is queried with the first sparse vector and the top 5 nearest documents are returned using BM25 scoring (k1=1.25, b=0.75). The output's first line lists the relevance scores, and the second line lists the matching document IDs in rank order. It illustrates the standard retrieval workflow for ranked results.
.ai.bm25.write
The .ai.bm25.write function saves a bm25 index to disk broken up into three tables.
This function writes a BM25 index to disk by splitting it into three separate tables. Persisting the index allows for reloading and reuse across sessions without needing to rebuild it. This ensures scalability and efficient storage management for large-scale search systems.
Parameters
| Name | Type | Description |
|---|---|---|
path |
symbol | The filehandle to save location |
index |
dict | The bm25 index |
indexName |
symbol | The name to save the index as on disk |
Returns
| Type | Description |
|---|---|
| symbol[] | The filehandles to set components |
Example
q).ai:use`kx.ai
q)sparse:(100;100)#10000?200j;
q)index:.ai.bm25.put[()!();1.25e;0.75e;sparse];
q).ai.bm25.write[`:db;index;`test]
`:db/teststats/`:db/testtoken/`:db/testdocument/
This example persists the in-memory BM25 index to disk under :db with the logical name test. The output shows the three persisted tables - teststats/, testtoken/, and testdocument/- which together reconstruct the index on reload. It demonstrates how to save an index for reuse across sessions or deployments.
Next steps
- Read the Hybrid Search with BM25 in KDB-X AI Libraries tutorial on Medium.