How to perform a Transformed Temporal Similarity Search
This page details how to execute a Transformed TSS search in KDB.AI.
Added in v1.1.0.
Before we dive in, go to the Understanding Transformed TSS search page to make sure you're familiar with concepts like dimensionality reduction, timeseries windows, and fast moving vectors.
Setup
To perform a Transformed TSS search, make sure you have:
- An active KDB.AI Cloud or Server license
- Installed the latest version of KDB.AI Cloud or Server
- A valid API key if you're using KDB.AI Cloud
- Python Client
1. Define your schema
To use the Transformed TSS method, you need to add an extra embedding
attribute to your vectorIndex
column. For general schema setup details see the Manage Tables page.
2. Embedding attribute parameters
Option | Description | Type | Required | Default | Allowed values |
---|---|---|---|---|---|
dims |
Reduced dimensionality of the data desired | int | true | 8 | 1,2,3,... |
type |
Embedding method to use | str | true | None | 'tsc' |
on_insert_error |
The action to take if there are records that would error on insertion. Either reject the batch or skip the erroneous record | string | false | 'reject_all' | 'reject_all', 'skip_row' |
A window fails to insert if the dimensionality is already less than the dimensionality specified by dims
.
The selection of dims
depends largely on the complexity of the data in the window; the more movement in the window, the larger this number should be.
schema = {'columns': [
{'name': 'index', 'pytype': 'int32'},
{'name': 'sym', 'pytype': 'str'},
{'name': 'time', 'pytype': 'datetime64[ns]'},
{'name': 'price',
'vectorIndex': {'type': 'flat', 'metric': 'L2'},
'embedding': {'dims': 8, 'type': 'tsc', 'on_insert_error': 'reject_all'}}]}
{
"type": "splayed",
"columns": [
{"name": "index", "type": "int"},
{"name": "sym", "type": "char"},
{"name": "time", "type": "timestamp"},
{
"name": "price",
"vectorIndex": {
"type": "flat",
"metric": "L2",
},
"embedding": {
"dims": 8,
"type": "tsc",
"on_insert_error": "reject_all"
}
}
]
}
3. Follow our index recommendations
Transformed TSS produces the best results using HNSW, IVF, and FLAT indexes. The added layer of product quantisation used in IVFPQ severely impacts the quality of the results returned.
Example: Transformed TSS search
# In this example we generate dummy market data using PyKX, create an index using the TSC feature exposed by KDB.AI,
# and search for a pattern of interest in the data
# Imports
# KX Dependencies
import pykx as kx
import kdbai_client as kdbai
# Other Dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from numpy.lib.stride_tricks import sliding_window_view
from matplotlib.text import Text
from tqdm import tqdm
import psutil
# Ignore Warnings
import warnings
warnings.filterwarnings("ignore")
# Report memory usage of Python + KDB.AI
def get_memory_usage():
virtual_memory = psutil.virtual_memory()
return virtual_memory.used / (1024 ** 2) # Memory usage in megabytes
# Generate Dummy Data
D = 1000 # Sliding Window Size
kx.q('\S 100')
kx.q('''
gen_marketTrades: { [num;ticker]
:marketTrades:: update price:{max(abs -0.5 + x + y;5.0)}\[first price; count[i]?1.0] from
`time xasc ([] time:(.z.d-30)+num?15D00:00:00;
sym:num#ticker;
qty:1000*1+num?10;
price:num#25.0)
};
''')
df1 = kx.q('gen_marketTrades[50000;`AAA]').pd() # Create 50k records for AAA
df2 = kx.q('gen_marketTrades[50000;`BBB]').pd() # Create 50k records for BBB
# Concatenate the DataFrames
df = pd.concat([df1, df2], ignore_index=True)
print(f"System Memory Usage: {get_memory_usage():.2f} MB")
# Method 1: Only Windowing - This method is shown as a point of comparison for the TSC method
# Vector Construction
# Create the vector column
vecdf = df.groupby(['sym']).apply(
lambda x: pd.DataFrame({
'sym': x['sym'].iloc[0],
'time': sliding_window_view(x['time'], D)[:, 0], # Adjusted to keep the last time in the window
'price': list(sliding_window_view(x['price'], D))
})
).reset_index(drop=True).reset_index()
vecdf.head()
print(f"System Memory Usage: {get_memory_usage():.2f} MB")
# Index Construction
session = kdbai.Session()
# If we're re-running this, remove the old trade table
if 'trade' in session.list():
table = session.table('trade')
table.drop()
session.list()
schema = dict(
columns=[
dict(
name='index',
pytype='int32'
),
dict(
name='sym',
pytype='str'
),
dict(
name='time',
pytype='datetime64[ns]'
),
dict(
name='price',
pytype='float32',
vectorIndex=
dict(
type='flat',
metric='L2',
dims=1000
)
)
]
)
table = session.create_table('trade', schema)
# Index Population
n = 1000 # chunk row size
for i in tqdm(range(0, vecdf.shape[0], n)):
table.insert(vecdf[i:i+n].reset_index(drop=True))
# Search
# We take the hundredth vector and use that as our search vector
q = vecdf['price'][100].tolist()
plt.plot(q) # See what the search vector looks like
plt.grid(True)
plt.title('Query Vector')
Text(0.5, 1.0, 'Query Vector')
res = table.search(vectors=[q], n=10)[0]
print(res)
print(f"System Memory Usage: {get_memory_usage():.2f} MB")
# Non-Trivial Solutions
# Expanding on the above note, we can see the non-trivial solutions by upping the nearest neighbors and looking
# at the tail end of the returned DataFrame
non_trivial = table.search(vectors=[q], n=200)[0].iloc[170:]
print(non_trivial)
vectors = non_trivial['price'].tolist()
for i, vector in enumerate(vectors):
plt.plot(vector-vector[0], color='blue', label='Non Trivial Nearest Neighbors' if i == 0 else "_nolegend_")
plt.plot(np.array(q)-q[0], color='red', label='Query')
plt.legend()
plt.grid(True)
plt.title('Raw Retrieval of 195-200 Nearest Neighbors')
Text(0.5, 1.0, 'Raw Retrieval of 195-200 Nearest Neighbors')
# Method 2: TSC
# This method utilises foreign keys to lookup on the vector table in the memory of the running Python process
# Index Construction
# If we're re-running this, remove the old trade table
if 'trade' in session.list():
table = session.table('trade')
table.drop()
session.list()
schema = dict(
columns=[
dict(
name='index',
pytype='int32'
),
dict(
name='sym',
pytype='str'
),
dict(
name='time',
pytype='datetime64[ns]'
),
dict(
name='price',
pytype='float32',
vectorIndex=
dict(
type='flat',
metric='L2',
),
embedding=
dict(
dims=8,
type='tsc',
on_insert_error='reject_all',
)
)
]
)
table = session.create_table('trade', schema)
print(f"System Memory Usage: {get_memory_usage():.2f} MB")
# Index Population
n = 1000 # chunk row size
for i in tqdm(range(0, vecdf.shape[0], n)):
table.insert(vecdf[i:i+n].reset_index(drop=True))
# Search
# We take the hundredth vector and use that as our search vector
res = table.search(vectors=[q], n=10)[0]
print(res)
# **N.B.** We observe the returned indexes lay either side of the index of our query vector. This passes our sanity check as the vectors offset
# by a few indexes is simply the query pattern shifted slightly left and right.
# If we want to have more meaningful results we can either increase the number of nearest neighbors returned, or in our sliding window creation
# we can have windows that aren't largely overlapping.
# Foreign Key Lookup
method_2 = res.merge(vecdf, on=['index','sym','time'], how='left')
print(method_2)
# Non-Trivial Solutions
# Expanding on the above note, we can see the non-trivial solutions by upping the nearest neighbors and looking at the tail end of the returned DataFrame
non_trivial = table.search(vectors=[q], n=200)[0].iloc[170:]
non_trivial.merge(vecdf, on=['index','sym','time'], how='left')
vectors = non_trivial.merge(vecdf, on=['index','sym','time'], how='left')['price'].tolist()
for i, vector in enumerate(vectors):
plt.plot(vector-vector[0], color='blue', label='Non Trivial Nearest Neighbors' if i == 0 else "_nolegend_")
plt.plot(np.array(q)-q[0], color='red', label='Query')
plt.legend()
plt.grid(True)
plt.title('TSC Retrieval of 195-200 Nearest Neighbors')
Text(0.5, 1.0, 'TSC Retrieval of 195-200 Nearest Neighbors')
# We can clearly see the high correlation between the query vector and the non-trivial nearest neighbors
print(f"System Memory Usage: {get_memory_usage():.2f} MB")
# Conclusion
# We see that our memory usage increases drastically when we insert raw vectors into KDB.AI. Under the TSC method there is less memory required
# and faster response times. The improvement of this correlates to the level of dimensionality reduction that's taking place, or $\frac{D}{dims}$$$
Next steps
Now that you're familiar with a Transformed TSS search, try the following:
- Explore best practices and use cases on the KDB.AI Learning hub.
- Discover our GitHub repo.
- Open the sample.
- Run the notebook in Google Colab.
- Learn how to perform a Non-Transformed Temporal Similarity Search.