Time-Series & Historical Analysis
Welcome to this example where we'll demonstrate how to work with large datasets in KDB-X to analyze time-series data.
One of the key features of KDB-X is its ability to handle huge volumes of data with exceptional speed and efficiency. Whether it's reading massive datasets, performing time-based aggregations, or joining data from different sources, KDB-X excels at time-series analysis. By the end of this example, you'll have a clear understanding of how to create, manipulate, store, and analyze data using q/KDB-X. Along the way, we'll introduce several key concepts that are fundamental to working with q/KDB-X.
Here, we'll cover:
- Creating a large time-series dataset from scratch
- Saving this data to a database on disk
- Streamline ingestion and save down using functions
- Performing time-based aggregations to analyze trends over time
- Using asof joins (aj) to combine time-series data (e.g., matching trades to quotes)
1. Install q
Before we start q it must first be installed. If you have yet to install q you can follow the installation guide. Once you have q installed and have been able to start your first q process you can continue with this example.
2. Create a Time-Series Dataset
Let’s start by creating a sample dataset to work with. This dataset will simulate trade data over a period of time, with random values for price, size, and symbols. We’ll generate 5 million rows of trade data for a single day.
n:5000000
day:2025.01.02
trade:([]
time:asc (`timestamp$day) + n?24:00:00.000000000; / Start from midnight, spread across 24h
sym:n?`AAPL`MSFT`GOOG`AMZN; / Random symbols
price:n?100f; / Random prices
size:n?1000 / Random trade sizes
)
Here's a breakdown of what's happening above:
n: 5000000sets the number of rows we want to generate- We define a new table with table notation
([] col1:<values>; col2:<values>: ....) - We use
?to generate random values for 4 columns: timeis populated with timestamps starting from midnight and increasing across a 24-hour period, with a random offset to simulate a spread of trades.symis populated with random symbols like AAPL, MSFT, etc., selected from a list.priceand tradesizeare randomnly generated
This table is now available in memory to investigate and query. Let's take a quick look at the row count, schema details with meta and first 10 rows using sublist.
These simple commands are essential when exploring your data quickly in q/KDB-X.
count trade / get row count
meta trade / get table schema details - datatypes, column names etc.
The following columns are produced when we run meta:
- c: column name
- t: column type
- f: foreign keys
- a: attributes - modifiers applied for performance characteristics
10 sublist trade / get first 10 rows
3. Save Data to Disk
Once the data is generated, you’ll likely want to save it to disk for persistent storage.
Because we want the ability to scale, partitioning by date will be a good approach for this dataset. Without partitioning, queries that span large time periods would require scanning entire datasets, which can be very slow and resource-intensive. By partitioning data, KDB-X can limit the query scope to the relevant partitions, significantly speeding up the process.
To partition by date we can use the inbuilt function .Q.dpft.
dbDir:"/tmp/kdb/data" / Define database location
dbPath:hsym `$dbDir
.Q.dpft[dbPath;day;`sym;`trade] / Save data as a partitioned database
In the above:
hsym: This function prefixes the directory location with a colon to make it a file handle..Q.dpft[d;p;f;t]: This command persists to a (d)atabase location with a specific (p)artition with data from a (t)able with an associated (f)ield.
One persisted, the table name is returned. We can test its worked as expected by deleting the trade table we have in memory and reloading the database from disk.
delete trade from `. / Delete in memory table
system"l ",dbDir / Load the partitioned database
meta trade / Check it exists
KDB-X actually offers a number of different methods to store tables which will allow for efficient storage and querying for different sized datasets: flat, splayed, partitioned and segmented.
A general rule of thumb around which format to choose depends on three things:
- Will the table continue to grow at a fast rate?
- Am I working in a RAM constrained environment?
- What level of performance do I want?
To learn more about these types and when to choose which see here. --ADD LINK--
4. Scaling Data Ingestion with Functions
If you want to scale the ingestion of data to many days, it’s helpful to create a reusable function. Let’s create a function createTrade that generates trade data for specific dates and saves it to the database.
createTrade:{[date]
trade::([] time:asc (`timestamp$date) + n?24:00:00.000000000; / Start from midnight, spread across 24h
sym:n?`AAPL`MSFT`GOOG`AMZN; / Random symbols
price:n?100f; / Random prices
size:n?1000); / Random trade sizes
.Q.dpft[dbPath;date;`sym;`trade] / Save data as a partitioned database
}
days:2025.02.01 + til 5
createTrade each days
In the above:
- The function
createTradegenerates trade data for a given date, and then saves it to disk. - We generate data for multiple days (2025.02.01 to 2025.02.05), using the
tiloperator as a quick handy way to generate a list of dates. - We then loop over the dates using
each
Iterators
Iterators like each are the primary means of iteration in q, and in almost all cases the most efficient way to iterate. Loops are rare in q programs and are almost always candidates for optimization.
After running this function, the data will be partitioned and stored for each specific day. Again, lets delete our in memory trade table and reload our database to pick up these new additions.
delete trade from `. / Delete in memory table
system"l ",dbDir / Load the partitioned database
select count i by date from trade / Count num rows by date after partitioning 5 days of data
5. Time-Series Analytics
Now that we have some data, let's dive into some basic time-series analytics.
Total Trade Volume Every Hour for AAPL
select sum size by date, 60 xbar time.minute from trade where sym=`AAPL
qSQL & Temporal Arithmetic
Here we are using qSQL, the inbuilt table query language in KDB-X. If you have used SQL, you will find the syntax of qSQL queries very similar.
- Just as in SQL, table results called using
selectandfromand can be filtered by expressions following awhere - Multiple filter criteria, separated by
,, are evaluated starting from the left - To group similar values together we can use the
byclause. This is particularly useful in combination with used with an aggregation likesum,max,minetc.
q/KDB-X supports several temporal types and arithmetic between them. See here for a summary of datatypes. In this example:
- The
timecolumn in the data has a type of timestamp, which includes both date and time values. - We convert the
timevalues to their minute values (including hours and minutes). - We then aggregate further on time by using
xbarto bucket the minutes into hours (60-unit buckets).
Weighted Average Price and Last Trade Price Every 15 Minutes for MSFT
select LastPrice:last price, WeightedPrice:size wavg price by date,15 xbar time.minute from trade where sym=`MSFT
This is similar to the previous analytic, but this time we make use of the built in wavg function to find out the weighted average over time intervals.
In finance, volume-weighted averages give a more accurate reflection of a stock’s price movement by incorporating trading volume at different price levels. This can be especially useful in understanding whether a price move is supported by strong market participation or is just a result of a few trades.
Let's time this anayltic with \t to see how long it takes in milliseconds to crunch through 30 million records.
\t select LastPrice:last price, WeightedPrice:size wavg price by date,15 xbar time.minute from trade where sym=`MSFT
This query processes 30+ million records in ~100ms (based on testing), efficiently aggregating LastPrice and WeightedPrice for MSFT trades. The use of by date, 15 xbar time.minute optimized the grouping, making the computation fast. This demonstrates the power of KDB-X/q for high-speed time-series analytics.
While these are just basic analytics, but they showcase q/KDB-X’s ability to handle large-scale time-series data and perform aggregations quickly.
6. Asof Join – Matching Trades with Quotes
One of the most powerful features in q/KDB-X is the asof join (aj), which is designed to match records from two tables based on the most recent timestamp. Unlike a standard SQL join, where records must match exactly on a key, an asof join finds the most recent match.
Why Use Asof Joins? In time-series data, we often deal with information arriving at different intervals. For example:
- Trade and Quote Data: A trade occurs at a given time, and we want to match it with the latest available quote.
- Sensor Data: A sensor records temperature every second, while another logs environmental data every 10 seconds—matching the closest reading is crucial.
Note
q/KDB-X optimizes asof joins to handle large datasets efficiently, making it a key tool in real-time analytics and historical data analysis.
Generate synthetic quote data for one day
n:2000000
today:last days
quote:([]
time:asc (`timestamp$today) + n?86400000000000; / Random timestamps
sym:n?`AAPL`MSFT`GOOG`AMZN; / Symbols
bid:n?100f; / Random bid prices
ask:n?100f / Random ask prices
)
As we're keeping this table in memory we need to perform one extra step before joining, we apply the parted (p#) attribute to the sym column of the quote table. Our trade table on disk already has the parted attribute on the sym column, we see this in the column a when we run meta trade.
meta trade
This is crucial for optimizing asof joins, as it ensures faster lookups when performing symbol-based joins. Before applying parted to quote, we first sort the table by sym using xasc, as the parted attribute requires the column to be sorted for it to work efficiently.
quote:`sym xasc quote / sorting sym in ascending order
quote:update `p#sym from quote / apply parted attruibute on sym
In the above:
xascSorts the quote table by sym in ascending order.#Applies the parted attribute to sym, optimizing symbol-based lookups.
Perform Asof Join
We now match each trade with the most recent available quote for todays date.
tradequote:aj[`sym`time;
select from trade where date=today;
quote]
tradequote
In the above:
ajperforms an asof join on thesymandtimecolumns- Each trade record gets matched with the latest available quote at or before the trade’s timestamp.
- We can see this means the first few
bidandaskvalues are empty because there was no quote data prior to those trades.
This approach ensures that for every trade, we have the best available quote information, allowing traders to analyze trade execution relative to the prevailing bid/ask spread at the time.
Conculsions and Next Steps
In this example, you learned how to:
- Create a dataset of trade data.
- Save and load data using partitioned storage.
- Scale data ingestion with reusable functions.
- Perform basic time-series analytics on the data.
- Use an asof join to match trades with quotes.
To take your learning further, consider exploring the following: