Downsample Time-Series Data

Use the RDP algorithm to downsample time-series data efficiently. This method preserves important peaks and valleys while removing noise, avoiding the stack errors common with recursive approaches.

Overview

Analyze bucketing limitations: See why standard aggregation fails
Generate volatile test data: Create a jagged dataset
Implement the RDP algorithm: Build a stack-safe solution
Run and visualize the results: Run the code and tune results
Quantify storage savings: Measure the space you save
Explore advanced methodology: Read the white paper

Analyze bucketing limitations

KDB-X often uses "bucketing" to group data into fixed windows, like 1-minute bars. This is fast, but it hides details.

Identify bucketing limitations

A simple average flattens the data. It hides peaks and valleys.

q)select avg price by 15 xbar time.minute from trade where sym=`AAPL

OHLC (Open-High-Low-Close) bars are a common alternative. They capture the range of movement:

q)select o:first price, h:max price, l:min price, c:last price 
  by 15 xbar time.minute from trade where sym=`AAPL

OHLC bars show the range, but hide the exact timing of changes. You lose the timestamp of each spike. Figure 1 shows this. The Blue line is the original data. The Orange line is the average. The Bars are the OHLC range.

%%{init: { "theme": "base", "themeVariables": { "xyChart": { "plotColorPalette": "#616161, #D50000, #00C853, #616161, #FFFFFF, #2962FF, #FF6D00", "backgroundColor": "#ffffff" } } } }%%
xychart-beta
    title "Bucketing vs. OHLC"
    x-axis "Time" ["t0", "t1", "t2", "t3", "t4", "t5", "t6", "t7", "t8", "t9"]
    y-axis "Price" 40 --> 160
    bar [0, 0, 150, 0, 0, 0, 0, 100, 0, 0]
    bar [0, 0, 130, 0, 0, 0, 0, 0, 0, 0]
    bar [0, 0, 0, 0, 0, 0, 0, 90, 0, 0]
    bar [0, 0, 120, 0, 0, 0, 0, 80, 0, 0]
    bar [0, 0, 100, 0, 0, 0, 0, 50, 0, 0]
    line [130, 110, 150, 100, 120, 80, 100, 60, 50, 90]
    line [122, 122, 122, 122, 122, 76, 76, 76, 76, 76]

Figure 1: Limitations of Bucketing

Understand RDP advantages

The RDP algorithm simplifies the line based on its shape. It ignores fixed time intervals. It keeps points that deviate from the trend. This preserves:

Peaks and Valleys: Keeps true volatility
Precise Timing: Events keep their original timestamps
Visual Shape: Looks like the original data

Figure 2 compares the results. The Blue line is the original data. The Orange line is the RDP result. It ignores small noise but keeps the big peaks.

%%{init: { "theme": "base", "themeVariables": { "xyChart": { "plotColorPalette": "#2962FF, #D50000", "backgroundColor": "#ffffff" } } } }%%
xychart-beta
    title "RDP Solution"
    x-axis "Time" ["t0", "t1", "t2", "t3", "t4", "t5", "t6", "t7", "t8", "t9"]
    y-axis "Price" 40 --> 160
    line [130, 110, 150, 100, 120, 80, 100, 60, 50, 90]
    line [130, 140, 150, 127, 103, 80, 100, 75, 50, 90]

Figure 2: RDP Solution

The algorithm steps are:

Draw a line from the start to the end
Find the point furthest from that line
Keep the point if it is far enough away from the line (distance > tolerance)
Split the data at that point and repeat

flowchart TD
    Start([Start]) --> Find[Find Max Deviation]
    Find --> Check{Dist > Tol?}
    Check -- Yes --> Keep[Keep Point]
    Keep --> Split[Split]
    Split --> Repeat[Repeat]
    Repeat --> Find
    Check -- No --> Discard[Discard]
    Discard --> Done([Done])

Figure 3: RDP Algorithm Logic

Deep dive: recursion vs. iteration

A recursive function (one that calls itself) crashes on large datasets. It uses "stack" memory for each split. If you have millions of points, you hit the 'stack limit.

To fix this, use an iterative approach. This is safe for large datasets.

Generate volatile test data

Use a "Cauchy distribution" to make jagged data. This tests if the algorithm keeps peaks.

Open a q session and run the following code to create 10,000 points:

q)// Define Random Walk Generator
q)PI:acos -1;
q)rcauchy:{[n;loc;scale] loc+scale*tan PI*(n?1.0)-0.5};
q)
q)// Generate 10,000 datapoints
q)n:10000;
q)trade:([]time:`timestamp$.z.d+00:15*til n; sym:n#`AAPL;price:100+sums rcauchy[n;0;1]);
q)
q)// Check first 5 rows
q)5#trade

The data looks like something like this:

%%{init: { "theme": "base", "themeVariables": { "xyChart": { "plotColorPalette": "#2962FF", "backgroundColor": "#ffffff", "titleColor": "#333333", "xAxisLabelColor": "#333333", "xAxisTitleColor": "#333333", "yAxisLabelColor": "#333333", "yAxisTitleColor": "#333333" } } } }%%
xychart-beta
    title "Volatile Data"
    x-axis "Time" ["09:30", "09:45", "10:00", "10:15", "10:30", "10:45", "11:00", "11:15", "11:30", "11:45", "12:00"]
    y-axis "Price" 40 --> 160
    line [100, 105, 140, 135, 160, 120, 125, 50, 40, 45, 100]

Figure 4: Volatile Data

Implement the RDP algorithm

Implement the RDP algorithm iteratively to downsample the data and remove noise.

The implementation has three parts:

pDist: Finds the distance from a point to a line
iter: Processes the queue
rdpIter: Runs the loop

Define the distance helper function

pDist finds the distance between a point (px, py) and the line (x1, y1) to (x2, y2). This identifies outlier points.

Safe calculation

This function avoids division-by-zero errors on vertical lines.

pDist:{[x1;y1;x2;y2;px;py]
  / Calculate numerator (cross product area)
  numerator:abs((x2-x1)*(y1-py))-((x1-px)*(y2-y1));
  / Calculate denominator (segment length)
  denominator:sqrt((x2-x1) xexp 2)+(y2-y1) xexp 2;
  / Handle single-point case (denominator=0) vs normal case
  ?[denominator=0f;sqrt((px-x1) xexp 2)+(py-y1) xexp 2;numerator%denominator]
 };

Define the iterator function

Logic breakdown

iter steps:

Get the queue of segments
Check the first segment
Find the point with max distance
Split the segment if distance > tolerance
Discard points if distance <= tolerance

iter:{[tol;xVec;yVec;state]
  queue:state 0;
  mask:state 1;

  / Exit if queue is empty
  if[0=count queue;:state];

  / Pop the first range
  startIdx:first key queue;
  endIdx:first value queue;
  remainingQueue:1_queue;

  / Identify points between start (startIdx) and end (endIdx)
  rangeIdx:startIdx+1+til (endIdx-startIdx)-1;

  / If no intermediate points, return updated queue
  if[0=count rangeIdx;:(remainingQueue;mask)];

  / Calculate distances for all intermediate points
  dists:pDist[xVec startIdx;yVec startIdx;xVec endIdx;yVec endIdx;xVec rangeIdx;yVec rangeIdx];
  maxDist:max dists;
  maxDistIdx:rangeIdx dists?maxDist;

  / Split if max distance > tolerance, else discard points (update mask)
  $[maxDist>tol;
    (remainingQueue,(startIdx,maxDistIdx)!(maxDistIdx,endIdx);mask);
    (remainingQueue;@[mask;rangeIdx;:;0b])
  ]
 };

Define the main wrapper function

rdpIter sets up the queue. It runs iter repeatedly with over. It stops when the queue is empty.

rdpIter:{[tol;x;y]
  / Initialize tracker: (Queue of ranges; Boolean Keep Mask)
  tracker:((enlist 0)!enlist count[x]-1;count[x]#1b);

  / Run until queue is empty
  res:iter[tol;x;y] over tracker;

  / Return filtered x and y vectors
  (x;y)@\:where res 1
 };

Run and visualize the results

Run the function on test data. Change tol (tolerance) to tune the result:

Low Tolerance (for example, 1): More detail
High Tolerance (for example, 10): Less noise. Shows main trends

q)// Run RDP with different tolerances
q)resHigh:rdpIter[1;`float$trade`time;trade`price]
q)resTrend:rdpIter[10;`float$trade`time;trade`price]
q)
q)// Format results into tables for comparison
q)tblHigh:([]time:"p"$resHigh 0;price:resHigh 1;label:`HighFidelity)
q)tblTrend:([]time:"p"$resTrend 0;price:resTrend 1;label:`TrendOnly)

Compare the results:

Figure 5 shows the high-fidelity result compared to the original data. Figure 6 shows the trend-only version compared to the original data.

%%{init: { "theme": "base", "themeVariables": { "xyChart": { "plotColorPalette": "#2962FF, #FF6D00", "backgroundColor": "#ffffff", "titleColor": "#333333", "xAxisLabelColor": "#333333", "xAxisTitleColor": "#333333", "yAxisLabelColor": "#333333", "yAxisTitleColor": "#333333" } } } }%%
xychart-beta
    title "High Fidelity (Tol=1)"
    x-axis "Time" ["09:30", "09:45", "10:00", "10:15", "10:30", "10:45", "11:00", "11:15", "11:30", "11:45", "12:00"]
    y-axis "Price" 40 --> 160
    line [100, 105, 140, 135, 160, 120, 125, 50, 40, 45, 100]
    line [100, 120, 140, 150, 160, 120, 85, 50, 40, 70, 100]

Figure 5: High Fidelity

%%{init: { "theme": "base", "themeVariables": { "xyChart": { "plotColorPalette": "#2962FF, #FF6D00", "backgroundColor": "#ffffff", "titleColor": "#333333", "xAxisLabelColor": "#333333", "xAxisTitleColor": "#333333", "yAxisLabelColor": "#333333", "yAxisTitleColor": "#333333" } } } }%%
xychart-beta
    title "Trend Only (Tol=10)"
    x-axis "Time" ["09:30", "09:45", "10:00", "10:15", "10:30", "10:45", "11:00", "11:15", "11:30", "11:45", "12:00"]
    y-axis "Price" 40 --> 160
    line [100, 105, 140, 135, 160, 120, 125, 50, 40, 45, 100]
    line [100, 115, 130, 145, 160, 130, 100, 70, 40, 70, 100]

Figure 6: Trend Only

Quantify storage savings

Count the rows to measure savings. You will see a large drop in size.

countOrig:count trade
countHigh:count tblHigh
countTrend:count tblTrend

([] 
  Configuration:`Original`HighFidelity`TrendOnly;
  Row_Count:(countOrig; countHigh; countTrend);
  Reduction_Pct:100*1-(countOrig; countHigh; countTrend)%countOrig
 )

Typical Outcomes

Configuration	Row_Count	Reduction_Pct
Original	10000	0.0
HighFidelity	4200	58.0
TrendOnly	300	97.0

Explore advanced methodology

This guide uses a loop. To learn about the recursive method, read the Dynamically Shrinking Big Data white paper. It covers:

Recursion: Why the original algorithm crashes
Comparisons: Bucketing vs. RDP
Case Studies: More on Cauchy random walks

Summary

In this guide, you:

Generated jagged data
Implemented a stack-safe RDP algorithm
Tuned tolerance to balance detail and size