Send Feedback
Skip to content

Downsample Time-Series Data

Use the RDP algorithm to downsample time-series data efficiently. This method preserves important peaks and valleys while removing noise, avoiding the stack errors common with recursive approaches.

Overview

Analyze bucketing limitations

KDB-X often uses "bucketing" to group data into fixed windows, like 1-minute bars. This is fast, but it hides details.

Identify bucketing limitations

A simple average flattens the data. It hides peaks and valleys.

q)select avg price by 15 xbar time.minute from trade where sym=`AAPL

OHLC (Open-High-Low-Close) bars are a common alternative. They capture the range of movement:

q)select o:first price, h:max price, l:min price, c:last price 
  by 15 xbar time.minute from trade where sym=`AAPL

OHLC bars show the range, but hide the exact timing of changes. You lose the timestamp of each spike. Figure 1 shows this. The Blue line is the original data. The Orange line is the average. The Bars are the OHLC range.

%%{init: { "theme": "base", "themeVariables": { "xyChart": { "plotColorPalette": "#616161, #D50000, #00C853, #616161, #FFFFFF, #2962FF, #FF6D00", "backgroundColor": "#ffffff" } } } }%%
xychart-beta
    title "Bucketing vs. OHLC"
    x-axis "Time" ["t0", "t1", "t2", "t3", "t4", "t5", "t6", "t7", "t8", "t9"]
    y-axis "Price" 40 --> 160
    bar [0, 0, 150, 0, 0, 0, 0, 100, 0, 0]
    bar [0, 0, 130, 0, 0, 0, 0, 0, 0, 0]
    bar [0, 0, 0, 0, 0, 0, 0, 90, 0, 0]
    bar [0, 0, 120, 0, 0, 0, 0, 80, 0, 0]
    bar [0, 0, 100, 0, 0, 0, 0, 50, 0, 0]
    line [130, 110, 150, 100, 120, 80, 100, 60, 50, 90]
    line [122, 122, 122, 122, 122, 76, 76, 76, 76, 76]

Figure 1: Limitations of Bucketing

Understand RDP advantages

The RDP algorithm simplifies the line based on its shape. It ignores fixed time intervals. It keeps points that deviate from the trend. This preserves:

  1. Peaks and Valleys: Keeps true volatility
  2. Precise Timing: Events keep their original timestamps
  3. Visual Shape: Looks like the original data

Figure 2 compares the results. The Blue line is the original data. The Orange line is the RDP result. It ignores small noise but keeps the big peaks.

%%{init: { "theme": "base", "themeVariables": { "xyChart": { "plotColorPalette": "#2962FF, #D50000", "backgroundColor": "#ffffff" } } } }%%
xychart-beta
    title "RDP Solution"
    x-axis "Time" ["t0", "t1", "t2", "t3", "t4", "t5", "t6", "t7", "t8", "t9"]
    y-axis "Price" 40 --> 160
    line [130, 110, 150, 100, 120, 80, 100, 60, 50, 90]
    line [130, 140, 150, 127, 103, 80, 100, 75, 50, 90]

Figure 2: RDP Solution

The algorithm steps are:

  1. Draw a line from the start to the end
  2. Find the point furthest from that line
  3. Keep the point if it is far enough away from the line (distance > tolerance)
  4. Split the data at that point and repeat
flowchart TD
    Start([Start]) --> Find[Find Max Deviation]
    Find --> Check{Dist > Tol?}
    Check -- Yes --> Keep[Keep Point]
    Keep --> Split[Split]
    Split --> Repeat[Repeat]
    Repeat --> Find
    Check -- No --> Discard[Discard]
    Discard --> Done([Done])

Figure 3: RDP Algorithm Logic

Deep dive: recursion vs. iteration

A recursive function (one that calls itself) crashes on large datasets. It uses "stack" memory for each split. If you have millions of points, you hit the 'stack limit.

To fix this, use an iterative approach. This is safe for large datasets.

Generate volatile test data

Use a "Cauchy distribution" to make jagged data. This tests if the algorithm keeps peaks.

Open a q session and run the following code to create 10,000 points:

q)// Define Random Walk Generator
q)PI:acos -1;
q)rcauchy:{[n;loc;scale] loc+scale*tan PI*(n?1.0)-0.5};
q)
q)// Generate 10,000 datapoints
q)n:10000;
q)trade:([]time:`timestamp$.z.d+00:15*til n; sym:n#`AAPL;price:100+sums rcauchy[n;0;1]);
q)
q)// Check first 5 rows
q)5#trade

The data looks like something like this:

%%{init: { "theme": "base", "themeVariables": { "xyChart": { "plotColorPalette": "#2962FF", "backgroundColor": "#ffffff", "titleColor": "#333333", "xAxisLabelColor": "#333333", "xAxisTitleColor": "#333333", "yAxisLabelColor": "#333333", "yAxisTitleColor": "#333333" } } } }%%
xychart-beta
    title "Volatile Data"
    x-axis "Time" ["09:30", "09:45", "10:00", "10:15", "10:30", "10:45", "11:00", "11:15", "11:30", "11:45", "12:00"]
    y-axis "Price" 40 --> 160
    line [100, 105, 140, 135, 160, 120, 125, 50, 40, 45, 100]

Figure 4: Volatile Data

Implement the RDP algorithm

Implement the RDP algorithm iteratively to downsample the data and remove noise.

The implementation has three parts:

  1. pDist: Finds the distance from a point to a line
  2. iter: Processes the queue
  3. rdpIter: Runs the loop

Define the distance helper function

pDist finds the distance between a point (px, py) and the line (x1, y1) to (x2, y2). This identifies outlier points.

Safe calculation

This function avoids division-by-zero errors on vertical lines.

pDist:{[x1;y1;x2;y2;px;py]
  / Calculate numerator (cross product area)
  numerator:abs((x2-x1)*(y1-py))-((x1-px)*(y2-y1));
  / Calculate denominator (segment length)
  denominator:sqrt((x2-x1) xexp 2)+(y2-y1) xexp 2;
  / Handle single-point case (denominator=0) vs normal case
  ?[denominator=0f;sqrt((px-x1) xexp 2)+(py-y1) xexp 2;numerator%denominator]
 };

Define the iterator function

Logic breakdown

iter steps:

  1. Get the queue of segments
  2. Check the first segment
  3. Find the point with max distance
  4. Split the segment if distance > tolerance
  5. Discard points if distance <= tolerance
iter:{[tol;xVec;yVec;state]
  queue:state 0;
  mask:state 1;

  / Exit if queue is empty
  if[0=count queue;:state];

  / Pop the first range
  startIdx:first key queue;
  endIdx:first value queue;
  remainingQueue:1_queue;

  / Identify points between start (startIdx) and end (endIdx)
  rangeIdx:startIdx+1+til (endIdx-startIdx)-1;

  / If no intermediate points, return updated queue
  if[0=count rangeIdx;:(remainingQueue;mask)];

  / Calculate distances for all intermediate points
  dists:pDist[xVec startIdx;yVec startIdx;xVec endIdx;yVec endIdx;xVec rangeIdx;yVec rangeIdx];
  maxDist:max dists;
  maxDistIdx:rangeIdx dists?maxDist;

  / Split if max distance > tolerance, else discard points (update mask)
  $[maxDist>tol;
    (remainingQueue,(startIdx,maxDistIdx)!(maxDistIdx,endIdx);mask);
    (remainingQueue;@[mask;rangeIdx;:;0b])
  ]
 };

Define the main wrapper function

rdpIter sets up the queue. It runs iter repeatedly with over. It stops when the queue is empty.

rdpIter:{[tol;x;y]
  / Initialize tracker: (Queue of ranges; Boolean Keep Mask)
  tracker:((enlist 0)!enlist count[x]-1;count[x]#1b);

  / Run until queue is empty
  res:iter[tol;x;y] over tracker;

  / Return filtered x and y vectors
  (x;y)@\:where res 1
 };

Run and visualize the results

Run the function on test data. Change tol (tolerance) to tune the result:

  • Low Tolerance (for example, 1): More detail
  • High Tolerance (for example, 10): Less noise. Shows main trends
q)// Run RDP with different tolerances
q)resHigh:rdpIter[1;`float$trade`time;trade`price]
q)resTrend:rdpIter[10;`float$trade`time;trade`price]
q)
q)// Format results into tables for comparison
q)tblHigh:([]time:"p"$resHigh 0;price:resHigh 1;label:`HighFidelity)
q)tblTrend:([]time:"p"$resTrend 0;price:resTrend 1;label:`TrendOnly)

Compare the results:

Figure 5 shows the high-fidelity result compared to the original data. Figure 6 shows the trend-only version compared to the original data.

%%{init: { "theme": "base", "themeVariables": { "xyChart": { "plotColorPalette": "#2962FF, #FF6D00", "backgroundColor": "#ffffff", "titleColor": "#333333", "xAxisLabelColor": "#333333", "xAxisTitleColor": "#333333", "yAxisLabelColor": "#333333", "yAxisTitleColor": "#333333" } } } }%%
xychart-beta
    title "High Fidelity (Tol=1)"
    x-axis "Time" ["09:30", "09:45", "10:00", "10:15", "10:30", "10:45", "11:00", "11:15", "11:30", "11:45", "12:00"]
    y-axis "Price" 40 --> 160
    line [100, 105, 140, 135, 160, 120, 125, 50, 40, 45, 100]
    line [100, 120, 140, 150, 160, 120, 85, 50, 40, 70, 100]

Figure 5: High Fidelity

%%{init: { "theme": "base", "themeVariables": { "xyChart": { "plotColorPalette": "#2962FF, #FF6D00", "backgroundColor": "#ffffff", "titleColor": "#333333", "xAxisLabelColor": "#333333", "xAxisTitleColor": "#333333", "yAxisLabelColor": "#333333", "yAxisTitleColor": "#333333" } } } }%%
xychart-beta
    title "Trend Only (Tol=10)"
    x-axis "Time" ["09:30", "09:45", "10:00", "10:15", "10:30", "10:45", "11:00", "11:15", "11:30", "11:45", "12:00"]
    y-axis "Price" 40 --> 160
    line [100, 105, 140, 135, 160, 120, 125, 50, 40, 45, 100]
    line [100, 115, 130, 145, 160, 130, 100, 70, 40, 70, 100]

Figure 6: Trend Only

Quantify storage savings

Count the rows to measure savings. You will see a large drop in size.

countOrig:count trade
countHigh:count tblHigh
countTrend:count tblTrend

([] 
  Configuration:`Original`HighFidelity`TrendOnly;
  Row_Count:(countOrig; countHigh; countTrend);
  Reduction_Pct:100*1-(countOrig; countHigh; countTrend)%countOrig
 )

Typical Outcomes

Configuration Row_Count Reduction_Pct
Original 10000 0.0
HighFidelity 4200 58.0
TrendOnly 300 97.0

Explore advanced methodology

This guide uses a loop. To learn about the recursive method, read the Dynamically Shrinking Big Data white paper. It covers:

  • Recursion: Why the original algorithm crashes
  • Comparisons: Bucketing vs. RDP
  • Case Studies: More on Cauchy random walks

Summary

In this guide, you:

  • Generated jagged data
  • Implemented a stack-safe RDP algorithm
  • Tuned tolerance to balance detail and size