Stats
This page details the statistical functions available in the Stream Processor.
Describe
Computes the requested descriptive statistics on the provided columns.
.qsp.stats.describe[fields; stats]
Parameters:
name | type | description | default |
---|---|---|---|
fields | symbol or symbol[] | A list of column names to compute statistics on | Required |
stats | symbol, symbol[], or list of tuples and symbols | A list of statistics which should be computed | Required |
Statistic Options
name | type | description |
---|---|---|
minimum | symbol | Computes the maximum of each provided column |
maximum | symbol | Computes the minimum of each provided column |
range | symbol | Computes the range of each provided column |
length | symbol | Counts the length of the batch provided |
total | symbol | Computes the total sum of each provided column |
average | symbol | Computes the average of each provided column |
numDistinct | symbol | Counts the number of distinct elements in each provided column |
numNull | symbol | Counts the number of null elements in each provided column |
numInfinity | symbol | Counts the number of infinite elements in each provided column |
median | symbol | Computes the median of each provided column |
quartiles | symbol | Computes the quartiles of each provided column |
frequency | symbol | Creates a frequency dictionary for each provided column |
mode | symbol | Computes all modes of each provided column |
sampleVar | symbol | Computes the sample variance of each provided column |
sampleStd | symbol | Computes the sample standard deviation of each provided column |
populationVar | symbol | Computes the population variance of each provided column |
populationStd | symbol | Computes the population standard deviation of each provided column |
standardError | symbol | Computes the standard error of each provided column |
skew | symbol | Computes the Fisher-Pearson coefficient of skewness of each provided column |
percentiles | tuple | Computes the specified percentiles on each provided column |
Note: some statistics do not support categorical data and will return generic null for said data
For all common arguments, refer to configuring operators
This example computes the min, max, and average on a batch of data
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.stats.describe[`y; `minimum`maximum`average]
.qsp.write.toVariable[`output];
publish ([] x: til 5; y: 10 13 1 9 8)
output
Expected output: ([] minimum_y: enlist 1; maximum_y: enlist 13; average_y: enlist 8.2)
This example demonstrates how to use the percentiles option The operator below will compute the mode and skew along with the 90th, 95th and 99th percentile.
Enlist for percentiles
If only percentiles are to be computed, the tuple must be enlisted.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.stats.describe[`x; (`mode; `skew; (`percentiles; 0.9 0.95 0.99))]
.qsp.write.toVariable[`output];
publish ([] x: til 100)
output
sp.stats.describe('price', 'average')
Parameters:
name | type | description | default |
---|---|---|---|
fields | symbol or symbol[] | A list of column names on which to compute the statistics | Required |
stats | symbol, symbol[], or list of tuples and symbols | A list of statistics that should be computed | Required |
Returns:
A pipeline comprised of a describe
operator, which can be joined to other pipelines.
A list of all supported statistic options can be found below:
name | type | description |
---|---|---|
minimum | string | Computes the maximum of each provided column |
maximum | string | Computes the minimum of each provided column |
range | string | Computes the range of each provided column |
length | string | Counts the length of the batch provided |
total | string | Computes the total sum of each provided column |
average | string | Computes the average of each provided column |
numDistinct | string | Counts the number of distinct elements in each provided column |
numNull | string | Counts the number of null elements in each provided column |
numInfinity | string | Counts the number of infinite elements in each provided column |
median | string | Computes the median of each provided column |
quartiles | string | Computes the quartiles of each provided column |
frequency | string | Creates a frequency dictionary for each provided column |
mode | string | Computes all modes of each provided column |
sampleVar | string | Computes the sample variance of each provided column |
sampleStd | string | Computes the sample standard deviation of each provided column |
populationVar | string | Computes the population variance of each provided column |
populationStd | string | Computes the population standard deviation of each provided column |
standardError | string | Computes the standard error of each provided column |
skew* | string | Computes the skewness of each provided column |
percentiles | tuple | Computes the specified percentiles on each provided column |
*calculated using the Fisher-Pearson coefficient of skewness
Categorical Data
Some statistics do not support categorical data and will return generic null for said data
>>> from kxi import sp
>>> import pykx as kx
>>> import pandas as pd
>>> sp.run(sp.read.from_callback('publish')
| sp.stats.describe('x', 'average')
| sp.write.to_variable('out'))
>>> data = pd.DataFrame({
'x':[5,1,4,2,3],
'y':[100,100,200,50,50]
})
>>> kx.q('publish', data)
average_x
---------
3
Using percentiles along with other stats
>>> from kxi import sp
>>> import pykx as kx
>>> sp.run(sp.read.from_expr('([] x: 1 2 2 3 3 3 4 4 4 4)')
| sp.stats.describe('x', ['mode', 'skew', ('percentiles', [0.9, 0.95, 0.99])])
| sp.write.to_variable('out'))
>>> kx.q('out')
mode_x skew_x percentile_0.9_x percentile_0.95_x percentile_0.99_x
---------------------------------------------------------------------
4 -0.512289 4 4 4
Exponential Moving Average
Calculates the exponential moving average.
.qsp.stats.ema[X; alpha; y]
Parameters:
name | type | description | default |
---|---|---|---|
X | symbol or symbol[] | A list of column names on which to compute the average | Required |
alpha | float | The decay rate | Required |
y | symbol or symbol[] | The columns to write to. These can overwrite existing columns | The same as X |
For all common arguments, refer to configuring operators
This example replaces the columns x
and y
with their exponential moving averages.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.stats.ema[`x`y; .33]
.qsp.write.toConsole[];
publish ([] x: til 10; y: 0 1 4 2 5 3 6 7 9 8)
sp.stats.ema('volume', 0.33, 'res')
Parameters:
name | type | description | default |
---|---|---|---|
X | symbol or symbol[] | A single column name or list of column names on which to compute the statistics | Required |
alpha | float | The decay rate to use | Required |
y | symbol or symbol[] | A single column name or list of column names to output results to | The same as X |
Number of input/output columns
The number of source and destination columns must match
Returns:
A pipeline comprised of a ema
operator, which can be joined to other pipelines.
>>> from kxi import sp
>>> import pandas as pd
>>> import pykx as kx
>>> sp.run(sp.read.from_callback('publish')
| sp.stats.ema('x', 0.33, 'res')
| sp.write.to_variable('out'))
>>> data = pd.DataFrame({
'x': [1, 50, 3, 4, 5, 6]
})
>>> kx.q('publish', data)
x res
-----------
1 1
50 17.17
3 12.4939
4 9.690913
5 8.142912
6 7.435751
Simple Moving Average
Computes a moving average by record count.
.qsp.stats.sma[X; n; y]
Parameters:
name | type | description | default |
---|---|---|---|
X | symbol or symbol[] | A list of column names on which to compute the average | Required |
n | long | The number of records to include in the average | Required |
y | symbol or symbol[] | The columns to write to. These can overwrite existing columns | The same as X |
For all common arguments, refer to configuring operators
This calculates, for each data point, the arithmetic mean of a moving window including that point and the n-1 prior data points.
This example replaces each value in y with the simple moving average of that value and the nine prior values.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.stats.sma[`y; 10]
.qsp.write.toConsole[];
publish ([] x: til 10; y: 0 1 4 2 5 3 6 7 9 8)
sp.stats.sma('price', 60, 'movingAvgPrice')
Parameters:
name | type | description | default |
---|---|---|---|
X | symbol or symbol[] | A single column name or list of column names on which to compute the statistics | Required |
window | long | The size of the window which should be used to calculate the average | Required |
y | symbol or symbol[] | A single column name or list of column names to output results to | The same as X |
Number of input/output columns
The number of source and destination columns must match
Returns:
A pipeline comprised of a sma
operator, which can be joined to other pipelines.
>>> from kxi import sp
>>> import pandas as pd
>>> import pykx as kx
>>> sp.run(sp.read.from_callback('publish')
| sp.stats.sma('x', 3, 'res')
| sp.write.to_variable('out'))
>>> data = pd.DataFrame({
'x': [1, 50, 3, 4, 5, 6]
})
>>> kx.q('publish', data)
x res
-------
1 1
50 25.5
3 18
4 19
5 4
6 5
Time Weighted Average
Computes a running time-weighted average.
.qsp.stats.twa[X; times; range; y]
Parameters:
name | type | description | default |
---|---|---|---|
X | symbol or symbol[] | A list of column names on which to compute the average | Required |
times | symbol | The name of the column containing the time data | Required |
range | long, int or short | The number of records to include in the average | Required |
y | symbol or symbol[] | The columns to write to. These can overwrite existing columns | Same as X |
For all common arguments, refer to configuring operators
This calculates, for each data point, the arithmetic mean of a moving window including that point and the n-1 prior data points weighted by the time deltas found in times.
Data must be sorted
The incoming data must be sorted, because the average is calculated using the deltas between each timestamp. Out of order data would cause negative weight to be applied to the calculation.
This example replaces each value in y with the time weighted average of that value
and the nine prior values using weights derived from the time
column.
.qsp.run
.qsp.read.fromCallback[`publish]
// The windowing is to ensure that records are sorted by timestamp
.qsp.window.tumbling[00:01:00; `time; .qsp.use `sort`lateness!(1b; 00:00:10)]
.qsp.stats.twa[`data; `time; 10]
.qsp.write.toConsole[]
publish ([] time: 0p + 00:00:01 * 0 5 6 17 14 21 57 58 71;
data: 10 20 10 9 11 8 21 10 9)
This example replaces each value in c and in d with the time weighted average of the
values within a and b respectively and four prior values using the times
column
as a series of times.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.window.tumbling[00:00:01; `time; .qsp.use `sort`lateness!(1b; 00:00:01)]
.qsp.stats.twa[`a`b; `time; 5; `c`d]
.qsp.write.toConsole[];
publish ([] time: 0p + 00:00:00.1 * 0 8 13 17 19 21; a: 1 7 8 7 7 8; b: til 6);
sp.stats.twa('price', 'time', 60, '1minMovingAvgPrice')
name | type | description | default |
---|---|---|---|
X | symbol or symbol[] | A single column name or list of column names on which to compute the statistics | Required |
times | symbol or symbol[] | A list of times to be used for weighting | Required |
window | timespan | The size of the window which should be used to calculate the average | Required |
y | symbol or symbol[] | A single column name or list of column names to output results to | Same as X |
Number of input/output columns
The number of source and destination columns must match
This calculates, for each data point, the arithmetic mean of a moving window including that point and the n-1 prior data points weighted by the time deltas found in times.
Data must be sorted
The incoming data must be sorted, because the average is calculated using the deltas between each timestamp. Out of order data would cause negative weight to be applied to the calculation.
Returns:
A pipeline comprised of a twa
operator, which can be joined to other pipelines.
Examples:
>>> from kxi import sp
>>> from datetime import timedelta
>>> import pandas as pd
>>> import pykx as kx
>>> sp.run(sp.read.from_callback('publish')
| sp.stats.twa('x', 'time', 3, 'res')
| sp.write.to_variable('out'))
>>> data = pd.DataFrame({
'x': range(1,6),
'time': [timedelta(seconds=x) for x in [0, 5, 6, 14, 17]]
})
>>> kx.q('publish', data)
x time res
--------------------------------
1 0D00:00:00.000000000 1
2 0D00:00:05.000000000 2
3 0D00:00:06.000000000 2.166667
4 0D00:00:14.000000000 3.214286
5 0D00:00:17.000000000 4.166667