Machine Learning Fresh freshCreate turns batches of data into features based on aggregated statistics

Classification adaBoostClassifier fits an adaBoost classification model decisionTreeClassifier fits a decision tree classification model gaussianNB fits a gaussian naive bayes model kNeighborsClassifier fits a k-nearest neighbors classification model logClassifier fits a logistic classification model using stochastic gradient descent quadraticDiscriminantAnalysis fits a quadratic discriminant analysis model randomForestClassifier fits a random forest classification model

Clustering affinityPropagation fits an affinity propagation clustering model birch fits a BIRCH clustering model cure fits a CURE clustering model dbscan fits a DBSCAN clustering model sequentialKMeans fits a sequential k-means model

Regression adaBoostRegressor fits an adaBoost regression model gradientBoostingRegressor fits a gradient boosting regression model kNeighborsRegressor fits a k-nearest neighbors regression model lasso fits a lasso-linear regression model linearRegression fits a linear regression model randomForestRegressor fits a random forest regression model

Metrics score evaluates a model's predictions

Preprocessing dropConstant drops constant columns from incoming data featureHasher encodes categorical data as numeric vectors labelEncode encodes symbolic data into numerical values minMaxScaler min-max scale a supplied dataset oneHot replaces symbolic values with numerical vector representations standardize standardizes a supplied dataset

Registry fits a model to batches of data, saving a model to a registry registry.predict predicts a target variable using a trained model from the registry registry.update trains a model incrementally, returning predictions for all records

Note All ml operators act solely on unkeyed tables (type 98).

Turns batches of data into features using aggregated statistics[X;features][X;features;.qsp.use enlist[`warn]!enlist warn]


name type description default
X symbol or symbol[] Name of the column(s) in the data to use for FRESH feature generation. Required
features :: or symbol or symbol[] Name of the FRESH feature(s) we want to define from the data. A full list of these features can be found here. Required


name type description default
warn boolean Show warnings 1b / Suppress warnings 0b. 0b

For all common arguments, refer to configuring operators


type description
table Returns a table containing the specified aggregated FRESH feature columns for each selected column in the input table.

Converts each chosen column into a collection of feature values based on the supplied FRESH features. Typically, the operator is intended to be used in conjunction with the windowing operators that provide regular batches of data from which we engineer features. The aggregate statistics used to create these features can be as simple as max/min/count.

For the feature parameter, if it is set to: :: - all features are applied. noHyperparameters - all features except hyperparameters are applied. noPython - all features that don't rely on Python are applied.

As this aggregates a batch to a single row of aggregated statistics, the output table does not include the original columns.

Build two features, absEnergy and max.[`publish]
  .qsp.window.tumbling[00:01:00; `time][`x; `absEnergy`max]

publish ([] time: .z.p+00:00:01 * til 500; x: 500?1f);

Build all features.[`publish]
  .qsp.window.count[100][`x; `min`max]

publish ([] x: 500?1f; y: 500?100);

Multi-Layer Perceptron Classifier[X;y;udf;.qsp.use (!) . flip (
    (`hiddenLayerSizes; hiddenLayerSizes);
    (`activation      ; activation);
    (`solver;         ; solver);
    (`alpha           ; alpha);
    (`batchSize       ; batchSize);
    (`learningRate    ; learningRate);
    (`learningRateInit; learningRateInit);
    (`powerT          ; powerT);
    (`maxIter         ; maxIter)
    (`registry        ; registry);
    (`experiment      ; experiment);
    (`model           ; model);
    (`bufferSize      ; bufferSize))]


name | type | description | default ------|----------------------------------|------------ X | symbol or symbol[] or function | Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. | Required y | symbol or function | Can be the name of the column containing the data's labels OR a user-defined function that returns the target values to use. | Required udf | symbol or function | Can be the name of the column which is to house the model's predicted label values for each data record OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. | Required


name type description default
hiddenLayerSizes int[] List of the number of neurons in each hidden layer in the neural network. Minimum size of each layer is 1. enlist 100
activation string Activation function used to transform the output of the hidden layers into a single scalar value. This value can be identity to use a linear activation function, logistic to use a sigmoid activation function, tanh to use a hyperbolic tangent activation function, or relu to use a rectified linear unit function. relu
solver string Optimization function used to search for the inputs that minimize/maximize the results of the model function. This value can be lbfgs to use a limited-memory BFGS, sgd to use stochastic gradient descent, or adam to use adaptive moment estimation. adam
alpha float Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss function and is used to reduce the chance of model overfitting. Minimum value is 0.0. 0.0001
batchSize int Number of training examples used in each stochastic optimization iteration. Minimum value is 1. auto
learningRate string Learning rate schedule for updating the weights of the neural network. Only used when the optimization function is set to sgd. This value can be constant for a constant learning rate, optimal for the optimal learning rate, invscaling to use an inverse scaling learning rate, or adaptive for an adaptive learning rate. constant
learningRateInit float Starting learning rate value. This controls the step-size used when updating the neural network weights. Not used when the optimization function is set to lmbfgs. Minimum value is 0.0. 0.001
powerT float Exponent used to update the learning rate when the learning rate is set to invscaling and the optimization function is set to sgd. 0.5
maxIter int Maximum number of optimization epochs/iterations. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1. 200
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under 'unnamedExperiments'. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a multi-layer perceptron classifier model

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline[`publish][`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]

First we pass a batch of data to the stream processor to fit the model

q)publish data
Then we can retrieve predictions by passing new data.

AdaBoost Classifier[X;y;udf;.qsp.use (!) . flip (
    (`nEstimators ; nEstimators);
    (`learningRate; learningRate);
    (`algorithm   ; algorithm);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`model       ; model);
    (`bufferSize  ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the features values to use. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function of the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
nEstimators int Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value 1. 50
learningRate float Controls the loss function used to set the weight of each classifier at each boosting iteration. The higher this value, the more each classifier will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is 0.0. 1.0
algorithm string Multi-class AdaBoost function used to extend the AdaBoost operator to have multi-class capabilities. This value can be SAMME for stagewise additive modeling or SAMME.R for real-valued stagewise additive modeling. SAMME.R
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with an adaBoost classification model

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline[`publish][`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]

First we pass a batch of data to the stream processor to fit the model

q)publish data
Then we can retrieve predictions by passing new data.

Decision Tree Classifier.[X;y;udf;.qsp.use (!) . flip (
    (`criterion      ; criterion);
    (`splitter       ; splitter);
    (`maxDepth       ; maxDepth);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`bufferSize     ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
criterion string Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be gini, to use the Gini impurity measure, or entropy, to use the information gain measure. gini
splitter string Strategy used to split the nodes in the tree. This can be best to choose the best split or random to choose the best random split. best
maxDepth int Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to ::, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. ::
minSamplesSplit int Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2. 2
minSamplesLeaf int Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1. 1
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config dict Configurations used for fitting the model. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit a decision tree classifier model on data and store the model in local registry.

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline[`publish][`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"DTCModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data

Gaussian Naive Bayes[X;y;udf;.qsp.use (!) . flip (
    (`priors      ; priors);
    (`varSmoothing; varSmoothing);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`model       ; model);
    (`bufferSize  ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
priors float[] List of the prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is 0.0. If set to ::, the priors will be adjusted according to the data. ::
varSmoothing float Value added to the Gaussian distributions variance to widen the curve and account for more samples further away from the distributions mean. Minimum value is 0. 1e-9
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a gaussian naive bayes model

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline[`publish][`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]

First we pass a batch of data to the stream processor to fit the model

q)publish data
Then we can retrieve predictions by passing new data.

K-Nearest Neighbors Classifier[X;y;udf;.qsp.use (!) . flip (
    (`nNeighbors; nNeighbors);
    (`weights   ; weights);
    (`algorithm ; algorithm);
    (`leafSize  ; leafSize);
    (`p         ; p);
    (`metric    ; metric)
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`bufferSize; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
nNeighbors int Number of already classified points, which lie closest to a given unclassified point (neighbors), to factor in when predicting the points class. Minimum value is 1. 5
weights string Weight function used to decide how much weight to give to the classes of each of the neighboring points when predicting a points class. Can be uniform to weight each neighbor's class equally or distance to weight each neighbor's class based on its distance to the point. uniform
algorithm string Algorithm used to parse the vector space and decide which points are the nearest neighbors to a given unclassified point. This algorithm can be a ball_tree algorithm, kd_tree algorithm, brute force distance measure approach, or an auto choice based on the data. auto
leafSize int If ball_tree or kd_tree is selected as the algorithm, this is the minimum number of points in a given leaf node, after which point, brute force algorithm will be used to find the nearest neighbors. Setting this value either very close to 1 or very close to the total number of points in the data may have a noticeable impact on model runtime. Minimum value is 1. 30
p int Power parameter used when the distance metric minkowski is selected. Minimum values is 0. 2
metric string Distance metric used to measure the distance between points. This value can be minkowski, euclidean, manhattan, etc. minkowski
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a k-nearest neighbors classification model

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline[`publish][`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]

First we pass a batch of data to the stream processor to fit the model

q)publish data
Then we can retrieve predictions by passing new data.

Logistic classifier fit using stochastic gradient descent[X;y;udf][X;y;udf; .qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`trend     ; trend);
    (`alpha     ; alpha);
    (`maxIter   ; maxIter);
    (`gTol      ; gTol);
    (`seed      ; seed);
    (`penalty   ; penalty);
    (`lambda    ; lambda);
    (`l1Ratio   ; l1Ratio);
    (`decay     ; decay);
    (`p         ; p);
    (`bufferSize; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config any Configuration used for fitting the model. ()!()
trend boolean Whether to add a constant value (intercept) to the classification function - c in y=mx+c. 1b
alpha float Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is 0.0. 0.01
maxIter long Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1. 100
gTol float Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is 0.0. 1e-5
seed long Integer value used to control the randomness of the model's Initialization state. Specifying this allows for reproducible results across function calls. If a value is not supplied, the randomness is based off the current timestamp. 0
penalty symbol Penalty term used to shrink the coefficients of the less contributive variables. Can be l1 to add an L1 penalty term, l2 to add an L2 penalty term, or elasticNet to add both L1 and L2 penalty terms. l2
lambda float Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is 0.0. 0.001
l1Ratio float If Elastic Net is chosen as the regularization method, this parameter determines the balance between the L1 and L2 penalty terms. If this value is set to 0, this is the same as using L2 regularization, if this value is set to 1, this is the same as using L1 regularization. This value must lie in the range [0.0, 1.0]. 0.5
decay float Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum values is 0.0. 0f
p float Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is 0.0. 0f
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0

For all common arguments, refer to configuring operators


type description
table Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. As this is an online model, if subsequent data is passed to the stream, each new collection of data points will be used to update the classifier model and a predictions will be made for each record.

Performance Limitations

This functionality is not currently encouraged for use in high throughput environments. Prediction times for this function is on the order of milliseconds. Further optimizations are expected in later releases.

Fit, update, and predict with a logistic classification model.

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline[`publish][`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());1000)]

// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish data;

// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish data;

// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish data;

Quadratic Discriminant Analysis[X;y;udf;.qsp.use (!) . flip (
    (`priors    ; priors);
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`bufferSize; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function of the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
priors float[] List of the prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is 0.0. If set to ::, the priors will be adjusted according to the data. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a quadratic discriminant analysis model

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline[`publish][`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]

First we pass a batch of data to the stream processor to fit the model

q)publish data
Then we can retrieve predictions by passing new data.

Random Forest Classifier[X;y;udf;.qsp.use (!) . flip (
    (`nEstimators          ; nEstimators);
    (`criterion            ; criterion);
    (`maxDepth             ; maxDepth);
    (`minSamplesSplit      ; minSamplesSplit);
    (`minSamplesLeaf       ; minSamplesLeaf);
    (`minWeightFractionLeaf; minWeightFractionLeaf);
    (`maxFeatures          ; maxFeatures);
    (`maxLeafNodes         ; maxLeafNodes);
    (`minImpurityDecrease  ; minImpurityDecrease);
    (`bootstrap            ; bootstrap)
    (`registry             ; registry);
    (`experiment           ; experiment);
    (`model                ; model);
    (`bufferSize           ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
nEstimators int Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1. 100
criterion string Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be gini to use the Gini impurity measure or entropy to use the information gain measure. gini
maxDepth int Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to ::, the tree will expand until all leaves are pure or contain less than the minSamplesSplit value. Minimum value is 1. ::
minSamplesSplit int Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2. 2
minSamplesLeaf int Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1. 1
minWeightFractionLeaf float Minimum proportion of sample weight required to be at any leaf node relative to the total weight of all samples in the tree. When the sample_weight argument is not set using the modelInit parameter, each sample carries equal weight. This value must lie in the range [0.0, 1.0]. 0.0
maxFeatures string Maximum number of features to consider when looking for the best way to split a node. This value can be sqrt for the square root of all features, log2 for log to the base 2 of all features, or auto to automatically select the number of features to consider. auto
maxLeafNodes int Maximum number of leaf nodes in each decision tree. This forces the tree to grow in a best-first fashion with the best nodes based on their relative reduction in impurity. If set to ::, there may be unlimited leaf nodes. Minimum value is 1. ::
minImpurityDecrease float Minimum impurity decrease value required to split a node. If the tree impurity would not decrease by more than this value, the node will not be split. Minimum value is 0.0. 0.0
bootstrap boolean Whether bootstrap samples are used when building trees. If 1b, the whole dataset is used to build each tree. 1b
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a random forest classification model

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline[`publish][`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]

First we pass a batch of data to the stream processor to fit the model

q)publish data
Then we can retrieve predictions by passing new data.

Affinity Propagation Clustering Algorithm.[X;udf;.qsp.use (!) . flip (
    (`damping        ; damping);
    (`maxIter        ; maxIter);
    (`convergenceIter; convergenceIter);
    (`affinity       ; affinity);
    (`randomState    ; randomState);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`config         ; config);
    (`bufferSize     ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
damping float Provides numerical stabilization and limits oscillations and “overshooting” of parameters by controlling the extent to which the current value is maintained relative to incoming values. This value must lie in the range [0.5, 1.0). 0.5
maxIter int Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1. 200
convergenceIter int Number of iterations, during which there is no change in the number of estimated clusters, needed to stop the convergence. Minimum value is 1. 15
affinity string Statistical measure used to define similarities between the representative points. This value can be euclidean to use negative squared Euclidean distance or precomputed to use the values in the data's distance matrix. euclidean
randomState int Integer value used to control the state of the random generator used in this model. Specifying this allows for reproducible results across function calls. If set to ::, the randomness is based off the current timestamp. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config dict Configuration used for fitting the model. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted cluster labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit a affinityPropagation clustering model storing the result in a registry.

// Generate packet of data

// Running a pipeline[`publish][`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"AffinityPropagationModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data
q)publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

Birch Clustering Algorithm.[X;udf;.qsp.use (!) . flip (
    (`threshold      ; threshold);
    (`branchingFactor; branchingFactor);
    (`nClusters      ; nClusters);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`config         ; config);
    (`bufferSize     ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
threshold float Maximum cluster radius allowed for a new sample to be merged into its closest subcluster. If adding this point to the cluster would cause that clusters radius to exceed this maximum, the new point is not added and instead becomes a new subcluster. Minimum value is 0.0. 0.5
branchingFactor int Maximum number of subclusters in each node in the tree, where each leaf node contains a subcluster. If a new sample arrives causing the number of subclusters to exceed this value for a given node, the node is split into two nodes. Minimum value is 1. 50
nClusters int Final number of clusters to be defined by the model. Minimum value is 2. 3
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config dict Configuration for fitting the model. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted cluster labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit a Birch clustering model storing the result in a registry.

// Generate packet of data

// Running a pipeline[`publish][`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"BirchModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data
q)publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

CURE Clustering Algorithm.[X;udf;.qsp.use (!) . flip (
    (`df        ; df);
    (`n         ; n);
    (`c         ; c);
    (`cutDict   ; cutDict);
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`bufferSize; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
df symbol Distance function used to measure the distance between points when clustering. This can be edist for Euclidean distance, e2dist for squared Euclidean distance, nege2dist for negative squared Euclidean distance, mdist for Manhattan distance, or cshev for Chebyshev distance. edist
n int Number of representative points to choose from each cluster to compare the similarity of clusters for the purposes of potentially merging them. Minimum value is 1. 2
c float Compression factor used for grouping the representative points together. Minimum value is 0.0. 0.0
k int Final number of clusters to be defined by the model. Minimum value is 2. The distance used when cutting the dendrogram will be adjusted to fit this number so only specify one of the parameters k or dist. If set to ::, the dist parameter will be used. If both are set to ::, the cutDict parameter will be used. ::
dist float Distance between leaves at which to cut the dendrogram to define the clusters. Minimum value is 0.0. The number of clusters will be dynamic based on this distance so only specify one of the parameters k or dist. If set to ::, the k parameter will be used. If both are set to ::, the cutDict parameter will be used. ::
cutDict dict A dictionary that defines the cutting algorithm used when splitting the data into clusters. This can be used to define a k value or a dist value (documentation for these above). enlist[`k]!enlist 3
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config dict Configuration for fitting the model. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0


type description
table Returns the input data with an additional column containing the model's predicted cluster labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit a cure clustering model storing the result in a registry.

// Generate packet of data

// Running a pipeline[`publish][`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"cureModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data
q)publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

DBSCAN Clustering Algorithm.[X;udf;.qsp.use (!) . flip (
    (`df        ; df);
    (`minPts    ; minPts);
    (`eps       ; eps);
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`bufferSize; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
df symbol Distance function used to measure the distance between points when clustering. This can be edist for Euclidean distance, e2dist for squared Euclidean distance, nege2dist for negative squared Euclidean distance, mdist for Manhattan distance, or cshev for Chebyshev distance. edist
minPts int Minimum number of points required to be close together before this group of points is defined as a cluster. The maximum distance these points are to be away from one another must be less than or equal to the Maximum Distance Between Points parameter. Minimum value is 1. 2
eps float Maximum distance points are allowed to be away from one another to still be classed as close enough to be in the same cluster. Minimum value is 0.0. 1.0
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config dict Configuration for fitting the model. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0


type description
table Returns the input data with an additional column containing the model's predicted cluster labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit a dbscan clustering model storing the result in a registry.

// Generate packet of data

// Running a pipeline[`publish][`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"dbscanModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data
q)publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

Sequential K-Means clustering using the function[X][X; .qsp.use (!) . flip (
    (`df        ; df);
    (`k         ; k);
    (`centers   ; centers);
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`init      ; init);
    (`alpha     ; alpha);
    (`forgetful ; forgetful);
    (`bufferSize; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required


name type description default
df symbol Distance function used to measure the distance between points when clustering. This can be edist for Euclidean distance, e2dist for squared Euclidean distance, nege2dist for negative squared Euclidean distance, mdist for Manhattan distance, or cshev for Chebyshev distance. edist
k long Final number of clusters to be defined by the model. Minimum value is 2. 3
centers dictionary or :: A dictionary mapping each cluster to the cluster centroid value that we want these clusters to initialize with. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config any Configuration used for fitting the model. ()!()
init bool Initialization method for the cluster centroids. This value can either be K-means++ (1b) or randomized initialization (0b). 1b
alpha float Controls the rate at which the concept of forgetfulness is applied within the algorithm. If forgetful Sequential K-Means is applied, this value defines how much past cluster centroid information is retained, if not, this is set to 1/(n+1) where n is the number of points in a given cluster. This value must lie in the range [0.0, 1.0]. 0.1
forgetful bool Whether to apply forgetful Sequential K-Means (1b) or normal Sequential K-Means (0b). Forgetful Sequential K-Means will allow the model to evolve its cluster boundaries over time by forgetting about old data as new data comes in. 1b
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0

For all common arguments, refer to configuring operators


type description
table or :: Null during initial fitting. Afterwards returns the input data with an additional column containing the model's predicted cluster labels.

The sequential K-Means algorithm is applied within a streaming framework. When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. As this is an online model, if subsequent data is passed to the stream, each new collection of data points are used to update the current cluster centers and predictions are made as to which cluster each point belongs.


Fit, update, and predict with the sequential K-Means model.

// Running a pipeline[`publish][`x`x1`x2; .qsp.use enlist[`bufferSize]!enlist 100]

publish ([]100?1f;100?1f;100?1f);
publish ([] 50?1f; 50?1f; 50?1f);

AdaBoost Regressor.[X;y;udf][X;y;udf;.qsp.use (!) . flip (
    (`nEstimators ; nEstimators);
    (`learningRate; learningRate);
    (`loss        ; loss);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`model       ; model);
    (`modelInit   ; modelInit);
    (`bufferSize  ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
nEstimators int Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1. 50
learningRate float Weight applied to each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. This value must lie in the range (0.0, inf). 1.0
loss string Loss function used to update the contributing weights of the regressors after each boosting iteration. This can be linear for linear loss, square for mean squared error, or exponential for exponential loss. "linear"
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit an adaBoost regression model on data and store model in local registry.

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline[`publish][`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"AdaModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data
q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Decision Tree Regressor.[X;y;udf][X;y;udf;.qsp.use (!) . flip (
    (`criterion      ; criterion);
    (`splitter       ; splitter);
    (`maxDepth       ; maxDepth);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`modelInit      ; modelInit);
    (`bufferSize     ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable Required


name type description default
criterion string Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be squared_error for mean squared error, friedman_mse for mean squared error with Friedman's improvement score, absolute_error for mean absolute error, or poisson for Poisson deviance. "squared_error"
splitter string Strategy used to split the nodes in the tree. This can be best to choose the best split or random to choose the best random split. "best"
minSamplesSplit int Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2. 2
minSamplesLeaf int Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1. 1
maxDepth int Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to ::, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. Minimum value is 1. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit a decision tree regression model on data and store model in local registry.

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline[`publish][`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"DTModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data
q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Gradient Boosting Regressor.[X;y;udf][X;y;udf;.qsp.use (!) . flip (
    (`loss           ; loss);
    (`learningRate   ; learningRate);
    (`nEstimators    ; nEstimators);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`maxDepth       ; maxDepth);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`modelInit      ; modelInit);
    (`bufferSize     ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
loss string Loss function that is optimized using gradient descent to get the best model fit. Can be squared_error, absolute_error, huber which is a combination of squared_error and absolute_error, or quantile which allows for quantile regression (using conditional median). "squared_error"
learningRate float Controls the loss function used to set the weight of each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is 0.0. 0.1
nEstimators int Maximum number of tree estimators to train. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1. 100
minSamplesSplit int Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2. 2
minSamplesLeaf int Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1. 1
maxDepth int Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to ::, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. Minimum value is 1. 3
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config dict Configuration for fitting the model. ()!()
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0


type description
table Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit a gradient boosting regression model on data and store model in local registry.

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline[`publish][`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"GbModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data
q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

k Nearest Neighbors Regressor.[X;y;udf][X;y;udf;.qsp.use (!) . flip (
    (`nNeighbors; nNeighbors);
    (`weights   ; weights);
    (`metric    ; metric);
    (`algorithm ; algorithm);
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`modelInit ; modelInit);
    (`bufferSize; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
nNeighbors int Number of points already labeled or predicted, which lie closest to a given unlabeled point (neighbors), to factor in when predicting a value for the point. Minimum value is 1. 5
metric string The distance metric to be used for the tree. The default metric is minkowski, see here for available metrics. "minkowski"
weights string Weight function used to decide how much weight to give to each of the neighboring points when predicting the target of a point. Can be uniform, to weight each neighbors target equally, or distance, to weight each neighbors target based on their distance to the point. "uniform"
algorithm string Algorithm used to parse the vector space and decide which points are the nearest neighbors. You can choose to use the algorithms ball_tree, kd_tree, brute force distance measure approach, or an auto choice based on the data. "auto"
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config dict Configuration for fitting the model. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit a k-nearest neighbors regression model on data and store model in local registry.

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline[`publish][`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"k-nearest neighborsModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data
q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Lasso.[X;y;udf][X;y;udf;.qsp.use (!) . flip (
    (`alpha       ; alpha);
    (`fitIntercept; fitIntercept);
    (`maxIter     ; maxIter);
    (`tol         ; tol);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`model       ; model);
    (`modelInit   ; modelInit);
    (`bufferSize  ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
alpha float Constant that controls the regularization strength by multiplying the L1 regularization term. Minimum value is 0.0. 1.0
fitIntercept boolean Whether to add a constant value (intercept) to the regression function - c in y=mx+c. 1b
maxIter int Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1. 1000
tol float Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is 0.0. 1e-4
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config dict Configurations for fitting the model. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit a lasso regression model on data and store model in local registry.

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline[`publish][`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"LassoModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data
q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Linear regressor fit using stochastic gradient descent[X;y;udf][X;y;udf;.qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config)
    (`trend     ; trend);
    (`alpha     ; alpha);
    (`maxIter   ; maxIter);
    (`gTol      ; gTol);
    (`seed      ; seed);
    (`penalty   ; penalty);
    (`lambda    ; lambda);
    (`l1Ratio   ; l1Ratio);
    (`decay     ; decay);
    (`p         ; p);
    (`bufferSize; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config any Configuration for fitting the model. ()!()
trend boolean Whether to add a constant value (intercept) to the regression function - c in y=mx+c. 1b
alpha float Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is 0.0. 0.01
maxIter long Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1. 100
gTol float Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is 0.0. 1e-5
seed long Integer value used to control the randomness of the model's Initialization state. Specifying this allows for reproducible results across function calls. If set to ::, the randomness is based off the current timestamp. 0
penalty symbol Penalty term used to shrink the coefficients of the less contributive variables. Can be l1 to add an L1 penalty term, l2 to add an L2 penalty term, or elasticNet to add both L1 and L2 penalty terms. l2
lambda float Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is 0.0. 0.001
l1Ratio float If elasticNet is used as the regularization method, this parameter determines the balance between the L1 and L2 penalty terms. If this value is set to 0, this is the same as using L2 regularization, if this value is set to 1, this is the same as using L1 regularization. This value must lie in the range [0.0, 1.0]. 0.5
decay float Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum value is 0.0. 0f
p float Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is 0.0. 0f
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0

For all common arguments, refer to configuring operators

type description
table Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. As this is an online model, if subsequent data is passed to the stream, each new collection of data points are used to update the regression model and a prediction will also be made for each record.

The algorithm is fit on the first 'n' elements in the stream, up until it reaches a number given by the buffer size. After the model has been fit subsequent data is used to update the model in an online fashion. If data is passed to the stream, the operator outputs the original data table together with predictions appended.

Fit, update, and predict with a linear regression model.

// Running a pipeline[`publish][`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());10000)]

// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);

// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);

// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish ([] x:asc 100?1f; y:asc 100?1f);

Random Forest Regressor.[X;y;udf][X;y;udf;.qsp.use (!) . flip (
    (`nEstimators    ; nEstimators);
    (`criterion      ; criterion);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`maxDepth       ; maxDepth);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`modelInit      ; modelInit);
    (`bufferSize     ; bufferSize))]


name type description default
X symbol or symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
udf symbol or function Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required


name type description default
nEstimators int Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1. 100
criterion string Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be squared_error for mean squared error, friedman_mse for mean squared error with Friedman's improvement score, absolute_error for mean absolute error, or poisson for Poisson deviance. "squared_error"
minSamplesSplit int Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2. 2
minSamplesLeaf int Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1. 1
maxDepth int Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to ::, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. Minimum value is 1. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
config dict Configuration for fitting the model. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()


type description
table Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.


The following spec.q files outline the use of the functionality described above.

Example 1: Fit a random forest regression model on data and store model in local registry.

// Generate packet of data
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline[`publish][`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"RafrModel")]

First we pass a batch of data to the stream processor to fit the model

q)publish data
We can see that the model is saved by calling the get model store function.
Then we can retrieve predictions by passing new data
q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Score the performance of a model[y;predictions;metric]


name type description default
y symbol or function The column name of the target variable, or a function to generate the target variable from the batch. Required
predictions symbol or function The column name of the predictions, or a function to generate the predictions from the batch. Required
metric symbol The metric on which to evaluate model performance. Required

For all common arguments, refer to configuring operators


type description
any The score given by the metric.

Score the performance of a model over time allowing changes in model performance to be evaluated. The values returned are the cumulative scores, rather than scores for the individual batches.

The following metrics are currently supported:

  • f1
  • accuracy
  • mse
  • rmse


Example 1: This example fits a scikit-learn model, then the pipeline predicts y and calculates the cumulative F1 score of the model on receipt of new data.

// Retrieve a dataset and format appropriately
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;

// Split data into training and testing set
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;

features:flip value flip delete y from training;
targets :training`y;

// Train the model
clf:clf[`max_depth pykw 3];

// Set model within existing registry
    {delete y from x};
    .qsp.use enlist[`model]!enlist"skModel"][`y; `pred; `f1]

publish testing;

Example 2: This example first fits a q model, then the pipeline predicts y and scores the cumulative accuracy on receipt of new data.

// Retrieve a dataset and format appropriately
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;

// Split the data into training and testing sets
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;

features:flip value flip delete y from training;
targets :training`y;

// Train the model[features;targets;1b;::];

// Add the model to the existing registry
    {delete y from x};
    .qsp.use enlist[`model]!enlist"myModel"][`y; `pred; `accuracy]

publish testing

Drops columns with constant values[X][X; .qsp.use enlist[`bufferSize]!enlist bufferSize]


name type description default
X symbol or symbol[] or dictionary or :: Name of the column(s) in the input table to remove because they contain a constant value throughout. Can also be a dictionary mapping column names to their associated constant values whereby only columns with these names and values will be dropped. If set to ::, the operator will be applied to all columns in the data that contain a constant value throughout. Required


name type description default
bufferSize long Number of records to observe before dropping the constant columns from the data. If set to 0, the operator will be applied on the first batch. Minimum value is 0. 0

For all common arguments, refer to configuring operators


type description
table Returns the input data with the constant valued columns no longer in the table.

The columns to be removed from the data are either specified by the user beforehand, through a list or dictionary, or these columns are determined using the .ml.dropConstant function. This function checks the data for columns that contain a constant value throughout. If a non-constant column is supplied, an error is thrown.

Example 1: Drop the constant columns protocol and response.[`publish][`protocol`response]
publish ([] protocol: `TCP; response: 200i; latency: 10?5f; size: 10?10000);

Example 2: Drop the columns id and ratio, checking that their values match the expected constant values.[`publish][`id`ratio!(1; 2f)]

publish ([] id: 1; ratio: 2f; data: 10?10f);

Example 3: Drop columns whose value is constant for all buffered records.[`publish][::;.qsp.use enlist[`bufferSize]!enlist 100]

publish ([] motorID: 0; rpms: 1000 + 200?10; temp: 60 + 200?5)

(Beta Feature) Encodes categorical data across several numeric columns

Beta Features

To enable beta features, set the environment variable KXI_SP_BETA_FEATURES to true.[X;n]


name type description default
X symbol or symbol[] Symbol or list of symbols indicating the columns to act on. Required
n long The number of numeric columns used to represent a variable. Required

For all common arguments, refer to configuring operators


type description
table New table with a column for each feature/hash value pair, with the columns specified by X removed.

This operator is used to encode categorical variables numerically. It is similar to one-hot encoding, but does not require the categories or number of categories to be known in advance.

It converts each chosen column into n columns, sending each string/symbol to its truncated hash value. The hash function employed is the signed 32-bit version of Murmurhash3.

As the mapping between values and their hashed representations is effectively random, collisions are possible, and the hash space must be made large enough to reduce collisions to an acceptable level.


Example 1: Encode a single categorical column[`publish][`location; 10]

publish ([] location: 20?`london`paris`berlin`miami; num: til 20);

Example 2: Here is a similar example where we hash multiple columns[`publish][`srcIP`destIP; 14]

IPs: "." sv/: string 4 cut 100?256;
publish ([] srcIP: 100?IPs; destIP: 100?IPs; latency: 100?10; size: 100?10000);

Encodes symbolic columns as numeric data[X][X; .qsp.use enlist[`bufferSize]!enlist bufferSize]


name type description default
X symbol or symbol[] or dictionary or :: Name of the column(s) in the input table whose labels we want to encode. Can also be a dictionary mapping column names to their expected label values whereby only columns with these names and values will be encoded. If set to ::, all categorical columns will be encoded as numeric values. Required


name type description default
bufferSize long Number of records to observe before label encoding the symbol columns in the data. If set to 0, the operator will be applied on the first batch. Minimum value is 0. 0

For all common arguments, refer to configuring operators


type description
table Returns the input data with the symbol columns in the data now having been label encoded as numeric values.

This operator encodes symbolic columns within input data as numeric representations. When data is fed into this operator via a stream, the encoding algorithm will only be run on the data when the number of records received has exceeded the value of the bufferSize. Once this happens, the specified symbol columns are encoded and the mapping of each symbol to its respective encoded number is stored as the state. If new symbols appear in subsequent batches, the state will be updated to reflect this.


Example 1: Encode all symbol columns within the data.[`publish][::]

publish ([]10?`a`b`c;10?`d`e`f;10?1f);

Example 2: Encode symbols in column x.[`publish][`x]

publish ([]10?`a`b`c;10?`d`e`f;10?1f);

Example 3: Encode the symbols in the encoded column with the mapping specified.[`publish][(enlist `encoded)!enlist `small`medium`large]

data: 10?`small`medium`large;
publish ([] original: data; encoded: data);

Apply min-max scaling to streaming data[X][X; .qsp.use (!) . flip (
    (`bufferSize; bufferSize);
    (`rangeError; rangeError))]


name type description default
X symbol or symbol[] or dictionary or :: Name of the column(s) in the input table whose values we want to scale. Can also be a dictionary mapping column names to the minimum and maximum values to use when scaling. If set to ::, all numeric columns will be scaled. Required


name type description default
bufferSize long Number of records to observe before scaling the numeric columns in the data. If set to 0, the operator will be applied on the first batch. Minimum value is 0. 0
rangeError boolean Whether to raise a range error if new input data falls outside the minimum and maximum data range observed during the initialization of the operator. 0b

For all common arguments, refer to configuring operators


type description
table Returns the input data with the numeric columns now being scaled so their values lie between 0 and 1.

This operator scales a set of numeric columns based on a user-supplied data range or based on the minimum and maximum values in the data when the operator is applied. The operator will only be applied, and the minimum/maximum values decided upon, once the number of data point given to the model exceeds the value of the bufferSize parameter. This function can also be configured to error if data supplied after the ranges have been set falls outside this range.


Example 1: Apply min-max scaling on all data.[`publish][::]

publish ([]20?5;20?5;20?10)

Example 2: Apply min-max scaling on the specified columns.[`publish][`x`x1]

publish ([]20?5;20?5;20?10)

Example 3: Apply min-max scaling on columns rating and cost, with supplied minimum and maximum values for one column and the other based on a buffer.[`publish][`rating`cost!(0 10;::); .qsp.use enlist[`bufferSize]!enlist 200]

publish ([] rating: 3 + 250?5; cost: 250?1000f)

Example 4: Error when passed batches containing data outside the min-max bounds.[`publish][::;.qsp.use enlist[`rangeError]!enlist 1b]

// As no buffer is specified, the min and max values are fit using the initial batch
publish ([]100?5;100?5;100?10)

// As `rangeError` has been set, this batch will cause an error by exceeding the
// expected maximum values
publish 1+([]100?5;100?5;100?10)

One hot encodes relevant columns[x][x; .qsp.use enlist[`bufferSize]!enlist bufferSize]


name type description default
X symbol or symbol[] or dictionary or :: Name of the column(s) in the input table to one-hot encode. Can also be a dictionary mapping column names to their expected values whereby only columns with these names and values will be encoded. If set to ::, all categorical columns will be encoded as numeric values. Required


name type description default
bufferSize long Number of records to observe before one-hot encoding the symbol columns in the data. If set to 0, the operator will be applied on the first batch. Minimum value is 0. 0

For all common arguments, refer to configuring operators


type description
table Returns the input data with the symbol columns in the data now each being represented by multiple numeric columns populated by 0s and 1s.

Encodes symbolic and string data as numeric representations. When data is fed into the operator via a stream, the algorithm will only be applied to the data when the number of records received has exceeded the value of the bufferSize parameter. When this happens, the buffered data is one-hot encoded. If subsequent data is passed which contains symbols that were not present at the time of the original fitting, these symbols will be mapped to 0.


Example 1: Encode all the symbolic or string columns.[`publish][::]

publish ([] action: 10?`upload`download; fileType: 10?("image";"audio";"document"); size: 10?100000)

Example 2: Encode column x[`publish][`x]

publish ([] x:10?`a`b`c; y:10?1f)

Example 3: Encode columns x and x1 with a required buffer[`publish][`x`x1;.qsp.use ``bufferSize!(`;200)]

publish ([] 250?`a`b`c; 250?`d`e`f`j; 250?0b)

Example 4: Encode the columns axis and status using given values. This is useful when the categories are known in advance, but may not be present in the training data.[`publish][`axis`status!(`x`y`z; `normal`error)]

publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)

Example 5: Encode column axis and status using hybrid method[`publish][`axis`status!(::; `normal`error)]

publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)

Apply standardization to streaming data[X][X; .qsp.use enlist[`bufferSize]!enlist bufferSize]


name type description default
X symbol or symbol[] or :: Name of the column(s) in the input table to standardize. If set to ::, all numeric columns will be standardized. Required


name type description default
bufferSize long Number of records to observe before standardizing the numerical columns in the data. If set to 0, the operator will be applied on the first batch. Minimum value is 0. 0

For all common arguments, refer to configuring operators


type description
table Returns the input data with the numeric columns now having a mean value of 0 and a standard deviation of 1.

Standardize a user-specified set of columns in an input table. When data is fed into this operator via a stream, the algorithm will only scale the data when the number of records received has exceeded the value of the bufferSize parameter. Once this happens, the mean and standard deviation of each column is computed. These statistics are then used on subsequent batches which are normalized by subtracting this mean value and dividing the result by the standard deviation value.


Example 1: Applies standardization to all data[`publish][::]

publish ([]100?5;100?5;100?10)

Example 2: Apply standardization to specified columns.[`publish][`x`x1]

publish ([]100?5;100?5;100?10)

Example 3: This pipeline applies standardization to all columns based on a buffer.[`publish][::; .qsp.use enlist[`bufferSize]!enlist 200]

publish ([] length: 100 + 250?2f; width: 10 + 250?1f);

Fit model to batch of data and predict target for future batches[X;y;untrained;modelType;udf][X;y;untrained;modelType;udf; .qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`modelArgs ; modelArgs);
    (`bufferSize; bufferSize))]


name type description default
X symbol[] or function The predictor variable's column names or a function to generate the predictors from the batch. Required
y symbol or function or :: The target variable's column name or a function to generate the predictors from the batch. This must be :: when training an unsupervised model Required
untrained function An untrained q/sklearn model. Required
modelType string Indication as to whether a model is "q" or "sklearn". Required
udf function or symbol A function to score the quality of the model or join predictions into the batch. In the case that this is a symbol, append the predictions to the batch as a new columns. Required
Functional UDF requirements

The udf parameter for the operator is a function with the following parameters:

    update yhat: predictions from data

name type description
data any The batch passed to the operator, only the data not the metadata.
y symbol | function | :: The target variable, as extracted by the y parameter. In the unsupervised case this is populated with nulls.
predictions list The predictions for each record in the batch.
modelInfo :: Currently unused and always set to ::.


name type description default
registry string The registry to load from. ::
experiment string The experiment name. ::
model string The model name in the registry. ::
config any The config parameter for .ml.registry.set.mode ()!()
modelArgs list A list of argument to pass to the model after X and y. ::
bufferSize long Number of records to buffer before training a model. If 0, the model will be fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training. 0

For all common arguments, refer to configuring operators


type description
any The current batch, modified in accordance with the udf parameter.

Fits a model to a batch or buffer of data, saving the model to the registry, and predicting the target variable for future batches after the model has been trained.

N.B. This is only for models that cannot be trained incrementally. For other models, should be used.

Fit a q model on a batch.

// Generate initial data to be used for fitting

// Define optional variables
optVals:(::;::;"sgdLR";(1b; `maxIter`gTol`seed!(100;-0w;42)))

// Define execution pipeline[`publish][
    {delete y from x};
    .qsp.use opt

publish data

// View model stored in registry

Fit an sklearn model.

// Generate initial data to be used for fitting
data:([]x:asc 100?1f;x1:100?1f;y:desc 100?5)

// Populate a random forest classifier expected
rfc:.p.import[`sklearn.ensemble][`:RandomForestClassifier][`max_depth pykw 2]

// Define execution pipeline[`publish][
     {delete y from x};
     {exec y from x};

publish data

Fit an unsupervised model.[`publish][
    .qsp.use enlist[`modelArgs]!enlist(`e2dist;3;::)

publish ([]x:1000?1f;x1:1000?1f;x2:1000?1f)

Predict a target variable using a model[X;udf];[X;udf; .qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`version   ; version))]


name type description default
X symbol[] or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. Required
udf function or symbol Can be the name of the column which is to house the model's predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. Required
Functional UDF requirements

The udf parameter for the operator is a function with the following parameters:

    update yhat: predictions from data

name type description
data any The batch passed to the operator, only the data not the metadata.
y symbol or function or :: The target variable, as extracted by the y parameter.
predictions list The predictions for each record in the batch.
modelInfo :: Currently unused and always set to ::.


name type description default
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model we want to load is stored under. If set to ::, the model will be loaded from unnamedExperiments. ::
model string Name of the fitted model we want to load in the registry. If set to ::, the most recently uploaded model will be loaded. ::
version float Version of the fitted model we want to load in the registry. If set to ::, the latest version of the model will be loaded. ::

For all common arguments, refer to configuring operators


type description
any Returns the input data with an additional column containing the model's predicted label values for each data point. will predict the target value for each record in the batch, using a model from the registry.

The user-defined function udf can join these predictions into the data, or do any arbitrary computation. Note that below data is the whole batch, not just those fields extracted by X. Additionally, modelInfo is a catch-all for any model-specific outputs.[X; {[data;y;predictions;modelInfo]
    update temperature: predictions from data
    }; .qsp.use `registry`experiment`model`version!(registry;experiment;model;version)]

In lieu of a user-defined function, this parameter can also just be the name of a new column or the name of an existing column to overwrite it.[X;`temperature;


Predict using an sklearn model, adding predictions to the initial data.

data:([]x:asc N?1f;x1:desc N?10;x2:N?1f;y:asc N?5)

features:flip value flip delete y from data

clf1:clf1[`max_depth pykw 3];

// Set the model within the existing registry
    {delete y from x};
    .qsp.use enlist[`model]!enlist"skModel"]

publish data

Example 2: Predict using a q model adding predictions to the initial data

// Define data for fitting the model

// Fit a model[data`x`x1`x2;`e2dist;6;enlist[`iter]!enlist 1000]

// Set the model within existing registry
    .qsp.use enlist[`model]!enlist"kmeansModel"]

publish data

Train a model incrementally returning predictions for each record in a batch[X;y;udf][X;y;udf; .qsp.use (!) . flip (
    (`registry     ; registry);
    (`experiment   ; experiment);
    (`model        ; model);
    (`version      ; version);
    (`config       ; config);
    (`supervised   ; supervised);
    (`untrained    ; untrained);
    (`modelType    ; modelType);
    (`modelArgs    ; modelArgs))]


name type description default
X symbol[] | function The predictor variable's column names or a function to generate the predictors from the batch. Required
y symbol | function The target variable's column name or a function to generate this from the batch. Required
udf function | symbol A function to score the quality of the model or join predictions into the batch. Required
Functional UDF requirements

The udf parameter for the operator is a function with the following parameters:

    update yhat: predictions from data

name type description
data any The batch passed to the operator, only the data not the metadata.
y symbol or function or :: The target variable, as extracted by the y parameter.
predictions list The predictions for each record in the batch.
modelInfo :: Currently unused and always set to ::.


name type description default
registry string Registry to load/store model from. ::
experiment string Experiment name under which to load/store model. ::
model string Model name. ::
version long[] The version to load. ::
config any Config for storage of the initial fit model. ()!()
supervised boolean Indicates an unsupervised model. 1b
untrained function | embedpy An untrained ML model e.g. ::
modelType string One of "q" or "sklearn" defining the type of model. ::
modelArgs list A list of argument to pass to the model after X and y. ::

For all common arguments, refer to configuring operators


type description
any The current batch, modified in accordance with the udf parameter.

Train a model incrementally returning predictions for each record in a batch. A user-defined function can be used to join these predictions into the data, or do any arbitrary computation.

Python support

Currently this functionality is only supported for q models. Support for deployment of online learning models written in Python is scheduled for a later release.


Example 1:Fit an untrained q model which can be updated, adding predictions to the initial data.

// Initialise functionality and data required for running example
    {delete y from x};
    {exec y from x};

publish data;