Skip to content

Machine Learning

.qsp.ml Fresh freshCreate turns batches of data into features based on aggregated statistics

Classification adaBoostClassifier fits an adaBoost classification model decisionTreeClassifier fits a decision tree classification model gaussianNB fits a gaussian naive bayes model kNeighborsClassifier fits a k-nearest neighbors classification model logClassifier fits a logistic classification model using stochastic gradient descent quadraticDiscriminantAnalysis fits a quadratic discriminant analysis model randomForestClassifier fits a random forest classification model

Clustering affinityPropagation fits an affinity propagation clustering model birch fits a BIRCH clustering model cure fits a CURE clustering model dbscan fits a DBSCAN clustering model sequentialKMeans fits a sequential k-means model

Regression adaBoostRegressor fits an adaBoost regression model gradientBoostingRegressor fits a gradient boosting regression model kNeighborsRegressor fits a k-nearest neighbors regression model lasso fits a lasso-linear regression model linearRegression fits a linear regression model randomForestRegressor fits a random forest regression model

Metrics score evaluates a model's predictions

Preprocessing dropConstant drops constant columns from incoming data featureHasher encodes categorical data as numeric vectors labelEncode encodes symbolic data into numerical values minMaxScaler min-max scale a supplied dataset oneHot replaces symbolic values with numerical vector representations standardize standardizes a supplied dataset

Registry registry.fit fits a model to batches of data, saving a model to a registry registry.predict predicts a target variable using a trained model from the registry registry.update trains a model incrementally, returning predictions for all records

Note All ml operators act solely on unkeyed tables (type 98).

.qsp.ml.freshCreate

Turns batches of data into features using aggregated statistics

.qsp.ml.freshCreate[X;features]
.qsp.ml.freshCreate[X;features;.qsp.use enlist[`warn]!enlist warn]

Parameters:

name type description default
X symbol or symbol[] Name of the column(s) in the data to use for FRESH feature generation. Required
features symbol, symbol[], or :: Name of the FRESH feature(s) we want to define from the data. A full list of these features can be found here. Required

options:

name type description default
warn boolean Show warnings 1b / Suppress warnings 0b. 0b

For all common arguments, refer to configuring operators

Returns:

type description
table Returns a table containing the specified aggregated FRESH feature columns for each selected column in the input table.

Converts each chosen column into a collection of feature values based on the supplied FRESH features. Typically, the operator is intended to be used in conjunction with the windowing operators that provide regular batches of data from which we engineer features. The aggregate statistics used to create these features can be as simple as max/min/count.

For the feature parameter, if it is set to: :: - all features are applied. noHyperparameters - all features except hyperparameters are applied. noPython - all features that don't rely on Python are applied.

As this aggregates a batch to a single row of aggregated statistics, the output table does not include the original columns.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Build two features - absEnergy and max.

// Define and run a stream processor pipeline using the freshCreate operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.window.tumbling[00:01:00; `time]
  .qsp.ml.freshCreate[`x; `absEnergy`max]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to create the fresh features
publish ([] time: .z.p+00:00:01 * til 500; x: 500?1f);

Example 2: Build all features.

// Define and run a stream processor pipeline using the freshCreate operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.window.count[100]
  .qsp.ml.freshCreate[`x; ::]
  .qsp.write.toVariable[`output];

// Pass a batch of data to the stream processor to create the fresh features
publish ([] x: 500?1f; y: 500?100);

.qsp.ml.MLPClassifier

Multi-Layer Perceptron Classifier

.qsp.ml.MLPClassifier[X;y;prediction;.qsp.use (!) . flip (
    (`hiddenLayerSizes; hiddenLayerSizes);
    (`activation      ; activation);
    (`solver;         ; solver);
    (`alpha           ; alpha);
    (`batchSize       ; batchSize);
    (`learningRate    ; learningRate);
    (`learningRateInit; learningRateInit);
    (`powerT          ; powerT);
    (`maxIter         ; maxIter);
    (`bufferSize      ; bufferSize);
    (`modelInit       ; modelInit);
    (`model           ; model);
    (`registry        ; registry);
    (`experiment      ; experiment);
    (`config          ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the column which is to house the model's predicted label values for each data record OR a function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
hiddenLayerSizes int[] List of the number of neurons in each hidden layer in the neural network. Minimum size of each layer is 1. enlist 100
activation string Activation function used to transform the output of the hidden layers into a single scalar value. This value can be identity to use a linear activation function, logistic to use a sigmoid activation function, tanh to use a hyperbolic tangent activation function, or relu to use a rectified linear unit function. relu
solver string Optimization function used to search for the inputs that minimize/maximize the results of the model function. This value can be lbfgs to use a limited-memory BFGS, sgd to use stochastic gradient descent, or adam to use adaptive moment estimation. adam
alpha float Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss function and is used to reduce the chance of model overfitting. Minimum value is 0.0. 0.0001
batchSize int Number of training examples used in each stochastic optimization iteration. Minimum value is 1. auto
learningRate string Learning rate schedule for updating the weights of the neural network. Only used when the optimization function is set to sgd. This value can be constant for a constant learning rate, optimal for the optimal learning rate, invscaling to use an inverse scaling learning rate, or adaptive for an adaptive learning rate. constant
learningRateInit float Starting learning rate value. This controls the step-size used when updating the neural network weights. Not used when the optimization function is set to lmbfgs. Minimum value is 0.0. 0.001
powerT float Exponent used to update the learning rate when the learning rate is set to invscaling and the optimization function is set to sgd. 0.5
maxIter int Maximum number of optimization epochs/iterations. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1. 200
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under 'unnamedExperiments'. ::
config any Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted class labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a multi-layer perceptron classifier model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Define and run a stream processor pipeline using the MLPClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.MLPClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the MLPClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.MLPClassifier[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.adaBoostClassifier

AdaBoost Classifier

.qsp.ml.adaBoostClassifier[X;y;prediction;.qsp.use (!) . flip (
    (`nEstimators ; nEstimators);
    (`learningRate; learningRate);
    (`algorithm   ; algorithm);
    (`bufferSize  ; bufferSize);
    (`modelInit   ; modelInit);
    (`model       ; model);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`config      ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the features values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function of the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
nEstimators int Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value 1. 50
learningRate float Controls the loss function used to set the weight of each classifier at each boosting iteration. The higher this value, the more each classifier will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is 0.0. 1.0
algorithm string Multi-class AdaBoost function used to extend the AdaBoost operator to have multi-class capabilities. This value can be SAMME for stagewise additive modeling or SAMME.R for real-valued stagewise additive modeling. SAMME.R
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted class labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with an adaBoost classification model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Define and run a stream processor pipeline using the adaBoostClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.adaBoostClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the adaBoostClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.adaBoostClassifier[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.decisionTreeClassifier

Decision Tree Classifier.

.qsp.ml.decisionTreeClassifier[X;y;prediction;.qsp.use (!) . flip (
    (`criterion      ; criterion);
    (`splitter       ; splitter);
    (`maxDepth       ; maxDepth);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`bufferSize     ; bufferSize);
    (`modelInit      ; modelInit);
    (`model          ; model);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`config         ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
criterion string Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be gini, to use the Gini impurity measure, or entropy, to use the information gain measure. gini
splitter string Strategy used to split the nodes in the tree. This can be best to choose the best split or random to choose the best random split. best
maxDepth int Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to ::, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. ::
minSamplesSplit int Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2. 2
minSamplesLeaf int Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1. 1
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted class labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a decision tree classifier model on data and store the model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Define and run a stream processor pipeline using the decisionTreeClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.decisionTreeClassifier[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"DTCModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry.
.ml.registry.get.modelStore["/tmp";::]
We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the decisionTreeClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.decisionTreeClassifier[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.gaussianNB

Gaussian Naive Bayes

.qsp.ml.gaussianNB[X;y;prediction;.qsp.use (!) . flip (
    (`priors      ; priors);
    (`varSmoothing; varSmoothing);
    (`bufferSize  ; bufferSize);
    (`modelInit   ; modelInit);
    (`model       ; model);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`config      ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
priors float[] List of the prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is 0.0. If set to ::, the priors will be adjusted according to the data. ::
varSmoothing float Value added to the Gaussian distributions variance to widen the curve and account for more samples further away from the distributions mean. Minimum value is 0. 1e-9
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted class labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a gaussian naive bayes model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Define and run a stream processor pipeline using the gaussianNB operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.gaussianNB[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the gaussianNB operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.gaussianNB[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.kNeighborsClassifier

K-Nearest Neighbors Classifier

.qsp.ml.kNeighborsClassifier[X;y;prediction;.qsp.use (!) . flip (
    (`nNeighbors; nNeighbors);
    (`weights   ; weights);
    (`algorithm ; algorithm);
    (`leafSize  ; leafSize);
    (`p         ; p);
    (`metric    ; metric);
    (`bufferSize; bufferSize);
    (`modelInit ; modelInit);
    (`model     ; model);
    (`registry  ; registry);
    (`experiment; experiment);
    (`config    ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
nNeighbors int Number of already classified points, which lie closest to a given unclassified point (neighbors), to factor in when predicting the points class. Minimum value is 1. 5
weights string Weight function used to decide how much weight to give to the classes of each of the neighboring points when predicting a points class. Can be uniform to weight each neighbor's class equally or distance to weight each neighbor's class based on its distance to the point. uniform
algorithm string Algorithm used to parse the vector space and decide which points are the nearest neighbors to a given unclassified point. This algorithm can be a ball_tree algorithm, kd_tree algorithm, brute force distance measure approach, or an auto choice based on the data. auto
leafSize int If ball_tree or kd_tree is selected as the algorithm, this is the minimum number of points in a given leaf node, after which point, brute force algorithm will be used to find the nearest neighbors. Setting this value either very close to 1 or very close to the total number of points in the data may have a noticeable impact on model runtime. Minimum value is 1. 30
p int Power parameter used when the distance metric minkowski is selected. Minimum values is 0. 2
metric string Distance metric used to measure the distance between points. This value can be minkowski, euclidean, manhattan, etc. minkowski
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted class labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a k-nearest neighbors classification model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Define and run a stream processor pipeline using the kNeighborsClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.kNeighborsClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the kNeighborsClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.kNeighborsClassifier[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.logClassifier

Logistic classifier fit using stochastic gradient descent

.qsp.ml.logClassifier[X;y;prediction]
.qsp.ml.logClassifier[X;y;prediction; .qsp.use (!) . flip (
    (`trend     ; trend);
    (`alpha     ; alpha);
    (`maxIter   ; maxIter);
    (`gTol      ; gTol);
    (`seed      ; seed);
    (`penalty   ; penalty);
    (`lambda    ; lambda);
    (`l1Ratio   ; l1Ratio);
    (`decay     ; decay);
    (`p         ; p);
    (`bufferSize; bufferSize);
    (`model     ; model);
    (`registry  ; registry);
    (`experiment; experiment)
    (`config    ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
trend boolean Whether to add a constant value (intercept) to the classification function - c in y=mx+c. 1b
alpha float Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is 0.0. 0.01
maxIter long Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1. 100
gTol float Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is 0.0. 1e-5
seed long Integer value used to control the randomness of the model's Initialization state. Specifying this allows for reproducible results across function calls. If a value is not supplied, the randomness is based off the current timestamp. 0
penalty symbol Penalty term used to shrink the coefficients of the less contributive variables. Can be l1 to add an L1 penalty term, l2 to add an L2 penalty term, or elasticNet to add both L1 and L2 penalty terms. l2
lambda float Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is 0.0. 0.001
l1Ratio float If Elastic Net is chosen as the regularization method, this parameter determines the balance between the L1 and L2 penalty terms. If this value is set to 0, this is the same as using L2 regularization, if this value is set to 1, this is the same as using L1 regularization. This value must lie in the range [0.0, 1.0]. 0.5
decay float Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum values is 0.0. 0f
p float Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is 0.0. 0f
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config any Dictionary used to configure additional settings when saving the model to the registry. ()!()

For all common arguments, refer to configuring operators

Returns:

type description
table Returns the input data with an additional column containing the model's predicted class labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. As this is an online model, if subsequent data is passed to the stream, each new collection of data points will be used to update the classifier model and a predictions will be made for each record.

Performance Limitations

This functionality is not currently encouraged for use in high throughput environments. Prediction times for this function is on the order of milliseconds. Further optimizations are expected in later releases.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, update, and predict with a logistic classification model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Define and run a stream processor pipeline using the logClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.logClassifier[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());1000)]
  .qsp.write.toVariable[`output];

// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish data;

// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish data;

// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish data;
We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the logClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.logClassifier[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.quadraticDiscriminantAnalysis

Quadratic Discriminant Analysis

.qsp.ml.quadraticDiscriminantAnalysis[X;y;prediction;.qsp.use (!) . flip (
    (`priors    ; priors);
    (`bufferSize; bufferSize);
    (`modelInit ; modelInit);
    (`model     ; model);
    (`registry  ; registry);
    (`experiment; experiment);
    (`config    ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's class labels OR a user-defined function of the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
priors float[] List of the prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is 0.0. If set to ::, the priors will be adjusted according to the data. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config any Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted class labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a quadratic discriminant analysis model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Define and run a stream processor pipeline using the quadraticDiscriminantAnalysis operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.quadraticDiscriminantAnalysis[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the quadraticDiscriminantAnalysis operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.quadraticDiscriminantAnalysis[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.randomForestClassifier

Random Forest Classifier

.qsp.ml.randomForestClassifier[X;y;prediction.;.qsp.use (!) . flip (
    (`nEstimators          ; nEstimators);
    (`criterion            ; criterion);
    (`maxDepth             ; maxDepth);
    (`minSamplesSplit      ; minSamplesSplit);
    (`minSamplesLeaf       ; minSamplesLeaf);
    (`minWeightFractionLeaf; minWeightFractionLeaf);
    (`maxFeatures          ; maxFeatures);
    (`maxLeafNodes         ; maxLeafNodes);
    (`minImpurityDecrease  ; minImpurityDecrease);
    (`bootstrap            ; bootstrap)
    (`bufferSize           ; bufferSize);
    (`modelInit            ; modelInit);
    (`model                ; model);
    (`registry             ; registry);
    (`experiment           ; experiment);
    (`config               ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
nEstimators int Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1. 100
criterion string Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be gini to use the Gini impurity measure or entropy to use the information gain measure. gini
maxDepth int Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to ::, the tree will expand until all leaves are pure or contain less than the minSamplesSplit value. Minimum value is 1. ::
minSamplesSplit int Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2. 2
minSamplesLeaf int Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1. 1
minWeightFractionLeaf float Minimum proportion of sample weight required to be at any leaf node relative to the total weight of all samples in the tree. When the sample_weight argument is not set using the modelInit parameter, each sample carries equal weight. This value must lie in the range [0.0, 1.0]. 0.0
maxFeatures string Maximum number of features to consider when looking for the best way to split a node. This value can be sqrt for the square root of all features, log2 for log to the base 2 of all features, or auto to automatically select the number of features to consider. auto
maxLeafNodes int Maximum number of leaf nodes in each decision tree. This forces the tree to grow in a best-first fashion with the best nodes based on their relative reduction in impurity. If set to ::, there may be unlimited leaf nodes. Minimum value is 1. ::
minImpurityDecrease float Minimum impurity decrease value required to split a node. If the tree impurity would not decrease by more than this value, the node will not be split. Minimum value is 0.0. 0.0
bootstrap boolean Whether bootstrap samples are used when building trees. If 1b, the whole dataset is used to build each tree. 1b
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config any Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted class labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a random forest classification model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Define and run a stream processor pipeline using the randomForestClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.randomForestClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]
We can retrieve predictions using this fit model by passing new data.

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the randomForestClassifier operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.randomForestClassifier[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.affinityPropagation

Affinity Propagation Clustering Algorithm.

.qsp.ml.affinityPropagation[X;cluster;.qsp.use (!) . flip (
    (`damping        ; damping);
    (`maxIter        ; maxIter);
    (`convergenceIter; convergenceIter);
    (`affinity       ; affinity);
    (`randomState    ; randomState);
    (`bufferSize     ; bufferSize);
    (`modelInit      ; modelInit);
    (`model          ; model);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`config         ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns by using all non-categorical columns. Required
cluster symbol, function, or :: Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol cluster is used. Required

options:

name type description default
damping float Provides numerical stabilization and limits oscillations and “overshooting” of parameters by controlling the extent to which the current value is maintained relative to incoming values. This value must lie in the range [0.5, 1.0). 0.5
maxIter int Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1. 200
convergenceIter int Number of iterations, during which there is no change in the number of estimated clusters, needed to stop the convergence. Minimum value is 1. 15
affinity string Statistical measure used to define similarities between the representative points. This value can be euclidean to use negative squared Euclidean distance or precomputed to use the values in the data's distance matrix. euclidean
randomState int Integer value used to control the state of the random generator used in this model. Specifying this allows for reproducible results across function calls. If set to ::, the randomness is based off the current timestamp. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted cluster labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or cluster model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the cluster parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the cluster model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
clusters list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;clusters;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a affinityPropagation clustering model storing the result in a registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);

// Define and run a stream processor pipeline using the affinityPropagation operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.affinityPropagation[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"AffinityPropagationModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
  update newClust: clusters from data
  };

// Define and run a stream processor pipeline using the affinityPropagation operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.affinityPropagation[xFunc;clustFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.birch

Birch Clustering Algorithm.

.qsp.ml.birch[X;cluster;.qsp.use (!) . flip (
    (`threshold      ; threshold);
    (`branchingFactor; branchingFactor);
    (`nClusters      ; nClusters);
    (`bufferSize     ; bufferSize);
    (`modelInit      ; modelInit);
    (`model          ; model);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`config         ; config))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns by using all non-categorical columns. Required
cluster symbol, ::, or function Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol cluster is used. Required

options:

name type description default
threshold float Maximum cluster radius allowed for a new sample to be merged into its closest subcluster. If adding this point to the cluster would cause that clusters radius to exceed this maximum, the new point is not added and instead becomes a new subcluster. Minimum value is 0.0. 0.5
branchingFactor int Maximum number of subclusters in each node in the tree, where each leaf node contains a subcluster. If a new sample arrives causing the number of subclusters to exceed this value for a given node, the node is split into two nodes. Minimum value is 1. 50
nClusters int Final number of clusters to be defined by the model. Minimum value is 2. 3
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Configuration for fitting the model. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted cluster labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or cluster model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the cluster parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the cluster model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
clusters list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;clusters;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a Birch clustering model storing the result in a registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);

// Define and run a stream processor pipeline using the birch operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.birch[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"BirchModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
  update newClust: clusters from data
  };

// Define and run a stream processor pipeline using the birch operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.birch[xFunc;clustFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.cure

CURE Clustering Algorithm.

.qsp.ml.cure[X;cluster;.qsp.use (!) . flip (
    (`df        ; df);
    (`n         ; n);
    (`c         ; c);
    (`cutDict   ; cutDict);
    (`bufferSize; bufferSize);
    (`model     ; model);
    (`registry  ; registry);
    (`experiment; experiment);
    (`config    ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns by using all non-categorical columns. Required
cluster symbol, function, or :: Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol cluster is used. Required

options:

name type description default
df symbol Distance function used to measure the distance between points when clustering. This can be edist for Euclidean distance, e2dist for squared Euclidean distance, nege2dist for negative squared Euclidean distance, mdist for Manhattan distance, or cshev for Chebyshev distance. edist
n int Number of representative points to choose from each cluster to compare the similarity of clusters for the purposes of potentially merging them. Minimum value is 1. 2
c float Compression factor used for grouping the representative points together. Minimum value is 0.0. 0.0
k int Final number of clusters to be defined by the model. Minimum value is 2. The distance used when cutting the dendrogram will be adjusted to fit this number so only specify one of the parameters k or dist. If set to ::, the dist parameter will be used. If both are set to ::, the cutDict parameter will be used. ::
dist float Distance between leaves at which to cut the dendrogram to define the clusters. Minimum value is 0.0. The number of clusters will be dynamic based on this distance so only specify one of the parameters k or dist. If set to ::, the k parameter will be used. If both are set to ::, the cutDict parameter will be used. ::
cutDict dict A dictionary that defines the cutting algorithm used when splitting the data into clusters. This can be used to define a k value or a dist value (documentation for these above). enlist[`k]!enlist 3
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted cluster labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or cluster model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the cluster parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the cluster model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
clusters list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;clusters;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a cure clustering model storing the result in a registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);

// Define and run a stream processor pipeline using the cure operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.cure[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"cureModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
  update newClust: clusters from data
  };

// Define and run a stream processor pipeline using the cure operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.cure[xFunc;clustFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.dbscan

DBSCAN Clustering Algorithm.

.qsp.ml.dbscan[X;cluster;.qsp.use (!) . flip (
    (`df        ; df);
    (`minPts    ; minPts);
    (`eps       ; eps);
    (`bufferSize; bufferSize);
    (`model     ; model);
    (`registry  ; registry);
    (`experiment; experiment);
    (`config    ; config))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns by using all non-categorical columns. Required
cluster symbol, function, or :: Can be the name of the generated column containing the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol cluster is used. Required

options:

name type description default
df symbol Distance function used to measure the distance between points when clustering. This can be edist for Euclidean distance, e2dist for squared Euclidean distance, nege2dist for negative squared Euclidean distance, mdist for Manhattan distance, or cshev for Chebyshev distance. edist
minPts int Minimum number of points required to be close together before this group of points is defined as a cluster. The maximum distance these points are to be away from one another must be less than or equal to the Maximum Distance Between Points parameter. Minimum value is 1. 2
eps float Maximum distance points are allowed to be away from one another to still be classed as close enough to be in the same cluster. Minimum value is 0.0. 1.0
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Configuration for fitting the model. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted cluster labels.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or cluster model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the cluster parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the cluster model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
clusters list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;clusters;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a dbscan clustering model storing the result in a registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);

// Define and run a stream processor pipeline using the dbscan operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dbscan[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"dbscanModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
  update newClust: clusters from data
  };

// Define and run a stream processor pipeline using the dbscan operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dbscan[xFunc;clustFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.sequentialKMeans

Sequential K-Means clustering using the function

.qsp.ml.sequentialKMeans[X;cluster]
.qsp.ml.sequentialKMeans[X;cluster; .qsp.use (!) . flip (
    (`df        ; df);
    (`k         ; k);
    (`centers   ; centers);
    (`init      ; init);
    (`alpha     ; alpha);
    (`forgetful ; forgetful);
    (`bufferSize; bufferSize);
    (`model     ; model);
    (`registry  ; registry);
    (`experiment; experiment);
    (`config    ; config))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, all non-categorical columns are extracted. Required
cluster symbol or :: Name of the column which is to house the model's predicted class labels. If set to ::, the default symbol cluster is used. Required

options:

name type description default
df symbol Distance function used to measure the distance between points when clustering. This can be edist for Euclidean distance, e2dist for squared Euclidean distance, nege2dist for negative squared Euclidean distance, mdist for Manhattan distance, or cshev for Chebyshev distance. edist
k long Final number of clusters to be defined by the model. Minimum value is 2. 3
centers dictionary or :: A dictionary mapping each cluster to the cluster centroid value that we want these clusters to initialize with. ::
init bool Initialization method for the cluster centroids. This value can either be K-means++ (1b) or randomized initialization (0b). 1b
alpha float Controls the rate at which the concept of forgetfulness is applied within the algorithm. If forgetful Sequential K-Means is applied, this value defines how much past cluster centroid information is retained, if not, this is set to 1/(n+1) where n is the number of points in a given cluster. This value must lie in the range [0.0, 1.0]. 0.1
forgetful bool Whether to apply forgetful Sequential K-Means (1b) or normal Sequential K-Means (0b). Forgetful Sequential K-Means will allow the model to evolve its cluster boundaries over time by forgetting about old data as new data comes in. 1b
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config any Configuration used for fitting the model. ()!()

For all common arguments, refer to configuring operators

Returns:

type description
table or :: Null during initial fitting. Afterwards returns the input data with an additional column containing the model's predicted cluster labels.

??? detail "Passing functions as the values for the model parameters" Functions can be passed as the value for the X, y, or cluster model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the cluster parameter.

 Functions for the `X` and `y` model parameters take one argument:

 name   | type  | description
 -------|-------|------------
 `data` | `any` | Batch passed to the operator, only the data not the metadata.

 This function is used to extract lists of values from the input data and takes the following form:
 ```
 func: {[data]
   ...
   }
 ```

 Functions for the `cluster` model parameter takes four arguments:

 name        | type   | description
 ------------|--------|------------
 `data`      | `any`  | Batch passed to the operator, only the data not the metadata.
 `clusters`  | `list` | Model's predictions for each record in the batch.
 `modelInfo` | `::`   | Information about the model. Currently not used and always set to `::`.

 This function is used to add a set of aggregate predictions to the output table and takes the following form:
 ```
 func: {[data;clusters;modelInfo]
   ...
   }
 ```

 `select`, `exec`, `update`, and `delete` statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

The sequential K-Means algorithm is applied within a streaming framework. When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. As this is an online model, if subsequent data is passed to the stream, each new collection of data points are used to update the current cluster centers and predictions are made as to which cluster each point belongs.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, update, and predict with the sequential K-Means model.

// Define and run a stream processor pipeline using the sequentialKMeans operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.sequentialKMeans[`x`x1`x2; `cluster; .qsp.use enlist[`bufferSize]!enlist 100]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish ([]100?1f;100?1f;100?1f);

// Now that the bufferSize has been reached, we can retrieve predictions using this fit model by passing new data
publish ([] 50?1f; 50?1f; 50?1f);

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
  update newClust: clusters from data
  };

// Define and run a stream processor pipeline using the sequentialKMeans operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.sequentialKMeans[xFunc;clustFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.adaBoostRegressor

AdaBoost Regressor.

.qsp.ml.adaBoostRegressor[X;y;prediction]
.qsp.ml.adaBoostRegressor[X;y;prediction;.qsp.use (!) . flip (
    (`nEstimators ; nEstimators);
    (`learningRate; learningRate);
    (`loss        ; loss);
    (`modelInit   ; modelInit);
    (`bufferSize  ; bufferSize);
    (`modelInit   ; modelInit);
    (`model       ; model);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`config      ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
nEstimators int Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1. 50
learningRate float Weight applied to each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. This value must lie in the range (0.0, inf). 1.0
loss string Loss function used to update the contributing weights of the regressors after each boosting iteration. This can be linear for linear loss, square for mean squared error, or exponential for exponential loss. "linear"
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config any Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted target values.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit an adaBoost regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Define and run a stream processor pipeline using the adaBoostRegressor operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.adaBoostRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"AdaModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the adaBoostRegressor operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.adaBoostRegressor[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.decisionTreeRegressor

Decision Tree Regressor.

.qsp.ml.decisionTreeRegressor[X;y;prediction]
.qsp.ml.decisionTreeRegressor[X;y;prediction;.qsp.use (!) . flip (
    (`criterion      ; criterion);
    (`splitter       ; splitter);
    (`maxDepth       ; maxDepth);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`bufferSize     ; bufferSize);
    (`modelInit      ; modelInit);
    (`model          ; model);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`config         ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
criterion string Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be squared_error for mean squared error, friedman_mse for mean squared error with Friedman's improvement score, absolute_error for mean absolute error, or poisson for Poisson deviance. "squared_error"
splitter string Strategy used to split the nodes in the tree. This can be best to choose the best split or random to choose the best random split. "best"
minSamplesSplit int Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2. 2
minSamplesLeaf int Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1. 1
maxDepth int Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to ::, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. Minimum value is 1. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config any Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted target values.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a decision tree regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Define and run a stream processor pipeline using the decisionTreeRegressor operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.decisionTreeRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"DTModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the decisionTreeRegressor operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.decisionTreeRegressor[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.gradientBoostingRegressor

Gradient Boosting Regressor.

.qsp.ml.gradientBoostingRegressor[X;y;prediction]
.qsp.ml.gradientBoostingRegressor[X;y;prediction;.qsp.use (!) . flip (
    (`loss           ; loss);
    (`learningRate   ; learningRate);
    (`nEstimators    ; nEstimators);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`maxDepth       ; maxDepth);
    (`modelInit      ; modelInit);
    (`bufferSize     ; bufferSize);
    (`model          ; model);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`config         ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
loss string Loss function that is optimized using gradient descent to get the best model fit. Can be squared_error, absolute_error, huber which is a combination of squared_error and absolute_error, or quantile which allows for quantile regression (using conditional median). "squared_error"
learningRate float Controls the loss function used to set the weight of each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is 0.0. 0.1
nEstimators int Maximum number of tree estimators to train. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1. 100
minSamplesSplit int Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2. 2
minSamplesLeaf int Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1. 1
maxDepth int Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to ::, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. Minimum value is 1. 3
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config any Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted target values.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a gradient boosting regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Define and run a stream processor pipeline using the gradientBoostingRegressor operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.gradientBoostingRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"GbModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the gradientBoostingRegressor operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.gradientBoostingRegressor[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.kNeighborsRegressor

k Nearest Neighbors Regressor.

.qsp.ml.kNeighborsRegressor[X;y;prediction]
.qsp.ml.kNeighborsRegressor[X;y;prediction;.qsp.use (!) . flip (
    (`nNeighbors; nNeighbors);
    (`weights   ; weights);
    (`metric    ; metric);
    (`algorithm ; algorithm);
    (`bufferSize; bufferSize);
    (`modelInit ; modelInit);
    (`model     ; model);
    (`registry  ; registry);
    (`experiment; experiment);
    (`config    ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
nNeighbors int Number of points already labeled or predicted, which lie closest to a given unlabeled point (neighbors), to factor in when predicting a value for the point. Minimum value is 1. 5
metric string The distance metric to be used for the tree. The default metric is minkowski, see here for available metrics. "minkowski"
weights string Weight function used to decide how much weight to give to each of the neighboring points when predicting the target of a point. Can be uniform, to weight each neighbors target equally, or distance, to weight each neighbors target based on their distance to the point. "uniform"
algorithm string Algorithm used to parse the vector space and decide which points are the nearest neighbors. You can choose to use the algorithms ball_tree, kd_tree, brute force distance measure approach, or an auto choice based on the data. "auto"
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted target values.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a k-nearest neighbors regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Define and run a stream processor pipeline using the kNeighborsRegressor operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.kNeighborsRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"k-nearest neighborsModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the kNeighborsRegressor operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.kNeighborsRegressor[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.lasso

Lasso.

.qsp.ml.lasso[X;y;prediction]
.qsp.ml.lasso[X;y;prediction;.qsp.use (!) . flip (
    (`alpha       ; alpha);
    (`fitIntercept; fitIntercept);
    (`maxIter     ; maxIter);
    (`tol         ; tol);
    (`bufferSize  ; bufferSize);
    (`modelInit   ; modelInit);
    (`model       ; model);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`config      ; registryConfig))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
alpha float Constant that controls the regularization strength by multiplying the L1 regularization term. Minimum value is 0.0. 1.0
fitIntercept boolean Whether to add a constant value (intercept) to the regression function - c in y=mx+c. 1b
maxIter int Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1. 1000
tol float Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is 0.0. 1e-4
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted target values.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a lasso regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Define and run a stream processor pipeline using the lasso operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.lasso[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"LassoModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

// We can retrieve predictions using this fit model by passing new data
publish ([] 5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the lasso operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.lasso[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.linearRegression

Linear regressor fit using stochastic gradient descent

.qsp.ml.linearRegression[X;y;prediction]
.qsp.ml.linearRegression[X;y;prediction;.qsp.use (!) . flip (
    (`trend     ; trend);
    (`alpha     ; alpha);
    (`maxIter   ; maxIter);
    (`gTol      ; gTol);
    (`seed      ; seed);
    (`penalty   ; penalty);
    (`lambda    ; lambda);
    (`l1Ratio   ; l1Ratio);
    (`decay     ; decay);
    (`p         ; p);
    (`bufferSize; bufferSize);
    (`model     ; model);
    (`registry  ; registry);
    (`experiment; experiment);
    (`config    ; config))]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
trend boolean Whether to add a constant value (intercept) to the regression function - c in y=mx+c. 1b
alpha float Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is 0.0. 0.01
maxIter long Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is 1. 100
gTol float Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is 0.0. 1e-5
seed long Integer value used to control the randomness of the model's Initialization state. Specifying this allows for reproducible results across function calls. If set to ::, the randomness is based off the current timestamp. 0
penalty symbol Penalty term used to shrink the coefficients of the less contributive variables. Can be l1 to add an L1 penalty term, l2 to add an L2 penalty term, or elasticNet to add both L1 and L2 penalty terms. l2
lambda float Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is 0.0. 0.001
l1Ratio float If elasticNet is used as the regularization method, this parameter determines the balance between the L1 and L2 penalty terms. If this value is set to 0, this is the same as using L2 regularization, if this value is set to 1, this is the same as using L1 regularization. This value must lie in the range [0.0, 1.0]. 0.5
decay float Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum value is 0.0. 0f
p float Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is 0.0. 0f
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config any Dictionary used to configure additional settings when saving the model to the registry. ()!()

For all common arguments, refer to configuring operators

type description
table Returns the input data with an additional column containing the model's predicted target values.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. As this is an online model, if subsequent data is passed to the stream, each new collection of data points are used to update the regression model and a prediction will also be made for each record.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit, update, and predict with a linear regression model.

// Define and run a stream processor pipeline using the linearRegression operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.linearRegression[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());10000)]
  .qsp.write.toVariable[`output];

// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);

// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);

// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish ([] x:asc 100?1f; y:asc 100?1f);

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the linearRegression operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.linearRegression[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.randomForestRegressor

Random Forest Regressor.

.qsp.ml.randomForestRegressor[X;y;prediction]
.qsp.ml.randomForestRegressor[X;y;prediction;.qsp.use (!) . flip (
    (`nEstimators    ; nEstimators);
    (`criterion      ; criterion);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`maxDepth       ; maxDepth);
    (`bufferSize     ; bufferSize);
    (`modelInit      ; modelInit);
    (`model          ; model);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`config         ; registryConfig)]

Parameters:

name type description default
X symbol, symbol[], function, or :: Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the operator tries to infer the feature columns as the non-categorical and non-target columns. If set to :: and the models y parameter is a function, an error will occur. Required
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
prediction symbol, function, or :: Can be the name of the generated column containing the model's predicted target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the default symbol prediction is used. Required

options:

name type description default
nEstimators int Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is 1. 100
criterion string Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be squared_error for mean squared error, friedman_mse for mean squared error with Friedman's improvement score, absolute_error for mean absolute error, or poisson for Poisson deviance. "squared_error"
minSamplesSplit int Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is 2. 2
minSamplesLeaf int Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is 1. 1
maxDepth int Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to ::, the tree will expand until all leaves are pure or contain less than the Minimum Samples To Split Node value. Minimum value is 1. ::
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelInit dict A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here. ()!()
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string or dict Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under unnamedExperiments. ::
config dict Dictionary used to configure additional settings when saving the model to the registry. ()!()

Returns:

type description
table Returns the input data with an additional column containing the model's predicted target values.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a random forest regression model on data and store model in local registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Define and run a stream processor pipeline using the randomForestRegressor operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.randomForestRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"RafrModel")]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore["/tmp";::]

// We can retrieve predictions using this fit model by passing new data
publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

Example 2: Pass functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data:([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the randomForestRegressor operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.randomForestRegressor[xFunc;yFunc;predFunc]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

.qsp.ml.score

Score the performance of a model

.qsp.ml.score[y;predictions;metric]

Parameters:

name type description default
y symbol or function Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use. Required
predictions symbol or function Can be the name of the column which houses the model's predictions OR a user-defined function that will generate predictions from the input data. Required
metric symbol Metric to use to compare the predictions with the y target values. Required

For all common arguments, refer to configuring operators

Returns:

type description
any The evaluation score given by the metric.

Score the performance of a model over time allowing changes in model performance to be evaluated. The values returned are the cumulative scores, rather than scores for the individual batches.

The following metrics are currently supported:

  • f1
  • accuracy
  • mse
  • rmse

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fits a scikit-learn model, predict y, and calculate the cumulative F1 score of the model on receipt of new data.

// Retrieve a predefined dataset and format it appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;

// Split the data into a training and testing set
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;

// Train the model
features:flip value flip delete y from training;
targets :training`y;
clf:.p.import[`sklearn.tree]`:DecisionTreeClassifier;
clf:clf[`max_depth pykw 3];
clf[`:fit][features;targets];

// Set model within the existing registry
.ml.registry.set.model[::;::;clf;"skModel";"sklearn";::];

// Define and run a stream processor pipeline using the score operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    {delete y from x};
    `pred;
    .qsp.use enlist[`model]!enlist"skModel"]
  .qsp.ml.score[`y; `pred; `f1]
  .qsp.write.toConsole[];

// Pass the test data to the stream processor to evaluate the predictive performance of the model
publish testing;

Example 2: Fit a q model, predict y, and score the cumulative accuracy on receipt of new data.

// Retrieve a predefined dataset and format it appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;

// Split the data into training and testing sets
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;

// Train the model
features:flip value flip delete y from training;
targets:training`y;
model:.ml.online.sgd.logClassifier.fit[features;targets;1b;::];

// Set model within the existing registry
.ml.registry.set.model[::;::;model;"myModel";"q";::]

// Define and run a stream processor pipeline using the score operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    {delete y from x};
    `pred;
    .qsp.use enlist[`model]!enlist"myModel"]
  .qsp.ml.score[`y; `pred; `accuracy]
  .qsp.write.toConsole[]

// Pass the test data to the stream processor to evaluate the predictive performance of the model
publish testing

.qsp.ml.dropConstant

Drops columns with constant values

.qsp.ml.dropConstant[X]
.qsp.ml.dropConstant[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name type description default
X symbol, symbol[], dictionary, or :: Name of the column(s) in the input table to remove because they contain a constant value throughout. Can also be a dictionary mapping column names to their associated constant values whereby only columns with these names and values will be dropped. If set to ::, the operator will be applied to all columns in the data that contain a constant value throughout. Required

options:

name type description default
bufferSize long Number of records to observe before dropping the constant columns from the data. If set to 0, the operator will be applied on the first batch. Minimum value is 0. 0

For all common arguments, refer to configuring operators

Returns:

type description
table Returns the input data with the constant valued columns no longer in the table.

The columns to be removed from the data are either specified by the user beforehand, through a list or dictionary, or these columns are determined using the .ml.dropConstant function. This function checks the data for columns that contain a constant value throughout. If a non-constant column is supplied, an error is thrown.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Drop the constant columns protocol and response.

// Define and run a stream processor pipeline using the dropConstant operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dropConstant[`protocol`response]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to drop its constant values
publish ([] protocol: `TCP; response: 200i; latency: 10?5f; size: 10?10000);

Example 2: Drop the columns id and ratio, checking that their values match the expected constant values.

// Define and run a stream processor pipeline using the dropConstant operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dropConstant[`id`ratio!(1; 2f)]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to drop its constant values
publish ([] id: 1; ratio: 2f; data: 10?10f);

Example 3: Drop columns whose value is constant for all buffered records.

// Define and run a stream processor pipeline using the dropConstant operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dropConstant[::;.qsp.use enlist[`bufferSize]!enlist 100]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to drop its constant values
publish ([] motorID: 0; rpms: 1000 + 200?10; temp: 60 + 200?5)

.qsp.ml.featureHasher

(Beta Feature) Encodes categorical data across several numeric columns

Beta Features

To enable beta features, set the environment variable KXI_SP_BETA_FEATURES to true.

.qsp.ml.featureHasher[X;n]

Parameters:

name type description default
X symbol or symbol[] Name of the column(s) in the data to perform the feature hashing on. Required
n long Number of resulting numeric columns to represent each specified column. Required

For all common arguments, refer to configuring operators

Returns:

type description
table Returns the input table with each specified column now replaced by n new numeric columns which contain the hashed feature values.

This operator is used to encode categorical variables numerically. It is similar to one-hot encoding, but does not require the categories or number of categories to be known in advance.

It converts each chosen column into n columns, sending each string/symbol to its truncated hash value. The hash function employed is the signed 32-bit version of Murmurhash3.

As the mapping between values and their hashed representations is effectively random, collisions are possible, and the hash space must be made large enough to reduce collisions to an acceptable level.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Encode a single categorical column.

// Define and run a stream processor pipeline using the featureHasher operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.featureHasher[`location; 10]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to encode its features through hashing
publish ([] location: 20?`london`paris`berlin`miami; num: til 20);

Example 2: Encode multiple categorical columns.

// Define and run a stream processor pipeline using the featureHasher operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.featureHasher[`srcIP`destIP; 14]
  .qsp.write.toVariable[`output];

// Pass a batch of data to the stream processor to encode its features through hashing
IPs: "." sv/: string 4 cut 100?256;
publish ([] srcIP: 100?IPs; destIP: 100?IPs; latency: 100?10; size: 100?10000);

.qsp.ml.labelEncode

Encodes symbolic columns as numeric data

.qsp.ml.labelEncode[X]

Parameters:

name type description default
X symbol, symbol[], dictionary, or :: Name of the column(s) in the input table whose labels we want to encode. Can also be a dictionary mapping column names to their expected label values whereby the list of values will be used as the first set of encoding values for each column. If set to ::, all categorical columns will be encoded as numeric values. Required

For all common arguments, refer to configuring operators

Returns:

type description
table Returns the input data with the symbol columns in the data now having been label encoded as numeric values.

This operator encodes symbolic columns within input data as numeric representations. When data is fed into this operator via a stream, the specified symbol columns are encoded and the mapping of each symbol to its respective encoded number is stored as the state. If new symbols appear in subsequent batches, the state will be updated to reflect this.

Examples:

Example 1: Encode all symbol columns within the data.

// Define and run a stream processor pipeline using the labelEncode operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.labelEncode[::]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to encode its columns
publish ([]10?`a`b`c;10?`d`e`f;10?1f);

Example 2: Encode symbols in column x.

// Define and run a stream processor pipeline using the labelEncode operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.labelEncode[`x]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to encode its columns
publish ([]10?`a`b`c;10?`d`e`f;10?1f);

Example 3: Encode the symbols in the encoded column with the mapping specified.

// Define and run a stream processor pipeline using the labelEncode operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.labelEncode[(enlist `encoded)!enlist `small`medium`large]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to encode its columns
data: 10?`small`medium`large;
publish ([] original: data; encoded: data);

.qsp.ml.minMaxScaler

Apply min-max scaling to streaming data

.qsp.ml.minMaxScaler[X]
.qsp.ml.minMaxScaler[X; .qsp.use (!) . flip (
    (`bufferSize; bufferSize);
    (`rangeError; rangeError))]

Parameters:

name type description default
X symbol, symbol[], dictionary, :: Name of the column(s) in the input table whose values we want to scale. Can also be a dictionary mapping column names to the minimum and maximum values to use when scaling. If set to ::, all numeric columns will be scaled. Required

options:

name type description default
bufferSize long Number of records to observe before scaling the numeric columns in the data. If set to 0, the operator will be applied on the first batch. Minimum value is 0. 0
rangeError boolean Whether to raise a range error if new input data falls outside the minimum and maximum data range observed during the initialization of the operator. 0b

For all common arguments, refer to configuring operators

Returns:

type description
table Returns the input data with the numeric columns now being scaled so their values lie between 0 and 1.

This operator scales a set of numeric columns based on a user-supplied data range or based on the minimum and maximum values in the data when the operator is applied. The operator will only be applied, and the minimum/maximum values decided upon, once the number of data point given to the model exceeds the value of the bufferSize parameter. This function can also be configured to error if data supplied after the ranges have been set falls outside this range.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Apply min-max scaling on all data.

// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[::]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to min-max scale its columns
publish ([]20?5;20?5;20?10)

Example 2: Apply min-max scaling on the specified columns.

// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[`x`x1]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to min-max scale its columns
publish ([]20?5;20?5;20?10)

Example 3: Apply min-max scaling on columns rating and cost, with supplied minimum and maximum values for one column and the other based on a buffer.

// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[`rating`cost!(0 10;::); .qsp.use enlist[`bufferSize]!enlist 200]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to min-max scale its columns
publish ([] rating: 3 + 250?5; cost: 250?1000f)

Example 4: Error when passed batches containing data outside the min-max bounds.

// Define and run a stream processor pipeline using the minMaxScaler operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[::;.qsp.use enlist[`rangeError]!enlist 1b]
  .qsp.write.toConsole[]

// As no buffer is specified, the min and max values are fit using the initial batch
publish ([]100?5;100?5;100?10)

// As `rangeError` has been set, this batch will cause an error by exceeding the
// expected maximum values
publish 1+([]100?5;100?5;100?10)

.qsp.ml.oneHot

One hot encodes relevant columns

.qsp.ml.oneHot[x]
.qsp.ml.oneHot[x; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name type description default
X symbol, symbol[], dictionary, or :: Name of the column(s) in the input table to one-hot encode. Can also be a dictionary mapping column names to their expected values whereby only columns with these names and values will be encoded. If set to ::, all categorical columns will be encoded as numeric values. Required

options:

name type description default
bufferSize long Number of records to observe before one-hot encoding the symbol columns in the data. If set to 0, the operator will be applied on the first batch. Minimum value is 0. 0

For all common arguments, refer to configuring operators

Returns:

type description
table Returns the input data with the symbol columns in the data now each being represented by multiple numeric columns populated by 0s and 1s.

Encodes symbolic and string data as numeric representations. When data is fed into the operator via a stream, the algorithm will only be applied to the data when the number of records received has exceeded the value of the bufferSize parameter. When this happens, the buffered data is one-hot encoded. If subsequent data is passed which contains symbols that were not present at the time of the original fitting, these symbols will be mapped to 0.

Examples:

Example 1: Encode all the symbolic or string columns.

// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[::]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] action: 10?`upload`download; fileType: 10?("image";"audio";"document"); size: 10?100000)

Example 2: Encode column x.

// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`x]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] x:10?`a`b`c; y:10?1f)

Example 3: Encode columns x and x1 with a required buffer.

// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`x`x1;.qsp.use ``bufferSize!(`;200)]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] 250?`a`b`c; 250?`d`e`f`j; 250?0b)

Example 4: Encode the columns axis and status using given values. This is useful when the categories are known in advance, but may not be present in the training data.

// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`axis`status!(`x`y`z; `normal`error)]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)

Example 5: Encode column axis and status using hybrid method

// Define and run a stream processor pipeline using the oneHot operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`axis`status!(::; `normal`error)]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to oneHot encode its columns
publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)

.qsp.ml.standardize

Apply standardization to streaming data

.qsp.ml.standardize[X]
.qsp.ml.standardize[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name type description default
X symbol or symbol[] or :: Name of the column(s) in the input table to standardize. If set to ::, all numeric columns will be standardized. Required

options:

name type description default
bufferSize long Number of records to observe before standardizing the numerical columns in the data. If set to 0, the operator will be applied on the first batch. Minimum value is 0. 0

For all common arguments, refer to configuring operators

Returns:

type description
table Returns the input data with the numeric columns now having a mean value of 0 and a standard deviation of 1.

Standardize a user-specified set of columns in an input table. When data is fed into this operator via a stream, the algorithm will only scale the data when the number of records received has exceeded the value of the bufferSize parameter. Once this happens, the mean and standard deviation of each column is computed. These statistics are then used on subsequent batches which are normalized by subtracting this mean value and dividing the result by the standard deviation value.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Applies standardization to all data.

// Define and run a stream processor pipeline using the standardize operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.standardize[::]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to standardize its columns
publish ([]100?5;100?5;100?10)

Example 2: Apply standardization to specified columns.

// Define and run a stream processor pipeline using the standardize operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.standardize[`x`x1]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to standardize its columns
publish ([]100?5;100?5;100?10)

Example 3: This pipeline applies standardization to all columns based on a buffer.

// Define and run a stream processor pipeline using the standardize operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[::; .qsp.use enlist[`bufferSize]!enlist 200]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to standardize its columns
publish ([] length: 100 + 250?2f; width: 10 + 250?1f);

.qsp.ml.registry.fit

Fit model to batch of data and predict target for future batches

.qsp.ml.registry.fit[X;y;untrained;modelType;prediction]
.qsp.ml.registry.fit[X;y;untrained;modelType;prediction; .qsp.use (!) . flip (
    (`bufferSize; bufferSize);
    (`modelArgs ; modelArgs);
    (`model     ; model);
    (`registry  ; registry);
    (`experiment; experiment);
    (`config    ; config))]

Parameters:

name type description default
X symbol[], ::, or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, the features are inferred from the y value. Required
y symbol, function, or :: Can be the name of the column containing the data’s target labels OR a user-defined function that returns the target values to use. This must be :: when training an unsupervised model. Required
untrained function An untrained q/sklearn model that we want to fit. Required
modelType string Whether the model we are fitting is a "q" model or an "sklearn" model. Required
prediction symbol, ::, or function Can be the name of the column which is to house the model’s predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the symbol prediction is used. Required

options:

name type description default
bufferSize long Number of records to observe before fitting the model. If set to 0, the model will be fit on the first batch. Minimum value is 0. 0
modelArgs list List of arguments passed to the model to help configure the fitting process. ::
model string Name of the model to be stored in the registry. If set to ::, the model will not be stored in the registry. ::
registry string Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory. ::
experiment string Name of the experiment in the registry that the fitted model is to be stored under. If set to ::, the model will be stored under 'unnamedExperiments'. ::
config any Dictionary used to configure additional settings when saving the model to the registry. ()!()

For all common arguments, refer to configuring operators

Returns:

type description
any The current batch, modified in accordance with the prediction parameter.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter. In unsupervised models, this value is set to ::.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

Fits a model to a batch or buffer of data, saving the model to the registry, and predicting the target variable for future batches after the model has been trained.

N.B. This is only for models that cannot be trained incrementally. For other models, .qsp.ml.registry.update should be used.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit a q model.

// Generate initial data to be used for fitting
a:500?1f;
b:500?1f;
data:([]x:a;x1:b;y:a+b);

// Define optional model fitting parameters
optKeys:`model`registry`experiment`modelArgs;
optVals:("sgdLR";::;::;(1b; `maxIter`gTol`seed!(100;-0w;42)));
opt:optKeys!optVals;

// Define and run a stream processor pipeline using the fit model operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.fit[
    `x`x1;
    `y;
    .ml.online.sgd.linearRegression;
    "q";
    `yhat;
    .qsp.use opt
    ]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

// Call the get model store function to show the model has been saved to the registry
.ml.registry.get.modelStore[::;::]

Example 2: Fit an sklearn model.

// Generate initial data to be used for fitting
data:([]x:asc 100?1f;x1:100?1f;y:desc 100?5);

// Define and fit an sklearn model
rfc:.p.import[`sklearn.ensemble][`:RandomForestClassifier][`max_depth pykw 2];

// Define and run a stream processor pipeline using the fit model operator
.qsp.run
  .qsp.read.fromCallback[`publish]
   .qsp.ml.registry.fit[
     `x`x1;
     `y;
     rfc;
     "sklearn";
     `yhat
     ]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

Example 3: Fit an unsupervised q model.

// Generate initial data to be used for fitting
data:([]x:1000?1f;x1:1000?1f;x2:1000?1f);

// Define and run a stream processor pipeline using the fit model operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.fit[
    `x`x1`x2;
    ::;
    .ml.clust.kmeans;
    "q";
    `cluster;
    .qsp.use enlist[`modelArgs]!enlist(`e2dist;3;::)
    ]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data;

Example 4: Fit a model while passing functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data: ([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions that will be passed as the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the fit model operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.fit[
    xFunc;
    yFunc;
    .ml.online.sgd.linearRegression;
    "q";
    predFunc;
    .qsp.use enlist[`modelArgs]!enlist(1b; `maxIter`gTol`seed!(100;-0w;42))
    ]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data

.qsp.ml.registry.predict

Predict a target variable using a model

.qsp.ml.registry.predict[X;prediction];
.qsp.ml.registry.predict[X;prediction; .qsp.use (!) . flip (
    (`model     ; model);
    (`registry  ; registry);
    (`experiment; experiment);
    (`version   ; version))]

Parameters:

name type description default
X symbol[], ::, or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, all non-categorical columns are used. Required
prediction symbol, ::, or function Can be the name of the column which is to house the model's predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the symbol prediction is used. Required

options:

name type description default
model string Name of the fitted model we want to load from the registry. If set to ::, the most recently uploaded model will be loaded. ::
registry string Location of the registry where the fitted model is to be loaded from. This can be a local path or a cloud storage path. If set to ::, the local registry in the present working directory will be used. ::
experiment string Name of the experiment in the registry that the fitted model we want to load is stored under. If set to ::, the model will be loaded from unnamedExperiments. ::
version float Version of the fitted model we want to load from the registry. If set to ::, the latest version of the model will be loaded. ::

For all common arguments, refer to configuring operators

Returns:

type description
any Returns the input data with an additional column containing the model's predicted label values for each data point.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

.qsp.ml.registry.predict will predict the target value for each record in the batch, using a model from the registry.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Get predictions from an sklearn model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:asc n?1f;x1:desc n?10;x2:n?1f;y:asc n?5);

// Define and fit an sklearn model
features:flip value flip delete y from data;
clf1:.p.import[`sklearn.tree]`:DecisionTreeClassifier;
clf1:clf1[`max_depth pykw 3];
clf1[`:fit][features;data`y];

// Set the model within the existing registry
.ml.registry.set.model[::;::;clf1;"skModel";"sklearn";::];

// Define and run a stream processor pipeline using the predict model operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    `x`x1`x2
    `yhat;
    .qsp.use enlist[`model]!enlist["skModel"]
    ]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to get predctions from the model
publish data;

Example 2: Get predictions from a q model.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);

// Define and fit a q model
fetures:data`x`x1`x2;
kmeansModel:.ml.clust.kmeans.fit[features;`e2dist;6;enlist[`iter]!enlist[1000]];

// Set the model within the existing registry
.ml.registry.set.model[::;::;kmeansModel;"kmeansModel";"q";::];

// Define and run a stream processor pipeline using the predict model operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    `x`x1`x2;
    `yhat;
    .qsp.use enlist[`model]!enlist["kmeansModel"]
    ]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor get predictions from the model
publish data;

Example 3: Get predictions from a q model by passing functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data: ([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions to be passed to the model arguments
xFunc: {[data]
 select x, x1 from data
 };
clustFunc: {[data;clusters;modelInfo]
  update newClust: clusters from data
  };

// Define and run a stream processor pipeline using the predict model operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    xFunc;
    clustFunc;
    .qsp.use enlist[`model]!enlist["kmeansModel"]
    ]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data

.qsp.ml.registry.update

Train a model incrementally returning predictions for each record in a batch

.qsp.ml.registry.update[X;y;prediction]
.qsp.ml.registry.update[X;y;prediction; .qsp.use (!) . flip (
    (`untrained    ; untrained);
    (`modelArgs    ; modelArgs);
    (`model        ; model);
    (`registry     ; registry);
    (`experiment   ; experiment);
    (`version      ; version);
    (`config       ; config))]

Parameters:

name type description default
X symbol[], ::, or function Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. If set to ::, all non-categorical and non-target columns are used. Required
y symbol or function Can be the name of the column containing the data’s target labels OR a user-defined function that returns the target values to use. Required
prediction symbol, ::, or function Can be the name of the column which is to house the model’s predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. If set to ::, the symbol prediction is used. Required

options:

name type description default
untrained function or embedpy An untrained q model that we want to update. If set to (), the registry related parameters will be used to load a model from the registry to be updated. ()
bufferSize long Number of records to observe before updating the model. If set to 0, the model will be updated on the first batch. Minimum value is 0. 0
modelArgs list List of argument passed to the model to help configure the updating process. ::
model string Name of the model to be loaded from/stored in the registry. If no value is supplied, the model will not be loaded from/stored in the registry. ::
registry string Location of the registry where the model to be loaded is stored/the updated model is to be stored. This can be a local path or a cloud storage path. If set to ::, a local registry will be created in the present working directory for the model to be stored. ::
experiment string Name of the experiment in the registry that the model is to be loaded from/the updated model is to be stored under. If no value is supplied, the model will be loaded from/stored under unnamedExperiments. ::
version long[] Version of the model we want to load from the registry. If set to ::, the latest version of the model will be loaded. ::
config any Dictionary used to configure additional settings when saving the model to the registry. ()!()

For all common arguments, refer to configuring operators

Returns:

type description
any The current batch, modified in accordance with the udf parameter.
Passing functions as the values for the model parameters

Functions can be passed as the value for the X, y, or prediction model parameters. These functions have two different forms depending on whether they are values for the X and y parameter or for the prediction parameter.

Functions for the X and y model parameters take one argument:

name type description
data any Batch passed to the operator, only the data not the metadata.

This function is used to extract lists of values from the input data and takes the following form:

func: {[data]
  ...
  }

Functions for the prediction model parameter takes four arguments:

name type description
data any Batch passed to the operator, only the data not the metadata.
y symbol, function, or :: Target variable supplied to the model as the y parameter.
predictions list Model's predictions for each record in the batch.
modelInfo :: Information about the model. Currently not used and always set to ::.

This function is used to add a set of aggregate predictions to the output table and takes the following form:

func: {[data;y;predictions;modelInfo]
  ...
  }

select, exec, update, and delete statements can be used in these functions to return a list or table which will be used as the value for whichever model parameter the function is passed as.

Train a model incrementally returning predictions for each record in a batch.

Python support

Currently this functionality is only supported for q models. Support for deployment of online learning models written in Python is scheduled for a later release.

Examples:

The following examples outline the use of the functionality described above.

Example 1: Fit an untrained q model which can be updated.

// Generate initial data to be used for fitting
a:500?1f;
b:500?1f;
data:([]x:a;x1:b;y:a+b);

// Define and run a stream processor pipeline using the update model operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.update[
    `x`x1;
    `y;
    `yhat;
    .qsp.use enlist[`untrained]!enlist[.ml.online.sgd.linearRegression]
    ]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to update the model
publish data;

Example 2: Fit an untrained model by passing functions as the model arguments.

// Generate initial data to be used for fitting
n:100000;
data: ([] x:asc n?1f; x1:n?1f; x2:n?1f; y:asc n?0b);

// Define the functions to be passed to the model arguments
xFunc: {[data]
 select x, x1 from data
 };
yFunc: {[data]
  delete x,x1,x2 from data  // this is the same as 'select y from data' as data only has 4 columns
  };
predFunc: {[data;y;predictions;modelInfo]
  update newPred: predictions from data
  };

// Define and run a stream processor pipeline using the update model operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.update[
    xFunc;
    yFunc;
    predFunc;
    .qsp.use enlist[`untrained]!enlist[.ml.online.sgd.linearRegression]
    ]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to fit the model
publish data

Example 3: Update a q model from the registry.

// Generate initial data to be used for fitting
n:100000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);

// Generate data to be used for updating
n:100000;
data2:([]x:n?1f;x1:n?1f;x2:n?1f);
// Define and fit a q model
features:data`x`x1`x2;
kmeansModel:.ml.clust.kmeans.fit[features;`e2dist;6;enlist[`iter]!enlist[1000]];

// Set the model within the existing registry
.ml.registry.set.model[::;::;kmeansModel;"kmeansModel";"q";::];

// Define optional model fitting parameters
optKeys:`model`registy`experiment;
optVals:("kmeansModel";::;::);
opt:optKeys!optVals;

// Define and run a stream processor pipeline using the update model operator
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.update[
    `x`x1;
    `y;
    `yhat;
    .qsp.use opt
    ]
  .qsp.write.toConsole[];

// Pass a batch of data to the stream processor to update the model
publish data2;

// Call the get model store function to show the original and updated models have been saved to the registry
.ml.registry.get.modelStore["/tmp";::]