Machine Learning

.qsp.ml Fresh freshCreate turns batches of data into features based on aggregated statistics

Classification adaBoostClassifier fits an adaBoost classification model decisionTreeClassifier fits a decision tree classification model gaussianNB fits a gaussian naive bayes model kNeighborsClassifier fits a k-nearest neighbors classification model logClassifier fits a logistic classification model using stochastic gradient descent quadraticDiscriminantAnalysis fits a quadratic discriminant analysis model randomForestClassifier fits a random forest classification model

Clustering affinityPropagation fits an affinity propagation clustering model birch fits a BIRCH clustering model cure fits a CURE clustering model dbscan fits a DBSCAN clustering model sequentialKMeans fits a sequential k-means model

Regression adaBoostRegressor fits an adaBoost regression model gradientBoostingRegressor fits a gradient boosting regression model kNeighborsRegressor fits a k-nearest neighbors regression model lasso fits a lasso-linear regression model linearRegression fits a linear regression model randomForestRegressor fits a random forest regression model

Metrics score evaluates a model's predictions

Preprocessing dropConstant drops constant columns from incoming data featureHasher encodes categorical data as numeric vectors labelEncode encodes symbolic data into numerical values minMaxScaler min-max scale a supplied dataset oneHot replaces symbolic values with numerical vector representations standardize standardizes a supplied dataset

Registry registry.fit fits a model to batches of data, saving a model to a registry registry.predict predicts a target variable using a trained model from the registry registry.update trains a model incrementally, returning predictions for all records

Note All ml operators act solely on unkeyed tables (type 98).

`.qsp.ml.freshCreate`

Turns batches of data into features using aggregated statistics

.qsp.ml.freshCreate[X;features]
.qsp.ml.freshCreate[X;features;.qsp.use enlist[`warn]!enlist warn]

Parameters:

name	type	description	default
`X`	`symbol or symbol[]`	Name of the column(s) in the data to use for FRESH feature generation.	Required
`features`	`:: or symbol or symbol[]`	Name of the FRESH feature(s) we want to define from the data. A full list of these features can be found here.	Required

options:

name	type	description	default
`warn`	`boolean`	Show warnings 1b / Suppress warnings 0b.	`0b`

Returns:

type	description
`table`	Returns a table containing the specified aggregated FRESH feature columns for each selected column in the input table.

Converts each chosen column into a collection of feature values based on the supplied FRESH features. Typically, the operator is intended to be used in conjunction with the windowing operators that provide regular batches of data from which we engineer features. The aggregate statistics used to create these features can be as simple as max/min/count.

For the feature parameter, if it is set to: :: - all features are applied. noHyperparameters - all features except hyperparameters are applied. noPython - all features that don't rely on Python are applied.

As this aggregates a batch to a single row of aggregated statistics, the output table does not include the original columns.

Build two features, absEnergy and max.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.window.tumbling[00:01:00; `time]
  .qsp.ml.freshCreate[`x; `absEnergy`max]
  .qsp.write.toConsole[];

publish ([] time: .z.p+00:00:01 * til 500; x: 500?1f);

Build all features.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.window.count[100]
  .qsp.ml.freshCreate[`x; `min`max]
  .qsp.write.toVariable[`output];

publish ([] x: 500?1f; y: 500?100);

`.qsp.ml.MLPClassifier`

Multi-Layer Perceptron Classifier

.qsp.ml.MLPClassifier[X;y;udf;.qsp.use (!) . flip (
    (`hiddenLayerSizes; hiddenLayerSizes);
    (`activation      ; activation);
    (`solver;         ; solver);
    (`alpha           ; alpha);
    (`batchSize       ; batchSize);
    (`learningRate    ; learningRate);
    (`learningRateInit; learningRateInit);
    (`powerT          ; powerT);
    (`maxIter         ; maxIter)
    (`registry        ; registry);
    (`experiment      ; experiment);
    (`model           ; model);
    (`bufferSize      ; bufferSize))]

Parameters:

name | type | description | default ------|----------------------------------|------------ X | symbol or symbol[] or function | Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use. | Required y | symbol or function | Can be the name of the column containing the data's labels OR a user-defined function that returns the target values to use. | Required udf | symbol or function | Can be the name of the column which is to house the model's predicted label values for each data record OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable. | Required

options:

name	type	description	default
`hiddenLayerSizes`	`int[]`	List of the number of neurons in each hidden layer in the neural network. Minimum size of each layer is `1`.	`enlist 100`
`activation`	`string`	Activation function used to transform the output of the hidden layers into a single scalar value. This value can be `identity` to use a linear activation function, `logistic` to use a sigmoid activation function, `tanh` to use a hyperbolic tangent activation function, or `relu` to use a rectified linear unit function.	`relu`
`solver`	`string`	Optimization function used to search for the inputs that minimize/maximize the results of the model function. This value can be `lbfgs` to use a limited-memory BFGS, `sgd` to use stochastic gradient descent, or `adam` to use adaptive moment estimation.	`adam`
`alpha`	`float`	Strength of the L2 regularization term. The L2 regularization term is divided by the sample size when added to the loss function and is used to reduce the chance of model overfitting. Minimum value is `0.0`.	`0.0001`
`batchSize`	`int`	Number of training examples used in each stochastic optimization iteration. Minimum value is `1`.	`auto`
`learningRate`	`string`	Learning rate schedule for updating the weights of the neural network. Only used when the optimization function is set to `sgd`. This value can be `constant` for a constant learning rate, `optimal` for the optimal learning rate, `invscaling` to use an inverse scaling learning rate, or `adaptive` for an adaptive learning rate.	`constant`
`learningRateInit`	`float`	Starting learning rate value. This controls the step-size used when updating the neural network weights. Not used when the optimization function is set to `lmbfgs`. Minimum value is `0.0`.	`0.001`
`powerT`	`float`	Exponent used to update the learning rate when the learning rate is set to `invscaling` and the optimization function is set to `sgd`.	`0.5`
`maxIter`	`int`	Maximum number of optimization epochs/iterations. The model will iterate until it converges or until it completes this number of iterations. Minimum value is `1`.	`200`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under 'unnamedExperiments'.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a multi-layer perceptron classifier model

// Generate packet of data
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.MLPClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

Then we can retrieve predictions by passing new data.

`.qsp.ml.adaBoostClassifier`

AdaBoost Classifier

.qsp.ml.adaBoostClassifier[X;y;udf;.qsp.use (!) . flip (
    (`nEstimators ; nEstimators);
    (`learningRate; learningRate);
    (`algorithm   ; algorithm);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`model       ; model);
    (`bufferSize  ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the features values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function of the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`nEstimators`	`int`	Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value `1`.	`50`
`learningRate`	`float`	Controls the loss function used to set the weight of each classifier at each boosting iteration. The higher this value, the more each classifier will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is `0.0`.	`1.0`
`algorithm`	`string`	Multi-class AdaBoost function used to extend the AdaBoost operator to have multi-class capabilities. This value can be `SAMME` for stagewise additive modeling or `SAMME.R` for real-valued stagewise additive modeling.	`SAMME.R`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with an adaBoost classification model

// Generate packet of data
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.adaBoostClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

Then we can retrieve predictions by passing new data.

`.qsp.ml.decisionTreeClassifier`

Decision Tree Classifier.

.qsp.ml.decisionTreeClassifier[X;y;udf;.qsp.use (!) . flip (
    (`criterion      ; criterion);
    (`splitter       ; splitter);
    (`maxDepth       ; maxDepth);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`bufferSize     ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`criterion`	`string`	Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be `gini`, to use the Gini impurity measure, or `entropy`, to use the information gain measure.	`gini`
`splitter`	`string`	Strategy used to split the nodes in the tree. This can be `best` to choose the best split or `random` to choose the best random split.	`best`
`maxDepth`	`int`	Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to `::`, the tree will expand until all leaves are pure or contain less than the `Minimum Samples To Split Node` value.	`::`
`minSamplesSplit`	`int`	Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is `2`.	`2`
`minSamplesLeaf`	`int`	Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is `1`.	`1`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`dict`	Configurations used for fitting the model.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit a decision tree classifier model on data and store the model in local registry.

// Generate packet of data
n:1000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.decisionTreeClassifier[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"DTCModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

`.qsp.ml.gaussianNB`

Gaussian Naive Bayes

.qsp.ml.gaussianNB[X;y;udf;.qsp.use (!) . flip (
    (`priors      ; priors);
    (`varSmoothing; varSmoothing);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`model       ; model);
    (`bufferSize  ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`priors`	`float[]`	List of the prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is `0.0`. If set to `::`, the priors will be adjusted according to the data.	`::`
`varSmoothing`	`float`	Value added to the Gaussian distributions variance to widen the curve and account for more samples further away from the distributions mean. Minimum value is `0`.	`1e-9`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a gaussian naive bayes model

// Generate packet of data
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.gaussianNB[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

Then we can retrieve predictions by passing new data.

`.qsp.ml.kNeighborsClassifier`

K-Nearest Neighbors Classifier

.qsp.ml.kNeighborsClassifier[X;y;udf;.qsp.use (!) . flip (
    (`nNeighbors; nNeighbors);
    (`weights   ; weights);
    (`algorithm ; algorithm);
    (`leafSize  ; leafSize);
    (`p         ; p);
    (`metric    ; metric)
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`nNeighbors`	`int`	Number of already classified points, which lie closest to a given unclassified point (neighbors), to factor in when predicting the points class. Minimum value is `1`.	`5`
`weights`	`string`	Weight function used to decide how much weight to give to the classes of each of the neighboring points when predicting a points class. Can be `uniform` to weight each neighbor's class equally or `distance` to weight each neighbor's class based on its distance to the point.	`uniform`
`algorithm`	`string`	Algorithm used to parse the vector space and decide which points are the nearest neighbors to a given unclassified point. This algorithm can be a `ball_tree` algorithm, `kd_tree` algorithm, `brute` force distance measure approach, or an `auto` choice based on the data.	`auto`
`leafSize`	`int`	If `ball_tree` or `kd_tree` is selected as the `algorithm`, this is the minimum number of points in a given leaf node, after which point, brute force algorithm will be used to find the nearest neighbors. Setting this value either very close to `1` or very close to the total number of points in the data may have a noticeable impact on model runtime. Minimum value is `1`.	`30`
`p`	`int`	Power parameter used when the distance metric `minkowski` is selected. Minimum values is `0`.	`2`
`metric`	`string`	Distance metric used to measure the distance between points. This value can be `minkowski`, `euclidean`, `manhattan`, etc.	`minkowski`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For the full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a k-nearest neighbors classification model

// Generate packet of data
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.kNeighborsClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

Then we can retrieve predictions by passing new data.

`.qsp.ml.logClassifier`

Logistic classifier fit using stochastic gradient descent

.qsp.ml.logClassifier[X;y;udf]
.qsp.ml.logClassifier[X;y;udf; .qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`trend     ; trend);
    (`alpha     ; alpha);
    (`maxIter   ; maxIter);
    (`gTol      ; gTol);
    (`seed      ; seed);
    (`penalty   ; penalty);
    (`lambda    ; lambda);
    (`l1Ratio   ; l1Ratio);
    (`decay     ; decay);
    (`p         ; p);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`any`	Configuration used for fitting the model.	`()!()`
`trend`	`boolean`	Whether to add a constant value (intercept) to the classification function - `c` in `y=mx+c`.	`1b`
`alpha`	`float`	Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is `0.0`.	`0.01`
`maxIter`	`long`	Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is `1`.	`100`
`gTol`	`float`	Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is `0.0`.	`1e-5`
`seed`	`long`	Integer value used to control the randomness of the model's Initialization state. Specifying this allows for reproducible results across function calls. If a value is not supplied, the randomness is based off the current timestamp.	`0`
`penalty`	`symbol`	Penalty term used to shrink the coefficients of the less contributive variables. Can be `l1` to add an L1 penalty term, `l2` to add an L2 penalty term, or `elasticNet` to add both L1 and L2 penalty terms.	`l2`
`lambda`	`float`	Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is `0.0`.	`0.001`
`l1Ratio`	`float`	If `Elastic Net` is chosen as the regularization method, this parameter determines the balance between the L1 and L2 penalty terms. If this value is set to `0`, this is the same as using L2 regularization, if this value is set to `1`, this is the same as using L1 regularization. This value must lie in the range `[0.0, 1.0]`.	`0.5`
`decay`	`float`	Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum values is `0.0`.	`0f`
`p`	`float`	Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is `0.0`.	`0f`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. As this is an online model, if subsequent data is passed to the stream, each new collection of data points will be used to update the classifier model and a predictions will be made for each record.

Performance Limitations

This functionality is not currently encouraged for use in high throughput environments. Prediction times for this function is on the order of milliseconds. Further optimizations are expected in later releases.

Fit, update, and predict with a logistic classification model.

// Generate packet of data
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.logClassifier[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());1000)]
  .qsp.write.toVariable[`output];

// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish data;

// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish data;

// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish data;

`.qsp.ml.quadraticDiscriminantAnalysis`

Quadratic Discriminant Analysis

.qsp.ml.quadraticDiscriminantAnalysis[X;y;udf;.qsp.use (!) . flip (
    (`priors    ; priors);
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's class labels OR a user-defined function of the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`priors`	`float[]`	List of the prior probabilities for each class. This refers to the probability that a random data record is an instance of the given class before any evidence or other factors are considered. Minimum value for each prior is `0.0`. If set to `::`, the priors will be adjusted according to the data.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a quadratic discriminant analysis model

// Generate packet of data
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.quadraticDiscriminantAnalysis[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

Then we can retrieve predictions by passing new data.

`.qsp.ml.randomForestClassifier`

Random Forest Classifier

.qsp.ml.randomForestClassifier[X;y;udf;.qsp.use (!) . flip (
    (`nEstimators          ; nEstimators);
    (`criterion            ; criterion);
    (`maxDepth             ; maxDepth);
    (`minSamplesSplit      ; minSamplesSplit);
    (`minSamplesLeaf       ; minSamplesLeaf);
    (`minWeightFractionLeaf; minWeightFractionLeaf);
    (`maxFeatures          ; maxFeatures);
    (`maxLeafNodes         ; maxLeafNodes);
    (`minImpurityDecrease  ; minImpurityDecrease);
    (`bootstrap            ; bootstrap)
    (`registry             ; registry);
    (`experiment           ; experiment);
    (`model                ; model);
    (`bufferSize           ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted class labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`nEstimators`	`int`	Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult classification cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is `1`.	`100`
`criterion`	`string`	Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be `gini` to use the Gini impurity measure or `entropy` to use the information gain measure.	`gini`
`maxDepth`	`int`	Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to `::`, the tree will expand until all leaves are pure or contain less than the `minSamplesSplit` value. Minimum value is `1`.	`::`
`minSamplesSplit`	`int`	Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is `2`.	`2`
`minSamplesLeaf`	`int`	Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is `1`.	`1`
`minWeightFractionLeaf`	`float`	Minimum proportion of sample weight required to be at any leaf node relative to the total weight of all samples in the tree. When the `sample_weight` argument is not set using the `modelInit` parameter, each sample carries equal weight. This value must lie in the range `[0.0, 1.0]`.	`0.0`
`maxFeatures`	`string`	Maximum number of features to consider when looking for the best way to split a node. This value can be `sqrt` for the square root of all features, `log2` for log to the base 2 of all features, or `auto` to automatically select the number of features to consider.	`auto`
`maxLeafNodes`	`int`	Maximum number of leaf nodes in each decision tree. This forces the tree to grow in a best-first fashion with the best nodes based on their relative reduction in impurity. If set to `::`, there may be unlimited leaf nodes. Minimum value is `1`.	`::`
`minImpurityDecrease`	`float`	Minimum impurity decrease value required to split a node. If the tree impurity would not decrease by more than this value, the node will not be split. Minimum value is `0.0`.	`0.0`
`bootstrap`	`boolean`	Whether bootstrap samples are used when building trees. If `1b`, the whole dataset is used to build each tree.	`1b`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted class labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). This ensures that a minimum of n samples are used to train the model. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples: The following spec.q files outline the use of the functionality described above.

Example 1: Fit, Update and Predict with a random forest classification model

// Generate packet of data
n:100000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?0b);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.randomForestClassifier[`x;`y;`yHat; .qsp.use `registry`bufferSize!("/tmp";10)]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

Then we can retrieve predictions by passing new data.

`.qsp.ml.affinityPropagation`

Affinity Propagation Clustering Algorithm.

.qsp.ml.affinityPropagation[X;udf;.qsp.use (!) . flip (
    (`damping        ; damping);
    (`maxIter        ; maxIter);
    (`convergenceIter; convergenceIter);
    (`affinity       ; affinity);
    (`randomState    ; randomState);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`config         ; config);
    (`bufferSize     ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`damping`	`float`	Provides numerical stabilization and limits oscillations and “overshooting” of parameters by controlling the extent to which the current value is maintained relative to incoming values. This value must lie in the range `[0.5, 1.0)`.	`0.5`
`maxIter`	`int`	Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is `1`.	`200`
`convergenceIter`	`int`	Number of iterations, during which there is no change in the number of estimated clusters, needed to stop the convergence. Minimum value is `1`.	`15`
`affinity`	`string`	Statistical measure used to define similarities between the representative points. This value can be `euclidean` to use negative squared Euclidean distance or `precomputed` to use the values in the data's distance matrix.	`euclidean`
`randomState`	`int`	Integer value used to control the state of the random generator used in this model. Specifying this allows for reproducible results across function calls. If set to `::`, the randomness is based off the current timestamp.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`dict`	Configuration used for fitting the model.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted cluster labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit a affinityPropagation clustering model storing the result in a registry.

// Generate packet of data
n:1000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.affinityPropagation[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"AffinityPropagationModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

q)publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

`.qsp.ml.birch`

Birch Clustering Algorithm.

.qsp.ml.birch[X;udf;.qsp.use (!) . flip (
    (`threshold      ; threshold);
    (`branchingFactor; branchingFactor);
    (`nClusters      ; nClusters);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`config         ; config);
    (`bufferSize     ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`threshold`	`float`	Maximum cluster radius allowed for a new sample to be merged into its closest subcluster. If adding this point to the cluster would cause that clusters radius to exceed this maximum, the new point is not added and instead becomes a new subcluster. Minimum value is `0.0`.	`0.5`
`branchingFactor`	`int`	Maximum number of subclusters in each node in the tree, where each leaf node contains a subcluster. If a new sample arrives causing the number of subclusters to exceed this value for a given node, the node is split into two nodes. Minimum value is `1`.	`50`
`nClusters`	`int`	Final number of clusters to be defined by the model. Minimum value is `2`.	`3`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`dict`	Configuration for fitting the model.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted cluster labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit a Birch clustering model storing the result in a registry.

// Generate packet of data
n:1000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.birch[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"BirchModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

q)publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

`.qsp.ml.cure`

CURE Clustering Algorithm.

.qsp.ml.cure[X;udf;.qsp.use (!) . flip (
    (`df        ; df);
    (`n         ; n);
    (`c         ; c);
    (`cutDict   ; cutDict);
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`df`	`symbol`	Distance function used to measure the distance between points when clustering. This can be `edist` for Euclidean distance, `e2dist` for squared Euclidean distance, `nege2dist` for negative squared Euclidean distance, `mdist` for Manhattan distance, or `cshev` for Chebyshev distance.	`edist`
`n`	`int`	Number of representative points to choose from each cluster to compare the similarity of clusters for the purposes of potentially merging them. Minimum value is `1`.	`2`
`c`	`float`	Compression factor used for grouping the representative points together. Minimum value is `0.0`.	`0.0`
`k`	`int`	Final number of clusters to be defined by the model. Minimum value is `2`. The distance used when cutting the dendrogram will be adjusted to fit this number so only specify one of the parameters `k` or `dist`. If set to `::`, the `dist` parameter will be used. If both are set to `::`, the `cutDict` parameter will be used.	`::`
`dist`	`float`	Distance between leaves at which to cut the dendrogram to define the clusters. Minimum value is `0.0`. The number of clusters will be dynamic based on this distance so only specify one of the parameters `k` or `dist`. If set to `::`, the `k` parameter will be used. If both are set to `::`, the `cutDict` parameter will be used.	`::`
`cutDict`	`dict`	A dictionary that defines the cutting algorithm used when splitting the data into clusters. This can be used to define a `k` value or a `dist` value (documentation for these above).	enlist[`k]!enlist 3
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`dict`	Configuration for fitting the model.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted cluster labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit a cure clustering model storing the result in a registry.

// Generate packet of data
n:1000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.cure[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"cureModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

q)publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

`.qsp.ml.dbscan`

DBSCAN Clustering Algorithm.

.qsp.ml.dbscan[X;udf;.qsp.use (!) . flip (
    (`df        ; df);
    (`minPts    ; minPts);
    (`eps       ; eps);
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted cluster labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`df`	`symbol`	Distance function used to measure the distance between points when clustering. This can be `edist` for Euclidean distance, `e2dist` for squared Euclidean distance, `nege2dist` for negative squared Euclidean distance, `mdist` for Manhattan distance, or `cshev` for Chebyshev distance.	`edist`
`minPts`	`int`	Minimum number of points required to be close together before this group of points is defined as a cluster. The maximum distance these points are to be away from one another must be less than or equal to the `Maximum Distance Between Points` parameter. Minimum value is `1`.	`2`
`eps`	`float`	Maximum distance points are allowed to be away from one another to still be classed as close enough to be in the same cluster. Minimum value is `0.0`.	`1.0`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`dict`	Configuration for fitting the model.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted cluster labels.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator outputs the original data table together with clusters added.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit a dbscan clustering model storing the result in a registry.

// Generate packet of data
n:1000;
data:([]x:n?1f;x1:n?1f;x2:n?1f);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dbscan[`x`x1`x2;`cluster; .qsp.use `registry`model!("/tmp";"dbscanModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

q)publish ([]5?1f;5?1f;5?1f)
                             | x           x1         x2        cluster
-----------------------------| ----------------------------------------
2022.03.01D09:26:44.376050100| 0.3065473   0.7141816  0.5130882 1
2022.03.01D09:26:44.376050100| 0.5817309   0.6165058  0.2164453 0
2022.03.01D09:26:44.376050100| 0.004154821 0.8229675  0.514663  1
2022.03.01D09:26:44.376050100| 0.7639509   0.07025696 0.1601784 0
2022.03.01D09:26:44.376050100| 0.3417209   0.59064    0.6708373 1

`.qsp.ml.sequentialKMeans`

Sequential K-Means clustering using the function

.qsp.ml.sequentialKMeans[X]
.qsp.ml.sequentialKMeans[X; .qsp.use (!) . flip (
    (`df        ; df);
    (`k         ; k);
    (`centers   ; centers);
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`init      ; init);
    (`alpha     ; alpha);
    (`forgetful ; forgetful);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required

options:

name	type	description	default
`df`	`symbol`	Distance function used to measure the distance between points when clustering. This can be `edist` for Euclidean distance, `e2dist` for squared Euclidean distance, `nege2dist` for negative squared Euclidean distance, `mdist` for Manhattan distance, or `cshev` for Chebyshev distance.	`edist`
`k`	`long`	Final number of clusters to be defined by the model. Minimum value is `2`.	`3`
`centers`	`dictionary or ::`	A dictionary mapping each cluster to the cluster centroid value that we want these clusters to initialize with.	`::`
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`any`	Configuration used for fitting the model.	`()!()`
`init`	`bool`	Initialization method for the cluster centroids. This value can either be K-means++ (`1b`) or randomized initialization (`0b`).	`1b`
`alpha`	`float`	Controls the rate at which the concept of forgetfulness is applied within the algorithm. If forgetful Sequential K-Means is applied, this value defines how much past cluster centroid information is retained, if not, this is set to `1/(n+1)` where `n` is the number of points in a given cluster. This value must lie in the range `[0.0, 1.0]`.	`0.1`
`forgetful`	`bool`	Whether to apply forgetful Sequential K-Means (`1b`) or normal Sequential K-Means (`0b`). Forgetful Sequential K-Means will allow the model to evolve its cluster boundaries over time by forgetting about old data as new data comes in.	`1b`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`

Returns:

type	description
`table or ::`	Null during initial fitting. Afterwards returns the input data with an additional column containing the model's predicted cluster labels.

The sequential K-Means algorithm is applied within a streaming framework. When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the bufferSize parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. As this is an online model, if subsequent data is passed to the stream, each new collection of data points are used to update the current cluster centers and predictions are made as to which cluster each point belongs.

Examples:

Fit, update, and predict with the sequential K-Means model.

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.sequentialKMeans[`x`x1`x2; .qsp.use enlist[`bufferSize]!enlist 100]
  .qsp.write.toConsole[];

publish ([]100?1f;100?1f;100?1f);
publish ([] 50?1f; 50?1f; 50?1f);

`.qsp.ml.adaBoostRegressor`

AdaBoost Regressor.

.qsp.ml.adaBoostRegressor[X;y;udf]
.qsp.ml.adaBoostRegressor[X;y;udf;.qsp.use (!) . flip (
    (`nEstimators ; nEstimators);
    (`learningRate; learningRate);
    (`loss        ; loss);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`model       ; model);
    (`modelInit   ; modelInit);
    (`bufferSize  ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`nEstimators`	`int`	Maximum number of estimators to train in each boosting iteration. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is `1`.	`50`
`learningRate`	`float`	Weight applied to each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. This value must lie in the range `(0.0, inf)`.	`1.0`
`loss`	`string`	Loss function used to update the contributing weights of the regressors after each boosting iteration. This can be `linear` for linear loss, `square` for mean squared error, or `exponential` for exponential loss.	`"linear"`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit an adaBoost regression model on data and store model in local registry.

// Generate packet of data
n:1000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.adaBoostRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"AdaModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

`.qsp.ml.decisionTreeRegressor`

Decision Tree Regressor.

.qsp.ml.decisionTreeRegressor[X;y;udf]
.qsp.ml.decisionTreeRegressor[X;y;udf;.qsp.use (!) . flip (
    (`criterion      ; criterion);
    (`splitter       ; splitter);
    (`maxDepth       ; maxDepth);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`modelInit      ; modelInit);
    (`bufferSize     ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable	Required

options:

name	type	description	default
`criterion`	`string`	Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be `squared_error` for mean squared error, `friedman_mse` for mean squared error with Friedman's improvement score, `absolute_error` for mean absolute error, or `poisson` for Poisson deviance.	`"squared_error"`
`splitter`	`string`	Strategy used to split the nodes in the tree. This can be `best` to choose the best split or `random` to choose the best random split.	`"best"`
`minSamplesSplit`	`int`	Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is `2`.	`2`
`minSamplesLeaf`	`int`	Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is `1`.	`1`
`maxDepth`	`int`	Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to `::`, the tree will expand until all leaves are pure or contain less than the `Minimum Samples To Split Node` value. Minimum value is `1`.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit a decision tree regression model on data and store model in local registry.

// Generate packet of data
n:1000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.decisionTreeRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"DTModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

`.qsp.ml.gradientBoostingRegressor`

Gradient Boosting Regressor.

.qsp.ml.gradientBoostingRegressor[X;y;udf]
.qsp.ml.gradientBoostingRegressor[X;y;udf;.qsp.use (!) . flip (
    (`loss           ; loss);
    (`learningRate   ; learningRate);
    (`nEstimators    ; nEstimators);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`maxDepth       ; maxDepth);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`modelInit      ; modelInit);
    (`bufferSize     ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`loss`	`string`	Loss function that is optimized using gradient descent to get the best model fit. Can be `squared_error`, `absolute_error`, `huber` which is a combination of `squared_error` and `absolute_error`, or `quantile` which allows for quantile regression (using conditional median).	`"squared_error"`
`learningRate`	`float`	Controls the loss function used to set the weight of each regressor at each boosting iteration. The higher this value, the more each regressor will contribute to our final model. This value depends highly on the maximum number of estimators. Minimum value is `0.0`.	`0.1`
`nEstimators`	`int`	Maximum number of tree estimators to train. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is `1`.	`100`
`minSamplesSplit`	`int`	Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is `2`.	`2`
`minSamplesLeaf`	`int`	Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is `1`.	`1`
`maxDepth`	`int`	Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to `::`, the tree will expand until all leaves are pure or contain less than the `Minimum Samples To Split Node` value. Minimum value is `1`.	`3`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`dict`	Configuration for fitting the model.	`()!()`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit a gradient boosting regression model on data and store model in local registry.

// Generate packet of data
n:1000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.gradientBoostingRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"GbModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

`.qsp.ml.kNeighborsRegressor`

k Nearest Neighbors Regressor.

.qsp.ml.kNeighborsRegressor[X;y;udf]
.qsp.ml.kNeighborsRegressor[X;y;udf;.qsp.use (!) . flip (
    (`nNeighbors; nNeighbors);
    (`weights   ; weights);
    (`metric    ; metric);
    (`algorithm ; algorithm);
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`modelInit ; modelInit);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`nNeighbors`	`int`	Number of points already labeled or predicted, which lie closest to a given unlabeled point (neighbors), to factor in when predicting a value for the point. Minimum value is `1`.	`5`
`metric`	`string`	The distance metric to be used for the tree. The default metric is `minkowski`, see here for available metrics.	`"minkowski"`
`weights`	`string`	Weight function used to decide how much weight to give to each of the neighboring points when predicting the target of a point. Can be `uniform`, to weight each neighbors target equally, or `distance`, to weight each neighbors target based on their distance to the point.	`"uniform"`
`algorithm`	`string`	Algorithm used to parse the vector space and decide which points are the nearest neighbors. You can choose to use the algorithms `ball_tree`, `kd_tree`, `brute` force distance measure approach, or an `auto` choice based on the data.	`"auto"`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`dict`	Configuration for fitting the model.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit a k-nearest neighbors regression model on data and store model in local registry.

// Generate packet of data
n:1000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.kNeighborsRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"k-nearest neighborsModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

`.qsp.ml.lasso`

Lasso.

.qsp.ml.lasso[X;y;udf]
.qsp.ml.lasso[X;y;udf;.qsp.use (!) . flip (
    (`alpha       ; alpha);
    (`fitIntercept; fitIntercept);
    (`maxIter     ; maxIter);
    (`tol         ; tol);
    (`registry    ; registry);
    (`experiment  ; experiment);
    (`model       ; model);
    (`modelInit   ; modelInit);
    (`bufferSize  ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`alpha`	`float`	Constant that controls the regularization strength by multiplying the L1 regularization term. Minimum value is `0.0`.	`1.0`
`fitIntercept`	`boolean`	Whether to add a constant value (intercept) to the regression function - `c` in `y=mx+c`.	`1b`
`maxIter`	`int`	Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is `1`.	`1000`
`tol`	`float`	Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is `0.0`.	`1e-4`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`dict`	Configurations for fitting the model.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit a lasso regression model on data and store model in local registry.

// Generate packet of data
n:1000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.lasso[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"LassoModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

`.qsp.ml.linearRegression`

Linear regressor fit using stochastic gradient descent

.qsp.ml.linearRegression[X;y;udf]
.qsp.ml.linearRegression[X;y;udf;.qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config)
    (`trend     ; trend);
    (`alpha     ; alpha);
    (`maxIter   ; maxIter);
    (`gTol      ; gTol);
    (`seed      ; seed);
    (`penalty   ; penalty);
    (`lambda    ; lambda);
    (`l1Ratio   ; l1Ratio);
    (`decay     ; decay);
    (`p         ; p);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`any`	Configuration for fitting the model.	`()!()`
`trend`	`boolean`	Whether to add a constant value (intercept) to the regression function - `c` in `y=mx+c`.	`1b`
`alpha`	`float`	Learning rate value used in the optimization function to dictate the step size taken towards the minimum of the loss function at each iteration. A high value will override information about previous data more in favor of newly acquired information. Generally, this value is set to be very small. Minimum value is `0.0`.	`0.01`
`maxIter`	`long`	Maximum number of iterations before model training is terminated. The model will iterate until it converges or until it completes this number of iterations. Minimum value is `1`.	`100`
`gTol`	`float`	Tolerance value required to stop searching for the global minimum/maximum value. This is achieved once you get close enough to this global value. Minimum value is `0.0`.	`1e-5`
`seed`	`long`	Integer value used to control the randomness of the model's Initialization state. Specifying this allows for reproducible results across function calls. If set to `::`, the randomness is based off the current timestamp.	`0`
`penalty`	`symbol`	Penalty term used to shrink the coefficients of the less contributive variables. Can be `l1` to add an L1 penalty term, `l2` to add an L2 penalty term, or `elasticNet` to add both L1 and L2 penalty terms.	`l2`
`lambda`	`float`	Lambda value used to define the strength of the regularization applied. The higher this value is, the stronger the regularization will be. Minimum value is `0.0`.	`0.001`
`l1Ratio`	`float`	If `elasticNet` is used as the regularization method, this parameter determines the balance between the L1 and L2 penalty terms. If this value is set to `0`, this is the same as using L2 regularization, if this value is set to `1`, this is the same as using L1 regularization. This value must lie in the range `[0.0, 1.0]`.	`0.5`
`decay`	`float`	Describes how much weight to give to historical predictions from previously fit iterations. The higher this value, the less important historic predictions will be. Minimum value is `0.0`.	`0f`
`p`	`float`	Coefficient used to help accelerate the gradient vectors in the right direction, leading to faster convergence. Minimum value is `0.0`.	`0f`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. As this is an online model, if subsequent data is passed to the stream, each new collection of data points are used to update the regression model and a prediction will also be made for each record.

The algorithm is fit on the first 'n' elements in the stream, up until it reaches a number given by the buffer size. After the model has been fit subsequent data is used to update the model in an online fashion. If data is passed to the stream, the operator outputs the original data table together with predictions appended.

Fit, update, and predict with a linear regression model.

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.linearRegression[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());10000)]
  .qsp.write.toVariable[`output];

// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);

// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);

// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish ([] x:asc 100?1f; y:asc 100?1f);

`.qsp.ml.randomForestRegressor`

Random Forest Regressor.

.qsp.ml.randomForestRegressor[X;y;udf]
.qsp.ml.randomForestRegressor[X;y;udf;.qsp.use (!) . flip (
    (`nEstimators    ; nEstimators);
    (`criterion      ; criterion);
    (`minSamplesSplit; minSamplesSplit);
    (`minSamplesLeaf ; minSamplesLeaf);
    (`maxDepth       ; maxDepth);
    (`registry       ; registry);
    (`experiment     ; experiment);
    (`model          ; model);
    (`modelInit      ; modelInit);
    (`bufferSize     ; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`y`	`symbol or function`	Can be the name of the column containing the data's target labels OR a user-defined function that returns the target values to use.	Required
`udf`	`symbol or function`	Can be the name of the column which is to house the model's predicted target labels OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

options:

name	type	description	default
`nEstimators`	`int`	Maximum number of decision tree estimators to train and use. Each estimator is fit on the dataset and adjusted to focus on difficult prediction cases. If we already have a perfect fit, we will not create this maximum number. Minimum value is `1`.	`100`
`criterion`	`string`	Criteria function used to measure the quality of a split each time a decision tree node is split into children. This can be `squared_error` for mean squared error, `friedman_mse` for mean squared error with Friedman's improvement score, `absolute_error` for mean absolute error, or `poisson` for Poisson deviance.	`"squared_error"`
`minSamplesSplit`	`int`	Minimum number of data records required at a node in the tree to split this node again into multiple child nodes. Minimum value is `2`.	`2`
`minSamplesLeaf`	`int`	Minimum number of data records required at each leaf node in the tree. A split will only take place if the resulting child nodes will each have this minimum number of data records. Minimum value is `1`.	`1`
`maxDepth`	`int`	Maximum depth of the decision tree - measured as the longest path from the tree root to a leaf. If set to `::`, the tree will expand until all leaves are pure or contain less than the `Minimum Samples To Split Node` value. Minimum value is `1`.	`::`
`registry`	`string or dict`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model is to be stored under. If set to `::`, the model will be stored under `unnamedExperiments`.	`::`
`model`	`string`	Name of the model to be stored in the registry. If set to `::`, the model will not be stored in the registry.	`::`
`config`	`dict`	Configuration for fitting the model.	`()!()`
`bufferSize`	`long`	Number of records to observe before fitting the model. If set to `0`, the model will be fit on the first batch. Minimum value is `0`.	`0`
`modelInit`	`dict`	A dictionary of parameter names and their corresponding values which are passed to the underlying python model to initialize it. For a full list of acceptable arguments see here.	`()!()`

Returns:

type	description
`table`	Returns the input data with an additional column containing the model's predicted target values.

When data is fed into this operator via a stream, the algorithm will only fit the underlying model when the number of records received has exceeded the value of the parameter (n). When training, all data in the batch which causes the buffered data to exceed n elements is included in fitting. If subsequent data is passed to the stream, the operator will output predictions for each sample using the model fitted on the first n samples.

Examples:

The following spec.q files outline the use of the functionality described above.

Example 1: Fit a random forest regression model on data and store model in local registry.

// Generate packet of data
n:1000;
data:([]x:asc n?1f;x1:n?1f;x2:n?1f;y:asc n?1f);

// Running a pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.randomForestRegressor[`x`x1`x2;`y;`yHat; .qsp.use `registry`model!("/tmp";"RafrModel")]
  .qsp.write.toConsole[];

First we pass a batch of data to the stream processor to fit the model

q)publish data

We can see that the model is saved by calling the get model store function.

q).ml.registry.get.modelStore["/tmp";::]

Then we can retrieve predictions by passing new data

q)publish ([]5?1f;5?1f;5?1f;y:0n)
                             | x         x1          x2        y yHat
-----------------------------| -------------------------------------------
2022.03.01D09:37:35.552468100| 0.4396505 0.1823248   0.591584    0.4310479
2022.03.01D09:37:35.552468100| 0.2864931 0.953808    0.3408518   0.3047388
2022.03.01D09:37:35.552468100| 0.2663074 0.001459365 0.2480502   0.2638261
2022.03.01D09:37:35.552468100| 0.8727333 0.1277611   0.2372084   0.9198592
2022.03.01D09:37:35.552468100| 0.9739936 0.6642186   0.1082126   0.9550528

`.qsp.ml.score`

Score the performance of a model

.qsp.ml.score[y;predictions;metric]

Parameters:

name	type	description	default
`y`	`symbol or function`	The column name of the target variable, or a function to generate the target variable from the batch.	Required
`predictions`	`symbol or function`	The column name of the predictions, or a function to generate the predictions from the batch.	Required
`metric`	`symbol`	The metric on which to evaluate model performance.	Required

Returns:

type	description
`any`	The score given by the metric.

Score the performance of a model over time allowing changes in model performance to be evaluated. The values returned are the cumulative scores, rather than scores for the individual batches.

The following metrics are currently supported:

f1
accuracy
mse
rmse

Examples:

Example 1: This example fits a scikit-learn model, then the pipeline predicts y and calculates the cumulative F1 score of the model on receipt of new data.

// Retrieve a dataset and format appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;

// Split data into training and testing set
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;

features:flip value flip delete y from training;
targets :training`y;

// Train the model
clf:.p.import[`sklearn.tree]`:DecisionTreeClassifier;
clf:clf[`max_depth pykw 3];
clf[`:fit][features;targets];

// Set model within existing registry
.ml.registry.set.model[::;::;clf;"skModel";"sklearn";::];

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    {delete y from x};
    `pred;
    .qsp.use enlist[`model]!enlist"skModel"]
  .qsp.ml.score[`y; `pred; `f1]
  .qsp.write.toConsole[];

publish testing;

Example 2: This example first fits a q model, then the pipeline predicts y and scores the cumulative accuracy on receipt of new data.

// Retrieve a dataset and format appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;

// Split the data into training and testing sets
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;

features:flip value flip delete y from training;
targets :training`y;

// Train the model
model:.ml.online.sgd.logClassifier.fit[features;targets;1b;::];

// Add the model to the existing registry
.ml.registry.set.model[::;::;model;"myModel";"q";::]

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    {delete y from x};
    `pred;
    .qsp.use enlist[`model]!enlist"myModel"]
  .qsp.ml.score[`y; `pred; `accuracy]
  .qsp.write.toConsole[]

publish testing

`.qsp.ml.dropConstant`

Drops columns with constant values

.qsp.ml.dropConstant[X]
.qsp.ml.dropConstant[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or dictionary or ::`	Name of the column(s) in the input table to remove because they contain a constant value throughout. Can also be a dictionary mapping column names to their associated constant values whereby only columns with these names and values will be dropped. If set to `::`, the operator will be applied to all columns in the data that contain a constant value throughout.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of records to observe before dropping the constant columns from the data. If set to `0`, the operator will be applied on the first batch. Minimum value is `0`.	`0`

Returns:

type	description
`table`	Returns the input data with the constant valued columns no longer in the table.

The columns to be removed from the data are either specified by the user beforehand, through a list or dictionary, or these columns are determined using the .ml.dropConstant function. This function checks the data for columns that contain a constant value throughout. If a non-constant column is supplied, an error is thrown.

Example 1: Drop the constant columns protocol and response.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dropConstant[`protocol`response]
  .qsp.write.toConsole[];
publish ([] protocol: `TCP; response: 200i; latency: 10?5f; size: 10?10000);

Example 2: Drop the columns id and ratio, checking that their values match the expected constant values.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dropConstant[`id`ratio!(1; 2f)]
  .qsp.write.toConsole[];

publish ([] id: 1; ratio: 2f; data: 10?10f);

Example 3: Drop columns whose value is constant for all buffered records.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dropConstant[::;.qsp.use enlist[`bufferSize]!enlist 100]
  .qsp.write.toConsole[];

publish ([] motorID: 0; rpms: 1000 + 200?10; temp: 60 + 200?5)

`.qsp.ml.featureHasher`

(Beta Feature) Encodes categorical data across several numeric columns

Beta Features

To enable beta features, set the environment variable KXI_SP_BETA_FEATURES to true.

.qsp.ml.featureHasher[X;n]

Parameters:

name	type	description	default
`X`	`symbol or symbol[]`	Symbol or list of symbols indicating the columns to act on.	Required
`n`	`long`	The number of numeric columns used to represent a variable.	Required

Returns:

type	description
`table`	New table with a column for each feature/hash value pair, with the columns specified by `X` removed.

This operator is used to encode categorical variables numerically. It is similar to one-hot encoding, but does not require the categories or number of categories to be known in advance.

It converts each chosen column into n columns, sending each string/symbol to its truncated hash value. The hash function employed is the signed 32-bit version of Murmurhash3.

As the mapping between values and their hashed representations is effectively random, collisions are possible, and the hash space must be made large enough to reduce collisions to an acceptable level.

Examples:

Example 1: Encode a single categorical column

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.featureHasher[`location; 10]
  .qsp.write.toConsole[];

publish ([] location: 20?`london`paris`berlin`miami; num: til 20);

Example 2: Here is a similar example where we hash multiple columns

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.featureHasher[`srcIP`destIP; 14]
  .qsp.write.toVariable[`output];

IPs: "." sv/: string 4 cut 100?256;
publish ([] srcIP: 100?IPs; destIP: 100?IPs; latency: 100?10; size: 100?10000);

`.qsp.ml.labelEncode`

Encodes symbolic columns as numeric data

.qsp.ml.labelEncode[X]
.qsp.ml.labelEncode[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or dictionary or ::`	Name of the column(s) in the input table whose labels we want to encode. Can also be a dictionary mapping column names to their expected label values whereby only columns with these names and values will be encoded. If set to `::`, all categorical columns will be encoded as numeric values.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of records to observe before label encoding the symbol columns in the data. If set to `0`, the operator will be applied on the first batch. Minimum value is `0`.	`0`

Returns:

type	description
`table`	Returns the input data with the symbol columns in the data now having been label encoded as numeric values.

This operator encodes symbolic columns within input data as numeric representations. When data is fed into this operator via a stream, the encoding algorithm will only be run on the data when the number of records received has exceeded the value of the bufferSize. Once this happens, the specified symbol columns are encoded and the mapping of each symbol to its respective encoded number is stored as the state. If new symbols appear in subsequent batches, the state will be updated to reflect this.

Examples:

Example 1: Encode all symbol columns within the data.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.labelEncode[::]
  .qsp.write.toConsole[];

publish ([]10?`a`b`c;10?`d`e`f;10?1f);

Example 2: Encode symbols in column x.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.labelEncode[`x]
  .qsp.write.toConsole[]

publish ([]10?`a`b`c;10?`d`e`f;10?1f);

Example 3: Encode the symbols in the encoded column with the mapping specified.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.labelEncode[(enlist `encoded)!enlist `small`medium`large]
  .qsp.write.toConsole[];

data: 10?`small`medium`large;
publish ([] original: data; encoded: data);

`.qsp.ml.minMaxScaler`

Apply min-max scaling to streaming data

.qsp.ml.minMaxScaler[X]
.qsp.ml.minMaxScaler[X; .qsp.use (!) . flip (
    (`bufferSize; bufferSize);
    (`rangeError; rangeError))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or dictionary or ::`	Name of the column(s) in the input table whose values we want to scale. Can also be a dictionary mapping column names to the minimum and maximum values to use when scaling. If set to `::`, all numeric columns will be scaled.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of records to observe before scaling the numeric columns in the data. If set to `0`, the operator will be applied on the first batch. Minimum value is `0`.	`0`
`rangeError`	`boolean`	Whether to raise a range error if new input data falls outside the minimum and maximum data range observed during the initialization of the operator.	`0b`

Returns:

type	description
`table`	Returns the input data with the numeric columns now being scaled so their values lie between `0` and `1`.

This operator scales a set of numeric columns based on a user-supplied data range or based on the minimum and maximum values in the data when the operator is applied. The operator will only be applied, and the minimum/maximum values decided upon, once the number of data point given to the model exceeds the value of the bufferSize parameter. This function can also be configured to error if data supplied after the ranges have been set falls outside this range.

Examples:

Example 1: Apply min-max scaling on all data.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[::]
  .qsp.write.toConsole[];

publish ([]20?5;20?5;20?10)

Example 2: Apply min-max scaling on the specified columns.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[`x`x1]
  .qsp.write.toConsole[];

publish ([]20?5;20?5;20?10)

Example 3: Apply min-max scaling on columns rating and cost, with supplied minimum and maximum values for one column and the other based on a buffer.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[`rating`cost!(0 10;::); .qsp.use enlist[`bufferSize]!enlist 200]
  .qsp.write.toConsole[];

publish ([] rating: 3 + 250?5; cost: 250?1000f)

Example 4: Error when passed batches containing data outside the min-max bounds.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[::;.qsp.use enlist[`rangeError]!enlist 1b]
  .qsp.write.toConsole[]

// As no buffer is specified, the min and max values are fit using the initial batch
publish ([]100?5;100?5;100?10)

// As `rangeError` has been set, this batch will cause an error by exceeding the
// expected maximum values
publish 1+([]100?5;100?5;100?10)

`.qsp.ml.oneHot`

One hot encodes relevant columns

.qsp.ml.oneHot[x]
.qsp.ml.oneHot[x; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or dictionary or ::`	Name of the column(s) in the input table to one-hot encode. Can also be a dictionary mapping column names to their expected values whereby only columns with these names and values will be encoded. If set to `::`, all categorical columns will be encoded as numeric values.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of records to observe before one-hot encoding the symbol columns in the data. If set to `0`, the operator will be applied on the first batch. Minimum value is `0`.	`0`

Returns:

type	description
`table`	Returns the input data with the symbol columns in the data now each being represented by multiple numeric columns populated by `0`s and `1`s.

Encodes symbolic and string data as numeric representations. When data is fed into the operator via a stream, the algorithm will only be applied to the data when the number of records received has exceeded the value of the bufferSize parameter. When this happens, the buffered data is one-hot encoded. If subsequent data is passed which contains symbols that were not present at the time of the original fitting, these symbols will be mapped to 0.

Examples:

Example 1: Encode all the symbolic or string columns.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[::]
  .qsp.write.toConsole[];

publish ([] action: 10?`upload`download; fileType: 10?("image";"audio";"document"); size: 10?100000)

Example 2: Encode column x

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`x]
  .qsp.write.toConsole[];

publish ([] x:10?`a`b`c; y:10?1f)

Example 3: Encode columns x and x1 with a required buffer

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`x`x1;.qsp.use ``bufferSize!(`;200)]
  .qsp.write.toConsole[];

publish ([] 250?`a`b`c; 250?`d`e`f`j; 250?0b)

Example 4: Encode the columns axis and status using given values. This is useful when the categories are known in advance, but may not be present in the training data.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`axis`status!(`x`y`z; `normal`error)]
  .qsp.write.toConsole[];

publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)

Example 5: Encode column axis and status using hybrid method

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`axis`status!(::; `normal`error)]
  .qsp.write.toConsole[];

publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)

`.qsp.ml.standardize`

Apply standardization to streaming data

.qsp.ml.standardize[X]
.qsp.ml.standardize[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or ::`	Name of the column(s) in the input table to standardize. If set to `::`, all numeric columns will be standardized.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of records to observe before standardizing the numerical columns in the data. If set to `0`, the operator will be applied on the first batch. Minimum value is `0`.	`0`

Returns:

type	description
`table`	Returns the input data with the numeric columns now having a mean value of `0` and a standard deviation of `1`.

Standardize a user-specified set of columns in an input table. When data is fed into this operator via a stream, the algorithm will only scale the data when the number of records received has exceeded the value of the bufferSize parameter. Once this happens, the mean and standard deviation of each column is computed. These statistics are then used on subsequent batches which are normalized by subtracting this mean value and dividing the result by the standard deviation value.

Examples:

Example 1: Applies standardization to all data

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.standardize[::]
  .qsp.write.toConsole[];

publish ([]100?5;100?5;100?10)

Example 2: Apply standardization to specified columns.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.standardize[`x`x1]
  .qsp.write.toConsole[];

publish ([]100?5;100?5;100?10)

Example 3: This pipeline applies standardization to all columns based on a buffer.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[::; .qsp.use enlist[`bufferSize]!enlist 200]
  .qsp.write.toConsole[];

publish ([] length: 100 + 250?2f; width: 10 + 250?1f);

`.qsp.ml.registry.fit`

Fit model to batch of data and predict target for future batches

.qsp.ml.registry.fit[X;y;untrained;modelType;udf]
.qsp.ml.registry.fit[X;y;untrained;modelType;udf; .qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`modelArgs ; modelArgs);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol[] or function`	The predictor variable's column names or a function to generate the predictors from the batch.	Required
`y`	`symbol or function or ::`	The target variable's column name or a function to generate the predictors from the batch. This must be `::` when training an unsupervised model	Required
`untrained`	`function`	An untrained q/sklearn model.	Required
`modelType`	`string`	Indication as to whether a model is `"q"` or `"sklearn"`.	Required
`udf`	`function or symbol`	A function to score the quality of the model or join predictions into the batch. In the case that this is a symbol, append the predictions to the batch as a new columns.	Required

Functional UDF requirements

The udf parameter for the .qsp.ml.registry.fit operator is a function with the following parameters:

udf:{[data;y;predictions;modelInfo]
    update yhat: predictions from data
    }

name	type	description
`data`	`any`	The batch passed to the operator, only the data not the metadata.
`y`	`symbol \| function \| ::`	The target variable, as extracted by the `y` parameter. In the unsupervised case this is populated with nulls.
`predictions`	`list`	The predictions for each record in the batch.
`modelInfo`	`::`	Currently unused and always set to `::`.

options:

name	type	description	default
`registry`	`string`	The registry to load from.	`::`
`experiment`	`string`	The experiment name.	`::`
`model`	`string`	The model name in the registry.	`::`
`config`	`any`	The config parameter for `.ml.registry.set.mode`	`()!()`
`modelArgs`	`list`	A list of argument to pass to the model after `X` and `y`.	`::`
`bufferSize`	`long`	Number of records to buffer before training a model. If 0, the model will be fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training.	`0`

Returns:

type	description
`any`	The current batch, modified in accordance with the `udf` parameter.

Fits a model to a batch or buffer of data, saving the model to the registry, and predicting the target variable for future batches after the model has been trained.

N.B. This is only for models that cannot be trained incrementally. For other models, .qsp.ml.registry.update should be used.

Fit a q model on a batch.

// Generate initial data to be used for fitting
a:500?1f
b:500?1f
data:([]a;b;y:a+b)

// Define optional variables
optKeys:`registry`experiment`model`modelArgs
optVals:(::;::;"sgdLR";(1b; `maxIter`gTol`seed!(100;-0w;42)))
opt:optKeys!optVals

// Define execution pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.fit[
    {delete y from x};
    `y;
    .ml.online.sgd.linearRegression;
    "q";
    `yhat;
    .qsp.use opt
    ]
  .qsp.write.toConsole[]

publish data

// View model stored in registry
.ml.registry.get.modelStore[::;::]

Fit an sklearn model.

// Generate initial data to be used for fitting
data:([]x:asc 100?1f;x1:100?1f;y:desc 100?5)

// Populate a random forest classifier expected
rfc:.p.import[`sklearn.ensemble][`:RandomForestClassifier][`max_depth pykw 2]

// Define execution pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
   .qsp.ml.registry.fit[
     {delete y from x};
     {exec y from x};
     rfc;
     "sklearn";
     `yhat]
  .qsp.write.toConsole[]

publish data

Fit an unsupervised model.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.fit[
    `x`x1`x2;
    ::;
    .ml.clust.kmeans;
    "q";
    `cluster;
    .qsp.use enlist[`modelArgs]!enlist(`e2dist;3;::)
    ]
  .qsp.write.toConsole[]

publish ([]x:1000?1f;x1:1000?1f;x2:1000?1f)

`.qsp.ml.registry.predict`

Predict a target variable using a model

.qsp.ml.registry.predict[X;udf];
.qsp.ml.registry.predict[X;udf; .qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`version   ; version))]

Parameters:

name	type	description	default
`X`	`symbol[] or function`	Can be the name of the column(s) containing the features from the data OR a user-defined function that returns the feature values to use.	Required
`udf`	`function or symbol`	Can be the name of the column which is to house the model's predicted class/cluster/target values OR a user-defined function which takes the predictions, does something with them, and then assigns them to a variable.	Required

Functional UDF requirements

The udf parameter for the .qsp.ml.update operator is a function with the following parameters:

udf:{[data;y;predictions;modelInfo]
    update yhat: predictions from data
    }

name	type	description
`data`	`any`	The batch passed to the operator, only the data not the metadata.
`y`	`symbol or function or ::`	The target variable, as extracted by the `y` parameter.
`predictions`	`list`	The predictions for each record in the batch.
`modelInfo`	`::`	Currently unused and always set to `::`.

options:

name	type	description	default
`registry`	`string`	Location of the registry where the fitted model is to be stored. This can be a local path or a cloud storage path. If set to `::`, a local registry will be created in the present working directory.	`::`
`experiment`	`string`	Name of the experiment in the registry that the fitted model we want to load is stored under. If set to `::`, the model will be loaded from `unnamedExperiments`.	`::`
`model`	`string`	Name of the fitted model we want to load in the registry. If set to `::`, the most recently uploaded model will be loaded.	`::`
`version`	`float`	Version of the fitted model we want to load in the registry. If set to `::`, the latest version of the model will be loaded.	`::`

Returns:

type	description
`any`	Returns the input data with an additional column containing the model's predicted label values for each data point.

.qsp.ml.registry.predict will predict the target value for each record in the batch, using a model from the registry.

The user-defined function udf can join these predictions into the data, or do any arbitrary computation. Note that below data is the whole batch, not just those fields extracted by X. Additionally, modelInfo is a catch-all for any model-specific outputs.

.qsp.ml.registry.predict[X; {[data;y;predictions;modelInfo]
    update temperature: predictions from data
    }; .qsp.use `registry`experiment`model`version!(registry;experiment;model;version)]

In lieu of a user-defined function, this parameter can also just be the name of a new column or the name of an existing column to overwrite it.

.qsp.ml.registry.predict[X;`temperature;
  .qsp.use`registry`experiment`model`version!(registry;experiment;model;version)]

Examples:

Predict using an sklearn model, adding predictions to the initial data.

N:1000
data:([]x:asc N?1f;x1:desc N?10;x2:N?1f;y:asc N?5)

features:flip value flip delete y from data

clf1:.p.import[`sklearn.tree]`:DecisionTreeClassifier;
clf1:clf1[`max_depth pykw 3];
clf1[`:fit][features;data`y];

// Set the model within the existing registry
.ml.registry.set.model[::;::;clf1;"skModel";"sklearn";::]

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    {delete y from x};
    `yhat;
    .qsp.use enlist[`model]!enlist"skModel"]
  .qsp.write.toConsole[]

publish data

Example 2: Predict using a q model adding predictions to the initial data

// Define data for fitting the model
N:1000;
data:([]x:N?1f;x1:N?1f;x2:N?1f);

// Fit a model
kmeansModel:.ml.clust.kmeans.fit[data`x`x1`x2;`e2dist;6;enlist[`iter]!enlist 1000]

// Set the model within existing registry
.ml.registry.set.model[::;::;kmeansModel;"kmeansModel";"q";::]

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    `x`x1`x2;
    `yhat;
    .qsp.use enlist[`model]!enlist"kmeansModel"]
  .qsp.write.toConsole[]

publish data

`.qsp.ml.registry.update`

Train a model incrementally returning predictions for each record in a batch

.qsp.ml.registry.update[X;y;udf]
.qsp.ml.registry.update[X;y;udf; .qsp.use (!) . flip (
    (`registry     ; registry);
    (`experiment   ; experiment);
    (`model        ; model);
    (`version      ; version);
    (`config       ; config);
    (`supervised   ; supervised);
    (`untrained    ; untrained);
    (`modelType    ; modelType);
    (`modelArgs    ; modelArgs))]

Parameters:

name	type	description	default
`X`	`symbol[] \| function`	The predictor variable's column names or a function to generate the predictors from the batch.	Required
`y`	`symbol \| function`	The target variable's column name or a function to generate this from the batch.	Required
`udf`	`function \| symbol`	A function to score the quality of the model or join predictions into the batch.	Required

Functional UDF requirements

The udf parameter for the .qsp.ml.update operator is a function with the following parameters:

udf:{[data;y;predictions;modelInfo]
    update yhat: predictions from data
    }

name	type	description
`data`	`any`	The batch passed to the operator, only the data not the metadata.
`y`	`symbol or function or ::`	The target variable, as extracted by the `y` parameter.
`predictions`	`list`	The predictions for each record in the batch.
`modelInfo`	`::`	Currently unused and always set to `::`.

options:

name	type	description	default
`registry`	`string`	Registry to load/store model from.	`::`
`experiment`	`string`	Experiment name under which to load/store model.	`::`
`model`	`string`	Model name.	`::`
`version`	`long[]`	The version to load.	`::`
`config`	`any`	Config for storage of the initial fit model.	`()!()`
`supervised`	`boolean`	Indicates an unsupervised model.	`1b`
`untrained`	`function \| embedpy`	An untrained ML model e.g. `.ml.online.sgd.linearRegression`.	`::`
`modelType`	`string`	One of `"q"` or `"sklearn"` defining the type of model.	`::`
`modelArgs`	`list`	A list of argument to pass to the model after `X` and `y`.	`::`

Returns:

type	description
`any`	The current batch, modified in accordance with the `udf` parameter.

Train a model incrementally returning predictions for each record in a batch. A user-defined function can be used to join these predictions into the data, or do any arbitrary computation.

Python support

Currently this functionality is only supported for q models. Support for deployment of online learning models written in Python is scheduled for a later release.

Examples:

Example 1:Fit an untrained q model which can be updated, adding predictions to the initial data.

// Initialise functionality and data required for running example
a:500?1f
b:500?1f
data:([]a;b;y:a+b)

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.update[
    {delete y from x};
    {exec y from x};
    `yhat;
    .qsp.use
      `untrained`modelType`modelArgs!(.ml.online.sgd.linearRegression;"q";(1b;()!()))]
  .qsp.write.toConsole[]

publish data;