Machine Learning

.qsp.ml freshCreate turns batches of data into features based on aggregated statistics logClassifier fits a logistic classification model on batches of data using stochastic gradient descent sequentialKMeans fits a sequential k means model on batches of data linearRegression fit a linear regression model to batches of data score evaluates a model's predictions dropConstant drops constant columns from incoming data featureHasher hashes feature names into sparse matrices labelEncode encodes symbolic data into numerical values minMaxScaler min-max scale a supplied dataset oneHot replaces symbolic values with numerical vector representations standardize standardize a supplied dataset registry.fit fits a model to batches of data, saving a model to a registry registry.predict predicts a target variable using a trained model from the registry registry.update trains a model incrementally, returning predictions for all records

`.qsp.ml.freshCreate`

Turns batches of data into features using aggregated statistics

.qsp.ml.freshCreate[X;features]

Parameters:

name	type	description	default
`X`	`symbol or symbol[]`	The columns to use for feature generation.	Required
`features`	`:: or symbol or symbol[]`	The list of features to apply to the columns.	Required

Returns:

type	description
`table`	Table with a column for each new feature.

Converts each chosen column into a collection of feature values based on the supplied FRESH features. Typically, the operator is intended to be used in conjunction with the windowing operators that provide regular batches of data from which we engineer features. The aggregate statistics used to create these features can be as simple as max/min/count.

For the feature parameter, if it is set to: :: - all features are applied. noHyperparameters - all features except hyperparameters are applied. noPython - all features that don't rely on Python are applied.

As this aggregates a batch to a single row of aggregated statistics, the output table does not include the original columns.

Build two features, absEnergy and max.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.window.tumbling[00:01:00; `time]
  .qsp.ml.freshCreate[`x; `absEnergy`max]
  .qsp.write.toConsole[];

publish ([] time: .z.p+00:00:01 * til 500; x: 500?1f);

Build all features.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.window.count[100]
  .qsp.ml.freshCreate[`x; `min`max]
  .qsp.write.toVariable[`output];

publish ([] x: 500?1f; y: 500?100);

`.qsp.ml.logClassifier`

Logistic classifier fit using stochastic gradient descent

.qsp.ml.logClassifier[X;y;udf]
.qsp.ml.logClassifier[X;y;udf; .qsp.use (!) . flip (
    (`modelArgs ; modelArgs);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Column names or user defined function to extract features.	Required
`y`	`symbol or function`	Column name or user defined function to extract labels.	Required
`udf`	`symbol or function`	Column name or user defined function to append predictions.	Required

options:

name	type	description	default
`modelArgs`	`list`	A length two list of trend and configuration.	`(1b;()!())`
`bufferSize`	`long`	Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training.	`0`

Returns:

type	description
`table`	Returns data with predictions appended.

The algorithm is fit on the first 'n' elements in the stream, up until it reaches the number given by the buffer size. After the model has been fit, subsequent data is used to update the model in an online fashion. Note, if data is passed to the stream, the operator outputs a table of the original data together with predictions appended.

Performance Limitations

This functionality is not currently encouraged for use in high throughput environments. Prediction times for this function is on the order of milliseconds. Further optimizations are expected in later releases.

Fit, update, and predict with a logistic classification model.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.logClassifier[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());1000)]
  .qsp.write.toVariable[`output];

// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish ([] x:asc 500?1f; y:asc 500?0b);

// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish ([] x:asc 500?1f; y:asc 500?0b);

// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish ([] x:asc 10?1f; y:asc 10?0b);

`.qsp.ml.sequentialKMeans`

Sequential K-Means clustering using the function

.qsp.ml.sequentialKMeans[X]
.qsp.ml.sequentialKMeans[X; .qsp.use (!) . flip (
    (`df        ; df);
    (`k         ; k);
    (`centers   ; centers);
    (`config    ; config);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Column names or user defined function to extract features.	Required

options:

name	type	description	default
`df`	`symbol`	Distance function used in clustering.	`edist`
`k`	`long`	The number of clusters.	`3`
`centers`	`dictionary or ::`	Initial cluster centers.	`::`
`config`	`dictionary`	Configuration for sequential K-Means clustering cf. `.ml.online.clust.sequentialKMeans.fit`.	`()!()`
`bufferSize`	`long`	Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training.	`0`

Returns:

type	description
`table or ::`	Null during initial fitting. Afterwards returns data with clusters appended.

The sequential K-Means algorithm is applied within a streaming framework. The first number of points, up to the buffer size, are used to fit the model. After this, each new collection of data points are used to update the current cluster centers and predictions are made as to which cluster each point belongs.

Fit, update, and predict with the sequential K-Means model.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.sequentialKMeans[`x`x1`x2; .qsp.use enlist[`bufferSize]!enlist 100]
  .qsp.write.toConsole[];

publish ([]100?1f;100?1f;100?1f);
publish ([] 50?1f; 50?1f; 50?1f);

`.qsp.ml.linearRegression`

Linear regressor fit using stochastic gradient descent

.qsp.ml.linearRegression[X;y;udf]
.qsp.ml.linearRegression[X;y;udf; .qsp.use (!) . flip (
    (`modelArgs ; modelArgs);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or function`	Column names or user defined function to extract features.	Required
`y`	`symbol or function`	Column name or user defined function to extract the target variable.	Required
`udf`	`symbol or function`	Column name or user defined function to append predictions.	Required

options:

name	type	description	default
`modelArgs`	`list`	A length two list of trend and configuration.	`(1b;()!())`
`bufferSize`	`long`	Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training.	`0`

Returns:

type	description
`table`	Returns data with predictions appended.

The algorithm is fit on the first 'n' elements in the stream, up until it reaches a number given by the buffer size. After the model has been fit subsequent data is used to update the model in an online fashion. If data is passed to the stream, the operator outputs a table of the original data together with predictions appended.

Fit, update, and predict with a linear regression model.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.linearRegression[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());10000)]
  .qsp.write.toVariable[`output];

// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);

// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);

// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish ([] x:asc 100?1f; y:asc 100?1f);

`.qsp.ml.score`

Score the performance of a model

.qsp.ml.score[y;predictions;metric]

Parameters:

name	type	description	default
`y`	`symbol or function`	The column name of the target variable, or a function to generate the target variable from the batch.	Required
`predictions`	`symbol or function`	The column name of the predictions, or a function to generate the predictions from the batch.	Required
`metric`	`symbol`	The metric on which to evaluate model performance.	Required

Returns:

type	description
`any`	The score given by the metric.

Score the performance of a model over time allowing changes in model performance to be evaluated. The values returned are the cumulative scores, rather than scores for the individual batches.

The following metrics are currently supported:

f1
accuracy
mse
rmse

This example fits a scikit-learn model, then the pipeline predicts y and calculates the cumulative F1 score of the model on receipt of new data.

// Retrieve a dataset and format appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;

// Split data into training and testing set
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;

features:flip value flip delete y from training;
targets :training`y;

// Train the model
clf:.p.import[`sklearn.tree]`:DecisionTreeClassifier;
clf:clf[`max_depth pykw 3];
clf[`:fit][features;targets];

// Set model within existing registry
.ml.registry.set.model[::;::;clf;"skModel";"sklearn";::];

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    {delete y from x};
    `pred;
    .qsp.use enlist[`model]!enlist"skModel"]
  .qsp.ml.score[`y; `pred; `f1]
  .qsp.write.toConsole[];

publish testing;

This example first fits a q model, then the pipeline predicts y and scores the cumulative accuracy on receipt of new data.

// Retrieve a dataset and format appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;

// Split the data into training and testing sets
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;

features:flip value flip delete y from training;
targets :training`y;

// Train the model
model:.ml.online.sgd.logClassifier.fit[features;targets;1b;::];

// Add the model to the existing registry
.ml.registry.set.model[::;::;model;"myModel";"q";::]

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    {delete y from x};
    `pred;
    .qsp.use enlist[`model]!enlist"myModel"]
  .qsp.ml.score[`y; `pred; `accuracy]
  .qsp.write.toConsole[]

publish testing

`.qsp.ml.dropConstant`

Drops columns with constant values

.qsp.ml.dropConstant[X]
.qsp.ml.dropConstant[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or dictionary or ::`	Symbol or list of symbols indicating the columns to drop. If dictionary is passed in then columns with the associated value are dropped. If null then a buffer is used to identify constant columns.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training.	`0`

Returns:

type	description
`table`	New table with constant columns dropped.

The columns to be removed are either specified by the user beforehand in the form of a dictionary/list, or are determined using the .ml.dropConstant function on a batch of a specified size. If a column that is thought to be constant gives a non-constant value, an error is thrown.

Drop the constant columns protocol and response.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dropConstant[`protocol`response]
  .qsp.write.toConsole[];
publish ([] protocol: `TCP; response: 200i; latency: 10?5f; size: 10?10000);

Drop the columns id and ratio, checking that their values match the expected constant values.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dropConstant[`id`ratio!(1; 2f)]
  .qsp.write.toConsole[];

publish ([] id: 1; ratio: 2f; data: 10?10f);

Drop columns whose value is constant for all buffered records.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.dropConstant[::;.qsp.use enlist[`bufferSize]!enlist 100]
  .qsp.write.toConsole[];

publish ([] motorID: 0; rpms: 1000 + 200?10; temp: 60 + 200?5)

`.qsp.ml.featureHasher`

(Beta Feature) Encodes categorical data across several numeric columns

Beta Features

To enable beta features, set the environment variable KXI_SP_BETA_FEATURES to true.

.qsp.ml.featureHasher[X;n]

Parameters:

name	type	description	default
`X`	`symbol or symbol[]`	Symbol or list of symbols indicating the columns to act on.	Required
`n`	`long`	The number of numeric columns used to represent a variable.	Required

Returns:

type	description
`table`	New table with a column for each feature/hash value pair, with the columns specified by `X` removed.

This operator is used to encode categorical variables numerically. It is similar to one-hot encoding, but does not require the categories or number of categories to be known in advance.

It converts each chosen column into n columns, sending each string/symbol to its truncated hash value. The hash function employed is the signed 32-bit version of Murmurhash3.

As the mapping between values and their hashed representations is effectively random, collisions are possible, and the hash space must be made large enough to reduce collisions to an acceptable level.

This functionality operates exclusively on string/symbol columns, numeric columns are not supported.

Encode a list of categorical values.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.featureHasher[`location; 10]
  .qsp.write.toConsole[];

publish ([] location: 20?`london`paris`berlin`miami; num: til 20);

Hash multiple columns.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.featureHasher[`srcIP`destIP; 14]
  .qsp.write.toVariable[`output];

IPs: "." sv/: string 4 cut 100?256;
publish ([] srcIP: 100?IPs; destIP: 100?IPs; latency: 100?10; size: 100?10000);

`.qsp.ml.labelEncode`

Encodes symbolic columns as numeric data

.qsp.ml.labelEncode[X]
.qsp.ml.labelEncode[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or dictionary`	Symbol or list of symbols indicating the columns to act on, or a dictionary containing the columns and their original expected values.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training.	`0`

Returns:

type	description
`table`	New table with symbols encoded.

This operator encodes symbolic columns within input data. Initially, the first batch is collected in a buffer until the required size is exceeded. In the case that we are dealing with the first batch of data we encode the specified symbol columns and store the mapping as the state. If additional values appear in subsequent batches, the state will be updated to reflect this.

Encode all symbol columns within the data.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.labelEncode[::]
  .qsp.write.toConsole[];

publish ([]10?`a`b`c;10?`d`e`f;10?1f);

Encode symbols in column x.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.labelEncode[`x]
  .qsp.write.toConsole[]

publish ([]10?`a`b`c;10?`d`e`f;10?1f);

Encode the symbols in the encoded column with the mapping specified.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.labelEncode[(enlist `encoded)!enlist `small`medium`large]
  .qsp.write.toConsole[];

data: 10?`small`medium`large;
publish ([] original: data; encoded: data);

`.qsp.ml.minMaxScaler`

Apply min-max scaling to streaming data

.qsp.ml.minMaxScaler[X]
.qsp.ml.minMaxScaler[X; .qsp.use (!) . flip (
    (`bufferSize; bufferSize);
    (`rangeError; rangeError))]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or dictionary or ::`	Symbol or list of symbols indicating the columns to be min-max scaled. If null, then all columns will be min-max scaled. Alternatively, a dictionary of column names and ranges can be specified, where `::` can be used in place of the range to scale based on the initial buffer.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training.	`0`
`rangeError`	`boolean`	Should an error be raised if new data falls outside the data range specified by the first batch	`0b`

Returns:

type	description
`table`	New table with defined columns min-max scaled.

Apply min-max scaling to user specified columns, where scaling is based on the user supplied limits for the minimum and maximum values, or determined by the first buffered batch of data. This function can be configured to error if data supplied falls outside this range and to accumulate a buffer of data prior to determining the minimum/maximum values from the data.

Apply min-max scaling on all data.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[::]
  .qsp.write.toConsole[];

publish ([]20?5;20?5;20?10)

Apply min-max scaling on the specified columns.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[`x`x1]
  .qsp.write.toConsole[];

publish ([]20?5;20?5;20?10)

Apply min-max scaling on columns x and x1, with supplied minimum and maximum values for one column and the other based on a buffer.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[`rating`cost!(0 10;::); .qsp.use enlist[`bufferSize]!enlist 200]
  .qsp.write.toConsole[];

publish ([] rating: 3 + 250?5; cost: 250?1000f)

Error when passed batches containing data outside the min-max bounds.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[::;.qsp.use enlist[`rangeError]!enlist 1b]
  .qsp.write.toConsole[]

// As no buffer is specified, the min and max values are fit using the initial batch
publish ([]100?5;100?5;100?10)

// As `rangeError` has been set, this batch will cause an error by exceeding the
// expected maximum values
publish 1+([]100?5;100?5;100?10)

`.qsp.ml.oneHot`

One hot encodes relevant columns

.qsp.ml.oneHot[x]
.qsp.ml.oneHot[x; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or dictionary or ::`	Symbol, list of symbols, dictionary, or null indicating the columns to encode. If dictionary is passed in then associated values are used for the encoding.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training.	`0`

Returns:

type	description
`table`	New table with defined symbol columns encoded.

Allows symbolic or string data to be encoded into numeric representations. The algorithm works by first collecting data in a buffer to the specified size then:

In the first instance that the columns are specified by a list of symbols the algorithm fits the one hot encoding on the buffer.
In the final instance that all symbol columns are selected again the algorithm fits the one hot encoding on the buffer.
In the case that a dictionary is presented, the keys with symbol[] values are used directly to fit the encoding, while keys with a null value are fit on the buffer.

Once the algorithm has been fit, all data in the stream is transformed using the given one hot encoding. If data in some subsequent batch contains symbols that were not present at the time of fitting these symbols will be mapped to zero.

Encode all the symbolic or string columns.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[::]
  .qsp.write.toConsole[];

publish ([] action: 10?`upload`download; fileType: 10?("image";"audio";"document"); size: 10?100000)

Encode the column x.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`x]
  .qsp.write.toConsole[];

publish ([] x:10?`a`b`c; y:10?1f)

Encode the columns x and x1 with a required buffer.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`x`x1;.qsp.use ``bufferSize!(`;200)]
  .qsp.write.toConsole[];

publish ([] 250?`a`b`c; 250?`d`e`f`j; 250?0b)

Encode the columns axis and status using given values. This is useful when the categories are known in advance, but may not be present in the training data.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`axis`status!(`x`y`z; `normal`error)]
  .qsp.write.toConsole[];

publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)

Encode columns x and x1 using a hybrid method.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.oneHot[`axis`status!(::; `normal`error)]
  .qsp.write.toConsole[];

publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)

`.qsp.ml.standardize`

Apply standardization to streaming data

.qsp.ml.standardize[X]
.qsp.ml.standardize[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]

Parameters:

name	type	description	default
`X`	`symbol or symbol[] or ::`	Symbol or list of symbols indicating the columns to be standardized. If null, then all columns will be scaled.	Required

options:

name	type	description	default
`bufferSize`	`long`	Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training.	`0`

Returns:

type	description
`table`	New table with defined columns standardized.

Apply standardization to user specified columns where scaling is determined by the buffered data, or the first batch if bufferSize is 0. On this batch the mean and standard deviation are computed. These statistics are then used on subsequent batches which are normalized, with the mean removed.

This pipeline applies standardization to all data

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.standardize[::]
  .qsp.write.toConsole[];

publish ([]100?5;100?5;100?10)

Apply standardization to specified columns.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.standardize[`x`x1]
  .qsp.write.toConsole[];

publish ([]100?5;100?5;100?10)

This pipeline applies standardization to all columns based on a buffer.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.minMaxScaler[::; .qsp.use enlist[`bufferSize]!enlist 200]
  .qsp.write.toConsole[];

publish ([] length: 100 + 250?2f; width: 10 + 250?1f);

`.qsp.ml.registry.fit`

Fit model to batch of data and predict target for future batches

.qsp.ml.registry.fit[X;y;untrained;modelType;udf]
.qsp.ml.registry.fit[X;y;untrained;modelType;udf; .qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`config    ; config);
    (`modelArgs ; modelArgs);
    (`bufferSize; bufferSize))]

Parameters:

name	type	description	default
`X`	`symbol[] or function`	The predictor variable's column names or a function to generate the predictors from the batch.	Required
`y`	`symbol or function or ::`	The target variable's column name or a function to generate the predictors from the batch. This must be `::` when training an unsupervised model	Required
`untrained`	`function`	An untrained q/sklearn model.	Required
`modelType`	`string`	Indication as to whether a model is `"q"` or `"sklearn"`.	Required
`udf`	`function or symbol`	A function to score the quality of the model or join predictions into the batch. In the case that this is a symbol, append the predictions to the batch as a new columns.	Required

Functional UDF requirements

The udf parameter for the .qsp.ml.registry.fit operator is a function with the following parameters:

udf:{[data;y;predictions;modelInfo]
    update yhat: predictions from data
    }

name	type	description
`data`	`any`	The batch passed to the operator, only the data not the metadata.
`y`	`symbol \| function \| ::`	The target variable, as extracted by the `y` parameter. In the unsupervised case this is populated with nulls.
`predictions`	`list`	The predictions for each record in the batch.
`modelInfo`	`::`	Currently unused and always set to `::`.

options:

name	type	description	default
`registry`	`string`	The registry to load from.	`::`
`experiment`	`string`	The experiment name.	`::`
`model`	`string`	The model name in the registry.	`::`
`config`	`any`	The config parameter for .ml.registry.set.mode	`()!()`
`modelArgs`	`list`	A list of argument to pass to the model after `X` and `y`.	`::`
`bufferSize`	`long`	Number of records to buffer before training a model. If 0, the model will be fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training.	`0`

Returns:

type	description
`any`	The current batch, modified in accordance with the `udf` parameter.

Fits a model to a batch or buffer of data, saving the model to the registry, and predicting the target variable for future batches after the model has been trained.

N.B. This is only for models that cannot be trained incrementally. For other models, .qsp.ml.registry.update should be used.

Fit a q model on a batch.

// Generate initial data to be used for fitting
a:500?1f
b:500?1f
data:([]a;b;y:a+b)

// Define optional variables
optKeys:`registry`experiment`model`modelArgs
optVals:(::;::;"sgdLR";(1b; `maxIter`gTol`seed!(100;-0w;42)))
opt:optKeys!optVals

// Define execution pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.fit[
    {delete y from x};
    `y;
    .ml.online.sgd.linearRegression;
    "q";
    `yhat;
    .qsp.use opt
    ]
  .qsp.write.toConsole[]

publish data

// View model stored in registry
.ml.registry.get.modelStore[::;::]

Fit an sklearn model.

// Generate initial data to be used for fitting
data:([]x:asc 100?1f;x1:100?1f;y:desc 100?5)

// Populate a random forest classifier expected
rfc:.p.import[`sklearn.ensemble][`:RandomForestClassifier][`max_depth pykw 2]

// Define execution pipeline
.qsp.run
  .qsp.read.fromCallback[`publish]
   .qsp.ml.registry.fit[
     {delete y from x};
     {exec y from x};
     rfc;
     "sklearn";
     `yhat]
  .qsp.write.toConsole[]

publish data

Fit an unsupervised model.

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.fit[
    `x`x1`x2;
    ::;
    .ml.clust.kmeans;
    "q";
    `cluster;
    .qsp.use enlist[`modelArgs]!enlist(`e2dist;3;::)
    ]
  .qsp.write.toConsole[]

publish ([]x:1000?1f;x1:1000?1f;x2:1000?1f)

`.qsp.ml.registry.predict`

Predict a target variable using a model

.qsp.ml.registry.predict[X;udf];
.qsp.ml.registry.predict[X;udf; .qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`version   ; version))]

Parameters:

name	type	description	default
`X`	`symbol[] or function`	The predictor variable's column names or a function to generate the predictors from the batch.	Required
`udf`	`function or symbol`	A user-defined function for integrating the predictions into the batch, or a column name to join them to the table as a new column.	Required

Functional UDF requirements

The udf parameter for the .qsp.ml.update operator is a function with the following parameters:

udf:{[data;y;predictions;modelInfo]
    update yhat: predictions from data
    }

name	type	description
`data`	`any`	The batch passed to the operator, only the data not the metadata.
`y`	`symbol or function or ::`	The target variable, as extracted by the `y` parameter.
`predictions`	`list`	The predictions for each record in the batch.
`modelInfo`	`::`	Currently unused and always set to `::`.

options:

name	type	description	default
`registry`	`string`	The registry to load from.	`::`
`experiment`	`string`	The experiment name.	`::`
`model`	`string`	The model name in the registry.	`::`
`version`	`float`	The version to load.	`::`

Returns:

type	description
`any`	The current batch, modified in accordance with the `udf` parameter.

.qsp.ml.registry.predict will predict the target value for each record in the batch, using a model from the registry.

The user-defined function udf can join these predictions into the data, or do any arbitrary computation. Note that below data is the whole batch, not just those fields extracted by X. Additionally, modelInfo is a catch-all for any model-specific outputs.

.qsp.ml.registry.predict[X; {[data;y;predictions;modelInfo]
    update temperature: predictions from data
    }; .qsp.use `registry`experiment`model`version!(registry;experiment;model;version)]

In lieu of a user-defined function, this parameter can also just be the name of a new column, or the name of an existing column to overwrite it.

.qsp.ml.registry.predict[X;`temperature;
  .qsp.use`registry`experiment`model`version!(registry;experiment;model;version)]

Predict using an sklearn model, adding predictions to the initial data.

N:1000
data:([]x:asc N?1f;x1:desc N?10;x2:N?1f;y:asc N?5)

features:flip value flip delete y from data

clf1:.p.import[`sklearn.tree]`:DecisionTreeClassifier
clf1:clf1[`max_depth pykw 3]
clf1[`:fit][features;data`y]

// Set the model within the existing registry
.ml.registry.set.model[::;::;clf1;"skModel";"sklearn";::]

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    {delete y from x};
    `yhat;
    .qsp.use enlist[`model]!enlist"skModel"]
  .qsp.write.toConsole[]

publish data

Predict using a q model, adding predictions to the initial data.

// Define data for fitting the model
N:1000
data:([]x:N?1f;x1:N?1f;x2:N?1f)

// Fit a model
kmeansModel:.ml.clust.kmeans.fit[data`x`x1`x2;`e2dist;6;enlist[`iter]!enlist 1000]

// Set the model within existing registry
.ml.registry.set.model[::;::;kmeansModel;"kmeansModel";"q";::]

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.predict[
    `x`x1`x2;
    `yhat;
    .qsp.use enlist[`model]!enlist"kmeansModel"]
  .qsp.write.toConsole[]

publish data

`.qsp.ml.registry.update`

Train a model incrementally returning predictions for each record in a batch

.qsp.ml.registry.update[X;y;udf]
.qsp.ml.registry.update[X;y;udf; .qsp.use (!) . flip (
    (`registry  ; registry);
    (`experiment; experiment);
    (`model     ; model);
    (`version   ; version);
    (`supervised; supervised);
    (`untrained ; untrained);
    (`modelType ; modelType);
    (`modelArgs ; modelArgs))]

Parameters:

name	type	description	default
`X`	`symbol[] \| function`	The predictor variable's column names or a function to generate the predictors from the batch.	Required
`y`	`symbol \| function`	The target variable's column name or a function to generate this from the batch.	Required
`udf`	`function \| symbol`	A function to score the quality of the model or join predictions into the batch.	Required

Functional UDF requirements

The udf parameter for the .qsp.ml.update operator is a function with the following parameters:

udf:{[data;y;predictions;modelInfo]
    update yhat: predictions from data
    }

name	type	description
`data`	`any`	The batch passed to the operator, only the data not the metadata.
`y`	`symbol or function or ::`	The target variable, as extracted by the `y` parameter.
`predictions`	`list`	The predictions for each record in the batch.
`modelInfo`	`::`	Currently unused and always set to `::`.

options:

name	type	description	default
`registry`	`string`	The registry to load from.	`::`
`experiment`	`string`	The experiment name.	`::`
`model`	`string`	The model name in the registry.	`::`
`version`	`float`	The version to load.	`::`
`supervised`	`boolean`	Indicates an unsupervised model.	`1b`
`untrained`	`function or embedpy`	An untrained ML model e.g. `.ml.online.sgd.linearRegression`.	`::`
`modelType`	`string`	One of `"q"` or `"sklearn"` defining the type of model.	`::`
`modelArgs`	`list`	A list of argument to pass to the model after `X` and `y`.	`::`

Returns:

type	description
`any`	The current batch, modified in accordance with the `udf` parameter.

Train a model incrementally returning predictions for each record in a batch. A user-defined function can be used to join these predictions into the data, or do any arbitrary computation.

Python support

Currently this functionality is only supported for q models. Support for deployment of online learning models written in Python is scheduled for a later release.

Fit an untrained q model which can be updated, adding predictions to the initial data.

// Initialise functionality and data required for running example
a:500?1f
b:500?1f
data:([]a;b;y:a+b)

.qsp.run
  .qsp.read.fromCallback[`publish]
  .qsp.ml.registry.update[
    {delete y from x};
    {exec y from x};
    `yhat;
    .qsp.use
      `untrained`modelType`modelArgs!(.ml.online.sgd.linearRegression;"q";(1b;()!()))]
  .qsp.write.toConsole[]

publish data;