Machine Learning
.qsp.ml freshCreate turns batches of data into features based on aggregated statistics logClassifier fits a logistic classification model on batches of data using stochastic gradient descent sequentialKMeans fits a sequential k means model on batches of data linearRegression fit a linear regression model to batches of data score evaluates a model's predictions dropConstant drops constant columns from incoming data featureHasher hashes feature names into sparse matrices labelEncode encodes symbolic data into numerical values minMaxScaler min-max scale a supplied dataset oneHot replaces symbolic values with numerical vector representations standardize standardize a supplied dataset registry.fit fits a model to batches of data, saving a model to a registry registry.predict predicts a target variable using a trained model from the registry registry.update trains a model incrementally, returning predictions for all records
.qsp.ml.freshCreate
Turns batches of data into features using aggregated statistics
.qsp.ml.freshCreate[X;features]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol or symbol[] |
The columns to use for feature generation. | Required |
features |
:: or symbol or symbol[] |
The list of features to apply to the columns. | Required |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
table |
Table with a column for each new feature. |
Converts each chosen column into a collection of feature values based on the supplied FRESH features. Typically, the operator is intended to be used in conjunction with the windowing operators that provide regular batches of data from which we engineer features. The aggregate statistics used to create these features can be as simple as max/min/count.
For the feature
parameter, if it is set to:
::
- all features are applied.
noHyperparameters
- all features except hyperparameters are applied.
noPython
- all features that don't rely on Python are applied.
As this aggregates a batch to a single row of aggregated statistics, the output table does not include the original columns.
Build two features, absEnergy
and max
.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.window.tumbling[00:01:00; `time]
.qsp.ml.freshCreate[`x; `absEnergy`max]
.qsp.write.toConsole[];
publish ([] time: .z.p+00:00:01 * til 500; x: 500?1f);
Build all features.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.window.count[100]
.qsp.ml.freshCreate[`x; `min`max]
.qsp.write.toVariable[`output];
publish ([] x: 500?1f; y: 500?100);
.qsp.ml.logClassifier
Logistic classifier fit using stochastic gradient descent
.qsp.ml.logClassifier[X;y;udf]
.qsp.ml.logClassifier[X;y;udf; .qsp.use (!) . flip (
(`modelArgs ; modelArgs);
(`bufferSize; bufferSize))]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol or symbol[] or function |
Column names or user defined function to extract features. | Required |
y |
symbol or function |
Column name or user defined function to extract labels. | Required |
udf |
symbol or function |
Column name or user defined function to append predictions. | Required |
options:
name | type | description | default |
---|---|---|---|
modelArgs |
list |
A length two list of trend and configuration. | (1b;()!()) |
bufferSize |
long |
Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training. | 0 |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
table |
Returns data with predictions appended. |
The algorithm is fit on the first 'n' elements in the stream, up until it reaches the number given by the buffer size. After the model has been fit, subsequent data is used to update the model in an online fashion. Note, if data is passed to the stream, the operator outputs a table of the original data together with predictions appended.
Performance Limitations
This functionality is not currently encouraged for use in high throughput environments. Prediction times for this function is on the order of milliseconds. Further optimizations are expected in later releases.
Fit, update, and predict with a logistic classification model.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.logClassifier[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());1000)]
.qsp.write.toVariable[`output];
// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish ([] x:asc 500?1f; y:asc 500?0b);
// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish ([] x:asc 500?1f; y:asc 500?0b);
// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish ([] x:asc 10?1f; y:asc 10?0b);
.qsp.ml.sequentialKMeans
Sequential K-Means clustering using the function
.qsp.ml.sequentialKMeans[X]
.qsp.ml.sequentialKMeans[X; .qsp.use (!) . flip (
(`df ; df);
(`k ; k);
(`centers ; centers);
(`config ; config);
(`bufferSize; bufferSize))]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol or symbol[] or function |
Column names or user defined function to extract features. | Required |
options:
name | type | description | default |
---|---|---|---|
df |
symbol |
Distance function used in clustering. | edist |
k |
long |
The number of clusters. | 3 |
centers |
dictionary or :: |
Initial cluster centers. | :: |
config |
dictionary |
Configuration for sequential K-Means clustering cf. .ml.online.clust.sequentialKMeans.fit . |
()!() |
bufferSize |
long |
Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training. | 0 |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
table or :: |
Null during initial fitting. Afterwards returns data with clusters appended. |
The sequential K-Means algorithm is applied within a streaming framework. The first number of points, up to the buffer size, are used to fit the model. After this, each new collection of data points are used to update the current cluster centers and predictions are made as to which cluster each point belongs.
Fit, update, and predict with the sequential K-Means model.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.sequentialKMeans[`x`x1`x2; .qsp.use enlist[`bufferSize]!enlist 100]
.qsp.write.toConsole[];
publish ([]100?1f;100?1f;100?1f);
publish ([] 50?1f; 50?1f; 50?1f);
.qsp.ml.linearRegression
Linear regressor fit using stochastic gradient descent
.qsp.ml.linearRegression[X;y;udf]
.qsp.ml.linearRegression[X;y;udf; .qsp.use (!) . flip (
(`modelArgs ; modelArgs);
(`bufferSize; bufferSize))]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol or symbol[] or function |
Column names or user defined function to extract features. | Required |
y |
symbol or function |
Column name or user defined function to extract the target variable. | Required |
udf |
symbol or function |
Column name or user defined function to append predictions. | Required |
options:
name | type | description | default |
---|---|---|---|
modelArgs |
list |
A length two list of trend and configuration. | (1b;()!()) |
bufferSize |
long |
Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training. | 0 |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
table |
Returns data with predictions appended. |
The algorithm is fit on the first 'n' elements in the stream, up until it reaches a number given by the buffer size. After the model has been fit subsequent data is used to update the model in an online fashion. If data is passed to the stream, the operator outputs a table of the original data together with predictions appended.
Fit, update, and predict with a linear regression model.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.linearRegression[`x;`y;`yHat; .qsp.use `modelArgs`bufferSize!((1b;()!());10000)]
.qsp.write.toVariable[`output];
// Data will be buffered for training until the buffer size is reached,
// during which time no batches will be emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);
// When the buffer size is reached, buffered data will be used for training,
// and will itself be classified and emitted.
publish ([] x:asc 5000?1f; y:asc 5000?1f);
// The operator can now be used to make predictions.
// Subsequent data will not be used for training, as the bufferSize has been exceeded.
publish ([] x:asc 100?1f; y:asc 100?1f);
.qsp.ml.score
Score the performance of a model
.qsp.ml.score[y;predictions;metric]
Parameters:
name | type | description | default |
---|---|---|---|
y |
symbol or function |
The column name of the target variable, or a function to generate the target variable from the batch. | Required |
predictions |
symbol or function |
The column name of the predictions, or a function to generate the predictions from the batch. | Required |
metric |
symbol |
The metric on which to evaluate model performance. | Required |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
any |
The score given by the metric. |
Score the performance of a model over time allowing changes in model performance to be evaluated. The values returned are the cumulative scores, rather than scores for the individual batches.
The following metrics are currently supported:
f1
accuracy
mse
rmse
This example fits a scikit-learn model, then the pipeline predicts y
and calculates
the cumulative F1 score of the model on receipt of new data.
// Retrieve a dataset and format appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;
// Split data into training and testing set
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;
features:flip value flip delete y from training;
targets :training`y;
// Train the model
clf:.p.import[`sklearn.tree]`:DecisionTreeClassifier;
clf:clf[`max_depth pykw 3];
clf[`:fit][features;targets];
// Set model within existing registry
.ml.registry.set.model[::;::;clf;"skModel";"sklearn";::];
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.predict[
{delete y from x};
`pred;
.qsp.use enlist[`model]!enlist"skModel"]
.qsp.ml.score[`y; `pred; `f1]
.qsp.write.toConsole[];
publish testing;
This example first fits a q model, then the pipeline predicts y
and scores the
cumulative accuracy on receipt of new data.
// Retrieve a dataset and format appropriately
dataset:.p.import[`sklearn.datasets;`:load_breast_cancer][];
X:dataset[`:data]`;
y:dataset[`:target]`;
data: ([] y: y) ,' flip (`$"x",/:string til count first X)!flip X;
// Split the data into training and testing sets
temp: (floor .8 * count data) cut data;
training: temp 0;
testing : temp 1;
features:flip value flip delete y from training;
targets :training`y;
// Train the model
model:.ml.online.sgd.logClassifier.fit[features;targets;1b;::];
// Add the model to the existing registry
.ml.registry.set.model[::;::;model;"myModel";"q";::]
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.predict[
{delete y from x};
`pred;
.qsp.use enlist[`model]!enlist"myModel"]
.qsp.ml.score[`y; `pred; `accuracy]
.qsp.write.toConsole[]
publish testing
.qsp.ml.dropConstant
Drops columns with constant values
.qsp.ml.dropConstant[X]
.qsp.ml.dropConstant[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol or symbol[] or dictionary or :: |
Symbol or list of symbols indicating the columns to drop. If dictionary is passed in then columns with the associated value are dropped. If null then a buffer is used to identify constant columns. | Required |
options:
name | type | description | default |
---|---|---|---|
bufferSize |
long |
Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training. | 0 |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
table |
New table with constant columns dropped. |
The columns to be removed are either specified by the user beforehand in the form of a
dictionary/list, or are determined using the .ml.dropConstant
function on a batch of a
specified size. If a column that is thought to be constant gives a non-constant value,
an error is thrown.
Drop the constant columns protocol
and response
.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.dropConstant[`protocol`response]
.qsp.write.toConsole[];
publish ([] protocol: `TCP; response: 200i; latency: 10?5f; size: 10?10000);
Drop the columns id
and ratio
, checking that their values match the expected
constant values.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.dropConstant[`id`ratio!(1; 2f)]
.qsp.write.toConsole[];
publish ([] id: 1; ratio: 2f; data: 10?10f);
Drop columns whose value is constant for all buffered records.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.dropConstant[::;.qsp.use enlist[`bufferSize]!enlist 100]
.qsp.write.toConsole[];
publish ([] motorID: 0; rpms: 1000 + 200?10; temp: 60 + 200?5)
.qsp.ml.featureHasher
(Beta Feature) Encodes categorical data across several numeric columns
Beta Features
To enable beta features, set the environment variable KXI_SP_BETA_FEATURES
to true
.
.qsp.ml.featureHasher[X;n]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol or symbol[] |
Symbol or list of symbols indicating the columns to act on. | Required |
n |
long |
The number of numeric columns used to represent a variable. | Required |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
table |
New table with a column for each feature/hash value pair, with the columns specified by X removed. |
This operator is used to encode categorical variables numerically. It is similar to one-hot encoding, but does not require the categories or number of categories to be known in advance.
It converts each chosen column into n
columns, sending each string/symbol to its
truncated hash value. The hash function employed is the signed 32-bit version of
Murmurhash3.
As the mapping between values and their hashed representations is effectively random, collisions are possible, and the hash space must be made large enough to reduce collisions to an acceptable level.
This functionality operates exclusively on string/symbol columns, numeric columns are not supported.
Encode a list of categorical values.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.featureHasher[`location; 10]
.qsp.write.toConsole[];
publish ([] location: 20?`london`paris`berlin`miami; num: til 20);
Hash multiple columns.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.featureHasher[`srcIP`destIP; 14]
.qsp.write.toVariable[`output];
IPs: "." sv/: string 4 cut 100?256;
publish ([] srcIP: 100?IPs; destIP: 100?IPs; latency: 100?10; size: 100?10000);
.qsp.ml.labelEncode
Encodes symbolic columns as numeric data
.qsp.ml.labelEncode[X]
.qsp.ml.labelEncode[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol or symbol[] or dictionary |
Symbol or list of symbols indicating the columns to act on, or a dictionary containing the columns and their original expected values. | Required |
options:
name | type | description | default |
---|---|---|---|
bufferSize |
long |
Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training. | 0 |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
table |
New table with symbols encoded. |
This operator encodes symbolic columns within input data. Initially, the first batch is collected in a buffer until the required size is exceeded. In the case that we are dealing with the first batch of data we encode the specified symbol columns and store the mapping as the state. If additional values appear in subsequent batches, the state will be updated to reflect this.
Encode all symbol columns within the data.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.labelEncode[::]
.qsp.write.toConsole[];
publish ([]10?`a`b`c;10?`d`e`f;10?1f);
Encode symbols in column x
.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.labelEncode[`x]
.qsp.write.toConsole[]
publish ([]10?`a`b`c;10?`d`e`f;10?1f);
Encode the symbols in the encoded
column with the mapping specified.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.labelEncode[(enlist `encoded)!enlist `small`medium`large]
.qsp.write.toConsole[];
data: 10?`small`medium`large;
publish ([] original: data; encoded: data);
.qsp.ml.minMaxScaler
Apply min-max scaling to streaming data
.qsp.ml.minMaxScaler[X]
.qsp.ml.minMaxScaler[X; .qsp.use (!) . flip (
(`bufferSize; bufferSize);
(`rangeError; rangeError))]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol or symbol[] or dictionary or :: |
Symbol or list of symbols indicating the columns to be min-max scaled. If null, then all columns will be min-max scaled. Alternatively, a dictionary of column names and ranges can be specified, where :: can be used in place of the range to scale based on the initial buffer. |
Required |
options:
name | type | description | default |
---|---|---|---|
bufferSize |
long |
Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training. | 0 |
rangeError |
boolean |
Should an error be raised if new data falls outside the data range specified by the first batch | 0b |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
table |
New table with defined columns min-max scaled. |
Apply min-max scaling to user specified columns, where scaling is based on the user supplied limits for the minimum and maximum values, or determined by the first buffered batch of data. This function can be configured to error if data supplied falls outside this range and to accumulate a buffer of data prior to determining the minimum/maximum values from the data.
Apply min-max scaling on all data.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.minMaxScaler[::]
.qsp.write.toConsole[];
publish ([]20?5;20?5;20?10)
Apply min-max scaling on the specified columns.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.minMaxScaler[`x`x1]
.qsp.write.toConsole[];
publish ([]20?5;20?5;20?10)
Apply min-max scaling on columns x
and x1
, with supplied minimum and
maximum values for one column and the other based on a buffer.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.minMaxScaler[`rating`cost!(0 10;::); .qsp.use enlist[`bufferSize]!enlist 200]
.qsp.write.toConsole[];
publish ([] rating: 3 + 250?5; cost: 250?1000f)
Error when passed batches containing data outside the min-max bounds.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.minMaxScaler[::;.qsp.use enlist[`rangeError]!enlist 1b]
.qsp.write.toConsole[]
// As no buffer is specified, the min and max values are fit using the initial batch
publish ([]100?5;100?5;100?10)
// As `rangeError` has been set, this batch will cause an error by exceeding the
// expected maximum values
publish 1+([]100?5;100?5;100?10)
.qsp.ml.oneHot
One hot encodes relevant columns
.qsp.ml.oneHot[x]
.qsp.ml.oneHot[x; .qsp.use enlist[`bufferSize]!enlist bufferSize]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol or symbol[] or dictionary or :: |
Symbol, list of symbols, dictionary, or null indicating the columns to encode. If dictionary is passed in then associated values are used for the encoding. | Required |
options:
name | type | description | default |
---|---|---|---|
bufferSize |
long |
Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training. | 0 |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
table |
New table with defined symbol columns encoded. |
Allows symbolic or string data to be encoded into numeric representations. The algorithm works by first collecting data in a buffer to the specified size then:
- In the first instance that the columns are specified by a list of symbols the algorithm fits the one hot encoding on the buffer.
- In the final instance that all symbol columns are selected again the algorithm fits the one hot encoding on the buffer.
- In the case that a dictionary is presented, the keys with
symbol[]
values are used directly to fit the encoding, while keys with a null value are fit on the buffer.
Once the algorithm has been fit, all data in the stream is transformed using the given one hot encoding. If data in some subsequent batch contains symbols that were not present at the time of fitting these symbols will be mapped to zero.
Encode all the symbolic or string columns.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.oneHot[::]
.qsp.write.toConsole[];
publish ([] action: 10?`upload`download; fileType: 10?("image";"audio";"document"); size: 10?100000)
Encode the column x
.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.oneHot[`x]
.qsp.write.toConsole[];
publish ([] x:10?`a`b`c; y:10?1f)
Encode the columns x
and x1
with a required buffer.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.oneHot[`x`x1;.qsp.use ``bufferSize!(`;200)]
.qsp.write.toConsole[];
publish ([] 250?`a`b`c; 250?`d`e`f`j; 250?0b)
Encode the columns axis
and status
using given values. This is useful when the
categories are known in advance, but may not be present in the training data.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.oneHot[`axis`status!(`x`y`z; `normal`error)]
.qsp.write.toConsole[];
publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)
Encode columns x
and x1
using a hybrid method.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.oneHot[`axis`status!(::; `normal`error)]
.qsp.write.toConsole[];
publish ([] axis: 100?`x`y`z; status: `normal; position: 100?50f)
.qsp.ml.standardize
Apply standardization to streaming data
.qsp.ml.standardize[X]
.qsp.ml.standardize[X; .qsp.use enlist[`bufferSize]!enlist bufferSize]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol or symbol[] or :: |
Symbol or list of symbols indicating the columns to be standardized. If null, then all columns will be scaled. | Required |
options:
name | type | description | default |
---|---|---|---|
bufferSize |
long |
Number of elements to buffer before fitting, or 0 to fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training. | 0 |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
table |
New table with defined columns standardized. |
Apply standardization to user specified columns where scaling is determined by the
buffered data, or the first batch if bufferSize
is 0.
On this batch the mean and standard deviation are computed.
These statistics are then used on subsequent batches which are normalized, with the
mean removed.
This pipeline applies standardization to all data
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.standardize[::]
.qsp.write.toConsole[];
publish ([]100?5;100?5;100?10)
Apply standardization to specified columns.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.standardize[`x`x1]
.qsp.write.toConsole[];
publish ([]100?5;100?5;100?10)
This pipeline applies standardization to all columns based on a buffer.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.minMaxScaler[::; .qsp.use enlist[`bufferSize]!enlist 200]
.qsp.write.toConsole[];
publish ([] length: 100 + 250?2f; width: 10 + 250?1f);
.qsp.ml.registry.fit
Fit model to batch of data and predict target for future batches
.qsp.ml.registry.fit[X;y;untrained;modelType;udf]
.qsp.ml.registry.fit[X;y;untrained;modelType;udf; .qsp.use (!) . flip (
(`registry ; registry);
(`experiment; experiment);
(`model ; model);
(`config ; config);
(`modelArgs ; modelArgs);
(`bufferSize; bufferSize))]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol[] or function |
The predictor variable's column names or a function to generate the predictors from the batch. | Required |
y |
symbol or function or :: |
The target variable's column name or a function to generate the predictors from the batch. This must be :: when training an unsupervised model |
Required |
untrained |
function |
An untrained q/sklearn model. | Required |
modelType |
string |
Indication as to whether a model is "q" or "sklearn" . |
Required |
udf |
function or symbol |
A function to score the quality of the model or join predictions into the batch. In the case that this is a symbol, append the predictions to the batch as a new columns. | Required |
Functional UDF requirements
The udf
parameter for the .qsp.ml.registry.fit
operator is a function with the
following parameters:
udf:{[data;y;predictions;modelInfo]
update yhat: predictions from data
}
name | type | description |
---|---|---|
data |
any |
The batch passed to the operator, only the data not the metadata. |
y |
symbol | function | :: |
The target variable, as extracted by the y parameter. In the unsupervised case this is populated with nulls. |
predictions |
list |
The predictions for each record in the batch. |
modelInfo |
:: |
Currently unused and always set to :: . |
options:
name | type | description | default |
---|---|---|---|
registry |
string |
The registry to load from. | :: |
experiment |
string |
The experiment name. | :: |
model |
string |
The model name in the registry. | :: |
config |
any |
The config parameter for .ml.registry.set.mode | ()!() |
modelArgs |
list |
A list of argument to pass to the model after X and y . |
:: |
bufferSize |
long |
Number of records to buffer before training a model. If 0, the model will be fit on the first batch. If the batch size is exceeded, additional records in that batch will also be included when training. | 0 |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
any |
The current batch, modified in accordance with the udf parameter. |
Fits a model to a batch or buffer of data, saving the model to the registry, and predicting the target variable for future batches after the model has been trained.
N.B. This is only for models that cannot be trained incrementally. For other models,
.qsp.ml.registry.update
should be used.
Fit a q model on a batch.
// Generate initial data to be used for fitting
a:500?1f
b:500?1f
data:([]a;b;y:a+b)
// Define optional variables
optKeys:`registry`experiment`model`modelArgs
optVals:(::;::;"sgdLR";(1b; `maxIter`gTol`seed!(100;-0w;42)))
opt:optKeys!optVals
// Define execution pipeline
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.fit[
{delete y from x};
`y;
.ml.online.sgd.linearRegression;
"q";
`yhat;
.qsp.use opt
]
.qsp.write.toConsole[]
publish data
// View model stored in registry
.ml.registry.get.modelStore[::;::]
Fit an sklearn model.
// Generate initial data to be used for fitting
data:([]x:asc 100?1f;x1:100?1f;y:desc 100?5)
// Populate a random forest classifier expected
rfc:.p.import[`sklearn.ensemble][`:RandomForestClassifier][`max_depth pykw 2]
// Define execution pipeline
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.fit[
{delete y from x};
{exec y from x};
rfc;
"sklearn";
`yhat]
.qsp.write.toConsole[]
publish data
Fit an unsupervised model.
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.fit[
`x`x1`x2;
::;
.ml.clust.kmeans;
"q";
`cluster;
.qsp.use enlist[`modelArgs]!enlist(`e2dist;3;::)
]
.qsp.write.toConsole[]
publish ([]x:1000?1f;x1:1000?1f;x2:1000?1f)
.qsp.ml.registry.predict
Predict a target variable using a model
.qsp.ml.registry.predict[X;udf];
.qsp.ml.registry.predict[X;udf; .qsp.use (!) . flip (
(`registry ; registry);
(`experiment; experiment);
(`model ; model);
(`version ; version))]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol[] or function |
The predictor variable's column names or a function to generate the predictors from the batch. | Required |
udf |
function or symbol |
A user-defined function for integrating the predictions into the batch, or a column name to join them to the table as a new column. | Required |
Functional UDF requirements
The udf
parameter for the .qsp.ml.update
operator is a function with the following parameters:
udf:{[data;y;predictions;modelInfo]
update yhat: predictions from data
}
name | type | description |
---|---|---|
data |
any |
The batch passed to the operator, only the data not the metadata. |
y |
symbol or function or :: |
The target variable, as extracted by the y parameter. |
predictions |
list |
The predictions for each record in the batch. |
modelInfo |
:: |
Currently unused and always set to :: . |
options:
name | type | description | default |
---|---|---|---|
registry |
string |
The registry to load from. | :: |
experiment |
string |
The experiment name. | :: |
model |
string |
The model name in the registry. | :: |
version |
float |
The version to load. | :: |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
any |
The current batch, modified in accordance with the udf parameter. |
.qsp.ml.registry.predict
will predict the target value for each record in the batch,
using a model from the registry.
The user-defined function udf
can join these predictions into the data, or do any arbitrary computation.
Note that below data
is the whole batch, not just those fields extracted by X
.
Additionally, modelInfo
is a catch-all for any model-specific outputs.
.qsp.ml.registry.predict[X; {[data;y;predictions;modelInfo]
update temperature: predictions from data
}; .qsp.use `registry`experiment`model`version!(registry;experiment;model;version)]
In lieu of a user-defined function, this parameter can also just be the name of a new column, or the name of an existing column to overwrite it.
.qsp.ml.registry.predict[X;`temperature;
.qsp.use`registry`experiment`model`version!(registry;experiment;model;version)]
Predict using an sklearn model, adding predictions to the initial data.
N:1000
data:([]x:asc N?1f;x1:desc N?10;x2:N?1f;y:asc N?5)
features:flip value flip delete y from data
clf1:.p.import[`sklearn.tree]`:DecisionTreeClassifier
clf1:clf1[`max_depth pykw 3]
clf1[`:fit][features;data`y]
// Set the model within the existing registry
.ml.registry.set.model[::;::;clf1;"skModel";"sklearn";::]
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.predict[
{delete y from x};
`yhat;
.qsp.use enlist[`model]!enlist"skModel"]
.qsp.write.toConsole[]
publish data
Predict using a q model, adding predictions to the initial data.
// Define data for fitting the model
N:1000
data:([]x:N?1f;x1:N?1f;x2:N?1f)
// Fit a model
kmeansModel:.ml.clust.kmeans.fit[data`x`x1`x2;`e2dist;6;enlist[`iter]!enlist 1000]
// Set the model within existing registry
.ml.registry.set.model[::;::;kmeansModel;"kmeansModel";"q";::]
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.predict[
`x`x1`x2;
`yhat;
.qsp.use enlist[`model]!enlist"kmeansModel"]
.qsp.write.toConsole[]
publish data
.qsp.ml.registry.update
Train a model incrementally returning predictions for each record in a batch
.qsp.ml.registry.update[X;y;udf]
.qsp.ml.registry.update[X;y;udf; .qsp.use (!) . flip (
(`registry ; registry);
(`experiment; experiment);
(`model ; model);
(`version ; version);
(`supervised; supervised);
(`untrained ; untrained);
(`modelType ; modelType);
(`modelArgs ; modelArgs))]
Parameters:
name | type | description | default |
---|---|---|---|
X |
symbol[] | function |
The predictor variable's column names or a function to generate the predictors from the batch. | Required |
y |
symbol | function |
The target variable's column name or a function to generate this from the batch. | Required |
udf |
function | symbol |
A function to score the quality of the model or join predictions into the batch. | Required |
Functional UDF requirements
The udf
parameter for the .qsp.ml.update
operator is a function with the following parameters:
udf:{[data;y;predictions;modelInfo]
update yhat: predictions from data
}
name | type | description |
---|---|---|
data |
any |
The batch passed to the operator, only the data not the metadata. |
y |
symbol or function or :: |
The target variable, as extracted by the y parameter. |
predictions |
list |
The predictions for each record in the batch. |
modelInfo |
:: |
Currently unused and always set to :: . |
options:
name | type | description | default |
---|---|---|---|
registry |
string |
The registry to load from. | :: |
experiment |
string |
The experiment name. | :: |
model |
string |
The model name in the registry. | :: |
version |
float |
The version to load. | :: |
supervised |
boolean |
Indicates an unsupervised model. | 1b |
untrained |
function or embedpy |
An untrained ML model e.g. .ml.online.sgd.linearRegression . |
:: |
modelType |
string |
One of "q" or "sklearn" defining the type of model. |
:: |
modelArgs |
list |
A list of argument to pass to the model after X and y . |
:: |
For all common arguments, refer to configuring operators
Returns:
type | description |
---|---|
any |
The current batch, modified in accordance with the udf parameter. |
Train a model incrementally returning predictions for each record in a batch. A user-defined function can be used to join these predictions into the data, or do any arbitrary computation.
Python support
Currently this functionality is only supported for q models. Support for deployment of online learning models written in Python is scheduled for a later release.
Fit an untrained q model which can be updated, adding predictions to the initial data.
// Initialise functionality and data required for running example
a:500?1f
b:500?1f
data:([]a;b;y:a+b)
.qsp.run
.qsp.read.fromCallback[`publish]
.qsp.ml.registry.update[
{delete y from x};
{exec y from x};
`yhat;
.qsp.use
`untrained`modelType`modelArgs!(.ml.online.sgd.linearRegression;"q";(1b;()!()))]
.qsp.write.toConsole[]
publish data;