Parameters that change the default behavior of .automl.fit

AutoML user-modifiable parameters aggregationColumns Aggregation columns (FRESH only) crossValidationFunction Cross validation function crossValidationArgument Number of folds/percentage of data in validation set functions Functions to be applied for feature extraction gridSearchFunction Grid search function gridSearchArgument Number of folds/percentage of data in validation set holdoutSize Size of holdout set used hyperparameterSearchType Form of hyperparameter search to perform loggingDir Directory to save log files in loggingFile Name of logging file produced for a run numberTrials Number of random/sobol hyperparameters to generate overWriteFiles Overwrite any saved models or log files that exist predictionFunction Fit-predict function to be applied pythonWarning Should Python warning be displayed randomSearchFunction Random search function randomSearchArgument Number of folds/percentage of data in validation set savedModelName Name assigned to a run of AutoML saveOption Option for what is to be saved to disk during a run scoringFunctionClassification Scoring functions for classification tasks scoringFunctionRegression Scoring functions for regression tasks seed Random seed to be used significantFeatures Feature significance procedure to be applied to data targetLimit Ignore NN models when above this number of targets testingSize Size of testing set on which final model is tested trainTestSplit Train-test split function to be applied w2v Word2Vec embedding methodology used (NLP only)

The other sections describe the default behavior of the framework, when the last argument of .automl.fit is the generic null (::).

The argument can be used to change the default behavior. Replace the null with either

• a dictionary
• path to a JSON file

of non-default parameter values. The parameter names above are the keys of either the dictionary or the JSON object.

## JSON files¶

The parameters are illustrated below as q dictionary entries. They can also be set in JSON files.

The defaults are defined in

automl/code/customization/configuration/default.json

You can modify this file.

Or make one or more custom parameter sets: save versions of default.json in sibling folder customConfig:

automl
└── code
└── customization
└── configuration
├── default.json
└── customConfig
├── custom1.json
└── custom2.json

Use it (as symbol, string or file symbol) as the last argument to .automl.fit.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Multi-classification target
target:100?5

// Feature extraction type
ftype:normal

// Problem type
ptype:class

// Custom configuration file
params:"custom2.json"

// Run AutoML
.automl.run[features;target;ftype;ptype;params]

## aggregationColumns¶

Columns to be used for aggregations in FRESH

By default the aggregation column for any FRESH based feature extraction is assumed to be the first column in the dataset. In certain circumstances, this may not be sufficient and a more complex aggregation setup may be required, as outlined below.

// Characteristic vector
v:100?50

// FRESH feature table
features:([]timestamp:"p"$v;v:v;100?1f;100?1f;100?1f) // Target vector target:count[distinct v]?1f // Feature extraction type ftype:fresh // Problem type ptype:reg // In this example we want timestampv as aggregation columns params:enlist[aggregationColumns]!enlisttimestampv // Run AutoML .automl.fit[features;target;ftype;ptype;params] ## crossValidationFunction/Argument¶ Cross-validation function and number of folds/percentage of data in validation set crossValidationFunction is the name of the cross-validation function to apply as a symbol and crossValidationArgument is the associated argument – either the number of folds to apply or the percentage of data in the validation set. By default, the cross-validation procedure being implemented is a 5-fold shuffled cross validation using the function .ml.xv.kfShuff. You can augment this for different use cases. For example, you could change crossValidationFunction to .ml.xv.tsRolls to suit a more timeseries-specific problem and change crossValidationArgument to 7 to split the data into more folds than the default configuration. For simplicity, where possible, use the functions within the .ml.xv namespace for this task. // Non-timeseries (normal) feature data features:([]asc 100?1f;100?1f;100?1f) // Target vector target:asc 100?1f // Feature extraction type ftype:normal // Problem type ptype:reg // Change cross validation procedure // Use percentage split, with 20% data in the testing set params:crossValidationFunctioncrossValidationArgument! (.ml.xv.pcSplit;.2) // Run AutoML .automl.fit[features;target;ftype;ptype;params] Custom cross-validation function To add a custom cross-validation function to those provided, follow the guidelines for function definition. Contact ai@kx.com with questions on this: it is more complicated than other customizations. ## functions¶ Functions to be applied for feature extraction FRESH By default, the feature-extraction functions applied for any FRESH-based problem are those contained in .ml.fresh.params. This comprises approximately 60 functions. To augment or apply a subset of these functions see the example below and the instructions. Normal By default, normal feature extraction simply entails the decomposition of any temporal types into their component parts. you can augment this to add new functionality where a list of supplied functions must input/output a simple table. NLP By default, feature-extraction steps taken for NLP models include parsing the text data using .nlp.newParser and applying sentiment anaylsis, regular expression searching and named-entity recognition tagging. The text is then vectorized using a Word2Vec model and concatenated with the created features. Normal feature extraction is then applied to any remaining non-textual columns. Much as above, you can augment the normal feature extraction. // Characteristic vector v:100?50 // Feature table features:([]tm:"t"$v;asc 100?1f;100?1f;100?1f;100?1f)

// FRESH target vector
target:count[distinct v]?1f

// Feature extraction type
ftype:fresh

// Problem type
ptype:reg

// Select functions which only take data as input with no extra parameters
dataFuncs:select from .ml.fresh.params where pnum=0
params:enlist[functions]!enlist dataFuncs

// Run feature extraction using user defined function table for FRESH
.automl.fit[features;target;ftype;ptype;params]

A user-defined function for feature extraction should take a simple table as input and return a simple table with the desired feature-extraction procedures applied.

Changing the number of rows in the dataset will cause in errors in the pipeline.

## gridSearchFunction/Argument¶

Grid search function and number of folds/percentage of data in validation set

gridSearchFunction is the name of the grid-search function to apply as a symbol, while gridSearchArgument is an argument associated with this function defining either the number of folds to apply or the percentage of data in the validation set.

By default, the grid-search procedure being implemented is a 5-fold shuffled grid search using the function .automl.gs.kfshuff. You can augment this for different use cases.

For example, when using timeseries data, you could use a method like chain-forward grid search, .automl.gs.tschain, in the ML Toolkit, paired with three folds.

For simplicity, use the functions within the .automl.gs namespace for this task.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Change hyperparameter search procedure
// Use roll-forward grid search with 6 folds
params:gridSearchFunctiongridSearchArgument!
(.ml.gs.tsRolls;6)

// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

Custom grid-search function

To add a custom grid search function, follow the guidelines for function definition.

Contact ai@kx.com with questions on this: it is more complicated than other customizations.

## holdoutSize¶

Size of holdout set used to validate the models run

By default the holdout set across all problem types is set to 20%. For problems with a small number of data points, you may wish increase the number of datapoints being trained on. The opposite may be true on larger datasets.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Set the holdout set to contain 10% of the dataset
params:enlist[holdoutSize]!enlist .1

// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

## hyperparameterSearchType¶

Type of hyperparameter search to perform

By default, an exhaustive grid search is applied to the best model found for a given dataset. Random or Sobol-random methods are also available within AutoML and can be applied by changing the parameter hyperparameterSearchType.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:normal
// Problem type
ptype:reg

// Change hyperparameter search procedure
// Use random search
params:enlist[hyperparameterSearchType]!enlistrandom
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

// Change hyperparameter search procedure
// Use Sobol-random search
params:enlist[hyperparameterSearchType]!enlistsobol
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

## loggingDir¶

Directory to store logging files

When .automl.utils.logging is 1b, this parameter sets (relative to the current directory) where a log file is stored.

By default, the log file is saved to the same directory that the reports, models, meta and images are stored.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Update logging function
.automl.updateLogging[]
q)// Check to ensure logging is enabled
q).automl.utils.logging
1b

q)// Set the logging directory to logDir
q)params:enlist[loggingDir]!enlist"logDir"

q)// Run AutoML
q).automl.fit[features;target;ftype;ptype;params]

## loggingFile¶

Name of saved logging file

When .automl.utils.logging is 1b, this is the name of the saved log file.

By default, the log file is named as: logFile_date_time.txt.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Update logging function
.automl.updateLogging[]
q)// Check to ensure logging is enabled
q).automl.utils.logging
1b

q)// Define the name of the logging file
q)params:enlist[loggingFile]!enlist"logFileNew"

q)// Run AutoML
q).automl.fit[features;target;ftype;ptype;params]

## numberTrials¶

Number of random/Sobol-random hyperparameters to generate

For the random and Sobol-random hyperparameter methods, a user specified number of hyperparameter sets are generated for a given hyperparameter space.

For Sobol, the number of trials must equal $$2^n$$, while for random, any number of distinct sets can be generated.

The default for both cases is 264.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:normal
// Problem type
ptype:reg

// Random search - set number of hyperparameter sets
params:hyperparameterSearchTypenumberTrials!(random;10)
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]
// Sobol-random search - set number of hyperparameter sets to equal 2^n
q)show n:"j"\$xexp[2;9]
512
q)params:hyperparameterSearchTypenumberTrials!(sobol;n)

q)// Run AutoML
q).automl.fit[features;target;ftype;ptype;params]

## overWriteFiles¶

Overwrite any saved models or log files that exist

If a defined savedModelName or loggingFile of the same name already exists in the system, setting this parameter to 1b will allow .automl.fit to overwrite these files.

By default the value is 0b and the code will exit with a warning message if the files already exist.

q)// Non-timeseries (normal) feature table
q)features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

q)// Multi-classification target
q)target:100?0b

q)// Feature extraction type
q)ftype:normal

q)// Problem type
q)ptype:class

q)// Use a savedModelName that already exists
q)params:enlist[savedModelName]!enlist"test"

q)// AutoML returns an error because the savePath already exists
q).automl.fit[features;target;ftype;ptype;params]
Error: The savePath chosen already exists, this run will be exited

q)// Set overWriteFiles to 1b
q)show params,:enlist[overWriteFiles]!enlist 1b
savedModelName| "test"
overWriteFiles| 1b

q)// Run AutoML
q).automl.fit[features;target;ftype;ptype;params]
modelInfo| startDatestartTimefeatureExtractionTypeproblemType..
predict  | {[config;features]
original_print:utils.printing;
utils.printi..

## predictionFunction¶

Fit-predict function to be applied

Ternary fitting and prediction function for cross validation and hyperparameter search. Both models fit on a training set and return the predicted scores based on supplied scoring function.

Syntax:

myFun[func;hyperParam;data]

Where

• func is a scoring function that takes parameters and data as input and returns appropriate score
• hyperParam is a dictionary of hyperparameters to be searched
• data is data split into training and testing sets of format ((xtrn;ytrn);(xval;yval))

myFun returns the predicted and true validation values.

By default .automl.utils.fitPredict is used.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Define updated prediction function
fitPredictUpd:{[func;hyperParam;data]
numpyArray:.p.import[numpy]:Array;
preds:@[.[func[][hyperParam]:fit;numpyArray data 0]:predict;data[1]0];
(preds;data[1]1) }

params:enlist[predictionFunction]!enlist fitPredictUpd

// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

## pythonWarning¶

Display Python warnings

Boolean atom: whether Python warning messages are to be displayed to standard output.

By default this is 0b.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?0b

// Feature extraction type
ftype:normal

// Problem type
ptype:class

// Set python warnings to display to standard output
params:enlist[pythonWarnings]!enlist 1b
q).automl.fit[features;target;ftype;ptype;params]
Executing node: automlConfig
Executing node: configuration
Executing node: targetDataConfig
Executing node: targetData
Executing node: featureDataConfig
Executing node: featureData
Executing node: dataCheck
Executing node: featureDescription
Executing node: dataPreprocessing
Executing node: featureCreation
Executing node: labelEncode
Executing node: featureSignificance
Executing node: trainTestSplit
Executing node: modelGeneration
Executing node: selectModels
Executing node: runModels
Executing node: optimizeModels
/lib/python3.7/site-packages/sklearn/neural_network/_multilayer_pe...
/lib/python3.7/site-packages/sklearn/neural_network/_multilayer_pe...
...

## randomSearchFunction/Argument¶

Random search function and number of folds/percentage of data in validation set

randomSearchFunction is the name of the random search function to apply as a symbol, while randomSearchArgument is an argument associated with this function defining either the number of folds to apply or the percentage of data in the validation set.

By default, the random search procedure being implemented (assuming hyperparameterSearchType is set to random) is a 5-fold shuffled random search using the function .automl.rs.kfshuff. You can augment this for different use cases.

For example, when using timeseries data, you might wish to use a method like chain-forward grid search, .automl.rs.tschain, in the ML Toolkit, paired with three folds.

For simplicity, use the functions within the .automl.rs namespace for this task.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)
// Regression target
target:100?1f
// Feature extraction type
ftype:normal
// Problem type
ptype:reg

// Change hyperparameter search procedure
// Use percentage split random search with 20% validation set
params:hyperparameterSearchTyperandomSearchFunctionrandomSearchArgument!
(random;.ml.rs.pcSplit;.2)
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

// Use chain-forward Sobol-random search function with 6-folds
params:hyperparameterSearchTyperandomSearchFunctionrandomSearchArgument!
(sobol;.ml.rs.tsChain;6)
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

Custom random/Sobol-random search function

To add a custom random/Sobol-random search function, follow the guidelines for function definition.

Contact ai@kx.com with questions on this: it is more complicated than other customizations.

## savedModelName¶

Folder name where all outputs related to a run will be saved

The folder created is saved in /outputs/namedModels/.

By default, the outputs are saved named by the start date/time of a run in the format /outputs/dateTimeModels/date/run_time

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Define the folder name where outputs are to be saved
params:enlist[savedModelName]!enlist"exampleModel"

// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

## saveOption¶

Option defining what is to be saved to disk during a run

There are three options.

0    Save nothing: the models run, but nothing is persisted to disk
1    Save model/metadata only: images and report are not generated
2    Save all: reports, images, metadata and models are saved to disk

The default is 2: save everything.

Example:

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Save only the minimal outputs
params:enlist[saveOption]!enlist 1
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

// No outputs saved
params:enlist[saveOption]!enlist 0
// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

## scoringFunctionClassification/Regression¶

Scoring functions used in model validation and optimization

The scoring metrics used to evaluate model performance for regression and classification tasks are defined respectively by the parameters

scoringFunctionClassification
scoringFunctionRegression

The following functions are supported within the framework at present along with the ordering which allows the best model to be chosen displayed as defined in

automl/code/customization/scoring/scoring.json

.ml Statistical analysis metrics with AutoML score order accuracy accuracy of classification results desc mae mean absolute error asc mape mean absolute percentage error desc matthewCorr matthews correlation coefficient desc mse mean square error asc rmse root mean square error asc rmsle root mean square logarithmic error asc r2Score r2-score desc smape symmetric mean absolute error desc sse sum squared error asc

The default values for these two parameters are

scoringFunctionRegression      .ml.mse
scoringFunctionClassification  .ml.accuracy

Example: modifying the regression metric

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Use Mean Average Error as scoring function
params:enlist[scoringFunctionRegression]!enlist.ml.mae

// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

To use a custom scoring metric function, define it in the central process and add it to automl/code/customization/scoring/scoring.json.

The function must be a binary with vector arguments:

1. predicted labels
2. true labels

and return the score.

Functions within the ML Toolkit which take additional parameters, such as .ml.f1Score, can be accessed in this way and could be defined as a projection.

## seed¶

The seed used to ensure model reruns are consistent

By default, each run of the framework is completed with a ‘random’ seed derived from the time of a run. The seed can be set to a user-specified value to ensure results are consiustent across runs, thus allowing the impact of modifications to the pipeline to be accurately monitored.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// User-defined seed
params:enlist[seed]!enlist 42

// Run AutoML - can run twice to show consistency
.automl.fit[features;target;ftype;ptype;params]
Reproducing results

For full reproducibility between q processes of the NLP word2vec implementation, set the PYTHONHASHSEED environment variable upon initializing q.

PYTHONHASHSEED=0 q

set PYTHONHASHSEED=0.

## significantFeatures¶

Feature significance function to be applied to data to reduce feature set

By default, the system will apply the feature-significance tests in the AutoML:

.automl.featureSignificance.significance

The function uses the Benjamini-Hochberg-Yekutieli (BHY) procedure to identify significant features within the dataset. If no significant columns are returned, the top 25th percentile of features will be selected.

Users can alter AutoML to apply different significance tests as shown below.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Define the function to be applied for feature significance tests
newSigFeats:{.ml.fresh.significantFeatures[x;y;.ml.fresh.kSigFeat 2]}

// Pass in new function as a symbol
params:enlist[significantFeatures]!enlistnewSigFeats

// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

A alternative function must be binary, with arguments

1. simple feature table
2. target vector

and return a list of table columns deemed significant.

## targetLimit¶

Number of targets above which long-running models are removed

If the number of targets in the dataset exceeds this, the following models will be removed from the processing stage: keras, svm, neuralNetwork

The default value is 10,000.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Lower the target limit
params:enlist[targetLimit]!enlist 1000

// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

## testingSize¶

Size of testing set on which final model is tested

By default the testing set across all problem types is set to 20%. For problems with a small number of data points, you may wish to increase the number of data points being trained on. The opposite may be true for larger datasets.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Regression target
target:100?1f

// Feature extraction type
ftype:normal

// Problem type
ptype:reg

// Set the testing set to contain 30% of the dataset
params:enlist[testingSize]!enlist .3

// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

## trainTestSplit¶

Function used to split the data into training and testing sets

Default functions for splitting the data into training and testing sets:

problem type function description
Normal .ml.trainTestSplit Shuffle the dataset and split into training and testing set with a defined percentage in each
FRESH .automl.ttsNonShuff Without shuffling, the dataset is split into training and testing set with defined percentage in each to ensure no time leakage
NLP .ml.trainTestSplit Shuffle the dataset and split into training and testing set with a defined percentage in each

For specific use cases this may not be sufficient. For example if you wish to split the data such that an equal distribution of target classes occur in the training and testing sets, this could be implemented as follows.

// Non-timeseries (normal) feature table
features:([]100?1f;asc 100?1f;100?1f;100?1f;100?1f)

// Multi-classification target
target:100?5

// Feature extraction type
ftype:normal

// Problem type
ptype:class

// Create new TTS function
ttsStrat:{[x;y;sz]
xtrainytrainxtestytest!
raze(x;y)@\:/:r@'shuffle each r:(,'/){
x@(0,floor n*1-y)_neg[n]?n:count x
}[;sz]each value n@'shuffle each n:group y }

// Update parameters
params:enlist[trainTestSplit]!enlistttsStrat

// Run AutoML
.automl.fit[features;target;ftype;ptype;params]

A alternative function for this must take arguments

1. simple table
2. target vector
3. size-splitting criteria used (number folds/percentage of data in validating model)

and return a dictionary with keys xtrainytrainxtestytest where the x components are tables containing the split data and y components are the associated target vectors.

## w2v¶

Word2Vec method used for NLP models

Methods:

0   Continuous-Bag-of-Words (default)
1   skip-gram
q)// NLP feature table
q)3#table
comment                                                                      ..
-----------------------------------------------------------------------------..
"If you like plot turns, this is your movie. It is impossible at any moment t..
"It's a real challenge to make a movie about a baby being devoured by wild ca..
"What a good film! Made Men is a great action movie with lots of twists and t..

q)// Binary-classification target
q)target:count[table]?0b

q)// Feature extraction type
q)ftype:nlp

q)// Problem type
q)ptype:class

q)// Apply skip-gram (1)
q)params:enlist[w2v]!enlist 1

q)// Run AutoML
q).automl.fit[features;target;ftype;ptype;params]