Cross validation procedures¶
.ml.gs Grid-search functions kfShuff K-Fold cross validation with randomized indices kfSplit K-Fold cross validation with sequential indices kfStrat K-Fold cross validation with stratified indices mcSplit Monte-Carlo cross validation with random split indices pcSplit Percentage-split cross validation tsChain Chain-forward cross validation tsRolls Roll-forward cross validation
.ml.rs Random-search functions kfShuff K-Fold cross validation with randomized indices kfSplit K-Fold cross validation with sequential indices kfStrat K-Fold cross validation with stratified indices mcSplit Monte-Carlo cross validation with random split indices pcSplit Percentage-split cross validation tsChain Chain-forward cross validation tsRolls Roll-forward cross validation
.ml.xv Cross-validation functions kfShuff K-Fold cross validation with randomized indices kfSplit K-Fold cross validation with sequential indices kfStrat K-Fold cross validation with stratified indices mcSplit Monte-Carlo cross validation with random split indices pcSplit Percentage-split cross validation tsChain Chain-forward cross validation tsRolls Roll-forward cross validation
The .ml.xv
, .ml.gs
and .ml.rs
namespaces contain functions related to cross validation, grid search, random/Sobol-random search algorithms respectively. These algorithms are used in machine learning to test how robust or stable a model is to changes in the volume of data or to the specific subsets of data used for model generation.
Within the following examples, .ml.xv.fitScore
is used extensively to fit models and return the score achieved on validation/test data. This function can be replaced by a user-defined alternative for tailored applications, e.g. a function to fit on training data and predict outputs for new data.
As of toolkit version 0.1.3, the distribution of cross-validation functions is invoked at console initialization. If a process is started with $q -s -4 -p 4321
and xval.q
is loaded into the process, then the cross-validation library will automatically make 4 worker processes available to execute jobs.
Interactive notebook implementations
Interactive notebook implementations of a large number of the functions outlined here are available within KxSystems/mlnotebooks
Grid search¶
The most common method of performing hyperparameter optimization in machine learning is through the use of a grid search. Grid search is an exhaustive searching method across all possible combinations of a hyperparameters provided by a user.
Hyperparameter dictionary¶
When applying the grid-search functionality provided within the toolkit, a user must provide a dictionary mapping the names of all the hyperparameters to be searched with the possible values to be considered for the hyperparameter. The following example shows a set of valid hyperparameter dictionaries for an sklearn
AdaBoostRegressor and DecisionTreeClassifier
/ grid search hyperparameter dictionary for an AdaBoostRegressor
q)p:`n_estimators`learning_rate!(10 20 50 100;0.1 0.5 0.9)
/ grid search hyperparameter dictionary for a DecisionTreeClassifier
q)p:enlist[`max_depth]!enlist(::;1;2;3;4;5)
Random search¶
Random and quasi-random search methods replace the need for exhaustive searching across a hyperparameter space used by grid-search by selecting combinations of hyperparameters randomly. This random selection can take place across a discrete space as would be the case if choosing randomly from a list of possible values or over a continuous space bounded based on user input.
Such methods commonly outperform grid search both with respect to finding the optimal parameters in particular in cases where a small number of parameters disproportionately affect the performance of the machine learning algorithm. This library contains two different implementations following the ethos of random search namely
- Random search
-
The completely random selection of hyperparameters across a defined continuous or discreet search space
- Quasi-random Sobol sequence search
-
The selection of hyperparameters across a user defined continuous or discreet space using a quasi-random selection method based on Sobol sequences. This method ensures that the hyperparameters searched encompass a more even representation of the hyperparameter space than is the case in purely random or grid search methods.
Hyperparameter dictionary¶
The random and Sobol searching methods follow the same syntax as grid search, with the exception of the params
parameter. In order to perfom these two searching methods extra information is needed, where the parameter dictionary must have the format:
typ
is the type of random search to perform as a symbol -random
orsobol
randomState
is the seed to apply during cross validation. If a null character,(::)
, is passed the default seed,42
, will be applied.-
n
is the number of hyperparameter sets to produce. For Sobol this number must be a power of 2, e.g. 4, 8, 16, etc. -
params
is a dictionary of hyperparameters to search, which must have the following forms:
- Numerical
-
enlist[`hyperparamName]!enlist(spaceType;loBound;hiBound;hpType)
-
where
spaceType
isuniform
orloguniform
,loUpper
andhiBound
are the limits of the hyperparameter space andhpTyp
is the type to cast the hyperparameters to. - Symbol
-
enlist[`hyperparamName]!enlist(`symbol;symbolsToSearch)
-
where
symbol
is given as the type followed by the list of possible symbol values. - Boolean
-
enlist[`hyperparamName]!enlist`boolean
A practical example:
// Example random dictionary for a Decision Tree Classifier
q)typ:`random
q)randomState:283
q)n:10
q)p:`max_depth`criterion!((`uniform;1;30;"j");(`symbol;`gini`entropy))
q)pr:`typ`randomState`n`p!(typ;randomState;n;p)
// Example Sobol dictionary for an SGD Classifier
q)typ:`sobol
q)randomState:93
q)n:512 // must equal 2^n
q)prms:`average`l1_ratio`alpha!(`boolean;(`uniform;0;1;"f");(`loguniform;-5;2;"f"))
q)ps:`typ`randomState`n`p!(typ;randomState;n;p)
Cross validation¶
Cross-validation is a technique used to gain a statistical understanding of how well a machine-learning model generalizes to independent datasets. This is used to limit overfitting and selection bias, especially when dealing with small datasets.
.ml.gs.kfShuff
¶
Cross-validated parameter grid search applied to data with shuffled split indices
.ml.gs.kfShuff[k;n;features;target;functions;params;tstTyp]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunctions
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to be searched - see section Grid-search hyperparameter dictionarytstTyp
is a float value denoting the size of the holdout set used in a fitted grid search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yr:x[;0]+m?.05
// algo
q)rf:{.p.import[`sklearn.linear_model]`:LinearRegression}
// Params
q)pr:`fit_intercept`normalize!(01b;01b)
// 4 fold cross-validation no holdout
q).ml.gs.kfShuff[4;1;x;yr;.ml.xv.fitScore rf;pr;0]
fit_intercept normalize|
-----------------------| ---------------------------------------
0 0 | 0.997356 0.9973086 0.9971782 0.9973688
0 1 | 0.9974428 0.9972132 0.9972411 0.9973183
1 0 | 0.9975179 0.997513 0.9975894 0.99757
1 1 | 0.99761 0.9974926 0.997565 0.997517
// 5 fold cross-validated grid search fitted on 20% holdout set
q).ml.gs.kfShuff[5;1;x;yr;.ml.xv.fitScore rf;pr;.2]
(+`fit_intercept`normalize!(0011b;0101b))!(0.9972156 0.9974328 0.9972923 0.99..
`fit_intercept`normalize!11b
0.9975171
Deprecated
This function was previously defined as .ml.gs.kfshuff
.
That is still callable but will be removed after version 3.0.
.ml.gs.kfSplit
¶
Cross-validated parameter grid search applied to data with ascending split indices
.ml.gs.kfSplit[k;n;features;target;functions;params;tstTyp]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunctions
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted grid search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
q)// Algos
q)cf:{.p.import[`sklearn.tree]`:DecisionTreeClassifier}
q)// Params
q)pc:enlist[`max_depth]!enlist(::;1;2;3;4;5)
q)// 5 fold cross-validation no holdout
q).ml.gs.kfSplit[5;1;x;yc;.ml.xv.fitScore cf;pc;0]
max_depth|
---------| ---------------------------------
:: | 1 0.999 1 1 0.9995
1 | 0.484 0.4925 0.4855 0.4895 0.4895
2 | 1 0.742 1 0.7385 0.738
3 | 1 0.999 1 1 0.9995
4 | 1 0.999 1 1 0.9995
5 | 1 0.999 1 1 0.9995
q)// 5 fold cross-validated grid search fitted on 20% holdout set
q).ml.gs.kfSplit[5;1;x;yc;.ml.xv.fitScore cf;pc;.2]
(+(,`max_depth)!,(::;1;2;3;4;5))!(0.99875 1 0.999375 0.999375 1;0.500625 0.48..
(,`max_depth)!,::
1f
q)// 10 fold cross-validated grid search fitted on 10% holdout
q)// with initial data shuffle
q).ml.gs.kfSplit[10;1;x;yc;.ml.xv.fitScore cf;pc;-.1]
(+(,`max_depth)!,(::;1;2;3;4;5))!(1 1 1 1 1 1 1 1 0.9988889 0.9977778;0.48555..
(,`max_depth)!,::
0.998
Deprecated
This function was previously defined as .ml.gs.kfsplit
.
That is still callable but will be removed after version 3.0.
.ml.gs.kfStrat
¶
Cross-validated parameter grid search applied to data with an equi-distributions of targets per fold
.ml.gs.kfStrat[k;n;features;target;functions;params;tstTyp]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunctions
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted grid search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
// Algos
q)cf:{.p.import[`sklearn.tree]`:DecisionTreeClassifier}
// Params
q)pc:enlist[`max_depth]!enlist(::;1;2;3;4;5)
// 5 fold cross-validation no holdout
q).ml.gs.kfStrat[5;1;x;yc;.ml.xv.fitScore cf;pc;0]
max_depth|
---------| -------------------------------
:: | 1 0.9995 1 0.9985 1
1 | 0.4995 0.5 0.5 0.5 0.5
2 | 1 1 1 0.9995 1
3 | 1 1 1 1 0.9995
4 | 1 1 1 0.9995 1
5 | 1 0.9995 1 0.999 0.999
// 4 fold cross-validated grid search fitted on 20% holdout set
q).ml.gs.kfStrat[4;1;x;yc;.ml.xv.fitScore cf;pc;.2]
(+(,`max_depth)!,(::;1;2;3;4;5))!(1 0.9995 1 1;0.5052474 0.505 0.5052474 0.50..
(,`max_depth)!,3
1f
Deprecated
This function was previously defined as .ml.gs.kfstrat
.
That is still callable but will be removed after version 3.0.
.ml.gs.mcSplit
¶
Cross-validated parameter grid search applied to randomly shuffled data and validated on a percentage holdout set
.ml.gs.mcSplit[pc;n;features;target;function;params;tstTyp]
Where
pc
is a float between 0 and 1 denoting the percentage of data in the holdout validation setn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted grid search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yr:x[;0]+m?.05
// algo
q)rf:{.p.import[`sklearn.linear_model]`:LinearRegression}
// Params
q)pr:`fit_intercept`normalize!(01b;01b)
// 20% validation set with 5 repetitions, no fit on holdout
q).ml.gs.mcSplit[0.2;5;x;yr;.ml.xv.fitScore rf;pr;0]
fit_intercept normalize|
-----------------------| -------------------------------------------------
0 0 | 0.9972352 0.9972425 0.9971611 0.9971591 0.997231
0 1 | 0.9972606 0.9972371 0.9972327 0.9971915 0.9972252
1 0 | 0.997512 0.9974875 0.9975111 0.9974552 0.9974987
1 1 | 0.9975357 0.9975999 0.9975074 0.9974429 0.9975411
// 10% validation set with 3 repetitions, fit on 20% holdout set
q).ml.gs.mcSplit[0.1;3;x;yr;.ml.xv.fitScore rf;pr;.2]
(+`fit_intercept`normalize!(0011b;0101b))!(0.9973292 0.997217 0.9973233;0.997..
`fit_intercept`normalize!10b
0.997549
Deprecated
This function was previously defined as .ml.gs.mcsplit
.
That is still callable but will be removed after version 3.0.
.ml.gs.pcSplit
¶
Cross-validated parameter grid search applied to percentage split dataset
.ml.gs.pcSplit[pc;n;features;target;function;params;tstTyp]
Where
pc
is a float between 0 and 1 denoting the percentage of data in the holdout validation setn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted grid search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yr:x[;0]+m?.05
// algo
q)rf:{.p.import[`sklearn.linear_model]`:LinearRegression}
// Params
q)pr:`fit_intercept`normalize!(01b;01b)
// 20% validation set with 1 repetition, no fit on holdout
q).ml.gs.pcSplit[0.2;1;x;yr;.ml.xv.fitScore rf;pr;0]
fit_intercept normalize|
-----------------------| ---------
0 0 | 0.9971796
0 1 | 0.9971796
1 0 | 0.9974919
1 1 | 0.9974919
// 10% validation set with 3 repetitions, fit on 20% holdout set
q).ml.gs.pcSplit[0.1;3;x;yr;.ml.xv.fitScore rf;pr;.2]
(+`fit_intercept`normalize!(0011b;0101b))!(0.9972762 0.9972762 0.9972762;0.99..
`fit_intercept`normalize!10b
0.9974919
This form of cross validation is also known as repeated random sub-sampling validation. This has advantages over K-fold when observations are not wanted in equi-sized bins or where outliers could heavily bias a classifier.
Repeated random sub-sampling validation
Deprecated
This function was previously defined as .ml.gs.pcsplit
.
That is still callable but will be removed after version 3.0.
.ml.gs.tsChain
¶
Cross-validated parameter grid search applied to chain forward time-series sets
.ml.gs.tsChain[k;n;features;target;functions;params;tstTyp]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunctions
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted grid search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
// Algos
q)cf:{.p.import[`sklearn.tree]`:DecisionTreeClassifier}
// Params
q)pc:enlist[`max_depth]!enlist(::;1;2;3;4;5)
// 5 fold cross-validation no holdout
q).ml.gs.tsChain[5;1;x;yc;.ml.xv.fitScore cf;pc;0]
max_depth|
---------| --------------------------
:: | 0.999 1 0.9995 1
1 | 0.4985 0.492 0.476 0.4885
2 | 0.758 0.745 0.7345 0.743
3 | 0.999 1 0.9995 1
4 | 0.999 1 0.9995 1
5 | 0.999 1 0.9995 1
// 4 fold cross-validated grid search fitted on 20% holdout set
q).ml.gs.tsChain[4;1;x;yc;.ml.xv.fitScore cf;pc;.2]
(+(,`max_depth)!,(::;1;2;3;4;5))!(0.999 1 0.9995;0.4985 0.492 0.476;0.758 0.7..
(,`max_depth)!,::
1f
This works as shown in the following image:
The data is split into equi-sized bins with increasing amounts of the data incorporated into the testing set at each step. This avoids testing a model on historical information which would be counter-productive for time-series forecasting. It also allows users to test the robustness of the model when passed increasing volumes of data.
Deprecated
This function was previously defined as .ml.gs.tschain
.
That is still callable but will be removed after version 3.0.
.ml.gs.tsRolls
¶
Cross-validated parameter grid search applied to roll forward time-series sets
.ml.gs.tsRolls[k;n;features;target;functions;params;tstTyp]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunctions
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted grid search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
// Algos
q)cf:{.p.import[`sklearn.tree]`:DecisionTreeClassifier}
// Params
q)pc:enlist[`max_depth]!enlist(::;1;2;3;4;5)
// 6-fold cross validation no holdout
q).ml.gs.tsRolls[6;1;x;yc;.ml.xv.fitScore cf;pc;0]
max_depth|
---------| -------------------------------------------------
:: | 0.9994001 0.9994001 1 0.9994001 0.9988002
1 | 0.4985003 0.5056989 0.5162065 0.4943011 0.4997001
2 | 0.7438512 0.7492501 1 0.7582484 0.7378524
3 | 0.9994001 0.9994001 1 0.9994001 0.9988002
4 | 0.9994001 0.9994001 1 0.9994001 0.9988002
5 | 0.9994001 0.9994001 1 0.9994001 0.9988002
// 5 fold cross-validated grid search fitted on 20% holdout set
q).ml.gs.tsRolls[4;1;x;yc;.ml.xv.fitScore cf;pc;.2]
(+(,`max_depth)!,(::;1;2;3;4;5))!(1 0.999 0.999;0.486 0.4825 0.5005;0.7315 0...
(,`max_depth)!,::
0.999
This works as shown in the following image:
Successive equi-sized bins are taken as training and validation sets at each step. This avoids testing a model on historical information which would be counter-productive for time-series forecasting.
Deprecated
This function was previously defined as .ml.gs.tsrolls
.
That is still callable but will be removed after version 3.0.
.ml.rs.kfShuff
¶
Cross-validated parameter random search applied to data with shuffled split indices
.ml.rs.kfShuff[k;n;features;target;functions;params;tstTyp]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted random or Sobol search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
// Algos
q)cf:{.p.import[`sklearn.tree]`:DecisionTreeClassifier}
// Params
q)typ:`random
q)randomState:82
q)n:5
q)p:enlist[`max_depth]!enlist(`uniform;1;20;"j")
// random hyperparameter dictionary
q)pr:`typ`randomState`n`p!(typ;randomState;n;p)
// 6-fold cross validation no holdout
q).ml.rs.kfShuff[6;1;x;yc;.ml.xv.fitScore cf;pr;0]
max_depth|
---------| ---------------------------------------------------
9 | 0.9994001 1 0.9993998 1 1 1
12 | 1 1 1 1 0.9994001 0.9981993
19 | 0.9994001 1 0.9993998 1 1 0.9993998
5 | 0.9994001 1 0.9993998 0.9994001 1 1
7 | 1 1 0.9987995 1 1 1
// 5 fold cross-validated random search fitted on 20% holdout set
q).ml.rs.kfShuff[5;1;x;yc;.ml.xv.fitScore cf;pr;.2]
(+(,`max_depth)!,9 12 19 5 7)!(1 1 0.999375 1 0.99875;1 1 0.999375 0.999375 0..
(,`max_depth)!,5
1f
Deprecated
This function was previously defined as .ml.rs.kfshuff
.
That is still callable but will be removed after version 3.0.
.ml.rs.kfSplit
¶
Cross-validated parameter random search applied to data with ascending split indices
.ml.rs.kfSplit[k;n;features;target;functions;params;tstTyp]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted random or Sobol search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
// Algos
q)cf:{.p.import[`sklearn.linear_model]`:SGDClassifier}
// Params
q)typ:`sobol
q)randomState:374
q)n:8 // must equal 2^n
q)p:`average`l1_ratio`alpha!(`boolean;(`uniform;0;1;"f");(`loguniform;-5;2;"f"))
// Sobol hyperparameter dictionary
q)ps:`typ`randomState`n`p!(typ;randomState;n;p)
// 4-fold cross validation no holdout
q).ml.rs.kfSplit[4;1;x;yc;.ml.xv.fitScore cf;ps;0]
average l1_ratio alpha |
------------------------------| ---------------------------
1 0.5 0.03162278 | 0.5072 0.5148 0.512 0.4884
0 0.75 0.0005623413 | 0.7612 0.6736 0.7232 0.678
0 0.25 1.778279 | 0.506 0.5128 0.5072 0.5392
1 0.375 0.004216965 | 0.64 0.6408 0.6576 0.624
0 0.875 13.33521 | 0.2608 0.2916 0.2604 0.2544
1 0.625 7.498942e-005| 0.898 0.9032 0.9108 0.9116
1 0.125 0.2371374 | 0.5044 0.5148 0.5116 0.4888
0 0.1875 0.001539927 | 0.646 0.6788 0.6476 0.6268
// 5 fold cross-validated Sobol search fitted on 20% holdout set
q).ml.rs.kfSplit[5;1;x;yc;.ml.xv.fitScore cf;ps;.2]
(+`average`l1_ratio`alpha!(10010110b;0.5 0.75 0.25 0.375 0.875 0.625 0.125 0...
`average`l1_ratio`alpha!(1b;0.625;7.498942e-005)
0.9105
// 10 fold cross-validated Sobol search fitted on 10% holdout
// with initial data shuffle
q).ml.rs.kfSplit[10;1;x;yc;.ml.xv.fitScore cf;ps;-.1]
(+`average`l1_ratio`alpha!(10010110b;0.5 0.75 0.25 0.375 0.875 0.625 0.125 0...
`average`l1_ratio`alpha!(1b;0.625;7.498942e-005)
0.894
Deprecated
This function was previously defined as .ml.rs.kfsplit
.
That is still callable but will be removed after version 3.0.
.ml.rs.kfStrat
¶
Cross-validated parameter random search applied to data with an equi-distributions of targets per fold
.ml.rs.kfStrat[k;n;features;target;functions;params;tstTyp]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted random or Sobol search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
// Algos
q)cf:{.p.import[`sklearn.ensemble]`:AdaBoostClassifier}
// Params
q)typ:`random
q)randomState:363
q)n:6
q)p:`n_estimators`learning_rate!((`uniform;1;2;"j");(`loguniform;-3;0;"f"));
// random hyperparameter dictionary
q)pr:`typ`randomState`n`p!(typ;randomState;n;p)
// 3-fold cross validation no holdout
q).ml.rs.kfStrat[3;1;x;yc;.ml.xv.fitScore cf;pr;0]
n_estimators learning_rate|
--------------------------| -----------------------------
1 0.1590291 | 0.5 0.4996999 0.4996999
1 0.007257519 | 0.5 0.4996999 0.5
1 0.2363748 | 0.5 0.4996999 0.5
1 0.9187224 | 0.5 0.4996999 0.4996999
1 0.01076973 | 0.4994005 0.4996999 0.5
1 0.6294208 | 0.5 0.4993998 0.4996999
// 4 fold cross-validated random search fitted on 20% holdout set
q).ml.rs.kfStrat[4;1;x;yc;.ml.xv.fitScore cf;pr;.2]
(+`n_estimators`learning_rate!(1 1 1 1 1 1;0.1590291 0.007257519 0.2363748 0...
`n_estimators`learning_rate!(1;0.007257519)
0.491
Deprecated
This function was previously defined as .ml.rs.kfstrat
.
That is still callable but will be removed after version 3.0.
.ml.rs.mcSplit
¶
Cross-validated parameter random search applied to randomly shuffled data and validated on a percentage holdout set
.ml.rs.mcSplit[pc;n;features;target;function;params;tstTyp]
Where
pc
is a float between 0 and 1 denoting the percentage of data in the holdout validation setn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted random or Sobol search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yr:x[;0]+m?.05
// Algos
q)dt:{.p.import[`sklearn.tree]`:DecisionTreeRegressor}
// Params
q)typ:`sobol
q)randomState:73
q)n:8 // must equal 2^n
q)p:enlist[`max_depth]!enlist(`uniform;1;20;"j")
// Sobol hyperparameter dictionary
q)ps:`typ`randomState`n`p!(typ;randomState;n;p)
// 20% validation set with 5 repetitions, no fit on holdout
q).ml.rs.mcSplit[0.2;5;x;yr;.ml.xv.fitScore dt;ps;0]
max_depth|
---------| -------------------------------------------------
11 | 0.9960566 0.9958836 0.9958455 0.9959367 0.9961012
15 | 0.9952471 0.9953858 0.9950434 0.9951379 0.9951459
6 | 0.9971643 0.9972604 0.9973471 0.9971362 0.9972258
8 | 0.9969521 0.9970162 0.9970413 0.9969478 0.9969298
18 | 0.99497 0.9946913 0.9949457 0.9947844 0.9952239
13 | 0.9953671 0.9951547 0.9955295 0.9954282 0.9952482
3 | 0.9812511 0.9810006 0.9819113 0.9826636 0.9819451
5 | 0.9966442 0.996488 0.9964846 0.9965193 0.9965324
// 10% validation set with 3 repetitions, fit on 20% holdout set
q).ml.rs.mcSplit[0.1;3;x;yr;.ml.xv.fitScore dt;ps;.2]
(+(,`max_depth)!,11 15 6 8 18 13 3 5)!(0.9963612 0.9960823 0.9959418;0.995198..
(,`max_depth)!,6
0.9970801
Deprecated
This function was previously defined as .ml.rs.mcsplit
.
That is still callable but will be removed after version 3.0.
.ml.rs.pcSplit
¶
Cross-validated parameter random search applied to percentage split dataset
.ml.rs.pcSplit[pc;n;features;target;function;params;tstTyp]
Where
pc
is a float between 0 and 1 denoting the percentage of data in the holdout validation setn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted random or Sobol search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yr:x[;0]+m?.05
// Algos
q)dt:{.p.import[`sklearn.tree]`:DecisionTreeRegressor}
// Params
q)typ:`random
q)randomState:92
q)n:6
q)p:enlist[`max_depth]!enlist(`uniform;1;20;"j")
// random hyperparameter dictionary
q)pr:`typ`randomState`n`p!(typ;randomState;n;p)
// 20% validation set with 1 repetition, no fit on holdout
q).ml.rs.pcSplit[.2;1;x;yr;.ml.xv.fitScore dt;pr;0]
max_depth|
---------| ---------
7 | 0.9972803
1 | 0.7437862
19 | 0.9949442
17 | 0.9950088
16 | 0.9950008
11 | 0.9959856
// 10% validation set with 3 repetitions, fit on 20% holdout set
q).ml.rs.pcSplit[.1;3;x;yr;.ml.xv.fitScore dt;pr;.2]
(+(,`max_depth)!,7 1 19 17 16 11)!(0.9971323 0.9971323 0.9971323;0.7563344 0...
(,`max_depth)!,7
0.9972803
This form of cross validation is also known as repeated random sub-sampling validation. This has advantages over K-fold when observations are not wanted in equi-sized bins or where outliers could heavily bias a classifier.
Repeated random sub-sampling validation
Deprecated
This function was previously defined as .ml.rs.pcsplit
.
That is still callable but will be removed after version 3.0.
.ml.rs.tsChain
¶
Cross-validated parameter random search applied to chain forward time-series sets
.ml.rs.tsChain[k;n;features;target;functions;params;tstTyp]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted random or Sobol search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
// Algos
q)cf:{.p.import[`sklearn.ensemble]`:AdaBoostClassifier}
// Params
q)typ:`sobol
q)randomState:15
q)n:8 // must equal 2^n
q)p:`n_estimators`learning_rate!((`uniform;1;5;"j");(`loguniform;-3;0;"f"));
// random hyperparameter dictionary
q)ps:`typ`randomState`n`p!(typ;randomState;n;p)
// 3-fold cross validation no holdout
q).ml.rs.tsChain[3;1;x;yc;.ml.xv.fitScore cf;ps;0]
n_estimators learning_rate|
--------------------------| -------------------
3 0.03162278 | 1 0.9991002
4 0.005623413 | 0.7491749 0.9991002
2 0.1778279 | 0.7344734 0.7474505
3 0.01333521 | 1 0.9991002
5 0.4216965 | 1 0.9991002
4 0.002371374 | 0.7491749 0.7420516
2 0.07498942 | 0.7344734 0.7474505
2 0.008659643 | 0.7344734 0.7474505
// 4 fold cross-validated random search fitted on 20% holdout set
q).ml.rs.tsChain[4;1;x;yc;.ml.xv.fitScore cf;ps;.2]
(+`n_estimators`learning_rate!(3 4 2 3 5 4 2 2;0.03162278 0.005623413 0.17782..
`n_estimators`learning_rate!(3;0.03162278)
0.4775
This works as shown in the following image:
The data is split into equi-sized bins with increasing amounts of the data incorporated into the testing set at each step. This avoids testing a model on historical information which would be counter-productive for time-series forecasting. It also allows users to test the robustness of the model when passed increasing volumes of data.
Deprecated
This function was previously defined as .ml.rs.tschain
.
That is still callable but will be removed after version 3.0.
.ml.rs.tsRolls
¶
Cross-validated parameter random search applied to roll forward time-series sets
.ml.rs.tsRolls[k;n;features;target;functions;params;tstTyp]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function that takes parameters and data as input and returns a scoreparams
is a dictionary of hyperparameters to searchtstTyp
is a float value denoting the size of the holdout set used in a fitted random or Sobol search, where the best model is fit to the holdout set. If 0 the function will return scores for each fold for the given hyperparameters. If negative the data will be shuffled prior to designation of the holdout set.
returns the scores for hyperparameter sets on each of the k
folds for all values of tstTyp
and additionally returns the best hyperparameters and score on the holdout set for 0 < tstTyp <= 1
.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
// Algos
q)cf:{.p.import[`sklearn.ensemble]`:AdaBoostClassifier}
// Params
q)typ:`random
q)randomState:116
q)n:7
q)p:`n_estimators`learning_rate!((`uniform;1;20;"j");(`loguniform;-3;1;"f"));
// random hyperparameter dictionary
q)pr:`typ`randomState`n`p!(typ;randomState;n;p)
// 5-fold cross validation no holdout
q).ml.rs.tsRolls[5;1;x;yc;.ml.xv.fitScore cf;pr;0]
n_estimators learning_rate|
--------------------------| -------------------------
16 0.002382789 | 0.998 0.51 0.5005 1
11 0.1853371 | 0.998 0.9995 0.9995 1
12 7.029101 | 0.998 0.757 0.7465 1
6 8.910367 | 0.998 0.51 0.7465 1
7 0.04554741 | 0.998 0.51 0.5005 1
10 0.02318011 | 0.998 0.51 0.5005 1
2 5.920482 | 0.74 0.7595 0.7465 0.735
// 6 fold cross-validated random search fitted on 20% holdout set
q).ml.rs.tsRolls[6;1;x;yc;.ml.xv.fitScore cf;pr;.2]
(+`n_estimators`learning_rate!(16 11 12 6 7 10 2;0.002382789 0.1853371 7.0291..
`n_estimators`learning_rate!(11;0.1853371)
1f
This works as shown in the following image:
Successive equi-sized bins are taken as training and validation sets at each step. This avoids testing a model on historical information which would be counter-productive for time-series forecasting.
Deprecated
This function was previously defined as .ml.rs.tsrolls
.
That is still callable but will be removed after version 3.0.
.ml.xv.kfShuff
¶
K-Fold cross validation for randomized non-repeating indices
.ml.xv.kfShuff[k;n;features;target;function]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function which takes data as input
returns output of features
applied to each of the cross-validation splits.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yr:x[;0]+m?.05
q)k:5
q)n:1
q)ar:{.p.import[`sklearn.linear_model]`:LinearRegression}
q)mdlFunc:.ml.xv.fitScore[ar][]
q).ml.xv.kfShuff[k;n;x;yr;mdlFunc]
0.9999935 0.9999934 0.9999935 0.9999935 0.9999935
Deprecated
This function was previously defined as .ml.xv.kfshuff
.
That is still callable but will be removed after version 3.0.
.ml.xv.kfSplit
¶
K-Fold cross-validation for ascending indices split into K-folds
.ml.xv.kfSplit[k;n;features;target;function]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function which takes data as input
returns output of features
applied to each of the cross-validation splits.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yr:x[;0]+m?.05
q)k:5
q)n:1
q)ar:{.p.import[`sklearn.linear_model]`:LinearRegression}
q)mdlFunc:.ml.xv.fitScore[ar][]
q).ml.xv.kfSplit[k;n;x;yr;mdlFunc]
0.9953383 0.9995422 0.9985156 0.9995144 0.9952133
Deprecated
This function was previously defined as .ml.xv.kfsplit
.
That is still callable but will be removed after version 3.0.
.ml.xv.kfStrat
¶
Stratified K-Fold cross-validation with an approximately equal distribution of classes per fold
.ml.xv.kfStrat[k;n;features;target;function]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function which takes data as input
returns output of features
applied to each of the cross-validation splits.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
q)k:5
q)n:1
q)ac:{.p.import[`sklearn.tree]`:DecisionTreeClassifier}
q)mdlFunc:.ml.xv.fitScore[ac][]
q).ml.xv.kfSplit[k;n;x;yc;mdlFunc]
0.9995 0.9995 0.9995 1 1
This is used extensively where the distribution of classes in the data is unbalanced.
Deprecated
This function was previously defined as .ml.xv.kfstrat
.
That is still callable but will be removed after version 3.0.
.ml.xv.mcSplit
¶
Monte-Carlo cross validation using randomized non-repeating indices
.ml.xv.mcSplit[pc;n;features;target;function]
Where
pc
is a float between 0 and 1 representing the percentage of data within the validation setn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function which takes data as input
returns output of features
applied to each of the cross-validation splits.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yr:x[;0]+m?.05
q)p:0.2
q)n:5
q)ar:{.p.import[`sklearn.linear_model]`:LinearRegression}
q)mdlFunc:.ml.xv.fitScore[ar][]
q).ml.xv.mcSplit[p;n;x;yr;mdlFunc]
0.9999905 0.9999906 0.9999905 0.9999905 0.9999905
This form of cross validation is also known as repeated random sub-sampling validation. This has advantages over k-fold when equi-sized bins of observations are not wanted or where outliers could heavily bias the classifier.
Repeated random sub-sampling validation
Deprecated
This function was previously defined as .ml.xv.mcsplit
.
That is still callable but will be removed after version 3.0.
.ml.xv.pcSplit
¶
Percentage split cross-validation procedure
.ml.xv.pcSplit[pc;n;features;target;function]
Where
pc
is a float between 0 and 1 representing the percentage of data within the validation setn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function which takes data as input
returns output of features
applied to each of the cross-validation splits.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yr:x[;0]+m?.05
q)p:0.2
q)n:4
q)ar:{.p.import[`sklearn.linear_model]`:LinearRegression}
q)mdlFunc:.ml.xv.fitScore[ar][]
q).ml.xv.pcSplit[p;n;x;yr;mdlFunc]
0.9975171 0.9975171 0.9975171 0.9975171
Deprecated
This function was previously defined as .ml.xv.pcsplit
.
That is still callable but will be removed after version 3.0.
.ml.xv.tsChain
¶
Chain-forward cross-validation procedure
.ml.xv.tsChain[k;n;features;target;function]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function which takes data as input
returns output of features
applied to each of the chained iterations.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yr:x[;0]+m?.05
q)k:5
q)n:1
q)ar:{.p.import[`sklearn.linear_model]`:LinearRegression}
q)mdlFunc:.ml.xv.fitScore[ar][]
q).ml.xv.tsChain[k;n;x;yr;mdlFunc]
0.9973771 0.9992741 0.9996898 0.9997031
This works as shown in the following image.
The data is split into equi-sized bins with increasing amounts of the data incorporated in the testing set at each step. This avoids testing a model on historical information which would be counter-productive for time-series forecasting. It also allows users to test the robustness of the model with data of increasing volumes.
Deprecated
This function was previously defined as .ml.xv.tschain
.
That is still callable but will be removed after version 3.0.
.ml.xv.tsRolls
¶
Roll-forward cross-validation procedure
.ml.xv.tsRolls[k;n;features;target;function]
Where
k
is the integer number of foldsn
is the integer number of repetitions of the procedurefeatures
is a matrix of featurestarget
is a vector of targetsfunction
is a function which takes data as input
returns output of features
applied to each of the chained iterations.
q)m:10000
q)x:(m;10)#(m*10)?1f
q)yc:(raze flip(0N;4)#m#`a`b`c`d)rank x[;0]
q)k:5
q)n:1
q)ac:{.p.import[`sklearn.tree]`:DecisionTreeClassifier}
q)mdlFunc:.ml.xv.fitScore[ac][]
q).ml.xv.tsRolls[k;n;x;yc;mdlFunc]
0.9973771 0.9995615 0.9999869 0.999965
This works as shown in the following image.
Successive equi-sized bins are taken as validation and training sets for each step. This avoids testing a model on historical information which would be counter-productive for time-series forecasting.
Deprecated
This function was previously defined as .ml.xv.tsrolls
.
That is still callable but will be removed after version 3.0.