# Automated data processing

The procedures outlined below describe the steps required to prepare extracted features for training a model, perform cross validation to determine the most generalizable model and optimize the best model using grid search. These steps follow on from the data preprocessing methods.

The following are the procedures completed when the default system configuration is deployed:

1. Feature extraction is performed using either FRESH or normal feature extraction methods.
2. Significance tests are applied to extracted features to determine those most relevant for model training.
3. Data is split into training, validation and testing sets.
4. Cross-validation procedures are performed on a selection of models.
5. Models are scored using a predefined performance metric, based on the problem type (classification/regression), with the best model selected and scored on the validation set.
6. Best model saved following optimization using grid search procedures.

## Feature extraction

Feature extraction is the process of building derived or aggregate features from a dataset in order to provide a more suitable or useful input for machine learning algorithms. Within the automated machine learning framework, there are currently 2 types of feature extraction available, FRESH and normal feature extraction.

### FRESH

The FRESH (FeatuRe Extraction and Scalable Hypothesis testing) algorithm is used specifically for time-series datasets. The data passed to FRESH should contain an identifying (ID) column, which groups the time-series into subsets from which features can be extracted. Note that each subset will have an associated target. The feature extraction functions applied within the FRESH procedure are defined within the ML Toolkit. A full explanation of this algorithm is also provided there.

The example below shows the application of FRESH to a time-series dataset with a date-time ID column and 4 columns from which we derive features.

q)5#tb:([]d:100?2001.01.01+til 5;100?1.;100?1.;100?100;100?10)
d          x         x1        x2 x3
------------------------------------
2001.01.02 0.2966022 0.1671079 52 2
2001.01.03 0.153475  0.5256145 38 2
2001.01.03 0.9069809 0.1577109 22 8
2001.01.04 0.7556175 0.2924597 79 9
2001.01.04 0.8512139 0.6712163 51 0
// no. of features before feature extraction
q)count 1_cols tb
4
// apply FRESH feature extraction
q)show freshfeats:.ml.fresh.createfeatures[tb;d;1_cols tb;.ml.fresh.params]
d         | x_absenergy x_abssumchange x_count x_countabovemean x_countbelowm..
----------| -----------------------------------------------------------------..
2001.01.01| 5.984048    8.764463       23      9                14           ..
2001.01.02| 6.028151    7.374439       18      8                10           ..
2001.01.03| 8.588659    7.321894       23      14               9            ..
2001.01.04| 8.098905    6.28368        17      9                8            ..
2001.01.05| 7.946673    5.334232       19      11               8            ..
// no. of features after feature extraction
q)count 1_cols freshfeats
1132


Note

When running .automl.run for FRESH data, by default the first column of the dataset is defined as the identifying (ID) column.

See instructions on how to amend this

### Normal

Normal feature extraction can be applied to ‘normal’ problems, which are independent of time and have one target value per row. The current implementation of normal feature extraction splits time/date type columns into their component parts.

q)5#tb:([]asc 100?0t;100?1.;100?1.;100?1.;100?100;100?10)
x            x1        x2        x3        x4 x5
------------------------------------------------
00:27:44.097 0.4370286 0.1024512 0.6908697 75 1
00:37:16.725 0.7071204 0.6892404 0.3227387 58 5
00:52:06.423 0.3053207 0.5529379 0.7147758 4  9
00:55:03.447 0.5192614 0.7592389 0.4201505 23 7
01:03:26.039 0.5691797 0.3437228 0.3924408 22 0
// no. of features before feature extraction
q)count cols tb
6
// Define the functions to be applied in normal feature creation
q)funcs:enlist[funcs]!enlist .automl.prep.i.default
// apply normal feature extraction
q)show normfeat:flip first .automl.prep.normalcreate[tb;funcs]
x1        | 0.1312281  0.3667245  0.1415448  0.8186288  0.5973116 0.2413466  ..
x2        | 0.5068552  0.9532577  0.8768543  0.6690883  0.8108164 0.07458746 ..
x3        | 0.9967246  0.07262924 0.1245068  0.5506067  0.1330347 0.1533462  ..
x4        | 68         28         77         78         47        8          ..
x5        | 3          2          1          6          8         3          ..
x_hh      | 0          0          0          0          0         0          ..
x_uu      | 0          11         13         30         31        45         ..
x_ss      | 32         31         48         2          12        50         ..
// no. of features after feature extraction
q)count normfeat
8


The early-stage releases of this repository limit the feature extraction procedures that are performed by default on the tabular data for a number of reasons.

1. The naïve application of many relevant feature-extraction procedures (truncated singular-value decomposition/bulk transforms) while potentially informative can expand the memory usage and computation time beyond an acceptable level.

2. Procedures being applied in one field of use may not be relevant in another field. As such the framework is provided to allow a user to complete feature extractions which are domain-specific if required, rather than assuming procedures to be applied are ubiquitous in all cases.

Over time the system will be updated to perform tasks in a way which is cognizant of the above limitations and where general frameworks can be assummed to be informative.

## Feature selection

In order to reduce dimensionality in the data following feature extraction, significance tests are performed by default using the FRESH feature significance function contained within the ML Toolkit. When default parameters are used, the top 25th percentile of features are selected. The regression example below demonstrates the steps required to extract significant features within the AutoML pipeline.

q)5#tb:([]100?1f;100?1f;100?1f;100?0x;100?(5?1f;5?1f);100?ABC;100?10)
x         x1         x2         x3 x4                                                 x5 x6
-------------------------------------------------------------------------------------------
0.7856908 0.6066074  0.03652362 00 0.7456247 0.1210367  0.8790503 0.4303547 0.8694497 A  4
0.634117  0.7527148  0.3851139  00 0.841519  0.04855508 0.4478229 0.2979644 0.4757531 A  3
0.837205  0.938271   0.8312463  00 0.7456247 0.1210367  0.8790503 0.4303547 0.8694497 B  1
0.9516154 0.06719988 0.2472061  00 0.841519  0.04855508 0.4478229 0.2979644 0.4757531 B  0
0.7042753 0.8352592  0.9241053  00 0.841519  0.04855508 0.4478229 0.2979644 0.4757531 A  7
// regression example
q)tgt:100?1f
...
// preprocessing steps taken in automl pipeline
...
// features created
q)5#tb
x         x1         x2         x6 x5_A x5_B x5_C xx1_trsvd xx2_trsvd xx6_trsvd  xx5_A_trsvd..
--------------------------------------------------------------------------------------------..
0.7856908 0.6066074  0.03652362 4  1    0    0    0.9847321 0.5715507 4.051885   1.265838   ..
0.634117  0.7527148  0.3851139  3  1    0    0    0.9804841 0.7173443 3.042683   1.161163   ..
0.837205  0.938271   0.8312463  1  0    1    0    1.255319  1.179496  1.06683    0.5781603  ..
0.9516154 0.06719988 0.2472061  0  0    1    0    0.7215467 0.8383892 0.07998563 0.6571703  ..
0.7042753 0.8352592  0.9241053  7  1    0    0    1.088445  1.154103  7.034425   1.209613   ..
q)count cols tb
28
// select top 25th percentile of extracted features
q)show feats:.automl.prep.freshsignificance[tb;tgt]
xx1_trsvdxx5_A_trsvdx1x5_A_trsvdx2x5_A_trsvdx5_Ax5_B_trsvdx5_Ax5_C_trsvdx5_A
q)count feats
7


In the example above, the columns x3x4 were first removed due to type restrictions. Following this, 28 features were created using normal feature extraction, with 7 selected for model training using the FRESH significance tests.

## Train-test split

Once the most significant features have been selected, the data is split into a training and testing set. The testing set is required later in the process for optimizing the best model.

In order to split the data a number of train-test split procedures are implemented for the different problem types

problem type function description
Normal .ml.traintestsplit Shuffle the dataset and split into training and testing set with defined percentage in each
FRESH .ml.ttsnonshuff Without shuffling, the dataset is split into training and testing set with defined percentage in each to ensure no time leakage.

An example shows .ml.traintestsplit being used within the automated pipeline.

q)5#tb:([]100?1f;100?1f;100?1f;100?0x;100?(5?1f;5?1f))
x         x1        x2         x3 x4                                         ..
-----------------------------------------------------------------------------..
0.8477153 0.3496333 0.01611913 00 0.8200469 0.9857311 0.4629496 0.8518719 0.9..
0.927854  0.6202434 0.1841156  00 0.8200469 0.9857311 0.4629496 0.8518719 0.9..
0.8390122 0.8223493 0.4659199  00 0.8200469 0.9857311 0.4629496 0.8518719 0.9..
0.681369  0.5082821 0.3996036  00 0.8200469 0.9857311 0.4629496 0.8518719 0.9..
0.8917179 0.7072718 0.970522   00 0.8200469 0.9857311 0.4629496 0.8518719 0.9..
// classification problem
q)tgt:100?10?1
// split into training and testing sets
q)show tts:.ml.traintestsplit[tb;tgt;.2]
xtrain| +xx1x2x3x4!(0.6646666 0.7103449 0.2865525 0.08858732 0.4145759 0..
ytrain| niebobokckeeoekdcpdnbecoepcoodpkek..
xtest | +xx1x2x3x4!(0.9274054 0.9187341 0.927854 0.9512318 0.9469828 0.5..
ytest | ndbdeeccibbcokckkccd
// count features and targets in each set
q)count each tts
xtrain| 80
ytrain| 80
xtest | 20
ytest | 20


For classification problems, similar to the one above, it cannot be guaranteed that all of the distinct target classes will appear in both the training and testing sets. This is an issue for the Keras neural network models which require that a sample from each target class is present in both splits of the data.

For this reason, the utility function .automl.i.kerascheck must be applied to the data split prior to model training. The function determines if all classes are present in each split of the data. If not, the Keras models will be removed from the list of models to be tested.

Below shows the output to be expected from .automl.run when the same number of classes are not present in each dataset.

q).automl.run[tb;tgt;normal;class;(::)]
...
Test set does not contain examples of each class. Removed MultiKeras from models
...


At this stage it is possible to begin model training and validation using cross-validation procedures. To do so, the data which is currently in the training set must be further split into training and validation sets. Following the same process as before, the Keras check must be applied again if Keras models are still present.

## Cross validation

Cross-validation procedures are commonly used in machine learning as a means of testing how robust or stable a model is to changes in the volume of data or the specific subsets of data used for validation. The technique ensures that the best model selected at the end of the process is the one that best generalizes to new unseen data.

The cross validation techniques implemented in the automated machine-learning platform are those contained within the ML Toolkit). The specific type of cross validation applied to the training data in each case is a shuffled 5-fold cross-validation procedure.

For clarity, the splitting of data to this point as implemented in the default configuration is depicted below, where 20% of the overall dataset is used as the testing set, 20% of the remaining dataset is used as the validation set and then the remaining data is split into 5-folds for cross validation to be carried out.

## Model selection

Once the relevant models have undergone cross validation using the training data, scores are calculated using the relevant scoring metric for the problem type (classification or regression). The file scoring.txt is provided within the platform which specifies the possible scoring metrics and how results should be ordered (ascending/descending) for each of these metrics in order to ensure that the best model is being returned. This file can be altered by the user if necessary in order to expand the number of metrics available or to add custom metrics. The default metric for classification problems is accuracy, while MSE is used for regression. An extensive list of example metric functions applied in kdb can be found in the ML-Toolkit

Below is an example of the expected output for a regression problem following cross validation. All models are scored using the default regression scoring metric, with the best scoring model selected and used to make predictions on the validation set.

q)5#tb:([]100?1f;100?1f;100?1f;100?0x;100?(5?1f;5?1f);100?ABC;100?10)
x          x1         x2        x3 x4                                        ..
-----------------------------------------------------------------------------..
0.06119115 0.1102437  0.9265934 00 0.002184472 0.06670537 0.6918339 0.4301331..
0.5510458  0.08516535 0.307924  00 0.002184472 0.06670537 0.6918339 0.4301331..
0.456294   0.3164178  0.1045391 00 0.002184472 0.06670537 0.6918339 0.4301331..
0.6581265  0.2706949  0.8376952 00 0.002184472 0.06670537 0.6918339 0.4301331..
0.9841029  0.3510406  0.7041609 00 0.002184472 0.06670537 0.6918339 0.4301331..
// regression example
q)tgt:100?1f
q).automl.run[tb;tgt;normal;reg;(::)]
...
Scores for all models, using .ml.mse
LinearRegression         | 0.08938869
KNeighborsRegressor      | 0.09325071
Lasso                    | 0.09390912
RandomForestRegressor    | 0.1206891
MLPRegressor             | 0.1398568
regkeras                 | 0.3604642

Best scoring model = LinearRegression
Score for validation predictions using best model = 0.1263564


## Optimization

In order to optimize the best model, grid-search procedures are implemented. Similarly to the cross validation methods above, this grid search functionality is sourced from within the ML-Toolkit.

In the default configuration, grid search is applied to the best model using the combined training and validation data. The parameters changed for each model in the default configuration are those listed below. The specific values for each parameter are contained within the flat file hyperparam.txt and can be altered by the user if required.

Models and default grid search hyperparameters:
Gradient Boosting Regressor      criterion, learning_rate, loss
KNeighbors Regressor             n_neighbors, weights
Lasso                            alpha, max_iter, normalize, tol
MLP Regressor                    activation, alpha, learning_rate_init, solver
Random Forest Regressor          criterion, min_samples_leaf, n_estimators
Gradient Boosting Classifier     criterion, learning_rate, loss, n_estimators
KNeighbors Classifier            leaf_size, metric, n_neighbors
Linear SVC                       C, tol
Logistic Regression              C, penalty, tol
MLP Classifier                   activation, alpha, learning_rate_init, solver
Random Forest Classifier         criterion, min_samples_leaf, min_samples_split
SVC                              C, degree, tol


Once the grid search has been performed, the optimized model is tested using the testing set, with the final score returned and the best model saved down. A normal classification example is shown below.

q)5#tb:([]100?1f;100?1f;100?1f;100?0x;100?(5?1f;5?1f);100?ABC;100?10)
x         x1         x2         x3 x4                                        ..
-----------------------------------------------------------------------------..
0.734461  0.3435291  0.6891676  00 0.8372026 0.4100884 0.7181567 0.7001034 0...
0.9741231 0.8266936  0.1658819  00 0.8372026 0.4100884 0.7181567 0.7001034 0...
0.6970151 0.1212986  0.7434879  00 0.8372026 0.4100884 0.7181567 0.7001034 0...
0.1726593 0.5145226  0.9719498  00 0.8372026 0.4100884 0.7181567 0.7001034 0...
0.2092798 0.02959764 0.03549935 00 0.3442338 0.1319948 0.6779861 0.2621923 0...
// regression example
q)tgt:100?5
q).automl.run[tb;tgt;normal;class;::]
...
`