Stochastic Gradient Descent
Stochastic Gradient Descent is an optimization technique used in the iterative updating of appropriately formatted machine learning algorithms.
Gradient descent attempts to optimize the parameters of a function by traversing in the direction of steepest descent as defined by the negative of the gradient of a cost function. This technique is used when optimal parameters to a solution cannot be found by setting the functions slope to 0, for example in linear regression.
Updates to the parameters at each iteration are defined by the following formula:
Where:
Parameter | Description |
---|---|
\(\theta\) | Parameters/Weights of a function |
\(\alpha\) | The learning rate |
\(J\) | The cost/loss function |
When using Gradient Descent as described above the entirety of a provided dataset is used for each iteration when calculating the minimum of the slope. This can often result in a significant number of required computations which may have a major impact on the time to train a model.
To offset the cost of these calculations SGD can be used. At each iteration a single/batch of random points are chosen from the dataset to update the parameters of the function, this decreases the training time and the risk of overfitting occuring when applying to new unseen data. Additionally as the model can be updated iteratively it allows models which support SGD to be trained on data too large to fit in memory or can be updated on streaming data.
SGD is often employed in optimizing neural networks along with other machine learning algorithms such as logistic/linear regression models, SVM's etc.
SGD forms the basis of the Linear Regression and Logistic Classification models provided in this section. The model outlined below forms the basis for each of these models and is provided to allow users familiar with this technique to use the .ml.online.sgd.fit
function as the basis for their own solutions.
.ml.online.sgd.fit
Fit a model using stochastic gradient descent
.ml.online.sgd.fit[X;y;trend;theta;gradFunc;paramDict]
Parameters:
name | type | description |
---|---|---|
X |
any |
Input/training data of N dimensions. |
y |
any |
Output/target regression data. |
trend |
boolean |
Is a trend to be accounted for? |
theta |
float[] |
The initial weight(s). |
gradFunc |
function |
Gradient function to be applied. |
paramDict |
dictionary |
Any modifications to be applied during the fitting process of SGD (See here for more details). |
Returns:
Returns a dictionary
containing all information collected during the fitting of a model and includes:
name | description |
---|---|
theta |
The weights calculated during the process. |
iter |
The number of iterations applied during the process. |
diff |
The difference between the final theta values and the preceding values. |
trend |
Whether or not a trend value was fitted during the process. |
paramDict |
The parameter dictionary used during the process. |
inputType |
The data type of each column of the input data. |
Examples:
Example 1: Fit the default linear gradient function
// Create data with a strong but noisy correlation
q)X:8*100?1f
q)y:4+3*X+100?1f
// Use the default linear gradient function
q)gradFunc:.ml.online.sgd.util.gradFuncLinear
// Set the initial model weights to 0
q)theta:enlist 0f
// Set the maximum iteration and alpha values
q)dict:`maxIter`alpha!(50;0.01)
q)show paramDict:.ml.online.sgd.util.updDefault[dict;X;0b]
alpha | 0.01
maxIter | 50
gTol | 1e-05
theta | ,0f
k | 100
seed | 38576953
batchType | `shuffle
gradArgs | ()
...
q).ml.online.sgd.fit[X;y;0b;theta;gradFunc;paramDict]
| theta iter diff trend paramDict ..
---------| ------------------------------------------------------------------..
modelInfo| 4.152746 2 0 0 `alpha`maxIter`gTol`theta`k`seed`batchTyp.
Configurable parameters
In the above function, the following are the optional configurable entries for paramDict
:
name | type | default | description |
---|---|---|---|
alpha |
float |
Applied learning rate. | 0.01 |
maxIter |
integer |
Max possible number of iterations before the run is terminated, this does not guarantee convergence. | 100 |
gTol |
float |
If the difference in gradient falls below this value the run is terminated. | 1e-5 |
theta |
float |
Initial starting weights. | 0 |
k |
integer |
Number of batches used or random points chosen each iteration. | *n |
seed |
integer |
Random seed. | random |
batchType |
symbol |
Batch type - `single`shuffle`shuffleRep`nonShuffle`noBatch . |
shuffle |
penalty |
symbol |
Penalty/regularization term - `l1`l2`elasticNet . |
l2 |
lambda |
float |
Penalty term coefficient. | 0.001 |
l1Ratio |
float |
Elastic net mixing parameter, only used if penalty type is ElasticNet . |
0.5 |
decay |
float |
Decay coefficient. | 0 |
p |
float |
Momentum coefficient. | 0 |
verbose |
boolean |
If information about the fitting process is to be printed after every epoch. | 0b |
accumulation |
boolean |
If the theta value after each epoch is returned as the output. | 0b |
thresholdFunc |
list |
Threshold function and value to apply when using updateSecure . |
() |
In the above table *n
is the length of the dataset.
A number of batchTypes
can be applied when fitting a model using SGD, the supported types and an explanation of their use of the k
parameter are explained below:
options:
name | description |
---|---|
noBatch |
No batching occurs and all data points are used (regular gradient descent) |
nonShuffle |
Data split into k batches with no shuffling applied. |
shuffle |
Data shuffled into k batches. Each data point appears once. |
shuffleRep |
Data shuffled into k batches. Data points can appear more than once and not all data points may be used. |
single |
k random points are chosen each iteration. |