# Stochastic Gradient Descent¶

*An optimization technique used in the iterative updating of appropriately formatted machine-learning algorithms*

Gradient descent attempts to optimize the parameters of a function by traversing in the direction of steepest descent as defined by the negative of the gradient of a cost function. This technique is used when optimal parameters to a solution cannot be found by setting the functions slope to 0, for example in linear regression.

Updates to the parameters at each iteration are defined by the following formula:

Where:

- \(\theta\) parameters/weights of a function
- \(\alpha\) learning rate
- \(J\) cost/loss function

When using Gradient Descent as described above the entirety of a provided dataset is used for each iteration when calculating the minimum of the slope. This can lead to a significant amount of computation, with a major impact on the training time of the model.

To offset the cost of these calculations SGD can be used. At each iteration a single/batch of random points are chosen from the dataset to update the parameters of the function, this decreases the training time and also decreases the risk of overfitting occuring when applying to new unseen data. Additionally, as the model can be updated iteratively, it allows models which support SGD to be trained on data too large to fit in memory or updated on streaming data.

SGD is often used in optimizing neural networks along with other machine learning algorithms such as logistic/linear regression models, SVMs etc.

SGD forms the basis of the Linear Regression and Logistic Classification models in this section. The model outlined below forms the basis for each of these models and is provided to allow users familiar with this technique to use the `.ml.online.sgd.fit`

function as the basis for their own solutions.

`.ml.online.sgd.fit`

¶

*Fit a model using stochastic gradient descent*

```
.ml.online.sgd.fit[X;y;trend;theta;gradFunc;paramDict]
```

Where

`X`

is the input/training data of \(n\) dimensions`y`

is the output/target regression data`trend`

is whether a trend is to be accounted for (boolean)`theta`

are the initial weight/s`gradFunc`

is the gradient function to be applied`paramDict`

is the configurable dictionary defining any modifications to be applied during the fitting process of SGD

returns a dictionary of information collected during the fitting of a model:

```
theta | weights calculated during the process
iter | number of iterations applied during the process
diff | difference between the final theta values and the preceding values
trend | whether or not a trend value was fitted during the process
paramDict | parameter dictionary used during the process
inputType | data type of each column of the input data
```

```
// Create data with a strong but noisy correlation
q)X:8*100?1f
q)y:4+3*X+100?1f
// Use the default linear gradient function
q)gradFunc:.ml.online.sgd.i.gradFuncLinear
// Set the initial model weights to 0
q)theta:enlist 0f
// Set the maximum iteration and alpha values
q)dict:`maxIter`alpha!(50;0.01)
q)show paramDict:.ml.online.sgd.i.updDefault[dict;X;0b]
alpha | 0.01
maxIter | 50
gTol | 1e-05
theta | ,0f
k | 100
seed | 38576953
batchType | `shuffle
gradArgs | ()
...
q).ml.online.sgd.fit[X;y;0b;theta;gradFunc;paramDict]
| theta iter diff trend paramDict ..
---------| ------------------------------------------------------------------..
modelInfo| 4.152746 2 0 0 `alpha`maxIter`gTol`theta`k`seed`batchTyp.
```

## Configurable parameters¶

In the above function, the following are the optional configurable entries for `paramDict`

:

Key | Type | Default | Description |
---|---|---|---|

alpha | float | 0.01 | The learning rate applied |

maxIter | integer | 100 | The maximum possible number of iterations before the run is terminated, this does not guarantee convergence |

gTol | float | 1e-5 | If the difference in gradient falls below this value the run is terminated |

theta | float | 0 | The initial starting weights |

k | integer | *n | The number of batches used or random points chosen each iteration |

seed | integer | random | The random seed |

batchType | symbol | shuffle | The batch type (``single`shuffle`shuffleRep`nonShuffle`noBatch` ) |

penalty | symbol | l2 | The penalty/regularization term (``l1`l2`elasticNet` ) |

lambda | float | 0.001 | The penalty term coefficient |

l1Ratio | float | 0.5 | The elastic net mixing parameter (Only used if penalty type ``ElasticNet` is applied) |

decay | float | 0 | The decay coefficient |

p | float | 0 | The momentum coefficient |

verbose | boolean | 0b | If information about the fitting process is to be printed after every epoch |

accumulation | boolean | 0b | If the theta value after each epoch is returned as the output |

thresholdFunc | list | () | The threshold function and value (optional) to apply when using `updateSecure` |

In the above table `*n`

is the length of the dataset.

A number of `batchTypes`

can be applied when fitting a model using SGD:

option | description |
---|---|

noBatch | No batching occurs and all data points are used (regular gradient descent) |

nonShuffle | The data is split into `k` batches with no shuffling applied. |

shuffle | The data is shuffled into `k` batches. Each data point appears once. |

shuffleRep | The data is shuffled into `k` batches. Data points can appear more than once and not all data points may be used. |

single | `k` random points are chosen each iteration. |