Stochastic Gradient Descent is an optimization technique used in the iterative updating of appropriately formatted machine learning algorithms.

Gradient descent attempts to optimize the parameters of a function by traversing in the direction of steepest descent as defined by the negative of the gradient of a cost function. This technique is used when optimal parameters to a solution cannot be found by setting the functions slope to 0, for example in linear regression.

Updates to the parameters at each iteration are defined by the following formula:

$\theta_{upd} = \theta_{old} - \alpha * \frac{\partial J}{\partial \theta}$

Where:

Parameter Description
$$\theta$$ Parameters/Weights of a function
$$\alpha$$ The learning rate
$$J$$ The cost/loss function

When using Gradient Descent as described above the entirety of a provided dataset is used for each iteration when calculating the minimum of the slope. This can often result in a significant number of required computations which may have a major impact on the time to train a model.

To offset the cost of these calculations SGD can be used. At each iteration a single/batch of random points are chosen from the dataset to update the parameters of the function, this decreases the training time and the risk of overfitting occuring when applying to new unseen data. Additionally as the model can be updated iteratively it allows models which support SGD to be trained on data too large to fit in memory or can be updated on streaming data.

SGD is often employed in optimizing neural networks along with other machine learning algorithms such as logistic/linear regression models, SVM's etc.

SGD forms the basis of the Linear Regression and Logistic Classification models provided in this section. The model outlined below forms the basis for each of these models and is provided to allow users familiar with this technique to use the .ml.online.sgd.fit function as the basis for their own solutions.

.ml.online.sgd.fit

Fit a model using stochastic gradient descent

.ml.online.sgd.fit[X;y;trend;theta;gradFunc;paramDict]

Parameters:

name type description
X any Input/training data of N dimensions.
y any Output/target regression data.
trend boolean Is a trend to be accounted for?
theta float[] The initial weight(s).
gradFunc function Gradient function to be applied.
paramDict dictionary Any modifications to be applied during the fitting process of SGD (See here for more details).

Returns:

Returns a dictionary containing all information collected during the fitting of a model and includes:

name description
theta The weights calculated during the process.
iter The number of iterations applied during the process.
diff The difference between the final theta values and the preceding values.
trend Whether or not a trend value was fitted during the process.
paramDict The parameter dictionary used during the process.
inputType The data type of each column of the input data.

Examples:

Example 1: Fit the default linear gradient function

// Create data with a strong but noisy correlation
q)X:8*100?1f
q)y:4+3*X+100?1f

// Use the default linear gradient function

// Set the initial model weights to 0
q)theta:enlist 0f

// Set the maximum iteration and alpha values
q)dict:maxIteralpha!(50;0.01)
q)show paramDict:.ml.online.sgd.util.updDefault[dict;X;0b]
alpha      | 0.01
maxIter    | 50
gTol       | 1e-05
theta      | ,0f
k          | 100
seed       | 38576953
batchType  | shuffle
...

| theta    iter diff trend paramDict                                ..
---------| ------------------------------------------------------------------..
modelInfo| 4.152746 2    0    0     alphamaxItergTolthetakseedbatchTyp.

Configurable parameters

In the above function, the following are the optional configurable entries for paramDict:

name type default description
alpha float Applied learning rate. 0.01
maxIter integer Max possible number of iterations before the run is terminated, this does not guarantee convergence. 100
gTol float If the difference in gradient falls below this value the run is terminated. 1e-5
theta float Initial starting weights. 0
k integer Number of batches used or random points chosen each iteration. *n
seed integer Random seed. random
batchType symbol Batch type - singleshuffleshuffleRepnonShufflenoBatch. shuffle
penalty symbol Penalty/regularization term - l1l2elasticNet. l2
lambda float Penalty term coefficient. 0.001
l1Ratio float Elastic net mixing parameter, only used if penalty type is ElasticNet. 0.5
decay float Decay coefficient. 0
p float Momentum coefficient. 0
verbose boolean If information about the fitting process is to be printed after every epoch. 0b
accumulation boolean If the theta value after each epoch is returned as the output. 0b
thresholdFunc list Threshold function and value to apply when using updateSecure. ()

In the above table *n is the length of the dataset.

A number of batchTypes can be applied when fitting a model using SGD, the supported types and an explanation of their use of the k parameter are explained below:

options:

name description
noBatch No batching occurs and all data points are used (regular gradient descent)
nonShuffle Data split into k batches with no shuffling applied.
shuffle Data shuffled into k batches. Each data point appears once.
shuffleRep Data shuffled into k batches. Data points can appear more than once and not all data points may be used.
single k random points are chosen each iteration.