# Diving into stochastic gradient descent

Stochastic Gradient Descent (let’s call it SGD from now on, because that feels a bit less repetitive) is an online optimisation algorithm that’s received a lot of attention in recent years for several reasons (and is notable for its applications in optimising neural networks). One of the most attractive things about it is its scalability - because of how it differs from other gradient descent algorithms, SGD lends itself neatly to massively parallel architectures and streaming data. SGD is also a very capable online learning algorithm, too - so it’s quite a useful technique to know.

## A recap of Gradient Descent

First off, let’s start with a recap of our good ol’ buddy gradient descent
(GD). GD is an iterative optimisation algorithm that takes small steps towards
a minimum of some function (notice that I say *a* minimum instead of *the*
minimum - GD can get caught in local minima if the starting point is poorly
chosen or the function is highly nonconvex). It works by evaluating the output
of the objective function at a particular set of parameter values - and then
taking partial derivatives of the output with respect to each of the
parameters. Using these partial derivatives, it’s possible to determine which
parameters can be changed to minimise the value of the objective function.
Basically, the algorithm finds the direction that most minimises the
objective… and then takes a step in that direction. Rinse and repeat until
convergence!

### A quick peek under the bonnet…

Let’s jump into some of the maths. The GD algorithm’s objective is to minimise some function $J$ by determining the optimal values of some parameter set $\theta$. This is done iteratively, so a single iteration (a single update of the parameters $\theta$) would look something like this;

$\theta = \theta_{old} - \alpha \nabla_{\theta} E\left[ J(\theta; X, y) \right]$

The interesting thing to note here is that GD requires computing the *expected
value* of the function $J(\theta; X, y)$ over the whole dataset (i.e. $E\left[
J(\theta; X, y) \right]$). That is, the objective function is evaluated over
the entire dataset and then averaged - and this happens during each iteration
of the GD algorithm. This is fine for smaller optimisation tasks, but this can
run into significant problems at scale - evaluating a function over a whole
dataset can prove intractable.

### An example with Linear Regression

Right. Remember linear regression? Minimising mean squared error? In case your
memory is a little fuzzy (like mine), let’s look at the objective function for
a linear regression (with a grateful nod to Cosma Shalizi’s excellent lecture
notes on this topic^{1}).

In this case, our learning problem has $n$ examples and $m$ features. Let’s define some things up front, so that our objective function makes sense.

- Let $\underset{n \times m}{X}$ be our design matrix - with rows for each training example and columns for each feature (and an additional column for the bias term)
- Let $\underset{n \times 1}{y}$ be a vector holding our target variable - we want to predict $y$ given the data in our matrix $X$.
- Let $\underset{m \times 1}{\theta}$ be our parameter vector - an $m$-length vector containing the coefficients for each feature and the bias term.

In this case, the objective function over a dataset $X$ is defined as follows;

$J(\theta; X, y) = \frac{1}{2n} (y - X\theta)^{T}(y - X\theta)$

Note that sometimes machine learning practitioners divide this function by a factor of two (i.e. the $\frac{1}{2n}$ at the beginning) so that the derivative is a little bit neater. This isn’t strictly necessary and doesn’t change the overall shape of the function - it’s just a mathematical convenience.

To use gradient descent to optimise $J(\theta; X, y)$, we’d want to begin by setting $\theta$ to some arbitrary values. We then evaluate the function’s gradient at that point to get an idea of the slope of the local area. We then focus on taking a number of small steps “downhill” to update $\theta$ by nudging the weights in the direction that minimises the objective. By taking enough steps this way, we will eventually arrive at an optimum.

One thing to note here is that the linear regression objective function is quadratic (like the one pictured below). It is therefore convex and has a single global optimum — this isn’t always the case with more complex functions.

#### Figuring out which way is downhill

Let’s compute the derivative for this objective function - once we’ve done
that, we can start stepping down the hill. This involves a little bit of matrix
calculus, so I’ve left some extra steps in so that it’s clear what’s happening.
Both Cosma Shalizi’s lecture notes and the Matrix Cookbook^{2} are really
handy here.

$\begin{aligned} \frac{\partial}{\partial \theta} J(\theta; X, y) & = \frac{\partial}{\partial \theta} \frac{1}{2n} (y - X\theta)(y - X\theta) \\ & = \frac{1}{2n} \frac{\partial}{\partial \theta} \big( y^{T}y - 2\theta^{T}X^{T}y + \theta^{T}X^{T}X\theta \big) \\ & = \frac{1}{2n} \big(-2X^{T}y + \frac{\partial}{\partial \theta} \theta^{T}X^{T}X\theta) \big) \\ & = \frac{1}{2n} \big(-2X^{T}y + 2X^{T}X\theta \big) \\ & = \frac{1}{n} \big(-X^{T}y + X^{T}X\theta \big) \\ & = \frac{1}{n} \big(X^{T}X\theta - X^{T}y \big) \end{aligned}$

#### Putting it all together

Now that the derivative of this function is known, we can plop it directly into
the gradient descent algorithm. The objective function that we determined
earlier computes the *expected value* over the entire dataset, so we can simply
drop it in place. A single round of parameter updates would look like this:

$\theta = \theta - \alpha \big( \frac{1}{n} X^{T}X\theta - X^{T}y \big)$

I applied this parameter update to the Boston housing dataset^{3} and then ran
GD for 200 iterations with $\alpha = 0.05$. Below is a plot showing the overall
error on each iteration - you can see the error begin to converge to a stable
state after about 25 iterations;

Now that we know GD works on our linear regression problem, let’s head back over to SGD-land and start tinkering with that instead.

## Now… back to SGD!

Right. Now that we’ve recapped bog-standard gradient descent, let’s dive right
into the interesting stuff. Luckily, you now know everything you need to
understand SGD because it’s only a slight modification to the standard
algorithm. The update procedure for SGD is almost exactly the same as for
standard GD - except we don’t take the expected value of the objective function
over the entire dataset, we simply compute it for *a single row*^{4}. That is,
we just slice out a single row of our training data, and use that in a single
iteration of the SGD algorithm.

$\theta = \theta_{old} - \alpha \nabla_{\theta} J(\theta; X_{(i)}, y_{(i)})$

Of course, as we’re only computing the gradient based on a single example at a time, we’re going to need far more iterations to converge upon a sensible result. It’s quite a natural way to think about presenting data to SGD in terms of “epochs” - essentially presenting as many examples as were in the original dataset. So, in this case, 200 epochs is approximately equal to 200 passes of the gradient descent algorithm above.

After 200 epochs (i.e. seeing the *entire* dataset 200 times), the MSE
converges to a reasonably stable figure - similar to the unmodified gradient
descent algorithm above.

However - as you may expect, updating the parameter vector based on a *single
training example* is going to play havoc with the initial variance in the
training error (and the gradient of the parameter vector). This has some
significant implications for setting the learning rate - because the variance
is so high, a considerably smaller learning rate is required.

During the initial few iterations of SGD, both the parameter vector and the MSE will initially have far greater variance as the parameters are being updated directly with each training example. Over time, as the number of training examples analysed grows larger, the influence of each individual example diminishes and the parameters will start to converge to a steady state.

To demonstrate the high variance in SGD, here’s a comparison of the error rate
convergence during the first 50 iterations of standard GD (left) and SGD
(right) - note that each GD iteration analyses *the entire dataset*, while each
SGD iteration only analyses *a single example* (this is why standard GD seems
to converge much faster in this plot).

## Disadvantages and traps for SGD

### Slow convergence in ravines and valleys

A downside of SGD is that it can struggle on certain types of topographic features, such as functions in “ravines” - a long shallow path surrounded by steep walls. An unfortunate sequence of training examples can cause SGD to update the parameters $\theta$ such that the objective essentially “bounces” off the walls of the ravine - that is, it oscillates between the steep sections but never arrives at the local minimum in the middle.

Typically this is addressed by modifying the algorithm to take into account “momentum”. That is, once the parameters start to converge in a particular direction, a certain amount of “force” is required to overcome that momentum. This helps overcome some of the effects of noisy data by applying a kind of smoothing effect to the parameters, and effectively dampens the oscillations that can occur in ravines and other challenging topographical features.

The implementation of momentum is simple enough. We keep track of a vector $v$ which contains information about the algorithm’s current momentum - and we use a parameter $\gamma \in \[0, 1\)$ which controls the exponential weighting of past values in the momentum vector $v$. Then, instead of directly updating the parameters with the gradient vector, we update $\theta$ by subtracting our momentum-augmented vector $v$ instead.

$\begin{aligned} v &= \gamma v + \alpha \nabla_{\theta}J(\theta; X_{(i)}, y_{(i)}) \\ \theta &= \theta - v \end{aligned}$

### Choosing the correct learning rate

The sensitivity of SGD to individual training examples also means that it’s much more important to choosing a sensible learning rate, $\alpha$. Setting $\alpha$ too high means the algorithm won’t converge smoothly (or that it might oscillate between several sub-optimal states), and setting it too low means that it’ll take far longer to converge than necessary (if ever!).

There are several heuristic approaches for determining $\alpha$ that seem to be
relatively common in practice. One such example is Leon Bottou’s heuristic^{5}
(used in the SGD implementation in `sklearn`

), while several other learning
schedules are mentioned in the Stanford Deep Learning materials (this
includes approaches like reducing the learning rate as the algorithm begins to
converge, or using an optimisation technique like simulated annealing to
periodically choose a better learning rate).

### Scaling

Something that caught me out when working on my own implementation of this
algorithm was *poorly scaled input*. It turns out that one of the most
important things to do before using SGD is to *scale* your input data to
sensible values. Scaling to zero mean and unit variance (typically referred to
as “standardisation” in the machine learning literature) is as straightforward
as $X_{scaled} = \frac{(X - \bar{X})}{\sigma_X^{2}}$.

Using gradient descent algorithms with unscaled data can cause highly suboptimal results - in the best case scenario, it’ll just cause slow convergence; in the worst case, the gradient will become unmanageable due to numerical precision errors (i.e. the “exploding gradient” or “vanishing gradient”) - which stops the algorithm converging at all.

### Dataset ordering

Finally, the ordering of the dataset can bias the way that the gradients develop as SGD is exposed to more of the training data. If there’s some meaningful ordering to the data (e.g. think of horse racing results where race winners are listed in the first row for each race). If this happens over an appreciable number of training examples, then this can cause the algorithm to choose vastly sub-optimal parameter values. There are several ways around this, but the easiest solution is simply to shuffle the dataset before using SGD.

## The solution? Mini-batch gradient descent

Mini-batch gradient descent (MBGD)uses, surprisingly enough, small bundles of
training examples to optimise the parameters in batches. This is actually quite
a neat formulation, as it addresses some of the problems of *both* vanilla GD
and SGD outlined above. Firstly, performing optimisation in small batches of
training examples allows programmers to use a vectorised implementation of
gradient descent - which is considerably faster than the iterative approach in
SGD. However, MBGD is still very easy to parallelise as it will typically use a
relatively small batch size (e.g. it’s perfectly feasible to use a batch size
of 100 to 200 examples).

Secondly, because training happens in batches (even if they’re only small ones) this has the handy side-effect of substantially reducing the variance between iterative applications of the algorithm. Note that this won’t remove all of the variance between iterations - but it will certainly help. Again, a single parameter update with MBGD would look something like this;

$\theta = \theta_{old} - \alpha \frac{1}{b} \nabla_{\theta} J(\theta; X_{[i, i + b]}, y_{[i, i + b]})$

Note that the batch size here is denoted by $b$ - and the indices on $X$ and
$y$ simply indicate a *slice* of the dataset (i.e. a mini-batch).

## In summary

So, that’s it - SGD in a nutshell. We looked at gradient descent from first principles, and applied it to a standard least-squares linear regression problem (which we also derived from scratch). We then implemented stochastic gradient descent for comparison, and achieved similar results (but with some small trade-offs). We looked at some caveats and traps that you should keep in mind when implementing SGD yourself, and finally, we proposed mini-batch GD as a potential best-of-both-worlds compromise between vanilla GD and SGD. Check out the further reading below for more information and further reading. Until next time!