
Gradient Descent Optimisation Algorithms - raibosome
https://towardsdatascience.com/10-gradient-descent-optimisation-algorithms-86989510b5e9
======
Riverheart
For those unfamiliar with the concept, courtesy Wikipedia

[https://en.m.wikipedia.org/wiki/Gradient_descent](https://en.m.wikipedia.org/wiki/Gradient_descent)

The basic intuition behind gradient descent can be illustrated by a
hypothetical scenario. A person is stuck in the mountains and is trying to get
down (i.e. trying to find the minima). There is heavy fog such that visibility
is extremely low. Therefore, the path down the mountain is not visible, so he
must use local information to find the minima. He can use the method of
gradient descent, which involves looking at the steepness of the hill at his
current position, then proceeding in the direction with the steepest descent
(i.e. downhill). If he was trying to find the top of the mountain (i.e. the
maxima), then he would proceed in the direction steepest ascent (i.e. uphill).
Using this method, he would eventually find his way down the mountain.
However, assume also that the steepness of the hill is not immediately obvious
with simple observation, but rather it requires a sophisticated instrument to
measure, which the person happens to have at the moment. It takes quite some
time to measure the steepness of the hill with the instrument, thus he should
minimize his use of the instrument if he wanted to get down the mountain
before sunset. The difficulty then is choosing the frequency at which he
should measure the steepness of the hill so not to go off track.

~~~
raibosome
Sweet! Another intuition for gradient descent:

Regularly updating your parameters using an educated guess. This educated
guess is the gradient value.

------
raibosome
At the end of this post you will get a cheat sheet of the 10 common gradient
descent optimisation algorithms.

Using more readable notations, I will walk you through how the vanilla
stochastic gradient descent slowly evolved into the popular Adam optimiser and
others. I also came out with an ‘evolutionary map’ of the optimisers to
visualise this.

The motivation for writing this post is that there is a lack of simple-to-read
equations for parameter update and a compiled list of these optimisers.

Hopefully this benefits the community.

------
usernomnomnom
This is very helpful! If I may make a shameless self-plug, this would be even
better as something that is dynamic and can be interactively played with. A
few years ago I made this iPython notebook for similar didactic purposes:
[https://github.com/turingbirds/gradient_descent/blob/master/...](https://github.com/turingbirds/gradient_descent/blob/master/gradient_descent.ipynb)

~~~
raibosome
This is great! I'd love to do something similar in JavaScript.

------
raibosome
If I may, I had also built a simple demo of linear regression using gradient
descent before writing this post. [https://raiboso.me/backpropagation-
demo/](https://raiboso.me/backpropagation-demo/)

This demo allows you to choose between four optimisers, and lets you track the
values of your variables during training.

Compare your runs with different optimisers using the graph at the bottom of
the page.

------
_jamesm_
It's always useful to see different SGD methods written with a consistent
nomenclature. A few thoughts:

1\. Is the 1999 Qian paper on momentum really the most appropriate one, given
the comparison of the publication date to NAG? As a cursory examination of the
paper reveals, momentum has been used for a long time before 1999!

2\. Similarly, the original NAG paper isn't about stochastic gradient descent
and doesn't really use the equation as written. A more appropriate reference
is to the Sutskever, Martens, Dahl and Hinton paper of 2013
[http://proceedings.mlr.press/v28/sutskever13.html](http://proceedings.mlr.press/v28/sutskever13.html)
which is the publication that described/reworked NAG in this way.

3\. It's worth noting the caveats about AMSGrad:
[https://www.fast.ai/2018/07/02/adam-weight-
decay/](https://www.fast.ai/2018/07/02/adam-weight-decay/)

~~~
raibosome
Thank you for pointing these out! I have made the necessary edits to the
citations for (1) and (2) and republished the article.

For (1), the paper by Sutskever et al., 2013
([http://proceedings.mlr.press/v28/sutskever13.pdf](http://proceedings.mlr.press/v28/sutskever13.pdf))
attributed the classical momentum to Polyak, 1964
([https://www.researchgate.net/publication/243648538_Some_meth...](https://www.researchgate.net/publication/243648538_Some_methods_of_speeding_up_the_convergence_of_iteration_methods)).
A Distil article on momentum
([https://distill.pub/2017/momentum/](https://distill.pub/2017/momentum/))
also cited Polyak's paper and also included a much earlier publication in 1959
by Ruttishauser
([https://doi.org/10.1007/978-3-0348-7224-9_2](https://doi.org/10.1007/978-3-0348-7224-9_2)),
but I will just make reference to Polyak's.

~~~
_jamesm_
Cool, glad to have helped. It seems I have caused a further minor point of
confusion though, so a correction to the correction:

The original Nesterov Accelerated Gradient paper _is_ about gradient descent,
it's just not about _stochastic_ gradient descent. It's useful to make the
distinction between "traditional" optimization methods like Newton's method,
Conjugate Gradient, BFGS and so on, which are all gradient descent methods in
the sense they require at least a full gradient calculation per iteration, and
a lot of the algorithms mentioned in the article, which are suitable for
stochastic gradient descent and originate with the deep learning community
(there's nothing to stop them being used elsewhere, it just doesn't seem that
common).

Some extra (unnecessary) detail on NAG to put things into a bit more context,
if you are so inclined:

Although NAG has received a fair amount of theoretical attention, as far as I
know it isn't widely used practically because its convergence properties rely
on an exact line search and a rather specific schedule for its momentum-like
term.

The Sutskever contribution is interesting because first, it expressed the NAG
formula in a way that could be easily understood by machine learning
practitioners. Then, by moving the procedure a half step, they showed you
could think of it as momentum followed by gradient descent.

------
hnuser355
And for anyone who wants to know why unmodified gradient descent may be
considered a piece of shit in certain circumstances

[http://wikipedia.org/wiki/Rosenbrock_function](http://wikipedia.org/wiki/Rosenbrock_function)

Gradient descent with a good line search (Wolfe conditions) applies to the
multidimensional case should converge to min, but it might take you thousands
of iterations. Newton’s method or something might take <50.

But machine learning practitioners will know why gradient algorithms are often
preferred despite this

~~~
marcosdumay
It's hard to propagate Newton's method over layers on a neural network.

~~~
stochastic_monk
There are also a lot of conditions required for Newton’s method to work which
you don’t have with neural networks.

~~~
marcosdumay
Is there some condition that makes the method not work at all?

I could never find a showstopper (granted that I have thought about this for a
few hours when first studying the subject), only stuff that slowed it down so
gradient descent became better (and honestly, I am still not sure that can not
be fixed).

------
hoseja
Gradient descent is the abstraction of so many real world problems it's not
even funny. From protein folding to machine intelligence, gradient descents
everywhere...

