
An overview of gradient descent optimization algorithms - azuajef
https://arxiv.org/abs/1609.04747
======
thearn4
It isn't mentioned in the abstract, but this seems to be more of an overview
of ML-specific notions of gradient descent, where batch processing is possible
due to needing to leverage gradients of a fixed prediction architecture over a
large set of training data, with respect to tunable weights.

So each of those training points represents a sort of separable or
parallelizable piece of the whole processes, giving you a ton of freedom in
how you actually execute the gradient stepping (with one training point,
several of them, or all of them). As I understand it, stochasticity in this
process interestingly seems to add enough "noise" that local minima seem to be
avoided in many cases.

In more general applications of non-linear gradient-based optimization (say
for optimizing parametric models in physical engineering), this doesn't
necessarily come into play.

~~~
sdenton4
Do you have any sense of whether tricks like momentum which afaik came out of
improving SGD for neural network training have found application in other
arenas where batching is less reasonable?

~~~
thearn4
I think there is room to exchange ideas between the nonlinear programming and
ML communities for sure.

Specifically for momentum, if I understand it right it's a particular way of
perturbing step size and gradient steps to prevent oscillation. There are some
other good examples of this used by many gradient-descent optimizers. For
example:

[https://www.cs.cmu.edu/~ggordon/10725-F12/scribes/10725_Lect...](https://www.cs.cmu.edu/~ggordon/10725-F12/scribes/10725_Lecture5.pdf)

There's part of me that wonders if one interesting way forward for deep
learning is a minibatch form of BFGS or SNOPT.

------
pacmansyyu
Here[1] is an article describing the same, written by the author himself.

[1]: [http://ruder.io/optimizing-gradient-
descent/index.html](http://ruder.io/optimizing-gradient-descent/index.html)

------
jabl
Stupid Q: Assuming "gradient descent" is roughly similar to the classical
"steepest descent" optimization algorithm (???), why aren't deep learning
researchers looking into other more advanced algorithms from classical non-
linear optimization theory. Like, say, (preconditioned) conjugate gradient, or
quasi-Newton methods such as BFGS?

------
abakus
SGD > adaptive, according to this:

[https://people.eecs.berkeley.edu/~brecht/papers/17.WilEtAl.A...](https://people.eecs.berkeley.edu/~brecht/papers/17.WilEtAl.Ada.pdf)

