
An overview of gradient descent optimization algorithms - tim_sw
http://sebastianruder.com/optimizing-gradient-descent/
======
chestervonwinch
Why are algorithms other than gradient descent (e.g., bfgs, newton's,
conjugate gradient, etc.) always seemingly absent from discussions on fitting
neural nets? Are there not batch / stochastic versions of these methods? Is it
simply ease of understanding / implementation? I seem to only ever see
"gradient descent + tricks", and I wonder what the reason for this is.

~~~
imurray
Stochastic gradient descent methods are the most-used for neural net fitting,
so the most discussed.

However, the methods you mention are sometimes used too. For the closest thing
to Newton's method applied to neural nets, search for "Hessian-free
optimization". Hacky minibatch updates based on batch methods have been tried,
and can work ok
[http://ai.stanford.edu/~quocle/LeNgiCoaLahProNg11.pdf](http://ai.stanford.edu/~quocle/LeNgiCoaLahProNg11.pdf)
. Any method could also potentially be used with growing batch sizes,
[https://papers.nips.cc/paper/4308-statistical-tests-for-
opti...](https://papers.nips.cc/paper/4308-statistical-tests-for-optimization-
efficiency)

Minibatches are good for parallelizing computation, especially on GPUs. And
batch methods can be less of a fiddle to tune. So the alternatives you mention
are popular in some commercial settings, for cost functions with nasty
curvature, or to make it easier to write fast code. I guess most people's
experience is that the dead simple SGD methods often work well though, and
that's why they keep returning to them.

~~~
nimish
I wonder what the quasi Newton methods would look like if adapted to this
framework

~~~
imurray
BFGS is an example of a quasi-Newton method. L-BFGS is the popular "low-
memory" variant. Any such method can be used with the two references I gave
above. One could also attempt to approximate each of its internal computations
to within some tolerance with minibatches. Work in that sort of direction
here:
[https://ei.is.tuebingen.mpg.de/publications/mahhen2015](https://ei.is.tuebingen.mpg.de/publications/mahhen2015)

Perhaps surprisingly, given how fundamental optimization is to so many fields,
there isn't consensus on what's best. And as I previously hinted above, the
answer may change with the hardware landscape.

~~~
davidkretch
> Perhaps surprisingly, given how fundamental optimization is to so many
> fields, there isn't consensus on what's best. And as I previously hinted
> above, the answer may change with the hardware landscape.

Not entirely to your point, but related, my understanding is that is no best
at everything: methods that work well in one field won't necessarily work in
others. See
[https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op...](https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_optimization)

~~~
imurray
I agree. But part of it is sociology: people use what their friends use.

------
graycat
There were some standard, old techniques I didn't see.

For the set of real numbers R, a positive integer n, and a function f: R^n -->
R, we seek x in R^n to minimize f(x). Let z in R^n denote the value of x that
minimizes f(x).

Borrowing from D. Knuth's mathematical typesetting system TeX, we let x_i
denote component i of x. More generally we use the underscore to denote
subscript.

Assume that function f is differentiable and let D_x f(x) be the derivative,
that is, the gradient, that is, the vector where component i is the partial
derivative if f with respect to component i of x.

(1) Line search. Suppose we have iterations j = 1, 2, ... where x^j is our jth
estimate of z.

Then, for positive real number s, our _step size_ , we can the usual gradient
descent is

x^{j + 1} = x^j - s D_x f(x^j)

Well, with this approach, it need not be that

f(x^{j + 1}) < f(x^j)

So, an improvement is, for each iteration j, to do a _line search_ to find the
set size s. If f is convex, then there is a simple, standard approach to how
to adjust s on this like search.

(2) Conjugate Gradients. Gradient descent is vulnerable to a lot of movement
in directions that are nearly orthogonal to the best direction. The classic
technique _conjugate gradients_ after n iterations improves this situation.

(3) Newton Iteration. The simple Newton iteration for, say, square root
generalizes to a function of n variables but, of course, requires finding
second derivatives of f.

(4) Quasi-Newton. Look up quasi-Newton that estimates the second derivatives
of Newton iteration from the gradients from the iterations.

------
imurray
For a mathy academic survey of gradient descent methods (from this month, June
2016), I recommend:
[https://arxiv.org/abs/1606.04838](https://arxiv.org/abs/1606.04838)

(It was submitted by someone else a few days ago, but saw no traction.)

~~~
Bootvis
If I understand HN policy correctly, interesting links maybe reposted, which I
know did. Hopefully, it will get some traction now.

------
gjm11
See also: ftp://ftp.sas.com/pub/neural/kangaroos.txt (less mathematics, more
kangaroos).

~~~
jacquesm
Fun read. Thanks!

------
antman
Very nice content introduction but painful for me to read with the light grey
letters on the white background.

~~~
avodonosov
Yes, and this contrast fails WCAG AAA validation. That's a harmful fashion
currently with gray texts.

------
SubiculumCode
As someone who is trying to educate himself on concepts in machine learning, I
appreciated the 101 overview on gradient descent optimizations. I love how
most of the approaches to these optimizations are relatively straightforward,
conceptually, to understand, even if their implementations and rigorous proofs
will require more work to get through.

------
dharma1
I came across this yesterday - proposing non-negative matrix factorisation as
an alternative to backpropagation

[https://arxiv.org/pdf/1605.04639v1.pdf](https://arxiv.org/pdf/1605.04639v1.pdf)

------
choosername
I have a theorie that submissions like this get to the frontpage because there
are a lot of procrastinators who will always upvote to save for later reading,
and they never actually read the articles so they keep saving them. I admit I
do.

