
An overview of gradient descent optimization algorithms - babelouc
http://www.datasciencecentral.com/profiles/blogs/an-overview-of-gradient-descent-optimization-algorithms
======
highd
Blog spam pass through to the original: [http://sebastianruder.com/optimizing-
gradient-descent/](http://sebastianruder.com/optimizing-gradient-descent/)

Useful reference, 6 months old.

Side note, anybody aware of any implementations of the "learning to learn
gradient descent by gradient descent" work [0]? I'd love to boost my training
time, but I'm worried implementing it myself will just result in an additional
set of hyperparameters to tune.

[0] [https://arxiv.org/abs/1606.04474](https://arxiv.org/abs/1606.04474)

~~~
zo7
Is it possible to swap the link on this post? That's kind of silly that they
just copy the original post. Understanding different optimizers and why they
work is important though, especially since often people will just use them
without considering what they're actually doing.

------
graycat
I didn't see mention of conjugate gradients, Newton iteration, or quasi
Newton.

And I saw no mention of constraints.

~~~
highd
This is primarily in regard to deep neural nets, where second-order methods
are too expensive (O(p^2) in parameters, where p is on the order of millions).

I'm not sure conjugate gradient methods are used much, due to the non-
convexity of the merit functions, and hard constraints aren't used much
either.

~~~
partykid92
quasi newton methods are not square in the dimension of the problem (think
limited memory L-BFGS), and can be run in linear time. In my experience,
however, they're 2-3 times slower than regular methods like ADAM.

