
Optimization for Deep Learning Highlights - stablemap
http://ruder.io/deep-learning-optimization-2017/index.html
======
legel
This is an important update for practitioners: beware of Adam, it has
fundamental flaws. In practice, carefully tuned learning rate schedules for
basic SGD, as discussed in this, has gotten me much better results on very
large very complicated data sets... SGD is clearly a brute force approach
requiring more effort by the programmer, but with sufficient data and
sufficient computing power it's still the best starting point, because it
assumes the least about the constraints of the data.

As a related aside, my current interest is on stable efficient use of higher
order derivatives for topological inference in optimization, and I wish more
energy was invested in such math...

~~~
nerdponx
Any references on (or reproducible examples for) the flaws in Adam?

~~~
stochastician
See [https://arxiv.org/abs/1705.08292](https://arxiv.org/abs/1705.08292) from
our group

------
naturalgradient
No mention of Kronecker factorization or any work on second order methods?

This would be more aptly titled 'Tips and tricks for Adam'..

------
zzleeper
The text is unreadable on Windows Chrome. The font is too gray or thin.

------
tzahola
TL;DR: a few new heuristics to tweak the step size of gradient descent.

The more I read about this kind of machine learning research, the more I'm
getting convinced that we're very-very far from the Kurzweilian singularity...

~~~
amelius
But what can we expect? We still have trouble solving a large general system
of linear equations, even after decades of research. And now we're trying to
optimize large nonlinear systems ...

~~~
Houshalter
Finding the global optimum is a lot harder than just finding a good local
optima quickly. Which is good enough for ML.

