Hacker News new | comments | show | ask | jobs | submit login
Optimization for Deep Learning Highlights (ruder.io)
102 points by stablemap 6 months ago | hide | past | web | favorite | 9 comments

This is an important update for practitioners: beware of Adam, it has fundamental flaws. In practice, carefully tuned learning rate schedules for basic SGD, as discussed in this, has gotten me much better results on very large very complicated data sets... SGD is clearly a brute force approach requiring more effort by the programmer, but with sufficient data and sufficient computing power it's still the best starting point, because it assumes the least about the constraints of the data.

As a related aside, my current interest is on stable efficient use of higher order derivatives for topological inference in optimization, and I wish more energy was invested in such math...

Any references on (or reproducible examples for) the flaws in Adam?

TL;DR: a few new heuristics to tweak the step size of gradient descent.

The more I read about this kind of machine learning research, the more I'm getting convinced that we're very-very far from the Kurzweilian singularity...

The AI effect is real. The second you understand something, it stops being magical.

But what can we expect? We still have trouble solving a large general system of linear equations, even after decades of research. And now we're trying to optimize large nonlinear systems ...

Finding the global optimum is a lot harder than just finding a good local optima quickly. Which is good enough for ML.

No mention of Kronecker factorization or any work on second order methods?

This would be more aptly titled 'Tips and tricks for Adam'..

The text is unreadable on Windows Chrome. The font is too gray or thin.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact