This is an important update for practitioners: beware of Adam, it has fundamental flaws. In practice, carefully tuned learning rate schedules for basic SGD, as discussed in this, has gotten me much better results on very large very complicated data sets... SGD is clearly a brute force approach requiring more effort by the programmer, but with sufficient data and sufficient computing power it's still the best starting point, because it assumes the least about the constraints of the data.
As a related aside, my current interest is on stable efficient use of higher order derivatives for topological inference in optimization, and I wish more energy was invested in such math...
TL;DR: a few new heuristics to tweak the step size of gradient descent.
The more I read about this kind of machine learning research, the more I'm getting convinced that we're very-very far from the Kurzweilian singularity...
But what can we expect? We still have trouble solving a large general system of linear equations, even after decades of research. And now we're trying to optimize large nonlinear systems ...
As a related aside, my current interest is on stable efficient use of higher order derivatives for topological inference in optimization, and I wish more energy was invested in such math...