
An overview of gradient descent optimization algorithms - bjourne
http://ruder.io/optimizing-gradient-descent/index.html
======
mnw21cam
Note that this is very much geared toward machine learning. In many other
descent algorithm applications, it would be sensible to use Conjugate
Gradients, which this site doesn't mention.

~~~
nimish
Of course, there's tons of theory and practice in nonlinear continuous
optimization.

It's interesting to see what machine learning practitioners will rediscover or
reinvent next. Quasi-Newton methods appear to be surfacing again.

But you do get some very interesting improvements to vanilla gradient descent!

------
gHosts
A long long time ago I tried every gradient optimization method available at
the time on fitting a model to noisy / non-smooth data with many local
minima..... and found all of them lacking in robustness.

In the end the only truly robust algorithm I could rely on was a combination
of Nelder and Meads downhill simplex method with simulated annealing. I'm
curious to see in the overview he mentions neither....

Have they been supplanted by something better? Or is he only considering
"nice" functions (ie. "Smooth" even for higher derivatives with no local
minima)?

