 An overview of gradient descent optimization algorithms 168 points by tim_sw on June 21, 2016 | hide | past | web | favorite | 21 comments Why are algorithms other than gradient descent (e.g., bfgs, newton's, conjugate gradient, etc.) always seemingly absent from discussions on fitting neural nets? Are there not batch / stochastic versions of these methods? Is it simply ease of understanding / implementation? I seem to only ever see "gradient descent + tricks", and I wonder what the reason for this is. Stochastic gradient descent methods are the most-used for neural net fitting, so the most discussed.However, the methods you mention are sometimes used too. For the closest thing to Newton's method applied to neural nets, search for "Hessian-free optimization". Hacky minibatch updates based on batch methods have been tried, and can work ok http://ai.stanford.edu/~quocle/LeNgiCoaLahProNg11.pdf . Any method could also potentially be used with growing batch sizes, https://papers.nips.cc/paper/4308-statistical-tests-for-opti...Minibatches are good for parallelizing computation, especially on GPUs. And batch methods can be less of a fiddle to tune. So the alternatives you mention are popular in some commercial settings, for cost functions with nasty curvature, or to make it easier to write fast code. I guess most people's experience is that the dead simple SGD methods often work well though, and that's why they keep returning to them. I wonder what the quasi Newton methods would look like if adapted to this framework BFGS is an example of a quasi-Newton method. L-BFGS is the popular "low-memory" variant. Any such method can be used with the two references I gave above. One could also attempt to approximate each of its internal computations to within some tolerance with minibatches. Work in that sort of direction here: https://ei.is.tuebingen.mpg.de/publications/mahhen2015Perhaps surprisingly, given how fundamental optimization is to so many fields, there isn't consensus on what's best. And as I previously hinted above, the answer may change with the hardware landscape. > Perhaps surprisingly, given how fundamental optimization is to so many fields, there isn't consensus on what's best. And as I previously hinted above, the answer may change with the hardware landscape.Not entirely to your point, but related, my understanding is that is no best at everything: methods that work well in one field won't necessarily work in others. See https://en.wikipedia.org/wiki/No_free_lunch_in_search_and_op... I agree. But part of it is sociology: people use what their friends use. I always assumed that when people talk about gradient-descent, they actually mean L-BFGS, i.e. a quasi-Hessian. After all, it can be used as a black-box algorithm that only needs to be told the gradient. At least in quantum optimization, the "simple" (non-quasi-Newton) gradient approach almost never works, whereas L-BFGS does just fine. So, I'm confused why in machine learning, the situation would be different, or why one would choose not to use the "for free" enhancement that L-BFGS provides. In machine learning "evaluating the gradient" means sweeping over the whole training set. For simple models, stochastic gradient descent will have found a good fit after one or two passes over the dataset. (For neural nets, tens to hundreds of passes over the dataset may be needed.) It's hard for batch L-BFGS to do much in the same number of gradient and function evaluations. No one uses batch steepest gradient descent, which would be the worst of all worlds.The discussion above was about making stochastic or mini-batch versions of algorithms like L-BFGS. It's possible, but not entirely straightforward, which is why "dumb" stochastic gradient descent is often used. Because it is extremely easy to conceptualize and implement, and it works (usually poorly) with few assumptions. bfgs -- Broyden, Fletcher, Goldfarb, Shanno update for quasi-Newton? Yep There were some standard, old techniques I didn't see.For the set of real numbers R, a positive integer n, and a function f: R^n --> R, we seek x in R^n to minimize f(x). Let z in R^n denote the value of x that minimizes f(x).Borrowing from D. Knuth's mathematical typesetting system TeX, we let x_i denote component i of x. More generally we use the underscore to denote subscript.Assume that function f is differentiable and let D_x f(x) be the derivative, that is, the gradient, that is, the vector where component i is the partial derivative if f with respect to component i of x.(1) Line search. Suppose we have iterations j = 1, 2, ... where x^j is our jth estimate of z.Then, for positive real number s, our step size, we can the usual gradient descent isx^{j + 1} = x^j - s D_x f(x^j)Well, with this approach, it need not be thatf(x^{j + 1}) < f(x^j)So, an improvement is, for each iteration j, to do a line search to find the set size s. If f is convex, then there is a simple, standard approach to how to adjust s on this like search.(2) Conjugate Gradients. Gradient descent is vulnerable to a lot of movement in directions that are nearly orthogonal to the best direction. The classic technique conjugate gradients after n iterations improves this situation.(3) Newton Iteration. The simple Newton iteration for, say, square root generalizes to a function of n variables but, of course, requires finding second derivatives of f.(4) Quasi-Newton. Look up quasi-Newton that estimates the second derivatives of Newton iteration from the gradients from the iterations. For a mathy academic survey of gradient descent methods (from this month, June 2016), I recommend: https://arxiv.org/abs/1606.04838(It was submitted by someone else a few days ago, but saw no traction.) If I understand HN policy correctly, interesting links maybe reposted, which I know did. Hopefully, it will get some traction now. See also: ftp://ftp.sas.com/pub/neural/kangaroos.txt (less mathematics, more kangaroos). Fun read. Thanks! Very nice content introduction but painful for me to read with the light grey letters on the white background. Yes, and this contrast fails WCAG AAA validation. That's a harmful fashion currently with gray texts. As someone who is trying to educate himself on concepts in machine learning, I appreciated the 101 overview on gradient descent optimizations. I love how most of the approaches to these optimizations are relatively straightforward, conceptually, to understand, even if their implementations and rigorous proofs will require more work to get through. I have a theorie that submissions like this get to the frontpage because there are a lot of procrastinators who will always upvote to save for later reading, and they never actually read the articles so they keep saving them. I admit I do. I came across this yesterday - proposing non-negative matrix factorisation as an alternative to backpropagation Applications are open for YC Summer 2019

Search: