Why “Gradient” Descent? 105 points by JunaidB 3 months ago | hide | past | web | favorite | 26 comments

 The reason the gradient is preferred (as I understand it) is actually computational considerations. Assuming you can actually compute the both of them, a Newton method (uses the gradient and the derivative of the gradient, the Hessian) usually has faster convergence (quadratic instead of linear). However that Hessian can be big and difficult to compute, and then you need a linear solve with it.However in most ML applications, you don’t even have the full gradient. You have a stochastic estimate since your loss is generally additive in the data. So you’re not (as far as I know) even going to bother trying to form a Hessian. I believe many have investigated quasi-Newton methods based on estimate gradients but I haven’t investigated that thoroughly.
 This is a really important point and I wish I'd mentioned it. The computational considerations (as you've said) make the classic Gradient Descent method infeasible in practice. Therefore we resort to stochastic estimates or Quasi Newton approaches (which I'm still looking into).My main objective was to highlight is that given that we are performing the classic gradient descent, the gradient will yield the greatest reduction in the function value. Essentially it was a point to highlight the underlying calculus. Wayne Winston in his book Operations Research: Applications and Algorithms has an interesting passage where he discusses the gradient being the direction of maximum increase (he was looking as steepest ascent).
 Your point about convergence to the exact minima over training loss not guaranteeing the best generalization loss reminds me of the point made in this lecture here https://www.youtube.com/watch?v=k3AiUhwHQ28.You also made an interesting comment about work not catching on outside of the optimization community - can you recommend some resources or websites to follow in order to see what the optimization community is working on? I've developed an interest in the area but don't really know where to go for "up to date" information.
 I’m not that other guy and I also haven’t read this paper but it seems quite thorough
 This seems like an excellent review. I'll check it out. Thanks very much!
 I always find it interesting to see the different paths people take toward learning the same thing. When I first did multivariable calculus, I learned that the gradient points uphill, and the negative of the gradient points downhill. I'm definitely a spacial learner, and mostly thought of surfaces the way one walks over hills. The idea of using gradient descent to find a local minimum is the simplest part of neural networks to me.It's interesting to see someone first write an article about nearest neighbor classifiers (a topic I really don't know much about), and then, 2-3 months later, figure out why we use gradient descent.
 Yeah thats what the gradient literally is defined as = Rate of change. Not trying to be snarky but for me this means not taking the time to learn the basics before jumping into far far advanced concepts. Sadly this seems to be a pattern in a lot of machine learning curriculum today.
 When private schools offer a 1-year course to become a "Machine Learning Consultant" with no prior mathematics or programming knowledge required, you know that something has to be off ...
 I mean, a quality program will include both numerical analysis and optimization methods classes in their program, both which will (hopefully) go through things like gradient descent in rigour.And long before that students will hopefully have a solid fundamental knowledge in what rate of change means - not just as in banging out derivatives on paper.
 Thank you for taking the time to look through the post. When I learned this formally it was introduced as steepest descent (Cauchy's variant). Like yourself, it didn't become concrete to me until I looked at the surface plots. I concur with your point that it's interesting to see the different paths people take toward learning the same thing.
 I too, like commenter above, find it fascinating how various concepts are quickly accessible (or not!) for different people. Usually I blame it on the teacher's presentation (for my part, I am completely unable to explain pointers in (say, C) to people who haven't already been able to get it), but sometimes it is also the building blocks that the learner already has in place.So - maybe a rude question - but had you take Calculus, Differential Equations or Vector Calculus (div, grad and curl and all that) before you started in on this more integrated material?
 Not rude at all! Thanks for the question. My background is in Statistics not Machine learning (that came later) so I covered these topics without reference to ML applications. I suppose I never learned to connect these ideas when I was learning about ML, it was only when I went back to my old material I realised how these ideas relate. I agree with your point about teacher material, when I was learning about ML it was separated from raw Calculus.A short answer to your question - yes I took those courses before I started looking at the more integrated material.
 I think there may be some confusion as to the terminology in the article. The gradient does not necessarily lead to the greatest decrease in the function. Specifically, if our function of interest is f and we're currently at the point x, then f(x - alpha grad f(x)) for some alpha may or may not be less that f(x + beta dx) for a different dx and beta.For example, consider the quadratic f(x) = 0.5 x' A x - b' x where ' denotes transpose and we have implicit multiplication. Let A = [2 1;1 2], b = [3; 4], x = [5;6]. This is convex and quadratic with a global minima of [2/3;5/3], inv(A) b. Now, grad f(x) = A x - b = [13;13] and moving in the direction [-13;-13] will not get us there in a single step. However, dx_newton = -inv(hess f(x)) grad f(x) = [-4 1/3, -4 1/3], which brings us to the global minima in a single step.The value in gradient descent is that, combined with an appropriate globalization technique such as a trust region or a line search, it guarantees convergence to a local minima. Newton's method does not unless close enough to the minima. As such, most good, fast optimization algorithms based on differentiable functions use the steepest descent direction as a metric, or fallback, to guarantee convergence and then use a different direction, most likely a truncated-Newton method, to converge quickly. Meaning, the gradient descent direction rarely leads to the greatest decrease. Unless, of course, we want to make an argument in an infinitesimal sense, which fine, but I'd denote that explicitly.
 That's a great point and to be honest I could have been a lot tighter with the terminology. Good advice to take on board for next time - thanks!Your point about combining optimisation techniques is interesting and I'd love to learn about it a little more. When you say "As such, most good, fast optimization algorithms based on differentiable functions use the steepest descent direction as a metric, or fallback, to guarantee convergence and then use a different direction, most likely a truncated-Newton method, to converge quickly", does this mean that both algorithms are being used together? So first steepest descent is run for a few iterations and then the truncated-Newton method takes over?If you have some resources where I could read up on this it would be much appreciated!
 Thank you for taking the time to write a thorough and considerate response. I have been working through the Engineering Optimization Methods and Applications by Ravindran, Ragsdell and Reklaitis so far but I will spend some time in the coming few weeks with Nocedal and Wright in accordance with your recommendation.I intend to write more about what I learn in this area and I'd be honoured if you would contribute like you did here with your comments/ corrections and suggestions! Thank you for the help and reference, I will definitely be following up.
 What I find interesting is the implicit assumption that the underlying function being learned is differentiable or continuous to begin with. That's not always the case; for example, we often work with "categorically labeled" discrete binning problems.
 In those cases, people tend to use something like a categorical cross-entropy loss where you assign a continuous likelihood score to each discrete possibility, thereby making things differentiable again.
 Data normalization is the 101 of any respectable machine learning course.
 Yeah but it’s still enough to stall and hinder newbies (me) that deal primarily with categorical data until you can start to intuitively map the continuous back into the discrete.Nature of the beast it seems but still kind of a pain.
 > it’s not obvious to me at all that the same direction as the (negative) gradient leads to the largest decrease in the value of the function f(x)What am I missing here? This is straight up the definition of the gradient.
 The gradient is defined as a limit. That it points to the direction of greatest increase is a consequence of the definition, not the definition itself.
 Small side note, don't forget to escape your trig functions in TeX! Else it will render your "cos" in italitcs (product of variables c, o and s) as in this article, and thousands of others; the only thing more surprising than people not noticing this is TeX not giving a warning about it.
 Please try this exercise and report back :) Suppose it happened that when setting up your parameter space you found yourself working with ξ and η instead of x and y, where the relationship is given simply by (ξ, η) = A (x, y) for A an invertible linear mapping (2×2 matrix). This could easily happen in practice. Is gradient descent in (ξ, η) the same procedure as gradient descent in (x, y)? What should we make of any difference?
 This article made me wonder if it is worth going in the positive direction on occasion, just to check that it gets worse.

Search: