However in most ML applications, you don’t even have the full gradient. You have a stochastic estimate since your loss is generally additive in the data. So you’re not (as far as I know) even going to bother trying to form a Hessian. I believe many have investigated quasi-Newton methods based on estimate gradients but I haven’t investigated that thoroughly.
My main objective was to highlight is that given that we are performing the classic gradient descent, the gradient will yield the greatest reduction in the function value. Essentially it was a point to highlight the underlying calculus. Wayne Winston in his book Operations Research: Applications and Algorithms has an interesting passage where he discusses the gradient being the direction of maximum increase (he was looking as steepest ascent).
Just to be clear, additive loss doesn't imply stochastic gradient estimate. Rather, because the loss function is additive, then stochastic gradient estimates of the loss are now possible. But, this of course does not mean one has to use stochastic gradient estimates.
It's just that it's easier to update and monitor progress this way, rather than computing the gradient term for every single example in the training set and then taking a descent step. The surprising thing is that stochastic gradient descent convergences quickly in practice relative to proper gradient descent. All of the justification and whatnot for SGD for ML is largely post-hoc because it works so unreasonably well and is so intuitive to anyone having taken calculus.
The other aspect (with respect to the context of optimization in machine learning) is that this optimization is performed over a loss over a training dataset for which you really don't even want convergence to an exact minima over the training loss. What you really care about is the expected generalization loss. Convergence to the exact minima over training loss doesn't necessarily guarantee the best generalization loss. I mention this because it contributes to the general aloofness towards optimization convergence rates in ML.
> I believe many have investigated quasi-Newton methods based on estimate gradients but I haven’t investigated that thoroughly.
Until semi-recently, quasi-newton was not explored in the stochastic setting because of the question of how to extend the Wolfe conditions to this arena. There's been a bit of work on this , but I don't think it's caught on outside of the optimization community (not that it necessarily should considering the points above).
You also made an interesting comment about work not catching on outside of the optimization community - can you recommend some resources or websites to follow in order to see what the optimization community is working on? I've developed an interest in the area but don't really know where to go for "up to date" information.
It's interesting to see someone first write an article about nearest neighbor classifiers (a topic I really don't know much about), and then, 2-3 months later, figure out why we use gradient descent.
And long before that students will hopefully have a solid fundamental knowledge in what rate of change means - not just as in banging out derivatives on paper.
So - maybe a rude question - but had you take Calculus, Differential Equations or Vector Calculus (div, grad and curl and all that) before you started in on this more integrated material?
A short answer to your question - yes I took those courses before I started looking at the more integrated material.
For example, consider the quadratic f(x) = 0.5 x' A x - b' x where ' denotes transpose and we have implicit multiplication. Let A = [2 1;1 2], b = [3; 4], x = [5;6]. This is convex and quadratic with a global minima of [2/3;5/3], inv(A) b. Now, grad f(x) = A x - b = [13;13] and moving in the direction [-13;-13] will not get us there in a single step. However, dx_newton = -inv(hess f(x)) grad f(x) = [-4 1/3, -4 1/3], which brings us to the global minima in a single step.
The value in gradient descent is that, combined with an appropriate globalization technique such as a trust region or a line search, it guarantees convergence to a local minima. Newton's method does not unless close enough to the minima. As such, most good, fast optimization algorithms based on differentiable functions use the steepest descent direction as a metric, or fallback, to guarantee convergence and then use a different direction, most likely a truncated-Newton method, to converge quickly. Meaning, the gradient descent direction rarely leads to the greatest decrease. Unless, of course, we want to make an argument in an infinitesimal sense, which fine, but I'd denote that explicitly.
Your point about combining optimisation techniques is interesting and I'd love to learn about it a little more. When you say "As such, most good, fast optimization algorithms based on differentiable functions use the steepest descent direction as a metric, or fallback, to guarantee convergence and then use a different direction, most likely a truncated-Newton method, to converge quickly", does this mean that both algorithms are being used together? So first steepest descent is run for a few iterations and then the truncated-Newton method takes over?
If you have some resources where I could read up on this it would be much appreciated!
As far as switching back and forth between the Newton and gradient descent steps, this is largely done in a class of algorithms called dogleg methods. Essentially, the Newton step is tried against some convergence criteria. If it satisfies this criteria, it takes a step. If not, it reduces itself until eventually it assumes the gradient descent step. I'll contend that truncated-CG (Steihaug-Toint CG) does this, but better. Essentially, it's a modified conjugate gradient algorithm to solve the Newton system that maintains a descent direction. The first Krylov vector this method generates is the gradient descent step, so it eventually reduces to this step if convergence proves difficult.
More broadly, there's a question of whether all of the trouble of using second-order information (Hessians) is worth it away from the optimal solution. I will contend, strongly, yes. I base this on experience, but there are some simple thought experiments as well. For example, say we have the gradient descent direction. How far should we travel in this direction? Certainly, we can conduct a line-search or play with a "learning parameter". Also, if you do this, please use a line-search because it will provide vastly better convergence guarantees and performance. However, if we have the second derivative, we have a model to determine how far we need to go. Recall, a Taylor series tells us that f(x + dx) ~= f(x) + grad f(x)'dx + 0.5 dx' hess f(x) dx. We can use this to figure out how far to travel in this direction where we try to find an optimal alpha such that J(alpha) = f(x + alpha dx) = f(x) + alpha grad f(x)'dx + (alpha/2) dx' hess f(x) dx. If dx' hess f(x) dx > 0, the problem is convex and we can simply look for when J'(alpha) = 0, which occurs when alpha = -grad f(x)' dx / (dx' hess f(x) dx). When dx' hess f(x) dx < 0, this implies that we should take a really long step as this is predicting the gradient will be even more negative in this direction the farther we go. Though both methods, must be safeguarded (the easiest is to just halve the step if we don't get descent), the point is that the Hessian provides information that the gradient did not and this information is useful. This is only one place where this information can be use, others include in the direction calculation itself, which is what truncated-CG does.
As a brief aside, the full Hessian is rarely, if ever, computed. Hessian-vector products are enough, which allows the problem to scale to really anything that a gradient descent method can scale to.
As one final comment, the angle observation that you make in the blog post is important. It comes in a different form when proving convergence of methods, which can be seen in Theorem 3.2 within Numerical Optimization, which uses expression 3.12. Essentially, to guarantee convergence, the angle between the gradient descent direction and whatever we choose must be controlled.
I intend to write more about what I learn in this area and I'd be honoured if you would contribute like you did here with your comments/ corrections and suggestions! Thank you for the help and reference, I will definitely be following up.
Nature of the beast it seems but still kind of a pain.
What am I missing here? This is straight up the definition of the gradient.