
Why “Gradient” Descent? - JunaidB
https://scienceofdata.org/2019/11/24/why-gradient-descent/
======
smallcharleston
The reason the gradient is preferred (as I understand it) is actually
computational considerations. Assuming you can actually compute the both of
them, a Newton method (uses the gradient and the derivative of the gradient,
the Hessian) usually has faster convergence (quadratic instead of linear).
However that Hessian can be big and difficult to compute, and then you need a
linear solve with it.

However in most ML applications, you don’t even have the full gradient. You
have a stochastic estimate since your loss is generally additive in the data.
So you’re not (as far as I know) even going to bother trying to form a
Hessian. I believe many have investigated quasi-Newton methods based on
estimate gradients but I haven’t investigated that thoroughly.

~~~
chestervonwinch
> You have a stochastic estimate since your loss is generally additive in the
> data.

Just to be clear, additive loss doesn't _imply_ stochastic gradient estimate.
Rather, because the loss function is additive, then stochastic gradient
estimates of the loss are now possible. But, this of course does not mean one
has to use stochastic gradient estimates.

It's just that it's easier to update and monitor progress this way, rather
than computing the gradient term for every single example in the training set
and _then_ taking a descent step. The surprising thing is that stochastic
gradient descent convergences quickly in practice relative to proper gradient
descent. All of the justification and whatnot for SGD for ML is largely post-
hoc because it works so unreasonably well and is so intuitive to anyone having
taken calculus.

The other aspect (with respect to the context of optimization in machine
learning) is that this optimization is performed over a loss over a _training_
dataset for which you really don't even want convergence to an exact minima
over the training loss. What you really care about is the expected
generalization loss. Convergence to the exact minima over training loss
doesn't necessarily guarantee the best generalization loss. I mention this
because it contributes to the general aloofness towards optimization
convergence rates in ML.

> I believe many have investigated quasi-Newton methods based on estimate
> gradients but I haven’t investigated that thoroughly.

Until semi-recently, quasi-newton was not explored in the stochastic setting
because of the question of how to extend the Wolfe conditions to this arena.
There's been a bit of work on this [1], but I don't think it's caught on
outside of the optimization community (not that it necessarily _should_
considering the points above).

[1]: [https://arxiv.org/abs/1401.7020](https://arxiv.org/abs/1401.7020)

~~~
JunaidB
Your point about convergence to the exact minima over training loss not
guaranteeing the best generalization loss reminds me of the point made in this
lecture here
[https://www.youtube.com/watch?v=k3AiUhwHQ28](https://www.youtube.com/watch?v=k3AiUhwHQ28).

You also made an interesting comment about work not catching on outside of the
optimization community - can you recommend some resources or websites to
follow in order to see what the optimization community is working on? I've
developed an interest in the area but don't really know where to go for "up to
date" information.

~~~
smallcharleston
I’m not that other guy and I also haven’t read this paper but it seems quite
thorough

[https://arxiv.org/abs/1606.04838](https://arxiv.org/abs/1606.04838)

~~~
JunaidB
This seems like an excellent review. I'll check it out. Thanks very much!

------
timerol
I always find it interesting to see the different paths people take toward
learning the same thing. When I first did multivariable calculus, I learned
that the gradient points uphill, and the negative of the gradient points
downhill. I'm definitely a spacial learner, and mostly thought of surfaces the
way one walks over hills. The idea of using gradient descent to find a local
minimum is the simplest part of neural networks to me.

It's interesting to see someone first write an article about nearest neighbor
classifiers (a topic I really don't know much about), and then, 2-3 months
later, figure out why we use gradient descent.

~~~
nodemaker
Yeah thats what the gradient literally is defined as = Rate of change. Not
trying to be snarky but for me this means not taking the time to learn the
basics before jumping into far far advanced concepts. Sadly this seems to be a
pattern in a lot of machine learning curriculum today.

~~~
fxtentacle
When private schools offer a 1-year course to become a "Machine Learning
Consultant" with no prior mathematics or programming knowledge required, you
know that something has to be off ...

------
kxyvr
I think there may be some confusion as to the terminology in the article. The
gradient does _not_ necessarily lead to the greatest decrease in the function.
Specifically, if our function of interest is f and we're currently at the
point x, then f(x - alpha grad f(x)) for some alpha may or may not be less
that f(x + beta dx) for a different dx and beta.

For example, consider the quadratic f(x) = 0.5 x' A x - b' x where ' denotes
transpose and we have implicit multiplication. Let A = [2 1;1 2], b = [3; 4],
x = [5;6]. This is convex and quadratic with a global minima of [2/3;5/3],
inv(A) b. Now, grad f(x) = A x - b = [13;13] and moving in the direction
[-13;-13] will not get us there in a single step. However, dx_newton =
-inv(hess f(x)) grad f(x) = [-4 1/3, -4 1/3], which brings us to the global
minima in a single step.

The value in gradient descent is that, combined with an appropriate
globalization technique such as a trust region or a line search, it guarantees
convergence to a local minima. Newton's method does not unless close enough to
the minima. As such, most good, fast optimization algorithms based on
differentiable functions use the steepest descent direction as a metric, or
fallback, to guarantee convergence and then use a different direction, most
likely a truncated-Newton method, to converge quickly. Meaning, the gradient
descent direction rarely leads to the greatest decrease. Unless, of course, we
want to make an argument in an infinitesimal sense, which fine, but I'd denote
that explicitly.

~~~
JunaidB
That's a great point and to be honest I could have been a lot tighter with the
terminology. Good advice to take on board for next time - thanks!

Your point about combining optimisation techniques is interesting and I'd love
to learn about it a little more. When you say "As such, most good, fast
optimization algorithms based on differentiable functions use the steepest
descent direction as a metric, or fallback, to guarantee convergence and then
use a different direction, most likely a truncated-Newton method, to converge
quickly", does this mean that both algorithms are being used together? So
first steepest descent is run for a few iterations and then the truncated-
Newton method takes over?

If you have some resources where I could read up on this it would be much
appreciated!

~~~
kxyvr
Though I have complaints with it, Numerical Optimization by Nocedal and Wright
is probably the best reference for modern optimization techniques. My
complaint with it is that they also present many historical techniques that I
would argue should not be used and don't provide clear guidance as to what are
the modern, robust algorithms. And, to be sure, arguments can be made for all
sorts of algorithms, but I will contend: (unconstrained) trust-region newton-
cg [algorithm 7.2 in Numerical Optimization], (equality) composite-step SQP
method [algorithm 15.4.1 in Trust-Region Methods by Conn, Gould, and Toint],
(inequality) NITRO interior point algorithm [algorithm 19.4 in Numerical
Optimization], (equality and inequality) combination of the above. There are
many implementation nuances with these algorithms and they can be made better
than their presentation, but I believe them to be a good starting point for
modern, fast algorithms.

As far as switching back and forth between the Newton and gradient descent
steps, this is largely done in a class of algorithms called dogleg methods.
Essentially, the Newton step is tried against some convergence criteria. If it
satisfies this criteria, it takes a step. If not, it reduces itself until
eventually it assumes the gradient descent step. I'll contend that truncated-
CG (Steihaug-Toint CG) does this, but better. Essentially, it's a modified
conjugate gradient algorithm to solve the Newton system that maintains a
descent direction. The first Krylov vector this method generates is the
gradient descent step, so it eventually reduces to this step if convergence
proves difficult.

More broadly, there's a question of whether all of the trouble of using
second-order information (Hessians) is worth it away from the optimal
solution. I will contend, strongly, yes. I base this on experience, but there
are some simple thought experiments as well. For example, say we have the
gradient descent direction. How far should we travel in this direction?
Certainly, we can conduct a line-search or play with a "learning parameter".
Also, if you do this, please use a line-search because it will provide vastly
better convergence guarantees and performance. However, if we have the second
derivative, we have a model to determine how far we need to go. Recall, a
Taylor series tells us that f(x + dx) ~= f(x) + grad f(x)'dx + 0.5 dx' hess
f(x) dx. We can use this to figure out how far to travel in this direction
where we try to find an optimal alpha such that J(alpha) = f(x + alpha dx) =
f(x) + alpha grad f(x)'dx + (alpha/2) dx' hess f(x) dx. If dx' hess f(x) dx >
0, the problem is convex and we can simply look for when J'(alpha) = 0, which
occurs when alpha = -grad f(x)' dx / (dx' hess f(x) dx). When dx' hess f(x) dx
< 0, this implies that we should take a really long step as this is predicting
the gradient will be even more negative in this direction the farther we go.
Though both methods, must be safeguarded (the easiest is to just halve the
step if we don't get descent), the point is that the Hessian provides
information that the gradient did not and this information is useful. This is
only one place where this information can be use, others include in the
direction calculation itself, which is what truncated-CG does.

As a brief aside, the full Hessian is rarely, if ever, computed. Hessian-
vector products are enough, which allows the problem to scale to really
anything that a gradient descent method can scale to.

As one final comment, the angle observation that you make in the blog post is
important. It comes in a different form when proving convergence of methods,
which can be seen in Theorem 3.2 within Numerical Optimization, which uses
expression 3.12. Essentially, to guarantee convergence, the angle between the
gradient descent direction and whatever we choose must be controlled.

~~~
JunaidB
Thank you for taking the time to write a thorough and considerate response. I
have been working through the Engineering Optimization Methods and
Applications by Ravindran, Ragsdell and Reklaitis so far but I will spend some
time in the coming few weeks with Nocedal and Wright in accordance with your
recommendation.

I intend to write more about what I learn in this area and I'd be honoured if
you would contribute like you did here with your comments/ corrections and
suggestions! Thank you for the help and reference, I will definitely be
following up.

------
FisDugthop
What I find interesting is the implicit assumption that the underlying
function being learned is differentiable or continuous to begin with. That's
not always the case; for example, we often work with "categorically labeled"
discrete binning problems.

~~~
julienreszka
Data normalization is the 101 of any respectable machine learning course.

~~~
jcims
Yeah but it’s still enough to stall and hinder newbies (me) that deal
primarily with categorical data until you can start to intuitively map the
continuous back into the discrete.

Nature of the beast it seems but still kind of a pain.

------
kenferry
> it’s not obvious to me at all that the same direction as the (negative)
> gradient leads to the largest decrease in the value of the function f(x)

What am I missing here? This is straight up the definition of the gradient.

~~~
chestervonwinch
The gradient is defined as a limit. That it points to the direction of
greatest increase is a consequence of the definition, not the definition
itself.

------
pixelpoet
Small side note, don't forget to escape your trig functions in TeX! Else it
will render your "cos" in italitcs (product of variables c, o and s) as in
this article, and thousands of others; the only thing more surprising than
people not noticing this is TeX not giving a warning about it.

------
alimw
Please try this exercise and report back :) Suppose it happened that when
setting up your parameter space you found yourself working with ξ and η
instead of x and y, where the relationship is given simply by (ξ, η) = A (x,
y) for A an invertible linear mapping (2×2 matrix). This could easily happen
in practice. Is gradient descent in (ξ, η) the same procedure as gradient
descent in (x, y)? What should we make of any difference?

------
mdonahoe
This article made me wonder if it is worth going in the positive direction on
occasion, just to check that it gets worse.

