
More Descent, Less Gradient - cantdutchthis
https://koaning.io/posts/more-descent-less-gradient/
======
benrbray
The author of this post has rediscovered some fundamental ideas in
optimization just by tinkering around! Neat!

> The objective function for linear regression is quadratic (and convex). This
> explains why Newton's method finds the optimum in one step--the local
> quadratic approximation turns out to be a global approximation!

> Notice, however, that Newton's method requires us to invert the Hessian
> matrix (f''), which is expensive in general. I don't think autograd software
> can get around this limitation.

> The KeepStepping optimizer is performing "gradient descent with line
> search", which is a more Googleable term.

For more about optimization for solving linear systems (as is the case for
linear regression), I recommend Shewchuk 1994, "Conjugate Gradient without the
Agonizing Pain" [1], which has some nice geometric insight.

[1]: [http://www.cs.cmu.edu/~quake-papers/painless-conjugate-
gradi...](http://www.cs.cmu.edu/~quake-papers/painless-conjugate-gradient.pdf)

~~~
LudwigNagasena
>The author of this post has rediscovered some fundamental ideas in
optimization just by tinkering around! Neat!

I checked the guy’s LinkedIn, he has 5 years of econometrics and data science
education. I am honestly perplexed how anything in this article could come off
as new to someone with such background.

~~~
currymj
totally possible to learn a lot of economics and statistics and never have
formal coursework in numerical methods for optimization.

~~~
benrbray
Seems to be the case here. Regardless of the author's educational background,
kudos to them for continuing to dive deeper into the math. After a math degree
and three years of graduate school, I still run into blind spots almost every
day, for topics that people across the hall from me would consider elementary.

------
rsp1984
_You can apply calculus to estimate the stepsize you need for the current
iteration. If you do this in a principled way then my results suggests that
for linear regression you may only need one gradient calculation._

Yes, it's called Linear Least Squares. Gauss discovered it in 1795.

~~~
19f191ty
It's newtons method that he rediscovers. Are people not taught numerical
methods these days? We learned it quite early in the much more general context
of finding zeros.

------
zmk_
The author rediscovered Newton's method. Next in line is the realisation that
Hessian can be approximated and we will see the rediscovery of BFGS, L-BFGS,
etc.

~~~
gspr
It reminds me of this story from a decade back:
[https://science.slashdot.org/story/10/12/06/0416250/medical-...](https://science.slashdot.org/story/10/12/06/0416250/medical-
researcher-rediscovers-integration)

I somehow find it more worrisome that someone who works as a "resarch
advocate" and previously a "data scientist" thinks he's had novel insight into
optimization over the weekend than that a medical doctor can't recognize that
he's rediscovered basic numerical integration.

~~~
gunshai
That is hilarious.

"The total area under a curve is computed by dividing the area under the curve
between two designated values on the X-axis (abscissas) into small segments
(rectangles and triangles) whose areas can be accurately calculated from their
respective geometrical formulas."

~~~
gspr
It's a shame that it only works for glucose curves. Imagine if someone could
generalize it :-)

------
TFortunato
If you are interested in this kind of thing, there are a whole host of
"derivative free" optimization methods out there. One I've seen used in
particular at a previous job, the Nelder-Mead (downhill simplex) method,
always struck me as pretty elegant and intuitive once you've seen it
explained.
[https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method](https://en.wikipedia.org/wiki/Nelder%E2%80%93Mead_method))

~~~
twic
There's a family of such methods invented by Michael Powell:

[https://en.wikipedia.org/wiki/PDFO](https://en.wikipedia.org/wiki/PDFO)

From what i remember, they are "trust region" methods, which work by trying to
fit a quadratic function to the local neighbourhood, then analytically finding
the minimum of that. The clever stuff is in how you form and update the trust
region efficiently.

~~~
Enginerrrd
Both Powell's and Nelder-mead are often the only thing I can get to work when
the optimization is of a quantity derived from integrating a system of
differential equations. Not sure why that is.

------
rotskoff
As many have pointed out, this rediscovers Newton's method. The reason that
this type of approach with a "hessian pre-conditioning" is not widely used in
practice is that computing the hessian is costly. Avoiding that additional
computation is the idea underlying "quasi-Newton" methods like BFGS and (more
loosely) popular methods like Adagrad.

------
nestorD
Many comment here are about the first section (which is indeed rediscovering
the basis of optimisation) but I find the second section much more
interesting.

He notes that computing the gradient is much more expensive than doing a
single evaluation and that the gradient tend to stay in the same overall
direction. Thus he proposes to compute the gradient only a fraction of the
time and to use normal evaluations in the meantime.

Following an old gradient and only updating it when it becomes clearly
obsolete seems like an interesting idea to me.

~~~
cl3misch
> Following an old gradient and only updating when it becomes clearly obsolete

This sounds exactly like a linesearch
[https://en.m.wikipedia.org/wiki/Line_search](https://en.m.wikipedia.org/wiki/Line_search)

------
kragen
Doesn't the Hessian take O(N²) space in the number of independent variables?
Like, if you have one dependent variable and a million independent variables,
isn't the Hessian a trillion-element matrix, albeit a symmetric one?

~~~
xscott
There are "limited memory" algorithms which use low rank approximations.

[https://en.wikipedia.org/wiki/Limited-
memory_BFGS](https://en.wikipedia.org/wiki/Limited-memory_BFGS)

~~~
jleahy
Also techniques that only require the product of the hessian with some
arbitrary vector, which itself is just a vector. That can be easy to compute
in some problems.

------
jl2718
[https://en.wikipedia.org/wiki/Wolfe_conditions](https://en.wikipedia.org/wiki/Wolfe_conditions)

This is what you want for step length in a line search for most smooth
continuous unconstrained optimizations. Really a lot more important than I had
expected. The post is about LLS, which is a special case.

------
cultus
A lot of adaptive GD algos work by kind of approximating the inverse Hessian,
which is usually too expensive to solve at every step. Adagrad (my go-to)
essentially finds a diagonal approximation to the Hessian, which often works
well in practice.

There's fancier techniques to take low-rank approximations and that sort of
thing, but I feel like they're usually more trouble than their worth.

------
lidHanteyk
I've seen this before; it was called "over-relaxing the step", I think. The
idea is that even if a step is too big, we can still show that the step was in
the right direction. That's the same as the author's technique of taking
multiple normal-sized steps in a single direction.

I don't know the name for the other part of the author's technique, where
several Newton-Raphson steps are taken, but I've seen it before, too.
Crucially, the author takes ratios of first and second derivatives, rather
than of the original function and first derivative, which is the right thing
to do for minimization but is going to be perhaps more expensive than the
original gradient computation.

------
abeppu
For the first section, where he uses a Hessian and finds he really only needs
one step for linear regression -- isn't he basically doing a newton method,
and it works more or less perfectly b/c with the normal squared error term (L2
norm) his problem is exactly quadratic? I haven't unpacked his math carefully
but I think (quasi)newton methods are often worth a look if (a) your dimension
is small enough that the the D^2 hessian is ok, and (b) your data is small
enough that you can afford to us exact derivatives rather than mini-batch
estimates.

~~~
nil-sec
Actually there is no need for gradient decent at all if you have a linear
regression problem, because you can find the min/max exactly by inverting your
weight-matrix (pseudo-inverse if degenerate). Additionally, second order
optimisation, i.e. using the Hessian in addition to the gradient is a well
studied problem in the literature. My understanding is that it is in general
not worth it because calculating the Hessian is very expensive (see e.g.
[https://arxiv.org/abs/2002.09018](https://arxiv.org/abs/2002.09018)).

~~~
James_Henry
It seems to me that gradient descent would be, often, a better solution than
the moore-penrose inverse. Is it not more efficient in most cases?

~~~
zwaps
I always thought you‘d use Gram Schmitz for linear regression

[https://en.m.wikipedia.org/wiki/Gram–Schmidt_process](https://en.m.wikipedia.org/wiki/Gram–Schmidt_process)

------
sandoooo
Let's say I have a hyperparameter optimization task where I have to tune a
simulation to some spec by varying 2-4 input parameters, and the output is a
single number. I have no analytical gradient, though it's probably OK to
assume the domain is smooth. Each sim takes hours to run, the entire search
could take days, and I would like something that works well in parallel so I
can speed up the search. What's the state of art here? Are there anything
close-to-state-of-art that's useable out of the box? I've read a few papers
but they don't tend to come with software.

~~~
mrdmnd
I think Bayesian Optimization is what you're looking for here. There's an
internal tool at google ("Vizier") for which a white paper has been published
that solves this exact problem. I don't know if there are any public
implementations of Vizier but you could probably reverse engineer some of it
from the white paper:
[https://static.googleusercontent.com/media/research.google.c...](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46180.pdf)

------
ur-whale
Read it hoping the topic was going to be sub-gradient optimization (or any
kind of technique that follows an "interesting" direction when the gradient is
either not available or "boring").

It's unfortunately not presenting anything new.

------
tel
The next step to investigate here is line search.

~~~
stellalo
I think that’s pretty much what he’s doing in the second part of the post

