
Backpropogation Is Just Steepest Descent with Automatic Differentiation (2013) - adamnemecek
https://idontgetoutmuch.wordpress.com/2013/10/13/backpropogation-is-just-steepest-descent-with-automatic-differentiation-2/
======
graycat
I've always wondered how the neural network parameter fitting got away with
using just steepest descent: In such fully or largely unconstrained
optimization, going way back, decades, steepest descent was seen as often
poor. The usual, first analogy was finding the bottom of a convex bowl that
was long and narrow. Then steepest descent would keep going down hill parallel
to essentially the short axis of the bowl and make nearly no progress along
the long axis. So, sure, the idea was make a quadratic approximation to the
bowl, find the orthogonal eigen vectors and values, and then make some rapid
progress to the bottom. That could be conjugate gradients or Newton iteration.
Then could build up an approximation to the quadratic approximation and get
quasi-Newton iteration. And there were other ideas.

I would guess that somewhere among the hundreds or more parameters to be
adjusted, there would be at least two that would determine a bowl such as I
described.

But, but, okay, I can guess: The neural network has so many parameters that
not doing well on a few of them doesn't make much difference.

~~~
lawrenceyan
Local minima are basically equivalent to global minimum in the high
dimensionalities achieved in deep learning. It's not even just a conjecture by
the way. This has been mathematically proven.

~~~
MereInterest
Is there a name for this theorem? I'd be interested in reading more about it.

~~~
_0ffh
Maybe GP is talking about the hypothesis that in high-dimensional space you
will probably almost always find /some/ direction which decreases the error
function. IIRC this hypothesis is well supported, but not yet proven. So the
real problem may not be local minima but saddle points that just /appear/ to
be minima due to slow optimisation progress. For more details, check out
[https://arxiv.org/abs/1406.2572](https://arxiv.org/abs/1406.2572)

------
brilee
This title is really confused - backpropagation is one implementation of
automatic differentiation, and yields a gradient. What you do with that
gradient next is up to you - you can just follow it (steepest descent), or try
other fancier second order techniques.

~~~
electricslpnsld
Can you get second order with only the gradient? Even fancy accelerated
gradient descents are still first order. I suppose there are BFGS-like quasi
Newton methods, but those are still building up estimates of the second order
info over multiple iterations.

~~~
vbarrielle
You can get closer to second order if your error is a sum of squares of the
form 0.5 * f^T * f, in which case you case compute the jacobian J of the
fitness f, and approximate your hessian as J^T * J. That's the Gauss-Newton
method and it's very effective if a linear approximation of f holds locally.

That's an approximation but that's arguably a second order method, since it
acknowledges the nonlinearity coming from the squared norm of f.

------
jefft255
The article seems good (although dated). See
[http://www.jmlr.org/papers/v18/17-468.html](http://www.jmlr.org/papers/v18/17-468.html)
for an in depth survey from last year.

But reading the title, I was like "well... sure?". Bad title for this article.

------
andbberger
tangent: all the backprop mysticism drives me nuts. it's just the chain
rule...

~~~
bitL
Most people couldn't pass calculus 101 so it's mysticism for them. Operations
research people on the other hand must be shaking their heads that somebody
still uses it and probably can't believe whole state-of-art is based off their
most stupid method.

~~~
jefft255
You known, ML researchers are aware of the enormous body of research in
optimization. A lot of these methods simply do not work or scale well to deep
neural networks. Second order methods have been tried and are not worth the
cost.

~~~
siekmanj
I mean, don’t most researchers use Adam? As far as I know that is a second-
order method.

~~~
jefft255
Nope! Sorry I can’t take more time to explain but there is no second
derivative used in Adam.

~~~
andbberger
From the paper

> We propose Adam, a method for efficient stochastic optimization that only
> requires first-order gradients with little memory requirement. The method
> computes individual adaptive learning rates for different parameters from
> estimates of first and second moments of the gradients

Most of the popular variants of SGD use approximations of the hessian in one
way or another

~~~
jefft255
Not to be peculiar, but I don't know if approximating the hessian using the
gradient counts as a second order method. I was talking about "full-blown"
second order methods where you compute de hessian through AD.

Furthermore, I don't think by "moment of the gradients" they actually mean
second derivatives.

Also from the paper: We introduce Adam, an algorithm for first-order gradient-
based optimization ofstochastic objective functions...

It's written right in the abstract that the authors consider it a first-order
method.

~~~
andbberger
Seems legit

------
m0zg
To a practitioner the title sounds like "water is wet" or "air is breathable".

------
xiaodai
Interesting that at end of the blog there are mentiond of native support to AD
which appearing in Julia and Swift

------
conjectures
Here, lemme fix that for you:

'Backpropogation Is Just Steepest Descent'

