
The limitations of gradient descent as a principle of brain function - mpweiher
https://arxiv.org/abs/1805.11851
======
LolWolf
The "money" quote is right here,

> Steepest descent or gradient descent depends on a choice of ruler (or
> Riemannian metric) for the parameter space of interest. The Euclidean metric
> is rarely a natural choice, especially for spaces with dimensions that carry
> different physical units, such as the example shown here.

But I think this, to many ML researchers or biophysicists, is true by what I
lovingly call the NSH[0]. The argument posed (in a slightly anthropomorphized
way) goes as follows:

-

We have some function, f, which we want to minimize/maximize: usually f will
be the negative log of some probability distribution, or whatever. Now, hey,
look, this function could be anything that's differentiable, all we know is
that the minimum of this function is _really_ useful.

But, what happens if we replace this function by another function, g, whose
minimum at x is the same, but g(x + ∆) ≈ g(x) for ∆ being fairly large, then
we run into a problem! Minimizing g (which is, admittedly, an unnatural
representation of f) is bad because it's really hard for us to distinguish
points around x from x itself (which is the point we want) using only the
gradient information, since it's very close to zero. In other words, while the
function g preserves many of the same properties as f (in fact, the only one
we care about, the minimum), the trajectory of gradient descent on g is going
to be mostly pretty dumb as compared to what it could be for f.

In the case that f _is_ the negative log-likelihood of some empirical
distribution, then the Fisher information turns out to be a natural choice for
the curvature of the distribution (see Information Geometry, if you're
interested in this idea) and running what is essentially a Newton method on f
will yield similar trajectories to running it on g, which is what we want.

-

The paper makes no particularly strong argument that is of note, except that,
yes, of course, "the brain" doesn't actually "run" gradient descent _as stated
on paper_ , which apparently is an assumption some neuroscience models make
use of to make predictions. This is... not really a claim anyone is arguing
for, as far as I know (the paper has citations [12,13], but taking a quick
peek, only 12 mentions this idea directly, but even there invokes the "natural
gradient" argument that the authors here propose, and only mentions, in
passing, that this mechanism is "biologically plausible").

\---

[0] More specifically, the No-Shit Hypothesis.

~~~
kurthr
I liked this quote to give a feeling why the euclidian metric for the gradient
might be wrong.

[https://hips.seas.harvard.edu/blog/2013/01/25/the-natural-
gr...](https://hips.seas.harvard.edu/blog/2013/01/25/the-natural-gradient/)

"As a concrete example, imagine two people standing on two different mountain
tops. If one of the people is Superman (or anybody else who can fly) then the
distance they would fly directly to the other person is the Euclidean distance
(in R^3). If both people were normal and needed to walk on the surface of the
Earth the Riemannian metric tensor tells us what this distance is from the
Euclidean distance. Don’t take this illustration too seriously since
technically all of this needs to take place in a differential patch (a small
rectangle whose side lengths go to 0)."

------
zerostar07
Noted that the example they use from Rui Costa et al. 2017 (a great paper btw
recommended reading) is not any kind of machine learning result, it is a
fitting of data from previous experiments showing pre- and post-synaptic
changes during short term LTP experiments. So i 'm not sure what the authors
are rambling on about here.

------
api
I really can't believe anyone argues that brains can just be gradient descent.
It's like people have never heard of a fitness valley or a sparsely connected
graph region.

~~~
mr_toad
I hear the converse (neural networks can’t be like brains because they do use
gradient descent) argued quite a lot.

~~~
api
That is also weird. Brains may use gradient descent but that can't be the only
way they learn. Gradient descent alone is almost completely incapable of
escaping a local maximum.

~~~
dingo_bat
Can't you run gradient descent on a higher level of abstraction and escape the
local minima? People routinely get stuck thinking in a certain way, even when
an observer looking at the problem from afar can tell that they are stuck.

~~~
api
You can if you know what the higher level abstraction is. That kind of thing
doesn't work when your learning algorithm doesn't have a programmer /
mathematician to babysit it who knows more than it does.

... unless God is sitting there hacking our brains' code in real time to kick
us out of local maxima. "Damn it... they're stuck on the mind/body dichotomy
again... let me see what happens if I bolt a dimension reduction transform on
there and then run another gradient descent on top. Shit now they're
worshipping cheese... undo, undo, undo..."

------
joe_the_user
I can't see a PDF here, I'm shown a 46 byte file to download.

~~~
jimfleming
The previous version has a PDF[0] but you're right, the current version shows
none.

[0]
[https://arxiv.org/pdf/1805.11851v1.pdf](https://arxiv.org/pdf/1805.11851v1.pdf)

~~~
alexanderchr
The paper was withdrawn with v2

Comments: We were asked by an author of a criticized paper to withdraw the
submisison in order to allow them to respond to our criticism in private

~~~
hshehehjdjdjd
Is that commonly done? I would guess this means the withdrawn paper contains
an egregious error. Does anyone know?

------
simonster
tl;dr ordinary gradient descent is sensitive to the parametrization of the
problem but natural gradient is not. This is an important fact (and one that
is fairly well-known within the ML community), but it is not totally clear to
me why it should be particularly relevant to neuroscience.

~~~
bjourne
Isn't it only the _trajectory_ that is sensitive to the parametrization?
Afaik, the location of the minimas do not depend on it.

~~~
simonster
Yep, the minima are in the same locations. However, if the problem has
multiple minima, then the parametrization can affect which minimum gradient
descent actually reaches.

