
Deep Learning Without Poor Local Minima - hacker42
https://arxiv.org/abs/1605.07110
======
philth
There is a great post [1] on stack overflow that might help contextualise this
development.

[1]
[http://stats.stackexchange.com/questions/203288/understandin...](http://stats.stackexchange.com/questions/203288/understanding-
almost-all-local-minimum-have-very-similar-function-value-to-the/203300)

~~~
charleshmartin
That is helpful thanks.

------
charleshmartin
It is difficult to understand the implications of these assumptions and if
they really apply to supervised deep learning nets.

It has been know for a very long time that simple models, like the Random
Energy Model, displays a spin glass transition at low temperature. (the REM is
a p-infinite limit the mean field p-spherical spin glass used in the earlier
papers by LeCun) So it is expected that a random network may also display this
kind of behavior, and, therefore, there could exist a large number of global
minima, separated by very large barriers.

see, for example
[http://guava.physics.uiuc.edu/~nigel/courses/563/essays2000/...](http://guava.physics.uiuc.edu/~nigel/courses/563/essays2000/pogorelov.pdf)

However, it has been argued that this is unphysical for very strongly
correlated systems like proteins and random hetero-polymers (with non-local
contacts). Instead, one would have a spin glass minimal frustration. This
would lead to a highly funneled energy landscape, with a single (or a few
local) minima. This would lead to a rugged convexity.

[https://charlesmartin14.wordpress.com/2015/03/25/why-does-
de...](https://charlesmartin14.wordpress.com/2015/03/25/why-does-deep-
learning-work/)

It would be helpful if some practitioners could comment on the reliability of
the assumptions in this paper.

~~~
avallet
As I understand it, the main result of the paper relies on these 4
assumptions:

\- That the dimensionality of the output of the network is smaller than that
of the input. That is usually the case in image recognition, where the image
is width x height x channels dimensional, while the output is usually a much
smaller number of label-wise probabilities. It probably isn't the case when
you generate data from some smaller representation, e.g. with autoencoders,
image generation, etc.

\- That the input data is decorrelated, and that the input data is
uncorrelated with the output ground truth. The former can easily be obtained
via a whitening transformation in many cases in practice. I am not quite sure
about the latter.

\- That whether a connection in the network is activated or not is random with
the same probability of success across the network. Active means the ReLU
activation function has output greater than 0. Many people initialize weights
in the network with some 0-mean random variable and some constant bias, in
which case that assumption should hold true at the beginning of training.
Whether that assumption holds throughout training could easily be verified
empirically - i.e. by monitoring the network's activation.

\- That the network activations are independent of the input, the weights and
each other. That's obviously not completely true - the network activations of
a given layer are a function of the previous layer's activations and weights,
and ultimately of the input in the first layer. With large enough networks,
this may hold sufficiently in practice - any single activation should not
depend very significantly on any other single variable.

I may have missed something in interpreting the maths, any comment is
appreciated. From a practical standpoint, especially for computer vision,
these assumptions seem quite reasonable. I am not qualified however to comment
on the proof of this result, so I would wait on peer review. Still, it is
heartening to see the theory of deep learning finally catching up with
practice!

~~~
imh
>..and that the input data is uncorrelated with the output ground truth.

I haven't read it yet. Is that just linear correlation, or full independence?
If it's independence, then there's no signal, right?

~~~
avallet
Linear correlation. Actually, it seems to be a bit more general than just
uncorrelated, i.e. if the input is a m by n matrix X and ground truth a k by n
matrix Y, the author requires that XX^T and XY^T to be full-rank. A whitening
transformation would yield the identity matrix for XX^T, but that's a bit
stronger than what's strictly necessary. My interpretation of XY^T being full-
rank meaning X and Y being uncorrelated might indeed be mistaken.

~~~
pedrosorio
Could you clarify what is the mathematical definition of "X and Y are
uncorrelated" for two matrices?

~~~
avallet
I am not quite sure there is such a thing. :p I was playing a bit loose with
the mathematics here, and trying to find some more intuitive way to explain
"XY^T is full-rank", but it got confusing. Sorry about that. I will edit my
initial post accordingly. (Ah, can't edit it seems, oh well)

------
mrkgnao
I have a feeling, albeit one not informed by deep knowledge of either field,
that there are already-known pure mathematical results that can cause a
<insert your favorite way of saying "paradigm shift"> in machine learning,
just "waiting to be discovered" by people in the latter field.

All this talk of maxima and minima reminds me of Morse theory, for instance
(and that Wikipedia page is more than what I know about it). [1]

Is there any sense in what I'm saying?

[1]:
[https://en.wikipedia.org/wiki/Morse_theory](https://en.wikipedia.org/wiki/Morse_theory)

------
argonaut
For context, if true (the paper has not been peer reviewed yet), this confirms
what has already been suspected and empirically evaluated about deep learning:
that the non-convexity of neural networks is not an issue.

------
rudyl313
If these claims are true (specifically, that every local minimum is a global
minimum), then why did the earlier neural networks have poor performance? Why
did we need advancements like pretraining via stacked RBMs and dropout in
order to make deep learning converge on usable/better models?

~~~
argonaut
A quadratic function is a convex function with a global minimum. Doesn't mean
it's a good model for much.

~~~
rudyl313
But the tricks/advancements I mentioned are not changing the function. They
changed the initial weights and how the cost function was explored.

~~~
aab0
But they do change the function. You use RELU now, instead of pretraining.

~~~
rudyl313
Using ReLU units is a newer advancement and I agree that changing the
activation function does change the cost function. However, before Hinton got
all excited about ReLU units, he was still showing huge improvements just by
using pretraining and later by using dropout, which shouldn't change the cost
function.

~~~
argonaut
Dropout helps with convergence/optimization, sure. The existence of a global
minimum says nothing about the time required to reach it. Important to note
that dropout isn't as common anymore; it's not a huge win.

------
mjw
My main quibble from this paper is:

> For deeper networks, Corollary 2.4 states that there exist “bad” saddle
> points in the sense that the Hessian at the point has no negative
> eigenvalue.

To me these sound just as bad as local minima. Also I don't think it's
standard to call something a saddle point unless the Hessian has negative as
well as positive eigenvalues. Otherwise there's no "saddle", more something
like a valley or plateau.

They claim that these can be escaped with some peturbation:

> From the proof of Theorem 2.3, we see that some perturbation is sufficient
> to escape such bad saddle points.

I haven't read through the (long!) proof in detail but it doesn't seem obvious
to me why these would be any easier to escape via peturbation than a local
minimum would be, and I think this could use some extra explanation as it
seems like an important point for the result to be useful. Did anyone figure
this bit out?

~~~
banskiachtar
A saddle is a critical point that's not a local extremum--the Hessian could
just be zero, for example, like x^4-y^4 at (0,0).

~~~
mjw
Ah yep, true. I'd forgotten you can still get the saddle effect from higher-
order derivatives, the Hessian eigenvalues aren't enough to characterise it.

I was thinking of examples like (x-y)^2 at zero, although I guess that's still
a local minimum, just not a unique local minimum in any neighbourhood.

------
thomasahle

        2) every local minimum is a global minimum
        3) every critical point that is not a global minimum is a saddle point, and
    

Are these not the same thing?

~~~
coherentpony
f(x) = (1 - x^2)^2

This satisfies 2) but not 3).

The point x = 0 is a critical point but it is not a global minimum and it is
not a saddle point.

~~~
n4r9
Could 3) be replaced by "there are no local maxima"?

------
conjectures
This paper would be much improved by cutting the material on linear models,
which AFAIK is not new. Then by properly explaining the assumptions (Axa...)
they are trying to do away with for nonlinear models rather than referring the
reader to previous work.

------
alpineidyll3
Similar results have been known for years, but this is good stuff. I guess
I'll keep reading news.y....

------
nshm
Oh yeah, our deep learning is a super cool technology, no need to study math,
believe us, just buy more GPUs and they will help you, throw the data on the
fan. Or, even better, give us your data, we'll decide what to do with it for
you, upload your data to our cloud and relax.

Consider the following question - how many layers do you need to build a RELU
network to approximate x^2 function on [0,1] with 1e-6 accuracy.

~~~
mikey_g
Actually not so much, I was wondering about similar things some time ago.

[https://github.com/ghostFaceKillah/tensorflow-
experiments/bl...](https://github.com/ghostFaceKillah/tensorflow-
experiments/blob/master/basic-functions/quadratic_function_nn_learner.py)

~~~
nshm
Exactly, this simple example demonstrates how hard is to learn even a simple
function. Playing with your code, 10 hidden units give error 0.6 in 1000
iterations, 1000 hidden units give error 0.02 in 10000 iterations so it takes
way longer to train. This is not an easy technology.

