
Gradient Descent Finds Global Minima of Deep Neural Networks - superfx
https://arxiv.org/abs/1811.03804
======
fwilliams
It's worth noting that the primary result of this paper has only to do with
the error on the _training_ data under empirical risk minimization. Zero
training error =/= a model that generalizes. For any optimization problem, you
can always add enough parameters to achieve zero error on a problem over a
finite training set (imagine introducing enough variables to fully memorize
the map from inputs to labels).

The major contribution of the work is showing that ResNet needs a number of
parameters which is polynomial in the dataset size to converge to a global
optimum in contrast to traditional neural nets which require an exponential
number of parameters.

~~~
sytelus
There is bit of difference between fitting dataset to some convenient
parameterized function vs finding global minima of non-convex function. Also,
paper claims that this can be done in polynomial time.

> The current paper proves gradient descent achieves zero training loss in
> polynomial time for a deep over-parameterized neural network with residual
> connections (ResNet).

~~~
p1esk
_There is bit of difference between fitting dataset to some convenient
parameterized function vs finding global minima of non-convex function_

What's the difference? Any point where the loss is zero is global minimum.

------
iandanforth
My hope is that as these bounds are refined we can start to do BotE
calculations such as, "I have 50k training images of 512x512x3 and 1k classes,
this means I'll need a Convolutional Resnet of at most 12 layers and 12M
params to fit the training data so let's start at half that." Rather than
today which is 'let's use resnet 101 and see if that works.'

~~~
trevyn
You have to include some sense of what the classes encode for this to have
meaning — for example, “pictures of correct mathematical proofs” vs “pictures
of incorrect mathematical proofs” is going to require a much different
architecture than “pictures of squares” vs “pictures of circles”.

~~~
iandanforth
Interesting classes! To use the heuristic Andrew Ng proposed if a human could
tell the difference between correct and incorrect proofs in 1 second then this
problem is likely no harder than most image recognition problems. If, instead,
we're talking about analysis that requires symbolic manipulation then we're
pretty far outside of the current capabilities of convolutional/residual/fully
connected nets for which the paper provides bounds.

------
ramgorur
I did not understand the paper very well.

1\. It's theoretically impossible to guarantee a convergence to global optima
using gradient descent if the function is non-convex.

2\. The only way to guarantee is to start the gradient descent from different
points in the search space or try with different step sizes if the algorithm
only starts from the same point in the search space.

3\. Also does "achieving zero training loss" mean the network has converged to
the global optima? I used to know you will get zero training loss even if you
are at a local minima as well.

Please correct me if I am wrong.

~~~
LolWolf
1\. > It's theoretically impossible to guarantee a convergence to global
optima using gradient descent if the function is non-convex.

This is false. See, e.g., [0][1].

2\. I'm not really sure what the question is here.

3\. If your loss is bounded from below (it is a square norm) by 0 and you
achieve 0 loss, this means that 0 is a global optimum, since, by definition,
no other objective value can be smaller than this number.

\---

[0] Theorem A.2 in Udell's Generalized Low-Rank models paper
[https://arxiv.org/pdf/1410.0342.pdf](https://arxiv.org/pdf/1410.0342.pdf)

[1] B&V _Convex Optimization_
([https://web.stanford.edu/~boyd/cvxbook/](https://web.stanford.edu/~boyd/cvxbook/)),
Appendix B.1. In fact, I can't find the reference right now, but you can
easily prove that GD with an appropriate step-size converges to a global
optimum on this problem when initialized at (0,0), even though the problem is
non-convex.

~~~
DoctorOetker
that reference you can't find right now seems rather pertinent?

I think the OP intended and should have written:

"It's theoretically impossible to guarantee a convergence to global optima
using gradient descent _for an arbitrary non-convex function_."

For example consider the function f(x)=sin^2(pi * x)+sin^2(pi * N/x) this
function has multiple global minima at the divisors of N, where it is f(x)==0,
if x or N/x is non-integer, it is guaranteed to be positive...

I am not taking a stance on if gradient descent does or does not guarantee
finding global minima and is thus able to factorize cryptographic grade RSA
products of primes, but the claim does appear to imply it.

Edit: the multiply symbols changed some cursive

~~~
LolWolf
> that reference you can't find right now seems rather pertinent?

Here it is:
[https://arxiv.org/pdf/1707.08706.pdf](https://arxiv.org/pdf/1707.08706.pdf)
(This isn't quite the one I was thinking of, so I'll dig a little deeper, but
it covers the idea).

Some slightly more technical conditions have to hold in order to have vanilla
GD work (since the function is non-differentiable at points), but a (very!)
slightly generalized variant for convex functions—sub-gradient descent—works.

> _for an arbitrary non-convex function._

Sure, but this is also obvious since it is NP-hard to reach global optima in
arbitrary non-convex problems. Additionally, specifically on the case of GD, I
can give simple examples that _always_ fail (consider f(x) = 1 everywhere
except f(0) = 0. GD always fails whenever the initial point, x_0 ≠ 0, since
the gradient is zero everywhere, except at one point. Additionally, picking
initializations randomly, we reach the global minimum with probability 0
whenever we have support with non-empty interior).

I'm afraid I disagree that this is what the OP intended, though, and I also
disagree that the paper's claim implies what you've said, since they only
study a _very_ specific subproblem (e.g. minimization of empirical loss on
ResNets).

The relative "ease" of this task vs solving arbitrary NP-hard problems is not
difficult to believe, since, given a bunch of training examples, I can always
generate a resnet that fits those examples perfectly (i.e. with zero loss) in
poly-space in a very dumb way: first, generate a circuit that matches the
look-up table of the training samples (which is poly-space in the number of
samples and can be done in poly-time), then map that circuit to an NN.

~~~
DoctorOetker
>Some slightly more technical conditions have to hold in order to have vanilla
GD work (since the function is non-differentiable at points)

which function is non-differentiable at points? if you refer to my example, it
is only nondifferentiable at x=0 and x=inf which are both uninteresting points
since they arent divisors of N, for all the rest f(x) I gave is differentiable
and lipschitz continuous of order infinity

This in contrary to your pathological example of f(x)= { 1 (x!=0); 0 (x==0)
... of course GD can not work there, and I wouldn't fault the paper for it...

don't misunderstand me, the paper _is_ interesting, but the title and certain
phrasings are very misleading IMHO

Still I think the approach by others is more interesting: by looking at the
absolute error between a fixed underlying NN as "ground truth" and observing
the error of the training NN (of same architecture as ground truth NN) trained
to match the underlying NN

~~~
LolWolf
> which function is non-differentiable at points?

Sorry, this was referring to the construction provided in the paper
referenced.

I do agree that the title is somewhat misleading, since, when I first read it
(and thought, "this is probably wrong"), I imagined that it proved that _given
any resnet_ , you can show convergence to the global optimum via GD, not just
"a resent of a given size converges to a global optimum, via GD, for a
specific training set."

That being said, the paper does not prove (nor claim to prove) _general_ ,
globally-optimal convergence of GD, which is what I think you're saying
(given, for example, what you mentioned about finding the factorization of a
semiprime in the GGP and your specific function construction)—which is what I
was pushing back against a bit. In particular, even in the title, they only
claim to prove this for a specific class of problems (i.e. NNs).

> Still I think the approach by others is more interesting: by looking at the
> absolute error between a fixed underlying NN as "ground truth" and observing
> the error of the training NN (of same architecture as ground truth NN)
> trained to match the underlying NN

I'm afraid I haven't seen this approach, but it would be interesting. Do you
have references?

~~~
DoctorOetker
Thanks for your added comments, it's really helpful to see a more candid
breakdown of others views of a paper.

>I'm afraid I haven't seen this approach, but it would be interesting. Do you
have references?

They are referenced in the paper in the section:

>Another way to attack this problem is to study the dynamics of a specific
algorithm for a specific neural network architecture. Our paper also belongs
to this category. Many previous works put assumptions on the input
distribution and assume the label is generated according to a planted neural
network. Based on these assumptions, one can obtain global convergence of
gradient descent for some shallow neural networks [Tian, 2017, Soltanolkotabi,
2017, Brutzkus and Globerson, 2017, Du et al., 2018a, Li and Yuan, 2017, Du et
al., 2017b]. Some local convergence results have also been proved [Zhong et
al., 2017a,b, Zhang et al., 2018]. In comparison, our paper does not try to
recover the underlying neural network. Instead, we focus the empirical loss
minimization problem and rigorously prove that randomly initialized gradient
descent can achieve zero training loss.

I had this idea independently but never pursued it due to lack of time. It's
another reason I like this paper: for referencing this approach, at least if
they investigate what I think they do, I still havent had time to read those
references, but from the description in this section it appears they
investigate nearly if not exactly what I wanted to investigate "some day" :)

------
charleshmartin
An excellent paper which uses (some of the) results we have also found
studying the weight matrices of neural networks...namely that they rarely
undergo rank collapse

[https://calculatedcontent.com/2018/09/21/rank-collapse-in-
de...](https://calculatedcontent.com/2018/09/21/rank-collapse-in-deep-
learning/)

But they miss something..the weight matrices also display power law behavior.

[https://calculatedcontent.com/2018/09/09/power-laws-in-
deep-...](https://calculatedcontent.com/2018/09/09/power-laws-in-deep-
learning/)

This is also important because it was suggested in the early 90s that Heavy
Tailed Spin Glasses would have a single local mimima.

This fact is the basis of my early suggestion that DNNs would exhibit a spin
funnel

------
gogut
This paper appears in November, but in fact, Allen-Zhu (MSR,
[http://people.csail.mit.edu/zeyuan/](http://people.csail.mit.edu/zeyuan/) )
already posted his result in Oct. This is their first paper in
Oct:[https://arxiv.org/pdf/1810.12065.pdf](https://arxiv.org/pdf/1810.12065.pdf),
this is their second paper in Nov
[https://arxiv.org/pdf/1811.03962.pdf](https://arxiv.org/pdf/1811.03962.pdf) .
In MSR Oct paper, they proved how to train RNN (which is even harder than
DNN). In their Nov paper, they proved how to train DNN. Compared to their Oct
one, the Nov one is actually much easier. The reason is, in RNN, every layer
has the same weight matrix, but in DNN every layer could have different weight
matrices. Originally, they were not planning to write this DNN paper. Since
someone is complaining that RNN is not multilayer neural network, that’s why
they did it.

In summary, the difference between MSR paper and this paper is: if H denotes
the number of layers, let m denote the number of hidden nodes. MRS paper can
show we only need to assume m > poly (H), using SGD, the model can find the
global optimal. However, in Du et al.’s work, they have a similar result, but
they have to assume m > 2^{O(H)}. Compared to MSR paper, Du et al.’s paper is
actually pretty trivial.

------
brentjanderson
Although I'm no expert, isn't this result an incredibly important
contribution? This paper claims to prove that:

> The current paper proves gradient descent achieves zero training loss in
> polynomial time for a deep over-parameterized neural network with residual
> connections (ResNet).

If this variant of gradient descent is able to reach global minima in
polynomial time, and if neural networks are proven to approximate any
function, then ostensibly this technique could be used to guarantee the lowest
error possible in approximating any function. This seems incredibly important.
Can someone correct my reading of the abstract?

~~~
hooloovoo_zoo
Well, without reading the whole paper, two important things strike me.

1\. Zero training loss is impossible in most networks because the last layer
can only reach the targets asymptotically.

2\. Zero training loss means nothing from a practical standpoint. We've had
algorithms capable of this for a long time (knn [k=1], decision trees etc.).

~~~
skeptic_69
1\. people overfit the baby datasets to zero training loss (MNIST) all the
time. maybe you meant a "hard" dataset.

2\. You clearly have no idea what you are talking about. This paper is trying
to argue a bit about why neural networks generalize well by showing with math
that a nn with some of their conditions converges to the zero training loss.
It isn't remotely meant to be practical. IT IS A THEORETICAL PAPER.

And comparing it to nearest neighbors of 1 is so so so so so silly it isn't
even wrong.

edit. #1 is actually an entire research direction in the theory of machine
learning fyi.

It is possible to get neural networks that massively overfit but still
generalize (which Is weird).

[https://arxiv.org/pdf/1611.03530.pdf](https://arxiv.org/pdf/1611.03530.pdf)

That paper was really famous. It showed you can get zero training loss on data
when you replace the labels with random noise.

edit 2: I am sorry to be harsh. It is just hard to read such arrant nonsense.

~~~
nilkn
I don't see how you really addressed the concerns. You say that the paper is
"trying to argue a bit about why neural networks generalize well," but in fact
I don't see anything in this paper about generalization or test error. The
first line of future research under section seven is to look into test error
instead of training error:

"The current paper focuses on the train loss, but does not address the test
loss. It would be an important problem to show that gradient descent can also
find solutions of low test loss. In particular, existing work only demonstrate
that gradient descent works under the same situations as kernel methods and
random feature methods [Daniely, 2017, Li and Liang, 2018]."

~~~
skeptic_69
have you heard of something called ERM? uniform convergence?

The typical way of showing generalization in ML is to show that if we have
some low or zero error solution on the test data-set, for a large enough
dataset, with high probability, the error on our training data set is close to
the error on the real and unknown distribution. The first step which is
basically "find a low error hypothesis on the training data" is called the ERM
principle.

In practice we observe stochastic gradient descent works pretty well in
solving the ERM problem and the solutions generalize well (perform well when
deployed).

This is very weird since neural networks are really weird objects with very
non-linear and non-convex behavior and gradient descent shouldn't play well
with weird bumps and curves and valleys.

People want to show mathematically that stochastic gradient descent does well
on neural networks.

This paper claims gradient descent is effective at minimizing quadratic loss
on the training data.

If we could improve the results to show that on the true distribution we also
have low loss-that might be compelling that gradient descent converges to the
minimum error solution.

None of this explicitly stated since this is a well understood part of basic
literature in learning theory.

Showing an algorithm can do erm on the hypothesis class is the first and
(easier ) part of showing generalization.

If you want a good reference that explains this in a more coherent way I
recommend looking at the first 4 chapters of understanding machine learning
theory by Shai-Shalev Schwartz.

If you still think the comments I was responding to are not totally
incoherent-take note of the fact that the very first sentence in the paper is
"One of the mysteries in deep learning is random initialized first order
methods like gradient descent achieve zero training loss"

------
juskrey
How do you gradually detect a Dirac stick?

------
pj_mukh
With this much wide distribution of ML algorithms in use, its still funny to
see papers begin with "One of the mysteries of deep learning is that...." and
then goes on to lay out the multiple ways we have no idea why some of these DL
techniques work.

------
orf
> One of the mysteries in deep learning is random initialized first order
> methods like gradient descent achieve zero training loss, even if the labels
> are arbitrary

Can someone expand on this? I've never heard of this before, at least not in
the general case.

~~~
currymj
The paper “Understanding Deep Learning Requires Rethinking Generalization” was
where this was first pointed out, I think.

They shuffled the labels on their datasets, so there can’t possibly be
anything to learn, yet got zero training loss, meaning the network must be
severely overfitting. Yet the same network trained with the actual labels
shows quite good generalization. So the usual intuition about overfitting and
the bias-variance tradeoff doesn’t seem to apply.

~~~
elcomet
This seems quite intuitive to me:

When you have nothing to learn, you need to memorize the data. But when there
is structure, it is easier to memorize the structure, so the network will
learn this first (and will memorize after).

~~~
srean
But thats the crux of the question: why and how does it not just memorize when
we know it can do so easily ?

~~~
elcomet
Maybe just because it's easier to find patterns than to memorize (if you have
a lot of data).

~~~
currymj
That sounds like it’s probably right to me. But so do lots of things that turn
out to be wrong. I wish we had a better grasp of what is happening, not just
plausible stories. I’m already sick of doing alchemical tinkering to find a
model that works.

~~~
elcomet
Yeah, I agree it would be nice to have some theoretical guarantees on the
architecture we need based on the problem, and the size of the data

------
lbj
I cant claim to fully understand the proof, but these guys have done an
amazing job in terms of furthering our understanding of deep nets.

