
Understanding deep learning requires re-thinking generalization - mpweiher
https://blog.acolyer.org/2017/05/11/understanding-deep-learning-requires-re-thinking-generalization/
======
mannigfaltig
Previous discussions:

[https://news.ycombinator.com/item?id=13566917](https://news.ycombinator.com/item?id=13566917)

[https://openreview.net/pdf?id=rJv6ZgHYg](https://openreview.net/pdf?id=rJv6ZgHYg)

[https://www.reddit.com/r/MachineLearning/comments/6ailoh/r_u...](https://www.reddit.com/r/MachineLearning/comments/6ailoh/r_understanding_deep_learning_requires_rethinking/)

[https://www.reddit.com/r/MachineLearning/comments/5kfs23/r_u...](https://www.reddit.com/r/MachineLearning/comments/5kfs23/r_understanding_deep_learning_requires_rethinking/)

[https://www.reddit.com/r/MachineLearning/comments/5cw3lr/r_1...](https://www.reddit.com/r/MachineLearning/comments/5cw3lr/r_161103530_understanding_deep_learning_requires/)

------
andreyk
Here's the TLDR:

"As the authors succinctly put it, “Deep neural networks easily fit random
labels.” Here are three key observations from this first experiment:

-The effective capacity of neural networks is sufficient for memorising the entire data set.

-Even optimisation on random labels remains easy. In fact, training time increases by only a small constant factor compared with training on the true labels.

-Randomising labels is solely a data transformation, leaving all other properties of the learning problem unchanged."

And conclusion

" This situation poses a conceptual challenge to statistical learning theory
as traditional measures of model complexity struggle to explain the
generalization ability of large artificial neural networks. We argue that we
have yet to discover a precise formal measure under which these enormous
models are simple. Another insight resulting from our experiments is that
optimization continues to be empirically easy even if the resulting model does
not generalize. This shows that the reasons for why optimization is
empirically easy must be different from the true cause of generalization. "

This paper was pretty hyped when it came out for seeming to discuss general
properties of deep learning, but the details of it are a little dissapointing
- okay so sufficiently big/deep networks can overfit to training data, and
that's exciting how?... it's a curious finding, but not one that's all that
hard to believe or that is all that informative. Or so it seems to me. I don't
see how they justify claiming that they "we show how these traditional
approaches fail to explain why large neural networks generalize well in
practice."

I suppose the notion is that memorizing random labels implies memorization
should also work on non-random labels (and thereby no generalzation to test
set is needed), but it seems intuitive that proper labels and gradients with
regularization will find the answer that generalizes because that is the
steepest optimization path available. I have not read it all that deeply and
not in a while, so perhaps their arguments are stronger than it appears to me,
though.

~~~
Eridrus
I haven't read the paper, but I think it only really makes sense in context:

a) The traditional view of generalization would argue that neural nets are
"too complicated"/have too many parameters/etc to generalize well. And that
for generalization you need more limited models.

b) To reconcile this with practical results of CNNs some people tried to argue
that while neural nets had a lot of actual parameters, the structure of neural
nets reduced the effective capacity of neural nets to not be as big as the
parameter counts implied. Same argument for regularization.

c) This paper shows that those arguments are not satisfactory since they
really can fit random labels.

------
stared
As a point of reference, it is good to know that MNIST (handwritten digits)
can be solved extremely well with nearest-neighbour classifier, see:
[http://yann.lecun.com/exdb/mnist/](http://yann.lecun.com/exdb/mnist/)

It should be not surprising that strategies "memorize all input and
interpolate between" are powerful, no matter if as a part of neural networks
or other easy-to-overfit techniques (such as random forest).

------
taneq
> Take the same training data, but this time randomly jumble the labels (i.e.,
> such that there is no longer any genuine correspondence between the label
> and what’s in the image).

What's a 'genuine correspondence'? The network has clearly picked out some
image features that correspond to the assigned labels. Just because they're
not the features you're thinking of doesn't mean they don't exist.

~~~
robert_tweed
If the labels are random they don't correspond to anything, so any "features"
are essentially noise.

The classifier is, in effect, memorising every element in the training set.
It's training a compression algorithm for storing that data.

It should be noted that "training a compression algorithm" isn't always a bad
thing per se, because that's how autoencoders work, which is one of the main
ways to do deep learning.

The key term in the article is "the effective capacity" of the model. If you
have a big enough network, it can simply memorise everything you give it. This
makes it difficult to know if such a model will generalise. A much smaller
model won't overfit in the same way, but also might not perform as well as a
larger, more sophisticated model. The problem in deep learning is nobody can
tell how much of the training data has simply been saved somewhere in the
model (in an obfuscated and compressed way).

There is some related research about reconstructing the training data from
deep networks (which has privacy implications), but I don't have a link handy.

------
throw_away_777
This statement: "Or in other words: the model, its size, hyperparameters, and
the optimiser cannot explain the generalisation performance of state-of-the-
art neural networks." is not true and very misleading. Careful selection of
hyperparameters and the model can clearly improve generalization - the article
is making a mistake in assuming that getting to zero training error is a good
thing or a desirable thing. In fact a large part of hyperparameter
optimization are choices that ensure generalization, and some of the
fundamental choices such as early stopping and many others do determine how
well the model generalizes. If your model has zero training error you have
likely made poor choices.

~~~
l3robot
Where does the article state that zero training error is a good thing? The
authors only show that almost every modern neural network can reach 0 training
error, even if the labels are randomized (generalization impossible). Hence,
they can learn the dataset by hearth. The authors can, from that, use the
testing error as a generalization indicator.

Indeed a careful hyperparameter choice is the only key now to have good
generalization. As I understood it, the goal here is more to show that the
correlation between the regularization of the network and its generalization
power is far from being clear as it is for other ML algorithms like SVM.

In short, NN hyperparameters help to reach generalization, but cannot
"explain" it. It's the key difference here between practice and theory.

------
yters
The VC dimension of neural networks is at least O(E), if not O(E^2) or worse.
E is the number of edge parameters. With billion parameter networks trained on
billion item datasets, there is no theoretical reason why deep learning should
generalize. This means deep learning is just memorizing the training data.
Evidence of this is the ease with which deep learning models are fooled.

~~~
backpropaganda
It's _at most_ O(E) not at least. The capacity of a deep network could be much
smaller than the number of weights and this is where the VC theory stops being
useful.

Deep networks can generalise to situations where even humans cannot. So the
memorizing narrative doesn't survive any scrutiny.

~~~
yters
Can you cite a source? It depends on the activation function, but as far as I
know only the perceptron has a decent VC dimension due to its use of the sign
function. The tanh and sigmoid result in O(E) and O(E^2) according to
Wikipedia.

~~~
backpropaganda
I don't really have a source, and am speaking from what is hearsay in the deep
learning community. The results you cite are valid only for shallow networks.
As you increase depth, you don't get the same increase in capacity, so even
though millions of params are being used in deep networks, the capacity is not
O(million).

The capacity of a million-sized shallow net might be O(million), but noone's
using such a model.

~~~
yters
I saw the formula in Abu-Mustafa's Learning from Data. I don't think it only
applied to single hidden layer networks, but I may be wrong. Additionally, the
book said the VC dimension is infinite in the general case.

I asked the question on CS stack exchange and no one took issue with the
statement that DL had such a large VC dimension. The only counter response was
that it didn't matter in practice due to DL's good error scores. But, that
still doesn't mean DL is generalizing. Good error is only a necessary
condition for generalization, not a sufficient condition.

------
infinity0
Is the answer not just simply "the generalisation is encoded in the data"? And
that is exactly why deep learning models need huge amounts of data.

The model, size, hyperparameters, optimiser, etc - all they do is convert the
data into a form that can be used to make predictions.

------
asavinov
> Understanding deep learning requires re-thinking generalization

Deep learning is called deep because it is based on multiple levels of
features corresponding to a hierarchy of notions. Of course, it can change how
generalization as well as other operations are performed but the way
generalization is done not a specific feature of deep learning.

------
naiveattack
Here's a lay thought.

The network must not have capacity to hold all the data. It must have a
capacity proportional to the number of classes of data (instead of the number
of samples).

Another way to arrive at this may be: take a trained network, run it in
inference on the training set. Group the nodes of the network into equally
sized groups. As inference happens train a smaller corresponding new group of
nodes for each previous group by looking only at its inputs and outputs that
are exercised. Put the new subnetworks together by looking purely at the edges
between the previous subnetwork. The new network is now constructed.

I have not built this. But would something like this work?

~~~
andreyk
People have demonstrated similar types of ideas are effective for optimizing
network size - after training a highly redundant big model it's often possible
to reduce it down to 1/10 of the parameters without significantly impacting
performance by doing stuff like this (even simpler, I think pruning is often
effective).

------
Abtin88
I have this talk from ICLR17, I've just uploaded on youtube!
[https://youtu.be/kCj51pTQPKI](https://youtu.be/kCj51pTQPKI)

------
RichardHeart
Would it be possible to use machine learning to do this job better, meaning,
could the machines look at what other machines are doing, and better translate
to us whats going on?

~~~
fooker
Use a system we don't understand to understand another system we don't
understand; what could go wrong?

~~~
jacquesm
> what could go wrong?

That's already known: there are inputs to the network that do not make sense
and yet will trigger strong responses. Think of them as inputs that have the
same effect on NNs that optical illusions have on the human brain. We infer
something that isn't there.

I suspect that as network architectures get better and parameter counts drop
these will get harder and harder to construct.

