
Understanding deep learning requires rethinking generalization - visionscaper
https://arxiv.org/abs/1611.03530
======
bmh_ca
I remember an AI professor I had once asked the class to define "the number
3".

The answer he chose, which stuck with me (if I recall the nuance correctly),
is that the number three is: The set of all things in the universe of which
there three, three is that which they have in common.

Where it became interesting for me is observing our children growing up,
especially learning colours and shapes. They exhibited a pattern of learning
based upon observations of common patterns in communication by vocalization.

For example, children decided things were "red" based upon that trait being
in-common with other things we called red. Circles based upon other things we
call circles.

It's really quite a fascinating phenomenon to observe in children, and I
expect there is a key atomicity of association from which more complex
patterns - up to consciousness - can be created. Too fine grained and the
patterns will be noise; too large and certain higher order structures will
never form - a "Goldilocks" zone for the complex system of interpreting
reality by observational exposure and initially arbitrary relation.

~~~
TheOtherHobbes
The prof's definition begs the question.

Humans can be observed recognising patterns, but that doesn't say anything
useful about the process of recognition.

It seems we have experiences first, and generalise from them later _in terms
of our experience._ So we have common experiences of threeness, circleness,
and redness, and from them we generalise what "three", "circle", and "red"
means.

But what defines an experience of threeness? Is it really atomic, or is it
made of further component parts/relationships? Is it learned, or innate? How
does the labelling process influence the experience/generalisation process?
(I'm not sure if it's an anthropological myth, but supposedly there are
primitive tribes where counting goes "One, two, many..." What's their
experience of threeness?)

The exact mechanics of this pattern recognition remain mysterious. It turns
out that when we build NNs to recognise patterns, the process is still
mysterious. Which seems mysterious in itself. It's _remarkable_ that we have
these tools and they seem to work to an extent, but really no one understands
why.

~~~
alimw
> The prof's definition begs the question.

It was good enough for Russell so I doubt that. See
[https://en.wikipedia.org/wiki/Set-
theoretic_definition_of_na...](https://en.wikipedia.org/wiki/Set-
theoretic_definition_of_natural_numbers#Definition_by_Frege_and_Russell)

~~~
bmh_ca
️ <\-- Unicode heart

------
gambler
Good to see someone testing the limits of neural nets, rather just squeezing a
few percent of performance on an artificial benchmark.

That said, is this result really all that surprising? Especially given the
results demonstrated in that paper on fooling DNNs from 2015 and visualization
experiments a-la Deep Dream.

Unless you believe in networks "painting" stuff, Deep Dream demonstrated that
neural networks capture and store certain chunks of their training data and
you can get those back out if you're clever enough.

That other paper[1] demonstrated that a trained DNN can classify noise as a
particular label with very high confidence, as long as you construct that
noise carefully enough. This hints at the fact that DNNs may do matching by
applying some complex transformation that _usually_ results in the correct
answer, but does not necessarily capture the underlying patterns. (Kind of
like guessing about the weather by telltale signs, without knowing anything
air pressure, currents and so on.)

[1] - [http://www.evolvingai.org/fooling](http://www.evolvingai.org/fooling)

~~~
dbecker
Adversarial noise isn't specific to deep learning models. Most model working
on high-dimensional input will confidently misclassify noise with high
confidence if you construct the noise right.

The original "adversarial pixel" paper demonstrates this with logistic
regression.

[https://arxiv.org/pdf/1412.6572v3.pdf](https://arxiv.org/pdf/1412.6572v3.pdf)

------
AlexCoventry
We discussed this paper in our reading group last week[0]. I think the key to
understanding what's going on here is figure 1(a). The fastest learning
happens with true labels, and the slowest with random labels. Shuffled pixels
is the second fastest. I believe the reason this is happening is that given
training data composed of structured images, the convolutional architecture
heavily favors learning filters which reflect geometric features, as opposed
to random filters which can memorize the data. This results in fastest
learning with the true labels because the geometric features correspond to the
learning target, but for memorizing random labels, geometric features have
lower capacity than random filters. On the other hand, it learns shuffled
pixels pretty fast because the convolutional architecture makes it easy to
capture a color histogram and learn off that.

[0] This week we discussed the Alpha Go paper. URL for that, although we don't
generally advertise our meetings unless we think there's going to be broad
interest: [https://www.meetup.com/Cambridge-Artificial-Intelligence-
Mee...](https://www.meetup.com/Cambridge-Artificial-Intelligence-
Meetup/events/237183581/)

~~~
argonaut
> I believe the reason this is happening is that given training data composed
> of structured images, the convolutional architecture heavily favors learning
> filters which reflect geometric features, as opposed to random filters which
> can memorize the data

And interesting idea, but you _really_ need to test this out before assigning
any particular confidence to this actually being what is happening.

------
maxander
My halfway informed interpretation, just from the abstract- it turns out that
modern image-recognition networks are capable of learning labels randomly
assigned to sets of random images, which means that it's still mysterious why
they learn labels with intelligible meaning when given non-random images
(rather than just memorizing the training set via some nonsense model.)

I'd guess the resolution would have to involve an ordering over possible
models, where (for well-designed networks) intelligible models are preferred
over unintelligble ones. Filing this away to read later.

~~~
sgt101
I think that it depends what you mean by "learning". Creating a mapping to a
training set where you have a sufficiently expressive representation is
trivial - simply create a list, and then find the most efficient
representation of the list; zip it for example. The point of learning is to
generalize from the training set to unseen examples via the representation and
figure 1 of the paper shows (part c) that this is not what is claimed for
corrupted or randomized examples in this paper. How the non generalised claims
then extend to the discussions in section 2 leaves me flailing, but that's not
unusual! However, my thought is that the difficulty is measurement of the
structural risk of a deep network where network weights very subtly encode
information, or don't depending on the network. Perhaps sweeping the networks
and then setting weights below a threshold to 0 and measuring the
generalisation error impact would be a way to measure what in the network is
useful encoding and what isn't? The rest of the network could be easily
"measured" as bits to encode?

------
dkarapetyan
> Brute-force memorization is typically not thought of as an effective form of
> learning. At the same time, it’s possible that sheer memorization can in
> part be an effective problem-solving strategy for natural tasks.

I like the conclusion. Basically neural nets are just beasts with too many
parameters and they even show you don't even need that many parameters to fit
any data set of size n. This is one reason I think neural nets are kinda a
dead end. People don't understand them and it is impossible to get any
explanatory results from them and based on these results that kinda makes
sense. Neural nets don't learn, they just memorize.

~~~
hiddencost
[https://arxiv.org/abs/1311.2901](https://arxiv.org/abs/1311.2901)

Check out figure 2. The network learns composition from fundamental shapes and
gradients to compositional ones. It's kinda awe inspiring.

Not NN specific, but some more work on explanations:
[https://homes.cs.washington.edu/~marcotcr/blog/lime/](https://homes.cs.washington.edu/~marcotcr/blog/lime/)

Visualization of attention mechanisms is pretty cool for explanations:
[https://arxiv.org/pdf/1502.03044.pdf](https://arxiv.org/pdf/1502.03044.pdf)
[http://torch.ch/blog/2015/09/21/rmva.html](http://torch.ch/blog/2015/09/21/rmva.html)

~~~
dkarapetyan
The local linear approximation is a cool idea in this context even though most
non-linear systems are modeled in exactly this way. You take a complicated
thing and linearized it to understand it. I'll have to look further into lime.

------
yazr
> number of parameters exceeds the number of data points as it usually does in
> practice

I dont get this part.

In reality, isnt the dataset much larger than the parameters of the nueral net
?

~~~
hiddencost
They're not saying "size of the data set in bits", they're saying "number of
items in the dataset". In speech and image recognition, it's normal to have
more parameters than data points. This is a bit old, although it's still a
very good architecture, but: GoogLeNet [0] has around 10M parameters, and was
trained on 1.2M images.

In fact, the 2016 winner of a bunch of the ILSVRC challenges [1,2] was
topologically basically the same as GoogLeNet.

EDIT: There's a perspective on machine learning which is basically just: "what
if your model learns a hash-map". Check out Vapnik-Chervonenkis dimension.

[0]
[https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf](https://www.cs.unc.edu/~wliu/papers/GoogLeNet.pdf)
[1]
[https://arxiv.org/pdf/1601.05150v2.pdf](https://arxiv.org/pdf/1601.05150v2.pdf)
[2] [http://image-net.org/challenges/LSVRC/2016/results](http://image-
net.org/challenges/LSVRC/2016/results) (CUImage)

------
miles7
Is is possible that although neural nets can overfit as this paper shows,
practitioners just stop training early before this happens? And/or they use a
validation set? Would that be enough to explain the good generalization
despite the huge number of parameters

~~~
argonaut
It doesn't explain why it learns anything in the first place (e.g. why it
doesn't just overfit from the start; or learn some weak signals, then start
overfitting).

