
Measuring the Tendency of CNNs to Learn Surface Statistical Regularities - tim_sw
https://arxiv.org/abs/1711.11561
======
sytelus
Adversarial noise is becoming quite a crises in the deep learning field and
causing lot of heated debates. There have been effort for defenses using
distillation and denoised auto-encoders but only thing that has worked well is
actually training on adversarial noise itself.

However the bigger insight is that deep networks aren't learning high level
features. This paper has very strong evidence that deep networks are just more
fancy statistical models as opposed to developing more abstract
representations. When if you make network more deeper, you are just developing
more higher capacity models as opposed to more higher level features. This is
quite damning... If you were hoping current form of deep learning will open
the door to AGI, your hopes should be shattered by now.

Some of the things to investigate are (1) can proper regularization help turn
this around (for example, forcing gradients to be as small as possible) (2)
are there any other architectures using gradient descent that might actually
develop abstract features.

This is definitely crisis time in the field and therefore more interesting
than ever :).

~~~
red75prime
It's also interesting whether there are people with apperceptive visual
agnosia who are susceptible to adversarial examples.

~~~
sharemywin
aren't optical illusions human adversarial examples?

~~~
hackinthebochs
It's probably fair to see them as such, but they operate in the semantic
domain rather than the pixel domain.

~~~
taeric
Do they? Specifically the graffiti style illusions that appear 3d to people.
Those seem to work by specifically lining up the "pixels" you see.

~~~
hackinthebochs
It's not a seemingly random (meaningless) perturbations that cause
misclassification/misrecognition, but a very specific (meaningful) tuning of
features that serve to trigger some misrecognition. The distinction here is
whether the modified features themselves have semantic information (i.e. have
a representation in the semantic domain with respect to the content of the
image). In the case of graffiti illusions, it seems that it does: the
misrecognition is due to a particular kind of coherence with the observers
perspective and the relative positioning/alignment of the features in the
image. Both perspective and relative alignment of features is meaningful.

~~~
taeric
I think you are somewhat shifting the goalposts, though. For adversarial
images, you are changing what you know the system is looking at. In such a way
that confuses it to "see" something else.

For humans, we just happen to somewhat understand the image that is being used
against us. How is this different?

~~~
hackinthebochs
The difficulty here is operationalizing the terms involved to make the point
of semantic domain vs pixel domain meaningful at all. But I think this point
makes it a substantive distinction: have a representation in the semantic
domain with respect to the content of the image.

There are various semantic features of an image and we presume that our visual
system is extracting and operating on these semantic features. So it seems
important to ask whether the features causing the misclassification can be
represented in a semantic domain of the image. Shapes, (relative) sizes,
gradients, high level patterns, etc are examples of semantic domains. There
seems to be a distinction between the kinds of optical illusions humans are
susceptible to and the kinds CNNs are. In the graffiti illusions, the
placement of the image that causes our 3D object recognition system to kick in
can be described through angles, lines, focal points, focal lengths, etc.
These are all semantic features of the scene depicted in an image. Contrast
this with CNN adversarial perturbations which have zero semantic features that
we recognize. This seems like an important result. It means that our visual
system is robust against certain classes of adversarial images, namely the
small imperceptible deltas that trip up CNNs. To trip up our visual system
requires certain combinations of semantic input which are harder to exploit
(larger delta is harder to exploit).

~~~
taeric
Then I offer the ease with which we see faces where they aren't. The very
constellations?

I get your point to an extent. However, I think it is presented stronger than
it is. Specifically, the semantics of imagery are not much more than non
frequency analysis of the pixel data your eye sees.

------
eggie5
Interesting paper that looks at the generalization abilities of deep CNN
architectures. They begin by highlighting the generalization gap or the
difference between the learning curves of the test and transit and how
typically, in deep CNN architectures, the gap is relatively small. They then
go on to hight how that this small generalisation gap is often attributed to
the fact that that deep CNNs learn high-level semantic meaning. They then
counter that common notion by highlighting the recent and popular research in
Adversarial Examples, and note the high sensitivity to adverbial
pertubtations. If a CNN is learning semantic meaning, then why does adding
static to the image break it? Also, related is the recent research by Zhang,
Chiyuan, et al. “Understanding deep learning requires rethinking
generalization.” in which they show that CNN arch. can perfectly fit random
labels which leads us to more down the path that generalization capabilities
of CNNs are currently unknown to the community. They then go on to introduce
their experiment to try and isolate what CNNs are doing.

[http://www.eggie5.com/129-Paper-Review-Measuring-the-
tendenc...](http://www.eggie5.com/129-Paper-Review-Measuring-the-tendency-of-
CNNs-to-Learn-Surface-Statistical-Regularities)

------
anon8
After stating their hypothesis the authors disclaim the issues CNNs with the
following:

 _... we feel the need to stress that it is not fair to compare the
generalization performance of CNN to a human being. In contrast to a CNN, a
human being is exposed to an incredibly diverse range of lighting conditions,
viewpoint variations, occlusions, among a myriad of other factors._

Why is it that evolutionary heritage is so rarely stated as an important
factor in these comparisons (as far as I have seen)? I appreciate that
evolution can be framed as another form of learning, but it is much more
powerful than that employed by standard CNNs, in that it can change the
network structure (and the I/O interface) as well as the network weights.

------
moultano
Humans do well with blurred images because our vision is blurry everywhere
except our fovea and our current focal plane. So I'm a bit skeptical that this
tells us something about the difference between cnns and humans. Especially
since the models did seem to adapt well to the modified domain when trained on
it.

~~~
hackinthebochs
>Humans do well with blurred images because our vision is blurry everywhere
except our fovea and our current focal plane.

I don't see how this follows. What we're not focused on is blurred, yes, but
what we're focused on is at least partially related to our ability to
recognize patterns (e.g. reading from peripheral vision is very difficult).

The fact that CNNs are tricked by imperceptible perturbations while the
semantic content is held constant is highly informative information, don't
reject it for superficial reasons.

The standard fair of CNN+RELU+Residuals are very powerful modelling tools but
that also means they're prone to model degenerate regularities if they exist.
This paper shows that they do exist and that these models are picking up on
them at least to some significant extent.

~~~
moultano
>I don't see how this follows. What we're not focused on is blurred, yes, but
what we're focused on is at least partially related to our ability to
recognize patterns (e.g. reading from peripheral vision is very difficult).

Object detection from peripheral vision is not difficult though. I think
you're overestimating how much of our vision is actually clear.

>The fact that CNNs are tricked by imperceptible perturbations while the
semantic content is held constant is highly informative information, don't
reject it for superficial reasons.

Yes, but that is not this paper. These perturbations are not imperceptible,
and unlike other adversarial examples the model adapted well when allowed to
train on them.

Also, it looks like the model did reasonable well on the random filtered
versions, only failing on the blurred versions. The random filtered images
looked much more corrupted to me than the blurred ones, which is consistent
with blurred images being part of the training for the human visual system,
but not randomly filtered ones.

~~~
yorwba
The radial masking in the Fourier domain is pretty much imperceptible for me,
but it seems to cause the most problems for the CNNs.

~~~
moultano
I'm pretty sure that's because you are constantly trained on that type of
image in your day to day life. "Radial masking in the Fourier domain" is
pretty much identical to an image that it is out of focus or seen with a low
resolution part of your eye. (Most of your vision.)

