
Why do CNNs generalize so poorly to small image transformations? - mathattack
https://arxiv.org/abs/1805.12177
======
Eridrus
This paper is definitely quite interesting, but before this thread turns into
a bunch of NN hate, here is another take:

"Do CIFAR-10 Classifiers Generalize to CIFAR-10?" \-
[https://arxiv.org/abs/1806.00451](https://arxiv.org/abs/1806.00451)

They use the same procedure used to construct CIFAR-10 to construct a new test
set and test a bunch of state of the art results on the new test set.

They see a generalization gap, but the relative order of SoTA results remains
(roughly) the same.

So, yes, test set validation is an overestimate of real-world performance of
these systems, but progress on test sets is indicative of progress in real-
world settings.

And to remember this in context, no-one is asking "do traditional computer
vision systems understand images", because they were explicitly just looking
at image statistics.

~~~
Cybiote
As you point out, that the relative ordering remained stable means progress
being made is not simply overfitting to the test set. The negative finding
from the paper you link to is in how much accuracy drops considering how
slight modifications to the test set were. In their own words:

> We view this gap as the result of a small distribution shift between the
> original CIFAR-10 dataset and our new test set. The fact that this gap is
> large, affects all models, and occurs despite our efforts to replicate the
> CIFAR-10 creation process is concerning.

> Nevertheless, the accuracy of all models drops by 4 - 15% and the relative
> increase in error rates is up to 3×. This indicates that current CIFAR-10
> classifiers have difficulty generalizing to natural variations in image
> data.

It remains to be seen if others can think up ways to maintain performance that
the authors did not manage.

------
denzil_correa
One of the cited papers "Measuring the tendency of CNNs to Learn Surface
Statistical Regularities" is also a great insight into this phenomena [0].

> Our main finding is that CNNs exhibit a tendency to latch onto the Fourier
> image statistics of the training dataset, sometimes exhibiting up to a 28%
> generalization gap across the various test sets. Moreover, we observe that
> significantly increasing the depth of a network has a very marginal impact
> on closing the aforementioned generalization gap. Thus we provide
> quantitative evidence supporting the hypothesis that deep CNNs tend to learn
> surface statistical regularities in the dataset rather than higher-level
> abstract concepts.

[0] [https://arxiv.org/abs/1711.11561](https://arxiv.org/abs/1711.11561)

------
gwern
It's interesting that they identify striding as the culprit. Striding is also,
according to some people at Google AI, the reason why VGG is one of the few
good CNNs for doing style transfer (which is otherwise quite mysterious:
[https://www.reddit.com/r/MachineLearning/comments/7rrrk3/d_e...](https://www.reddit.com/r/MachineLearning/comments/7rrrk3/d_eat_your_vggtables_or_why_does_neural_style/)).

------
fallingfrog
Time shift invariance, which is used when you assume that past patterns
predict future results, is one form of invariance. Space shift invariance and
rotation invariance are others. You have to specifically program your nn to
look for time shift invariance; stands to reason you'd want to design your
architecture for the spacial invariances too. In other words: it's not
reasonable to expect a neural net to derive space and rotation invariance from
first principles, and I'd be very surprised to learn that the human brain
didn't have special purpose hardware for accomplishing those things.

~~~
dr_zoidberg
You made me think of the Margaret Thatcher Illusion, for which the best
example I found was Dr. Phil Plait[0]. Funny thing, the idea of Capsule
Networks (Hinton et al)[1, 2] seems to tackle some of these issues, though
it's they're a bit young yet and more more work.

[0] [https://www.opticalspy.com/opticals/dr-phil-plait-optical-
il...](https://www.opticalspy.com/opticals/dr-phil-plait-optical-illusion)

[1] [https://arxiv.org/abs/1710.09829](https://arxiv.org/abs/1710.09829)

[2] [https://hackernoon.com/what-is-a-capsnet-or-capsule-
network-...](https://hackernoon.com/what-is-a-capsnet-or-capsule-
network-2bfbe48769cc)

~~~
fallingfrog
I think that convolutional networks are really mostly a way to address space
shift invariance too; and I even wonder if the same job could be done by some
special purpose code that just runs the underlying neural net with a bunch of
different rotations of the same image. That's probably how they do it now.. I
feel like that's probably close to the optimal approach.

------
calebh
I've recently become interested in permutation invariant neural networks.
There has been very little work in this area - just PointNet and a few
derivatives.

Anyway, I think that neural networks are now entering the trough of
disillusionment as people begin to discover the limitations. Maybe in the
future, somebody will come up with a new machine learning architecture that
has better generalization. I'm not expecting gradient descent to give us
general AI.

~~~
salawat
The main cause of the generic brittleness of Neural Networks is probably in
the way they are utilized. Biological neural nets never really stop learning.
They slow down, even "forget" in order to restructure, but they change
constantly. A static neural net in basically a snapshot of it's environment
(training data).

Very interesting consequences for the ML field if my hunch has anything
remotely resembling a kernel of truth to it.

~~~
xapata
That's a factor, but it doesn't explain the phenomenon that human-
imperceptable transformations of an image can dramatically shift a NN's
outputs.

~~~
salawat
Again, look to biology. That thing you are modeling.

Humanly "imperceptible" is a very loaded term. Human perception has billions
upon billions of networks worth of filtering going before we even boil down
our environment to the "interesting" stuff.

Furthermore, if you take a snapshot of that network after training, you're fit
to the training data. The network has lost it's plasticity. Take a potato, put
it on the ground, train the network on other potato shots. Now show it a
potato shaped asteroid. Now show it a French fry. What is the potato-ness that
this potato+detector is ACTUALLY homing in on? Keep in mind, this structure is
trained on digital encodings of maps of light and color. The function may not
be a perfect semantic detector of potato-ness. It just knows what patterns of
bits MIGHT be potatoes. And when you are working on bit level encodings, one
bit translates to a lot of change, even if it is imperceptible to a human
looking at a rendering on a screen.

Heck, there is no guarantee that the function it's emulating is well defined
outside the training data set.

Neural networks are GOING to be fickle. You're trying to coerce "reliable,
repeatable. generifiable results" out of a simulation of the same stuff that
drives five year olds and emotional people. Consider yourself lucky the
program hasn't opened the CD tray and demanded you insert crayons.

~~~
xapata
> You're trying to coerce "reliable, repeatable. generifiable results" out of
> a simulation of the same stuff that drives five year olds and emotional
> people.

The "neural network" machine learning technique is not a simulation of
biology. It's a nice marketing phrase. The technique is just math, maybe
"inspired" by someone thinking about neurons.

Don't be misled by branding. For a good explanation of why some of these
fanciful science terms come about, read Bellman's explanation of why he called
his research "dynamic programming".

~~~
salawat
It isn't just branding though. It is a mathematical representation of a
synaptic action potential.

The main difference of course being that you are implementing it in silicon
rather than carbon, and having no bloody grammar or conception of how in the
heck to explain how 'X set of nodes with Y set of weights with Z activation
threshold function'=something useful.

They called them Neural networks for a reason. It's been quite a mystery both
on the biology (in terms of gray matter) and computing (in terms of ANN) front
why it works at all. But it does.

I think I've read Bellman's work before, but I'll take a look. Thanks for the
pointer.

~~~
xapata
It's got some loose relationship to action potential, but to a network of
neurons? It's a leaky abstraction.

~~~
salawat
It's been a while since I've been in the literature (about 7 years to be
exact).

Just can't find enough hours in the day to keep on top of the state-of-the-
art, and unfortunately the career hasn't budged me in the direction of
anywhere I could weasel work on it into my day.

C'est la vie.

------
milani
I expected a reference to a kind of deformable convnet and its variants[1]
that try to learn natural transformations.

[1] [https://arxiv.org/abs/1703.06211](https://arxiv.org/abs/1703.06211)

------
candiodari
It's funny but looking at those prediction graphs bouncing up and down like
crazy ... one immediately thinks "yep I know, that's exactly what they said
would happen if I used polynomials for fitting functions". And yes, that's
what polynomials do.

They don't have good local behavior: if a -> f(a) and b -> f(b) are fitted
with a higher-degree polynomial then the image of f between a and b will often
be infinite (meaning there is some value x between a and b where f(x) is
infinite), starting at very high degrees (ie. deep networks) there will
probably be many such points.

Intuitively I think of it like this: if you look at the "real world" as a
function you can make a couple of very general observations. F(x) -> doesn't
have very much information about the world, and it's very hard to make sense
of. Most of it just doesn't seem relevant. d/dx F(x) ... much more relevant.
d^2/dx F(x) ... also pretty interesting. d^3/dx F(x) less interesting but
occasionally important. d^4/dx F(x) ... nobody cares (also if you take camera
images and calculate this, it'll be almost exclusively zeroes).

Secondly there are strong "domains" in the real world that we just seem to be
unwilling to accept. Polynomials are good in the sense that if you get the
equation for a stone dropping onto your foot really, really tighly correctly
fitted, that equation holds up for the movement of an entire planet, which is
great. But why bother ? It is much more valuable to be able to predict whether
a stone will fall on my foot than how Venus will move. That's if you get it
right. If you get the polynomial degree of your equation wrong ... it makes
utterly ridiculous predictions. Many other approximation methods don't suffer
from this problem

Doesn't happen with spline, beziers, even taylor approximations have better
behavior.

------
klausjensen
For those who (like me) did not know what CNN is:

In machine learning, a convolutional neural network (CNN, or ConvNet) is a
class of deep, feed-forward artificial neural networks, most commonly applied
to analyzing visual imagery.

(Source:
[https://en.wikipedia.org/wiki/Convolutional_neural_network](https://en.wikipedia.org/wiki/Convolutional_neural_network))

------
candiodari
"How do humans do small image transformations ?"

Perhaps the answer is simple : REM (the awake variant), and
[https://www.youtube.com/watch?v=quJEyTvDdfY](https://www.youtube.com/watch?v=quJEyTvDdfY)

------
amelius
IIUC, CNNs are just a computational trick to reduce the number of parameters
in the network, and train for all possible translations at once.

So how would a network perform w.r.t. translations if it was expanded, i.e.,
topology similar to the CNN but with all parameters expanded, and trained on
translated images?

------
d--b
Perhaps, I'm saying just perhaps, there is a reason why humans are good at
discerning things in the very narrow field of vision that's straight ahead,
and not very good at the peripheral vision.

Pointing first and then classify may help in solving those issues. Maybe?

~~~
PeterisP
The reason why humans are good at discerning things in the very narrow field
of vision that's straight ahead, and not very good at the peripheral vision is
biological, there's simply a much lower density of receptors ("pixels") in the
periphery, so in the periphery there's much less sensory information for the
brain to work with.

------
zer0faith
I thought this reference was for CNN News Network. Silly me..

------
PredictorY
It's worth noting that the actual title of that essay is, "Why do deep
convolutional networks generalize so poorly to small image transformations?"

~~~
sctb
Thanks! We've un-generalized the submission title from “Why do neural networks
generalize so poorly?”.

------
bjornsing
> Taken together our results suggest that the performance of CNNs in object
> recognition falls far short of the generalization capabilities of humans.

No shit! An earth shattering result. :P

------
John_KZ
Let me save you some time on why:

>While VGG16 has 5 pooling operations in its 16 layers, Resnet50 has only one
pooling operation among its 50 intermediate layers and InceptionResnetV2 has
only 5 among its intermediate 134 layers.

Aka being computationally cheap with pooling causes weird sampling issues.

The paper has a terrible pompous title that's just wrong. Modern CNNs
generalize wonderfully on small translations. They just managed to break a
couple of old CNNs and go on to claim they broke AI research or something.

~~~
microtherion
The paper claims the opposite: "[...] jaggedness is greater for the modern,
deeper, networks compared to the less modern VGG16 network. While the deeper
networks have better test accuracy, they are also less invariant."

