
Machine Learning Confronts the Elephant in the Room - TheAuditor
https://www.quantamagazine.org/machine-learning-confronts-the-elephant-in-the-room-20180920/
======
ivan_gammel
Please, correct me if I'm wrong, but isn't the whole machine learning testing
approach based on the recognition the whole image at once in full resolution
in one attempt? If so, it's not how animals see and observe the world and this
could be a key difference between AI and living beings.

What if an algorithm actually had a focused sight? Look at the brightest
feature, blur the rest, then try to find another distinctive details, look
closer into the parts of the image until the most of it becomes clear. Let
other neural network control the scanning process and loop until it develops
some confidence in the result. Can it work that way?

~~~
ThePhysicist
Yes but children can pass these tests when presented with a single photo as
well (if I understand the setup correctly), so I'd say it's a fair comparison.
The algorithm can "look" at the picture as long as it wants as well, e.g.
using many different convolutional filters that can "see" features in
different parts of the image and then combine these in later stages of the
analysis.

~~~
ivan_gammel
How close are those convolutional filters to the physical sight of animals?
Were they designed specifically to mimic nature or they are artificial concept
with similar purpose?

~~~
TeMPOraL
They may be close to first stages of sight pipeline. Maybe even close to full
pipe for some simple animals. For humans, they're pretty much input processing
stage. Human vision (and presumably animal vision too) has _runtime feedback_
that uses the the output of the vision process to fix up the input. That's
what makes you suddenly see things if someone tells you what to look for, or
suddenly unsee things. That's why you didn't notice I repeated the the word
"the" in few places in this comment.

As the intro to the the article says (you now noticed, didn't you?), the way
DNN vision systems are different is that "unlike humans, they can’t do a
double take". And those "double takes" is what our vision pipeline does all
the time.

~~~
joejerryronnie
Very clever young man, but it's turtles all the way down!

------
hydrox24
I know very little about machine vision, so forgive the naïvete of this
question:

> an ability humans have that AI lacks: the ability to understand when a scene
> is confusing and thus go back for a second glance.

Wouldn't the machine return a low confidence score when a scene is confusing?
If not, why is this difficult to get around?

If so, why can't we just call this a computation difficulty problem, simply
requiring a way to go back and spend more effort later when required? A
problem like this would simply require better computers, and patience.

In other words, Is this problem as deeply rooted as the article suggests? Or
is it simply a problem with the popular approach to machine vision?

~~~
skierscott
> a low confidence score

Neural nets should return a low confidence score. But, the popular approach
(described below) ignores that. Neural nets ignore confidence because of a
technique called softmax [1].

This happens as the final operation of a neural net, and is required for
training.

Softmax is a tool to make an array of positive numbers look like a probability
distribution:

    
    
        out = x / x.sum()
    

x[i] is a class prediction, but x.sum() != 1. Say if the network was
uncertain, x[cat, dog] = [0.03, 0.01]. These are small values that do not
imply great confidence (the network was trained on vectors with out.sum() = 1.
The network would predict “dog” using softmax because out[dog] = 0.75 > 0.25 =
out[cat].

But then in inference/prediction, the confidence is ignored. What if x.sum()
is small? That would imply that the network is uncertain.

[1]:
[https://en.m.wikipedia.org/wiki/Softmax_function](https://en.m.wikipedia.org/wiki/Softmax_function)

~~~
TeMPOraL
Is this what softmax is? Simply dividing a vector by sum of its components? If
so, then how does it deserve a _name_ , not to mention a long Wikipedia page
full of formulas?

~~~
CodesInChaos
Softmax has two components:

1\. Transform the components to e^x. This allows the neural network to work
with logarithmic probabilities, instead of ordinary probabilities. This turns
the common operation of multiplying probabilities into addition, which is far
more natural for the linear algebra based structure of neural networks.

2\. Normalize their sum to 1, since that's the total probability we need.

One important consequence of this is that bayes' theorem is very natural to
such a network, since it's just multiplication of probabilities normalized by
the denominator.

The trivial case of a single layer network with softmax activation is
equivalent to logistic regression.

The special case of two component softmax is equivalent to sigmoid activation,
which is thus popular when there are only two classes. In multi class
classification softmax is used if the classes are mutually exclusive and
component-wise sigmoid is used if they are independent.

~~~
TeMPOraL
Thanks for the detailed explanation!

------
PavlikPaja
First, the chair wasn't replaced with a couch - you can see there is yet
another rectangle just a few pixels to the left, that likely says "chair".

Second, even many people are surprisingly bad at decoding incongruous scenes,
which is why hidden object games are a thing.

~~~
CodesInChaos
The disappearing cup and book are caused by a similar issue. They "only show
detection results with confidence value that exceeds 0.5". The cup is shown at
50% in the original image, so it "disappears" if the presence of the elephant
lowers its score even slightly. Same of disappearing books.

Presumably the couch showed up because it rose from <0.5 to 0.57. They don't
tell us what the score is for the chair, it might even be higher than the
score for the couch, since their visualization clearly doesn't sort by
probability.

------
weregiraffe
Why can't you just train the network on artificially altered photos, with
"elephants" randomly scattered around, until it is robust to them?

~~~
hnhg
Because there's always another type of 'elephant' to be thrown in. The
networks are just not inherently robust against unexpected features (I'll call
them outliers for want of a better term), and this is what needs to be fixed,
and not increasing the variety and frequency of outliers.

~~~
weregiraffe
>Because there's always another type of 'elephant' to be thrown in.

Well, you could randomly generate and render 3D objects, and paste them into
photos. This will give you a huge space of 'elephants'.

~~~
YeGoblynQueenne
If what you propose was possible to do, if it was possible to overcome the
problem described in the above article by using 3d models of real-world
objects, then it would also be possible to train on those models in the first
place, i.e. we wouldn't need to train machine vision algorithms like CNNs on
large datasets of digital stills or photographs- we'd just generate millions
of 3d models of objects and scenes of interest, already fully annotated for
supervised learning, and train on those. But, as you can probably tell, this
is not what happens and instead we have to make do with "real" images,
painstakingly collected and annotated with the classes of objets within them
"by hand".

What you propose is, essentially, training on a simulation. That doesn't work
very well because statistical machine learning algorithms generalise very
poorly from training data to unseen data and when your training data is a
simulation (such as a 3d model of an object) and the unseen data is the real
world (such as a video still, or a digitial photograph of that object) the
resulting model will be very bad at predicting the unseen data, so identifying
objects in images of the real world.

It might seem to you (and to me!) that modern 3d simulations (as in high-end
CGI and video games) is a very faithful simulation of the real world, but
that's primarily because we look at modern 3d through human eyes and perceive
it with our (presumably) human brains. We don't know how that works exactly,
so we can't reproduce the process in computers, yet, therefore machine vision
systems essentially "see" and "perceive" something completely different than
what we do. In the end, what looks like a very close approximation to you, is
pretty much useless for them- or in any case, they are not able to learn to
generalise from that to real objects.

~~~
weregiraffe
>It might seem to you (and to me!) that modern 3d simulations (as in high-end
CGI and video games) is a very faithful simulation of the real world, but
that's primarily because we look at modern 3d through human eyes and perceive
it with our (presumably) human brains. We don't know how that works exactly,
so we can't reproduce the process in computers, yet, therefore machine vision
systems essentially "see" and "perceive" something completely different than
what we do. In the end, what looks like a very close approximation to you, is
pretty much useless for them- or in any case, they are not able to learn to
generalise from that to real objects.

This really is not what I expected, being a layman.

Do you have any more data on this topic of training on a simulation?

~~~
YeGoblynQueenne
"Data", not so much, but you can explore the subject by searching for "reality
gap in machine learning". To be honest there are not many references to it
that I could find online, but this post by the Google AI blog starts with a
very good summary (then proceeds to propose one way to overcome the
difficulties of training on simulations, though of course the problem is still
far from solved):

[https://ai.googleblog.com/2017/10/closing-simulation-to-
real...](https://ai.googleblog.com/2017/10/closing-simulation-to-reality-gap-
for.html)

 _Simulating many years of robotic interaction is quite feasible with modern
parallel computing, physics simulation, and rendering technology. Moreover,
the resulting data comes with automatically-generated annotations, which is
particularly important for tasks where success is hard to infer automatically.
The challenge with simulated training is that even the best available
simulators do not perfectly capture reality. Models trained purely on
synthetic data fail to generalize to the real world, as there is a discrepancy
between simulated and real environments, in terms of both visual and physical
properties. In fact, the more we increase the fidelity of our simulations, the
more effort we have to expend in order to build them, both in terms of
implementing complex physical phenomena and in terms of creating the content
(e.g., objects, backgrounds) to populate these simulations. This difficulty is
compounded by the fact that powerful optimization methods based on deep
learning are exceptionally proficient at exploiting simulator flaws: the more
powerful the machine learning algorithm, the more likely it is to discover how
to "cheat" the simulator to succeed in ways that are infeasible in the real
world. The question then becomes: how can a robot utilize simulation to enable
it to perform useful tasks in the real world?_

 _The difficulty of transferring simulated experience into the real world is
often called the "reality gap." The reality gap is a subtle but important
discrepancy between reality and simulation that prevents simulated robotic
experience from directly enabling effective real-world performance. Visual
perception often constitutes the widest part of the reality gap: while
simulated images continue to improve in fidelity, the peculiar and
pathological regularities of synthetic pictures, and the wide, unpredictable
diversity of real-world images, makes bridging the reality gap particularly
difficult when the robot must use vision to perceive the world, as is the case
for example in many manipulation tasks._

Note the bit about deep learning algorithms being very proficient "cheaters",
which I missed in my comment above. Indeed, one way to fail to generalise from
a simulation to the real world is to "overfit" to the defects in the
simulation!

Like the linked blog post, most material you are likely to find online focus
on training robots with (deep) Reinforcement Learning, I think because that
just happens to be one domain where it is even harder to collect training data
than good old supervised learning for image recognition. I can find virtually
no source referring to the "reality gap" in the context of purely machine
vision research- it's just not the done thing to train vision algorithms on
simulated data, for the reasons described above, consequently it's very
difficult to find hard data on why it's not done.

A good related source (with a bazillion references) is the following blog
post, discussing the difficulties of deep RL, which is primarily trained on
simulated environments:

[https://www.alexirpan.com/2018/02/14/rl-
hard.html](https://www.alexirpan.com/2018/02/14/rl-hard.html)

The post makes only passing referenct to the "reality gap" but it should give
a good idea about the ins and outs of training in simulated environments.

~~~
YeGoblynQueenne
Oh, and, regarding the issue of discrepancies in perception between humans and
machine vision systems, this is a good source, albeit on the short side (well,
the bit that talks about perception anyway):

[https://rodneybrooks.com/forai-steps-toward-super-
intelligen...](https://rodneybrooks.com/forai-steps-toward-super-intelligence-
iii-hard-things-today/)

------
vinayms
I am not an expert in AI/ML, just a casual observer, but I didn't like the
tone of the article. Not only it fixates on the speed of processing, it also
seems smug and acts as if we know how human brain sees and processes images,
while all we have is a bunch of conjectures. That was distracting.

~~~
mcguire
Well, we do know that the brain does something that CV systems don't, and that
it was a part of a recent self-driving car accident.

------
kakarot
I'm confused about what's so special here.

This is analogous to a person from the early days of photography having no
knowledge of the possibility of image doctoring, and us feeling "smug" because
a photo like this would elicit confusion.

It seems to me you just have to run your NN on pairs of doctored and
undoctored images, and create a separation of concern between scene continuity
and object recognition. It's likely that a lot of implementations currently
rely too much on surrounding context for successful categorization.

It's also worth exploring certain image analysis techniques during pre-
processing to draw out certain attenuated aspects. Edge detection is par for
the course, but things like shading are also important to attenuate so that
the NN has an easier time learning.

If we're dealing with multiple frames and not just a single image, this opens
up a whole dimension of temporal analysis that should make it trivial to
separate the background from juxtaposed images.

~~~
PurpleBoxDragon
>It seems to me you just have to run your NN on pairs of doctored and
undoctored images, and create a separation of concern between scene continuity
and object recognition.

This may fix this problem, but what other visual issues will we uncover that
might be fixable with the right kind of training but which we haven't yet
accounted for.

For example, we see so many optical illusions because of how our eyes and
brain (and the neurons between them) work. These are often times caused by our
sensation or perception incorrectly picking up data, but on purpose (well, as
much purpose as an evolved system could have). For example, how neurons in the
eye exaggerate differences in brightness or how we are overly sensitive to
perceiving faces so that we see them even where they don't exist. What impact
does the lack of these have on an artificial vision system? For starters, it
would likely indicate the system wouldn't recognize the optical illusion
unless trained for it, but are there any other impacts that we need to take
into consideration, especially when considering having human lives depending
upon the system?

------
deltron3030
Can't they implement a "virtual representation" or sort of ideal or
archetypical scene in an AIs memory that's quickly accessible, and then use
that to diff "new unexpected stuff" and refocus on that difference somehow?

~~~
dmichulke
If we had such a representation and the mapping from real world (images) to
it, we'd basically solved all image recognition tasks because the above is the
definition of an image recognition task

~~~
deltron3030
In the real world you wouldn't need to know the exact shape of an object to
determine what it is, unique features could be enough. For an elephant this
could be just a range of skin colors that are unique to elephants.

The problem with videos and images vs. the real world is the fidelity, you
can't trust images or videos that are previously captured through other
equipment. This barrier wouldn't exist if the AI can "see" into the real
world, the skin colors of animals can be trusted there!

------
mcguire
This is an important, and worrying, problem, but I'd be more impressed if the
elephant wasn't added as a bad photoshop. That kind of manipulation is going
to screw up a lot of the processing.

------
fizixer
> A visual prank exposes an Achilles’ heel of computer vision systems: Unlike
> humans, they can’t do a double take.

Thanks for giving us ML researchers a TODO. We'll get to work right away.

------
dukoid
Wouldn't humans who haven't seen Elephants before or who are not trained to
recognize the general category of mammals / animals be prone to make similar
errors?

~~~
sgt101
I don't think so - I think that the human may fail to find the correct label
for the Elephant (although if they have some knowledge about Elephants they
might), but the machine fails to see it - to identify it as a thing, and also
it changes the labels it gives other things; so cats become dogs, mugs become
books and so on. Humans are not prone to this.

------
lolc
Somehow the right image where according to the description an elephant was
introduced is now identical to the left photo in the article.

~~~
proto-n
The elephant is on top of the guy's head, it's hard to see with all the boxes
laid over the image.

------
sorokod
Is this not a whole new attack surface?

~~~
QML
Adversarially generated inputs have been known for some time [1], so I
wouldn’t call it a new class of attacks unless 1. there are other and older
ways of attacking neural networks 2. new as in the last couple of years.

[1] [https://blog.openai.com/adversarial-example-
research/](https://blog.openai.com/adversarial-example-research/)

~~~
p1esk
Yes, this can be used for generating adversarial inputs, and yes, this is a
new attack surface. From the paper:

 _The images generated here could be viewed as a variant of adversarial
examples, in which small image perturbations (imperceptible to humans) cause a
large shift in the network’s output. The images we generate are of a somewhat
opposite flavor: while we do not limit the magnitude of the difference between
the original and modified image, the detectors are sometimes “blind” to the
inserted object. In addition, our examples are not “targeted” in the sense
that no optimization process is required to generate them; they seem prevalent
enough so that a simple scan of transplanting translated versions of one
object in the other can give rise to multiple wrong interpretations._

------
crimsonalucard
I'm aware of the vision algorithm in my head to a certain extent and I'm not
sure if machine vision does the same. You can run simple thought experiments
to see what your brain actually does to analyze an image. First of all when I
look at a scene I am 100% aware of geometry. Irregardless of meaning, words
and symbols I can trace out the three dimensional shape of things without
associations to words.

How do I know I can do this? Simple. Every scene I look at I can basically
translate or imagine that scene in my head as wireframe scene or some low poly
scene as if it was generated by a computer. Similar to if I look at wireframe
scene generated by a computer, my mind can translate it into a scene that
looks real. Try it, you can do it.

Second, I can look at an actual low poly wireframe model of an elephant and
associate it with the word 'elephant.' I do not need color, or detail to know
it's an elephant. In fact, with just color and detail alone it is harder for
me to identify an elephant. For example if someone takes many very closeup
photographs of parts of an elephant like its eye, skin, ear, etc.. and asks me
to guess the subject by interpreting the pictures... I become fully aware that
I would be accessing a slower, different part of my brain to deduce the
meaning. This is a stark contrast to the instantaneous word association
established when I look at a wireframe model of an elephant. The speed
difference between both ways of identifying an elephant indicate to me that
geometric interpretation is the primary driver behind our visual analysis and
details like color or texture are tertiary when it comes to the identification
of an elephant. I believe the visual cortex determines shape first, then
subsequently determines word from shape.

If you feed a white sculpture of an elephant or a wireframe of an elephant
into one of these deep learning networks it is unlikely you will get the word
'elephant' as output. But if you feed it a real picture of an elephant it can
correctly identify the elephant (assuming it was trained against photos of an
elephant). Because the delta between a white sculpture of an elephant and an
actual picture of an elephant is just color and detail this indicates to me
that when you train these deep learning networks to recognize an elephant you
are training the network to recognize details. It's a form of over fitting,
the training is not general enough to catch geometry. It is correlating blobs
of pixels , color and detail with an elephant rather then associating a three
dimensional model of it to the word... the opposite of what humans do. In fact
I bet you that if you took those very closeup photographs of an elephant and
fed it into the network it'd do a better job at recognition versus the picture
of a white sculpture of an elephant.

This indicates to me that to improve our vision algorithms, the algorithm must
first associate pixels with geometry then identify the associated word to the
geometry rather than try to associate blobs of pixels to words. Train geometry
recognition before word association.

My guess is that our minds have specific and genetically determined built in
geometry recognition algorithms honed to turn a 2d image into a 3d shape. We
do not learn to translate 2d to 3d we are born with that ability hardwired.
Where learning comes in is the translation of this shape to a word. Whereas
most of the machine learning we focus on in research is image recognition, I
believe the brain is actually learning shape and geometry recognition.

------
doombolt
Wrong. They can do a second take. You should just code it in.

~~~
majos
Is this sarcasm? The point of the second take is that humans often know
they're confused and will go back to think about the image and remedy the
confusion, whereas having a neural network just look at the image again isn't
going to do anything (plus, existing architectures don't seem to have any
capacity to say "I'm confused" anyway).

~~~
antpls
Parent is not sarcastic. Automated recognition systems are _systems_ , which
means they are made of several components (sensors, databases, hardwares,
softwares) working together. Neural networks are one of those sub-components
and no one ever claimed that neural networks are 100% accurate and sufficient
to build automated decision systems.

The problem described in the article is taken into account when building
systems (like autonomous cars) using neural networks.

~~~
sorokod
_The problem described in the article is taken into account when building
systems (like autonomous cars) using neural networks._

Where can I find more information on this?

~~~
antpls
This is a vast engineering field studied before the existence of neural
networks.

Keywords are "control theory", "sensor fusion", "automated decision under
uncertainty" that you can look on Google, Wikipedia and arXiv. Also
"Simultaneous localization and mapping" which is a good example of using
uncertain data points from different sensors to build a representation of the
reality.

In those systems, a neural network is just another sensor providing augmented
information.

~~~
sorokod
Thank you, I will have a look. Just out of curiosity, what do you mean by
'uncertain' in this context?

~~~
Qworg
Errors in precision and accuracy.

~~~
sorokod
Hmm... there is no doubt about the sensor input, the elephant is there. The
issue is with precision and accuracy of the model itself.

Edit: I need to rephrase, the interesting case for _me_ ( and the one the
article is describing) is when the model fails with accurate input data due to
assumptions that are intrinsic to the model itself.

~~~
unstuckdev
Optical sensors don't have anywhere near the dynamic range of a retina. I
think poor low light performance has been implicated in at least one self-
driving crash. Eyes can at least render objects and shapes in extreme low
light even if the brain doesn't have enough information to identify them. The
tiny sensors in a camera compact enough to be practical still have a long way
to go.

------
antpls
The title is a bit misleading / clickbait. It should be "Neural networks
Confronts...". Machine Learning isn't all about neutral networks and deep
learning.

As another comment said, "second take" is not what neural networks are made
for. Neural networks are a building block of more complex decision systems,
where the weakness of the neural networks are taken into account before
automatically commiting to decisions.

Otherwise, I guess the article is good at pointing the current limitation of
neural networks alone.

~~~
dpwm
> It should be "Neural networks Confronts...".

I disagree, for the following reasons:

\- The paper is called The Elephant in the Room. Technically the _paper_
confronted the popular machine learning approaches, so the reversal would have
been a more eye catching headline. "The Elephant in the Room confronts Machine
Learning" is a rare opportunity to honestly play with word order in a way that
would arguably be more attention-grabbing.

\- Machine learning is more general than Convolutional Neural Networks, but
less general than AI. This seems appropriate for this type of popular science
publication.

\- There are approaches that use attention and RNNs and they are an
interesting approach, but by no means mainstream nor producing state-of-the-
art classification accuracies. Actually, second take is exactly what an
attention-based RNN can do.

\- The last time I checked the state-of-the-art solutions to image
classification involve Convolutional Neural Networks.

If there was an approach that was even close to approaching ConvNet
performance, I would agree that it is perhaps unfair to condemn the entire
field. Even then, we would be in the territory of generalizations in
headlines, which are often fair. "{party} proposing legislation" is a probably
bigger generalization of headline that would generally pass without criticism.

~~~
antpls
> \- The paper is called The Elephant in the Room.

How is this argument related to the confusion of the terms Machine Learning
and Neural Networks...?

> \- Machine learning is more general than Convolutional Neural Networks, but
> less general than AI. This seems appropriate for this type of popular
> science publication.

The most appropriate term is the most precise one, in this case : neural
networks

> \- There are approaches that use attention and RNNs and they are an
> interesting approach, but by no means mainstream nor producing state-of-the-
> art classification accuracies. Actually, second take is exactly what an
> attention-based RNN can do.

Correct, but you are still talking about neural networks...

> \- The last time I checked the state-of-the-art solutions to image
> classification involve Convolutional Neural Networks.

And again neural networks.

‍️

~~~
dpwm
> How is this argument related to the confusion of the terms Machine Learning
> and Neural Networks...?

We're talking about the title.

> The most appropriate term is the most precise one, in this case : neural
> networks

That is generalizing RNNs and CNNs into "neural networks". If I take a popular
CNN architecture, chop off the final dense layer and activation and attach an
SVM classifier, is it an SVM or a convnet I am using?

Somehow the title "The Elephant in the Room confronts
faster_rcnn_inception_resnet_v2_atrous_coco, faster_rcnn_nas_coco,
ssd_mobilenet_v1_coco, mask_rcnn_inception_resnet_v2_atrous_coco,
mask_rcnn_resnet101_atrous_coco." (or the reverse) doesn't have the same ring
to it and is probably inappropriate for the audience.

------
calhoun137
In this article, learn the secret AI researchers dont want you to know, my non
tech friends love talking about it! Gonna have to agree with everyone else
here, the title is clickbait and you can just code around it.

There is nothing fatal here its just one more problem to solve. Havent we
heard all about this issue of image recognition being trickable 1000 times,
why is this the top post?

