
Suddenly, a leopard print sofa appears - anglerfish
http://rocknrollnerd.github.io/ml/2015/05/27/leopard-sofa.html
======
karpathy
This article would not come as a surprise to anyone who works with ConvNets.
Sadly, that might not the case for those outside of the field, largely due to
media's inadequate coverage of our advances (but this is common outside our
field too). No one in the field really believes ConvNets see better than
humans. They are very good single glance texture recognizers. It's as if you
flashed an image and looked at it for a split second without giving yourself a
chance to look around and take some time to gain any higher-level scene
understanding. If you tried this with this image you might also think you had
seen a leopard. Another point to make is not from modeling side but from data
side. If in the training data the leopard texture is highly indicative of
leopard, then the ConvNet will learn to strongly associate it as such. As the
article mentions, a quick hack would be to make sure that your training data
contains many leopard-textured items of different classes. You might then
expect the ConvNet to seek other features to latch on to and become less
reliant on the texture itself.

Also, we carried out an experiment on ImageNet and the outcome was that "One
human labeler (me, incidentally) with a fixed amount of training and a
slightly-above average determination reached ~5% top-5 error on a subset of
ImageNet test set". The media sees this and it immediately gets spun to "AI
now Super-Human. And we're all going to die." It makes a lot of us cringe
every time.

Many people in Computer Vision now consider ImageNet "squeezed" out of juice -
we're good at texture recognition and when an object is in plain view, and
we're now searching for harder tasks and more dynamic range with respect to
human performance, in areas such as harder 3D/Spatial tasks, Image Captioning,
Visual Q&A, etc. The hope is that these harder datasets might in turn guide us
in developing models with more nuanced understanding.

~~~
_dps
Might you by chance be familiar with Rodney Brooks' work on subsumption
architectures [1]? If not, I would summarize the underlying idea (my words not
his) as "don't try to jump too many layers of abstraction in one go" [2].

So I wonder to what extent you would consider this a predictable outcome from
the classifier in question not being part of a subsumptive architecture ---
which at a guess would look like

    
    
      - glance/texture responses fed into 
      - boundary-recognition layers fed into
      - object persistence/tracking layers 
      - fed into abstract scene reasoning
    

It seems to me, as a non-vision researcher (I mainly worked in planning and
control), that the most obvious counterargument to the image being a spotted
cat is based on boundary/object/scene reasoning, and that it's "reasonable"
for the texture/glance layer to say "looks a lot like a cat texture".

[1]
[https://en.wikipedia.org/?title=Subsumption_architecture](https://en.wikipedia.org/?title=Subsumption_architecture)

[2] I realize this may seem, superficially, anathema to deep network research,
which advocates letting the network find its own intermediate levels of
abstraction. But it's actually compatible in my view because Brooks advocates
(again, paraphrasing quite a bit) that the separate layers should have
different objective functions, and that in fact the need for different
objective functions (in a prioritized order) is the cause of emergent layering
in nature. "First, don't die. Second, find shelter. Third, find food etc." So
one can imagine deep networks each finding their own _locally useful_
abstractions for each objective function in the "Maslow" chain, while still
having some macro architecture that tracks human-imposed design principles.

~~~
LoSboccacc
the limit is that training cannot force abstraction. you can only reach
abstraction if you have enough neuron space and the data set is big enough to
avoid over-fitting textures.

the problem is.. human vision doesn't work just by feeding a bitmap. we have
structure to decode space relationships, shapes and maybe even shadow/light
relations. no way we gonna see classificator working on color arrays matching
our vision capabilities

~~~
Retric
Seems simple enough to feed a NN with that abstracted data.

However, the advantage to the texture approach is it's abstracted from a lot
of other information. You don't want a classifier to say sofa, when it's a
picture of a person on a sofa.

~~~
LoSboccacc
but then you're biasing it toward your perception:

[http://www.bespokesofalondon.co.uk/assets/Uploads/bespoke-
so...](http://www.bespokesofalondon.co.uk/assets/Uploads/bespoke-sofa-2.jpg)

anyway it does work perfectly if that's what you need, but most proponent are
trying to use deep nn to classify 'as good as humans do'

------
fla
Somehow tangent but this made me think about this quote found on HN last year:

Context: Evolutionary algorithms and analog electronic circuits

> One thing stands out when you try playing with evolutionary systems.
> Evolution is _really_ good at gaming the system. Unless you are very careful
> at specifying all of the constraints that you care about you can end up with
> a solution that is very clever but not quite what you had in mind. Here
> power consumption is the issue. If you tried to evolve a sturdy chair you
> might end up with something that is 1mm tall. or maybe a fuel efficient car
> that exploits continental drift.

I think it's the same here: The net is never gonna better than what it needs
to be, and it is probably always gonna take the easy route.

~~~
FranOntanaya
There's an alife program called DarwinBots where small bots powered by
mutating code compete against each other to survive and reproduce.

Given enough time, you'd expect the to develop clever behaviors, but instead
they just fuzz-tested the sim and locked in on exploits of bugs or environment
settings. They only got a bit more clever when connecting different sims
running on different conditions.

Eyes already use different kinds and densities of sensors optimized for either
detail and color or movement/edges. I wouldn't expect a single learning
method, even after optimizing it to its limits, to be above what two or more
layers of different methods could do, especially when trying to avoid exploits
like the tank story.

~~~
stcredzero
_Given enough time, you 'd expect the to develop clever behaviors, but instead
they just fuzz-tested the sim and locked in on exploits of bugs or environment
settings._

Classic A-life! Also, not so different from the spirit of actual biology.

 _They only got a bit more clever when connecting different sims running on
different conditions._

Diversity is very important for evolution on many levels. What many don't
realize (especially, I note, evolution deniers) is that the ecosystem as a
whole provides a very complex and continually varying epiphenomenal fitness
function to any given organism.

~~~
rdlecler1
If you don't have a sufficiently complex genotype phenotype mapping and the
system is not evolvabke (See Gunter Wager's work) the you shouldn't expect
more complex phenotype. Understanding a genetic representation is going to be
an important step toward open ended evolutionary systems.

------
lambda
> So I guess, there's still a lot of work to be done.

And I think this is the most interesting part.

One of the most depressing things about all of the "this image recognition
algorithm performs better than humans on this task" is the idea that we've
pretty much solved the problem, and it's just a matter of some more
optimization and tweaking to handle a few edge cases.

This kind of problem, where the dominant solution simply gets it so wrong, and
the problem cases are uncommon enough that any statistical solution is
generally going to treat them as noise, reveals that in fact that there is
likely plenty of room for entirely new, novel ways of approaching the problem
to handle these kinds of cases better.

It's actually more exciting that there's so much more to be done, than to say
"well, it's basically a solved problem, we just need to do some tweaking and
optimization."

~~~
jhundal
Definitely agree here, want to add a link to this paper which shows how far we
still have to go (and questions whether the current models will ever replicate
human vision):
[http://arxiv.org/abs/1412.1897](http://arxiv.org/abs/1412.1897)

------
jweir
[https://neil.fraser.name/writing/tank/](https://neil.fraser.name/writing/tank/)

This is a classic story of a neural net failure.

The net was able to find tanks hiding in the trees with amazing accuracy. Too
amazing. It turned out the photos of the hidden tanks were all photographed on
a cloudy day. The images without tanks in a clear day.

------
fpgaminer
Thank you for this article; very thought provoking.

My nitpick:

> When each student was given a heavy book of MNIST database, hundreds of
> pages filled with endless hand-written digit series, 60000 total, written in
> different styles, bold or italic, distinctly or sketchy. > ... > So, are you
> going to say that was not the case?

I understand the point the author is making. Human brains are really good at
taking limited examples and correctly extrapolating them to new cases. That
is, of course, the goal of intelligence. Machine Learning has gotten better at
this generalization, but has a long way to go. And ConvNets as they exist
today will not achieve that, no matter how much training you perform on them.

This specific example is inaccurate though. Let us aggressively simplify and
low-ball by saying that humans see at 24fps. Humans of course don't see in
discrete frames, but this simplification doesn't detract from my argument and
makes quantifying easier. So, if you give a human a single page of numbers,
and they look at it for an hour, they have now seen >86k examples. That's 86k
examples with twitching saccades, and from both eyes. That's in just an hour
of looking at numbers.

Prior to being given that page of numbers, most children will have been alive
for 4-5 years. That's 3 billion examples from a wide variety of subjects (we
ignore sleeping cycles, because we're already low-balling this fps figure, and
because the brain is still learning and visualizing during sleep).

And humans are born with a pre-built visual cortex. Edge detection, gradient
detection, etc. are all already built for us. CNNs learn that from scratch.

The author's real point is still valid, though, don't get me wrong. I'm just
nitpicking.

~~~
alextgordon
Forget seeing a symbol once, you can recognise and represent a symbol without
_ever_ having seen it.

Test your humanness; draw these symbols:

 _" Like an E but rotated so the prongs point upwards"_

 _" Like a snake but with two heads. Snakes down, up, down, up, down."_

 _" Like a walking stick with the handle pointing left and looping back
around."_

(answer for A: Russian letter Sha) (answer for B: Kannada letter Uu) (answer
for C: Tamil vowel sign I)

~~~
Houshalter
Convnets can do this. Geoffrey Hinton has a wonderful lecture, where he
trained a digit recognizer on everything but 7's and 8's (IIRC.) He then let
another convnet tell it which numbers looked more or less like 7's and 8's.
E.g. "that 9 looks kind of like an 8. That 1 looks kind of like a 7", etc.

And then it was able to correctly recognize 7's and 8's, despite never having
actually seen one. I'm simplifying somewhat, but it was super cool.

I don't know why people are so focused on one-shot learning, or think that NNs
can't do it. Neural networks learn features from lots of (possibly unlabelled)
data. That's the whole point. Once you have those features, you can use them
for all sorts of things. You can show it an image, and then measure how close
other images are too it. Thereby learning from a single example.

------
qbrass
On the other hand, nobody has been upset that humans are constantly
misidentifying that jaguar print sofa as a leopard print.

~~~
vacri
Well-played!

I myself have researched leopard spots since I painted our toilet floor in
them. It's a lead sheet, and the paint had worn off, which probably wasn't the
healthiest thing. My housemates had filled the toilet with memorabilia from an
African trip, so leopard-print paintjob it was.

Which entailed looking up leopardprint online. Very little of which actually
looks like leopard rosettes, and now I have a problem with almost anything
trying to pass itself off as leopardprint. Anyway, I can't say that my
paintjob is a particularly good reproduction, but at least it's 'spiritually
correct'... :)

------
Mz
_Leopards (or jaguars) are complex 3-dimensional shapes with quite a lot of
degrees of freedom (considering all the body parts that can move
independently). These shapes can produce a lot of different 2d contours_

My son keeps telling me that infants are fine with, say, a truck transforming
into a clown (when it emerges from the other side of a visual barrier) but not
with it transforming into TWO of something. Apparently, babies subjectively
experience this (visual transformation) all the time -- mom moves a plate and
what seemed like a big circle is now a flat line or whatever.

So humans apparently get tons and tons of experience with visually mapping 3d
reality to mere 2d imagery. I have been thinking somewhat about this of late,
in terms of physical attractiveness or "image" \-- that pictures of a woman
posted on a blog capture a 2d version of her but people interacting with her
are interacting with a 3d living, moving creature who also has smell and a
voice and her movements may be elegant or may be not elegant. Which is a
thought process relevant to a project of mine, something people here surely
will have no interest in. But where it is relevant to this article is that we
are doing this wrong: Humans have thousands of hours of practice of looking at
3d reality and figuring out how it to interpret 2d images as representative of
that 3d reality. Image recognition software is just dealing with 2d images. I
don't see how it can hope to compete. Humans don't come preinstalled with the
software to make that distinction. We acquire it with enormous repetition.

When do we make a robot and give it some baseline parameters and a learning
algorithm (and set it loose in 3d reality and have to learn)? That is when we
can get scared about human like AI that can compete on image recognition.

~~~
etangent
Indeed, arguments like "My mother didn't have to purchase 30,000 mugs to teach
me what one looks like" miss the fact that we humans spent so much time
(almost all of our waking time in fact) interpreting 3D reality, an endless
stream of repetitive tasks.

------
xyproto
[https://www.imageidentify.com/](https://www.imageidentify.com/) correctly
identified the images as "a small sofa". I think rotating the image is
questionable, since there could be an algorithm for first orienting the image
correctly based on light and shadow and then the image recognition could be
run.

[https://www.imageidentify.com/result/1ixb9603m9ix1](https://www.imageidentify.com/result/1ixb9603m9ix1)

~~~
quasiresearcher
But those algorithms would be very limited in usefulness. Systems such as
imageidentify.com should be at least trying an ensemble of algorithms, many of
which I suppose should be invariant under translation and rotation.

Edit: There's a comment about invariance in this thread [1] and apparently
CNNs are not invariant under rotation.

[1]
[https://news.ycombinator.com/item?id=9750133](https://news.ycombinator.com/item?id=9750133)

------
shpx
Suddenly, a shaded rock appears.

[https://upload.wikimedia.org/wikipedia/commons/7/77/Martian_...](https://upload.wikimedia.org/wikipedia/commons/7/77/Martian_face_viking_cropped.jpg)

We're doing humans wrong. Maybe not all wrong, and of course, humans are
extremely useful things, but think about it: sometimes it almost looks like
we're already there. There always going to be an anomaly; lots of them,
actually, considering all the things shaded in different patterns. Something
have to change.

I agree that we aren't there, but we'll never be there, every system can be
fooled, its just a question of 95%, 99% or 99.99%

~~~
benplumley
> We're doing humans wrong.

I don't follow. If you asked a human what the linked image looked like, they'd
likely say a face, but if you then asked them what it actually was, they're
all going to change their answer to a rock, even specifically a rock on Mars
(if given a colour version of this image).

It's true that humans see patterns that aren't there, but does that detract
from our ability to recognise objects?

------
TimFogarty
This is fascinating and well written.

I tried the unrotated sofa image on Wolfram's ImageIdentify and it correctly
identified a settee [1]. So it presumably gathered that from the shape of the
image rather than the pattern. It is peculiar though that it can't see the
shape under a simple rotation. Or perhaps the margin of confidence levels
between sofa and leopard were so narrow that a rotation was enough to tip it
in favour of the leopard? I'd be interested to see the inner workings of this.

[1] [http://i.imgur.com/6f6Co5O.png](http://i.imgur.com/6f6Co5O.png)

~~~
glaberficken
I tried Wolrfram ImageIdentify with a bunch of bicycle photos and it insisted
on identifying them as "Bicycle Saddle".

I kept trying different ones and it kept identifying as "Bicycle Saddle"...

~~~
fortyeight
To be fair that is one of the few things on a bike that isn't a triangle.

~~~
glaberficken
hmmm, good point!

------
castratikron
Really wasn't expecting a Terry Davis comment at the bottom.

~~~
briandear
Who is Terry Davis.

~~~
cgriswald
I didn't know either, but found this: [http://motherboard.vice.com/read/gods-
lonely-programmer](http://motherboard.vice.com/read/gods-lonely-programmer)

Scary and fascinating.

~~~
YoukaiCountry
I felt like Alice down the rabbit hole after getting sucked into reading about
TempleOS. Quite an unexpected side-effect of reading an article on neural
networks!

------
vanderZwan
The MNIST analogy reminds me of the "Teaching Me Softly" article that was
posted here last year:

> _When Vladimir Vapnik teaches his computers to recognize handwriting, he
> does something similar. While there’s no whispering involved, Vapnik does
> harness the power of “privileged information.” Passed from student to
> teacher, parent to child, or colleague to colleague, privileged information
> encodes knowledge derived from experience. That is what Vapnik was after
> when he asked Natalia Pavlovich, a professor of Russian poetry, to write
> poems describing the numbers 5 and 8, for consumption by his learning
> algorithms. The result sounded like nothing any programmer would write. One
> of her poems on the number 5 read,_

> _He is running. He is flying. He is looking ahead. He is swift. He is
> throwing a spear ahead. He is dangerous. It is slanted to the right. Good
> snaked-ness. The snake is attacking. It is going to jump and bite. It is
> free and absolutely open to anything. It shows itself, no kidding.
> Brown_Cornerart_

> _All told, Pavlovich wrote 100 such poems, each on a different example of a
> handwritten 5 or 8, as shown in the figure to the right. Some had excellent
> penmanship, others were squiggles. One 5 was, “a regular nice creature.
> Strong, optimistic and good,” while another seemed “ready to rush forward
> and attack somebody.” Pavlovich then graded each of the 5s and 8s on 21
> different attributes derived from her poems. For example, one handwritten
> example could have an ‘‘aggressiveness” rating of 2 out of 2, while another
> could show “stability” to a strength of 2 out of 3._

> _So instructed, Vapnik’s computer was able to recognize handwritten numbers
> with far less training than is conventionally required. A learning process
> that might have required 100,000 samples might now require only 300. The
> speedup was also independent of the style of the poetry used. When Pavlovich
> wrote a second set of poems based on Ying-Yang opposites, it worked about
> equally well. Vapnik is not even certain the teacher has to be right—though
> consistency seems to count._

[http://nautil.us/issue/6/secret-codes/teaching-me-
softly](http://nautil.us/issue/6/secret-codes/teaching-me-softly)

That article in turn reminded me strongly of "Metaphors We Live By" by Lakoff
& Johnson, and the works they have written since, where they claim that humans
make sense of the world using systems of rich, conceptual metaphors. As I
understand, the work is well-known to machine learning researchers.

------
jameshart
Obviously these classifiers do often focus on patterns, rather than shapes,
and that's probably something that could be worked on, but I don't think an
image classifier can possibly be expected to, at the level it is operating,
identify the leopard-print sofa all on its own. Clearly there's a higher order
process at work than image recognition here - after all, when a human is faced
with a sofa-shaped object with a leopardskin pattern on it, there are two
hypotheses that need to be evaluated: 1) this is a sofa patterned to look like
a leopard; or 2) this is a leopard, shaped like a sofa. Rejecting the less
plausible of those two scenarios is obviously a higher-order activity. If the
image classifier is at least firing off the concepts 'leopard' and 'sofa' with
some level of probability, it's doing its job pretty well.

~~~
visarga
Then we need to integrate higher order knowledge about the world collected
from text (Wikipedia and the like).

------
anglerfish
OP here, and thank you kind sirs and ladies for you feedback.

I'd just like to answer the recurring objection: yes, our visual experience
contains a lot of frames and that seemingly refutes my MNIST example; however,
you do forget about the other part of a supervised dataset, namely labels. Do
we have a label provided to each thing we see in our life? Obviously not. How
much time do you need to familiarize yourself with a new entity, like an
unknown glyph or symbol? Can't provide a concrete example, but I guess a
single math class was enough for all of you to recognize all the digits the
next day. You can test it right now by looking into some unknown alphabet and
then looking into it again upside down - you'll recognize it perfectly, except
for mental rotation issues (which occuur even for well-known letters and
symbols).

~~~
vidarh
> I guess a single math class was enough for all of you to recognize all the
> digits the next day.

I'm curious what makes you think that. My experience with what's going on at
my sons school is telling me that the children spends a massive amount of time
on getting recognition of digits and letters right.

~~~
prewett
When I was living in China I had difficulty recognizing handwritten 9's and
1's. Their 9's are less half-circular than Westerner's are, and the 1's have a
long stroke the top that looks like a sloppily written 7 to me. I would
frequently look at a handwritten number and have to analyze what it was.

------
rndn
That's right, CNNs tend to be more concerned with textures rather than with
overall shapes. One problem is that CNNs discard a lot of pose information of
the detected features during pooling. Another problem is that there is no top-
down verification such as "hmm, leopards always have {heads, legs, tails ...},
I should scan the input for these... nope, it doesn't fit at all, I should
exclude everything that has {heads, legs, tails ...} from my interpretation."
In a human brain that likely happens in some distributed fashion without
considering individual classes, but by just inhibiting everything that can't
be verified upon looking twice (or more times).

------
Houshalter
Awhile ago I tried an interesting experiment. I fed a famous psychology image
into a bunch of different image recognition systems.

This image: [https://i.imgur.com/2aCqMx2.png](https://i.imgur.com/2aCqMx2.png)

And here are the results:
[https://imgur.com/a/8ndyq](https://imgur.com/a/8ndyq)

This doesn’t really prove anything, but I thought it was interesting. It is of
course, unreasonable to expect ML algorithms to perform decently, so far
outside of the space they were trained on.

But I suspect that part of the reason they don’t do well is that they are
purely feed forward. Humans also don’t see the image at first. It takes time
to find the pattern, and then everything clicks into place and you can’t unsee
it.

This might have something to do with recurrency. But more importantly,
information feeds down the hierarchy as well as up. Features above, give
information back down to features below. So once you see the dog, that tells
the lower level features that they are seeing legs and heads, which says they
are seeing outlines of more basic 3 dimensional shapes, and so on.

I think it also requires a descent understanding of 3d space, to fit the
observed pattern to 3d models which could have produced it. I’m not certain if
regular NNs observing static images, are optimal for learning that.

More here:
[https://www.reddit.com/r/MachineLearning/comments/399ooe/tes...](https://www.reddit.com/r/MachineLearning/comments/399ooe/testing_image_recognition_platforms_on_a_famous/)

------
dlss
For context, ImageNet _does_ have a sofa category for labels :/

[http://image-net.org/search?q=sofa](http://image-net.org/search?q=sofa)

~~~
sbodenstein
The Caffe models are only trained on the 1000 category subset of ImageNet used
for the competition: [http://image-net.org/challenges/LSVRC/2014/browse-
synsets](http://image-net.org/challenges/LSVRC/2014/browse-synsets)

There are no sofa's in this list, the closest thing I can find is a "studio
couch, day bed":
[http://imagenet.stanford.edu/synset?wnid=n04344873](http://imagenet.stanford.edu/synset?wnid=n04344873)

~~~
dlss
Right. The "studio couches" are couches.

------
sosuke
Just to be sure, Wolfram Alpha Image Identify was correct, down to it was a
particular type of sofa.

[https://www.imageidentify.com/result/1ixb9603m9ix1](https://www.imageidentify.com/result/1ixb9603m9ix1)

------
andreyf
Google image search says

    
    
      Best guess for this image: cat bed furniture
    

[https://goo.gl/vXwSaj](https://goo.gl/vXwSaj)

~~~
comex
That's cheating though; you can see the first few results are for the exact
same image with a dog pasted on it, and contain those terms.

------
quantombone
ConvNets have gotten popular because of their strong empirical results. All
the recent work on visualizing CNNs suggests that the community working on
Deep Learning still has a lot to learn about their own algorithms.

But high-level notions like a Jaguar is a cat-like animal aren't necessary to
perform well on an N-way classification task like ImageNet.

What's more important to note is everybody knows there's plenty wrong with a
pure appearance-based approach like CNNs. Every few years a new approach pops
up that is based on ontologies, an approach inspired by Plato, etc, but these
systems require a lot of time and effort. More importantly, they don't perform
as well on large-scale benchmarks. In the publish-or-perish world, you can
jump on the CNN bandwagon or start reading Aristotle's metaphysics and never
earn your PhD.

~~~
sgt101
I doubt that reviewers for NIPS would reject a paper with a novel approach
because it didn't perform at best in class level, provided it offered a way
forward.

If it doesn't work at all, or isn't a new idea, that's different.

------
nartz
I have seen some kaggle competitions do image transformations and put the data
back into the training set to increase the robustness of the classifier. For
instance, rotating images, slightly skewing them, etc.

I would propose that for this leopard problem, instead of just skewing the
images, you also performed transformations on the COLOR and put the images
back into the training set.

Maybe applying certain filters, such asdimming the saturation or contrast of
images, so that the contrast between the leopoard spots were less visible
(i.e. "A Leopoard in low lighting") - maybe this would force the neural net to
learn more than just its print.

Knowing the right set of color filters to apply to all images could be tricky
though.

------
yellowbkpk
Let's say I didn't want to use the ImageNet or CaffeNet pre-trained models but
wanted to train my own model (say, of thousands of images of sofas, leopards,
jaguars, and cheetahs); are there any tutorials that walk through the process
of building a CNN on your own data?

(I've seen the comments like
[https://news.ycombinator.com/item?id=9584325](https://news.ycombinator.com/item?id=9584325)
and watched the lectures and youtube walkthroughs, but they're all theoretical
and I'm looking for documented code to go along with that theory)

~~~
discardorama
For doing image training, Caffe is pretty good. Here's a starting point:
[http://caffe.berkeleyvision.org/tutorial/](http://caffe.berkeleyvision.org/tutorial/)

------
raverbashing
Yes, ConvNets are limited and results are pretty arbitrary sometimes.

The net correctly identified "leopard". Was it taught about sofas? Who knows,
maybe Sofa had a high score as well on the output.

Or, look at the Dalmatian/Cherry picture. The net identified "Dalmatian"
_which is a 100% valid response!_ But whoever labeled it wanted "cherry". The
picture is 50% cherry 50% dalmatian.

Pictures often have more than one element and a pure ConvNet is "one picture
to one label"

~~~
sova
Ha! I was wondering this exact thing. How could it identify a sofa if it never
learned about one? It seems very contrived. At the same time, the example is
kind of clear, but we are making neural nets that are emulating Pollock and
Dali, not Monet. If you can dig it. The whole beauty is the overlapping
interstitial matrix of weighted values that leads to these beautiful
discoveries. To rank such a fractal-like algorithm on whether or not it
predicts a label satisfactory to the average human is to mis-apply the
elegance and potential of these mathematical wonders, in my humble opinion.

------
geophile
This is the best HN submission I've seen in a very long time. Really thought-
provoking.

~~~
vonklaus
I agree. I don't work with machine learning or neural nets, and thus don't
have more than the most cursory laymans understanding of them. This article
read really well and was quite informative of the problems this technology is
facing.

------
deet
Is there any work into building self-verification into these type of networks?
For example based on hierarchical categories of concepts?

If part of the network is trained on the concept of a cat, and whether or not
an image is a cat is fed into training of the leopard, it seems like the
problems would be avoided. Or is the notion that with enough training data and
deep enough networks the concept of "leopard is cat" will be learned?

------
rdlecler1
We spend so much effort trying to engineer intelligence, when we would get a
lot farther reverse engineering intelligence. Whenever AI makes a big advance
the analog was already known my neuroscientists. There is also clearly no
comprehension of the importance of the topological (circuitry) defining a
neural network. We always assume a fully connected network, and draw the out
as such, but we don't stop to consider that many of those Wijk interactions
are completely spurious, meaning they have no information bearing role. If you
strip them away you'll start to reveal the underlying circuit at work. I've
published theoretical results using artificial gene networks, but the results
should be similar for ANNs.
[http://m.msb.embopress.org/content/4/1/213.abstract](http://m.msb.embopress.org/content/4/1/213.abstract)

~~~
abrichr
_We spend so much effort trying to engineer intelligence, when we would get a
lot farther reverse engineering intelligence. Whenever AI makes a big advance
the analog was already known my neuroscientists._

The problem with attempting to understand intelligence by reverse engineering
the human brain is that we cannot know a priori which aspects of the human
brain are necessary for intelligence to arise, and which are merely
consequences/side effects of biology and chemistry. Once we discover some
technique that works in a practical setting (e.g. on ImageNet), then it is
fairly straightforward to find the biological analogy in the brain.

In fact, Geoff Hinton explicitly advocates an approach of "try things, keep
what works, and figure out how it relates to the brain". The inverse is like
finding a needle in a haystack.

 _There is also clearly no comprehension of the importance of the topological
(circuitry) defining a neural network. We always assume a fully connected
network, and draw the out as such, but we don 't stop to consider that many of
those Wijk interactions are completely spurious, meaning they have no
information bearing role._

The purpose of training a deep neural network from data is to automatically
discover what the topological circuitry of the network should be, rather than
engineering it by hand. In the brain, some prior knowledge is encoded via
genetics, while the rest is learned. The effect of sparsity of the weights in
deep neural networks is an active area of research [1].

 _If you strip them away you 'll start to reveal the underlying circuit at
work. I've published theoretical results using artificial gene networks, but
the results should be similar for ANNs._

Very interesting. If I understand correctly, the cost you are attempting to
minimize is phenotypic variation, which you measure as the gross cost of
perturbation (GCP). Would this cost be analogous to sensitivity to adversarial
examples in the case of convolutional neural networks [2]?

[1]
[http://www.jmlr.org/papers/volume14/thom13a/thom13a.pdf](http://www.jmlr.org/papers/volume14/thom13a/thom13a.pdf)

[2] [http://arxiv.org/abs/1412.6572](http://arxiv.org/abs/1412.6572)

------
thret
Isn't the solution to simply have two NNs? One trained to identify leopards,
another trained to identify sofas.

Regardless of how computationally expensive NNs may be now, wait a few years,
and then train millions of them on different classes of objects and run them
concurrently to identify new pictures.

------
rcfox
I don't work with NN at all, but it kind of seems like the author set up their
CNN with a large enough filter to see a whole spot at once, but not large
enough to see a whole cat at once. Then he complains that it doesn't know what
a cat looks like. Would it be possible to make a larger filter with a lower
resolution such that overhead is the same as the smaller filter but it can get
a higher-level view of the image?

Also, the author spends the first section of the article determining that it
is in fact a jaguar-print sofa (which the model also confirms) but continues
to throw around the word "leopard". They're not making it any easier for the
future machine learning algorithms that try to identify an image by the text
surrounding it. ;)

------
mirimir
TinEye tells me that the leopard print "sofa" is actually a dog bed ;) [0]

[0]
[http://www.tineye.com/search/4c4ce7b6558e8d3c4dd443439e80556...](http://www.tineye.com/search/4c4ce7b6558e8d3c4dd443439e805561ee7d4f26)

------
stcredzero
Really, humans are not so different. Online and in person, we pattern-match on
a few scant signals, sometimes jumping to ridiculous conclusions as a result.
(1) Granted, we have much better machinery for recognizing the kinematics of
fellow animals. There's compelling evolutionary reasons for animals to become
really, really good at this.

If one paired the present classifiers with Amazon Mechanical Turk, just
providing one bit of information -- "is-it-an-animal?" \-- I wonder how well
the current classifiers would fare in relation to human beings?

(1) - Ironically, the more "cosmopolitan" people become, the _quicker_ they
are to jump to such conclusions!

------
pilooch
Well, you can try this one, it is open source and based on Caffe :
[http://imgdetect.alexgirard.com/#](http://imgdetect.alexgirard.com/#) The
image link:
[http://rocknrollnerd.github.io/assets/article_images/2015-05...](http://rocknrollnerd.github.io/assets/article_images/2015-05-27-leopard-
sofa/sofa.jpg) It says 'studio couch, day bed' with 97% accuracy. Very likely
because it is part of Imagenet.

------
Artemis2
[http://cloudsightapi.com/](http://cloudsightapi.com/)

This, given the leopard couch, returns "brown leopard print couch".

------
TheEzEzz
> the problem won't be solved by collecting even larger datasets and using
> more GPUs, because leopard print sofas are inevitable.

The models have room for improvement, but it's not clear to me that larger
datasets won't solve the problem. Larger datasets and more processing power is
exactly why neural nets have surged in effectiveness recently. Who knows how
much further current models can go with more data and processing power?

~~~
kristjankalm
read the Hinton paper cited in the OP -- no amount of processing power will
make CNNs represent structured latent variables

------
jmount
That is neat. Just for fun I tried to figure out how to set up Caffe on EC2. I
got it to work, but I didn't get CUDA up and it probably would have been
faster with Anaconda. But for what it is worth: here are my current notes
[https://github.com/JohnMount/CaffeECSExample](https://github.com/JohnMount/CaffeECSExample)

------
BrandonM
Sorry to go off-topic, but the way that font combined 'st' (with a loop back
from the 't' to the top-center of the 's') was very visually distracting to
me. Did that bother anyone else?

------
acd
How about detecting vector edge shapes and unifying that result with the
existing classifier? Surely a leopard sofa cannot have the same edge vector
shape as a real big cat.

~~~
Animats
That's an idea from 1970-1980s AI, called the "primal sketch" model. The
concept was to take an image and try to turn it into a line drawing, then
extract the topology and geometry.[1] Further processing might yield a 3D
model.

This sort of works on simple situations without too much edge noise. It's been
used for industrial robot vision, where what matters are the outside edges of
the part. It's not too useful when there's clutter, occlusion, or noisy
textures.

More recent thinking is to find surfaces, rather than edges. This works well
if you have a 3D imager, such as a Kinect. You can get a 3D model of the
scene. Occlusion remains a problem, but texture noise doesn't hurt.

[1]
[http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/GOME...](http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/GOMES1/marr.html#Marr82)

------
jng
I'm happy somebody tries to put some sense into the whole absurdly overblown
machine learning field.

~~~
ghaff
I'm not sure it's absurdly overblown but individual advances/findings can be
way overhyped or at least over-generalized. ML/AI has been incrementally
delivering pretty impressive results within certain constraints. That's great
but there's then a widespread tendency to extrapolate those results to the
broader case--and then absurd/stupid-looking results happen.

We certainly see the same thing with autonomous vehicles. Given very accurate
mapping and a particular set of environmental and type-of-road conditions,
cars can do so well that it's tempting to say they're 95% of the way to fully-
autonomous. But dump them in a Boston snowstorm and you see they're really not
even close. (Which isn't to say that bounded use cases can't be very useful.)

~~~
leereeves
Overhype has always been the enemy of AI/ML, leading to unrealistic
expectations, then disappointment, then distrust.

But it might be different this time...

------
NHQ
Maybe the algorithm did detect a face in the couch, as we see them in trees
(wood sprites!).

------
arthurcolle
Suddenly Terrence Davis appears!

------
jostmey
After rotating the image 90 degrees, the predicted result changes
substantially. The author should not be surprised. A convolutional neural
network is translationally invariant, not rotationally invariant.

~~~
sbodenstein
Standard convnets do not contain explicit rotational invariance (unless you
include a layer such as this: arXiv:1506.02025v1). They can however learn
rotational invariance if you feed them rotated images.

------
simonmd
Fascinating read.

------
davyjones
gi/go

------
robbrown451
I want that sofa.

