
How close are we to solving vision? - avivo
http://blog.piekniewski.info/2016/08/12/how-close-are-we-to-vision/
======
IsaacL
Here's an interesting test case for "context-aware" machine perception: the
UK's Hazard Perception test for new drivers. You sit at a computer and are
shown 15 recordings of everyday road scenes, from the drivers view, and you
have to click when a "developing hazard" appears. It's quite an annoying test,
since what counts as a "developing hazard" is not clearly defined, but the
intention is that practicing for the test trains your subconscious to scan for
minor cues -- e.g., a car rapidly approaching from a side road (which will
likely pull out without stopping), or two pedestrians finishing their
conversation and one turning to face the street (they will likely step out
into the road).

[https://www.gov.uk/theory-test/hazard-perception-
test](https://www.gov.uk/theory-test/hazard-perception-test)
[https://www.youtube.com/watch?v=SdQRkmdhwJs](https://www.youtube.com/watch?v=SdQRkmdhwJs)

------
unlikelymordant
The visual cortex in the brain (which is the physical analogue of the conv
nets) is not the whole brain, it is really just a feature extractor for the
rest of the brain. So saying that 'vision is not solved' and 'so much for
superhuman performance' is not really news, the performance is superhuman on
the particular dataset, that is all. You still need the rest of the brain to
reason about the inputs and correct weird errors. These results are a stepping
stone to better results on harder datasets. I have a feeling people will be
saying 'vision is not solved' until general AI is realised (which in a sense
is true, but discounts the very real progress being made today).

~~~
erikpukinskis
I would go further and say that the visual cortex and the rest of the brain
are really just feature extractors for the body to use. And without a lifetime
of embodied experience, neither can "vision" be solved nor can "general AI" be
realized.

It's impossible to really see something unless you have the opportunity to
interact with it (or things like it). It's impossible to understand something
unless you have the opportunity to interact with it (or things like it).
Without interaction there can be no intelligence. The best you can do is have
an intelligent being (the human researcher) train a computer to mimic
intelligent action in a constrained environment. But there is no path from
there to general intelligence without interaction.

And there is no interaction without a body of some shape. The body can be
virtual, and needn't look like a human body, but there needs to be a set of
actuators which cause realtime changes in the input space.

~~~
unlikelymordant
> Without interaction there can be no intelligence. And there is no
> interaction without a body of some shape.

I disagree fully. Can a person born quadriplegic not be considered intelligent
simply because they can't interact with things? They can interact by
communicating, same as a computer can.

~~~
erikpukinskis
A quadriplegic has a body, they can move their head, mouth, eyes, etc, and
those things can influence the world. That constitutes a perception action
loop which is a basic requirement for intelligence.

An AI without any actuators is more like a fully paralyzed person.

A baby which was born totally paralyzed would in fact be profoundly retarded
and they would be unable to communicate or form anything resembling
intelligence. See [http://io9.gizmodo.com/the-seriously-creepy-two-kitten-
exper...](http://io9.gizmodo.com/the-seriously-creepy-two-kitten-
experiment-1442107174)

An AI with vocal cords and a microphone in the real world has a body.

------
catwell
This is the same kind of question as: "are computers better than humans at
Math." They are obviously better at some things related to scale: they can
compute things much faster and derive solutions to equations much more easily.
But the issue is: they don't really "understand" what they do. And that is why
computers are still not better than humans at discovering things
independently, even though a lot of proofs are now machine-assisted.

Similarly, machines are becoming better than us at recognizing some specific
instances within categories of objects because they can know more of them,
i.e. they have larger "databases". But they are still bad at learning new
concepts on their own, even though there has been much progress on that front
in recent years.

In general, I don't think it is a good idea to consider limitations in
Computer Vision as "vision" issues; instead we should consider them as wider
AI issues. Basically, ask ourselves: "could a blind Human solve this problem
with the information that our current vision algorithms have"?

I wrote a more detailed response along those lines over three years ago at
Quora, when I was still working in CV and the field was achieving its switch
to neural nets and deep learning. I still think it is mostly relevant today.

[https://www.quora.com/What-are-the-major-open-problems-in-
co...](https://www.quora.com/What-are-the-major-open-problems-in-computer-
vision/answer/Pierre-Chapuis)

~~~
TheOtherHobbes
It's not obvious that humans really "understand" math. Or at least, only a
very tiny minority of humans understand math well enough to improvise with it.

Most humans are only able to learn a small handful of "cookbook" math
practices.

This is a standard trope in AI - AIs are compared with the sum total skill of
human culture as a whole, not with the relatively weak skills of individual
humans. (We have individuals with stand-out skills in specific domains, but
there are no - at least virtually no - individuals with stand-out skills in
many domains.)

Perhaps future approaches to AI will be collective. Instead of a single smart
all-powerful monoAI we'll build evolving problem-solving polyAI cultures, and
skim off the skills and insights they develop.

So "solving vision" isn't a useful measure. AI vision is getting close to
classifying photos with human-like levels of consistency. 3D vision is still a
problem, but will probably come with time.

But then what? Non-blind humans can all recognise familiar people, pick out
strangers as strangers, identify a standard selection of objects, make
educated guesses about non-familiar objects, and so on.

But humans can also appreciate art, identify memes and find them amusing,
respond to font choices and colours, describe and label spatial relationships
and views, and point to the location of objects/places that are not currently
in view.

Trained artists and architects can identity and name specific proportions and
identify cultural references.

Etc. How many of these are necessary to "solve vision"?

------
jules
Could these problems be solved with bigger networks, or do you really need to
improve the algorithm beyond that?

~~~
kristjankalm
No, not really. The "structure of the world" such nets learn is based on
bottom-up processing of data -- moving from basic features such as
orientations and colours to more complex features. As a result it will make
famously absurd predictions like mistake a spotty fur coat for an actual
leopard, since it has no "model" of a leopard, in the sense people go "this is
an absurd place for a leopard to sit so it's most ceratinly a fur coat rather
than an animal". Or to use use a more technical term, no prior probability of
a leopard given observed data. Hence a standard convnet (if there is such a
thing) will massively overestimate the probabilities of such "adversial"
stimuli.

------
Houshalter
The progress that has been made in the past 5 years has been amazing. No one
would have predicted superhuman performance on Imagenet in 2016. Hell no one
predicted it would be even close for much simpler datasets like CIFAR-10,
which is just low res images of 10 types of objects. This is amazing progress,
don't let the AI Effect ruin it
([https://en.wikipedia.org/wiki/AI_effect](https://en.wikipedia.org/wiki/AI_effect)).

Second you can't measure progress without a quantitative benchmark. Feeding a
net random images from a different dataset, and then just noticing it makes
some mistakes, is not scientific. Sure, I agree, Imagenet has been beaten and
we need something better to compare with humans. We need bigger and harder
datasets. We need more interesting tasks than classification. We need to work
more on video than static images. And researchers are working on this. It's
not going to happen overnight, but if the current rate of progress continues,
it won't be that long.

Also I question if this focus on machine vision is actually that productive.
Originally the developments in vision, were generalizing to many other
domains. But now they are increasingly focused on little tricks and
optimizations that only apply on that specific task. I don't think it's
contributing towards general AI any more.

The human brain has evolved to do vision well. It probably uses a huge number
of tricks and optimizations to do as well as it does. NNs may eventually get
that good, but it's interesting they can do so well _without_ being so highly
task specialized. This makes them very general and applicable to many other
kinds of problems.

Lastly, half the problem is just computing resources. The biggest nets are
still roughly comparable to insect brains (more synapses, but fewer neurons.)
It's really amazing that we can get such good results with such underpowered
computers. Much better machine vision might be possible if we had more
computing power to train with. Training on big datasets, high resolution
images, and especially video, can be really expensive.

>My point here is different: notice that the mistakes that those models make
are completely ridiculous to humans. They are not off by some minor degree,
they are just totally off.

I wonder if the algorithms think the mistakes humans make are equally
ridiculous? Hinton once found some crazy errors NNs made, and then pointed out
that the image actually does kind of look like that thing, if you squint.

>In my next post in a few days I will go deeper into the problems of deep nets
and analyse the so called adversarial examples. These special stimuli reveal a
lot about how convolutional nets work and what their limitations are.

This is a super overblown issue. It's been shown that every machine learning
algorithm is vulnerable to adversarial examples. Especially linear models, NNs
are actually more resistant to it. We don't know that humans aren't vulnerable
to them - no one's ever opened up a human brain and backpropagated to the
inputs. Adversarial examples are astronomically unlikely to occur by chance

~~~
eli_gottlieb
>This is a super overblown issue. It's been shown that every machine learning
algorithm is vulnerable to adversarial examples. Especially linear models, NNs
are actually more resistant to it. We don't know that humans aren't vulnerable
to them - no one's ever opened up a human brain and backpropagated to the
inputs. Adversarial examples are astronomically unlikely to occur by chance

Generative models are only vulnerable to adversarial examples _that are
actually unlikely_ in the data distribution. They do not have patterns or
filters that can be added to ordinary images to cause wild misclassifications.

The brain, as far as we know, uses generative modeling.

So yeah.

~~~
oifnwoivnoinvo
"Generative models are only vulnerable to adversarial examples that are
actually unlikely in the data distribution. They do not have patterns or
filters that can be added to ordinary images to cause wild
misclassifications."

^ Actually, you are wrong on this. See this recent paper "Universal
adversarial perturbations"

[https://arxiv.org/pdf/1610.08401v1.pdf](https://arxiv.org/pdf/1610.08401v1.pdf)

~~~
eli_gottlieb
That paper deals with non-stochastic deep neural networks, and its
mathematical analysis deals with discriminative classification. It doesn't
deal with generative models, which model the joint probability distribution of
classes and data instances rather than just taking a maximum-a-posteriori
estimate from the posterior.

------
mrfusion
One thing computer vision is missing is making a depth map given a 2d image.
You can look at a photograph and describe it as a 3D scene. This will be
important for many fields.

~~~
nicklo
This problem has already been solved with decent success using deep learning.

See:
[https://homes.cs.washington.edu/~jxie/pdf/deep3d.pdf](https://homes.cs.washington.edu/~jxie/pdf/deep3d.pdf)

~~~
mrfusion
That's a good start. I was thinking you could generate unlimited training data
by using a game engine. You'd have the actual 3D model for every single frame.

~~~
phorese
Yup, the community is on it!

[http://www.cv-
foundation.org/openaccess/content_cvpr_2016/ht...](http://www.cv-
foundation.org/openaccess/content_cvpr_2016/html/Ros_The_SYNTHIA_Dataset_CVPR_2016_paper.html)

[http://www.cv-
foundation.org/openaccess/content_cvpr_2016/ht...](http://www.cv-
foundation.org/openaccess/content_cvpr_2016/html/Mayer_A_Large_Dataset_CVPR_2016_paper.html)

[https://link.springer.com/chapter/10.1007/978-3-319-46475-6_...](https://link.springer.com/chapter/10.1007/978-3-319-46475-6_7)

And there's more every week... Blender, Unity Engine, Unreal Engine, you name
it. (Disclaimer: am author on one of these papers)

------
KayEss
Is each frame looked at separately? Given what is shown there seems to be no
memory building context and pruning the options. Is that really hard to add?

~~~
omginternets
There's something called "attentional neural networks" that attempt to do
this. They tend to do very well in reading natural language, IIRC, but I've
also seen them applied to video.

------
davesque
Wouldn't some kind of recurrent network give better results for restricted
fields of vision like this?

