
The state of Computer Vision and AI: we are really, really far (2012) - harperlee
http://karpathy.github.io/2012/10/22/state-of-computer-vision/
======
1971genocide
One of the problems is we still do not have the hardware to begin doing the
needed computation to solve the problem.

The brain in reality is quite slow. We know neurons require at least 5 ms to
do any computation. But the real power lies in the shear parallelism.

Essentially biology set a limit of 5 ms but evolution worked around it by
creating billions of neurons. Even If they are slower, because there are so
many of them, they can do more computation in that 5 ms gap than all computers
in the world combined ! Its truly marvellous when you sit back and think about
that.

When you think about how that gets done just using that information, the
modern computer architecture works completely differently , What that means is
we do not even have the adequate hardware to start working on this problem.

We can push the current limit of computation to solve sub-problems and it
seems under that constrain we have done very well. But its a slow evolution.
GPUs gave us a clue, now we have FPGAs and soon we will have better hardware
to create "better" intelligent machines. The machinery that is the brain is
vastly complex and beautiful, but not well understood. Its a slow and
incremental process but we will get there sometime this or the next century.

~~~
danieltillett
One of the most important thing to remember is the efficiency of the human
brain is layered. The ancient parts (vision for example) have been honed to an
amazing level of efficiency so even the most powerful computers can't come
close, while the other modules (say simple arithmetic) are so inefficient that
a $2 calculator needs to uses only 1/100,000th the energy to solve the same
problem. The general rule is anything we find easy as humans to accomplish is
amazingly hard and ancient, while anything we find hard is relatively easy and
recent (evolutionary).

~~~
1971genocide
One thing to point out that even if the "module" that does arithmetic is slow,
the brain understands the context of that calculation.

2 + 2 can be viewed as a computation that happens in nature and our brain all
the time.

But the real magic doesn't lie in the pure computation. Our brain understands
what that 2 means in the context of all the knowledge of the entire Universe.
2 cows, 2 sheeps, 2 planets ?

Understanding context is what we are really good at, as said in the article
"prior" knowledge. So its not surprising that a cheap calculator can outsmart
me in computation when it doesn't actually compute the context of its
computation, which is sort if cheating.

~~~
danieltillett
I was trying to stay away from "meaning" since it gets messy, but if you just
compare the energy used by sub-conscious human arithmetic with that expended
by a simple cpu then the cpu is massively more efficient. We find mathematics
far more mentally taxing than visual observation purely because we are so bad
at mathematics, not because visual processing is easy.

------
dave_sullivan
He wrote this more recently when convnets surpassed "human level performance":
[http://karpathy.github.io/2014/09/02/what-i-learned-from-
com...](http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-
against-a-convnet-on-imagenet/)

Gains have continued in 2015.

Also, more recent reflections on his research which I think gives a bit more
color to the OP: [http://karpathy.github.io/2014/07/03/feature-learning-
escapa...](http://karpathy.github.io/2014/07/03/feature-learning-escapades/)

Kind of like with the rise of electricity, the microprocessor, the PC, or
internet -- in the beginning, only the people building it understood what all
the fuss was about. But that changed quickly over the course of N years (where
N ends up being sooner than everyone thinks). If you had started a career in
any of those fields before they were obvious in hindsight, you would have
probably done quite well.

The author of the post has not quit to go start a photo app as far as I know,
he's still doing research on the cutting edge of deep learning because that's
where the most promise is.

~~~
joe_the_user
I don't think his later comments really negate his earlier comments.

Neural networks continue to make progress in the narrow field they are
designed for and researching these I'm sure continues to be interesting. That
doesn't change the point that human don't interpret an image as a couple of
annotations but as rich fabric of information far beyond what computer vision
current does.

Basically, there is a ocean of interesting, useful and excite things computers
can do before they arrive at what humans can do.

------
Netcob
I'm not sure how well we can estimate how difficult a task is just based on
our amazement at "iceberg complexity".

Just the fact that I was able to read that article the way I did is the
product of something enormous. Discovery of electricity and semiconductors,
mathematics, development of CPUs and memory, global computer networks, the
entire software stack that allows me to display what someone wrote three years
ago on a glowing rectangle with no involvement of any paper at all. The entire
social and economic development that allows me to read this instead of walking
through the woods and trying to impale some creature on an arrow.

That's a _lot_ of solved problems, most of them figured out during the past
century. I don't think some image segmentation, pose estimation and semantic
reasoning is going to take quite as long, especially with more and more people
working on it.

------
emddudley
Advances in computer vision are being made every day. Image understanding is a
big challenge, and the way we tackle big challenges is in little steps.

I was impressed by some video segmentation and object classification results
that Microsoft showed off the other day at its Ignite conference. We're a lot
farther along than some people realize.

Picture here:
[https://twitter.com/MS_Ignite/status/595365048547180545](https://twitter.com/MS_Ignite/status/595365048547180545)

Video clip at the 1:02:00 mark:
[https://channel9.msdn.com/Events/Ignite/2015/KEY02](https://channel9.msdn.com/Events/Ignite/2015/KEY02)

~~~
balakk
In fact, this blog author had published a paper pretty recently on scene
understanding.

[http://cs.stanford.edu/people/karpathy/deepimagesent/](http://cs.stanford.edu/people/karpathy/deepimagesent/)

This got a lot of press:

[http://www.nytimes.com/2014/11/18/science/researchers-
announ...](http://www.nytimes.com/2014/11/18/science/researchers-announce-
breakthrough-in-content-recognition-software.html)

------
tehchromic
The article implies that we will or ought to arrive at the point where a
machine can appreciate why an image is funny in the same complex way that a
human can. Not to discount the value of AI Vision research and technology, but
at some point we must ask "why".

Partly the question goes to the difference between artificial intelligence,
and artificial consciousness. According to some definitions, the former is
ability to produce relevant information, while the latter is an autonomous
system capable of using that information for self interest.

For example, no matter how complex a system like Watson is, it's no more
conscious than a rock. Meanwhile, rudimentary life form is something we are
nowhere near replicating artificially. This distinction is quite important,
and very intelligent AI pundits seem to fail to make it (or understand it?).

While we are quite capable of producing intelligence artificially, the
capacities associated with the consciousness become confused easily in the
mind. While some intelligence is easy to produce with machines, there are some
problems of experience that simply cannot be solved by artificial
intelligence, and require consciousness.

But let's be clear: producing an artificial consciousness is orders of
magnitude more complex an engineering challenge than building an artificial
intelligence machine.

It is also potentially enormously destructive: many times more difficult to
create, and at least as destructive as atomic bomb.

------
tehwalrus
_> How can we even begin to go about writing an algorithm that can reason
about the scene like I did?_

You are doing AI wrong. AI should _learn_ all of that context by itself, from
a large amount of stimulus. If it was a good one, it might be able to learn
enough in less than N years, where N is the age of a human who would laugh at
the photo.

~~~
scott_s
I think the author was using a linguistic-shorthand. He is an active
researcher in the field:
[http://cs.stanford.edu/people/karpathy/](http://cs.stanford.edu/people/karpathy/)

------
charlysisto
The flash you get out of a joke is build up on a huge framework of human
interaction that takes literally years to acquire for the "dedicated" BI of a
brain. The fact the AI is only starting to reach the power of facial
recognition which is for us a seemingly "low level automatic" function tells
us the amount of learning left for an AI to come close to our understanding of
the world. Not only in terms of computational power but in length and
diversity of the learning process.

Assuming sufficiently powerful Neural Networks, they will probably go through
years of learning our world through interaction, just like a kid does, before
it "gets" the joke. Doesn't mean it's impossible, and that's quit scary (in a
good and bad sense I guess).

~~~
nileshtrivedi
Years _worth_ of learning. They could learn by watching videos, or playback of
sensory data of early AIbots and by playing video games created for the
specific purpose of training them which will be much more efficient.

~~~
charlysisto
Good point. The thought actually actually popped into my mind after I wrote
the post : "Wall E" and many more films/books have suggested this accelerated
learning path.

However this is information only, not interaction, I suspect this will have a
serious distorsion effect on how the AI "perceives". It's a wild guess, but I
believe interaction is at the root of understanding.

Edit : yes you also mention interaction through video games, which I skipped
when I scanned your comment. But then again video games might be still far
from the depth of real world interaction, more of a learning enforcer than the
source of it - like books are for us....

~~~
fapjacks
But there is already so much content on the internet, that accelerated
learning could still happen by way of proxied interactions visible in "old"
content. Does your AI _have_ to interact with people to learn how to interact?
Or is watching interaction good enough?

~~~
charlysisto
In a nutshell I don't see how any kind of adaptive intelligence can bypass the
reinforcement process of trial and error through interaction. Then again you
could have simulated interaction, but that may be the equivalent of the
machine dreaming :-)

~~~
fapjacks
You can learn to not touch a hotplate by watching someone burn their hand on
it.

------
juanuys
A related post on the front page right now:

"Neural network chip built using memristors (arstechnica.com)"
[https://news.ycombinator.com/item?id=9501119](https://news.ycombinator.com/item?id=9501119)

------
JDDunn9
To be fair, the author is evaluating a computer based on a task our brains
have specifically evolved to be good at, facial recognition, social
interaction, and familiar settings. You could turn it around and find tasks
that computers are good at. e.g. Watch a month's worth of highway videos and
count how many cars passed.

------
mollerhoj
The argument in the article seems to be that we are really far away from
having an AI with social intelligence. However, AI's do not need social
intelligence to be interesting / dangerous. My best guess would be that
creating artificial humour is much harder than creating a machine that is
dangerously intelligent.

~~~
erikb
A great argument! In fact humour is a tough task for humans as well. We do it
all the time but only few people can be consistently funny in the eyes of
other people. Also different cultures experience humour differently. E.g.,
German jokes aren't funny to Chinese and the other way around. It's nearly
100% excluding.

------
Simp
3 years later and computers now outperform humans on Imagenet. While we still
have a lot of work ahead, it shows how fast things can change in the
exponential world of computers.

~~~
quonn
Can you please give a link showing an exponential improvement between 2012 and
2015?

~~~
twelfthnight
[http://image-net.org/challenges/LSVRC/2012/results.html](http://image-
net.org/challenges/LSVRC/2012/results.html) One team at 15%

[http://www.image-net.org/challenges/LSVRC/2014/results](http://www.image-
net.org/challenges/LSVRC/2014/results) Most teams are below 15%, GooLeNet is
at 6%

[http://www.image-net.org/challenges/LSVRC/2014/results](http://www.image-
net.org/challenges/LSVRC/2014/results) Microsoft is now below 5%

[http://arxiv.org/pdf/1502.03167.pdf](http://arxiv.org/pdf/1502.03167.pdf)
Google is now below 5%

I can't really argue whether that's exponential or not, but it amazing
progress in a short amount of time.

------
dicroce
We might be really far from a ASI that completely grok every nuance of that
picture... but we already have AI that can understand pieces of the picture...
and those AI's can be used to do interesting things today. This isn't an all
or nothing endeavor.

~~~
loopbit
A properly trained AI would have recognized Obama in that picture much faster
than I did. As for the foot in the scale, I only noticed when the author
mentioned it.

So much for the 'quick glance'. Which brings me to another matter. One of the
reasons the author can extract all that information from that picture is
because all the elements in it have been 'seen' already. A machine might not
be able to extract the whole context, but things like the people involved and
that they seem happy? Easy (-ish).

------
vishnuharidas
The image makes totally different ideas in a baby and adult. Babies doesn't
understand most of the parts, but adults do. Because, adults learned
everything from their experience.

We need systems that learn from experience.

~~~
gldalmaso
Though in order to build systems that learn from experience, we first need to
grasp a whole lot more clearly just how learning works for us, and how exactly
is it that we persist our experience data.

To me it seems we are still in the stage that we have to understand ourselves
better, because we simply can't devise and algorithm to solve a problem we
cannot solve ourselves.

We can do clever stuff with statistics and math, but that seems to me like
more of a hack. For instance, people create models from very small datasets,
if you have kids you surely can watch in amazement just how efficient we are
wired in that regard. We try to mimic that by feeding huge datasets to
algorithms but it still pales in comparison.

~~~
vishnuharidas
True, we still don't know how "learning" works in brains. But it works in our
neuronal network. What about combining some real neuronal cells with
technology?

I remember one documentary in Discovery Channel, where some scientists make
use of rat brain cells cultured on a circuit to build a small robot, which
learns to avoid obstacles on it's path.

A similar video is here
-[https://www.youtube.com/watch?v=1-0eZytv6Qk](https://www.youtube.com/watch?v=1-0eZytv6Qk).

------
bra-ket
just step out of 'Machine Learning' bubble for a sec

there have been decades of research into 'Cognitive Architectures'
([http://en.wikipedia.org/wiki/Cognitive_architecture](http://en.wikipedia.org/wiki/Cognitive_architecture))
and 'Artificial Consciousness'
([http://en.wikipedia.org/wiki/Artificial_consciousness](http://en.wikipedia.org/wiki/Artificial_consciousness))

there is massive amount of experimental observations on learning and cognition
in neuroscience and cognitive sciences (from neurophysiology to psychology)
that is largely ignored by Artificial General Intelligence and Machine
Learning communities.

On the other hand the progress in Deep Learning, Computer Vision, NLP and
Robotics is largely ignored by neuroscientists because these learning models
do not respect biological constraints

There is a whole group of narrow domains like Formal Concept Analysis,
Statistical Relational Learning, Inductive Logic Programming, Commonsense
Reasoning, Probabilistic Graphical Models that don't talk to each other but
all deal with cognition and conceptual reasoning using different tools

I think we have a chance to make progress if these fragmented domains
converge.

~~~
1971genocide
Not really,

There are researchers in all the different fields who's sole job is to report
what other communities and doing and be the agents of cross pollination.

Everyone agrees that artificial general intelligence is a difficult problem.

Practically its not possible to converge all the different fields and also
what is the point of that ?

Each researcher is interested in solving their own sets of problems what they
find interesting or have the motivation to be part of the solution.

Progress is being made - maybe not at the rate of silicon valley start-ups but
hard problems require time to solve.

It would not be ideal that Computer Vision people suddenly stop doing their
research and take the massive risk of putting all their shoes into Deep
Learning.

People doing Computer Vision have their sets of constraints and goals. If
tomorrow the garbage man, cleaner, cook, etc all stop working and stay "we are
all going to work on deep learning". The world will stop working.

As absurd as that sounds that is what the implications would be if theses
separate fields try to converge. Even if we do solve the problem of AGI today,
what direct change or improvement in human condition would be see tomorrow ?

When that AGI needs to be integrated in a framework like computer vision,
robotic, or search engine you need that domain experts, practitioners in the
those various fields to still exist tomorrow to maximize the economical
benefit of such a technology.

~~~
bra-ket
I'm not suggesting experts should drop what they're good at and work on
integration, it's a job for engineers, think 'Apollo 11' kind of integration
of different sciences into a single working product.

------
Houshalter
The author wrote this in 2011:

>My impression from this exercise is that it will be hard to go above 80%, but
I suspect improvements might be possible up to range of about 85-90%,
depending on how wrong I am about the lack of training data. (2015 update:
Obviously this prediction was way off, with state of the art now in 95%, as
seen in this Kaggle competition leaderboard. I'm impressed!)

Thats more impressive than it sounds because each percentage point is
exponentially harder than the last. Getting 95% accuracy is not 5% harder than
getting 90%.

Just recently machine vision starting beating humans on imagenet. Imagenet is
1,000 classes, high resolution images, taken randomly from the internet. No
one would have predicted that few years ago.

Sometimes a notable researcher like Hinton says that something like
transcription of images into sentences might be possible in five years, only
for researchers to demonstrate it in five months.

I remember reading something about early engineers working on computers were
extremely skeptical of the rate of computer advancement. They were so focused
on narrow technical problems they didnt see the big picture.

------
ohitsdom
Why are we grouping together computer vision and AI? They are two distinctly
different fields, in my opinion. Computer vision is really hard. Great work
has been done in the last 5 years with tech like the Kinect, but there's
certainly a long way to go. AI has made small steps with projects like IBM's
Watson, but still seems to be in its infancy (and even that may be generous).

~~~
natch
Is there a good term for a system that integrates both fields to achieve the
goal of, say, matching human performance at visual scene understanding?

------
noreasonw
To understand the picture the machine should reason like this.

First: Look for focus, detect zone in which eyes are looking. Result: The
focus is on the man at the right since eyes are directed at him.

Second: Sentiment Analysis Whatever he is doing people find it funny.

People like to play jokes that make you experience strange things that you
can't understand at the moment.

Hypothesis: The man B near the main character A is interacting with A. It
seems that the foot of B is interacting with the machine M. So B interacts
with M that interacts with A.

Generated question: What kind of interaction could be exerted by B on M to
cause M get A confused?

Hypothesis: B's foot is making the machine to malfunction in such a way to
give a false message to A.

Hypothesis: To make this image more noticeable the men are well known people
or famous people.

Data: The more serious the role this people play at society the more funny is
the image, since their behavior is more unexpected.

Reasoning like this the machine could get a plausible hypothesis of what is
happening in the scene:

A famous man B that probably do a serious and responsible role in society is
playing a trick on a man A by making a machine M to malfunction in such a way
that it gives a false message or information to A. The malfunction is caused
by putting the foot on the machine. To make it funnier, the main character
can't see that, ...

Having a general context like this the machine could look for machines and
people which could play such a role and give a heavier weight to those that
make the joke a better one.

The next generation of machines could modify the image to make the joke better
since it can understand perfectly the context and purpose of it.

------
abetusk
At the risk of stating the obvious, even if this were true and we were
actually really far away from AI or good computer vision classification,
Moore's law should take effect. For every order of magnitude of difficulty it
is, by whatever measure, we should only need to wait a linear amount of time.

Also, notice that for this particular picture a lot of what the article is
talking about doesn't matter:

    
    
      - Does it matter that they're in a hallway?
      - Does it matter there are mirrors?
      - Does it matter that one person is the president?
    

The main gag is having one person standing on a scale with the intent of
making a weight measurement while the other is subverting that intent,
presumably with the knowledge of bystanders. My bet is that this picture could
be classified as funny, even correctly labelling the joke, if the picture was
tagged and a classifier were run on a large database of other tagged pictures.

~~~
arbitrage
There's the explicit gag itself, and it's an old one, of stepping on a scale.

The real joke, however, is in fact that it is the president. It's more
humorous because it shows the president as a real person. It's unexpected, and
out of context of how we typically understand the holder of the office of the
president of the united states should behave.

That's the main joke. So yes, the finer details are important. It's not just a
funny picture.

------
coco1989
There is no reason for a "computer" to do all that and if it could it would
not - all that he talked about is about dealing with the limitations of our
brains - our mind the machine would not care because it would not care about
its limits - we are as a computer one big limited system constrained by the
process that brought us into being - chance and natural selection. An ai is
not constrained by this anymore than a dog is constrained from finding the
same thing funny. humor is not a problem for us because we are part of the
problem - we are a constantly derived huge word puzzle. an AI would not be
that - AI would have a purpose.

------
zk00006
The article nicely illustrates the complexity present in a single image and I
agree that we are far from automatically and fully understand images like
this. I believe we would need to replicate the full brain to do that. But
still, there are many interesting applications that are possible with state of
the art computer vision. The question is whether state of the art is useful
for industry, not whether the ultimate goal is close or far.

------
ilaksh
Article is from a few years ago, missed existing AGI research at the time.
Deep learning is now mainstream and awareness of AGI research that has been
going on for years is slowly creeping into mainstream AI.

Anyway he mentioned what needs to be done at the very end of the article which
is to emulate the embodied human development process, and there is quite a lot
of progress in that area.

------
nobody_nowhere
It's not just vision.

~~~
solve
I have no hesitation in arguing that our textual understanding is currently
far worse.

E.g. modern deep-learning classifiers can get high 90s F1-score on identifying
entities in images, but what's our capability for recognizing e.g. product
entities being mentioned in Amazon reviews? Good luck getting even 70%
F1-score. It's still incredibly awful.

And this is just talking about very basic entity recognition in realistic
settings. Lets not even consider relation extraction, and more meaningful
tasks.

------
ChuckMcM
Presumably the author is familiar with the Google paper -
[http://arxiv.org/pdf/1411.4555v2.pdf](http://arxiv.org/pdf/1411.4555v2.pdf)
which should get the people, scale, and laughing bits out of the image. There
are a remarkable number of technologies converging here.

------
putzdown
You know, there are some problems that are simply unsolvable. Star Trek-style
teleportation, I'm inclined to think, will never happen.

I think "true" AI of the sort described in this article is, likewise a pipe
dream. Not in a 100 million years of scientific effort could we create the
hardware and software necessary to do this. If you think otherwise, that's
because you have an enduring and overriding faith in the power of science to
overcome all barriers and achieve all goals. But on what basis could you place
that faith? Sure, science is generally good at solving problems and advancing
technology, but there are some "technologies" that are simply not attainable:
indeed, not even clearly definable.

I think that deep AI of this kind is one of those. We think we know what it
would mean for a computer to think like a person. It's not so clear that we
do. Given that we deeply don't understand how we ourselves not only think, but
also feel, want, assess, moralize, and experience, we're unlikely to produce
thinking machines.

"Ah, but the Power of Science, may it be praised forever!" you say. "Science
will help us to understand how we understand!" Yeah, maybe, nah, I don't think
so. Just look at the construction itself. It's loopy. "Understand how we
understand"? I doubt it's possible for any instrument to ever really
comprehend itself; in other words, for any thinker to ever think accurately
about how it thinks.

We should keep AI research within the realm of observable, measurable, useful
utilities. "Comprehension" of any kind... it's never going to happen.

There I've said it. Now roast me at the stake, O ye of great faith.

------
bsaul
The funny thing is, you could just as well think about the complexity of any
kind of animal behavior. AI isn't not just very far from human intelligence,
it is also very far from any kind of living intelligence.

~~~
peter303
Roundworm elegans has just 302 fully mapped nerve cells. A people are still
trying to figure out how that small system works.

~~~
juliangregorian
If you're interested you can join in!
[http://www.openworm.org/](http://www.openworm.org/)

------
jctrope
>I have a really cool idea for a mobile local social iPhone app.

I wonder if this was tongue in cheek since I thought people didn't start
making fun of mo-lo-so until "Silicon Valley" aired:
[https://www.youtube.com/watch?v=J-GVd_HLlps](https://www.youtube.com/watch?v=J-GVd_HLlps).

------
CyberDildonics
The state of Computer Vision and AI: we are really, really far _away_

------
ColinWright
Something related - the title is:

    
    
        The state of Computer Vision and AI:
        we are really, really far. (2012)
    

To me, that means we are really close to achieving it, because we are really,
really far along the path. But reading the article immediately creates a
cognitive dissonance - that can't be right, can it?

No, the author means "we are really, really _far away from our objective_ ,"
which is not the same thing at all.

Was this deliberate, or did the author, in his focus on the question of
interpreting images, simply not notice that his text was ambiguous as well?

Or is it just me?

~~~
raverbashing
I think it's just you (from your username you're probably a native English
speaker, still)

"we are far" to me means "we are distant (from the objective)"

[https://en.wiktionary.org/wiki/far#Adverb](https://en.wiktionary.org/wiki/far#Adverb)

If it was "far ahead", or "we've come far" I'd agree with you

~~~
slight
It's not just him. I'm a British English speaker and read it the same way. I
read it as "We are far [along our course]". It came across as a bit of an odd
phrase but I don't think it's really good English either way, it needs an
object to be clear.

~~~
vutekst
I'm from the US and I read it the same way, as a statement that we are far
along our course.

------
jere
Oh, but Ray Kurzweil says we'll have an AI beating the turing test in 14
years, so I guess we're fine.

------
meric
I don't find the picture funny, even though I understand all the bullet points
from looking at the picture.

Yes, the man is confused because he's being pranked by the president. It would
be funny if I was there.

But looking at this photo? It's not funny to me.

I suppose I would rank this picture a lot higher than a picture of a
skyscraper in the list of funniest pictures, but neither would be above the
"laugh" threshold.

Does everyone else here find it funny?

~~~
sweezyjeezy
You're missing the point entirely. You can explain why people might find it
funny, that's the point.

~~~
meric
Thanks...I thought it was slam-dunk laugh out loud funny for everyone except
me and something was wrong with my sense of humour...

