
Researchers Announce Advance in Image-Recognition Software - mxfh
http://www.nytimes.com/2014/11/18/science/researchers-announce-breakthrough-in-content-recognition-software.html
======
karpathy
Hi all, I'm the author on the Stanford article, so I can answer any more
detailed/technical questions. The article doesn't mention this, but closely
related work I've only very recently become aware of (if anyone is interested)
are additionally:

From UofT:
[http://arxiv.org/pdf/1411.2539v1.pdf](http://arxiv.org/pdf/1411.2539v1.pdf)

From Baidu/UCLA:
[http://arxiv.org/pdf/1410.1090v1.pdf](http://arxiv.org/pdf/1410.1090v1.pdf)

From Google: [http://googleresearch.blogspot.com/2014/11/a-picture-is-
wort...](http://googleresearch.blogspot.com/2014/11/a-picture-is-worth-
thousand-coherent.html)

From Stanford (our work):
[http://cs.stanford.edu/people/karpathy/deepimagesent/](http://cs.stanford.edu/people/karpathy/deepimagesent/)

Edit: And seemingly also from Berkeley
[http://arxiv.org/pdf/1411.4389v1.pdf](http://arxiv.org/pdf/1411.4389v1.pdf)

At least from my perspective, the main motivation behind this work is in
thinking of natural language as a rich label space. As humans, we've very
successfully used language for communication and knowledge representation, and
I think our algorithms should do the same. This also enables much more
straightforward I/O interactions between a human and machine because humans
speak text fluently. They don't speak a set of arbitrarily devised and
assigned labels.

So the hope is that in the future you can, for example, search your photos
based on text queries, or ask the computer questions about images, and get
answers back in natural language. There's a lot of exciting work to be done!

~~~
pfisch
I don't understand why you are doing this and I don't understand why other
people in this thread are praising you.

I understand that this problem is interesting to work on but seriously, this
is obviously going to fall into the hands of people who are going to use it to
make the world worse. If you don't think computer vision is going to be used
to kill people and to take away freedom you are lying to yourself.

For every positive use this could have there are far worse negative uses and
it is so clear that they will happen.

There are so many other fields you could work in that make use of machine
learning algorithms. Why would you work to advance a field that has so many
obviously evil applications that will definitely happen.

~~~
Houshalter
The main application of computer vision is robotics and automation.

If you don't like war or mass surveillance, stop voting for it. It's like
blaming the Wright brothers for the city bombings of world war II.

Do you really believe if these researches didn't work on it, the technology
just wouldn't be invented at all?

And I think the military applications are sort of overstated. We already have
drones and effective mass surveillance. Machine vision can help, but it
doesn't make or break it.

~~~
pfisch
Mass surveillance is almost worthless without computers to monitor all of the
video/audio feeds. You would need to have like a million people listening to
the feeds and even then you wouldn't be able to monitor like 99% of the data.

------
sillysaurus3
Direct link to the paper, "Deep Visual-Semantic Alignments for Generating
Image Descriptions":
[http://cs.stanford.edu/people/karpathy/deepimagesent/devisag...](http://cs.stanford.edu/people/karpathy/deepimagesent/devisagen.pdf)

Some impressive examples of what the algorithm can do:
[http://cs.stanford.edu/people/karpathy/deepimagesent/](http://cs.stanford.edu/people/karpathy/deepimagesent/)

Relevant xkcd: [http://xkcd.com/1425/](http://xkcd.com/1425/)

~~~
sp332
Flickr's response to the xkcd:
[https://code.flickr.net/2014/10/20/introducing-flickr-
park-o...](https://code.flickr.net/2014/10/20/introducing-flickr-park-or-
bird/)

------
zellyn
A bit more info:
[https://plus.google.com/118227548810368513262/posts/KjUwmgHa...](https://plus.google.com/118227548810368513262/posts/KjUwmgHaRSi)

The examples of incorrect captions at
[http://googleresearch.blogspot.com/2014/11/a-picture-is-
wort...](http://googleresearch.blogspot.com/2014/11/a-picture-is-worth-
thousand-coherent.html) are particularly interesting.

------
gear54rus
So what does it say when it is shown Rorschach's test? It's supposed to
represent different shapes to different people, what does this machine have in
mind?:)

------
ChuckMcM
Presumably most of the images on the internet could be captioned "A cat" :-)
This is really impressive research though, lots of applications in even
mundane things like figuring out streetview images. The really interesting
test will be with Google Image search where the auto-caption text is more
accurate than the image description.

------
technopessimist
I suspect that regardless of how far this technology advances, doctors will
still be manually reading X-rays and MRIs for decades to come--and of course,
in the US, collecting obscene salaries for doing so.

Sure, they may adopt the technology to make their diagnoses more accurate, but
they will still demand that they have final say and that the law effectively
make them its gatekeepers.

Even if these systems prove more accurate, doctors will still fear-monger
(successfully) about the danger of trusting algorithms and stress the need for
competent, highly-educated MDs being involved at every step along the way.

Medicine will be the last field automated.

~~~
rl3
Until we have _Elysium_ -style re-atomization machines that magically heal
people, I'm perfectly happy having a doctor in the loop.

~~~
taeric
I'd go it a different way. As long as having a doctor in the loop doesn't harm
outcomes, I'm happy having one in the loop.

------
benanne
Looks like the Toronto group have been working on something very similar as
well:
[http://lanl.arxiv.org/abs/1411.2539](http://lanl.arxiv.org/abs/1411.2539)

Has anybody been able to find the Google paper? The article says it's on
arxiv, but I can't seem to find it there. All that seems to be published so
far is this blog post: [http://googleresearch.blogspot.be/2014/11/a-picture-
is-worth...](http://googleresearch.blogspot.be/2014/11/a-picture-is-worth-
thousand-coherent.html)

~~~
dumitrue
Just went out:
[http://arxiv.org/abs/1411.4555](http://arxiv.org/abs/1411.4555)

~~~
iandanforth
I'm curious to hear your thoughts about learning object saliency from these
datasets. Most human generated images have built-in biases toward framing
things humans care about, and all of the captions will reflect the relative
importance (to humans) of pictured objects.

Captioning images, for humans, is a subset of a much more general skill set.
Humans can scan a broad visual scene for salient components, focus on those
while ignoring non-salient objects, and then organize their thoughts about
what has been seen in such a way as to produce an extremely low dimensional
description of the scene (a descriptive sentence.)

Human's also have the advantage of immediate feedback to their generated
descriptions from peers or parents.

I haven't seen much work that has attempted to tackle datasets that aren't
pre-framed by humans, or ones that try to scale reinforcement learning. I'd
love to hear your thoughts or get suggested reading if any pops to mind.

------
Houshalter
Machine vision is advancing rapidly. This year the imageNet machine vision
challenge winner got 6.7% top-5 classification error. 2013 was 11%, 2012 was
15%. And since there is an upper limit of 0% and each percentage point is
exponentially harder than the last, this is actually better than exponential.
It's also estimated to be about human level, at least on that specific
competition. And that required a human to practice and spend lots of time
looking for similar images to match the image.

------
Animats
I'm enormously impressed, amazed that this works. I'm reading the papers to
understand how they do it.

Phase I is segmenting the image into rectangular regions which contain an
"object". That's done using the approach in this paper.

[http://koen.me/research/pub/uijlings-
ijcv2013-draft.pdf](http://koen.me/research/pub/uijlings-ijcv2013-draft.pdf)

The regions can overlap, and one region can be completely contained in
another. Regions found may turn out to be useless, so the goal is to find too
many, rather than too few. That part is reasonably straightforward, and you
can see how well it's working visually.

Phase II is described in this paper:

[http://arxiv.org/pdf/1311.2524v5.pdf](http://arxiv.org/pdf/1311.2524v5.pdf)

Each rectangular region is rescaled into a 277 x 277 pixel square.

Then they apply a convolutional neural network to each square:

"We extract a 4096-dimensional feature vector from each region proposal using
the Caffe implementation of the CNN described by Krizhevsky et al. Features
are computed by forward propagating a mean-subtracted RGB image through five
convolutional layers and two fully connected layers."

So now they have a 4K vector of values for each region. I thought CNNs had to
be trained, but what would you train this against? The 4K vector has no
explicit meaning yet.

Phase III, in the same paper, uses a manually tagged training set and a
support vector machine to try to train a model of those 4K feature vectors.
The output of this is a set of tags for each region, each with a probability.
This seems to be a straightforward machine learning system.

The new Stanford paper takes that data and, using training data from captioned
images, turns those tags into sentences. (I'm more interested in the object
recognition part than the natural language part, so it's the upstream steps
I'm trying to understand.)

I'm amazed that this works, and have no clue how one debugs such a thing. None
of this is all that big in terms of lines of code, but it's not at all clear
what will work.

There's been a lot of progress in the last few years.

~~~
Umn55
Check out bohm's idea on the issue:

Bohm’s basic assumption is that “elementary particles are actually systems of
extremely complicated internal structure, acting essentially as amplifiers of
information contained in a quantum wave.” As a consequence, he has evolved a
new and controversial theory of the universe. A new model of reality that Bohm
calls the “Implicate Order.”

The theory of the Implicate Order contains an ultra-holistic cosmic view; it
connects everything with everything else. In principle, any individual element
could reveal “detailed information about every other element in the universe.”
The central underlying theme of Bohm’s theory is the “unbroken wholeness of
the totality of existence as an undivided flowing movement without borders.”

[http://www.scienceandnonduality.com/david-bohm-implicate-
ord...](http://www.scienceandnonduality.com/david-bohm-implicate-order-and-
holomovement/)

[http://www.fromquarkstoquasars.com/david-bohm-and-the-
hologr...](http://www.fromquarkstoquasars.com/david-bohm-and-the-holographic-
universe/)

~~~
Animats
The above reply seems to be to the wrong article.

------
nathanathan
A browser extension that uses this to fill in missing image alt text could be
really useful for the vision impaired.

------
chuckcode
Great work! I'm a novice in the field of image recognition but do you think
that using neural networks to represent both the text and image features in an
h-dimensional embedding space could be extended to additional features like
sound, or touch or images in other spectra like infra red? For example could
we do better image recognition if there was also sound input from the scene or
infra red in addition to visual spectra? It seems naively like lots of input
data could be processed and then compared in this way. Thanks for taking the
time to share your work and results.

------
aruggirello
I'm impressed - I didn't even notice the Frisbee in the first picture myself.

This one was funny though...

Human: “A green monster kite soaring in a sunny sky.” Computer model: “A man
flying through the air while riding a snowboard.”

------
taeric
I can't help but think it would be hilarious to drop some classic "photo bomb"
situations in these algorithms to see how the computer interprets them versus
a human.

------
thom
The natural language stuff is cool, but I really hope people are working on
getting structured data out of things like this. Having that coupled with
reliable recognition from smartphone-quality or even CCTV-quality images will
be a game-changer in terms of smart sensors and security.

~~~
contingencies
_...in terms of smart sensors and security._

Or perhaps, "in terms of the rapidly closing vice of state-sponsored dystopian
surveillance?" Dense, authoritarian and technocratic cities like London,
Singapore, Hong Kong and South Korea's will be some of the first to suffer
under the all-seeing eye of omnipresent computer-annotated, database-retained
vision. Corporate cubicle workers will slip further apart from the masses in
to the corporate acultural dystopias predicted by science fiction... and in
the case of those already _housed_ by their employer (eg. many in South Korean
'chebol' megacorps), who's to say residential zone behavioural profiling will
not begin in earnest?

------
Sandhya22
An important milestone in the area of image processing and Natural Language
Generation. I am really looking forward to the details of your work as it is
quite related to my research area.

------
ppymou
Out of curiosity, how well would these model work on computer generated
graphics (say Skyrim)?

My gut feeling says it will work but I am curious if the model need to be
trained separately for those.

------
petervandijck
That is very impressive, congrats.

------
jdimov
Google GLASS might finally become useful. Image -> text -> speech. Hugely
beneficial to the visually impaired. I can also see applications in physical
security and law enforcement.

