
Neural Networks That Describe Images - vpanyam
http://cs.stanford.edu/people/karpathy/deepimagesent/?hn
======
kolbe
Presentations like these make me realize how close we are to developing law
enforcement (/police state) technology that will be very effective. I figure
when the kinks are smoothed out, that we could run this on a video feed, and
have crimes prevented right as they're about to happen. It's almost scary.
Imagine 20 years from now, some guy pulls a gun on you, and a video feeds
identifies his action, and immediately shoots a tranquilizer straight into his
jugular with perfect aim.

~~~
bprater
Effective and also scary. That same video feed could be archiving (for
eternity) every single citizen's movement, action, or even words:

"Citizen #5135: In 2015, spit on sidewalks 28x this year, jaywalked 49x,
etc..."

~~~
visarga
Even if this is not working now, it can still be applied later on video shot
today. So we are already en route to being judged by such an automated
guardian.

------
jcr
At the Bay Area Vision Meeting in 2013 [1], Fei-Fei Li and Olga Russakovsky
gave a related talk on, "Analysis of Large-Scale Visual Recognition" [2,3]

[1] [http://bavm2013.splashthat.com/](http://bavm2013.splashthat.com/)

[2] video:
[http://www.youtube.com/watch?v=DK6KfUsVN8w](http://www.youtube.com/watch?v=DK6KfUsVN8w)

[3] slides:
[http://bavm2013.splashthat.com/img/events/46439/assets/a10b....](http://bavm2013.splashthat.com/img/events/46439/assets/a10b.stanford.pdf)

------
Zeebrommer
I'm a bit skeptic. They have several very good examples, and those are really
impressive. Everything about how often these striking results occur is 'coming
soon' though. And since it is neural network-based, being lucky sometimes
doesn't say anything about the statistical performance. What percentage of the
test dataset was labelled correctly?

~~~
userbinator
I'd like to see the worst-case behaviour too - e.g. hilariously wrong results.
Seeing the failures (and their rate) makes it more believable and enables a
more unbiased evaluation of the capabilities of the system.

It's like those "up to XX% better" claims - " _up to_ ", not " _at least_ "
being the key phrase here.

------
arjie
Fascinating. Also interesting to see the failure modes. Any human would
quickly realize that the "boy doing backflip on wakeboard" is actually playing
on a trampoline. Or the "two young girls playing with legos toy". Great stuff!

~~~
jameshart
Right - impressive though this is, also the cat isn't black, the woman with
the bananas might well not be a woman, the girl isn't swinging on a swing, the
construction worker is probably not in the road, and the guitarist is tuning
the guitar, not playing it. It's pretty much wrong -in detail- on every count.

The weird thing is how at a glance, it seems pretty much correct. And how some
people here are willing to look at that and think 'automated law enforcement
is clearly imminently possible'.

~~~
negamax
Wow.. talk about high expectations. This is really well done. Identifying
multiple concepts in an image and describe it to a good degree of accuracy.
And this will get better further.

~~~
jameshart
No no! Don't misunderstand me! I agree: It's _phenomenally_ impressive! It's
also still _completely useless at this level of accuracy_.

At this point, this software is as useful at describing photographs as a
disinterested teenager who is busy trying to text. "This is a picture of my
mom with some dude playing I dunno like tennis or something. Whatever."

Which is really impressive! Seriously!

But closing the gap to accuracy is really important, and it's a hard hard
problem.

------
aurelius
Neural networks are impressive only in that they are able to give any kind of
meaningful results at all. In the end, they are only a poor mimicry of real
machine intelligence, and not much better, conceptually, than plain old
nonlinear regression.

Nobody has been able to determine what the structure of a neural network
should look like for any given problem (network type, number of nodes, layers,
activation functions), how many iterations of the parameter optimization
algorithm are needed to achieve "optimal" results, and how "learning" is
actually stored in the network.

Statistical learning methods are obviously still useful, but I think the field
is still wide open for something to emerge that is closer to true machine
intelligence.

~~~
xanderjanz
Please, try implementing non-linear regression to understand images. Tell me
how it goes.

Also, 'nonbody know hows learning is stored'? You very clearly have never
worked with neural nets before. Experience is stored in the form of weight
values.

~~~
jameshart
Okay, you've got a neural net that does a really good job on identifying types
of animals in pictures. unfortunately, whenever you show it a picture of a
horse, it says 'fish'. Everything else, it's great at - marmosets, capybaras,
dolphins, kangaroos; but it's got a complete blindspot for horses.

Where's the incorrect data stored? How can you fix it? It's in the weight
values, somewhere, but you can't go and change the weight values to fix the
horse/fish cascade without breaking everything else it knows.

Yes, we know 'where' the data is stored. But it's diffuse, not discrete, so we
can't separate it from other data.

~~~
xanderjanz
Umm even still, you can train it on more horse photos in order to increase its
performance specifically on horses. Furthermore, you can study neron
activation levels on said horse training data in order to reverse-engineer the
neural "ravines" which the activations settle into. And run comparison tests
about those ravines against the ravines for say zebras.

This is something actively being done by nn researches. And it lets us do
things like take the low level audio processing part of a neural net trained
on english voice data, and use it to train smarter neural nets on Portugese
voice data than you couldn've without the English voice recordings.

------
31reasons
Something doesn't seem right. If we are this good in image recognition, how
come we are still using captchas? Are we so good in image recognition that it
can identify young girl, bunch of bananas, guitar etc just by training it from
an image set of just few thousand image. Whats the catch ? On one hand google
says their self-driving car can't understand all the situations on the road
while this algorithm can identify so many details from an image like a strong
AI would. Feels very strange.

~~~
vidarh
Captcha's largely fight the lowest common denominator, my making those who
don't care enough (or have the knowledge) to work around them when they can
just go elsewhere, so that you can invest your human resources fighting the
more sophisticated attackers that actually target you.

They are reasonably successful because spammers have enough other targets that
not many see it as worth the extra effort (and clock cycles) to break them,
not because most of them are particularly hard to beat any more.

~~~
indymike
Breaking captchas also have an incentive that has been attractive enough to
create an three entire industries -- spam, malware and security products to
deal with spam and malware.

------
ed
This isn't really a breakthrough in object identification, as much as it is a
clever pairing of identification with (mostly existing) language systems, is
that right?

Wondering whether there's any merit to sibling comments speculating this is
the future of e.g. surveillance

~~~
xanderjanz
well, this is doing more than just object recognition+language labeling.
Notice how it understands the the girl is 'in' the white dress. That is a lot
more comlicated that identifying a white dress and a girl.

------
zvanness
This is one of the coolest things i've seen in a while. I'd guess this is
super similar to what the folks on the DeepMind team at Google are working on
now, with the overall vision of being to classify images that have no metadata
and add them to a dynamically learning knowledge graph:

[http://www.newscientist.com/article/dn24946-google-buys-
ai-f...](http://www.newscientist.com/article/dn24946-google-buys-ai-firm-
deepmind-to-boost-image-search.html)

[http://en.wikipedia.org/wiki/Knowledge_Graph](http://en.wikipedia.org/wiki/Knowledge_Graph)

[http://appft.uspto.gov/netacgi/nph-
Parser?Sect1=PTO2&Sect2=H...](http://appft.uspto.gov/netacgi/nph-
Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-
bool.html&r=1&f=G&l=50&co1=AND&d=PG01&s1=20140019484.PGNR.&OS=DN/20140019484&RS=DN/20140019484)

[http://appft.uspto.gov/netacgi/nph-
Parser?Sect1=PTO2&Sect2=H...](http://appft.uspto.gov/netacgi/nph-
Parser?Sect1=PTO2&Sect2=HITOFF&p=1&u=/netahtml/PTO/search-
bool.html&r=1&f=G&l=50&co1=AND&d=PG01&s1=20140019431&OS=20140019431&RS=20140019431)

------
tiler
I really appreciate that the Stanford group is always willing to post
mistakes/mislabels. Kudos.

~~~
sigterm
The face detection on her research group's front page has one glaring mislabel
([http://vision.stanford.edu/](http://vision.stanford.edu/)).

I always find that quite amusing.

------
xanderjanz
Sharing Pre built models are so cool, and definitely important to the advance
machine learning science. Especially when you consider how mixing weight
layers allows you to do things like understand Portuguese text better through
English text.

------
thomasahle
It would be great to have this for image search!

~~~
visarga
It will revolutionize porn search, for sure.

~~~
dmritard96
unfortunately came to the same conclusion. could be a serious buzz kill though
when it is wrong.

------
yihyeh
Before going through on object recognition in images, I'm curious about how
they managed to arrange perfect sentence for the caption.

Are we already at the point where NN can arrange perfect sentence when we
throw bunch of words into it?

~~~
ogrisel
Yes, there is a recent trend to use Recurrent Neural Networks to model the
structure and semantics of sentences. This used in particular to do research
for Machine Translation by people at Google and the University of Montreal in
particular:
[http://scholar.google.fr/scholar?q=rnn+lstm+machine+translat...](http://scholar.google.fr/scholar?q=rnn+lstm+machine+translation)

------
yuncun
Idk if this is a daft question, but in the Visual-Semantic Alignment section,
are those objects in the colored boxes actually being directly recognized by
the software? Or are they inputted in some other way?

~~~
tjr
From the paper: _Our core insight is that we can leverage these large image-
sentence datasets by treating the sentences as weak labels, in which
contiguous segments of words correspond to some particular, but unknown
location in the image. Our approach is to infer these alignments and use them
to learn a generative model of descriptions._

~~~
karpathy
This gets a little more detailed into the work, but compared to other papers
that have sprung up in this area recently, our paper slightly frowns on the
idea of distilling a complex image into a single short sentence description.
In that sense we are a little more ambitious and we're trying to produce
snippets of text that cover the full image with descriptions on level of image
regions. I would call our results encouraging, but there is certainly more
work to be done here. And I think one of the limitations right now to do a
good job is the amount of training data available to us.

------
tawan
I'm impressed but skeptic, too. Laypeople might not get an accurate impression
on what is possible. Considering the image with the guy in a black shirt
playing guitar. What if it was a picture of a naked man standing next to a
black shirt that is hanging on a clothes hook, and next to it is a guitar is
mounted on a rack: My guess it that the computer will still spit out: Guy in
black shirt playing guitar, just because this sentence is very plausible in
the underlying language model.

------
fmax30
I remember doing something like this without the neural networks that is, but
my results were very very bad. If i find that project in old laptop i'll post
it to github.

------
shultays
I am pretty sure some of those examples are correct only by a chance. "little
girl is eating piece of cake."? Only thing that marks her as a girl is hair
clip, did the software really saw that?

"woman is holding bunch of bananas." hell, I (hopefully a human) would
recognize her as a male at first glance.

------
tintor
It is interesting how the neural network labeled the woman in the lower right
photo with the rectangle that includes the body only, without the head.

------
hugozap
They forgot to put the link to the npm module ;)

------
notastartup
this is unsettling and amazing. not too far when we'll have robots that will
be aware of what's going on around it.

------
nulldata
Fascinating that we've come this far, also:
[http://xkcd.com/1425/](http://xkcd.com/1425/)

~~~
cLeEOGPw
Well, five years probably already passed since the comic release, so it isn't
surprising.

~~~
kolinko
This comic is from this year if I'm not mistaken

