
Building a natural description of images - bpierre
https://googleresearch.blogspot.com/2014/11/a-picture-is-worth-thousand-coherent.html
======
etiam
Note that many of the errors are much more understandable if one considers
that the convolutional net pooling destroys much of the spatial relations in
the pictures.

I imagine I might make similar errors if I only got little jumbled fragments
to work from. Given those conditions, the cat "laying on a couch" or the dog
"jumping to catch a frisbee" hardly even seem like errors to me.

This is going to get radically better when someone works out an efficient way
to keep the spatial relations.

~~~
davmre
Geoff Hinton gave a talk last week at Berkeley on exactly this problem - in
pixel space, object identities are all tangled up with location/pose
information in a very nonlinear way; it would be nice to find a representation
that actually preserves both components while disentangling them
("equivariance") instead of just throwing away all of the spatial information
("invariance", what convnets do). He's done some work on this, a lot of which
is apparently unpublished, but gave a reference to one older paper covering
some of the ideas:
[https://www.cs.toronto.edu/~hinton/absps/transauto6.pdf](https://www.cs.toronto.edu/~hinton/absps/transauto6.pdf)

------
SammoJ
Relevant, very similar paper input/output wise, from our resident karpathy
with a detailed discussion in comments:
[https://news.ycombinator.com/item?id=8621658](https://news.ycombinator.com/item?id=8621658)

------
Trufa
I am very surprised it got the color of the motorcycle wrong, it seems like
the easiest thing to detect...

~~~
Xophmeister
It may not have a sophisticated enough vocabulary to distinguish 'pink' when
'red' was close enough. This effect is manifest in human languages which
classify colours differently: say, for example, a language may have no word
for 'blue', so the sky is 'green' to its speakers; it's still perceptually
different to them, of course, but the lack of fidelity means it can't be
communicated better than "sky green" or "grass green".

~~~
Trufa
I understand there are a lot of subtleties, but it's still surprising for me
that it can recognize a parked motorcycle from an awkward angle but can't
distinguish normal english pink from red.

Just to be clear, I don't mean it as a criticism, it just seems to be the
easier part.

~~~
ajuc
There was no pink motorcycle in the training dataset most probably, and neural
net may have failed to generalize colors beetween the objects.

Similar error - yellow passanger car is described as yellow school bus -
school bus is more common in yellow color.

------
rspeer
This is interesting. I think that natural language generation is a largely
overlooked task outside of machine translation -- perhaps because most tasks
that might require it can get away with the much stupider, much easier job of
filling in templates like a form letter. It's cool to see Google attempting
the real thing, on top of the image recognition.

That said, I don't expect particularly high accuracy from the composition of
an image recognition system and natural language generation. The first actual
demo of this is going to be a source of utter hilarity. I hope they're okay
with that.

------
teddyh
Duplicate of
[https://news.ycombinator.com/item?id=8623095](https://news.ycombinator.com/item?id=8623095)

~~~
Bjoern
The other way around. The link you gave is a duplicate of this post actually.

~~~
bennetthi
Here is the HN post from yesterday that also points to the NYTimes article:
[https://news.ycombinator.com/item?id=8621658](https://news.ycombinator.com/item?id=8621658)

