
Show and Tell: Image captioning open sourced in TensorFlow - runesoerensen
https://research.googleblog.com/2016/09/show-and-tell-image-captioning-open.html
======
bahro
Why on earth would they distribute this without a trained model? Several weeks
of training time on a multi-thousand dollar piece of specialized hardware are
required to actually run this.

Google clearly has many different trained versions of this network sitting
around. It must have been a conscious decision not to release them. Is the
point to artificially create a barrier to entry for hobbyists that might want
to apply this research? If so, why bother releasing it at all? I'm really
scratching my head here.

~~~
Kabukks
I feel the same. Maybe we should pool some money to train it on AWS. Is there
a community where Machine Learning hobbyists can pool money to train models
that are open sourced afterwards?

~~~
bahro
I haven't heard of one, but this is a good idea!

------
visarga
This network shows how it is possible to represent meaning into a vector of
100-600 real numbers, first from images to vectors, then from vectors to text.
Philosophers have always wondered about the nature of thought, but this
representation model creates for the first time the ability to work with
sensorially grounded abstract concepts in AI. It's not so mysterious after
all. It is possible to merge multiple sensorial modalities in a common meaning
space.

In order to get full AI we need to add behavior and embodiment to these
meaning vectors. They need to be trained by reinforcement learning, to learn
the behavior that maximizes rewards. Meaning vectors are just a small part of
the final system, equivalent to our ability to see and speak. The most
difficult part is that of learning behavior.

~~~
bbctol
I'd be much more hesitant to talk about "meaning" in this context. Are we
working with abstract concepts? We're working with images and text, and
creating connections between the two. It may be that this approach, ramped up
in processing power and complexity, can completely mimic a human's response to
images; we may also hit a wall where new techniques are needed to address what
you'd call "meaning."

~~~
visarga
Yes, could be. Word vectors and other kinds of embeddings seem promising, a
little too good to be true. There might be a glass ceiling we're not seeing
yet.

------
tkinom
Wondering if one can train TensorFlow to catch bugs in source code.

Train it from github.com's commits, logs - auto learn "what the SW bugs look
like" and scan for new one....

~~~
visarga
There is something like this:

Automatic Patch Generation by Learning Correct Code

[https://people.csail.mit.edu/rinard/paper/popl16.pdf](https://people.csail.mit.edu/rinard/paper/popl16.pdf)

------
angerbot
Very cool. I've been toying with the idea of using something like this or
perhaps the cloud vision API to automatically generate image captions for
screen readers (e.g. through a browser extension) but the cost to run
something like an EC2 GPU unit is prohibitive for a project like that which I
wouldn't want to charge for.

Running it locally on the user's machine would take far too long to train,
especially as you would have to use the CPU in the majority of cases since
many people don't have a separate GPU.

~~~
GrantS
While you would never do this kind of training on your user's machines (which
takes multiple weeks even with a powerful GPU), you should be able to apply
the trained model to a single photo nearly instantaneously. So the real
roadblock is mostly that they don't appear to have a included a completely
pre-trained model with this release, and it will take you as a developer a lot
of GPU time to train one. But your users would not necessarily have a problem
captioning images on their machines.

~~~
angerbot
I hadn't considered that (this is really out of my depth). Any ideas on what
the actual size of a trained model would be to distribute? Taking 150G on the
user's hard drive is out as well, probably.

~~~
dharma1
Depends on the model and dataset, inceptionv3 trained on imagenet is about
150mb but you can quantise the weights to 8bit and prune it much smaller
without affecting perf much

------
infinitone
It would be much more useful if they released a pretrained model. Most people
don't have the tech required to train their own- unless you want to wait
months.

------
mungoman2
This is super awesome! It would be nice if a pretrained model was also
available to be able to play with this without spending weeks of training.

------
Omnipresent
Would it be possible to simulate this on a MBP or would it require
significantly more amount of computing power

~~~
danialtz
from their note on github page [1]:

> The time required to train the Show and Tell model depends on your specific
> hardware and computational capacity. > In this guide we assume you will be
> running training on a single machine with a GPU. In our experience on an
> NVIDIA Tesla K20m GPU the initial training phase takes 1-2 weeks. The second
> training phase may take several additional weeks to achieve peak performance
> (but you can stop this phase early and still get reasonable results).

> It is possible to achieve a speed-up by implementing distributed training
> across a cluster of machines with GPUs, > but that is not covered in this
> guide.

> Whilst it is possible to run this code on a CPU, beware that this may be
> approximately 10 times slower.

So, I assume it will take veeery long time to train it on a MBP, unless they
publish their pre-trained data-set.

[1]
[https://github.com/tensorflow/models/tree/master/im2txt#a-no...](https://github.com/tensorflow/models/tree/master/im2txt#a-note-
on-hardware-and-training-time)

------
georgehm
If someone has the model running already, could you please share the captions
generated for the images mentioned in page 25 of
[http://cims.nyu.edu/~brenden/1604.00289v2.pdf](http://cims.nyu.edu/~brenden/1604.00289v2.pdf)
.

------
NasKe
I wonder if as someone learning ML I should try to train it and run it on my
pc.

