
Text to Image Synthesis Using Thought Vectors - piyush8311
https://github.com/paarthneekhara/text-to-image
======
gwern
It's a little tricky getting this to work because you need two separate models
working together, but I tried it out. Here's some of the samples I generated:

[https://imgur.com/Uwp1wfu](https://imgur.com/Uwp1wfu)

[https://imgur.com/yuW9Yre](https://imgur.com/yuW9Yre)

[https://imgur.com/oZ4wzdC](https://imgur.com/oZ4wzdC) some definite
weaknesses in the natural language embedding

[https://imgur.com/MAupphr](https://imgur.com/MAupphr) roses in general don't
seem to work well. must not have been many in the dataset

You can see that it works better than one would expect, but there are
definitely limits to the understanding. The flower and COCO datasets are,
ultimately, not that big. What would be exciting is if you could train it on
some extremely large and well-annotated dataset like Danbooru.

~~~
paarthn
One possible improvement can be training the text embeddings along with the
entire model (Instead of using the pretrained embeddings like skip-thought-
vectors). It is on my to-do list, I'll try it out.

------
radarsat1
I think the idea is interesting but I'm not convinced it really "synthesizes
ideas" so much as treats the neural network like a database of images that it
mixes.

Now, I could be wrong, but because of the way the results are presented it
doesn't tell me that it's any good at picking up the meaning of the phrase.
The results show a single phrase and a set of images it generates. White
flower with yellow center, and a bunch of images of white flowers.

But if it can synthesize the idea properly, one should be able to generate a
flower of a variety of descriptions. Yellow flower with blue center. Red
flower with yellow center. Blue flower with black edges and black center.
etc..

From the way they describe the functionality it should be able to do these
things so in a way I don't doubt it, but I want to see how it performs on
phrases that induce combinations of ideas that are well outside of the
training set yet refer to individual ideas within the training set.

~~~
taneq
How do you "synthesize ideas" if not by combining parts of your own personal
database of images/concepts? Even in your example of "X flower with Y center
[and Z features]" you start with a mental picture of a flower you've seen (or
a generalisation from many flowers you've seen) and then modify it with your
mental picture of colours X and Y and features Z.

~~~
jcannell
>How do you "synthesize ideas" if not by combining parts of your own personal
database of images/concepts?

Procedural generation can be far more complex than just linear blending, which
is all that a shallow net can do. For example, consider the full generative
process which creates a frame from the game No Man's Sky. It is enormously
more complex than a simple shallow net that can just do linear blends of
previous examples - many many nonlinear processing steps to go from a small
random seed to intermediate databases for terrain and objects and finally down
to pixels.

If you look at the actual net design used here, it's only a few layers deep,
and not very big. Much much closer to 'linear blending' than what our brains
do (which is presumably vaguely closer to what no man's sky does).

------
failrate
This is lovely. As a lazy programmer, I would appreciate this as a web
service. Instead of googling for an image to steal as placeholder art, I could
request a uniquely generated image.

~~~
AJ007
I think few have grasped how much of future output will be machine generated
from existing work, and yet rather than violate copyright, almost be a
necessity to ensure a copyright somewhere is not broken.

~~~
failrate
That's an interesting angle that I hadn't considered. I was discussing with a
coworker who is into machine learning about overfitting: where the output
could perfectly match the input in a flawed implementation.

~~~
iaw
I'm literally working on this now as a side project. If the avenue I'm
exploring is successful the concept of hiring artists will be completely
changed.

------
Y_Y
I can see this being useful for police sketch artists.

~~~
paarthn
There is still a long way to go, to be able to do that. The model currently
generates 64 X 64 pictures and is trained on a very specific flowers image
dataset. Nevertheless, it would be a great idea to experiment with such a
dataset (of sketches and descriptions) if available.

~~~
taneq
But the thing about machine learning is that once it works at all, "a long way
to go" generally means "add more training data" rather than "we require
significant conceptual breakthroughs".

~~~
jcannell
That is not generally true. For example, GANs work great on MNIST, pretty well
on the flower dataset, and ok on bedrooms.

But the same techniques currently fail on ImageNet - which actually is a much
larger dataset. "Add more training data" is not a magic solution that
overcomes limitations of your model.

In particular if you look at the generative models these GANs map to, it makes
sense that they can learn 2D shapes and texture patterns, but rendering a
complex 3D scene with significant depth complexity and lighting interactions
is an entirely different beast. That problem has been studied deeply in 3D
computer graphics and the generative programs successful there are vastly more
complex than current GANs.

~~~
hughperkins
seen this?
[https://arxiv.org/abs/1605.09304](https://arxiv.org/abs/1605.09304) awesome
generated images, from imagenet

------
viach
It would be cool to implement text to pizza image synthesis.

~~~
taneq
Hey, if you're gonna go in that direction why not just implement text-to-pizza
synthesis where you say "I want a mushroom and jalapeno pizza with sundried
tomatoes" and then it makes one for you.

------
ash9r
What GPU was used to train this model?

~~~
paarthn
I trained it on an AWS instance with grid k520 gpu.

