Hacker News new | past | comments | ask | show | jobs | submit login
Text to Image Generation (github.com/lucidrains)
467 points by indigodaddy 19 days ago | hide | past | favorite | 88 comments

"The Notorious B.I.G. raps H.P. Lovecraft's Nemesis with AI"

Oh man, this video is crazy.. awesome.. crazy - not sure. The images are creepy, creepy as hell.

I've dabbled in trying to make something more... substantial, with these techniques.


This is probably the best Notorious B.I.G. song that has ever existed.

For anyone who isn't familiar, DALL-E is the state of the art for text-to-image generation. It's closed source, but it's astonishing: https://openai.com/blog/dall-e/

I have a working repo for that


It just needs to be trained

Hey Phil, not quite on topic but I've used some of your implem (both for hobby projects and for work) and the conciseness and clarity of your code is a delight. Many thanks!

Thanks for the kind words and glad you found them useful :)

I love ur transformer repos, its always joy to re-implement stuff from ur code its easy to read and clean :D

Thanks :D

This is the first time I've seen new tech and felt a twinge of fear instead of thinking "oh cool".

They already had a small drone war. Azerbaijan vs Armenia. Armenia had 1990s soviet stuff. Azerbaijan had recently purchased drone tech. Armenia got totally destroyed.


These are previews of future wars, much like the wars in the decades before WW1 previewed the utter slaughter that was to come thanks to the new technological era of mechanized warfare.

I don't see how this is scarier than the nuclear bombs that already exist.

Less collateral damage. L

Much more accessible to regular people.

Ability to be anonymous and hide in a crowd of drones.

yeah exactly

> It's closed source, but it's astonishing

Really? Their name says OpenAI.

Anyway, how do we know their examples were not cherry-picked? Do they have an online demo?

OpenAI has a reputation and a history for pushing compute forward (AI in Dota2, GPT, CLIP, DALL-E, etc.).

They have everything to lose by lying. If they say that these examples are not cherry-picked, then we have no reason (a priori) to doubt them.

On a side-note, the fact that you could doubt the results are real is telling: each of their compute-heavy experiments shakes our belief and further reinforces their reputation.


The singularity approaches.

Small pet peeve of mine: The package name is `deep-daze` but the executable is called `imagine`.

This bothers me. More than it should.

Imagemagick used to do the same thing - but they judiciously renamed their `convert` executable to `magick`. Still not perfect, but an improvement.

If your package introduces one command-line executable, it should always be called the same as your package.

This was a big pet peeve of mine early in my linux usage as well (still is really). Especially since it is not exactly straight forward to figure out what binaries get installed when you install a package. Almost feels like there should be some namespacing like pkg_name->cmd_name so you can at least tab complete package binaries easily.

Arch makes it pretty easy to recall:

# Who owns binary? pacman -Qo /bin/ls

# What files come with package? pacman -Ql coreutils | grep ......

Yes debian has the same types of things, but maybe spread across a few commands. I actually just prototyped what I am thinking here https://github.com/seiferteric/pkg_namespace if anyone is interested.

It's just `apt-file search /path/to/file` actually. I think there's a dpkg command that is limited to installed packages as well.

I disagree with that. The primary command line executable provided by systemd is systemctl, this is an apt name to interface with and control the system daemon, which is an executable it also provides, but not one that is intended to be directly invoked from the command daemon.

ImageMagick is also far more than the command line utility that it provides to interface with it's library. Say that it had been written by a third party, but did the same thing using the ImageMagick library as a dependency, would it then be fine for it to have a different name?

Ryan Murdock/advadnoun here: Glad to see some interest in the project!

I've written a few follow-ups to this, with some public notebooks as well that produce qualitatively different results. If anyone is interested, it should be pretty easy to find the BigSleep method, which steers BigGAN in a very similar way to this, as well as the Aleph notebooks, which use the DALL-E decoder or Taming Transformers VQGAN to generate images with CLIP, depending on the version of the notebook.

One thing that has fascinated me about this is the possibility of translation across dissimilar categories.

What I mean is we have style transfer in the domain of text, and we can style transfer with images. And we can generate images from text. Can we style transfer from image to text or vice versa? Can prose be rewritten in a manner that, in some sense, adheres to aesthetic principles of impressionist painting?

Presumably there would be some kind of informational representation of text style discernable to an image generation system. And just like an artistic style can be extracted from a painting and transposed to a photograph, perhaps an interpretation of textual style could be applied to a photograph despite them being different mediums.

What would that even look like? I don't know, but I find the possibilities fascinating.

The temptation, I think, is to make a first pass at answering this question in a frustrating, cartoonishly shallow way. And I think systems will possibly be developed that just go ahead and do it before people are culturally ready to understand it in a non-frivolous way. Everyone needs to get those reactions out of their system, I guess, but there's a more nuanced possibility here that might allow clashing of dissimilar categories in ways never previously contemplated.

Nice ! This points me back to the my favorite mental model of machine learning/nn. It’s always about shuffling in an out of a number of dimensions and the mappings between them.

When you make this please name it SynesthesAI

How does copyright work in those cases, for example if you train your model using copyrighted messages, then wouldn't the result be a derivative of the works used in the dataset? If the result comes from a "sum" of different images, how can you calculate the split for royalties? Is it possible to "reverse engineer" the result to see which data points and at what proportion contributed to the final result?

it is a grey area. There has been a huge drama in the furries world some time ago about this: https://www.reddit.com/r/HobbyDrama/comments/gfam2y/furries_...

For some reason Colab let me run this all last night. Wow just wow. Eventually I got the feeling that it matters if I'm watching.

Each of these links has about 9 images and the prompts that made them. Sometimes the image does not look like an animal right off the bat, then it seems like you asked for something the network had to say.

https://postimg.cc/vDvYdBhC https://postimg.cc/XpHnpzT8 https://postimg.cc/HVspWgPn

That Syd Mead style ... mwa

How long did it take for you.. I just did one text to image and Colab was churning away for a few hours but then choked after maybe 300ish image iterations saying something about some kind of space limit being reached..

Were you able to complete the 1050 iterations, and how long did it take?

Not the OP, but some tips:

- the model usually locks in within 200-300 iterations, so if you don’t like the result by then, retry

- in fact, you can tell if the model is off to a good start within 25-50 iterations and I encourage you to cherry-pick runs early and often; don’t be afraid to restart

- time to render depends on which GPU you get from colab, but I usually run the renders for 10 minutes a pop. About 1-2 minutes if I run them on a 3090 locally

- the prompt plays a big role in the quality of the result; “A painting of a dog playing fetch” will usually turn out better than “dog playing fetch”

- lucidrains/bigsleep produces better results generally than lucidrains/deepdaze (this is my subjective preference)

- the colabs linked to from the big-sleep GitHub repo produce poorer results than running them as a python package locally (this one might honestly be placebo)

> - the prompt plays a big role in the quality of the result; “A painting of a dog playing fetch” will usually turn out better than “dog playing fetch”

However, it can get taken very literally, in that you might get a picture that features a frame around the painting.


Another thing that comes to mind as a corollary is that the AI seems to like being constrained in its outputs. So adding something like “in the style of Monet” to the end of the prompt will return much more coherent results.

True. For the article, I experimented with a number of 'in the style of' prompts. Where there's a distinctive visual iconography with strong key features, BigSleep [1]does an amazing job of abstracting and reproducing that style. Besides artists, it also does very well with iconic movies like Blade Runner and The Matrix.

[1]The Deep-Daze author's most popular T2I mashup: https://github.com/lucidrains/big-sleep


Very helpful feedback, thanks very much!

Wow! Fascinating images. The “Chinese ink painting made of poetry” one is particularly interesting and beautiful.

If anyone is interested in trying this without Google Colab, I have a site that takes a text prompt and renders it for you: https://dank.xyz

One of the models is lucidrains’ implementation of the excellent Big Sleep model by Ryan Murdock. The other models are mostly based on the work of Federico Galatolo.

The queue is temporarily paused as I’m upgrading the hardware to a better GPU, but I encourage you to browse the existing renders or submit your own for when the server is back online.

The gallery is truly valuable as a reference - thank you.

Wow, I submitted one, then scrolled down the queue ... there is a /lot/ of porn requests, and some ... rather disturbng ones as well!

This uses CLIP to optimize a GAN's input to generate an output matching a text description. Optimization is very slow, it's basically the same process as training. DALL-E uses a feedforward network to directly predict an image from text. But that model hasn't been published yet.

I think it'd be neat to take an old text only terminal game, and just auto generate the pictures based off responses.

Then I might try Zork....

Honestly, I probably wouldn't, I don't want my mental vagaries of what a Grue looks like to be set by some random image.

This is not really related but if you want to play Zork over Gemini I set up a whole gemini server and domain just for that. gemini://zork.club

.club was on sale for $1.17

Domain registrars will often make the initial registration of a domain cheaper than the renewal costs. Beware the renewal costs, or be prepared to rename to a new domain.

But how about making lovecraftian horrors come into being!

I tried making a grue with CLIP+FFT and got 1. a lot of construction cranes 2. some kind of bat thing.


An interesting thing about CLIP is that when it doesn't know what something looks like, it instead generates pictures with the search text in them. That's why it confuses "an iPod" with "a piece of paper with iPod written on it".

I ran some Lovecraft through Story2Hallucination[1] which uses Big Sleep to make videos from text.

The results were quite something - https://m.imgur.com/tfWLsSR

[1] https://github.com/lots-of-things/Story2Hallucination

Maybe I should try hooking it up to Nethack!

> AssertionError: CUDA must be available in order to use Deep Daze

And here I was hoping to generate extensive phallic imagery on this most auspicious of nights.

This “CUDA” is Nvidia only if memory serve, correct?

Yes, but you can engage in your phallic pursuits, for free, using Google Colab.

Yeah, got the same thing and remembered reading "This will require that you have an Nvidia GPU". SOL, but still pretty cool.

I wrote a piece on this, including a chat with the creator of this notebook/GitHub (also commenting in this thread), a couple of weeks ago: https://rossdawson.com/futurist/implications-of-ai/future-of...

There are many similar apps and projects. Many of them are Google Colab notebooks, which can be used for free via a web browser. See "List of sites/programs/projects that use OpenAI's CLIP neural network for steering image/video creation to match a text description" at https://www.reddit.com/r/MachineLearning/comments/ldc6oc/p_l...

Disclosure: I am the author of this list.

For a non-neural AI take on text-to-scene generation (statistical parsing, symbolic representation, rule-based, 3d models with inverse kinematics), check out Bob Coyne's WordsEye:




I would be interested in reading an explanation of how this works, which I can't see on the repo.

I'm familiar with SIRENs and CLIP, but not immediately obvious how the two are utilised here.

Is it just me, or are the examples just not that good?

The quality of the image really depends on the quality of the prompt, and a LOT of cherry picking.

I find that big sleep is also a better model than the one linked here (deep daze), generally.

I’ve generated several hundred images myself and found a few real treasures. Here’s a few of my personal favourites:

“A painting of a murder in the style of Monet” [0]

“A photo of fellas in Paris” [1]

“A painting of Thanos wearing the Infinity Gauntlet in the style of Rembrandt” [2]

I definitely agree that in the general case the examples are underwhelming, but I believe there is a lot of potential here. Personally I’m super excited to unlock the potential of human-guided, AI-assisted creative tooling. Some Colab notebooks let you active explore the latent space of a model to direct the results where you want them to go. As the generate-adjust feedback loop gets tighter we’re gonna see some crazy things.

[0]: https://www.reddit.com/r/MediaSynthesis/comments/l4hbkl/text...

[1]: https://www.reddit.com/r/MediaSynthesis/comments/l4eg64/text...

[2]: https://www.reddit.com/r/deepdream/comments/l4hq22/texttoima...

It would be interesting to assemble frames from a movie, say at scene changes, etc. and have this thing "narrate" a movie. It would be similar in concept (though not in content) to how narration for the visually impaired is offered as a form of assisted media viewing.

In fact, you could probably train it with existing visual descriptions from movies.

>Install >$ pip install deep-daze >Examples >$ imagine "a house in the forest" >That's it.

if only! just spent an hour on fresh installs of debian 10 and ubuntu 20.04 with python 2.7 and 3.6 alternatively - not having it

I understand it has a lot of required packages but please, write a bloody install guide

This is amazing. I highly recommend submitting something via the Google Collab notebook and actually seeing how the code generates the images over time... I'm currently waiting and watching "a lizard king wielding a sword" form, and the actual formation of the image is really interesting as well.

If anyone is interested, here's the lizard king progression: https://twitter.com/Wojciech/status/1376371663542087682

Computer hallucinating words. Reminds me of teratoma, a tumor that can develop hair, teeth, muscles and bone.

> This is just a teaser. We will be able to generate images, sound, anything at will, with natural language. The holodeck is about to become real in our lifetimes.

Does anyone have any similar resources for other forms of media generated via natural language inputs?

Kudos on making an accessible notebook to play around with this. Currently have one running 'baby jumps over a house' on my machine. At iteration 50 and its starting to take shape, literally, will post here if it turns out to be decent.

I wrote a script a while back that pairs clip art to class names (ex. MicrophoneRecorder -> pictures of a microphone and tape recorder). The goal was to add a visual component to naming your abstractions.

Definitely going to update with this!

Is there a way to get it to use my GPU more aggressively? Have a 3090 and right now its using ~5% of the capacity according to task manager looking at GPU usage of the browser window its running on.

"super crab" ... it made a shrine with figurines and a logo https://postimg.cc/5jhgGtCv

Going to predict that someone will find this out and sell some generated text-to-images as NFTs and hype it all over Twitter.

Then will cause another wave of copycats selling some generated text-to-images as NFTs.

"the answer to life the universe and everything" https://postimg.cc/w3g4TMvp

"minimalist pumpkin" https://postimg.cc/1fvHzzq8

Remarkable! It's not quite there but it can only get better.

This is why I like HN on a Monday morning.

This looks like something I’ll be addicted.

I couldn’t think that would be possible, very interesting.

And we said that an AI couldn't produce art..

Has anyone tried to sell their work as NFT?

I find the titles much more “artsy” by reading the one above. Can easily picture a couple of them in a gallery

arg, needs cuda

What is the point of this work? Other than that it is cool.

Despite the downvotes, I'm still curious about the problem that this solves.

I guess it is supposed to help us find tumors and whatnot, synthesize images rather than rely on an artist to do it, etc. But if you listen to a lot of ML talks they sometimes say, "we trained this on images but the technique works on any 2D or 1D signal" so it's applicable to signal problems in general.

What those are, I have no idea.

This is incredible... The first few images could easily be used as album cover art.

Is there a way to perform a similar translation with music? For example, if you play in D Minor (the saddest of all keys), is there a way to map the key or some other musical characteristic to a word and have the images be generated with the intermediate being the primary source? Or would the approach be to map images to certain characteristics of music directly?

I wonder if you could use another model that describes music and feed that text into this one?

Even something based on spotify's music labeling api would be super interesting!

I will get excited when I see this making images bigger than 256 pixels.

Currently both Big Sleep and Deep Doze are generating 512x512. These ones are representative: https://postimg.cc/HVspWgPn Few are "collages", most images have full-area coherence.

> as album cover art

Indeed, what I see is an album cover generator.

“A man painting a completely red image” is very much a dadaist collage. The only complaint is that the ‘man’ could be rather more recognizable as such.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact