DALL-E Paper and Code

minimaxir · on Feb 24, 2021

Note that this is just the VAE component as used to help training and generating images, it will not let you create crazy images with natural language as used in the blog post (https://openai.com/blog/dall-e/).

More specifically from that link:

> [...] the image is represented using 1024 tokens with a vocabulary size of 8192.

> The images are preprocessed to 256x256 resolution during training. Similar to VQVAE, each image is compressed to a 32x32 grid of discrete latent codes using a discrete VAE1 that we pretrained using a continuous relaxation.

OpenAI also provides the encoder and decoder models and their weights.

However, with the decoder model, it's now possible to say train a text-encoding model to link up to that decoder (training on say an annotated image dataset) to get something close to the DALL-E demo OpenAI posted. Or something even better!

indiv0 · on Feb 24, 2021

Yeah unfortunately OpenAI has only released the weaker resnets and vision transformers they trained.

Some brilliant folks (Ryan Murdock [@advadnoun], Phil Wang [@lucidrains]) have tried to replicate their results with projects like big-sleep [0] with decent results, but even with this improved VAE we're still a ways from DALL-E quality results.

If anyone would like to play with the model check out either the Google Colab [1] (if you wanna run it on Google's cloud) or my site [2] (if you want a simplified UI).

[0]: https://github.com/lucidrains/big-sleep/

[1]: https://colab.research.google.com/drive/1MEWKbm-driRNF8PrU7o...

[2]: https://dank.xyz

TheRealPomax · on Feb 25, 2021

Can someone explain what this even is for folks reading the description and going "this means nothing to me"?

ansk · on Feb 25, 2021

About a month ago, OpenAI released information on their latest project -- a neural network which aims to generate images from text[1]. The results were impressive and the work received a lot of attention in the ML community. The repo linked in this post includes a small portion of the code used in the model. Perhaps I'm missing some context as well, but the code itself appears to be remarkably generic, not particularly interesting, and basically useless on its own. Perhaps the repo is in its early stages and more interesting developments may come later. I'd assume it's simply being upvoted because of all the OpenAI fanboys. Feel free to correct me if I'm mistaken and there is something remotely useful in this repo.

[1] https://openai.com/blog/dall-e/

jrrrr · on Feb 25, 2021

Right?

What's DALL·E.? What's a VAE?

A post to hacker news doesn't target a specific niche community.

drdeca · on Feb 25, 2021

DALL-E (DALL·E ?) is a machine learning model which, given a caption and either a top portion (not "the top half" but rather some amount of top segment) or nothing, will generate an image which continues what it was given, in a way that is supposed to match the caption. From the results they published from it, it looks quite convincing. You can do things like specify the art style, or various novel combinations of things, and it will often produce a good looking image of that thing.

One of the examples that was shown was making a chair that looks like an avocado. It produced a number of such images.

It seems rather impressive imo.

_____

A VAE is a "variational auto-encoder". aiui, a VAE is a neural net which has an encoder part, and a decoder part, where the encoder part has input which is the full thing (in this context, the picture, (or maybe small square in a picture?)), and this is also the space that the output of the decoder has. The space for the output of the encoder is much smaller than the other side.

Err, let me rephrase that.

There's a high dimensional space, like the space of possible pictures, but you want to reduce it to a low dimensional space corresponding to the sort of pictures in your dataset, so you have the neural net take in a picture, map it into some low dimensional space (this is the encoder), and then map it back to the original high dimensional space, trying to make it so that the output it gets out after doing that is as close as it can get to the input that went into the encoder. Once you've gotten that working well, you can just grab random locations in the low dimensional space, and it will look like a picture of the same type that came from your dataset?

uh, I'm simplifying as a result of not knowing all the details myself.

make3 · on Feb 24, 2021

the title should be updated, this doesn't have the paper, and it's not the code for DALL-E but for its VAE component only

pikseladam · on Feb 25, 2021

I have prepared it in collab. https://colab.research.google.com/drive/1KA2w8bA9Q1HDiZf5Ow_...

MrUssek · on Feb 24, 2021

So, uhh, where's the paper? The link in the readme isn't active.

skybrian · on Feb 25, 2021

Seems to be here: https://arxiv.org/pdf/2102.12092.pdf

campac · on Feb 24, 2021

Has anyone tried this out?

CasperDern · on Feb 24, 2021

The repository linked is just a part of the entire model so it can't be used as is.

That said there is a completely implementation made by lucidrains[1] with some results, the only missing component now is the dataset.

[1]: https://github.com/lucidrains/DALLE-pytorch

minimaxir · on Feb 24, 2021

A thread of examples from the provided notebook: https://twitter.com/ak92501/status/1364666124919447558

Note that these just demonstrate that arbitrary encoded input images match the decoded images, which is what would be expected from a VAE.