I had a fair bit of fun with DALL-E, but it's very expensive and I found the product too much like a toy - a pair of plastic scissors made for children. Many times have I got my prompts blocked for apparent "scunthorpe" style filtering.
Also the creativity is a bit muted, and I'm convinced anything an American would describe as un-christian has been purged from its training data leaving an air of vapidity - I found myself wasting many prompts trying to get it to generate vomit for example.
And of course, the ironically named "OpenAI" being beaten to the punch for actually releasing something that isn't just an API.
That was one of the most disappointing parts of DALL-E 2 -- not only is its domain space limited (justifiable IMO) but they then severely limited the outputs and coupled it with a very opaque abuse policy. Their press releases are all "we're gonna revolutionize art" but what they sell is in every way targeted towards the visual equivalent of elevator music.
With DALL-E mini, the classic opening sentence of the dark tower "the man in black fled across the desert, and the gunslinger followed" produces evocative (if abstract) artwork. With DALL-E 2? You get a message that your prompt is inappropriate and your account will be reviewed for termination if you continue. So they don't want guns, okay whatever.
But it eventually impacts pretty much any concept you want to try. I was exploring scifi art and wanted to see what it would do if I tried to get a stylistic fusion of old-school soviet spacecraft aesthetics (exposed structures with bulbous pressure vessels housing the controls [among other things]) with western equivalents. Fusing disparate design philosophies in a way that actually feels creative is a task at which DALL-E 2 occasionally performs superbly and I was really looking forward to playing with the concept -- dreaming through the machine of a long-lost potential future.
Nope! Turns out "Soviet", "USSR", and IIRC now "Russia" are straight up banned words. I burned up quite a few prompts before figuring out that was the trigger.
And then there's the cost. During the closed beta they gave us what's now several hundred dollars a month in access. That translated to less than an hour a day of exploring DALL-E. It felt inadequate when it was free, and now they're asking for hundreds of dollars a month for it? No.
The pricing model makes sense if you're trying to generate bland corporate artwork for some random webpage but as a creative tool it just ensures that the only people really engaging with it have substantial financial backing for producing exclusively milquetoast ``artwork.''
I noticed this with GPT3 as well, its output is extremely sterile. Interestingly, DALLE Mini (craiyon) doesn't shy away from crazy stuff in the training data so you can end up with stuff like this:
https://www.reddit.com/r/weirddalle/comments/v8bsvo/gender_r...
But if you tell it to generate boobs, it doesn't know about nipples, so it gives you what is obviously just a bikini shot with the bra replaced with uniform skin colour.
So 9/11 is cool, but boobs apparently not. What a world we live in.
"Stable diffusion runs on under 10 GB of VRAM on consumer GPUs, generating images at 512x512 pixels in a few seconds. This will allow both researchers and soon the public to run this under a range of conditions, democratizing image generation."
I mean - someone can probably get it running on a Commodore 64 eventually. But I wonder how well it's going to run. If it runs on the GPU then it might be plausible but don't models like this have a 100x or so slowdown on CPU?
Someone else said GPU PyTorch on the M1 is far from ready so I'm wondering if this will be CPU instead.
That's not bad. Rough back-of-the envelope has that as a 30x slowdown compared to the times on Discord but then I've got no idea how fast it will be on my own kit.
I get in the region 2-3 minutes from Disco Diffusion etc on a mobile 3080.
Looking at the repo[1], it uses pytorch so you may be able to. Pytorch released GPU acceleration for Apple M1 earlier this year in v1.12. Looking again at the environments.yaml file they require pytorch v1.11 so you might not be able to upgrade without issue.
Despite being 10x slower - is it still doable with few cups of coffee and waiting? I have no idea how long in wall clock time generating these images take.
Super excited about this. A tool that's comparable or even outperforms DALL-E 2, but open to the public, open source, and without the insane content restrictions or draconian banning policies.
It's great that the code is open source. However, the magic is in the models. I wouldn't be so sure that model access won't have the same restrictions as OpenAI.
From the repo:
"To prevent misuse and harm, we currently provide access to the checkpoints only for academic research purposes upon request. This is an experiment in safe and community-driven publication of a capable and general text-to-image model. We are working on a public release with a more permissive license that also incorporates ethical considerations."
Right now the model people use in the closed beta is really permissive, it can create NSFW, famous people, gorish images, etc. I don't think (I hope) that the model will change much. They will probably release all the weights from the different training phases.
The most interesting thing is that the model is relatively small compared to Imagen, Dalle 2 and Parti. They have specially trained this model so that people can easily use it on their GPUs. I think StabilityAI will train a larger version of Stable Diffusion, perhaps with a larger text encoder, as the one used in this model is quite small and I think that is the biggest bottleneck in this model; Imagen shows how scaling the text encoder is actually more important than scaling the generator.
In the end, the architecture of the model is not very different from LDM-400m that CompVis had already trained, but it is conditioned on CLIP text embeddings instead of text tokens, they trained an autoencoder from 512 instead of 256, and of course Stable Diffusion was trained for much more.
I've been diving deep in on this space over the past few months. The democratization of research, code, and applications has been so profound.
Big kudos to GitHub, Papers-with-code, Huggingface, Google Colab, Replicate, Discord, probably many other tools, and then everyone playing a part (especially the disco diffusion crowd).
I especially want to call out Google Colab. Simply having a gmail account awards you the ability to run most of the state-of-the-art and other uses very rapidly, as well as an integrated environment to things like Github and drive/storage.
The outputs from this model are super impressive... Would be cool for it to be usable with Google Colab soon!
Unrelated, but multimodal.art has been doing very cool work on building a whole little app you can run from a colab. But their models are pretty underwhelming at the moment.
MidJourney is fantastic when it comes to creating AI art. DALL-E is better at some things (it seems to understand depth better, can draw hands, better at cartoon characters), but Stability looks fantastic and I'm really excited to try it.
I think I heard the Stable Diffusion folks call DALL-E the McDonald's of AI art, and based on my experience, I agree.
For someone only tangentially familiar with this space, how is this different than e.g. https://github.com/nerdyrodent/VQGAN-CLIP which you can also run at home? Is it the quality of the generated images?
Stable Diffusion produces substantially higher quality images in most context, but is much more expensive to produce. The genius of VQGAN-CLIP is that it showed that you could take two pre-existing models and combine them to get text-to-image synthesis to work at all. By contrast, models like DALL-E and Stable Diffusion require extremely expensive pretraining.
There's a discussion of this in the VQGAN-CLIP paper, see in particular 6.1 "Efficiency as a Value" https://arxiv.org/abs/2204.08583
Disclaimer: I'm one of the authors of the VQGAN-CLIP paper and was tangentially involved with Stable Diffusion.
I must say that some of my favourite images are ones that I generated in VQGAN. Don't dismiss older/smaller models. Some have a very specific quality that is perfect for some things. (Heck, I'm still fond of Aphantasia and it's weird tiling)
Yes for the end user that will be the main difference. From the curation of the training data to the model itself a number of things have been put together that make the generations substantially more aesthetically pleasing imho
I had a fair bit of fun with DALL-E, but it's very expensive and I found the product too much like a toy - a pair of plastic scissors made for children. Many times have I got my prompts blocked for apparent "scunthorpe" style filtering.
Also the creativity is a bit muted, and I'm convinced anything an American would describe as un-christian has been purged from its training data leaving an air of vapidity - I found myself wasting many prompts trying to get it to generate vomit for example.
And of course, the ironically named "OpenAI" being beaten to the punch for actually releasing something that isn't just an API.