This is really exciting to see. I applaud Stability AI's commitment to open source and hope they can operate for as long as possible.
There was one thing I was curious about... I skimmed through the executive summary of the paper but couldn't find it. Does Stable Diffusion 3 still use CLIP from Open AI for tokenization and text embeddings? I would naively assume that they would try to improve on this part of the model's architecture to improve adherence to text and image prompts.
They use three text encoders to encode the caption:
1. CLIP-G/14 (OpenCLIP)
2. CLIP-L/14 (OpenAI)
3. T5-v1.1-XXL (Google)
They randomly disable encoders during training, so that when generating images SD3 can use any subset of the 3 encoders. They find that using T5 XXL is important only when generating images from prompts with "either highly detailed descriptions of a scene or larger amounts of written text".
I have just been informed that my above comment is false, the CLIP-L is in fact referring to OpenAI's, despite that also being the name of an OpenCLIP model.
It's impressive that it spell words correctly and lay them out but the issue I have is the text always has this distinctively overly fried look to it. The color of the text is always ramped up to a single value which when placed into a high fidelity image gives the impression of just slapping some text on top with photoshop afterwards in quite an amateurish fashion rather than text properly integrated into an image.
It's very likely an artifact of CFG (classifier-free guidance), hopefully some days will be able to ditch this kinda dubious trick.
This is also the reason why the generated images have this characteristic high contrast and saturation. Better models usually need to rely less on CFG to generate coherent images because they fit the training distribution better.
But the sample images they show here showcase a good job at blending text with the rest of the image, using the correct art style, composition, shading and perspective. It seems like an improvement, no?
The blending looks better, but the LED sign on the bus for example looks almost like handwritten lettering... the letters are all different heights and widths. Not even close to realistic. There's a lot of nuance that goes into getting these things right. It seems like it'll be stuck in an uncanny valley for a long time.
I'm expecting at some point the stable diffusion community of developers to recognize the value of Layered Diffusion, the method of generating elements with transparent backgrounds, and transitioning to outputs that are layered images one may access and tweak independently. The addition of that would make the hands-on media producers of the world say "okay, now we're talking, finally directly indigestible into our existing production pipelines."
Of the people I know in the CG industries using SD in any sort of pipeline, they're all using Comfy because a node based workflow is what they're used to from things like Substance Designer, Houdini, Nuke, Blender, etc.
> In early, unoptimized inference tests on consumer hardware our largest SD3 model with 8B parameters fits into the 24GB VRAM of a RTX 4090 and takes 34 seconds to generate an image of resolution 1024x1024 when using 50 sampling steps. Additionally, there will be multiple variations of Stable Diffusion 3 during the initial release, ranging from 800m to 8B parameter models to further eliminate hardware barriers.
It will probably suck. These models aren’t quite good enough for most tasks (other than toy fun exploration). They’re close in the sense that you can get there with a lot of work and dice rolling. But I would be pessimistic about a smaller model actually getting you where you want.
It's has about as many parameters as SD v1.5, but hopefully with a better architecture, so I think it could end up being better for VRAM-constrained users than SD v1.5.
SDXL turbo is great in my experience! Way better than any alternative at that speed (e.g. sd1.4 or sd2). For img2img it's fantastic. And since the 800m model is using new insights since then and generally a cleaner dataset from the looks of it, I could imagine it's decent for some tasks (or better than sd2 turbo at least, which is enough to be fun and useful in my eyes).
A lot of use cases are cost-limited. Dalle3 makes great images but costs $0.12 per image so it would get extremely expensive at scale (1k generations is already 120$). The cost is by gpu time, the faster you can generate it the cheaper it is. We can get images under 250ms now, which is fast enough to fit in a web request.
Some use cases might be generating profile pictures or banners for users or unique profile pictures for bots in online games. Discord, steam, social media, whatnot, you could just type what you want your profile picture to be and make it on the fly. They're small, aren't expected to be extremely high quality, and cheap enough.
Testing on https://fastsdxl.ai/ - "high quality profile picture of a cartoon cat holding a Bouquet of flowers"
To be clear it's not perfect, but this is a fairly complex prompt and I find the majority of seeds would be "good enough" for thumbnail profile pictures. I think we're almost there for "cheap good enough" usecases.
Github's profile pictures are 40x40px in issues/pr's/commits/etc, they're very rarely seen above that, and I think sdxl lightning creates acceptable 1024x1024 images in many cases - downscaling to 512x512 hides a lot of the "ai artifacts"
Places like steam, discord, etc you very rarely see profile pictures above that size.
The use case is you can print out the great images and put them on your wall for decoration.
It is just art. I think AI art shows what obsessed gadget makers for profit we have become culturally. We can't even figure out that the use case for art is hanging on the wall for decoration. For a conversation piece. A few will have their name become known and make it into galleries.
Infinite supply means the value tends towards zero.Good luck monetizing anything with those economic characteristics.
FWIW, whether you use Turbo or an accelerator (i.e. Nvidia's TensorRT), there is plenty of guesswork in the prompt you want. Iterating quickly with low steps in Euler A, finding a great prompt that works, then switching to higher fidelity (I like DPMS 2M at 3x the steps) goes a long way to getting it all "just right".
If you are getting what you want most of the time, then you are a better 'prompt engineer' than I am.
latency does matter. realistic workflows are not one shot outputs. They're slow iterative improvements and changes to an image generated from a fixed or at least manually adjusted seed. 5 min would be killer.
the more interesting use cases I've seen have been real time video filters and uses with music based generation for live performances, which is why the speed matters more than the accuracy.
For example it could take an old video game say morrowind and it could in real time patch the graphics onto the video screen. Or people could look at a video of themselves and it would update the style similar to a snapchat filter.
The linked paper did look at SDXL Turbo, and found that the images were about as good as SDXL and better than a lot of models that would have been popular a little while ago. The compromise from using it, if there is any, is hard to detect. But it is much faster.
But the difference is academic; progress is so fast that it is reasonable to expect all these models will be obsolete in a year or two.
It looks like it. I really hope they do. Running SDXL right now is propper fun. I don't even use it for anything specific, just to amuse myself at times :D
It's very exciting to see that image generators are finally figuring out spelling. When DALL-E 3 (?) came out they hyped up spelling capabilities but when I tried it with Bing it was incredibly inconsistent.
I'd love to read a less technical writeup explaining the challenges faced and why it took so long to figure out spelling. Scrolling through the paper is a bit overwhelming and it goes beyond my current understanding of the topic.
Does anyone know if it would be possible to eventually take older generated images with garbled up text + their prompt and have SD3 clean it up or fix the text issues?
> I'd love to read a less technical writeup explaining the challenges faced and why it took so long to figure out spelling. Scrolling through the paper is a bit overwhelming and it goes beyond my current understanding of the topic.
I'm not an ML researcher but I can answer this. Note that this is not information from the paper, just my own findings from following the "scene".
We actually figured out spelling not long after diffusion models came out, the imagen paper that came several months before SD1 explained how they did it, which is the same technique SD3 and Dalle3 use, instead of using CLIP's text encoder, they use T5.
The reason why image models can't spell is the same reason why language models also have difficulty spelling. Tokenization. Simply speaking instead of seeing each letter individually, we split a sentence into sub-words, most commonly called tokens and that's what the model sees. The model never gets to see each letter individually. But it turns out that if you make the model big enough and feed it enough data, it actually learns how each token is spelled out.
Clip is both small (200-500M params) and trained on limited text data (only image captions). T5 is trained on a large corpus of data and is also huge (~5.5B params). This makes T5 the obvious choice if you care about spelling.
So why use CLIP in the first place? Simple: it's much easier to train on clip embeddings than T5 embeddings, not only are they smaller, but because clip is trained on text/image pairs and due to backpropagation, the text embeddings also contain a lot of visual semantic information. This simplifies a lot of the work the text->diffusion attention modules need to do. Another reason is that T5 is absolutely massive, it's 5x larger than the image part of the model and 10-20x larger than CLIP.
If you take a close look at diagram (a) in the paper you'll actually see that SD3 uses both CLIP and T5, not only that but they trained it in a way that makes the encoder used optional, so you can use the CLIP models only if you don't care about spelling and image composition (CLIP is also bad at understanding prompts), which is useful because most GPUs can't handle T5 on it's own. Tho I suspect someone will distil T5 so it becomes 5-10 times smaller than it currently is at a minimal loss on how good it is for prompting.
I would imagine with an img2img workflow it would be. The same way you can reconstruct a badly rendered face but doing a second pass on the affected region
Nice improvements in text rendering, but it seems generating hands and fingers is still difficult for SD3. None of the pictures in the example contain human hands, except for the pixelized wizard; and the monkey hands seem a bit odd.
The proper solution to fine details like hands will be conditioning the image on a 3D pose eg with controlnets. It's hard to get exactly what you want with only a single text prompt.
This looks great, very exciting. The paper is not a lot more detailed than the blog. The main Thing about the paper is they have an architecture that can include more expressive text encoders (t5-xxl here), they show this helps with complex scenes, and it seems clear they haven’t maxed out this stack in terms of training. So, expect sd3.1 to be better than this, and expect 4 to be able to work with video through adding even more front end encoding. Exciting!
This arch seems to be flexible enough to extends to video easily. Hopefully what we have here will be another "foundation" blocks like the transformer blocks in LLaMA.
Why:
It looks generic enough to incorporated text encoding / timestep condition into the block in all the imaginable ways (rather than in limited ways in SDXL / SD v1, or Stable Cascade). I don't think there is much left to be done there other than to play with positional encoding (2D RoPE?).
Great job! Now let's just scale up the transformers and focus on quantization / optimizations to run this stack properly everywhere :)
More and more companies that were once devoted to being 'open', or were previously open, are now becoming increasingly closed. I appreciate Stability AI releases these research papers.
It's hard to build a business on "open". I'm not sure what Stability AI's long term direction will be, but I hope they do figure out a way to become profitable while creating these free models.
Agreed but this isn't the same as an open source library; it costs A LOT of money to constantly train these models. That money has to come from somewhere, unfortunately.
Yeah. The amount of compute required is pretty high. I wonder, is there enough distributed compute available to bootstrap a truly open model through a system like seti@home or folding@home?
Distributing the training data also opens up vectors of attack. Poisoning or biasing the dataset distributed to the computer needs to be guarded against... but I don't think that's actually possible in a distributed model (in principal?). If the compute is happing off server: then trust is required (which is not {efficiently} enforceable?).
Trust is kinda a solved problem in distributed computing, The different "@Home" projects and Bitcoin handle this by requiring multiple validations of a block of work for just this reason.
How do you verify the work of training without redoing the exact same work for training? (That's the neat part: you don't)
Bitcoin is trust-solved because of how the new blocks depends on previous blocks. With training data, there is no such verification (prompts/answers pairs do not depend at all on other prompt/answer pairs) (if there was, we wouldn't need to do the work of training the data in the first place).
You can rely on multiplying the work where gross variations are ignored (as you suggest): but that will take a lot more overhead in compute, and still is susceptible to bad actors (but much more resistant).
There is no solid/good solution - afaik - for distributed training of an AI (Open assistant I think is working on open training data?), if there is: I'll sign up.
There has been some interesting work when it comes to distributed training. For example DiLoCo (https://arxiv.org/abs/2311.08105). I also know that Bittensor and nousresearch collaborated on some kind of competitive distributed model frankensteining-training thingy that seems to be going well. https://bittensor.org/bittensor-and-nous-research/
Of course it gets harder as models get larger but distributed training doesn't seem totally infeasible. For example if we were to talk about MoE transformer models, perhaps separate slices of the model can be trained in an asynchronous manner and then combined with some retraining. You can have minimal regular communication about say, mean and variance for each layer and a new loss term dependent on these statistics to keep the "expertise" for each contributor distinct.
Forward-Forward looked promising, but then Hinton got the AI-Doomer heebie-jeebies and bailed. Perhaps someone picks up the concept and runs with it - I'd love to myself but I don't have the skillz to build stuff at that depth, yet.
>> but Y-Combinator literally only exists to squeeze the most bizness out of young smart people.
YC started out with the intent to give young smart people a shot at starting a business. IMHO it has shifted significantly over the years to more what you say. We see ads now seeking a "founding engineer" for YC startups, but it used to be the founders were engineers.
>> Training these big models is very very expensive.
Which is why they are not the future. A big model that can generate a picture about anything in response to any input makes for a great website. It generates lots of press. But it is not a reasonable tool for content generation. If you want to produce content in a specific area or genre, the best results come from a model trained or modified in the area. So the big generalized AI, if you use it, would only be the framework on which you built your specialized tool. Building that specialized tool, such as something dedicated to images of a particular politician, does not require huge amounts of computation. That sort of thing can and is being done by individuals.
I am waiting for a tool trained on publicly-accessible mugshots. It wouldn't be a very big project but could yield a tool to generate very believable mugshots of politicians.
Depending on your background and circumstances, there are ways to opt out of the race to a greater/lesser degree. Moving to a cheaper city in your country, or a cheaper country altogether, is one of them. Finding a less stressful way of making less money is another.
It's just hard being reminded that there's no escape hatch - we've welded them all shut for eternity. Being reduced to choices within a system but the choice horizon never extends to the system itself and won't within my lifetime makes me feel trapped.
Maybe, but in image generation it's also hard to be closed.
The big providers are all so terrified they'll produce a deepfake image of obama getting arrested or something, the models are so locked down they only seem capable of producing stock photos.
But they used to let you download the model weights to run on your own machine... But stable diffusion 3 is just in 'limited preview' with no public download links.
Both SD1.4 and SDXL was in limited preview for a few months before a public release. This has been their normal course of business for about 2 years now (since founding). They just do this to improve the weights via a beta test with less judgemental users before official release.
How is a closed beta anything out of the ordinary? They know they would only get tons of shit flinged at them if they publicly released something beta-quality, even if clearly labeled as such. SD users can be a VERY entitled bunch.
I've noticed a strange attitude of entitlement that seems to scale with how open a company is - Mistral and Stable Diffusion are on very sensitive ground with the open source community despite being the most open.
If you try to court a community then it will expect more of you. Same as if you were to claim to be an environmentalist company then you would receive more scrutiny from environmentalists confirming your claims.
That's… not really relevant to Stability AI at all. SAI isn't "claiming" anything. They are show, not tell (well, mostly). They give a technology away for free the likes of which everybody else keeps very tightly locked behind SaaS. Then people bitch about said free technology.
That's nothing new with Stability. Even 1.5 was "released early" by RunwayML because they felt Stability was taking too long to release the weights instead of just providing them in DreamStudio.
There was one thing I was curious about... I skimmed through the executive summary of the paper but couldn't find it. Does Stable Diffusion 3 still use CLIP from Open AI for tokenization and text embeddings? I would naively assume that they would try to improve on this part of the model's architecture to improve adherence to text and image prompts.