I'd bet this has a significant impact on the community. From what I've seen - experimenting with an AMD card that crashes and sends me to bed early after a bit of stable diffusing - there is a power distribution of graphics cards out there. Models that play nicely with less memory and slower cards get more attention from the hordes of anonymous copyright infringers who power the llama-wearing-tophat LORAs and checkpoints.
SDXL isn't there yet but we're getting mighty close. SD1 felt a lot like shouting words at an idiot savant to try and inspire something good. SDXL gives me the impression I'm explaining what I want to a talented but verbally challenged painter who will attempt to draw what I want. The trajectory seems to be that we'll reach something close to natural English in maybe 2-6 years and if AMD could please keep fixing crashes that'd be lovely.
But seriously, halving the model size and doubling the speed will mean a lot more people can use it. That is a big deal because some of them will spend time training in more data for specific applications and setting up tooling to work around limitations.
A problem with SDXL is that its just different. The main prompting syntax is different, and many UIs don't even implement the positive/negative style prompting.
SD 1.5 has a ton of inertia, and ultimately its still smaller than SDXL.
I, for one, hope the llama-wearing-tophat LORAS meet the smarmy looking turtle-neck mfers who just HAD to rebind all the sane keybindings on macs in order to 'think different' and build a usability moat around their OS. And I hope they kick butt when they do. But that's just on a personal level.
I love that LLM are getting more accessible and I want that trend to continue.
The Segmind Stable Diffusion Model (SSD-1B) is a distilled 50% smaller version of the Stable Diffusion XL (SDXL), offering a 60% speedup while maintaining high-quality text-to-image generation capabilities. It has been trained on diverse datasets, including Grit and Midjourney scrape data, to enhance its ability to create a wide range of visual content based on textual prompts.
By the way: What is currently the most capable text to image model on Hugging Face? I tried stable-diffusion-xl-base-1.0 but it seems far behind what the Bing image creator is creating these days. Is open source falling behind again?
Bing is using OpenAI's DALLE-3 which is somewhat ahead right now. However, SDXL (Stable Diffusion XL) can produce some pretty fantastic results, but it helps a lot to use a refining model (https://huggingface.co/stabilityai/stable-diffusion-xl-refin...) at the end of the generation. The automatic1111 webui has a feature to do this automatically built in, but other services have their own approaches.
> Bing is using OpenAI's DALLE-3 which is somewhat ahead right now.
A lot of that is using an LLM for prompt expansion and composition guidance, both of which can be done with SD just by looping in an LLM for those purposes, though there isn’t support in any of the popular UIs for LLM composition assiatance (there are recent papers denonstrating it, though, so I wouldn't be surprised to see extensions for popular UIs within a month.)
EDIT: corrected to clarify that only composition assistance isn’t yet supported in popular UIs, because prompt expansion is built in to Fooocus.
My impression is there's a large open development community that has come up with some really clever stuff - fine-tuning, LoRAs, ControlNets, regional prompting, and so on. You can teach the model completely new concepts and styles at home, for <$1 in electricity.
But it's all been built on top of a base model trained by Stability AI at a cost of $600k (or at least, it would have cost that at AWS GPU prices).
And some imperfections like mangled hands, prompt concept bleed, the inability to generate clock faces and mirrors etc seem to go deep into the base model, and beyond the point that fine-tuning can fix. And there aren't many people with $600k of spare cash lying around to train a new base model.
The outputs of the open source models are great despite such weaknesses IMHO - but the costs involved mean the open source community is fighting with one hand tied behind its back.
Of course, open models will always win in some areas - like if you want to generate a picture of Xi Jinping.
> But it's all been built on top of a base model trained by Stability AI at a cost of $600k (or at least, it would have cost that at AWS GPU prices).
The LAION dataset they used was notoriously poor quality, and the training process wasn't optimal by any measure, though. The costs of training are rapidly falling due to various optimizations.
Take a look at Pixart-alpha [0]. They claim SDXL-comparable performance for just $26k in training from scratch, with just 600M parameters in the unet and 25M pictures in the training set. Supposedly they achieved this due to the high quality training set tagged by a third-party model. The weights got leaked recently and the claim looks believable.
This looks very promising and the sample images are impressive, especially for the small model size.
Unfortunately, according to their Readme, "Inference requires at least 23GB of GPU memory." So not something you can run on consumer hardware in its current state.
That's because they slapped a 11B transformer on top of it for better prompt understanding; you can run it quantized to 8 bit and it will take about 14GB total, which has been tested by some people. This can be further reduced, or replaced for another encoder, the initial code is typically poorly optimized in these models (SD 1.4 required 12GB at first, then it was reduced to 4 and even 2GB). The meaningful part of the model that actually generates the picture is even smaller than the SD 1.x.
Stable Diffusion's strength (one of) is in the ability to be guided by sketches and references, which is much faster and more expressive/precise than anything achievable with a text prompt if you're good at it. If you want that, you want SD, it has vast amounts of related tooling around it. If text-to-image is enough for you, you need one of the commercial image generators with better prompt understanding.
Bing's DALL-E 3 is bleeding edge and a huge leap forward in prompt comprehension. Explained by the fact that it uses GPT-4 as the text encoder, so… similar performance probably not available on your local SD instance in the near future.
Hmm, that paper is interesting, and definitely descriptions are a big bottleneck in training. But they only say they used T5-XXL in the experiments related to the paper. Who knows what Bing is actually running?
A text encoder is the part that takes the prompt, tokenizes it, and then turns the tokens into an embedding [1] (an n-dimensional vector in the image (or latent image) space). The embedding is then used to "push" the noise diffusion process towards the desired part of the image space. The encoder is a net distinct from the diffusion net (and if the diffusion is done in a latent space like SD does, there's yet another net, so-called VAE, responsible for the image<->latent mapping). SD uses CLIP [2] as its text encoder.
At a high level, the SD architecture looks like this:
Its what "interprets" the prompt to give to the unet (which does the actual diffusion).
SDXL's better text encoder is partially why it can understand sentences better than SD 1.5, and you can swap it out for, say, BLIP for combined image + text prompting.
Open Source has been behind proprietary options since the Midjourney release, SD was never better in terms of fidelity. That's how things work in general, FOSS trails commercial offers.
I have a stupid question. I saw someone below ask about compatibility with A1111. Why can't I just use this like I use any of the 10s of 1.5 models I have downloaded? Does SDXL use a different inference tech or something? I have the same question with SDXL overall but I understand the refiner is important there.
I have stuck with 1.5 because SDXL wasn't that exciting to me and ultimately I haven't been clear what software needs to change (and how/why) to be "compatible"
Hah, one of your comments weeks ago got me to try Fooocus-MRE after getting nowhere with SDXL on A1111. Absolutely blew away my previous results in so much less time. I was able to illustrate a whole TTRPG session in an hour or two.
I (for similar reasons to yours) didn't do the upgrade 1.5 -> SDXL until last week. I found the upgrade totally easy in the end.
I just had to update A1111, download the sdxl model, download the sdxl refiner model and set the refiner in the refiner dropdown. That's it. So compared to other models it is just one additional dropdown click.
You won't be able to use the LoRAs, etc. you are used to, but you can download new ones that are compatible with SDXL.
For someone out of the loop, will this eventually make it into automatic1111 the same way that oobabooga rolls out support for new llama models a few days after they're published?
You should be using Fooocus (which IMO has the best out-of-the-box quality), InvokeAI or ComfyUI for it instead.
> oobabooga rolls out support for new llama models
ooba is kinda different than a1111 under the hood. Its a big bundle of different LLM backends with as much common code hacked in as possible, while a1111 is the single, original SAI backend with a ton of features globbed on.
Technically ooba doesn't have to "support" any new models at a low level, thats typically the job of the backends. What it adds are new features and integrations.
My impressions has always been that comfy and invoke are better as strictly-SD UIs out-of-the-box, but A1111 disproportionately gets support for new external-to-SD tools via extensions, etc., first.
Fooocus is new and seems focussed more on casual use compared to the other three.
If the model is 50% smaller, then it has different number of weights and is basically a different model, and not compatible with anything from the base model.
SDXL isn't there yet but we're getting mighty close. SD1 felt a lot like shouting words at an idiot savant to try and inspire something good. SDXL gives me the impression I'm explaining what I want to a talented but verbally challenged painter who will attempt to draw what I want. The trajectory seems to be that we'll reach something close to natural English in maybe 2-6 years and if AMD could please keep fixing crashes that'd be lovely.
But seriously, halving the model size and doubling the speed will mean a lot more people can use it. That is a big deal because some of them will spend time training in more data for specific applications and setting up tooling to work around limitations.