Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Segmind Stable Diffusion – A smaller version of Stable Diffusion XL (huggingface.co)
95 points by sayak_hf on Oct 25, 2023 | hide | past | favorite | 39 comments


I'd bet this has a significant impact on the community. From what I've seen - experimenting with an AMD card that crashes and sends me to bed early after a bit of stable diffusing - there is a power distribution of graphics cards out there. Models that play nicely with less memory and slower cards get more attention from the hordes of anonymous copyright infringers who power the llama-wearing-tophat LORAs and checkpoints.

SDXL isn't there yet but we're getting mighty close. SD1 felt a lot like shouting words at an idiot savant to try and inspire something good. SDXL gives me the impression I'm explaining what I want to a talented but verbally challenged painter who will attempt to draw what I want. The trajectory seems to be that we'll reach something close to natural English in maybe 2-6 years and if AMD could please keep fixing crashes that'd be lovely.

But seriously, halving the model size and doubling the speed will mean a lot more people can use it. That is a big deal because some of them will spend time training in more data for specific applications and setting up tooling to work around limitations.


A problem with SDXL is that its just different. The main prompting syntax is different, and many UIs don't even implement the positive/negative style prompting.

SD 1.5 has a ton of inertia, and ultimately its still smaller than SDXL.


I, for one, hope the llama-wearing-tophat LORAS meet the smarmy looking turtle-neck mfers who just HAD to rebind all the sane keybindings on macs in order to 'think different' and build a usability moat around their OS. And I hope they kick butt when they do. But that's just on a personal level.

I love that LLM are getting more accessible and I want that trend to continue.


The Segmind Stable Diffusion Model (SSD-1B) is a distilled 50% smaller version of the Stable Diffusion XL (SDXL), offering a 60% speedup while maintaining high-quality text-to-image generation capabilities. It has been trained on diverse datasets, including Grit and Midjourney scrape data, to enhance its ability to create a wide range of visual content based on textual prompts.


Gives me an error:

    'list' object cannot be interpreted as an integer
By the way: What is currently the most capable text to image model on Hugging Face? I tried stable-diffusion-xl-base-1.0 but it seems far behind what the Bing image creator is creating these days. Is open source falling behind again?


Bing is using OpenAI's DALLE-3 which is somewhat ahead right now. However, SDXL (Stable Diffusion XL) can produce some pretty fantastic results, but it helps a lot to use a refining model (https://huggingface.co/stabilityai/stable-diffusion-xl-refin...) at the end of the generation. The automatic1111 webui has a feature to do this automatically built in, but other services have their own approaches.

(BTW, regarding the error you saw - use this instead https://huggingface.co/spaces/segmind/Segmind-Stable-Diffusi... )


> Bing is using OpenAI's DALLE-3 which is somewhat ahead right now.

A lot of that is using an LLM for prompt expansion and composition guidance, both of which can be done with SD just by looping in an LLM for those purposes, though there isn’t support in any of the popular UIs for LLM composition assiatance (there are recent papers denonstrating it, though, so I wouldn't be surprised to see extensions for popular UIs within a month.)

EDIT: corrected to clarify that only composition assistance isn’t yet supported in popular UIs, because prompt expansion is built in to Fooocus.


> Is open source falling behind again?

My impression is there's a large open development community that has come up with some really clever stuff - fine-tuning, LoRAs, ControlNets, regional prompting, and so on. You can teach the model completely new concepts and styles at home, for <$1 in electricity.

But it's all been built on top of a base model trained by Stability AI at a cost of $600k (or at least, it would have cost that at AWS GPU prices).

And some imperfections like mangled hands, prompt concept bleed, the inability to generate clock faces and mirrors etc seem to go deep into the base model, and beyond the point that fine-tuning can fix. And there aren't many people with $600k of spare cash lying around to train a new base model.

The outputs of the open source models are great despite such weaknesses IMHO - but the costs involved mean the open source community is fighting with one hand tied behind its back.

Of course, open models will always win in some areas - like if you want to generate a picture of Xi Jinping.


> But it's all been built on top of a base model trained by Stability AI at a cost of $600k (or at least, it would have cost that at AWS GPU prices).

The LAION dataset they used was notoriously poor quality, and the training process wasn't optimal by any measure, though. The costs of training are rapidly falling due to various optimizations.

Take a look at Pixart-alpha [0]. They claim SDXL-comparable performance for just $26k in training from scratch, with just 600M parameters in the unet and 25M pictures in the training set. Supposedly they achieved this due to the high quality training set tagged by a third-party model. The weights got leaked recently and the claim looks believable.

[0] https://pixart-alpha.github.io/


This looks very promising and the sample images are impressive, especially for the small model size.

Unfortunately, according to their Readme, "Inference requires at least 23GB of GPU memory." So not something you can run on consumer hardware in its current state.


That's because they slapped a 11B transformer on top of it for better prompt understanding; you can run it quantized to 8 bit and it will take about 14GB total, which has been tested by some people. This can be further reduced, or replaced for another encoder, the initial code is typically poorly optimized in these models (SD 1.4 required 12GB at first, then it was reduced to 4 and even 2GB). The meaningful part of the model that actually generates the picture is even smaller than the SD 1.x.


Stable Diffusion's strength (one of) is in the ability to be guided by sketches and references, which is much faster and more expressive/precise than anything achievable with a text prompt if you're good at it. If you want that, you want SD, it has vast amounts of related tooling around it. If text-to-image is enough for you, you need one of the commercial image generators with better prompt understanding.


I can't get decent quality out of SDXL in base diffusers. Which is unfortunate, as HF diffusers is fast and very hackable.

But try it in Fooocus, with its sea of augmentations (including Dall E-like prompt expansion): https://github.com/lllyasviel/Fooocus


How do I use that on Hugging Face?


You cannot, but you can host instances on Replicate, Vast.ai and others.


Bing's DALL-E 3 is bleeding edge and a huge leap forward in prompt comprehension. Explained by the fact that it uses GPT-4 as the text encoder, so… similar performance probably not available on your local SD instance in the near future.


DALL-E 3 uses T5-XXL for the text encoder not GPT-4. The huge leap comes mostly from using text descriptions that aren't trash.

https://cdn.openai.com/papers/dall-e-3.pdf


Hmm, that paper is interesting, and definitely descriptions are a big bottleneck in training. But they only say they used T5-XXL in the experiments related to the paper. Who knows what Bing is actually running?


What is a "text encoder" and why does a text-2-image net need it?


A text encoder is the part that takes the prompt, tokenizes it, and then turns the tokens into an embedding [1] (an n-dimensional vector in the image (or latent image) space). The embedding is then used to "push" the noise diffusion process towards the desired part of the image space. The encoder is a net distinct from the diffusion net (and if the diffusion is done in a latent space like SD does, there's yet another net, so-called VAE, responsible for the image<->latent mapping). SD uses CLIP [2] as its text encoder.

At a high level, the SD architecture looks like this:

                  noisy latent -.    +----------+
            +------+             `-> |   UNet   |                       +-----+
  prompt -> | CLIP | -> embedding -> | diffuser | -> denoised latent -> | VAE | -> output image
            +------+                 +----------+                       +-----+

[1] See, eg. this article currently on the HN front page: https://simonwillison.net/2023/Oct/23/embeddings/

[2] https://openai.com/research/clip


Its what "interprets" the prompt to give to the unet (which does the actual diffusion).

SDXL's better text encoder is partially why it can understand sentences better than SD 1.5, and you can swap it out for, say, BLIP for combined image + text prompting.


They've explained it but i'll point out that Dalle-3 doesn't use GPT-4 for the text encoder but the T5-XXL.

The huge leap mostly comes from text descriptions that aren't trash.


Open Source has been behind proprietary options since the Midjourney release, SD was never better in terms of fidelity. That's how things work in general, FOSS trails commercial offers.


I have a stupid question. I saw someone below ask about compatibility with A1111. Why can't I just use this like I use any of the 10s of 1.5 models I have downloaded? Does SDXL use a different inference tech or something? I have the same question with SDXL overall but I understand the refiner is important there.

I have stuck with 1.5 because SDXL wasn't that exciting to me and ultimately I haven't been clear what software needs to change (and how/why) to be "compatible"


> Does SDXL use a different inference tech or something.

Its extremely similar, but not drop-in compatible, no.

Whats more, small changes make it compatible, but larger changes are needed for SDXL to work well.

Personally I find SDXL to be a huge upgrade, at the cost of performance, hence I am using Fooocus but keeping an eye on https://github.com/FizzleDorf/ComfyUI-AIT

As I have been preaching on HN, just don't use A1111 for SDXL.


Hah, one of your comments weeks ago got me to try Fooocus-MRE after getting nowhere with SDXL on A1111. Absolutely blew away my previous results in so much less time. I was able to illustrate a whole TTRPG session in an hour or two.


Well now Fooocus MRE is obsolete :P. The dev shuttered it, but most of the functionality is now upstreamed into the advanced/debug menus of Fooocus.

And yeah, the difference is dramatic!


>As I have been preaching on HN, just don't use A1111 for SDXL.

Why, what am I missing by doing so?


Better speed or quality in other UIs.

Less jankiness.

For me, personally, better hackability, especially in `diffusers` UIs (though Comfy-based UIs seem to be SOTA for SDXL).

EDIT: Also, other user interfaces are more "suited" to SDXL and its presets+style prompting. A1111 is very much built for SD 1.5


I (for similar reasons to yours) didn't do the upgrade 1.5 -> SDXL until last week. I found the upgrade totally easy in the end.

I just had to update A1111, download the sdxl model, download the sdxl refiner model and set the refiner in the refiner dropdown. That's it. So compared to other models it is just one additional dropdown click.

You won't be able to use the LoRAs, etc. you are used to, but you can download new ones that are compatible with SDXL.


For someone out of the loop, will this eventually make it into automatic1111 the same way that oobabooga rolls out support for new llama models a few days after they're published?


TBH, a1111's SDXL support is still not very good.

You should be using Fooocus (which IMO has the best out-of-the-box quality), InvokeAI or ComfyUI for it instead.

> oobabooga rolls out support for new llama models

ooba is kinda different than a1111 under the hood. Its a big bundle of different LLM backends with as much common code hacked in as possible, while a1111 is the single, original SAI backend with a ton of features globbed on.

Technically ooba doesn't have to "support" any new models at a low level, thats typically the job of the backends. What it adds are new features and integrations.


My impressions has always been that comfy and invoke are better as strictly-SD UIs out-of-the-box, but A1111 disproportionately gets support for new external-to-SD tools via extensions, etc., first.

Fooocus is new and seems focussed more on casual use compared to the other three.


Fooocus actually has lots of knobs hidden behind the debug menu, and default augmentations are hidden because there is no reason to disable them.

Its, by a mile, the best SDXL output I've gotten from any UI, even with lots of tweaking.

A1111 indeed has tons of tools built into the UI through extensions, but they are mostly focused on SD 1.5.

ComfyUI and Invoke bundle lots of tools too, but they are "hidden" behind the nodes UIs and sometimes less functional.


> A1111 indeed has tons of tools built into the UI through extensions, but they are mostly focused on SD 1.5.

IME, most of them are agnostic, because most of them wrap around rather than hooking into the inference process.


What about the refiner? Does this just do away with SDXL's second step?


Does this work well with SDXL compatible LoRAs?


If the model is 50% smaller, then it has different number of weights and is basically a different model, and not compatible with anything from the base model.


This makes sense. Thank you for clarifying.




Consider applying for YC's Summer 2026 batch! Applications are open till May 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: