Hacker News new | past | comments | ask | show | jobs | submit login
Stability AI Launches Stable Diffusion XL 0.9 (stability.ai)
171 points by seydor on June 22, 2023 | hide | past | favorite | 106 comments



Am I the only one who doesn't see an obvious difference in the quality between the left and right photos? (Maybe the wolf one) And these are extremely-curated examples!


Objective comparison is always so tricky with Stable Diffusion. They should show off large batches, at the very least.

I think Stability is ostensibly showing that the images are closer to the prompt (and the left wolf in particular has some distortion around the eyes).


It really is a much better base model aesthetically than 1.5, 2.1 etc

Comps here - https://imgur.com/a/FfECIMP


Thanks!

Does anyone have comparisons of how the model does on specific artist styles?

Simple prompts like "By $ARTISTNAME" worked very well in SD v1.5, and less so in v2.x, depending on the artist in question.


No, I have generated a few thousand midjourney images and there is quite a difference in these images actually.

It is hard to describe but there is a very unnatural "sheen" to the images on the left.

The SDXL 0.9 images look more photo realistic but they still aren't quite at the level that midjourney can do.

The best example is the wolf's hair between the ears in the SDXL 0.9 image. It is just a little too noisy and wavey compared to how a real wolf photo would look. Midjourney 5.1 --style raw would still handily beat this image if making a photo realistic wolf.

The jacket on the Alien in the SDXL 0.9 image also has too much of that AI sheen but it kind of works in this image as an effect for the jacket material so not really the best example.

The coffee cup isn't very good on either of them IMO. The trees on the right are still not blurred quite right. They are hiding the hand with this image on the right too. You can see how bad the little and ring finger is on the left image.

Obviously, this is all very nit picky.


For the aliens, the right image has much more realistic gradation. The one on the right looks like the grays have been crushed out of it. There's also a funky glow coming from the right edge of the alien.

I'd say the blur effects on the left images are much cleaner as well. There are some weird artifacts at the fringes of objects in the earlier version.


That's because they are actually comparing old version of SDXL vs new one. The old version already improved things...

The real comparison should be with SD 1.5/2.1, and is WAY better.


At the resolution provided they are indeed very close. In my eyes:

In the first example, the second image is more representative of Las Vegas for the foreigner I am, but none of them hav ethe scratchy found film requirement

In the second example, both fit the prompt, but the first image look more coming from a documentary than the second one

in the third example, the hand from the second picture looks much better


The wolf looks better, but also looks less like what you'd see in a "nature documentary" (part of the prompt).

I think the coffee cup looks better in the right phot, it seems a tad bit more real to me.

Like you I much prefer the alien photo on the left, but the photos are so stylistically different I'm not sure that says anything about the releases' respective capabilities.


I prefer the composition of the beta model over the release. Quality wise I can’t say one is better than the other. Maybe the hand in the coffee picture is better for the 0.9 model.


The hands look better but there’s still hints of a sixth finger in each of them.


The last image has big toes for thumbs.



Nothing a hands LORA can't fix.


True, but you still won’t get a coherent picture of someone picking their nose I bet.


and 5 phalanges


For anyone dumb like me: this is NOT a routine announcement, definitely read it through. The improvements they show off are honestly stunning.

Also, they’ve (re-) established a universal law of AI: fuck it, just ensemble it


Apparently that's what GPT4 is too: eight GPT3.5's ensembled together.

Not sure if true, sounds plausible tho.


How does ensembling work? What


Combining the results of multiple models and then adding another layer onto the combined output tends to increase accuracy / reduce error rates. (not new to AI: it's been done for over a decade)


But how to combine the results?


Usually just by concatenating the final hidden states (before any classifier/regression/image output head)


In this case, the first model generates a general image, and the second model handles finer details


Many ways, for one: https://magicfusion.github.io/


So it's non-commercial but they are adding it to the API on Monday? Does that mean it will be commercial then?


“ The model can be accessed via ClipDrop today with API coming shortly. Research weights are now available with an open release coming mid-July as we move to 1.0.”

I read this as: commercial use through our API now, self hosted commercial use in July.


"Research weights" seems to imply non-commercial use only though, right?


"with an open release coming mid-July"


Clipdrop is owned by Stability AI and you can access the model now, with its own API: https://clipdrop.co/stable-diffusion

The Stability AI API/DreamStudio API is slightly different. Yes, it's confusing.


This is copyright: non-commercial for everyone that copies the model from them. Not for the ones producing their own model.


NGL I can't wait to get hold of this model file and run it locally, ill be sure to do a write up on it on my AI blog https://soartificial.com. I just hope that my GPU can handle this locally. I don't think 8gb of vram is going to be enough, might have to tinker with some settings.

Im just looking forward to the custom LoRA files we can use with it :D


> Nvidia GeForce RTX 20 graphics card (equivalent or higher standard)

RIP my 1080 TI.

Does anyone know what specific feature they need which 20+ cards have and older ones don't?


Not enough RAM in your 1080TI

Edit - it’s not the RAM. 1080TI has 11GB and this press release says it requires 8. So I’m going to speculate that it’s because 1080 lacks tensor cores compared to the 20x’s Turing architecture


Since it is now split into two models to do the generation, you could load one and do the first stage of a bunch of images, then load the second and complete them, with half the vram usage.


I believe the HF pipeline can do this already and I assume each stage uses more than 4 GB vram. There are other tricks they open source community will come up with though.


8 bit (and 4 bit?) quantization is low hanging fruit, assuming its not already running in 8 bit.

An 8GB requirement kinda sounds like they have already quantized the model though.


Funny knowing the stupidity nvidia is pulling with the 4xxx series regarding ram amount.


The post says it only needs 8GB, and my 1080 has 11GB.


My guess is low precision support, or some newer ops Pascal does not support in a custom CUDA kernel.


Low precision support, almost certainly. SD 1.5 needs almost twice the memory on a 10xx card as on 20xx, because you can't use FP16; a triple bummer, since that makes it even slower (memory bandwidth!) and you don't have as much to begin with.


RTX 20 have tensor cores. 1080 do not.


Any speculation why the AMD cards require twice the VRAM that Nvidia cards do? I have an RX 6700 XT and I'm disappointed that my 12 GB won't be enough.


Probably no 4-bit quantization support? Or they are missing some ops that the tensor core cards have?

My guess is AMD users will eventually get low VRAM compatibility through Vulkan ports (like SHARK/Torch MLIR or Apache TVM).

Then again, the existing Vulkan ports were kinda obscure and unused with SD 1.5/2.1


You will want an optimized implementation from torch-mlir or apache tvm anyway.

We had this for SD 1.5, but it always stayed obscure and unpopular for some reason... I hope its different this time around.


Your 6700XT won't work anyway, since its not supported by ROCm


The 6000 series "unofficially" works with rocm, and that is hopefully getting more official.


That's not accurate, I was able to make my old RX 5500XT work with ROCm some months ago, only after 6 hours of compilation process though...


Its not supported, only navi 21 is supported, which is RX 6800+

https://github.com/RadeonOpenCompute/ROCm/issues/1668


It works on my 6900XT with ROCm


Because anything higher the 6800 is unofficially supported


If Emad tweeted this image [1] made with SDXL, then text in image could possibly better!

1:https://twitter.com/emostaque/status/1671885525639380992


Text will be better due to simple scale, but the text will still be limited due to the use of a CLIP for text encoding (BPEs+contrastive). So that may be SD XL 0.9 but it should still be worse due to not using T5 like https://github.com/deep-floyd/IF


That’s probably DeepFloyd.


Is not. SDXL can do text, at least some type of text.


Cool!


Is the mac studio with the Apple M2 Max with 12‑core CPU, 30‑core GPU enough for something like this?


System requirements

Despite its powerful output and advanced model architecture, SDXL 0.9 is able to be run on a modern consumer GPU, needing only a Windows 10 or 11, or Linux operating system, with 16GB RAM, an Nvidia GeForce RTX 20 graphics card (equivalent or higher standard) equipped with a minimum of 8GB of VRAM. Linux users are also able to use a compatible AMD card with 16GB VRAM.

I’m guessing that it will work eventually, though I’m not sure who will make that happen.


Apple ported Stable Diffusion 1.5/2.1 to MPS themselves.

If they don't so it for SDXL, the port will probably take awhile (if it happens at all).


I've used Apple's port of Stable Diffusion on my Mac Studio with M1 Ultra and it worked flawlessly. I could even download models from Hugging Face and convert them to a CoreML model with little effort using Apple's conversion tool documented in their Stable Diffusion repo [1]. Some models on Hugging Face are already converted – I think anything tagged with CoreML.

[1] https://github.com/apple/ml-stable-diffusion


Likely not. Even with earlier models performance may be finnicky: https://huggingface.co/docs/diffusers/optimization/mps

Additionally, for Apple Silicon you likely need 64 GB RAM (since CPU/GPU memory is shared) which is expensive.


I have an M2 MBP with 64 GB RAM. Performance with the older models is very good in my opinion. It feels to run faster locally than DreamStudio. I don't have benchmarks, but in any case the performance is not bad.


I’ve had good results with SD1.4/2 with MPS acceleration on similar hardware (M1 Max, though with 64gb). No stability issues with MPS, either. I’d say don’t rule it out just yet.


My guess is you will need cuda support until someone ports it to mips


What's the dataset? How commercially viable/legally questionable is it?

This is critical, for legality of use, ethics concerns, and the quality of the output (as overly zealous filtering can degrade the model like it did for SD 2.0).


If I recall correctly, Stability AI’s process to skirt copyright is to have the training data compiled and model weights trained by a third-party university. Educational research institutions have more lax requirements around copyright. That may or may not be a legitimate way to work under existing laws, but doesn’t tell us much about what the moral/ethical/legal considerations should be, which seems like an open question.


That sounds more like their story for SD 1.5 last year. I think there was some kerfuffle between Stability.ai and Runway.ai/Heidelberg Uni (see the Forbes article; won't link as I'm unclear on the veracity), who they were working with, and they may have parted ways by their first indie work on SD 2.x around the holidays. Either way, the Uni connection story may be old.




So, there are at least a few dozen AI image generating sites, some specialized, other not. Are they all powered by SD? Maybe just with some better pre-prompting? Or are there other engines (eg DALL-E)?.

AFAIK only SD can be run locally?


I've only run across 3 primary models: Midjourney (via their Discord), Dall-E and SD. And yes, there's a bunch of sites, but I've seen very similar quality to SD and no mention yet of a different base.

I do expect there are other bases out there, but haven't seen any of quality yet.

Before this release (XL 0.9) it's been unclear how much of the SD quality was in-house or came from their prior collab with Runway/Heidelberg.


Kandinsky, Deep Floyd... Also Midjourney is derived from SD I believe.


I don't think MJ is from SD.. found no mention of SD on their site or on Wiki, besides a comparison. Any citation?


None of those is derived from SD


I wasn't saying they were. Read back one more message.

(However, I thought Midjourney definitely was at some point)


Most sites/apps use Stable Diffusion, yes. Custom model of these, because the base model of SD 1.5/2.1 is really... heh, complicated to say the least.


The progress in AI is great, but very hard to keep up with (or understand). It's real to me once it makes it into Automatic1111.


Small rant:

I don't like how SD consolidated around the A1111 repo. The features are great, and it was fantastic when SD was brand new... but the performance and compatibility is awful, the setup is tricky, and it sucked all the oxygen out of the room that other SD UIs needed to flourish.


I had the same issues with that repo in addition to some early release controversy around its provenance.

I ended up using https://github.com/easydiffusion/easydiffusion

which has served me well so far.


Easy Diffusion is, uh, easily my favorite UI.

While it has a fraction of the features found in stable-diffusion-webui, it has the best out of the box UI I've tried so far.The way it enqueues tasks and renders the generated images beats anything I've seen in the various UIs I've played with.

I also like that you can easily write plugins in Javascript, both for the UI and for server-side tweaks.


The problem is those features/extensions in A1111 are absolutely killer once you use them. I assume they have ControlNet support now but I couldn’t do what I often do without regional prompter. Adetailer is also amazing.


Easy Diffusion does much less. No ControlNet support. Only just got LoRa support (at least in the beta channel). If A1111 is "professional", Easy Diffusion is maybe "hobbyist".

I use A1111 as a tool, but if I want to goof off, I queue up a bunch of prompts in Easy Diffusion and end up with a gallery built in real time. Its smaller feature set make it great for that.


I like to run A1111 in --api mode and write my own script to drive it over HTTP.


VoltaML is making good progress. InvokeAI is also pretty good (but not as optimized/bleeding edge).


It’s open source. The only way to compete is on your merits in open source.

If you want another UI to flourish, clone both it and A111, copy and paste the bits from A111 you’d like to have in yours (with attribution) and push it up along with any features you personally want.

That does require developer time, and developers may converge on a popular implementation with good tests and lots of features as it’s easier to contribute.

The bottleneck isn’t really the community though, it’s the developers.


Its not that simple, as A1111 uses the old stability AI implementation while pretty much everything else uses HF diffusers code.

I worked trying to add torch.compile support to A1111 for a bit, fixing some graph breaks locally, but... It was too much. Some other things, like ML compilation backends, are also basically impossible.


What benefits does the Huggingface diffusers(?) implementation have over A1111?


- Compatibility with stuff from research papers and ML compilers since it is the "de facto" SD implementation.

- The codebase is cleaner more hackable, and (compared to base SAI code) more performant.

- HF continues to put lots of work into optimization and cleanup. For instance, they ensure there are no graph breaks for torch.compile, and work with other hardware vendors for thier own SD implementations.


Do you know what it would take to support longer prompts with HF? The 77-token limit is what's so far stopped me from making the jump.


Its super easy, in fact I think they specifically have a long prompt pipeline. Look at the implementation in pretty much any diffusers UI (like VoltaML or Invoke)

Facebook's AITemplate backend even supports long prompts now.



I'm already seeing the speedrun to implement this model architecture in the automatic1111 webui.


The A1111 backend is kinda not set up for this, as it is built around the old Stability AI 1.5/2.1 implementation (not HF diffusers which most other backends use).

It would basically be a rewrite, if I were to guess... And at that point they mind as well port everything to diffusers.


implementing model architecture and supporting it in a web ui are two different things, maybe a link to pr?


Do you mean that supporting the model in automatic webui is much more difficult than just implementing the model in a repo? I guess.


Is there image + text to image prompting?

Can I do dreambooth here? If so, what commands do I use?


You can do this in SD 1.5/2.1 already. Encode the image like you do for pix2pix, and the text, then average those latents together.

Dreambooth is gonna require an A100 I think... I doubt it will work on the free (16GB VRAM) Colab instances.


Curious to know how this compares to mid journey v5


It's not better out of the box for text to image, this was known for quite some time. However, as soon as they release the weights (in a month as they promise), it will benefit from the tooling available for SD, without being limited to text to image.

It's also a foundational model, not a finished product, and MJ will possibly use it, like they did in v4 with SD 1.5.


Midjourney only used a combination of SD and their own stuff with the --beta and --test/testp model which came between V3 and V4, other versions have no connection to SD.


MJv4 is not related to SD at all


I might be misremembering, but didn't they announce in their twitter that they were using SD somehow for MJ v4? Later deleting this with a bunch of other tweets.


There were test/testp models that were based on SD, but v4 and v5 are created from scratch.


Emad said SD cost $600,000 to train. I wonder if Midjourney also had to pay that to train from scratch.


Got a link for that? I'm genuinely asking as I didn't know this.


Midjourney IMO still better (especially because it can do hands), but this actually come pretty close. I created some amazing pictures with it already!


Until it gets a control net release, MJ will be better.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: