Am I the only one who doesn't see an obvious difference in the quality between the left and right photos? (Maybe the wolf one) And these are extremely-curated examples!
Objective comparison is always so tricky with Stable Diffusion. They should show off large batches, at the very least.
I think Stability is ostensibly showing that the images are closer to the prompt (and the left wolf in particular has some distortion around the eyes).
No, I have generated a few thousand midjourney images and there is quite a difference in these images actually.
It is hard to describe but there is a very unnatural "sheen" to the images on the left.
The SDXL 0.9 images look more photo realistic but they still aren't quite at the level that midjourney can do.
The best example is the wolf's hair between the ears in the SDXL 0.9 image. It is just a little too noisy and wavey compared to how a real wolf photo would look. Midjourney 5.1 --style raw would still handily beat this image if making a photo realistic wolf.
The jacket on the Alien in the SDXL 0.9 image also has too much of that AI sheen but it kind of works in this image as an effect for the jacket material so not really the best example.
The coffee cup isn't very good on either of them IMO. The trees on the right are still not blurred quite right. They are hiding the hand with this image on the right too. You can see how bad the little and ring finger is on the left image.
For the aliens, the right image has much more realistic gradation. The one on the right looks like the grays have been crushed out of it. There's also a funky glow coming from the right edge of the alien.
I'd say the blur effects on the left images are much cleaner as well. There are some weird artifacts at the fringes of objects in the earlier version.
At the resolution provided they are indeed very close. In my eyes:
In the first example, the second image is more representative of Las Vegas for the foreigner I am, but none of them hav ethe scratchy found film requirement
In the second example, both fit the prompt, but the first image look more coming from a documentary than the second one
in the third example, the hand from the second picture looks much better
The wolf looks better, but also looks less like what you'd see in a "nature documentary" (part of the prompt).
I think the coffee cup looks better in the right phot, it seems a tad bit more real to me.
Like you I much prefer the alien photo on the left, but the photos are so stylistically different I'm not sure that says anything about the releases' respective capabilities.
I prefer the composition of the beta model over the release. Quality wise I can’t say one is better than the other. Maybe the hand in the coffee picture is better for the 0.9 model.
Combining the results of multiple models and then adding another layer onto the combined output tends to increase accuracy / reduce error rates. (not new to AI: it's been done for over a decade)
“ The model can be accessed via ClipDrop today with API coming shortly. Research weights are now available with an open release coming mid-July as we move to 1.0.”
I read this as: commercial use through our API now, self hosted commercial use in July.
NGL I can't wait to get hold of this model file and run it locally, ill be sure to do a write up on it on my AI blog https://soartificial.com. I just hope that my GPU can handle this locally. I don't think 8gb of vram is going to be enough, might have to tinker with some settings.
Im just looking forward to the custom LoRA files we can use with it :D
Edit - it’s not the RAM. 1080TI has 11GB and this press release says it requires 8. So I’m going to speculate that it’s because 1080 lacks tensor cores compared to the 20x’s Turing architecture
Since it is now split into two models to do the generation, you could load one and do the first stage of a bunch of images, then load the second and complete them, with half the vram usage.
I believe the HF pipeline can do this already and I assume each stage uses more than 4 GB vram. There are other tricks they open source community will come up with though.
Low precision support, almost certainly. SD 1.5 needs almost twice the memory on a 10xx card as on 20xx, because you can't use FP16; a triple bummer, since that makes it even slower (memory bandwidth!) and you don't have as much to begin with.
Any speculation why the AMD cards require twice the VRAM that Nvidia cards do? I have an RX 6700 XT and I'm disappointed that my 12 GB won't be enough.
Text will be better due to simple scale, but the text will still be limited due to the use of a CLIP for text encoding (BPEs+contrastive). So that may be SD XL 0.9 but it should still be worse due to not using T5 like https://github.com/deep-floyd/IF
Despite its powerful output and advanced model architecture, SDXL 0.9 is able to be run on a modern consumer GPU, needing only a Windows 10 or 11, or Linux operating system, with 16GB RAM, an Nvidia GeForce RTX 20 graphics card (equivalent or higher standard) equipped with a minimum of 8GB of VRAM. Linux users are also able to use a compatible AMD card with 16GB VRAM.
I’m guessing that it will work eventually, though I’m not sure who will make that happen.
I've used Apple's port of Stable Diffusion on my Mac Studio with M1 Ultra and it worked flawlessly. I could even download models from Hugging Face and convert them to a CoreML model with little effort using Apple's conversion tool documented in their Stable Diffusion repo [1]. Some models on Hugging Face are already converted – I think anything tagged with CoreML.
I have an M2 MBP with 64 GB RAM. Performance with the older models is very good in my opinion. It feels to run faster locally than DreamStudio. I don't have benchmarks, but in any case the performance is not bad.
I’ve had good results with SD1.4/2 with MPS acceleration on similar hardware (M1 Max, though with 64gb). No stability issues with MPS, either. I’d say don’t rule it out just yet.
What's the dataset? How commercially viable/legally questionable is it?
This is critical, for legality of use, ethics concerns, and the quality of the output (as overly zealous filtering can degrade the model like it did for SD 2.0).
If I recall correctly, Stability AI’s process to skirt copyright is to have the training data compiled and model weights trained by a third-party university. Educational research institutions have more lax requirements around copyright. That may or may not be a legitimate way to work under existing laws, but doesn’t tell us much about what the moral/ethical/legal considerations should be, which seems like an open question.
That sounds more like their story for SD 1.5 last year. I think there was some kerfuffle between Stability.ai and Runway.ai/Heidelberg Uni (see the Forbes article; won't link as I'm unclear on the veracity), who they were working with, and they may have parted ways by their first indie work on SD 2.x around the holidays. Either way, the Uni connection story may be old.
So, there are at least a few dozen AI image generating sites, some specialized, other not. Are they all powered by SD? Maybe just with some better pre-prompting? Or are there other engines (eg DALL-E)?.
I've only run across 3 primary models: Midjourney (via their Discord), Dall-E and SD. And yes, there's a bunch of sites, but I've seen very similar quality to SD and no mention yet of a different base.
I do expect there are other bases out there, but haven't seen any of quality yet.
Before this release (XL 0.9) it's been unclear how much of the SD quality was in-house or came from their prior collab with Runway/Heidelberg.
I don't like how SD consolidated around the A1111 repo. The features are great, and it was fantastic when SD was brand new... but the performance and compatibility is awful, the setup is tricky, and it sucked all the oxygen out of the room that other SD UIs needed to flourish.
While it has a fraction of the features found in stable-diffusion-webui, it has the best out of the box UI I've tried so far.The way it enqueues tasks and renders the generated images beats anything I've seen in the various UIs I've played with.
I also like that you can easily write plugins in Javascript, both for the UI and for server-side tweaks.
The problem is those features/extensions in A1111 are absolutely killer once you use them. I assume they have ControlNet support now but I couldn’t do what I often do without regional prompter. Adetailer is also amazing.
Easy Diffusion does much less. No ControlNet support. Only just got LoRa support (at least in the beta channel). If A1111 is "professional", Easy Diffusion is maybe "hobbyist".
I use A1111 as a tool, but if I want to goof off, I queue up a bunch of prompts in Easy Diffusion and end up with a gallery built in real time. Its smaller feature set make it great for that.
It’s open source. The only way to compete is on your merits in open source.
If you want another UI to flourish, clone both it and A111, copy and paste the bits from A111 you’d like to have in yours (with attribution) and push it up along with any features you personally want.
That does require developer time, and developers may converge on a popular implementation with good tests and lots of features as it’s easier to contribute.
The bottleneck isn’t really the community though, it’s the developers.
Its not that simple, as A1111 uses the old stability AI implementation while pretty much everything else uses HF diffusers code.
I worked trying to add torch.compile support to A1111 for a bit, fixing some graph breaks locally, but... It was too much. Some other things, like ML compilation backends, are also basically impossible.
- Compatibility with stuff from research papers and ML compilers since it is the "de facto" SD implementation.
- The codebase is cleaner more hackable, and (compared to base SAI code) more performant.
- HF continues to put lots of work into optimization and cleanup. For instance, they ensure there are no graph breaks for torch.compile, and work with other hardware vendors for thier own SD implementations.
Its super easy, in fact I think they specifically have a long prompt pipeline. Look at the implementation in pretty much any diffusers UI (like VoltaML or Invoke)
Facebook's AITemplate backend even supports long prompts now.
The A1111 backend is kinda not set up for this, as it is built around the old Stability AI 1.5/2.1 implementation (not HF diffusers which most other backends use).
It would basically be a rewrite, if I were to guess... And at that point they mind as well port everything to diffusers.
It's not better out of the box for text to image, this was known for quite some time. However, as soon as they release the weights (in a month as they promise), it will benefit from the tooling available for SD, without being limited to text to image.
It's also a foundational model, not a finished product, and MJ will possibly use it, like they did in v4 with SD 1.5.
Midjourney only used a combination of SD and their own stuff with the --beta and --test/testp model which came between V3 and V4, other versions have no connection to SD.
I might be misremembering, but didn't they announce in their twitter that they were using SD somehow for MJ v4? Later deleting this with a bunch of other tweets.
Midjourney IMO still better (especially because it can do hands), but this actually come pretty close. I created some amazing pictures with it already!