Hacker News new | past | comments | ask | show | jobs | submit login
Stable Diffusion XL 1.0 (techcrunch.com)
340 points by gslin on July 26, 2023 | hide | past | favorite | 171 comments



It's often said porn drives technology.

I clicked through the links in the article, since they sounded technically interesting. They led to AI-generated porn. Those, in turn, led to pages about training SD to generate porn. Now, two disclaimers:

1) I am not interested in AI-generating porn

2) I haven't followed SD in maybe 6-9 months

With those out-of-the-way, the out-of-the-box tools for fine-tuning SD are impressive, well beyond anything I've seen in the non-porn space, and the progress seems to be entirely driven by the anime porn community:

https://aituts.com/stable-diffusion-lora

10 images is enough to fine-tune. 30-150 is preferred. This takes 15-240 minutes, depending on GPU. I do occasionally use SD for work. If this works for images other than naked and cartoon women, and for normal business graphics, this may dramatically increase the utility of SD in my workflows (at least if I get around to setting it up).

I want my images to have a consistent style. If I'm making icons, I'd like to fine-tune on my baseline icon set. If I'm making slides for a deck, I'd like those to have a consistent color scheme and visual language. Now I can.

Thanks creepy porn dudes!

The other piece: Anyone trying to keep the cat in the bag? It's too late.


> the progress seems to be entirely driven by the anime porn community:

Its not entirely driven by porn communities, and the porn communities driving it aren’t entirely anime porn communities (and the anime communities driving it aren’t entirely porn communities.)

But, yeah, the anime + porn/fetish art + furry + rpg art + scifi/fantasy art communities, and particularly the niches in the overlap of two or more of those are, pretty significant.

> If this works for images other than naked and cartoon women

It does, and while it may not be large proportionally compared to the anime-porn stuff, there’s a lot of publicly distributed fine tuned checkpoints, LoRas, etc., demonstrating that it does.


It absolutely works for things other than naked and cartoon women. Here are some generations of my daughter and dog (together!). I believe most of these are from a fine tuned model of them and not an extracted LoRA, though I use that sometimes too: https://imgur.com/a/naHgnel


The space one without headphones is particularly cool.

I use it for D&D art generation. I can have a piece of art that somewhat matches every location/scene I have planned. If things don't match my plans I can generate 8 images and pick the best in about 2 minutes. I talk to a lot of other DMs who use it in a similar way.

It's not great with specific details, I plan to commission someone to draw the party when the campaign is over. But for things like a fantasy magic shop with potions, or a fantasy dungeon exterior, or a forest of mushroom trees, it's more than good enough for concept art to throw into Roll20. I couldn't afford 5-10 pieces of custom concept art per game, nor could I come up with the ideas for them 2 hours beforehand and have them ready for the session.


Have you tried XL as a test for handling specific details?


I have not and will try it soon now that it's out in an official release. But I doubt it can meet the level of detail my players have put into their own characters. I also fully expect to need to pay for revisions from whatever artist I commission. But I really want a piece of art that every player feels captures their character perfectly.


> Here are some generations of my daughter and dog (together!).

I will choose to intentionally misread that :)


The official blog post from Stability is finally up and would probably be a better URL to link to than the TechCrunch coverage: https://stability.ai/blog/stable-diffusion-sdxl-1-announceme...


It’ll be "released" once the model weights show up on the repo or in HuggingFace… for now it’s "announced"

It should appear here at some point, currently only the VAE was added:

https://huggingface.co/stabilityai



Yes, it's now been released


It only replies, "module 'diffusers' has no attribute 'StableDiffusionXLPipeline'"


The release event is in like ~30 minutes on their discord, probably the announcement went out a bit early.


It does appear to be live on Clipdrop.

https://clipdrop.co/stable-diffusion


You get access to the weights instantly if you apply for them. It's basically not a hurdle.

(I've been having fun with this for a few days. https://huggingface.co/stabilityai/stable-diffusion-xl-base-... Not sure there's much of a difference with the 1.0 version.)


For 1.0? Where do you apply? Or are you talking about 0.9?


The ones you can apply for access to are the 0.9 weights, which have been available for a couple of weeks. Unless the SDXL 1.0 weights are also available by application somewhere that I'm unaware.



It sounds like after the previous 0.9 version there was some refining done:

> The refining process has produced a model that generates more vibrant and accurate colors, with better contrast, lighting, and shadows than its predecessor. The imaging process is also streamlined to deliver quicker results, yielding full 1-megapixel (1024x1024) resolution images in seconds in multiple aspect ratios.

Sounds pretty impressive, and the sample results at the bottom of the page are visually excellent.


They have bots in their discord for generating images bases on user prompts. Those randomize some settings, compare candidate models and are used for rlhf fine-tuning and that's the main source of refining which will continue even after release.


There were, IIRC, three different post-0.9 candidate models in parallel testing to become 1.0 recently.


I always wondered why the vision models don't seem to be following the whole "scale up as much as possible" mantra that has defined the language models of the past few years (to the same extent). Even 3.5 billion parameters is absolutely nothing compared to the likes of GPT-3, 3.5, 4, or even the larger open-source language models (e.g. LLaMA-65B). Is it just an engineering challenge that no one has stepped up for yet? Is it a matter of finding enough training data for the scaling up to make sense?


Diffusion is more parameter-efficient and you quickly saturate the target fidelity, especially with some refiner cascade. It's a solved problem. You do not need more than maybe 4B total. Images are far more redundant than text.

In fact, most interesting papers since Imagen show that you get more mileage out of scaling the text encoder part, which is, of course, a Transformer. This is what drives accuracy, text rendering, compositionality, parsing edge cases. In SD 1.5 the text encoder part (CLIP ViT-L/14) takes a measly 123M parameters.[1] In Imagen, it was T5-XXL with 4.6B [2]. I am interested in someone trying to use a really strong encoder baseline – maybe from a UL2-20B – to push this tactic further.

Seeing as you can throw out diffusion altogether and synthesize images with transformers [3], there is no reason to prioritize the diffusion part as such.

1. https://forums.fast.ai/t/stable-diffusion-parameter-budget-a...

2. https://arxiv.org/abs/2205.11487

3. https://arxiv.org/abs/2301.00704


> Seeing as you can throw out diffusion altogether and synthesize images with transformers [3]

That’s actually how this whole party got started. DALL-E (the first one) was a transformer model trained on image tokens from an early VAE (and text tokens ofc). Researchers from CompVis developed VQGAN in response. OpenAI showed improved fidelity with guided diffusion over ImageNet (classes) and subsequently DALLE2 using pixel space diffusion and cascading up sampling. CompVis responded with Latent Diffusion which used diffusion in the latent space of some new VQGANs.

The paper you mention is interesting! They go back to the DALL-E 1 method but train two VQGAN’s for upsampling and increase the parameter count. This is faster, but only faster than originally reported benchmarks using inferior sampling methods for their diffusion. I would be curious if they can beat some of the more recent ones which require as few as 10-20 steps.

They also improve on FID/CLIP scores likely by using more parameters. This might be a memory/time trade off though. I would be curious how much more VRAM their model requires compared to SD, MJ, Kandinsky.

The same goes for using T5-XXL. You’ll win FID score contests but no one will be able to run it without an A100 or TPU pod.


> The same goes for using T5-XXL

Is this still true in 2023? Sure, back in the dark ages it seemed like a 860M model is just about the limit for a regular consumer, but I don't see why we wouldn't be able to use quantized encoders; and even 30B LLMs run okay on Macbooks now.


That’s a fair point and I’m not sure actually. I bet you’re right though.


> Images are far more redundant than text.

"A picture is worth a thousand words" - I wonder how (in)accurate this popular saying turned out to be? :D


I'm gonna go ahead and say in 2023, one detailed picture (512x512) is worth about 30 words.


I guess that depends on the prompt.


Do negative prompt tokens count as words?


They often reference this paper as the motivation for that https://arxiv.org/pdf/2203.15556.pdf I.e. training with 10x data and 10x longer can yield as good models as a gpt-3 model but with fewer weights (according to the paper) and the same principle applies in vision.


Diffusion is relatively compute intensive compared to transformers llms, and (in current implementation) doesn't quantize as well.

A 70B parameter model would be very slow and vram hungry, hence very expensive to run.

Also, image generation is more reliant on tooling surrounding the models than pure text prompting. I dont think even a 300B model would get things quite right through text prompting alone.


Hmm this is a good point, diffusion requires several (many?) inference passes as you refine the noise into an image, right? Makes sense that this is more expensive to scale up. Thanks for the explanation!


Technically the llms require a pass for each token, but the passes are cheaper and benefit more from batching.


Do we know the amount of parameters Dall-e have these days, Firefly or Midjourney, etc?

If we are talking about Stable Diffusion, the reality is that... more parameters mean it will be hard to run locally. And let me tell you something, the community around Stable Diffusion only cares with NSFW... And want local for that...

Stable Diffusion 2 was totally boycotted by the community because they... banned NSFW from there. They had now to allow it again on SDXL.

Also, more parameters mean it will be more expensive to community finetunners to train as well.


I'm out of date on the image-generating side of AI, but I'd like to check things out. What's the best tool for image generation that's available on a website right now? Ie, not a model that I have to run locally.


If you want to play around with Stable Diffusion XL: https://clipdrop.co


Since clipdrop hs an API is there any way to use it with ComfyUI or Automatic111 (or whatever that's called).


I just tried this and the UI is very nice (better than dreamstudio), with nice tool integration, and image quality is definitely going up with each new release. You can see a few results at fb.com/onlyrolydog (along with a lot of other canine nonsense).


https://playgroundai.com/create

Not affiliated in anyway and not very involved in the space. I just wanted to generate some images a few weeks ago and was looking for somewhere I could do that for free. The link above lets you do that but I suggest you look up prompts because its a lot more involved than I expected.


Any particularly useful resources for looking into prompts?


This AI Horde UI has, IMO, some really good templates and suggestions:

https://tinybots.net/artbot


Hey! Creator of ArtBot here. Thanks for plugging the site!

For those not aware, here's an interesting fact about ArtBot (and the AI Horde in general) -- we've been running an A/B test with Stability.ai for the last 3 weeks or so related to SDXL [1].

Any time a user generates an image using SDXL_beta on the AI Horde, they get two images back. They pick which image they think is best for the given prompt. This data is sent back to Stability.ai in order to help improve their image models.

In a similar vein, LAION partnered with the AI Horde earlier this year in order to gather aesthetics ratings for improving various image datasets. [2]

It's a cool little open source community and there's just a ton of stuff going on.

[1] https://dbzer0.com/blog/stable-diffusion-xl-beta-on-the-ai-h...

[2] https://laion.ai/blog/laion-stable-horde/


Artbot is an amazing and criminally underappreciated project, I try to find an excuse to plug it wherever I can.


I used this: https://learnwithnaseem.com/best-playground-ai-prompts-for-a...

I just took the ones I liked and then deleted out the words that were specific to that image and left the ones that were providing the style of the image. So for example on the first one I would delete "an cute kitsune in florest" but would keep "colorfully fantast concept art". Then I just added a comma separated list of the of the features I wanted in my picture. It took a lot more trial and error than I thought and adding sentences seemed to be worse than just individual words. I am sure I barely scratched the surface of interfacing with the tool correctly but the space is moving so fast its not the kind of thing I want to spend my time learning right now just to have that knowledge deprecate in 6 months.


Midjourney right? Although, discord isn't a website I guess.


I've found https://firefly.adobe.com/ pretty good at composing images with multiple subjects. [disclaimer - I work at Adobe, but not in the Creative Cloud]

But I wouldn't say it's the "best." Just trained on images that weren't taken from unconsenting artists.


I was quite disappointed that the Photoshop generative fill stuff insists on running on Adobe's servers rather than locally. So however good it is, there are many of us who will never use it.


Yeah-- I can only assume it's to ensure a consistent experience and to not disperse the model openly. If you have the model running locally on people's computers, it limits who can use the generative AI and opens up a ton of headache around customer support. Again, I don't work on this, but I'm familiar with generative AI and what it takes to run.


I'm actually a big fan of firefly. It has a different kind of style from the others, presumably due to its training dataset?



What models does dreamstudio use? I couldn't see how to view them without logging in.


Dreamstudio (and ClipDrop, also) uses Stable Diffusion, gettig new SD models generally before public release (both are owned by StabilityAI.)


There are toy AI things, but there is nothing quite like Stable Diffusion running on Colab. Lots of people recommended Midjourney but that is like playing with MSpaint. If you can get Stable Diffusion going with Automatic1111, its AAA tier. Especially with Control-net, and dreambooth, but that is part 2.

Google: The Last Ben Stable Diffusion Colab

for a way to not run it locally, but get all the features.


> The Last Ben Stable Diffusion Colab

https://github.com/TheLastBen/fast-stable-diffusion


Probably Midjourney, but I like Dreamstudio better.


Is there anything like this for the vector landscape?

This may just be due to the iterative denoising approach a lot of these models take but they only seem to work well when creating raster style images.

In my experience when you ask them to create logos, shirt designs, illustrations, they tend to not work as well and introduce a lot of artifacts, distortions, incorrect spellings etc.


If you mean raster images that look like vector and contain arbitrary text and shapes, controlnets/T2I adapters do work for this. You could train your custom controlnet for this, too. (it requires understanding)

As for directly generating vector images, there's nothing yet. Your best bet is generating vector-looking raster and tracing it.


There are SD models tunes for vector like raster output. And XL has specifically focused on this use case as one of the improvements. Try SDXL 1 on Clipdrop or Dreamstudio.


A lot of people are having success by adding extra networks (lora is the most common) which are trained on the type of image you're looking for. It's still a raster image, of course, but you can produce images which look very much like rasterizations of vector images, which you can then translate back into SVGs in Inkscape or similar.


Midjourney is still going to be hard to beat imo. Comparing SD to MJ is a little unfair considering their applications and flexibility, but I do really enjoy the "out of the box" experience that comes with MJ.


Different use case.

I can run SDXL 1.0 offline from my home. I can’t do this with Midjourney.

A closed source model that doesn’t have the limitation of running on consumer level GPUs will have certain advantages.


What type of setup do you have at home? What type of GPU? MJ completes a pretty high quality photo in about a minute. Does SD compare?


I use both but StableDiffusion has better control over the workflow. With automatic111 I can generate a matrix of output based on prompt variations or parameter changes. I can also do bigger batches. And I can open multiple tabs and queue up several prompt variation matrices at once, then leave for an hour. I have a laptop rtx 2070 and a 512x768 takes about 20 seconds[0] or so. automatic111 also includes some upscaling AI once you've found the base image you want.

StableDiffusion needs you to be way more specific than Midjourney. MJ will fill in the gaps of your prompt to get a better image. SD usually won't.

MJ photos are higher quality with easier prompting IMO, but with a distinctive style. Even if you ask it to mimic some other style, it has that midjourney feel.

I mainly it for generating setting or character images for a D&D game. I use Midjourney more for characters.

[0] This is at ~25 iterations.


With an RTX 4090, you can crank out several images per minute, even at high resolutions.


I haven't done it in a while but I was cranking images out at 11s/output on a 3080. But it depends on your workflow, too. I started low res/low samples (32-64) and scaled up or used recursion until I got a desirable result or found a nice seed. I think I was doing 512x916 or something close to that.


With SD you have a lot of control over not just basics like image size and prompt complexity, but also things like how many iterations of which different sampler(s) get used.

So speed can vary wildly depending on how you're choosing to use it. And that's without even getting into the wide variance of hardware.

But generally speaking, it will usually be significantly faster than one image per minute.


Midjourney is destroyed by the ecosystem around stable diffusion, especially all the features and extensions in automatic1111. It’s not even close


You still have to run midjourney through discord right? There isn't even an official API. Feels like a joke.


Been using https://omnibridge.io Pretty stable!


Got any way to get individual images in relax mode? As it gives 4 images combined, and upscale is available in fast mode only. So that kind of makes relax mode useless.


MJ quality is significantly worse. Everything has the Pixar look and barely follows the prompt. Its nice for a toy, but SD with Automatic1111 is miles ahead of MJ.


I tried it in dreamstudio. Like all the other image generators I've tried, it's rubbish at drawing a piano keyboard or an accordion. (Those are my tests to see if it understands the geometry of machines.)

A couple of accordion pictures do look passable at a distance.

Another test: how well does it do at drawing a woman waving a flag?

One thing that strikes me is that it generates four images at a time, but there is little variety. It's a similar looking woman wearing a similar color and style of clothing, a similar street, and a large American flag. (In one case drawn wrong.) I guess if you want variety you have to specify it yourself?

AI models seem to be getting ever better in resolution and at portraits.


My go-to test is "elephant riding unicycle". Neither Midjourney nor Stable Diffusion XL is capable of doing this.


I hope someday there’s a version of this or something comparable to it that can run on <8gb consumer hardware. The main selling point of Stable Diffusion was its ability to run in that environment.


You can do this if you select the `pipe.enable_model_cpu_offload()` option. See this https://huggingface.co/stabilityai/stable-diffusion-xl-base-...


> I hope someday there’s a version of this or something comparable to it that can run on <8gb consumer hardware.

Someday is today: from the official announcement: “SDXL 1.0 should work effectively on consumer GPUs with 8GB VRAM or readily available cloud instances.” https://stability.ai/blog/stable-diffusion-sdxl-1-announceme...


Easy Diffusion (previously cmdr2 UI) can run SDXL in 768x768 in about 7 GB of VRAM. And SDXL 512x512 in about 5 GB of VRAM.

Regular SD can run in less than 2 GB of VRAM with Easy Diffusion.

1. Installation (no dependencies, python etc): https://github.com/easydiffusion/easydiffusion#installation

2. Enable beta to get access to SDXL: https://github.com/easydiffusion/easydiffusion/wiki/The-beta...

3. Use the "Low" VRAM Usage model in the Settings tab.


SDXL 0.9 runs on iPad Pro 8GiB just fine.


Is this using Draw Things, or another app? Did you have to quantize the model first?


Yeah, Draw Things. It will be submitted as soon as SDXL v1.0 weights available. Quantized model should run on iPhones (4GiB / 6GiB models), but we haven't done that yet. So no, these are just typical FP16 weights on iPad.


Thanks! I guess I'll stick to running it on my Macbook for the time being until the quantized model gets uploaded. What kind of performance are you seeing with the FP16 weights on the iPad? I've run a few SD2.0-based (unquantized) models on my 2020 iPad Pro but it seems like it gets thermally throttled after a while.


Will be more info upon release. SDXL v0.9 performs generally the same as SD v1 / v2 on the same resolution. But because you tend to run it at larger resolution, you might feel it slower.


I feel like this is the greatest demand for LLMs at the moment too.

It's hard to believe we're only 8 months into this industry, so I imagine we'll start seeing smaller footprints soon.


8 months from what point?

Gpt3 is 36 months old. Dalle-e is 28 months old. Even StableDiffusion is like 11 months old.


Fair, I should have said 8 months since the market exploded.


We already do. MLC-LLM and Llama.cpp have Vulkan/OpenCL/Metal 3 bit implementations. That can run llama 7B (or maybe even 13b?) in 8GB.

TBH devices just need more ram for coherent output though. Llama 13b and 33b are so much "smarter" and more coherent than 7B with 3 bit quant.


13b q5 llama report „total VRAM used: 8321 MB” so 3bit will most likely fit into 8GB.


Give InvokeAI a try.

https://github.com/invoke-ai/InvokeAI

Edit: Spec required from the documentation

You will need one of the following:

    An NVIDIA-based graphics card with 4 GB or more VRAM memory. 6-8 GB of VRAM is highly recommended for rendering using the Stable Diffusion XL models
    An Apple computer with an M1 chip.
    An AMD-based graphics card with 4GB or more VRAM memory (Linux only), 6-8 GB for XL rendering.


Thanks for the recommendation.

As an aside, does this irritate anyone else?

"You must have Python 3.9 or 3.10 installed on your machine. Earlier or later versions are not supported. Node.js also needs to be installed along with yarn"

I don't like having to install npm when an existing dev stack (python) is already present.


I should clarify that by <8gb I meant “less than 8gb”, which is what SD 1.5 and 2 were able to do. I’m aware that it can run on ==8gb.


There are several papers on 4/8 bit quantization, and a few implementations for Vulkan/CUDA/ROCm compilation.

TBH the UIs people run for SD 1.5 are pretty unoptimized.


Let's see wether derived models will suffer less from the 'same face actor'-model response to every portrait prompt. It's not trivial to get photoreal models not lookalike without resorting to specific, typically celeb based, finetunes.


I am completely uninformed in this space.

Would someone be kind to explain what the current state of the art in image generation is (how does this compare to Midjourney and others)?

How do open source models stack up?

Also what are the most common use cases for image generation?


SDXL is in roughly the same ballpark as MJ 5 quality-wise, but the main value is in the array of tooling immediately available for it, and the license. You can fine-tune it on your own pictures, use higher order input (not just text), and daisy-chain various non-imagegen models and algorithms (object/feature segmentation, depth detection, processing, subject control etc) to produce complex images, either procedural or one-off. It's all experimental and very improvised, but is starting to look like a very technical CGI field separate from the classic 3D CGI.


For bland stock photos and other "general-purpose" image generation, DALLE-2/Bing/Adobe etc are... the okayest. SD (with just standard model weights) is particularly weak here because of the small model size.

If you want to get arty, then state of the art for out-of-the-box typing in a prompt and clicking "generate" is probably MidJourney.

But if you're willing to spend some more time playing around with the open-source tooling, community finetunes, model augmentations (LyCORIS, etc), SD is probably going to get you the farthest.

> Also what are the most common use cases for image generation?

By sheer number of image generations? Take a guess...


> By sheer number of image generations? Take a guess...

Cat images right?


Well, yes, kind of.

Catgirl images, to be precise.


SDXL 0.9 should be the state-of-the-art image generation model (in the open). It generates at 1024x1024 large resolution, with high coherency and good selection of styles out of box. It also has reasonable text-understanding comparing to other models.

That has been said, based on the configurations of these models, we are far from saturating what the best model can do. The problem is, FID is terrible metrics to evaluating these models so like LLM, we are a bit clueless about how to evaluate them now.


Why do you think FID is a terrible metrics? What don't you like in particular about it?


I overspoke. FID is a fine metrics to observe the training progress of your own model. And it correlates well with some coherency issues of generative models. But for cross model comparisons, especially for models that generally do well under FID, it is not discriminative enough to separate better / good.


I don't know what the use case is for other people is, but I've been playing around with book covers. This one took about two weeks, but it was my first real try and I was still learning how. Composition is a little off. The one I'm working on now is going faster (and better).

https://imgur.com/a/CxX5eYj

I've found that I rarely get a usable image completely as-is. It might take 5 or 10 generations to find something sort of ok, and even then I end up erasing the bad parts and letting it in-paint (which again takes multiple attempts). The T-rex had like 7 legs and two jaws, but was otherwise close to what I wanted... just keep erasing extra body parts until the in-painter finally takes a hint.

I was also going to do a few book covers for some Babylon 5 books, but it does so bad on celebrity faces. Looked like Koenig's mutant love child with Ernest Borgnine. Dunno what to do about that. I keep wondering if I shouldn't spend the next 10 years putting together my own training set of fantasy and science fiction art.


Midjourney may be better for plain prompts, but Stable Diffusion is SOTA because of the tooling and finetuning surrounding it.


Idk Midjourny ignores prompts.

For the longest time I thought it was google imaging things and doing some photoshop to make things look like Pixar because it was so bad.


I will wait for the automatic1111 web ui version


It's already supported in automatic1111 (see recent updates), and someone in the community will convert it to the automatic1111 format within minutes/hours after it's released on huggingface.


Sort of. IIRC (which may be unlikely) Auto1111 has the base model in the text to image plane, but if you want to use the refiner that is a separate IMG2IMG step/tab. Which would be a pain in the ass imo.

The "Comfy" tool is node based and you can string both together which is nice. Although if you aren't confident in your images you don't need the refiner for a bit.


I think the diffusers UIs (like Invoke and VoltaML) are going to implement the refiner soon since HF already has a pipeline for it.

Comfy and A1111 are based around the original SD StabilityAI code, but the implementation must be pretty similar if they could add the base model so quickly.


Work started with the SDXL 0.9 release, and for A1111 it exited release candidatr status in the last few days.


The huggingface release works in A1111 directly, there is nothing to convert


Yep. Just download the safetensor version.


whats the memory usage of sdxl ?


It depends greatly on your UI and the size you are generating. There is no hard answer.


Runs great on my 10GB 3080 FTW3. ComfyUI moreso than auto1111.


Auto1111 (latest on git) OOMed my 3080 running the base-xl on 1024x1024 unfortunately. Granted, my xorg takes up almost 950 MB on the precious VRAM. Did you get it to run using A1111 on the FTW3 without OOMing it?


Been working fine on a 8 GB 3070 generating 1024x1024 images, using Comfy UI with the refiner


TBH I was hoping the community would take the opportunity to move to the diffusers format...

You get deduplication, easy swapping of stuff like VAEs, faster loading, and less ambiguity about what exactly is inside a monolithic .safetensors file. And this all seems more important since SDXL is so big, and split between two models anyway.


SD.Next (vlad fork of auto1111) switched to diffusers.


Is this pre-censored like their other later models?


No. From what I’ve gathered was trained on human anatomy, but not straight up porn. What they tried for 2.0/2.1 was way too overdone, to the point where if I prompted “princess Zelda,” the generation would only look mildly like her. Presumably they just didn’t have many images of people in the training. 1.5 and SDXL both work fine of that front.

Fine tuners will quickly take it further, if that’s what you’re after.


I don't think 1.x was trained on porn either?

I seem to remember the issue with 2.x is that they removed all the commercial art from top-notch illustrators from the training data due to the backlash, so it was just way worse at generating great-looking things, which is all the user cares about. So the community stayed on their custom-trained models derived from SD 1.5 (which, yes, often included porn).


2.0 included a filter on the training data that removed all nudity. It went way too far and removed a lot of humans. They tried to rectify it a bit with 2.1 but even that was still hampered.

What you said also happened, but the main thing was the base model didn't have a great concept of human anatomy. Apparently it was really hard to train for anything else as well.


I've been playing with 0.9 and it can generate nude people so it seems not.


Yes.


Not sure why this is downvoted. This model is heavily censored by default.


I thought this release had been announced already? Or was that not 1.0? Could have sworn they released an "XL" variant a little while ago?


It was the research weights of the v0.9 model


In the meantime I've been getting good mileage out of Kandinsky - anyone got a good sense of how they compare?


This is the first I have heard of Kandinsky. Thanks for the tip.

SDXL is a bigger model. There are some subjective comparison posts with SDXL 0.9, but I can't see them since they are on X :/


That sounds so weird, it took me a minute to understand. Go to nitter.net which has no login requirement and no ads, but all the same content that X (tmsfkat) has.


Two Nitter instances failed to load it, unfortunately.

And yeah, X is weird to type out too.


Amazing that their examples at the bottom of the page still show really messed up human hands.


Some of them look surprisingly correct, so it looks like there's been at least some progress on that front. I would assume these are among the best examples of many, many attempts so it still seems to be a ways off.


I for sure generated hands better than they showed :)


Hands being bad is a result of people one shotting images, you need to go repaint them afterwards I've found. But it'll do it great if you inpaint well.


I've personally observed that the drawing of hands in Midjourney and SD has been getting incrementally better release after release.


That's why I'm amazed they picked images with totally borked up hands to put on their press release. Truth in advertising!


If they cherry pick the examples people would get the wrong idea. What I like about imagegen is your results are really only bound by your patience


Stability AI is awesome I love them


Can SD draw hands finally?


You can already easily generate images with good looking hands if you use a good custom model.


With a few tries, yes. Probably will be even better with negative embedding.


> Probably will be even better with negative embedding.

And/or hand-specific LoRa and/or a workflow using something like ADetailer extension in A1111 that applies a model to recognize hands [0] and then inpaints them.

[0] recognition models are also provided for people, faces, and eyes, too, and it can use additional custom models for other things.


Still can't draw hands correctly it looks like.


Eh it’s not a huge deal with stable diffusion because you can inpaint. So you mask out the bad hands and generate a few dozen iterations that merge perfectly with the rest of the image. You’re bound to get something that looks good.

With SD2.1 I’d generate 100 or so images using inpainting and the “good fingers” was about 1% hit rate. If it’s up to 10% that’d be great because generating 10 images takes just a few seconds on an A100


Hands are generally a non-issue at this point. You can just inpaint them and use a negative prompt LoRA to get good hands in just a few attempts. That is, if you don't just get good enough hands right away. That happens surprisingly often.


I can't say for 1.0 but in 0.9 hands get fairly often rendered perfectly. It's not always right but it's way better than any other earlier release (where it's usually consistently wrong).


Both 0.9 and 1.0 seem to frequently merge fingers when hands are clasped; they don’t do nearly as much of other errors with them as earlier SD models.


Not actually released in the API unlike they said.


It took several hours after it was announced that it was already available in the API


This explosion of AI-generated imagery will result in an explosion of millions of fake images, obivously. Perhaps in the short-term this is fun, but in the long-term, we will lose a bit more scarcity, which is not that great in my opinion.

Isn't the best part of a meal eating after you've not had anything to eat for a while? The best part about a kiss that you've quenched the pain of missing your partner?

The best part of art is that you haven't seen anything good in a while?

Scarcity is an underappreciated gift to us, and the relative scarcity per capita is in a sense what drives us to connect with other people, so that we may be priveleged to witness the occasional spark of creativity from a person, which in turn tells us about that person.

Although that sort of viewpoint has been declining for some time due to the intensely capitalistic squeezing of every sort of human endeavor, AI brings this to a whole new level.

I think if those making this software thought a bit about this, they might second-guess whether it is truly right to release it. Just a thought.


Enforcing artificial scarcity is idiotic and counter progressive. There will be other things that will continue to be uncommon that humans will continue to appreciate. This is what human progress looks like. Imagine someone said this when agriculture started up- “The great thing about fruits and vegetables is that they taste so sweet the few times we find them. We shouldn’t grow them in bulk”


AI Art models are completely dependent on human labor to function. "Out-competing" human generated images will damage the commons and make it harder to train these models over time as they push human creative labor out of the market, if we believe it's even competitive.

Personally I think this guy has a point and he's pointing to something that I don't believe a lot of ai art advocates have considered: the attention economy. There isn't actually any scarcity in the current market for artistic content. There are literally millions of people producing art every day, and the market is very winner take all. There are few artists who are able to support themselves on their work, and their skills are exceptional and specifically in demand. My theory is that the supply of AI generated content will be so vast, and the perception around it will be that it's low effort and low quality, that so called AI artists are going to have trouble distinguishing themselves in a market where they're saturating it and all using the same models. I think your perception of this market is flawed. Art is not a fungible good like food or clothing.


So what if art is devalued? We are hardwired to appreciate beauty so art of some form will always be sought. Obviously there is the matter of artists losing their livelihoods but that is also an inevitable outcome of progress and always has been.


Lol. I hate this industry sometimes. It's not obvious, to me at least, that our society should accept the automation of the production of culture. Maybe I'm a luddite or whatever lazy quip you'd like to use, but I prefer the story of a human mastering a skill and producing something beyond contextless aesthetic sludge.


Yeah, that's why I always laugh when I see people use cars instead of walking 100 miles. It's an inspiring story for people to do marathons, and cars devalue that.


Imagine talking about artistic expression like it's just a fucking car trip that should be automated.


I disagree. The problem is that AI will flood the market to an extent that humans will barely be able to keep up, moving from one AI-generated thing to the next, barely having time any more to spend any real effort on enjoying life.


There is a great chasm between "don't grow vegetables" and "grow them the way we do them today".

I wouldn't advocate not growing vegetables. But today we grow them in a monoculture for instant availability everywhere, and those monocultures are susceptible to disease and also are not terribly ecologically friendly. AI is like an ultimate monoculture of diseased fruits.

Also, I believe counter-progressive to be a good thing. Human beings should not progress in certain ways, as we don't have the wisdom to use the technology we have developed.

Humans in general cannot appreciate things very well, and computers and AI will only make it worse.


Seeing the "less art needs to exist" perspective is certainly a first time for me on this topic.


It would be a real shame if low effort AI spam buried people who have dedicated their lives to artistic expression.


The same could have been said when photoshop or CGI tools like blender replaced hand sculpting and hand painting but I think it hasn't been a net negative across the board (I think rather the opposite).


I believe it has. CGI at the beginning was okay but like all technologies, humans could not resist bring it to a high level of efficiency. Now all CGI movies are pretty bland and barely any effort is brought to storytelling.


I want to appreciate your comment, but I can't.

Can you please chisel it on stone tablets for me?

That will really help me appreciate it.


A lot of downvotes. I can relate to it a little bit. During beginning of covid I was in SE Asia at airbnb that didn't have laundry machine - since in SE Asia you don't need it generally because there is so many cheap per kg laundry services around. When for the first month I had to hand wash my clothes I really appreciated having a laundry machine after moving to another airbnb that had one - you take some things for granted.

But no, I wouldn't want to hand wash my laundry more often. For the same reason probably I still prefer using a lighter when having a BBQ than a flint.


Yes, humans take many things for granted. That's why it's better to experience a lack of things sometimes because we are just wired to move onto the next thing.

As for the downvotes, I don't mind. I try and present a critical view of technology, but I expect the downvotes because almost everyone here has the perspective that technology is a tool and progress is generally a good thing, which are two statements that I wholeheartedly reject.


I don't think that trying to convince people to starve themselves a little as your opening analogy is good for your argument.


I didn't say starve. I just meant to take a break from eating (you know, like between meals?). AI and computer technology has removed the breaks between meals.


I think you have this backwards, Capitalism loves scarcity. Scarcity is what allows for supply and demand curves and profit-making opportunities, even better if you can control the scarcity. Capitalist entities are constantly attempting to use laws, technology, and market power to add scarcity to places where it didn't previously exist.


That is not the whole story. Capitalism likes the following process (if you can say capitalism "loves" anything, which isn't quite right. It's more like the abusers of capitalism love this):

1. Create scarcity, 2. Flood the market to reap short term gains with market-disrupting technology 3. Creat new scarcity by creating new products


I'm pretty sure that lots of things end up being scarce in totalitarian forms of government... food, for one.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: