High-performance image generation using Stable Diffusion in KerasCV (keras.io)
356 points by tosh on Sept 28, 2022 | hide | past | favorite | 86 comments

Tried to get this running on my 2080ti (11GB VRAM) but hitting OOM issues. So while performance seems better (but can't actually test this myself), I'm unable to actually verify it as it doesn't run. Some of the Pytorch forks works on as little as 6GB of VRAM (or maybe even 4GB?), but always good to have implementations that optimize for various factors, this one seems to trade memory usage for raw generation speed.

Edit: there seems to be a more "full" version of the same work available here, made by one of the authors of the submission article: https://github.com/divamgupta/stable-diffusion-tensorflow

Just breaking the attention matrix multiply into parts allows a significant reduction of memory consumption at minimal cost. There are variants out there that do that and more.

Short version: Attention works as a matrix multiply that looks like this: s(QK)V where QK is a large matrix but Q,K,V and the result are all small. You can break Q and V into horizontal strips. Then the result is the vertical concatenation of:

Since you're reusing the memory for the computation of each block you can get away with much less simultaneous RAM use.

Yeah, the problem is indeed in the attention computation.

You can do something like that but it's far from optimal.

From memory consumption perspective, the right way to do it, is to never materialize the intermediate matrices.

You can do it, by using a customop, that compute att = scaledAttention(Q,K,V) and the gradient dQ,dK,dV = scaledAttentionBackward(Q,K,V,att,datt)

The memory needed for these ops is the memory to store Q,K,V,attn,dQ,dK,dV,dattn + extra temporary memory.

When you do the work to minimize memory consumption, this extra temporary memory is really small : 6attention_horizon^2number_of_core_running_in_parallel numbers.

But even though there is not much re computation, this kernel won't run as fast due to the pattern of memory access, unless you spend some time manually optimizing it.

The place to do it is at the level of the autodiff framework aka tensorflow or pytorch, with low level c++/cuda code.

Anybody can write some custom kernel, but deploying, maintaining them and distributing them is a nightmare. So the only people that could and should have done it, are the tensorflow or pytorch guys.

In fact they probably have, but it's considered a strategic advantage and reserved for internal use only.

The mere mortals like us, have to use some workarounds (splitting matrices, Kheops, gradient checkpointing... ) to not be too much penalized by the limited ops of the out of the box autodiff frameworks like tensorflow or torch.

PyTorch doesn't offer an inplace softmax which contributes about 1GiB extra memory for inference (of stable diffusion). Although all these are not significant improvements comparing to just switch to FlashAttention inside the UNet model.

There are forks that even work on 1.8 of VRAM! They work great on my GTX 1050 2GB.

This is by far the most popular and active right now: https://github.com/AUTOMATIC1111/stable-diffusion-webui

> This is by far the most popular and active right now: https://github.com/AUTOMATIC1111/stable-diffusion-webui

While technically the most popular, I wouldn't call it "by far". This one is a very close second (500 vs 580 forks): https://github.com/sd-webui/stable-diffusion-webui/tree/dev

That's why I said "right now", since I feel that most people have moved from the one you linked to AUTOMATIC's fork by now. hlky's fork (the one you linked) was by far the most popular one until a couple of weeks ago, but some problems with the main developer's attitude and a never-ending migration from Gradio to Streamlit filled with issues made it lose its popularity.

AUTOMATIC has the attention of most devs nowadays. When you see any new ideas come up, they usually appear in AUTOMATIC's fork first.

Just as another point of reference. I followed the windows install. I'm running this on my 1060 with 6GB memory. With no setting changes takes about 10 seconds to generate an image. I often run with sampling steps up to 50 and that takes about 40 seconds to generate an image.

While AUTOMATIC is certainly popular, calling it the most active/popular would be ignoring the community working on Invoke. Forks don’t lie.


> Forks don’t lie.

They sure do. InvokeAI is a fork of the original repo CompVis/stable-diffusion and thus shares its fork counter. Those 4.1k forks are coming from CompVis/stable-diffusion, not InvokeAI.

Meanwhile AUTOMATIC1111/stable-diffusion-webui is not a fork itself, and has 511 forks.

Welp - TIL.

Thanks for the correction.

Any idea on how to count forks of a downstream fork? If anyone would know... :)

Subjectively, AUTOMATIC has taken over -- I have not heard of invoke yet but will check it out.

The only reason to use it imo has been if you need mac/m1 support, but that's probably in other forks by now

What settings and repo are you using for GTX 1050 with 2GB?

I'm using the one I linked in my original post: https://github.com/AUTOMATIC1111/stable-diffusion-webui

The only command line argument I'm using is --lowvram, and usually generate pictures at the default settings at 512x512 image size.

You can see all the command line arguments and what they do here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki...

I guess then it could even work on a Jetson Nano(4GB) then, I run models of ~1.6 GB on it 24*7; Would give this a try.

This needs Windows 10/11 though?

Nope. There are instructions for Windows, Linux and Apple Silicon in the readme: https://github.com/AUTOMATIC1111/stable-diffusion-webui

There's also this fork of AUTOMATIC1111's fork, which also has a Colab notebook ready to run, and it's way, way faster than the KerasCV version: https://github.com/TheLastBen/fast-stable-diffusion

(It also has many, many more options and some nice, user-friendly GUIs. It's the best version for Google Colab!)

Brilliant thanks.

Bonus points for this article being one of the clearest explanations for how Stable Diffusion works that I've seen to-date.

Not familiar with the other authors, but François Chollet (author of Deep Learning with Python) is truly gifted in the art of pedagogy.

This is _markedly_ faster than the PyTorch versions I've seen (nothing against the library, just categorizing the implementations). It would be nice to see this including the little quality of life additional models (eye fixes, upscaling, etc.), but I suspect the optimizations are transferrable.

Either way, getting 3 images for 25 iterations under 10 seconds (quick Colab test, which is where I've taken to comparing these things) is just ridiculously faster.

Which GPU did you test on Colab? Are you comparing with one of the fp16 PyTorch versions? Their test shows little improvement on V100.

PyTorch is now quite a bit more popular than Keras in research-type code (except when it comes from Google) so I don't know if these enhancements will get ported. This port was done by people working on Keras which is kind of telling - there isn't a lot of outside interest.

This is not true, the initial Keras port of the model was done by Divam Gupta who is not affiliated with Keras or Google. He works at Meta.

The benchmark in the article uses mixed precision (and equivalent generation settings) for both implementations, it's a fair benchmark.

In the latest StackOverflow global developer survey, TensorFlow had 50% more users than PyTorch.

Two Keras creators are listed as authors on this post. If they were not involved, this should be specified. I specifically talked about research and StackOverflow is not in any way representative of what's used. Do you disagree that the majority of neural net research papers now only have PyTorch implementations, not TensorFlow? Also, according to Google Trends, PyTorch is more popular: https://trends.google.com/trends/explore?geo=US&q=pytorch,te.... BTW, I would love it if TF made a strong comeback, it's always better to have two big competing frameworks and I have some issues with PyTorch, including with its performance.

> In the latest StackOverflow global developer survey, TensorFlow had 50% more users than PyTorch.

It also doesn't help that PyTorch has its own discussion forum [1] where most pytorch questions end up.

[1]: https://discuss.pytorch.org/

Should we expect people not working on keras to have the interest and ability to get it to work on keras?

If these people have existing Keras code they want to integrate or they are interested in developing it further in Keras, then it shouldn't require any insider knowledge to create a Keras version of a small but popular open-source project like this. I am very sure we'd get a PyTorch version made by outsiders quickly if Stable Diffusion was originally released in Keras/TF.

What is your definition of outsider?

We got a Keras version made by Divam Gupta very quickly after Stable Diffusion was released.

Is he not an outsider?

From what I can tell this Keras version was just released (the date on the post is Sep. 25) and the first author listed is the creator of Keras. Is this incorrect? I am not familiar with Divam Gupta and I would consider outsiders to be people not paid by Google.



Now they are working together. That may be “telling” to you but I’m not sure why that should cast a negative light on Keras, really.

I didn't say that it casts a negative light on Keras. Just on its popularity among outsiders. There are thousands of great libraries out there that are much less popular than Keras or PyTorch. And BTW, JAX is a useful Google-created framework that's growing in popularity among researchers and pushed PyTorch to improve (functorch), so I have nothing against Google projects.

The reason why we’re having this discussion is that what you call a Keras outsider ported Stable Diffusion to Keras last week.

It’s hard to understand how that can say anything negative about the popularity of Keras among outsiders.

So why are Keras creators listed as authors on this post and why is it on Keras' official site? Compare this to hundreds of PyTorch SD forks that have been thrown up on GitHub.

The OP was wondering whether additional enhancements will also be ported and that's what I responding to. It's simply much less likely that a new paper will get a Keras implementation than a PyTorch implementation.

> So why are Keras creators listed as authors on this post and why is it on Keras' official site?

Because when Keras creators learned about the port done by what you define as an outsider they thought that it was cool - and that it would be nice to make it part of the KerasCV toolbox.

What answer did you expect?

You understand that it was "an outsider" who ported Stable Diffusion to Keras, right?


Aug 22, 2022 [day 1 of the SD era] Public release of stable diffusion

Sep 17, 2022 [day 27 of the SD era] Keras outsider Divam Gupta announces a Keras port


"Stable Diffusion implemented using @Tensorflow and #Keras. [...] Thanks @fchollet and team for building this amazing framework which makes it easy to implement a model like Stable Diffusion."

Sep 19, 2022 [day 29 of the SD era, 2 days after port announcement] François Chollet publishes a Twitter thread about the port (and his own improvements on a fork - two dozen commits dated Sep 18-21 will be pushed upstream)


"Huge thanks to @divamgupta for creating this port! This is top-quality work that will benefit everyone doing creative AI. I'm always amazed by the velocity of the open-source community"

Sep 25, 2022 [day 35 of the SD era, 8 days after port] Divam Gupta talks about the upcoming KerasCV integration


"Last week I implemented Stable Diffusion using Keras / Tensorflow. Now its almost integrated in KerasCV thanks to @fchollet and team. This is the power of open source collaboration."

Sep 27, 2022 [day 37 of the SD era, 10 days after port announcement] Official release of the Stable Diffusion implementation in KerasCV - and TFA (dated Sep 25)


"Stable Diffusion is now available directly in KerasCV! [...] Many thanks to all those who made this implementation possible, in particular @divamgupta @luke_wood_ml and of course the creators of the original Stable Diffusion models!"

Is this faster even after applying the optimizations that reduce VRAM usage? (some of which the Keras version seem to lack)

,, Note that when running on a M1 MacBookPro, you should not enable mixed precision, as it is not yet well supported by Apple's Metal runtime”

It is a bit sad if this is just a closed software issue that cannot be fixed :(

Mixed precision won't do anything on Apple Silicon anyway since there is no performance advantage to using FP16 (aside from decreasing register pressure and RAM bandwidth which won't happen here as data is FP32 to start with).

Is it really that sad? Closed software/hardware won't get support (official nor community) for things until the maintainer of the software adds it, and people who buy that kind of hardware is more than aware of this pitfall (and in fact, see it as a benefit sometimes too).

I'm a new MacOS user and, while I did anticipate some of these issues, I do often find myself surprised when running into them. This was one such surprise I hit recently

The otter examples highlight something you can't control using these things: the 'eats shoots and leaves' phenomenon.

The prompt was "A cute otter in a rainbow whirlpool holding shells, watercolor"

Seems like the otter should be holding shells, the way a normal human parses it.

The tool showed the otter 'in holding-shells', which are shells that hold otters apparently. Also some random shells strewn about, as the technique is sensitive to spurious detail sprouting up from single words.

Until the tool permits some kind of syntactic diagramming or so forth, we'll not be able to control for this.

Just the other day here, I saw a picture of a fork and some plastic mushrooms. The prompt was 'plastic eating mushrooms' which was ambiguous even to humans. The tool chose to illustrate the subclass of mushrooms 'eating-mushrooms' (as opposed to poison mushrooms or decorative mushrooms I suppose) made of plastic.

When we're playing around this can seem whimsical and artistic. But a graphic designer might want some semblance of control over the process.

Not sure how a solution would work.

This is the compositionality problem--the language model sometimes doesn't quite know how to put the words together. Better language models will help in the future; in the mean time you can give it a helping hand by prompt engineering or using img2img.

Graphic designers lean on img2img in their workflows more than txt2img, as that gives you the control you speak of.

Do you know of examples of graphic designers who have shared their img2img workflows online?

Hang out in reddit.com/r/stablediffusion, people have posted workflows quite a bit.

My favorite is when you do “<whatever> bla, bla, bla, wearing a t-shirt by <artist>“ and it gives you an image of <whatever> wearing a t-shirt with a print in the style of the artist. Which adds extra dimensions to play with so isn’t all that bad.



Compositional generation. Our method can compose multiple diffusion models during inference and generate images containing all the concepts described in the inputs without further training"

There must be a way to disambiguity of the prompt by adding some negative factor to clip guidance penalizing possible misunderstandings of the prompt maybe

Is the H5 file type that much different than whatever the Pytorch versions are using?

The model is loaded from Huggingface during the instantiation of the stable diffusion class. It is loaded as an H5 file which I believe is unique to Keras[0]. I don't have any experience with Keras so I can't say if that is good or bad. I wanted to see where they were getting the weights as the blog post didn't demonstrate an explicit loading function/call like Pytorch.

Gonna run it and see... although I have like 40GB of stable diffusion weights on my computer now.

[0] https://github.com/keras-team/keras-cv/blob/master/keras_cv/...

Has anyone tried running this with an AMD card on Mac? At first glance it's able to run on Metal (given the M1 compatibility)...

I have a mediocre GPU but a fast CPU (with a lot of RAM). Would I see improvements there?

I guess I should give it a try.

On intel MacBookPro 2020, CPU-only, the original one[1] using pytorch utilized one core only. A tensorflow implementation[2] with oneDNN support which utilized most of the cores ran at ~11sec/iteration. Another OpenVINO based implementation[3] ran at ~6.0sec/iteration.

[1] https://github.com/CompVis/stable-diffusion/

[2] https://github.com/divamgupta/stable-diffusion-tensorflow/

[3] https://github.com/bes-dev/stable_diffusion.openvino/

Yes, I use [3] and I get 2.4s/iter on my 10 core machine. I was wondering if keras would give additional help here. I'll have to try I guess.

tried it yesterday, on intel i9 macbook pro it takes about 300 seconds per image.

You mean the keras version? How does it compare to the original one? Currently on my 10850k I get 2.4s/iteration, which is borderline usable. I haven't managed (nor tried very hard) to get the cuda version working on my 1070; I expect to be a little better, but I don't want to fight with ram issues.

How many steps did you perform?

I tried some and found no major differences after 16 steps or so with given random seed.

Not necessarily my expertise but if as stated by the article, 2 lines of code can already get a 2x performance gain, what more can be done to improve performance in the coming years?

It's not two lines of code... It's 2 lines that enable tens of thousands of lines of library code by invoking a new optimizer...

I'm curious whether this really is "the fastest model yet" there are pytorch optimizations as well.

Something like global optimization has been done in pytorch, here's a blog about it: https://www.photoroom.com/tech/stable-diffusion-25-percent-f...

Mixed precision seems pretty much default looking at a few Stable Diffusion notebooks.

More intriguing, there's also a more local optimization that makes pytorch faster: https://www.photoroom.com/tech/stable-diffusion-100-percent-...

Unless it's already there, that last one would be interesting to add to keras.

All in all this machine learning ecosystem is wild, as a software dev, things like cache locality and preferring computation over memory access are basic optimizations, yet in machine learning it seems wildly disregarded, I've seen models happily swapping between gpu and system memory to do numpy calculations.

Hopefully stable diffusion changes things, the work towards optimizations is there, it just seems often disregarded. As stable diffusion is one popular open model that, when optimized, can be run locally (and not as saas, where you just add extra compute power, which seems cheaper than engineers) and has a lot of enthusiasm behind it, it might just be the spark that makes optimization sexy again.

Does this run on AMD?

A problem I see is that a lot of times everything works fine on rocm+hip, but since nvidia dominates the machine learning market (and thus most researches run nvidia), most forks don't bother checking and just advertise compatibility with nvidia and sometimes apple M1.

Problem is, AMD GPUs are much cheaper!

Well, high-end stuff is always on Nvidia and Apple Silicon seems to get some love because of its unified memory that makes it possible in first place plus its popularity among developers.

AMD seems to be popular among gamers on budget and the budget cards often don't have the VRAM required by default. So, AMD seems to be in this weird place where the people who can make it work don't care.

For what it's worth, at the consumer level AMD cards -- at least recently -- have tended to have more VRAM than Nvidia cards. My 3080 Ti, which I bought for $1400 (though it now goes for ~$1k), has less RAM (12GB) than a 6800 XT that you can get for $600 (16GB).

> Problem is, AMD GPUs are much cheaper!

Are they? I believe Nvidia (consumer) gpus have better price/performance than amd for AI.

I don't know about AI performance (does this happen only because of the overhead of providing CUDA through rocm+HIP?), but I was just checking and at least in my country (Brazil), for any given memory size (12GB, 8GB, 4GB) I can find cheaper AMD GPUs than NVidia GPUs

Here I'm considering that the main constraint is VRAM and while stable diffusion now runs even on GPUs with 2GB RAM, there's always new developments that require more VRAM (for example, Dreambooth requires 12GB as of today)

Maybe for AI? For other tasks, especially gaming, they punch well above their weight relative to Nvidia (though they lack features in comparison). It's also possible to get a 16GB card for much cheaper from AMD than Nvidia.

Nice! I'll take anything over the huggingface version - the API design by huggingface where CLIP is in transformers, everything else is in diffusers...not a great developer experience [unless youre the type of person that likes their python to look like half-baked J2EE).

I've been seeing a lot of HN submissions and images generated by stable diffusion but I've yet to actually toy around with it due to a lack of time, what fork appears to produce the best results for people that don't have a 3090s worth of VRAM? (Currently only have a 3060 with 12GB which I thought would be enough for almost anything haha)

On a 16GB 8c8g Macbook Air M1, the PyTorch implementation takes about 3.6s/step which is about 3 minutes per image with the default parameters. I wonder how faster this would be. If there's anyone out there with a similar system and wants to compare, could you please write your findings?

Not M1 comparible but I'm working on testing various GPU vs M1 comparisons, with a few accessible cloud providers. My impression is times should be the same, but it's nice to hear other real-world stats for M1 with SD. Makes me really want to rent the Hetzner M1 now.

Which repo or build are you using BTW, is it the one related to this readme?


>Which repo or build are you using BTW, is it the one related to this readme? https://github.com/magnusviri/stable-diffusion/blob/main/REA...

Yes, this one. However it was like a month ago I think, so speeds might have improved. I'm getting ~2.2s/step with another implementation: https://news.ycombinator.com/item?id=33006447

Wow, that sounds like a good improvement.

I am also wondering, do you follow the general advice of 1 iteration and 1 sample, for example:

--n_samples 1 --n_iter 1 (when referencing commands using txt2img.py)

I figure you could wait a bit for things to process going further, but curious just if you're getting results like that with higher sample/iter settings.

I usually go with the default parameters.

I would love to see it, but this file is not accessible.

Sorry about that, web link rot sure is real eh.

This is an example of the original file: https://github.com/magnusviri/stable-diffusion/blob/79ac0f34...

Which seems to have been renamed, and cleaned up a bit here: https://github.com/magnusviri/stable-diffusion/blob/main/doc...

However, per the note on the magnusviri repo, the following repo should be used for a stable set of this SD Toolkit: https://github.com/invoke-ai/InvokeAI

with instructions here https://github.com/invoke-ai/InvokeAI/blob/main/docs/install...

I've not tried it, but this approach apparently takes 10-20s per image?


I just gave it a spin, it took 1 min 52 sec for a 50 steps image and that is ~2.2s/step. It seems faster than my original installation(which might also have improved speed as it was at very beta stage when I tried it) but definitely not 20 seconds for 50 steps image at 512x512 resolution.

Maybe they use lower parameters.


50 steps at 256x256 resolution took 55 seconds.

50 steps at 768x768 resolution took 8 min, exactly.

PS: my Macbook Air is modified with thermal pads, it takes a bit longer to start throttling than usual. Either way, it's very dependent on the ambient temperature.

Very interesting performance. Also a very good write-up. Can't wait to try this.

I don't quite understand the benefit of mixed precision.

It seems like using high precision is useful for training, but if not training, why not just use float16 weights and save the memory?

Converting weights to float16 after training will reduce quality/accuracy whereas mixed precision has a negligible effect on quality/accuracy and dramatically improves performance.

If you really just want to save memory, there's plenty of other low hanging fruit. It's just not a priority for most devs since mid tier GPUs start at 10GB whereas a typical model only has 0.5GB weights. Activations and intermediate calculations use way more memory.

You usually can. But it can take some work if you're using any libraries that expect FP32 and it might be slower, depending on the GPU. The FP16 support isn't quite as good as FP32.

Can this be used to train you own model? I have a moderately large medical image dataset that would like to try this with for data augmentation.

This is amazing! I am more used to TF so very happy to see this!

Has anyone got a suggestion on how to fine tune this model?

someone should compare results with just doing a keyword search on deviantart

How do I deploy this? Can someone offer some guidance please?


