Tried to get this running on my 2080ti (11GB VRAM) but hitting OOM issues. So wh...

WithinReason · on Sept 28, 2022

Just breaking the attention matrix multiply into parts allows a significant reduction of memory consumption at minimal cost. There are variants out there that do that and more.

Short version: Attention works as a matrix multiply that looks like this: s(QK)V where QK is a large matrix but Q,K,V and the result are all small. You can break Q and V into horizontal strips. Then the result is the vertical concatenation of:

    s(Q1*K)*V1
    s(Q2*K)*V2
    s(Q3*K)*V3
    ...
    s(QN*K)*VN

Since you're reusing the memory for the computation of each block you can get away with much less simultaneous RAM use.

GistNoesis · on Sept 28, 2022

Yeah, the problem is indeed in the attention computation.

You can do something like that but it's far from optimal.

From memory consumption perspective, the right way to do it, is to never materialize the intermediate matrices.

You can do it, by using a customop, that compute att = scaledAttention(Q,K,V) and the gradient dQ,dK,dV = scaledAttentionBackward(Q,K,V,att,datt)

The memory needed for these ops is the memory to store Q,K,V,attn,dQ,dK,dV,dattn + extra temporary memory.

When you do the work to minimize memory consumption, this extra temporary memory is really small : 6attention_horizon^2number_of_core_running_in_parallel numbers.

But even though there is not much re computation, this kernel won't run as fast due to the pattern of memory access, unless you spend some time manually optimizing it.

The place to do it is at the level of the autodiff framework aka tensorflow or pytorch, with low level c++/cuda code.

Anybody can write some custom kernel, but deploying, maintaining them and distributing them is a nightmare. So the only people that could and should have done it, are the tensorflow or pytorch guys.

In fact they probably have, but it's considered a strategic advantage and reserved for internal use only.

The mere mortals like us, have to use some workarounds (splitting matrices, Kheops, gradient checkpointing... ) to not be too much penalized by the limited ops of the out of the box autodiff frameworks like tensorflow or torch.

liuliu · on Sept 28, 2022

PyTorch doesn't offer an inplace softmax which contributes about 1GiB extra memory for inference (of stable diffusion). Although all these are not significant improvements comparing to just switch to FlashAttention inside the UNet model.

Karuma · on Sept 28, 2022

There are forks that even work on 1.8 of VRAM! They work great on my GTX 1050 2GB.

This is by far the most popular and active right now: https://github.com/AUTOMATIC1111/stable-diffusion-webui

extesy · on Sept 28, 2022

> This is by far the most popular and active right now: https://github.com/AUTOMATIC1111/stable-diffusion-webui

While technically the most popular, I wouldn't call it "by far". This one is a very close second (500 vs 580 forks): https://github.com/sd-webui/stable-diffusion-webui/tree/dev

Karuma · on Sept 28, 2022

That's why I said "right now", since I feel that most people have moved from the one you linked to AUTOMATIC's fork by now. hlky's fork (the one you linked) was by far the most popular one until a couple of weeks ago, but some problems with the main developer's attitude and a never-ending migration from Gradio to Streamlit filled with issues made it lose its popularity.

AUTOMATIC has the attention of most devs nowadays. When you see any new ideas come up, they usually appear in AUTOMATIC's fork first.

jtap · on Sept 28, 2022

Just as another point of reference. I followed the windows install. I'm running this on my 1060 with 6GB memory. With no setting changes takes about 10 seconds to generate an image. I often run with sampling steps up to 50 and that takes about 40 seconds to generate an image.

sophrocyne · on Sept 28, 2022

While AUTOMATIC is certainly popular, calling it the most active/popular would be ignoring the community working on Invoke. Forks don’t lie.

https://github.com/invoke-ai/InvokeAI

counttheforks · on Sept 28, 2022

> Forks don’t lie.

They sure do. InvokeAI is a fork of the original repo CompVis/stable-diffusion and thus shares its fork counter. Those 4.1k forks are coming from CompVis/stable-diffusion, not InvokeAI.

Meanwhile AUTOMATIC1111/stable-diffusion-webui is not a fork itself, and has 511 forks.

sophrocyne · on Sept 28, 2022

Welp - TIL.

Thanks for the correction.

Any idea on how to count forks of a downstream fork? If anyone would know... :)

pwillia7 · on Sept 28, 2022

Subjectively, AUTOMATIC has taken over -- I have not heard of invoke yet but will check it out.

toqy · on Sept 28, 2022

The only reason to use it imo has been if you need mac/m1 support, but that's probably in other forks by now

rmurri · on Sept 28, 2022

What settings and repo are you using for GTX 1050 with 2GB?

Karuma · on Sept 28, 2022

I'm using the one I linked in my original post: https://github.com/AUTOMATIC1111/stable-diffusion-webui

The only command line argument I'm using is --lowvram, and usually generate pictures at the default settings at 512x512 image size.

You can see all the command line arguments and what they do here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki...

Abishek_Muthian · on Sept 29, 2022

I guess then it could even work on a Jetson Nano(4GB) then, I run models of ~1.6 GB on it 24*7; Would give this a try.

jaggs · on Sept 28, 2022

This needs Windows 10/11 though?

Karuma · on Sept 28, 2022

Nope. There are instructions for Windows, Linux and Apple Silicon in the readme: https://github.com/AUTOMATIC1111/stable-diffusion-webui

There's also this fork of AUTOMATIC1111's fork, which also has a Colab notebook ready to run, and it's way, way faster than the KerasCV version: https://github.com/TheLastBen/fast-stable-diffusion

(It also has many, many more options and some nice, user-friendly GUIs. It's the best version for Google Colab!)

jaggs · on Sept 28, 2022

Brilliant thanks.