Tried to get this running on my 2080ti (11GB VRAM) but hitting OOM issues. So while performance seems better (but can't actually test this myself), I'm unable to actually verify it as it doesn't run. Some of the Pytorch forks works on as little as 6GB of VRAM (or maybe even 4GB?), but always good to have implementations that optimize for various factors, this one seems to trade memory usage for raw generation speed.
Just breaking the attention matrix multiply into parts allows a significant reduction of memory consumption at minimal cost. There are variants out there that do that and more.
Short version: Attention works as a matrix multiply that looks like this: s(QK)V where QK is a large matrix but Q,K,V and the result are all small. You can break Q and V into horizontal strips. Then the result is the vertical concatenation of:
s(Q1*K)*V1
s(Q2*K)*V2
s(Q3*K)*V3
...
s(QN*K)*VN
Since you're reusing the memory for the computation of each block you can get away with much less simultaneous RAM use.
Yeah, the problem is indeed in the attention computation.
You can do something like that but it's far from optimal.
From memory consumption perspective, the right way to do it, is to never materialize the intermediate matrices.
You can do it, by using a customop, that compute
att = scaledAttention(Q,K,V) and the gradient dQ,dK,dV = scaledAttentionBackward(Q,K,V,att,datt)
The memory needed for these ops is the memory to store Q,K,V,attn,dQ,dK,dV,dattn + extra temporary memory.
When you do the work to minimize memory consumption, this extra temporary memory is really small : 6attention_horizon^2number_of_core_running_in_parallel numbers.
But even though there is not much re computation, this kernel won't run as fast due to the pattern of memory access, unless you spend some time manually optimizing it.
The place to do it is at the level of the autodiff framework aka tensorflow or pytorch, with low level c++/cuda code.
Anybody can write some custom kernel, but deploying, maintaining them and distributing them is a nightmare. So the only people that could and should have done it, are the tensorflow or pytorch guys.
In fact they probably have, but it's considered a strategic advantage and reserved for internal use only.
The mere mortals like us, have to use some workarounds (splitting matrices, Kheops, gradient checkpointing... ) to not be too much penalized by the limited ops of the out of the box autodiff frameworks like tensorflow or torch.
PyTorch doesn't offer an inplace softmax which contributes about 1GiB extra memory for inference (of stable diffusion). Although all these are not significant improvements comparing to just switch to FlashAttention inside the UNet model.
That's why I said "right now", since I feel that most people have moved from the one you linked to AUTOMATIC's fork by now. hlky's fork (the one you linked) was by far the most popular one until a couple of weeks ago, but some problems with the main developer's attitude and a never-ending migration from Gradio to Streamlit filled with issues made it lose its popularity.
AUTOMATIC has the attention of most devs nowadays. When you see any new ideas come up, they usually appear in AUTOMATIC's fork first.
Just as another point of reference. I followed the windows install. I'm running this on my 1060 with 6GB memory. With no setting changes takes about 10 seconds to generate an image. I often run with sampling steps up to 50 and that takes about 40 seconds to generate an image.
They sure do. InvokeAI is a fork of the original repo CompVis/stable-diffusion and thus shares its fork counter. Those 4.1k forks are coming from CompVis/stable-diffusion, not InvokeAI.
Meanwhile AUTOMATIC1111/stable-diffusion-webui is not a fork itself, and has 511 forks.
Edit: there seems to be a more "full" version of the same work available here, made by one of the authors of the submission article: https://github.com/divamgupta/stable-diffusion-tensorflow