Hacker News new | past | comments | ask | show | jobs | submit login

Tried to get this running on my 2080ti (11GB VRAM) but hitting OOM issues. So while performance seems better (but can't actually test this myself), I'm unable to actually verify it as it doesn't run. Some of the Pytorch forks works on as little as 6GB of VRAM (or maybe even 4GB?), but always good to have implementations that optimize for various factors, this one seems to trade memory usage for raw generation speed.

Edit: there seems to be a more "full" version of the same work available here, made by one of the authors of the submission article: https://github.com/divamgupta/stable-diffusion-tensorflow




Just breaking the attention matrix multiply into parts allows a significant reduction of memory consumption at minimal cost. There are variants out there that do that and more.

Short version: Attention works as a matrix multiply that looks like this: s(QK)V where QK is a large matrix but Q,K,V and the result are all small. You can break Q and V into horizontal strips. Then the result is the vertical concatenation of:

    s(Q1*K)*V1
    s(Q2*K)*V2
    s(Q3*K)*V3
    ...
    s(QN*K)*VN
Since you're reusing the memory for the computation of each block you can get away with much less simultaneous RAM use.


Yeah, the problem is indeed in the attention computation.

You can do something like that but it's far from optimal.

From memory consumption perspective, the right way to do it, is to never materialize the intermediate matrices.

You can do it, by using a customop, that compute att = scaledAttention(Q,K,V) and the gradient dQ,dK,dV = scaledAttentionBackward(Q,K,V,att,datt)

The memory needed for these ops is the memory to store Q,K,V,attn,dQ,dK,dV,dattn + extra temporary memory.

When you do the work to minimize memory consumption, this extra temporary memory is really small : 6attention_horizon^2number_of_core_running_in_parallel numbers.

But even though there is not much re computation, this kernel won't run as fast due to the pattern of memory access, unless you spend some time manually optimizing it.

The place to do it is at the level of the autodiff framework aka tensorflow or pytorch, with low level c++/cuda code.

Anybody can write some custom kernel, but deploying, maintaining them and distributing them is a nightmare. So the only people that could and should have done it, are the tensorflow or pytorch guys.

In fact they probably have, but it's considered a strategic advantage and reserved for internal use only.

The mere mortals like us, have to use some workarounds (splitting matrices, Kheops, gradient checkpointing... ) to not be too much penalized by the limited ops of the out of the box autodiff frameworks like tensorflow or torch.


PyTorch doesn't offer an inplace softmax which contributes about 1GiB extra memory for inference (of stable diffusion). Although all these are not significant improvements comparing to just switch to FlashAttention inside the UNet model.


There are forks that even work on 1.8 of VRAM! They work great on my GTX 1050 2GB.

This is by far the most popular and active right now: https://github.com/AUTOMATIC1111/stable-diffusion-webui


> This is by far the most popular and active right now: https://github.com/AUTOMATIC1111/stable-diffusion-webui

While technically the most popular, I wouldn't call it "by far". This one is a very close second (500 vs 580 forks): https://github.com/sd-webui/stable-diffusion-webui/tree/dev


That's why I said "right now", since I feel that most people have moved from the one you linked to AUTOMATIC's fork by now. hlky's fork (the one you linked) was by far the most popular one until a couple of weeks ago, but some problems with the main developer's attitude and a never-ending migration from Gradio to Streamlit filled with issues made it lose its popularity.

AUTOMATIC has the attention of most devs nowadays. When you see any new ideas come up, they usually appear in AUTOMATIC's fork first.


Just as another point of reference. I followed the windows install. I'm running this on my 1060 with 6GB memory. With no setting changes takes about 10 seconds to generate an image. I often run with sampling steps up to 50 and that takes about 40 seconds to generate an image.


While AUTOMATIC is certainly popular, calling it the most active/popular would be ignoring the community working on Invoke. Forks don’t lie.

https://github.com/invoke-ai/InvokeAI


> Forks don’t lie.

They sure do. InvokeAI is a fork of the original repo CompVis/stable-diffusion and thus shares its fork counter. Those 4.1k forks are coming from CompVis/stable-diffusion, not InvokeAI.

Meanwhile AUTOMATIC1111/stable-diffusion-webui is not a fork itself, and has 511 forks.


Welp - TIL.

Thanks for the correction.

Any idea on how to count forks of a downstream fork? If anyone would know... :)


Subjectively, AUTOMATIC has taken over -- I have not heard of invoke yet but will check it out.


The only reason to use it imo has been if you need mac/m1 support, but that's probably in other forks by now


What settings and repo are you using for GTX 1050 with 2GB?


I'm using the one I linked in my original post: https://github.com/AUTOMATIC1111/stable-diffusion-webui

The only command line argument I'm using is --lowvram, and usually generate pictures at the default settings at 512x512 image size.

You can see all the command line arguments and what they do here: https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki...


I guess then it could even work on a Jetson Nano(4GB) then, I run models of ~1.6 GB on it 24*7; Would give this a try.


This needs Windows 10/11 though?


Nope. There are instructions for Windows, Linux and Apple Silicon in the readme: https://github.com/AUTOMATIC1111/stable-diffusion-webui

There's also this fork of AUTOMATIC1111's fork, which also has a Colab notebook ready to run, and it's way, way faster than the KerasCV version: https://github.com/TheLastBen/fast-stable-diffusion

(It also has many, many more options and some nice, user-friendly GUIs. It's the best version for Google Colab!)


Brilliant thanks.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: