Accelerating Stable Diffusion XL Inference with Jax on Cloud TPU v5e

MasterScrat · 2023-10-03T22:18:22

TPUs are amazing for Stable Diffusion.

We've been doing training (Dreambooth) and inference on TPUs since the beginning of the year at https://dreamlook.ai.

We basically get 2.5x the training speed for Stable Diffusion 1.5 compared to A100, a very nice "unfair advantage"!

brucethemoose2 · 2023-10-03T18:23:02

That's very fast!

I just tried it on my RTX 3090, with a riced linux environment + pytorch/xformers nightly, and 4 images take 36.7 seconds on the ComfyUI backend (used by Fooocus-MRE).

...But the issue is, right now, you can either pick high quality tooling (The ComfyUI/automatic1111 backend UIs) or speed (diffusers-based UIs), not both. InvokeAI and VoltaML do not support SDXL as well as Fooocus at the moment, and all the other UIs use the slow Stability backend with no compilation support.

FloatArtifact · 2023-10-03T22:30:21

Double check your facts there I think InvokeAI support SDXL

See https://github.com/invoke-ai/InvokeAI/releases

I believe invoke supports caching and a number of interesting features this release.

brucethemoose2 · 2023-10-04T00:10:09

Yeah, I just tried the release candidate two days ago, but my results with SDXL were very poor compared to Fooocus, and its missing other niceties like the known style presets, FreeU, the better Fooocus prompt expansion...

I really wanted to use Invoke for the better performance and torch.compile support, but the switch to the refiner seems slow, and torch.compile doesn't seem to work either. I need to investigate torch.compile more, but maybe they did something to the vanilla diffusers pipeline?

FloatArtifact · 2023-10-04T00:43:47

Thanks for your perspective here. I'll definitely look into some of these other projects.

iAkashPaul · 2023-10-05T03:33:45

With some CPU offloading I'm able to run SDXL at 1.5it/s on an RTX2070S w/ 8GB VRAM.

When used with ControlNet it still runs but more layers offloaded at 1.4s/it