S-LoRA: Serving Concurrent LoRA Adapters

dang · on Nov 9, 2023

Related ongoing thread:

Punica: Serving multiple LoRA finetuned LLM as one - https://news.ycombinator.com/item?id=38196661 - Nov 2023 (17 comments)

brucethemoose2 · on Nov 9, 2023

Oh this is super cool.

Basically every user gets their own LoRA finetune without losing the accelerator efficiency of batching requests for a single base model.

This would be tremendously cool for the Kobold Horde and similar services, where users request LoRA recipes instead of being stuck with whatever finetune the host picks. The Stable Diffusion AI Horde already does this, just with no batching and tremendous inefficiency.

zten · on Nov 9, 2023

> The Stable Diffusion AI Horde already does this, just with no batching and tremendous inefficiency.

Yes, the inefficiency here is the sheer number of models that are served. Instead of every worker having SDXL and loading LoRAs, the worker is very rapidly switching between one of ~230 models... _and_ loading LoRAs, if the worker allows it! From discussions I've seen, there is a lot of encouragement to use LoRAs (and textual inversions) instead of a unique model, but there's still a lot of demand among clients for unique base models. Additionally, being a volunteer system, a lot of the workers are, well, strange. You have people contributing high-end GPUs, and then other people are trying their best to (abuse?) use Colab and Kaggle within whatever their TOS allows, so the capacity of the workers varies quite a bit.

brucethemoose2 · on Nov 9, 2023

Yeah. Stable Diffusion itself is a bit on an oddball because there are so many actual dreambooth finetunes, not to speak of the merging culture.

Also, it all still runs in PyTorch eager mode last I checked. So it isn't even to the point where discussing optimized kernels like this is necessarily relevant.

zten · on Nov 9, 2023

While not related, thank you for mentioning PyTorch eager mode. I had no idea this existed. There is just SO MUCH documentation to read, and it is bewildering how bad the defaults are in the whole deep learning ecosystem.

brucethemoose2 · on Nov 9, 2023

Since I am sending you down the rabbit hole anyway, you should check out stable-fast:

https://github.com/chengzeyi/stable-fast

It's the most promising "fast" and flexible stable diffusion implementation akin to this paper or vLLM that I know of. It doesn't have as many caveats as other implementations, like AITemplate (which is basically Turing+ and linux only) or torch.compile (basically no support for changing inputs/loras, and super long compilation on every startup/change).

jeffrallen · on Nov 9, 2023

LoRA is an AI term here, not the IoT networking protocol.

gdiamos · on Nov 9, 2023

Fast switching among many LoRAs is already supported in Lamini.

Great paper talking through some of the design issues and optimizations, eg Orca and custom fused GPU kernels.

I’d love to see more research in this direction.