Basically every user gets their own LoRA finetune without losing the accelerator efficiency of batching requests for a single base model.
This would be tremendously cool for the Kobold Horde and similar services, where users request LoRA recipes instead of being stuck with whatever finetune the host picks. The Stable Diffusion AI Horde already does this, just with no batching and tremendous inefficiency.
> The Stable Diffusion AI Horde already does this, just with no batching and tremendous inefficiency.
Yes, the inefficiency here is the sheer number of models that are served. Instead of every worker having SDXL and loading LoRAs, the worker is very rapidly switching between one of ~230 models... _and_ loading LoRAs, if the worker allows it! From discussions I've seen, there is a lot of encouragement to use LoRAs (and textual inversions) instead of a unique model, but there's still a lot of demand among clients for unique base models. Additionally, being a volunteer system, a lot of the workers are, well, strange. You have people contributing high-end GPUs, and then other people are trying their best to (abuse?) use Colab and Kaggle within whatever their TOS allows, so the capacity of the workers varies quite a bit.
Yeah. Stable Diffusion itself is a bit on an oddball because there are so many actual dreambooth finetunes, not to speak of the merging culture.
Also, it all still runs in PyTorch eager mode last I checked. So it isn't even to the point where discussing optimized kernels like this is necessarily relevant.
While not related, thank you for mentioning PyTorch eager mode. I had no idea this existed. There is just SO MUCH documentation to read, and it is bewildering how bad the defaults are in the whole deep learning ecosystem.
It's the most promising "fast" and flexible stable diffusion implementation akin to this paper or vLLM that I know of. It doesn't have as many caveats as other implementations, like AITemplate (which is basically Turing+ and linux only) or torch.compile (basically no support for changing inputs/loras, and super long compilation on every startup/change).
Punica: Serving multiple LoRA finetuned LLM as one - https://news.ycombinator.com/item?id=38196661 - Nov 2023 (17 comments)