Yeah, it shouldn't be too difficult to build this with python. I wonder why none...

generalizations · on Dec 1, 2023

Could lora fine tunes be used instead of completely different models? I wonder if that would save space.

amilios · on Dec 1, 2023

Yeah that would save disk space! In terms of inference, you'd still need to hold multiple models in memory though, and I don't think we're that close to that (yet) on personal devices. You could imagine a system that dynamically unloads and reloads the models as you need them in this process, but that unloading and reloading would be pretty slow probably.

ilaksh · on Dec 1, 2023

https://github.com/predibase/lorax does this, it's not that slow, since LoRAs aren't usually very big.

Kubuxu · on Dec 1, 2023

With a fast NVME loading a model is only 2-3s.

ij23 · on Dec 1, 2023

I'm the LiteLLM maintainer, can you elaborate what you're looking for us to do here?