Hacker News new | past | comments | ask | show | jobs | submit login

Yeah, it shouldn't be too difficult to build this with python. I wonder why none of the popular routers like https://github.com/BerriAI/litellm have this feature.

> Problem is running so many LLMs in parallel means you need quite a bunch of resources.

Top of line MacBooks or Minis should be able to run several 7B or even 13B models without major issues. Models are also getting smaller and better. That's why we're close =)




Could lora fine tunes be used instead of completely different models? I wonder if that would save space.


Yeah that would save disk space! In terms of inference, you'd still need to hold multiple models in memory though, and I don't think we're that close to that (yet) on personal devices. You could imagine a system that dynamically unloads and reloads the models as you need them in this process, but that unloading and reloading would be pretty slow probably.


https://github.com/predibase/lorax does this, it's not that slow, since LoRAs aren't usually very big.


With a fast NVME loading a model is only 2-3s.


I'm the LiteLLM maintainer, can you elaborate what you're looking for us to do here?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: