Yeah, it shouldn't be too difficult to build this with python. I wonder why none of the popular routers like https://github.com/BerriAI/litellm have this feature.
> Problem is running so many LLMs in parallel means you need quite a bunch of resources.
Top of line MacBooks or Minis should be able to run several 7B or even 13B models without major issues. Models are also getting smaller and better. That's why we're close =)
Yeah that would save disk space! In terms of inference, you'd still need to hold multiple models in memory though, and I don't think we're that close to that (yet) on personal devices. You could imagine a system that dynamically unloads and reloads the models as you need them in this process, but that unloading and reloading would be pretty slow probably.
> Problem is running so many LLMs in parallel means you need quite a bunch of resources.
Top of line MacBooks or Minis should be able to run several 7B or even 13B models without major issues. Models are also getting smaller and better. That's why we're close =)