Download the model in background. Serve the client with an LLM vendor API just f...

manmal · 2025-04-21T21:09:53 1745269793

Personally, I only use locally run models when I absolutely can’t have the prompt/context uploaded to a cloud. For anything else, I just use one of the commercial cloud hosted models. The ones I‘m using are way faster and better in _every_ way except privacy. Eg if you are ok to spend more, you can get blazing fast DeepSeek v3 or R1 via OpenRouter. Or, rather cheap Claude Sonnet via Copilot (pre-release also has Gemini 2.5 Pro btw).

I’ve gotten carried away - I meant to express that using cloud as a fallback for local models is something I absolutely don’t want or need, because privacy is the whole and only point to local models.

aazo11 · 2025-04-21T20:01:23 1745265683

Exactly. Why does this not exist yet?

byyoung3 · 2025-04-21T20:33:19 1745267599

its an if statement on whether the model has downloaded or not

aazo11 · 2025-04-21T21:20:24 1745270424

A better solution would train/finetune the smaller model from the responses of the larger model and only push to the inference to the edge if the smaller model is performant and the hardware specs can handle the workload?

monoid73 · 2025-04-21T22:21:01 1745274061

yeah, that'd b nice, some kind of self-bootstrapping system where you start with a strong cloud model, then fine-tune a smaller local one over time until it’s good enough to take over. tricky part is managing quality drift and deciding when it's 'good enough' without tanking UX. edge hardware's catching up though, so feels more feasible by the day.