Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Download the model in background. Serve the client with an LLM vendor API just for the first requests, or even using that same local LLM installed on your own servers (likely cheaper). By doing so, in the long run the inference cost is near-zero and allows to use LLMs in otherwise impossible business models (like freemium).


Personally, I only use locally run models when I absolutely can’t have the prompt/context uploaded to a cloud. For anything else, I just use one of the commercial cloud hosted models. The ones I‘m using are way faster and better in _every_ way except privacy. Eg if you are ok to spend more, you can get blazing fast DeepSeek v3 or R1 via OpenRouter. Or, rather cheap Claude Sonnet via Copilot (pre-release also has Gemini 2.5 Pro btw).

I’ve gotten carried away - I meant to express that using cloud as a fallback for local models is something I absolutely don’t want or need, because privacy is the whole and only point to local models.


Exactly. Why does this not exist yet?


its an if statement on whether the model has downloaded or not


A better solution would train/finetune the smaller model from the responses of the larger model and only push to the inference to the edge if the smaller model is performant and the hardware specs can handle the workload?


yeah, that'd b nice, some kind of self-bootstrapping system where you start with a strong cloud model, then fine-tune a smaller local one over time until it’s good enough to take over. tricky part is managing quality drift and deciding when it's 'good enough' without tanking UX. edge hardware's catching up though, so feels more feasible by the day.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: