Show HN: Cloud-native Stack for Ollama - Build locally and push to deploy

kkielhofner · on March 21, 2024

Question/feedback:

Who is this for?

I understand the ease of use but using ollama (an easy to use wrapper for llama.cpp) in production is in my experience as someone who deploys this stuff a very bad idea.

I understand it's "highly scalable" thanks to the tooling but at the end of the day on a resource utilization basis vLLM, HF TGI, etc are going to walk all over llama.cpp which IMO is the completely wrong tool for the job.

vLLM and HF TGI are containerized and run very well with nothing other than a HuggingFace model name as an argument/environment variable.

In the days of GPU shortages, very high costs, and CPU being unacceptably slow (only advantage of llama.cpp) using vLLM or similar cuts hosting costs in half if not more while providing more management tools, higher TPS, lower time to first token, etc.

When your hardware or cloud hosting costs are multiples higher using this vs these real serving frameworks a little extra ease of use on the frontend combined with the impossibility to really compete on performance makes this approach a tough proposition all around.

dsamy · on March 21, 2024

It's true other engines like vLLM are way faster and more optimized. I started with Ollama because its codebase is Go. In reality Ollama does not even take full advantage of llama.cpp itself as it does not implement concurrency plus adds latency using json in a CGO call. I discovered that building the wasm plugin, I was disappointed, and it's not on the ollama priorities to solve that see https://github.com/ollama/ollama/issues/3170

Another advantage of Ollama is it can easily run locally, so does the wasm plugin. Accomplishing the goal of local development environment which uses dreamland.

That's great feedback. I was thinking about fixing the concurrency issue myself, but creating a vLLM wasm plugin is a better idea. The user code won't need to change as long as the plugin exports as the same wasm host module.

kkielhofner · on March 21, 2024

I'm with you. I think having an optimized production inference serving framework on the deployment side has the potential to make your project the best of all worlds, essentially.

Of course there are other really advanced use cases and alternative ways to go about this but that would go a very long way.

Also FWIW Nvidia Triton Inference server is even more performant than vLLM and supports dynamic batching, quantization, paging, KV cache, blah blah blah in addition to being able to load multiple models today whether they be LLMs, ONNX, whatever across all of the available backends.

Significantly more complex in terms of deployment but wanted to mention it in terms of being able to load multiple models concurrently in an efficient and performant manner.

dsamy · on March 21, 2024

I took a look at Nvidia Triton Inference server, and it might be a good option for production especially as it has c++ api.

Amazing feedback, Thanks!