Show HN: Open-source load balancer for llama.cpp

sszuecs · 2024-06-03T11:27:20 1717414040

Hi!

I am one of the maintainers of https://opensource.zalando.com/skipper http proxy library, which can support similar cases. We use this at Zalando https://www.zalando.com/ in Kubernetes and allow developers to connect to different kind of data applications including chat based LLMs or notebooks. We have of course OTel/Opentracing support https://opensource.zalando.com/skipper/operation/operation/#....

Likely the comparison with lb algorithms round robin and least connections is not a fair choice. Better would be to compare with consistent hash, that naturally does stateful load balancing. In skipper you can tune the behavior by filters https://opensource.zalando.com/skipper/reference/filters/#co... and https://opensource.zalando.com/skipper/reference/filters/#co... per route.

You don't want auto scaling? You can also limit concurrent requests to a route with queue support and make sure backends are not overloaded using scheduler filters https://opensource.zalando.com/skipper/reference/filters/#sc....

If you need more you can also help yourself and use lua filters to influence these options https://opensource.zalando.com/skipper/reference/scripts/ .

We are happy to hear from you, Sandor

curious_cat_163 · 2024-06-02T04:59:52 1717304392

> a stateful load balancer that is aware of each server's available slots

Interesting. Curious to understand what a 'slot' is, in this context.

Is that a llama.cpp specific application-layer state that llama.cpp makes available? Or is this an application-layer state that is being inferred? If later, how?

kherud · 2024-06-02T11:46:49 1717328809

I think this comment explains it https://github.com/ggerganov/llama.cpp/discussions/4130#disc... As far as I understand (and mcharytoniuk should better confirm this), llama.cpp allows to chunk the context window of an LLM into independent blocks, such that multiple requests can be processed in a single inference. I think due to the auto-regressive nature of LLMs, you also don't have to wait for all sequences to finish to output them. As soon as one sequence finishes, you can use its "slot" in the context window for other requests.

mcharytoniuk · 2024-06-03T11:06:00 1717412760

Yes, exactly. You can split the available context into "slots" (chunks) so it can handle multipe requests concurrently. The number of them is configurable.

tjfontaine · 2024-06-02T05:12:41 1717305161

Looks like it’s to only have one completion running at a time currently, https://github.com/distantmagic/paddler/blob/main/loadbalanc... I was curious if there was any other cache tracking happening

mcharytoniuk · 2024-06-02T09:53:33 1717322013

Just open an issue if you need anything. I want to make it as good and helpful as possible. Every kind of feedback is appreciated.

riedel · 2024-06-02T08:10:31 1717315831

What does stateful mean: I always wonder how loading states of users is done, it seems that one can call `llama_state_set_data` , does this load balancer create a central store for such states? What is the overhead of transfering state?

mcharytoniuk · 2024-06-02T09:52:47 1717321967

Currently, it is a single instance in memory, so it doesn't transfer state. HA is on the roadmap; only then will it need some kind of distributed state store.

Local states are reported by the agents installed alongside llama.cpp to the load balancer. That means they can be dynamically added and removed; it doesn't need a central configuration.

asne11 · 2024-06-02T06:01:16 1717308076

"slot" is a processing unit. Either GPU or CPU. I believe `llama.c` is only CPU so I'm guessing 1 slot = 1 core (or thread)?

mcharytoniuk · 2024-06-02T09:54:21 1717322061

It divides the context into smaller "slots", so it can process requests concurrently with continuous batching. See also: https://github.com/ggerganov/llama.cpp/tree/master/examples/...

kgeist · 2024-06-02T09:14:49 1717319689

Llama.cpp can run on CPU, on GPU, or in mixed mode (some layers run on CPU and some on GPU if you don't have enough VRAM).

new299 · 2024-06-02T06:58:24 1717311504

llama.cpp is not CPU only…

42lux · 2024-06-02T17:49:53 1717350593

Probably just free accelerators.

anonzzzies · 2024-06-02T06:30:49 1717309849

Does it do queuing ? Didn’t see it in the readme. I haven’t seen (but that says nothing at all) an open source solution that queues when all are busy and allow me to show a countdown of people in the queue. Like the closed ones do.

ritonlajoie · 2024-06-02T07:12:15 1717312335

I stumbled on this yesterday, it seems to have a queue concept.. https://github.com/ParisNeo/ollama_proxy_server

mcharytoniuk · 2024-06-02T09:50:35 1717321835

In progress. I added that to the readme; I need the feature myself. :)

friendly_chap · 2024-06-02T20:10:40 1717359040

I have actually open sourced one recently which supports queues - its both a desktop app and a daemon: https://github.com/singulatron/singulatron

3abiton · 2024-06-02T21:08:49 1717362529

So many interesting projects, I just wish AI hardware was readily available.

amiantos · 2024-06-02T16:19:50 1717345190

https://aihorde.net is something like an open-source load balancer with queuing for LLMs, too

animegirl2024 · 2024-06-03T03:46:05 1717386365

Wow, the open-source project is userful. Thanks for sharing.