Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: liteLLM Proxy Server: 50+ LLM Models, Error Handling, Caching (github.com/berriai)
140 points by ij23 9 months ago | hide | past | favorite | 34 comments
Hello hacker news,

I’m the maintainer of liteLLM() - package to simplify input/output to OpenAI, Azure, Cohere, Anthropic, Hugging face API Endpoints: https://github.com/BerriAI/litellm/

We’re open sourcing our implementation of liteLLM proxy: https://github.com/BerriAI/litellm/blob/main/cookbook/proxy-...

TLDR: It has one API endpoint /chat/completions and standardizes input/output for 50+ LLM models + handles logging, error tracking, caching, streaming

What can liteLLM proxy do? - It’s a central place to manage all LLM provider integrations

- Consistent Input/Output Format - Call all models using the OpenAI format: completion(model, messages) - Text responses will always be available at ['choices'][0]['message']['content']

- Error Handling Using Model Fallbacks (if GPT-4 fails, try llama2)

- Logging - Log Requests, Responses and Errors to Supabase, Posthog, Mixpanel, Sentry, Helicone

- Token Usage & Spend - Track Input + Completion tokens used + Spend/model

- Caching - Implementation of Semantic Caching

- Streaming & Async Support - Return generators to stream text responses

You can deploy liteLLM to your own infrastructure using Railway, GCP, AWS, Azure

Happy completion() !




This would be super useful if it supported local/in-K8-cluster models.

Most production use cases probably need some sort of fall back of small -> medium -> large -> GPT4.

Given the costs + low quota limits, I would be surprised if any significant portion of the market is falling back from one expensive proprietary* API to another.

With the rapid improvements in model servers like llama.cpp and vllm.ai, providing an abstraction layer for “fastest model server of the month” would be useful.


What local/in-K8-cluster models servers would you recommend adding ?

Should we add support for llama.cpp and vllm.ai in the proxy server ? Or should we assume you can host them on your own infra and the proxy server requests your hosted model ?


IMO don’t try to be the one stop shop to host models. There are too many players with all sorts of advancements (eg: stopping grammar, continuous batching, novel quantization etc.) and you won’t be able to keep up.

There is a ton of boilerplate around the actual model server that’s just busy work , but if done wrong can be a huge performance suck. Solve that.

Build the proxy that works with the most model servers out there. Do it in a way that once you have mindshare, the model server makers will be find it easy to put up a PR so that they can claim your proxy supports their server.

Don’t take a hard dependency on non-OSS stuff - being able to build an “on-prem” solution (read “deployed into customer’s VPC”) is table stakes for anyone to use your offering for a lot of enterprise use cases.

Edit: another unsolved problem - different models need slightly different prompts to solve the same problem well…


If it makes sense to expand scope to provide a particular model server and the group can easily be the best st it, I say go for it. But do it as a separate (but perhaps connected) project to this.

But in general I’m in agreement that this sounds like a separate concept than any given model server.

That said, where is a list of model servers for the most commonly wanted LLMs at this point?

Perhaps maintaining a list of those that do and don’t work with the proxy would be helpful.


Hey bredren - our supported list (if that's helpful) is here https://litellm.readthedocs.io/en/latest/supported/

We're adding new integrations every day, so if there's any specific one you'd like to add feel free to let us know (discord/ticket/email/etc.) - here's my email: krrish@berri.ai


*Update*: for those asking how to run self-hosted llama2, we just added the ollama integration.

Here's the tutorial - https://github.com/BerriAI/litellm/blob/main/cookbook/liteLL...


The idea of an LLM proxy is super compelling. There's a lot of powerful ideas baked into the proxy form factor – I think you've listed out quite a few of them. It reminds me a bit of what Cloudflare did for the web: both making it faster and safer/easier. Have you considered local LLMs at all for Llama 2? A few people and I have been working on https://github.com/jmorganca/ollama/ and was thinking it would be helpful to be able to augment it with a proxy layer like this. Not only that, but it might help folks dynamically choose to run locally (vs against a cloud LLM) for certain prompts.


Hey @jmorgan - we love ollama!

Re: LocalLLM's like Llama2 - yes we support self-deployed models, through Huggingface, Replicate, TogetherAI integrations.

We're missing support for locally deployed models - and would love the help!

Re: ollama - we spent a couple hours trying to integrate ollama. We had a couple issues though, would love to try and support it. Got time to chat sometime this/next week? I think this would be an awesome addition.


Great. Let's chat!


Does it support the function-calling API? https://openai.com/blog/function-calling-and-other-api-updat...


Hey @breckenedge yes it does! Exactly the same way as the openai-python sdk. Here's our code tutorial for it https://github.com/BerriAI/litellm/blob/main/cookbook/liteLL...


What about streaming output — is there a uniform interface for all models for streaming output?


Hi @d4rkp4ttern - yes it's standardized to the openai format - https://litellm.readthedocs.io/en/latest/stream/


Can you say more about the semantic caching, I can't seem to find much in the docs.

edit: Found the cache notebook[1] and the calls to the cache and distance eval in the code[2]

[1] https://github.com/BerriAI/litellm/blob/main/cookbook/liteLL...

[2] https://github.com/BerriAI/litellm/blob/80d77fed7123af222011...


This makes sense to me. I’d previously mentioned here [1]:

> …can’t people just build their product with openAI or other and plan to move away based on the cost and fit for their circumstances?

> Couldn’t someone say prototype the entire product on some lower-quality LLM and occasionally pass requests to GPT4 to validate behavior?

This seems to reduce the switching costs to almost nothing.

[1] https://news.ycombinator.com/item?id=36626943


If you'd like to add any of these ideas as notebooks to our cookbook - https://github.com/BerriAI/litellm/tree/main/cookbook, we'd love the contribution!


Those are really cool use-cases. I wrote a bit about our initial motivation for LiteLLM here - https://hackernoon.com/litellm-call-every-llm-api-like-its-o....

tldr; reliable model switching involved multiple 100-line if/else statements, which made our code messy, and debugging in prod pretty hard.


Could you integrate with something like TextGen and then automatically have support for every locally running model they support? Am I missing something?


So I use my own accounts to access this correct? I didn’t see in the documentation where I provide my credentials. I’ll look again…


Yes, you use your own API keys. You can set them as env variables. Either set them as os.environ['OPENAI_API_KEY'] or set them in .env files: https://litellm.readthedocs.io/en/latest/supported/


Is user auth (and tracking token spend) within scope of this or is that better handled at a layer in front of this?


Hi @deet - Yes it is! This automatically stores the cost per query to the Supabase table - here's how: https://github.com/BerriAI/litellm/blob/80d77fed7123af222011...

If you have ideas for improvement - we'd love a ticket/PR!


Very cool! Could you add in the readme how to install llama 2 on the same machine ?


How do you plan on deploying llama 2? Is that via ollama/fastchat/etc.? We're actively building out integrations to different providers, so if you have any preferred tooling - let me know!


No idea yet. Which you recommend to start? It will be hosted on a Ubuntu server (Digital Ocean, Linode etc..)


Any reason you're doing that vs. using Lambda Labs / Replicate / together.ai / Banana.dev, etc.

There's a lot of good model deployment platforms that would make it easy to call your model behind a hosted endpoint

-- If you do want to self-host - there's some great libraries like https://github.com/lm-sys/FastChat and https://github.com/ggerganov/llama.cpp that might be helpful

If none of these really solve your issue - feel free to email me and I'm happy to help you figure something out - krrish@berri.ai


You can definitely start by checking out ollama, it was super helpful for me


It’s only for MacOSX. I expect to load the model on a Ubuntu server, not on my local dev machine.


You have to build it if you want it for Ubuntu, Windows, or anything else. Just build Go on your machine and have at it.


Reminds me of analytics.js which later turned into Segment and 3b acquisition


I do think this is an apt analogy. I've heard a counterpoint that there won't be enough "destinations" for this to work, but then it's not hard to imagine a single order of magnitude more LLM "destinations" than today (all of which were launched in the last 12 months).

There's also the fact that the data being sent to these LLM "destinations" could be significantly more valuable (or contain significantly more sensitive information) than the average segment identity or track objects.


Thanks - that's very kind.


There are unaddressed github issues like #29 added by whoiskatrin who is not listed as a contributor yet telephone #s and emails are listed on the readme- motive for that is ambiguous and is not likely to result in an optimal outcome.


Hey @downvotetruth, the issues/PR's are actually ours (the creators). We're actively working + using this repo, and use these as ways to keep track of ongoing work. We're super open to contributions and help though!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: