Hello hacker news,
I’m the maintainer of liteLLM() - package to simplify input/output to OpenAI, Azure, Cohere, Anthropic, Hugging face API Endpoints: https://github.com/BerriAI/litellm/
We’re open sourcing our implementation of liteLLM proxy: https://github.com/BerriAI/litellm/blob/main/cookbook/proxy-...
TLDR: It has one API endpoint /chat/completions and standardizes input/output for 50+ LLM models + handles logging, error tracking, caching, streaming
What can liteLLM proxy do?
- It’s a central place to manage all LLM provider integrations
- Consistent Input/Output Format
- Call all models using the OpenAI format: completion(model, messages)
- Text responses will always be available at ['choices'][0]['message']['content']
- Error Handling Using Model Fallbacks (if GPT-4 fails, try llama2)
- Logging - Log Requests, Responses and Errors to Supabase, Posthog, Mixpanel, Sentry, Helicone
- Token Usage & Spend - Track Input + Completion tokens used + Spend/model
- Caching - Implementation of Semantic Caching
- Streaming & Async Support - Return generators to stream text responses
You can deploy liteLLM to your own infrastructure using Railway, GCP, AWS, Azure
Happy completion() !
Most production use cases probably need some sort of fall back of small -> medium -> large -> GPT4.
Given the costs + low quota limits, I would be surprised if any significant portion of the market is falling back from one expensive proprietary* API to another.
With the rapid improvements in model servers like llama.cpp and vllm.ai, providing an abstraction layer for “fastest model server of the month” would be useful.