Show HN: UpTrain (YC W23) – open-source tool to evaluate LLM response quality

Hello, we are Shikha and Sourabh, founders of UpTrain(YC W23) - an open-source tool to evaluate the performance of your LLM applications on aspects such as correctness, tonality, hallucination, fluency, etc.

The Problem: Unlike traditional Machine learning or Deep learning models where we always have a unique Ground Truth and can define metrics like Precision, Recall, accuracy, etc. to quantify the model’s performance, LLMs are trickier and it is very difficult to estimate if their response is correct or not. If you are using GPT-4 to write a recruitment email, there is no unique correct email to do a word-to-word comparison against.

As you build an LLM application, you want to compare it against different model providers, prompt configurations, etc., and figure out the best working combination. Instead of manually skimming through a couple of model responses, you want to run them through hundreds of test cases, aggregate their scores, and make an informed decision. Additionally, as your application generates responses for real user queries, you don’t want to wait for them to complain about the model inaccuracy, instead, you want to monitor the model’s performance over time and get alerted in case of any drifts.

Again, at the core of it, you want a tool to evaluate the quality of your LLM response and assign quantitative scores.

The Solution: To solve this, we are building UpTrain which has a set of evaluation metrics so that you can know when your application is going wrong. These metrics include traditional NLP metrics like Rogue, Bleu, etc., embeddings similarity metrics as well as model grading scores i.e. where we use LLMs to evaluate different aspects of your response. A few of these evaluation metrics include:

1. Response Relevancy: Measures if the response contains any irrelevant information 2. Response Completeness: Measures if the response answers all aspects of the given question 3. Factual Accuracy: Measures hallucinations i.e. if the response has any made-up information or not with respect to the provided context 4. Retrieved Context Quality: Measures if the retrieved context has sufficient information to answer the given question 5. Response Tonality: Measures if the response aligns with a specific persona or desired tone etc.

We have designed workflows so that you can easily add your testing dataset, configure which checks you want to run (you can also define custom checks suitable for your use case) and conveniently access the results via Streamlit dashboards.

UpTrain also has experimentation capabilities where you can specify different prompt variations and models to test across and use these quantitative checks to find the best configuration for your application.

You can also use UpTrain to monitor your application’s performance and find avenues for improvement. We integrate directly with your databases (BigQuery, Postgres, MongoDB, etc.) and can run daily evaluations.

We’ve launched the tool under an Apache 2.0 license to make it easy for everyone to integrate it into their LLM workflows. Additionally, we also provide managed service (with a free trial) where you can run LLM evaluations via an API request or through UpTrain testing console.

We would love for you to try it out and give your feedback.

Links: Demo: https://demo.uptrain.ai/evals_demo/ Github repo: https://github.com/uptrain-ai/uptrain Create an account (free): https://uptrain.ai/dashboard UpTrain testing console (need an account): https://demo.uptrain.ai/dashboard Website: https://uptrain.ai/