Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Ragas – Open-source library for evaluating RAG pipelines (github.com/explodinggradients)
121 points by shahules 37 days ago | hide | past | favorite | 26 comments
Ragas is an open-source library for evaluating and testing RAG and other LLM applications. Github: https://docs.ragas.io/en/stable/, docs: https://docs.ragas.io/.

Ragas provides you with different sets of metrics and methods like synthetic test data generation to help you evaluate your RAG applications. Ragas started off by scratching our own itch for evaluating our RAG chatbots last year.

Problems Ragas can solve

- How do you choose the best components for your RAG, such as the retriever, reranker, and LLM?

- How do you formulate a test dataset without spending tons of money and time?

We believe there needs to be an open-source standard for evaluating and testing LLM applications, and our vision is to build it for the community. We are tackling this challenge by evolving the ideas from the traditional ML lifecycle for LLM applications.

ML Testing Evolved for LLM Applications

We built Ragas on the principles of metrics-driven development and aim to develop and innovate techniques inspired by state-of-the-art research to solve the problems in evaluating and testing LLM applications.

We don't believe that the problem of evaluating and testing applications can be solved by building a fancy tracing tool; rather, we want to solve the problem from a layer under the stack. For this, we are introducing methods like automated synthetic test data curation, metrics, and feedback utilisation, which are inspired by lessons learned from deploying stochastic models in our careers as ML engineers.

While currently focused on RAG pipelines, our goal is to extend Ragas for testing a wide array of compound systems, including those based on RAGs, agentic workflows, and various transformations.

Try out Ragas here https://colab.research.google.com/github/shahules786/openai-... in Google Colab. Read our docs - https://docs.ragas.io/ to know more

We would love to hear feedback from the HN community :)




congrats on launching! i think my continuing struggle with looking at Ragas as a company/library rather than a very successful mental model is that the core of it is like 8 metrics (https://github.com/explodinggradients/ragas/tree/main/src/ra...) that are each 1-200 LOC. i can inline that easily in my app and retain full control, or model that in langchain or haystack or whatever.

why is Ragas a library and a company, rather than an overall "standard" or philosophy (eg like Heroku's 12 Factor Apps) that could maybe be more universally adopted without using the library?

(just giving an opp to pitch some underappreciated benefits of using this library)


Thank you for asking this question.

To answer this question, I will explain two directions of Ragas.

The first one is the horizontal expansion of the library which involves features like

- Giving you the ability to use any LLMs instantly without any hassle

- Asynchronous evaluations, integrations with tracing tools, etc

- Automatic support to adapt metrics to any language

The second is vertical expansion or adding more core features like metrics to Ragas which includes.

- Synthetic test data generation: this is something that is heavily loved by our community so we are continuously improving the quality of it. https://docs.ragas.io/en/stable/concepts/testset_generation....

Now, as we expand in both directions we aim to solve the problem of how to evaluate and test compound systems. Now, to solve this we will be innovating and working on features like feedback utilization, automatically synthesizing assertions, etc to solve this hard problem.

I hope I was able to answer your question. Would love to discuss more.


cool cool. so 1) will be a direct langchain competitor, and 2) is net new territory?


1) Horizondal expansion and support are core to every framework/library. This won't make us a competitor to LC, we actually use langchain-core to support many of these like supporting different LLMs. 2) We operate in a layer underneath the stack of evals and testing because we want to solve the problem from the ground up rather than building a fancy tracing tool, which comes later in the stack.


> that are each 1-200 LOC. i can inline that easily in my app and retain full control

Isn't that true of most of langchain as well though?


I think it's true for any early-stage library/framework. The tradeoff is then you will have to keep maintaining it, add support to other LLMs if you change LLMs, etc. Then in the end OSS will be far ahead because by that time it will have smoothened its rough edges.


Or OSS will be going in a different direction then what you need, so if you are using it you'll either be stuck on an old version or you have to keep fighting around it. ML libraries in particular have this annoying habit of not being very backwards compatible over more than 2-3 years.


Based on our initial analysis with RAGAS a few months ago, it didn't provide the results that our team was expecting. Required a lot of customisation on top of it. Nevertheless a pretty solid library.


Hey, thanks for trying out Ragas. As an open-source library, we are continuously improving from the feedback from the community which I see as our primary strength. I am sure that Ragas is not perfect yet, but I can assure you that it is 10x better than it was a few months ago.


Also check out DeepEval... our team has been using it for a while, and it's been working well for us because we can evaluate any LLMs, something this library doesn't seem to support (https://github.com/confident-ai/deepeval).


Hey, DeepEval is interesting. What do you mean by "evaluating any LLMs"?


This is nice, we've got more Open Source LLM Evaluation Libraries coming in more often.

We're using DeepEval (https://github.com/confident-ai/deepeval) currently. How is this different from that?


Deepeval also uses Ragas underneath. They initially took a different approach by allowing uses to formulate test cases but we were focusing on RAGs only and creating metrics and features like synthetic test data generation for it. Now that we are doing good in the RAG category, we also want to expand to solve the greater challenge.


Great product and great progress.

The first step to build rage is always to evaluate.

Except all the current evaluations, cost and perf should also be part of evaluations


could you elaborate on what you mean by perf? cost we'll add soon


Congratulations on the launch! Personally would love to see a rough estimates of the expected number of requests and tokens required to run tasks like synthetic data generation for different amounts of data. Though this is likely highly variable, would like to have a loose idea of possible incurred costs and execution time.


Hey, this is a highly requested feature. We will be implementing it soon. Something like a rough estimate is what we are planning to do.


Congratulations on the launch of Ragas! This looks like an incredibly valuable tool for the LLM community. As the library continues to evolve, it will be interesting to see how it adapts to handle the growing diversity of LLM architectures and use cases.


Yes, this is an interesting challenge we are also excited about.


Congratulations on the launch! I was unable to use this library: I was trying to evaluate different non-OpenAI models and it consistently failed due to malformed JSONs coming from the model.

Any thoughts about using different models? Is this just a langchain limitation?


Thanks for your feedback. We have tested Ragas on alternatives like Claude, Mixtral, Gemini, etc.

Although we support all LLMs supported by Langchain, sadly many of the OSS models out of the box aren't capable of generating JSON-type output which is important for us to ensure reproducibility.


Any tips for Mixtral? That’s what we tried


Hey, I would recommend checking out our PRs. There would be PRs that have modified some of the prompts to better suit Mixtral.


Checkout this instead: https://github.com/confident-ai/deepeval

Also has native ragas implementation but supports all models.


Phenomenal to see how Ragas has progressed. Congratulations on the launch


Thank you.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: