Ragas is an eval tool which needs ground truths and queries for evaluation. FiddleCube generates the queries and the ground truth needed for eval in Ragas, LangSmith or an eval tool of choice.
We incorporate user prompts to generate the outputs and provide diagnostics and feedback for improvement, rather than eval metrics. So you can plug your low scored queries provided by Ragas, your prompt and context. FiddleCube can provide the root cause and the ideal response.
This is an alternative to manual auditing and testing, where an auditor works on curating the ideal dataset.
Ragas also has a feature to generate ground truths and queries: https://docs.ragas.io/en/latest/getstarted/testset_generatio...
Although simply prompting an LLM with chunks of source documents might work better / cheaper - ragas tends to explode with retries in my experience.
Is this done by calling gpt4o with user query, prompt and context to generate result as ground truth, and analysis? If so, what is the value added except perhaps automation?
While we call LLMs(internal and external, based on instruction type), the output generated by LLMs can't be taken as ground truths unless we do rigorous evaluations. We have our own metrics when it comes to what could be called a ground truth, based on the user's seed information and business logic.
Accuracy & preciseness needs also differ from use-case to use case. Function calling adds in another layer.
Another value add is type of instructions that we can generate. We expose 7 currently, and are working on exposing more instruction types. The challenge is to create ground truth of wide variety of cases that a given user can ask for a business including guardrailing.
We have built internal tools and agents to solve for those, and are internally discussing the ideal way to expose it to users, and whether it would be beneficial for the community. Any thoughts on that would be appreciated.
Automation took a significant amount of time for us as well, so at scale, even a reliable automated CI/CD pipeline is indeed a value add in itself.
Lmk if I can add more details to answer the question.
We identified and solved for 2 key problems with generating data using GPT:
1. Duplicate/similar data points - we solve this by adding deduplication to our pipeline.
2. Incorrect question-answers - we check for correctness and context relevance. Filter out incorrect rows of data.
Apart from this, we generate a diverse set of questions including complex reasoning and chain of thought.
We also generate domain specific unsafe questions - questions that violate TnC of the particular LLM to test the model guardrails.