We did some research into the different ways of evaluating LLMs, but there is a lot of literature on many different approaches, e.g. scores like BLEURT, precision/recall if you have ground truth data, all the way to asking GPT to be a human rater.
Are there evaluation strategies that worked best for you? We basically want to allow users/developers to evaluate which LLM (and specifically, which LLM + prompts + parameters) has the best performance for their use case (which seems different from the OpenAI evals framework, or benchmarks like BigBench).
Are there evaluation strategies that worked best for you? We basically want to allow users/developers to evaluate which LLM (and specifically, which LLM + prompts + parameters) has the best performance for their use case (which seems different from the OpenAI evals framework, or benchmarks like BigBench).