Hacker News new | past | comments | ask | show | jobs | submit login
What are the best ways of evaluating LLMs for specific use-cases? (lastmileai.dev)
7 points by saqadri on June 30, 2023 | hide | past | favorite | 1 comment



We did some research into the different ways of evaluating LLMs, but there is a lot of literature on many different approaches, e.g. scores like BLEURT, precision/recall if you have ground truth data, all the way to asking GPT to be a human rater.

Are there evaluation strategies that worked best for you? We basically want to allow users/developers to evaluate which LLM (and specifically, which LLM + prompts + parameters) has the best performance for their use case (which seems different from the OpenAI evals framework, or benchmarks like BigBench).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: