> [...] but it is hard to optimize prompts when you cannot build a test-suite of...

mg · on April 10, 2024

This is what I do. I calculate a score over a sample of questions and replies. I'm not doing unit tests.

Comparing the scores of two prompts will not give you a definitive answer which one is superior. But the prediction which one is superior would be better without the noise added by the randomness in the execution of the LLM.

phillipcarter · on April 10, 2024

> Comparing the scores of two prompts will not give you a definitive answer which one is superior.

Yes, but it can tell you which is likely to be superior, which is perhaps good enough?

Offline evals are only a part of the equation though, which is why online evaluations are perhaps even more important. Good observability and a way to systematically measure "what is a good response" on production data is what ultimately gets us closer to real truth.