> [...] but it is hard to optimize prompts when you cannot build a test-suite of questions and correct replies against which you can test and improve the instruction prompt.
I think this is because your approach isn't right. This tech isn't really unit-testable in the same sense. In fact, for many use cases, you may want non-deterministic results by design.
Instead, you probably need evaluations. The idea is that you're still building out "test" cases, but instead of expecting a specific result each time, you get a result that you can score through some means. Each test case produces a score, and you get a rollup score for the suite, and that's how you can track regressions over time.
For example, in our use case, we produce structured JSON that has to match a spec, but we also want to have the contents of that valid-to-spec JSON object be "useful". So there's a function that defines "usefulness" based on some criteria that I've put together since I'm a domain expert. This is something I can evolve over time, using real-world inputs that produce bad or unsatisfying outputs as new evaluations for the evaluation suite.
Fair warning though, it's not very easy to get started with, and there's not a whole lot of information about doing it well online.
This is what I do. I calculate a score over a sample of questions and replies. I'm not doing unit tests.
Comparing the scores of two prompts will not give you a definitive answer which one is superior. But the prediction which one is superior would be better without the noise added by the randomness in the execution of the LLM.
> Comparing the scores of two prompts will not give you a definitive answer which one is superior.
Yes, but it can tell you which is likely to be superior, which is perhaps good enough?
Offline evals are only a part of the equation though, which is why online evaluations are perhaps even more important. Good observability and a way to systematically measure "what is a good response" on production data is what ultimately gets us closer to real truth.
I think this is because your approach isn't right. This tech isn't really unit-testable in the same sense. In fact, for many use cases, you may want non-deterministic results by design.
Instead, you probably need evaluations. The idea is that you're still building out "test" cases, but instead of expecting a specific result each time, you get a result that you can score through some means. Each test case produces a score, and you get a rollup score for the suite, and that's how you can track regressions over time.
For example, in our use case, we produce structured JSON that has to match a spec, but we also want to have the contents of that valid-to-spec JSON object be "useful". So there's a function that defines "usefulness" based on some criteria that I've put together since I'm a domain expert. This is something I can evolve over time, using real-world inputs that produce bad or unsatisfying outputs as new evaluations for the evaluation suite.
Fair warning though, it's not very easy to get started with, and there's not a whole lot of information about doing it well online.