We're approaching LLM prompt evaluation at QA.tech

exdsq · 2024-09-17T09:16:12 1726564572

Nice work! Is there any risk that an evaluation and tweaking cycle accidentally changes the original test requirements over time?

drakonka · 2024-09-17T10:07:24 1726567644

Hey! Author here. Do you mean whether the cycle changes the original test requirements for the _evaluation test_, or for the test our agent is running for the customer's application?

If you are asking about the requirements for the evaluation test, that's possible for sure. We'll need to refine and maintain our eval asserts as we go as this happens. For example, I'm currently working on revising how our test steps are generated (from batched generation to a more iterative process), which changes the structure of the output value we expect from the LLM. This will impact some of our deterministic (i.e., non-model-graded) prompt eval asserts.

If you are asking about requirements for the _customer test_ itself: the customer test goal is generated _once_ based on testable actions we detect in the application or, optionally, manual input. Tweaking the test generation prompt that generates this goal could/would change that goal, but after it is created the goal is essentially 'set in stone' for each test (unless manually modified). So in our agent's test result evaluation stage after a test is run, it will always evaluate the result against an immutable goal point.

_However_, tweaking the _test evaluation_ stage's prompt could definitely impact how the result is evaluated even if the goal stays the same! This is why we have prompt eval tests running against tagged data points for that stage as well - to make sure our prompt tweaks to this stage produce expected results. And if the requirements for that output change as we go, we just tag more up-to-date data points (or add evals on manually-defined inputs if desired) to make sure we're testing on the right thing.

I hope that answers your question - let me know if that doesn't quite cover it!