After an initial first rush of getting features to prod, at this point I would say we're doing "Evaluation Driven Development". That is new features are built by building out the evaluations of those results as the starting point.
At least from the people I've talked with, how important evaluations/tests are to your team seems to be the major differentiator between the people rushing out hot garbage and those seriously building products in this space.
More specially: we're running hundreds/thousands of evaluations on a wide range of scenarios for prompts we're using. We can very quickly see if there are regressions, and have a small team of people working on improving this functionality.
Interesting! How would you say you bootstrapped your evaluation system(s)?
This is a great perspective. We managed to bring our error rate (that is, there's always a valid result) down to about 4% without evaluation-driven prompt engineering, but it did involve looking at real-world usage every day, noticing patterns, and doing our best (ugh, this is where evals would have been nice) to not regress things. Combined with some other pieces - basically things that end users can customize definitions for, which we parameterize into our prompt - this seemed to get it very close.
At least from the people I've talked with, how important evaluations/tests are to your team seems to be the major differentiator between the people rushing out hot garbage and those seriously building products in this space.
More specially: we're running hundreds/thousands of evaluations on a wide range of scenarios for prompts we're using. We can very quickly see if there are regressions, and have a small team of people working on improving this functionality.