I built a methodology at Fiverr Labs for generating agent prompts from product specs using tests instead of manual prompt engineering. You write a behavioral spec, a coding agent generates tests from it, and a second agent iterates on the prompt until tests pass. Hidden test splits and mutation testing address specification gaming.
Evaluated on 4 agent specs across 24 trials — 92% compilation success, $2–3 per compilation. The benchmark and all code are open at https://github.com/f-labs-io/tdad-paper-code
Happy to discuss the methodology, limitations, and directions for follow-ups
Evaluated on 4 agent specs across 24 trials — 92% compilation success, $2–3 per compilation. The benchmark and all code are open at https://github.com/f-labs-io/tdad-paper-code
Happy to discuss the methodology, limitations, and directions for follow-ups
reply