Hi everyone,
I’ve been working on an open-source tool called Flakestorm to test the reliability of AI agents before they hit production.
Most agent testing today focuses on eval scores or happy-path prompts. In practice, agents tend to fail in more mundane ways: typos, tone shifts, long context, malformed input, or simple prompt injections — especially when running on smaller or local models.
Flakestorm applies chaos-engineering ideas to agents. Instead of testing one prompt, it takes a “golden prompt”, generates adversarial mutations (semantic variations, noise, injections, encoding edge cases), runs them against your agent, and produces a robustness score plus a detailed HTML report showing what broke.
Key points:
Local-first (uses Ollama for mutation generation)
Tested with Qwen / Gemma / other small models
Works against HTTP agents, LangChain chains, or Python callables
No cloud or API keys required
This started as a way to debug my own agents after seeing them behave unpredictably under real user input. I’m still early and trying to understand how useful this is outside my own workflow.
I’d really appreciate feedback on:
Whether this overlaps with how you test agents today
Failure modes you’ve seen that aren’t covered
Whether “chaos testing for agents” is a useful framing, or if this should be thought of differently
Repo: https://github.com/flakestorm/flakestorm
Docs are admittedly long.
Thanks for taking a look.