The initial motivation is to run benchmarks, though the foundation is flexible and can support many other use cases over time.
It's already proving useful. For example, I can run a benchmark, view the results in a dashboard, and even feed the report into Claude Code to answer questions like:
"How did changing X affect the results?" or "What could be improved in the next run?"
reply