Hacker Newsnew | past | comments | ask | show | jobs | submit | draismaa's commentslogin

We open-sourced Scenario, a tiny framework to simulate and test AI agents by using another AI agent — much like how self-driving cars are tested in controlled environments before real-world deployment.

The idea: if you wouldn’t deploy an autonomous vehicle without simulation, why would you deploy an AI agent without pressure-testing it first?

Scenario lets you: Describe multi-turn user flows (e.g. “book a flight and cancel it”) Set success criteria ("user got confirmation + refund info") Have one agent simulate the user, testing how another agent performs the task Debug regressions, track failures, iterate on behavior

It’s like writing unit tests, but for conversations.

Why? Traditional evals fall short when testing multi-step flows, memory handling, tool use, or goal alignment. Inspired by simulation in autonomous vehicle testing, where cars are exposed to rare, edge-case situations at scale, Scenario allows similar validation loops for AI agents.

Instead of hoping your support bot or agent workflow handles complexity… you simulate it:

Works with any agent-framework you define

Minimal setup required. There are a few starter examples in the repo: https://github.com/langwatch/scenario

We’d love feedback from the HN community:

What test patterns do you (want to) use for agent workflows?

Thanks!


Wow, they were really one of the first companies I noticed in this space, always heard very great feedback on the founder when speaking to prospects. All the best in the upcoming journey.

If there are any users/customers around here, we provide full migration support from any platform (but also Humanloop :)) contact @ (langwatch.ai)

cheers


LLM evaluations are tricky. You can measure accuracy, latency, cost, hallucinations, bias... but what really matters for your app? Instead of relying on generic benchmarks, build your own evals --> focused on your use case, and then, bring those evals into real-time monitoring of your LLM app. We open-sourced LangWatch to help with this.. How are you handling LLM evals in production?


Excited to introduce LangWatch, the tool designed for developers working with LLMs. It allows you to experiment with DSPy optimizers in a simple way and monitor the performance of LLM features in your projects. Key features include: DSPy optimizer integration: Test various optimization algorithms to enhance your LLM's efficiency. Monitor and visualize the performance metrics of different LLM features over time. Open Source: Contributions are welcome to help improve and expand the tool's capabilities. Feedback, suggestions, and contributions are highly appreciated.


Awesome to see more opensource tools in this space. In transparency we'r building the oss tool https://github.com/langwatch/langwatch which is tool for tracing and monitoring your LLM features and open telemetry is supported as well. Monitoring is key to any team building LLM-features, and still much can be done in this field. What i believe in is the power of optimizing when understanding your performance with these solutions. For ex we're using DSPy optimizers. Curious towards your thoughts int this too! Congrats on the launch and all the best!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: