Hi HN! Sumanyu and Marius here from Hamming (
https://www.hamming.ai). Hamming lets you automatically test your LLM voice agent. In our interactive demo, you play the role of the voice agent, and our agent will play the role of a difficult end user. We'll then score your performance on the call. Try it here:
https://app.hamming.ai/voice-demo (no signup needed). In practice, our agents call your agent!
LLM voice agents currently require a lot of iteration and tuning. For example, one of our customers is building an LLM drive-through voice agent for fast food chains. Their KPI is order accuracy. It's crucial for their system to gracefully handle dietary restrictions like allergies and customers who get distracted or otherwise change their minds mid-order. Mistakes in this context could lead to unhappy customers, potential health risks, and financial losses.
How do you make sure that such a thing actually works? Most teams spend hours calling their voice agent to find bugs, change the prompt or function definitions, and then call their voice agent again to ensure they fixed the problem and didn't create regressions. This is slow, ad hoc, and feels like a waste of time. In other areas of software development, automated testing has already eliminated this kind of repetitive grunt work — so why not here, too?
We were initially working on helping users create evals for prompts & LLM pipelines for a few months but noticed two things:
1) Many of our friends were building LLM voice agents.
2) They were spending too much time on manual testing.
This gave us evidence that there will be more voice companies in the future, and they will need something to make the iteration process easier. We decided to build it!
Our solution involves four steps:
(1) Create diverse but realistic user personas and scenarios covering the expected conversation space. We create these ourselves for each of our customers. Getting LLMs to create diverse scenarios even with high temperatures is surprisingly tricky. We're learning a lot of tricks along the way to create more randomness and more faithful role-play from the folks at https://www.reddit.com/r/LocalLLaMA/.
(2) Have our agents call your agent when we test your agent's ability to handle things like background noise, long silences, or interruptions. Or have us test just the LLM / logic layer (function calls, etc.) via an API hook.
(3) We score the outputs for each conversation using deterministic checks and LLM judges tailored to the specific problem domain (e.g., order accuracy, tone, friendliness). An LLM judge reviews the entire conversation transcript (including function calls and traces) against predefined success criteria, using examples of both good and bad transcripts as references. It then provides a classification output and detailed reasoning to justify its decisions. Building LLM judges that consistently align with human preferences is challenging, but we're improving with each judge we manually develop.
(4) Re-use the checks and judges above to score production traffic and use it to track quality metrics in production. (i.e., online evals)
We created a Loom recording showing our customers' logged-in experience. We cover how you store and manage scenarios, how you can trigger an experiment run, and how we score each transcript. See the video here: https://www.loom.com/share/839fe585aa1740c0baa4faa33d772d3e
We're inspired by our experiences at Tesla, where Sumanyu led growth initiatives as a data scientist, and Anduril, where Marius headed a data infrastructure team. At both companies, simulations were key to testing autonomous systems before deployment. A common challenge, however, was that simulations often fell short of capturing real-world complexity, resulting in outcomes that didn't always translate to reality. In voice testing, we're optimistic about overcoming this issue. With tools like PlayHT and ElevenLabs, we can generate highly realistic voice interactions, and by integrating LLMs that exhibit human-like reasoning, we hope our simulations will closely replicate how real users interact with voice agents.
For now, we're manually onboarding and activating each user. We're working hard to make it self-serve in the next few weeks. The demo at https://app.hamming.ai/voice-demo doesn't require any signup, though!
Our current pricing is a mix of usage and the number of seats: https://hamming.ai/pricing. We don't use customer data for training purposes or to benefit other customers, and we don't sell any data. We use PostHog to track usage. We're in the process of getting HIPAA compliance, with SOC 2 being next on the list.
Looking ahead, we're focused on making scenario generation and LLM judge creation more automated and self-serve. We also want to create personas based on real production conversations to make it easier to ‘replay’ a user on demand.
A natural next step beyond testing is optimization. We're considering building a voice agent optimizer (like DSPy) that uses scenarios from testing that failed to generate a new set of prompts or function call definitions to make the scenario pass. We find the potential of self-play and self-improvement here super exciting.
We'd love to hear about your experiences with voice agents, whether as a user or someone building them. If you're building in the voice or agentic space, we're curious about what is working well for you and what challenges you are encountering. We're eager to learn from your insights about setting up evals and simulation pipelines or your thoughts on where this space is heading.
If you're going to develop AI voice agents to tackle pre-determined cases, why wouldn't you just develop a self-serve non-voice UI that's way more efficient? Why make your users navigate a nebulous conversation tree to fulfill a programmable task?
Personally when I realize I can only talk to a bot, I lose interest and end the call. If I wanted to do something routine, I wouldn't have called.