Hacker News new | past | comments | ask | show | jobs | submit login
Launch HN: Hamming (YC S24) – Automated Testing for Voice Agents
129 points by sumanyusharma 30 days ago | hide | past | favorite | 66 comments
Hi HN! Sumanyu and Marius here from Hamming (https://www.hamming.ai). Hamming lets you automatically test your LLM voice agent. In our interactive demo, you play the role of the voice agent, and our agent will play the role of a difficult end user. We'll then score your performance on the call. Try it here: https://app.hamming.ai/voice-demo (no signup needed). In practice, our agents call your agent!

LLM voice agents currently require a lot of iteration and tuning. For example, one of our customers is building an LLM drive-through voice agent for fast food chains. Their KPI is order accuracy. It's crucial for their system to gracefully handle dietary restrictions like allergies and customers who get distracted or otherwise change their minds mid-order. Mistakes in this context could lead to unhappy customers, potential health risks, and financial losses.

How do you make sure that such a thing actually works? Most teams spend hours calling their voice agent to find bugs, change the prompt or function definitions, and then call their voice agent again to ensure they fixed the problem and didn't create regressions. This is slow, ad hoc, and feels like a waste of time. In other areas of software development, automated testing has already eliminated this kind of repetitive grunt work — so why not here, too?

We were initially working on helping users create evals for prompts & LLM pipelines for a few months but noticed two things:

1) Many of our friends were building LLM voice agents.

2) They were spending too much time on manual testing.

This gave us evidence that there will be more voice companies in the future, and they will need something to make the iteration process easier. We decided to build it!

Our solution involves four steps:

(1) Create diverse but realistic user personas and scenarios covering the expected conversation space. We create these ourselves for each of our customers. Getting LLMs to create diverse scenarios even with high temperatures is surprisingly tricky. We're learning a lot of tricks along the way to create more randomness and more faithful role-play from the folks at https://www.reddit.com/r/LocalLLaMA/.

(2) Have our agents call your agent when we test your agent's ability to handle things like background noise, long silences, or interruptions. Or have us test just the LLM / logic layer (function calls, etc.) via an API hook.

(3) We score the outputs for each conversation using deterministic checks and LLM judges tailored to the specific problem domain (e.g., order accuracy, tone, friendliness). An LLM judge reviews the entire conversation transcript (including function calls and traces) against predefined success criteria, using examples of both good and bad transcripts as references. It then provides a classification output and detailed reasoning to justify its decisions. Building LLM judges that consistently align with human preferences is challenging, but we're improving with each judge we manually develop.

(4) Re-use the checks and judges above to score production traffic and use it to track quality metrics in production. (i.e., online evals)

We created a Loom recording showing our customers' logged-in experience. We cover how you store and manage scenarios, how you can trigger an experiment run, and how we score each transcript. See the video here: https://www.loom.com/share/839fe585aa1740c0baa4faa33d772d3e

We're inspired by our experiences at Tesla, where Sumanyu led growth initiatives as a data scientist, and Anduril, where Marius headed a data infrastructure team. At both companies, simulations were key to testing autonomous systems before deployment. A common challenge, however, was that simulations often fell short of capturing real-world complexity, resulting in outcomes that didn't always translate to reality. In voice testing, we're optimistic about overcoming this issue. With tools like PlayHT and ElevenLabs, we can generate highly realistic voice interactions, and by integrating LLMs that exhibit human-like reasoning, we hope our simulations will closely replicate how real users interact with voice agents.

For now, we're manually onboarding and activating each user. We're working hard to make it self-serve in the next few weeks. The demo at https://app.hamming.ai/voice-demo doesn't require any signup, though!

Our current pricing is a mix of usage and the number of seats: https://hamming.ai/pricing. We don't use customer data for training purposes or to benefit other customers, and we don't sell any data. We use PostHog to track usage. We're in the process of getting HIPAA compliance, with SOC 2 being next on the list.

Looking ahead, we're focused on making scenario generation and LLM judge creation more automated and self-serve. We also want to create personas based on real production conversations to make it easier to ‘replay’ a user on demand.

A natural next step beyond testing is optimization. We're considering building a voice agent optimizer (like DSPy) that uses scenarios from testing that failed to generate a new set of prompts or function call definitions to make the scenario pass. We find the potential of self-play and self-improvement here super exciting.

We'd love to hear about your experiences with voice agents, whether as a user or someone building them. If you're building in the voice or agentic space, we're curious about what is working well for you and what challenges you are encountering. We're eager to learn from your insights about setting up evals and simulation pipelines or your thoughts on where this space is heading.




AI voice agents are weird to me because voice is already a very inefficient and ambiguous medium, the only reason I would make a voice call is to talk to a human who is equipped to tackle the ambiguous edge cases that the engineers didn't already anticipate.

If you're going to develop AI voice agents to tackle pre-determined cases, why wouldn't you just develop a self-serve non-voice UI that's way more efficient? Why make your users navigate a nebulous conversation tree to fulfill a programmable task?

Personally when I realize I can only talk to a bot, I lose interest and end the call. If I wanted to do something routine, I wouldn't have called.


Think 1-800-CONTACTS not Siri. Call centers are super expensive and the user experience is usually pretty bad. There's a huge incentive to move to voice agents, but one of the challenges is building a framework to adequately test it. That seems to be what this is focused on.


If I understand correctly, it’s the push back on the call center in general when using AI agents. Why go through the trouble at that point versus another manner of fixing my issue.

For example, when I need to activate a new SIM card, I need to call the company to get it activated. But if I’m talking to an AI agent at that point, why not have me go through another channel (website/app?) to activate it?


Some people (like me) are primarily verbal processors:

- I am dictating this message through macOS's voice to text right now

- I am a huge user of Google Assistant

- I prefer to call people versus texting them

- I tend to call restaurants instead of using something like Toast to order takeout (although this is partially because online services will add a surcharge onto the price sometimes, and sometimes I need to ask questions about dietary restrictions, etc.)

Generally, wherever possible, I will use a voice interface versus a text based one to get my point across. It's just faster and more convenient for me. I'm pretty neutral on the consumption side: I read and listen to audiobooks in roughly equal amounts.

All that to say that, just like there are people out there who prefer text UIs, there are also people who prefer voice interfaces.


I use Superwhisper (no affiliation, just a happy user), which runs a local Whisper model, to create most of my email drafts and post-meeting notes. I find Whisper more accurate than Mac’s built-in speech-to-text, plus I’m faster at speaking than typing.

Sometimes, I even ‘talk’ into Cursor’s chat window instead of typing. The only downside? It can get a bit annoying for others when you're talking to yourself all day.


I'm looking for something like this that runs on Linux. Best thing I've found is LiveCaptions, but its output is janky. I can't just use it to type in any old text field, and its output requires substantial editing after the fact.

I guess I understand that a lot of things are being developed for Apple silicon specifically. It's just frustrating that despite hours of searching, I'm not finding anything decent.


Talon Voice is good and runs on linux.

https://talonvoice.com/


This looks really powerful for controlling the system with different scripts, but what if all I want it to do is let me narrate something and print out the sentences as close to real-time as possible? It's really just good STT that I'm looking for out of it.


The Talon voice dev created his own STT model that's very performant. The transcription quality is... good, but not world-class. It's better than anything that came out before Whisper IMO, but the newest generator of models can do things like inferring punctuation and words outside of its vocabulary (although the downside of the new generation of VTT is that they can sometimes hallucinate words that are very different from what you said).

It's a bit overkill to use Talon for just voice dictation, but that is 90% of what I use it for, and it's pretty good at it.


Interesting! I'll give Superwhisper a try.


The example of "fast food drive-thru" really cleared this up for me.

Frankly I'm surprised there isn't already some sort of NFC info transfer system in fast food restaurants' apps that lets you and everyone in your car enter your order while you're waiting in line, then knows when your car is up and brings you the food. Have the voice part be a fallback tier, not the primary one.

My grocery store can know when I'm arriving and bring out my food, based on location services on my phone. So can Walmart or Home Depot. Granted, they make me wait a couple hours until they notify me that my order is "ready" before I come get it.

I suppose it's possible this does exist and I just haven't seen it because I don't drive through fast food restaurants, but I don't get why a place that primarily takes orders in real time and hands them out the window can't broaden the way to submit them to include on-site online orders as well as "talk to our agent over a glorified walkie talkie" orders.


Sort of related, but I just came off a RyanAir flight of all things, and they have something similar. Instead of talking a stewardess to get a sandwich, I order it on the app and they bring it to me.

It worked quite well, and surprising coming from RyanAir.


It's been a dream of mine for some years now. They're pretty dumb still today.


I'm the same way, and I don't have any data on this, but it's possible that we're in the minority. This probably isn't the case, but hopefully anyone implementing such a system has thought through whether it will actually provide any value.

For example, if you had an existing IVR system and you tracked menu options and found that a significant portion of calls were able to be answered by non-smart pre-recorded messages, upgrading to an AI voice agent could be a reasonable improvement.


Our customers, who build voice agents, are often asked by their customers to make their voice agents more human-like and flexible. Their clients — businesses like pest control and automotive repairs — value providing a personalized experience but want the convenience and reliability of a 24/7 booking and answering service.


Can I upgrade to a web form instead?


Why “Hamming”? As in Richard Hamming, ex-Bell Labs, “You and Your Research”?


Yup, we named it after Richard Hamming. His essay 'you and your research' was deeply influential during my undergrad; I re-read it every quarter.

Our current product draws inspiration from Hamming distance because we're comparing the `distance` between current LLM output vs. desired LLM output.


My 2.5 year old yesterday starting saying "Hey, This is a test, Can you hear me?", parroting me spending hours testing my LLM. Hah.

This will work with a https://www.pipecat.ai type system? Would love to wrap a continuous testing system with my bot.


Pipecat looks awesome! I'll run the examples over the weekend and try to see what the integration hooks need to look like: https://github.com/pipecat-ai/pipecat/tree/main/examples

It should be pretty straightforward at first glance!


Yea, it's interesting, it's just a Chatbot over a Zoom (we use Daily) call as opposed to a 1-on-1 websocket (or a phone call). Other advantage is using WebRTC!


As someone whose job has been negatively impacted by LLMs already, I'll echo the sentiment here that use cases like this one are sort of depressing, as they will primarily impact people who work long hours for small pay. It certainly seems like there's money to be made in this, so congratulations. The landing page is clear and inviting as well. I think I understand what my workflow inside it would be like based on your text and images.

I'm most excited to see well-done concepts in this space, though, as I hope it means we're fast-forwarding past this era to one in which we use AI to do new things for people and not just do old things more cheaply. There's undeniably value in the latter but I can't shake the feeling that the short-term effects are really going to sting for some low-income people who can only hope that the next wave of innovations will benefit them too.


What line of work was it?


I'm a Top Rated/Pro-verified ghostwriter on Fiverr. It's been my full-time job since 2015. Went from mid-six figures in 2022 to scraping by today.


Congratulations for the launch! We had a big QC need for https://kea.ai/ where we needed to stress test our CX agents in real time too. This would be a big life saver. kudos on the product and the brilliant demo!


I am curious - how was the team solving this at Kea?


Nice! Great to see the UI looks clean enough that it's accessible to non-engineers. The prompt management and active monitoring combo looks especially useful. Been looking for something with this combo for an expense app we're building.


Yes! We're aiming to build a tool that both engineers and non-engineers love.

We've discovered that it's often faster for non-technical domain experts to iterate on prompts in a structured, eval-driven way, rather than relying on engineers to translate business requirements into prompts.

While storing prompts in code offers version control benefits, it can hinder collaboration. On the other hand, using a pure CMS for prompts enhances collaboration but sacrifices some modern software development practices.

We're working towards a solution that bridges this gap, combining the best of both approaches. We're not there yet, but we have a clear roadmap to achieve this vision!


I feel like the better positioning would be evals for voice agents. It seems just as challenging to figure out all the ways your system can go wrong, as it is to build the system in the first place. Doing this in a way that actually adds value without any domain expertise, seems impossible.

If it did, wouldn't all the companies with production AI text interfaces be using similar techniques? Now being able to easily replay a conversation that was recorded with a real user seems like a huge value add.


Absolutely agree that creating effective evals requires domain expertise. Right now, we're co-building evals with customers, but we're identifying which aspects can be productized.

Regarding text-based evals — part of testing voice agents involves assessing their core reasoning logic. To do that, we bypass the voice layer and simulate conversations via text. So yes, the core simulation engine is reusable for both conversational text and voice interactions.

We're also excited about shipping the ability to replay a simulated conversation inspired by a real user!


The idea of testing an agent with annoying situations, like uncooperative people or vague responses, makes me wonder if, in the future, similar approaches might be tried on humans. People could be (unknowingly) subjected to automated "social benchmarks" with artificially designed situations, which I'm sure I don't have to explain how dystopian that is.

It would essentially be another form of a behavioral interview. I wonder if this exists already, in some form?


I wonder if a more optimistic version of this could be used to train humans and improve their skills. I'm thinking along the lines of LeetCode / Project Euler, but more dynamic and personalized!

Few examples:

1) Customer service: Simulating challenging customer interactions could help reps develop patience and problem-solving skills.

2) Emergency responders: Creating realistic crisis scenarios (like 911 calls) that could improve decision-making under pressure.

3) Healthcare: Virtual patients with complex symptoms could speed up the learning rate for med students.

4) Conflict resolution: Practicing with difficult personalities could aid mediators and negotiators.

5) Sales: AI-simulated tough customers could help salespeople refine their pitches and objection-handling skills in a low-stakes environment.

Thoughts?


That does sound like an interesting idea. Upon further thought, I think that it would heavily depend on implementation.

In a bad case, I envision a ton of companies or institutions employing very strict & narrow situations to the point where they only accept a very homogenized personality. It could end up creating a stiff or worse culture than if they had naturally accumulated a diverse population, if that makes sense. Discrimination already exists, but would be made a lot easier, automated, and commonplace.

In a good case, extremely antisocial behavior (situations that are "softballs" or "hard to screw up for reasonable people") could be easily caught at scale and addressed an early age. Plus the cases you've listed, eliminating the need for special attention and mentorship from the limited people we meet irl.

I'm sure there are other horrible or amazing cases I'm missing.

So as all tools are, it would depend. Whether this will actually benefit more than harm will depend on the society you place it in, and I'm not sure I have that much faith in the corporate world.


There's already a startup for the last use case. I forgot the name though.


I work in the telecom space. I don't think this paradigm will get adopted in the near future. Customers are already building voice bots on top of Google Dialogflow e.g. Cognigy. Cognigy does have LLM capabilities, but it is not widely adopted. I think voice bots will still have to be manually configured for some time.


I'm curious to learn more about what's blocking the widespread adoption of the LLM capabilities. Lack of knowledge, reliability, or something else?


The reality is - businesses most of the time already know their most asked questions from their customers (based off of feedback from call center agents) when they're asking a voice bot. E.g. - What is the status of my order? How much do I have left on my balance? Can I please pay my balance off?

99% of the time, we can just build a simple intent flow off of dialogflow pointing to the customer's API endpoints that will return that data. No where here do we need an LLM / RAG since their endpoint already points to that answer. Hope that makes sense!


You guys should try and get acquired by Cognigy.


No plans for acquisition :)

Building product, talking to customers and making something people want!


is there an open source variant available? I am building https://github.com/bolna-ai/bolna which is an open source voice orchestration.

would love to have something like this integrated as part of our open source stack.


Bolna looks awesome! We've considered going open-source, but we're not sure how to effectively manage a community.

I'll reach out async!


Sure! Would love to discuss synergies and if we can integrate it. Thanks & all the best!


That drive through customer… oh my. I have new found empathy for drive through operators.


Yes! Drive-through customers can be very impatient. We tried to make the demo persona maximally annoying.

Testing for edge cases is especially important because getting an order wrong can cause health hazards, long line-ups, and churn!


As someone who has worked in TTS for over 4 years now. I can tell you that evaluation is the most difficult aspect of generative audio ML.

How will this really check that the models are performing well vs just listening?


We're focused on end-to-end evals focused on function-call accuracy, style, tone & latency of the conversations between our sims and your voice agent. Less focused on pure TTS evals at the moment!


This is great to see. Evals on voice are hard - we only have evals on text based prompting, but it doesn't fully capture everything. Excited to give this a try.


This tracks. Text evals to test core logic and voice evals for overall end-to-end performance!


I'm working on AI voice agents here in the UK for real estate professionals, unfortunately I couldn't try your service.


We forgot to enable non-US numbers in our config for the demo. (oops)

We're working on a fix right now!


What’s the name of your product / business?


Should be fixed now; could you try again please?


Awesome work guys! Which industries / jobs do you suspect will be adopting voice agents the fastest?


Likely outsourced call centers since call complexity is low to medium. We also expect rapid adoption in industries like customer service, healthcare, and retail, where 24/7 availability could be high-impact for businesses and convenient for consumers!


There is not even one reliable and proven "voice agent" yet (correct me if I'm wrong but the best available, elevenlabs, isn't that great yet to be a voice agent) but there is already companies selling the test of voice agents?

Selling shovels on a gold rush seems to have become the only one mantra here.


It's a bit of a catch-22.

Making current voice agents reliable is incredibly time-consuming and complex. This challenge has kept many teams from pushing their agents into production. Those who do launch often release a very limited, basic version to minimize risk. We frequently talk to teams in both camps.

As a result, there aren't many 'killer' voice products on the market right now. But as models improve, we'll see more voice-centric companies emerge.

Teams are already calling their agents by hand and keeping track of experiment runs in a spreadsheet. We're just automating the workflow and making it easier to run experiments!


As a test, I asked GPT to call my phone company and get my account balance. It worked and even declined some program they tried to sign me up for. Blew my mind.


What were the steps to get it to make a call?


I just set my phone next to another phone and put them both on speaker. It didn't actually dial a number, but I'm sure it could if you used the API and gave it a "tool".


Wow, gonna test this with my Retell AI agent.


Nice! What's the use case your agent solves for?

I'm happy to spin up some scenarios that are more relevant for you instead of our stock demo personas :)

Feel free to email me at sumanyu@hamming.ai


[flagged]


If only handloom operators hadn’t been replaced by steam looms we’d all be in a better place.


I appreciate your candid feedback. Our aim isn't to push for replacing humans but to ensure that when companies do use LLMs, they work as intended and don't create more problems. We'd rather see well-functioning systems than glitchy ones that frustrate everyone involved!


Well, the idea behind this product is to make sure the LLMs they replace/augment workers with aren't total dogshit (at least relative to the workers they're replacing).

But also remember -- the point of the economy is not jobs. It's value creation. If we can create the same or greater value with fewer people working/people working less, that's a great result! And it's the result we've seen continue over the last century and a half, even while people became much wealthier (because they were choosing to exchange some of that wealth for less working time).


I'm speaking as a potentially ignorant layman here (in the US), but if you need a job for income then isn't that taking away a lot of avenues for entry career progression and way more competition for remaining roles?

I don't think reducing the need for human labor is inherently bad, but our current society seems to be heavily centered around finding work.


At least at the moment, the jobs we're talking about (e.g. drive-through order-taker) aren't meaningful entry into a career path — they're jobs people take before they have a career path, or on a largely short-term basis. There are exceptions, but there are still going to be plenty of restaurant jobs for a while. We're still in a shortage of labor on that end of the job market.

But also, the number of remaining roles isn't fixed. Jobs exist (at least in the private sector) because they create more value than they cost to fill, and we're always finding new and expanded ways for people to create more value. Saving resources through automation just means we can redirect that value creation somewhere else.

Ultimately this is how economies grow and the world becomes wealthier over time; we increase the value of people's time because there's so much competition for it, to the point where we can then more cheaply automate some or part of the job. If the supply of labor gets too large for the uses we can find for it, prices for labor fall, and the relative cost of automation is increased.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: