Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Voice bots with 500ms response times (cerebrium.ai)
309 points by kwindla 5 days ago | hide | past | favorite | 96 comments
Last year when GPT-4 was released I started making lots of little voice + LLM experiments. Voice interfaces are fun; there are several interesting new problem spaces to explore.

I'm convinced that voice is going to be a bigger and bigger part of how we all interact with generative AI. But one thing that's hard, today, is building voice bots that respond as quickly as humans do in conversation. A 500ms voice-to-voice response time is just barely possible with today's AI models.

You can get down to 500ms if you: host transcription, LLM inference, and voice generation all together in one place; are careful about how you route and pipeline all the data; and the gods of both wifi and vram caching smile on you.

Here's a demo of a 500ms-capable voice bot, plus a container you can deploy to run it yourself on an A10/A100/H100 if you want to:

https://fastvoiceagent.cerebrium.ai/

We've been collecting lots of metrics. Here are typical numbers (in milliseconds) for all the easily measurable parts of the voice-to-voice response cycle.

  macOS mic input                 40
  opus encoding                   30
  network stack and transit       10
  packet handling                  2
  jitter buffer                   40
  opus decoding                   30
  transcription and endpointing  200
  llm ttfb                       100
  sentence aggregation          100
  tts ttfb                        80
  opus encoding                   30
  packet handling                  2
  network stack and transit       10
  jitter buffer                   40
  opus decoding                   30
  macOS speaker output           15
  ----------------------------------
  total ms                       759
Everything in AI is changing all the time. LLMs with native audio input and output capabilities will likely make it easier to build fast-responding voice bots soon. But for the moment, I think this is the fastest possible approach/tech stack.





Well that was fast. Kudos, really neat. Speed trumps everything else. I only noticed the robotic voice after I read the comments.

I worked on an Ai for customer service. Our agent took the average response time of 24/48 hours to merely seconds.

One of the messages that went to a customer was "Hello Bitch, your package will be picked up by USPS today, here is the tracking number..."

The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake.


"The customer responded "thank you so much" and gave us a perfect score in CSAT rating. Speed trumps everything, even when you make such a horrible mistake."

I think not everyone would react the same way. For some calling each other bitch is normal talk (which is likely, why I it got into the training data in the first place). For others, not so much.


If I'm used to waiting 2 days, and you get it down to 30 seconds you can call me what ever you want.

I'm more pissed if I'm waiting days for a response.


Me too. But I learned that not everyone is like me. And i general I also would not trust a LLM so much, that cannot divide between formal talk and ghetto slang. It will likely get other things wrong as well, humans will, too - so the error bar needs to be lower for me as a customer to be happier. I am not happy to get a fast, but wrong response and then fight for days to get an actual human to solve the mess.

I've grown up in various neighborhoods. In no context would calling someone a slur like that when you don't even know them be acceptable .

That said, it's obviously a technical glitch. Let's say it was something really important like medication, would you rather wait two or three days to find out when it gets here, or would you rather have a glitchy AI say some gibberish but then add it's coming tomorrow


It's also possible that it's such an unlikely thing to hear that she actually misheard it and thought it said something nicer.

Am I the only one who would be delighted to be called Bitch (or any of the worst male-specific terms) by random professionals?

"Hey fucker, your prescription has been ready for pickup for three days. Be sure to get your lazy ass over here or else you’ll need to reorder it. Love you bye"


This is something I've been wanting ever since maps/driving apps came. I'd love to have Waze/GoogleMaps be angry when you miss an exit or miss the initial ETA by too much.

However, I don't think it fits the culture too well in the companies that could do it as trying hard not to offend anybody is of utmost importance.


I would love this so much.

Fun fact, we fixed this issue by adding a #profanity tag and dropping the message to the next human agent.

Now our most prolific sales engineer could no longer run demos to potential clients. He had many embarrassing calls where the Ai would just not respond. His last name was Dick.


I find it odd that your engineer would make the system rely on instructions (“Do this. Never do that.”). This exposes your system to inconsistencies from the instruct tuning and future changes thereof by OpenAI or whoever. System prompts and instructions are maybe great for demos. But for a prod system where you have to cover all the bases I would never rely on such a thin layer of control.

(You can imagine the instruct layer to be like the skin on a peach. It’s tiny in influence compared to what’s inside. Even more so than, in humans, the cortex vs. the mammalian brain. Whoever tried to tell their kids not to touch the cookies while putting them in front of them and then leaving the room knows that relying on high level instructions is a bad idea.)


I wonder if the solution is to run the message through another LLM to make the message as polite as possible removing any profanities. Cost >2x as much to run though.

Maybe that was their first name, at least the one they put in lol

This is so, so good. I like that it seems to be a teaser app for cerebrium, if I understand it. It has good killer app potential. My tests from iPad ranged from 1400ms to 400ms reported latency; in the low end, it felt very fluid.

One thing this speed makes me think is that for some chat workflows you’ll need/get to have kind of a multi-step approach — essentially, quick response, during which time a longer data / info / RAQ query can be farmed out, then the informative result picks up.

Humans work like this; we use lots of filler words as we sort of get going responding to things.

Right now, most workflows seem to be just one shot prompting, or in the background, parse -> query -> generate. The better workflow once you have low latency response is probably something like: [3s of LLama 8b in your ears] -> query -> [55s of Llama 70b/GPT4/whatever you want, informed by query].

Very cool, thank you for sharing this.


Hi Vessenes

From Cerebrium here. Really appreciate the feedback - glad you had a good experience!

This application is easy to extend/implement meaning you can edit it to however you like: - Swap in different LLM's, STT and TTS models - Change prompts as well as implement RAG etc

In partnership with Daily, we really wanted to focus on the engineer here. So make it extremely flexible for them to edit the application to suit their use case/preference while at the same time take away the mundane infrastructure setup.

You can read more about how to extend it here: https://docs.cerebrium.ai/v4/examples/realtime-voice-agents


Thanks for this reply. Yep, as an engineer, this is awesome, the docs look simple and I’ll give it a whirl. As a product guy, it seems like it would be dead simple to start a company on this tech by just putting up a web page that lets people pick a couple choices and gives them a custom domain. Very cool!

I've wondered about this as well. Is there a way to have a small, efficient LLM model that can estimate general task complexity without actually running the full task workload?

Scoring complexity on a gradient would let you know you need to send a "Sure, one second let me look that up for you." instead of waiting for a long round trip.


For sure: in fact MoE models train such a router directly, and the routers are not super large. But it would also be easy to run phi-3 against a request.

I almost think you could do like a check my work style response: ‘I’m pretty sure xx, .. wait, actually y.’ Or if you were right, ‘yep that’s correct. I just checked.’

There’s time in there to do the check and to get the large model to bridge the first sentence with the final response.


A cross-platform browser VAD module is: https://github.com/ricky0123/vad. This is an ONNX port of Silero's VAD network. By cross-platform, I mean it works in Firefox too. It doesn't need a WebRTC session to work, just microphone access, so it's simpler. I'm curious about the browser providing this as a native option too.

There are browser text-to-speech engines too, starting to get faster and higher quality. It would be great if browsers shipped with great TTS.

GPT-4o has Automatic Speech Recognition, `understanding`, and speech response generation in a single model for low latency, which seems quite a good idea to me. As they've not shipped it yet, I assume they have scaling or quality issues of some kind.

I assume people are working on similar open integrated multimodal large language models that have audio input and output (visual input too)!

I do wonder how needed or optimal a single combined model is for latency and cost optimisation.

The breakdown provided is interesting.

I think having a lot more on the model on-device is a good idea if possible, like speech generation, and possibly speech transcription or speech understanding, at least right at the start. Who wants to wait for STUN?


>> I'm curious about the browser providing this as a native option too.

IMHO the desktop environment should provide voice to text as a service with a standard interface to applications - like stdin or similar but distinct for voice. Apps would ignore it by default since they aren't listening, but the transcriber could be swapped out and would be available to all apps.


Lol this announcement blows what ive been working on out of the water but i have a simple assistant implementation with rick0123/VAD + Websockets.

https://github.com/charlesyu108/voiceai-js-starter


If you do stt and tts on the device but everything else remains the same, according to these numbers that saves you 120ms. The remaining 639ms is hardware and network latency, and shuffling data into and out of the LLM. That's still slower than you want.

Logically where you need to be is thinking in phonemes: you want the output of the LLM to have caught up with the last phoneme quickly enough that it can respond "instantly" when the endpoint is detected, and that means the whole chain needs to have 200ms latency end-to-end, or thereabouts. I suspect the only way to get anywhere close to that is with a different architecture, which would work somewhat more like human speech processing, in that it's front-running the audio stream by basing its output on phonemes predicted before they arrive, and only using the actual received audio as a lightweight confirmation signal to decide whether to flush the current output buffer or to reprocess. You can get part-way there with speculative decoding, but I don't think you can do it with a mixed audio/text pipeline. Much better never to have to convert from audio to text and back again.


This was fun to try out. Earlier this week I tried june-va and the long response time kind of killed the usefulness. It's a great feature to get fast responses, this feels much more like a conversation. Funny enough, I asked it to tell me a story and then it only answered with one sentence at a time, requiring me to say "yes", "aha", "please continue" to get the next line. Then we had the following funny conversation:

"Oh I think I figured out your secret!"

"Please tell me"

"You achieve the short response times by keeping a short context"

"You're absolutely right"


That works for me, to be honest. not the short context, but definitely the short replies. Contrast that with the current implementation of ChatGPT's voice mode, where you ask something and then get a minute worth of GPT bla bla.

Your marketing says 500 but your math says 759.

That's called marketing

My tests had one outlier at 1400ms, and ten or so between 400-500ms. I think the marketing numbers were fair.

500 are the transcription/llm/tts steps (ie the response time from data arriving on the server to sending back), the rest seems to be various non-AI "overheads" such as encoding, network traffic,etc.

The latencies in the table are based on heuristics or averages that we’ve observed. However, in reality, based on the conversation, some of the larger latency components can be much lower.

Very, very impressive! It's incredibly fast, maybe too fast, but I think that's the point. What's most impressive though is how the VAD and interruptions are tuned. That was, by far, the most natural sounding conversation I've had with an agent. Really excited to try this out once it's available.

I too am excited about voice inferencing. I wrote my own Websocket Faster whisper implementation before OpenAI's gpt4o release . They steamrolled my interview coach concept https://intervu.trueforma.ai and https://sales.trueforma.ai - sales pitch coach implementations. I defaulted to Push to talk implementation as I couldn't get VAD to work reliably. I run it all on a panda Latte :) Was looking to implement Groq's hosted whisper. I love the idea of having Llama3 uncensored on Groq as the LLM as I'm tired of the boring corporate conversations. I hope to reduce my latency and learn from your examples - Kudos to your efforts. I wish I could try the demo - seems to be over subscribed as I can't get in to talk to the bot. I'm sure my latte Panda would melt if just 3 people try to inference at the same time :)

Personally, I use https://github.com/foges/whisper-dictation with llama-70b on groq. I start talking, navigate to website, and by the time it's loaded, and I picked llama-70b I finish talking, so 0 overhead. I read much faster than listen, so it works for me perfectly.

I use Firefox... still.

Hi, I built the client UI for this and... yea, I really wanted to get Firefox working :(

We needed a way to measure voice-to-voice latency from the end-user's perspective, and found Silero voice activity detection (https://github.com/snakers4/silero-vad) to be the most reliable at detecting when the user has stopped speaking, so we can start the timer (and stop it again when audio is received from the bot.)

Silero runs via onnx-runtime (with wasm). Whilst it sort-of-kinda works in Firefox, the VAD seems to misfire more than it should, causing the latency numbers to be somewhat absurd. I really want to get it working though! I'm still trying.

The code for the UI VAD is here: https://github.com/pipecat-ai/web-client-ui/tree/main/src/va...


Do you know why there's a difference in the performance of the algorithm in another browser? I would expect that all browsers run the code exactly the same way.

Do not go by the warning message. It does work just fine on Firefox latest. Cool, demo, btw!

I hate that everyone just develops for chromium only

This site works fine in safari/mobile safari, it is not ‘chromium only’

WebKit and its derivatives then.

I tried it with FIrefox 127 (production) and it worked just fine for me even though there is a huge banner on the top.

Mozilla refuses to implement some really cool standards.

https://mozilla.github.io/standards-positions/

That, and their shitty management shakes my faith in Firefox


They're not necessarily standards. I clicked on the first negative one and it said draft.

One browser vendor proposing things other vendor NAKing it makes it a vendor-specific feature. Like IE had its own features.


Yeah but we’re trying to fight against browser engine superiority, aren’t we?

I hate using chrome, but I’m forced to with any application that uses we busb/web serial or Bluetooth.


But that's basically complaining that firefox doesn't just blindly adopt whatever google proposes. A lot of the concerns are about security and privacy, the thing that mozilla is praised for doing better than google.

And no, you're not forced to use google. You can make native applications when it's necessary to use privileged interfaces.


That's a fair argument, but IMO Mozilla is dogmatic to a point it'll be detrimental to them long-term.

You prefer the management of Chromium, which makes billions a year from invading your privacy and force feeding you advertising, while also ruining the internet ecosystem?

Yes, me lightly criticizing Mozilla means that I endorse Google. Fuck off

A shame. They used to be the free (freedom) option.-

Likely a lot of people on HN use Firefox

It is working perfectly for me on Firefox (version 127).

Thanks for sharing. I did make some changes that seems to have improved things, although I do still see the occasional misfire. Perhaps good enough to remove that ugly red banner though!

Damned impressive.

Apple's Siri still can't allow me to have a conversation in which we aren't tripping over each other and pausing and flunking and the whole thing degrades into me hoping to get the barest minimum from it.


This was scary fast. Neat interface and (almost) indistinguishable from a human over the phone / internet. Kudos @cerebrium.ai.

It's not exactly clear is this a voice-to-voice model or a voice-to-text-to-voice model? When it is finally released, OpenAI claim their GPT4o audio model will be a lot faster at conversations because there's no delay to convert from audio to text and back to audio again. I'm also looking forward to using voice models for language learning.


It's a voice-to-text-to-voice approach, as implied by this description:

"host transcription, LLM inference, and voice generation all together in one place"

I think there are some benefits to going through text rather than using a voice-to-voice model. It creates a 100% reliable paper trail of what the model heard and said in the conversation. This can be extremely important in some applications where you need to review and validate what was said.


There are way more text training data than voice data. It also allows you to use all the benchmarks and tool integrations that have already been developed for LLMs.

I love it when engineers worth their salt actually do the back-of-the-envelope calculations for latency, etc.

Tangentially related, I remember years ago when Stadia and other cloud gaming products were being released doing such calculations and showing a buddy of mine that even in the best case scenario, you'd always have high enough input latency to make even casual multiplayer FPS games over cloud gaming services not feasible, or rather, comfortable, to play. Other slower-paced games might work, but nothing requiring serious twitch gameplay reaction times.

The same math holds up today because of a combination of fundamental limits and state of the art limits.


The calculations I was reading at the time suggested it would work for casual due to the gaming PC being very close to the game servers and running inside the best network available (googles).

Google also said that the controller would send the input straight to the server.

And a fast stadia server should have good fps combined with a little bit of brain prediction


I’m genuinely shocked by how conversational this is.

I think you hit a very important nail on the head here; I feel like that scene in iRobot where the protagonist talks to the hologram, or in the movie “AI” where the protagonist talks to an encyclopaedia called “Dr Know”


A chatbot that interrupts me even faster. Sorry for the sarcasm. maybe im just slow, but when I'm trying to formulate a question on the spot, I pause a lot. having the chatbot jump in and interrupt is frustrating. Humans recognize the difference between someone still planning on saying something, and when they've finished. I even tried to give it a rule where it shouldn't respond until I said "The End", and of course it couldn't follow that instruction.

Very true. I think we are a bit aggressive with the VAD timeout. The demo was intended to showcase speed, but the bot can be a bit eager! You can tinker with the VAD settings, it could definitely use a bit more air (but that will impact latency in the event the user has indeed finished talking.) As others say below, the magic will be figuring out the pace and style in which the user talks and adapting to that on the fly.

ps. The speed is impressive, but the key to a useful voice chatbot (which I've never seen) is one that adapts to your speaking style, identifies and employs turn-taking signals.

I acknowledge there are multiple viable patterns of social interaction, some talk over each other, and find that fun and engaging, while others think that's just the worst, and wait for a clear signal for their turn to speak and expect the same. I am of the latter.


I'm sure that, with an annotated dataset, a model could learn to pick up on the right cues.

This is pretty amazing ; it’s very fast indeed. I don’t really care about the voice responding sounding robotic; low latency is more important for whatever I do. And you can interrupt it too. Lovely.

Wow, Kwin, you’ve outdone yourself! The speed makes an even bigger difference than I expected going in.

Feels pretty wild/cool to say it might almost be too fast (in terms of feeling natural).


Maybe silly question:

> jitter buffer [40ms]

Why do you need a jitter buffer on the listening side? The speech-to-text model has neither ears nor a sense of rhythm — couldn’t you feed in the audio frames as you receive them? I don’t see why you need to delay processing a frame by 40ms just because the next one might be 40ms late.


Almost any gap in audio is detectable and sounds really bad. 40ms is a lot, but sending 40ms of silence is probably worse

Sounds bad to whom? I’m talking about the direction from user to AI, not the direction from AI to user. If some of the audio gets delayed on the way to the AI, the AI can be paused. If some of the audio gets delayed on the way to a human, the human can’t be paused, so some buffering is needed to reduce the risk of gaps.

This is indeed fast! Also seems to be no issue interrupting it while speaking. Is this using WebRTC echo cancellation to avoid microphone and speaker audio mix ups?

Yes, echo cancellation via the browser (and maybe also at OS-level too, if you're on a Mac with Sonoma.) The accuracy of speech detection vs. noise is largely thanks to Silero, which runs on the client via WASM. I'm surprised at how well it works, even in noisy environments (and a reminder that I should experiment more with AudioWorklet stuff in the future!)

I've been developing with Deepgram for a while, and this is one of the coolest demos I've seen with it!

I am curious about total cost to run this thing, though. I assume that on top of whatever you're paying Cerebrium for GPU hosting you're also having to pay for Deepgram Enterprise in order to self-host it.

To get the latency reduction of several hundred milliseconds how much more would it be for "average" usage?


Hey! From the Cerebrium team here!

So our costs are based on the infra you use to run your application and we charge per millisecond of compute.

Some things to note that we might do differently to other providers: 1. You can specify your EXACT requirements and we charge you only for that. Eg: if you want 2 vCPU, 12GB Memory and 1 A10 GPU we charge you for that which is 35% less if you rented a whole A10 2. We have over 10 variety of GPU chips so you can choose the price/performance trade-off 3. While you can extend this on the Cerebrium platform, it cannot be used commercially. We are speaking to Deepgram to see how we can offer it to customers. Hopefully I can provide more updates on this soon


Excellent; thanks for the info.

Or we can say the latency is a good listening skills!! It was fast but occasionally interrupted me to answer.

Dumb question - I see 2 opus encodes and decodes for a total around 120ms; is opus the fastest option?

Yes, Opus is the fastest and best option for real-time audio. It was designed to be flexible and to encode/decode at fairly low latencies. It sounds good for narrow-band (speech) at low bitrates but also works well at higher bitrates for music. And forward error correction is part of the codec standard.

It's possible to tweak the Opus settings to reduce that encode/decode latency substantially. Which might actually be worth doing for this use case. But there isn't quite a free lunch, here. The default Opus frame size is 20ms. Smaller frames lower the encoding/decoding latency, but increase the bitrate. The implementation in libwebrtc is very well tested and optimized for the default 20ms frame sizes and maybe not so much at other frame sizes. Experience has made me leery of taking the less-trodden-paths without a lot of manual testing.


you may be double counting opus encoding/decoding delay - usually, you can run it with 20ms frame, and both encoder and decoder take less than 1ms of realtime for their operation - so it should be ~ 21ms, instead of 30+30ms for 1 direction.

You are right! Thank you. I went back and looked at actual benchmark numbers from a couple of years ago and the numbers I got were ~26ms one-way. I rounded up to 30 to be conservative, but then double-counted in the table above. Will fix in the technical write-up. I don't think I can edit the Show HN.

This is super cool. Thanks for sharing. And I'm excited it encourage other to share. I'm excited to spend some time this weekend looking at the different ways people in this thread implemented solutions.

That's awesome - can you say anything about what datasets this was trained on? I assume something specifically conversational?

This thing is incredible. It finished a sentence I was saying.

This is really good. I'm blown away by how important the speed is.

And this was from a mobile connection in Europe, with a shown latency of just over 1s.


Fast yes, but the voice sounds robotic.

I prefer a slighty robotic voice. This was way I know I am talking to a bot, and this sets expectations.

Typical HN comment. Absolutely incredible tech is displayed that honestly, one year ago nobody could've imagined. Yet people still find something to moan about. I'm sure the authors of the project, who should be very proud, are fully aware the voice is robotic.

It's literally a robot

Voice models are getting both faster and more natural at a, well, a fast clip.

Amazing to see the metrics of each part that is involved! I've wondererd why you could not introduce a small sound that overplays the waiting time? Like an "hmm" to skip a few 100ms of the response time? Could be pregenerated (like 500 different versions) and play after 200ms of the last users input.

This is very impressive, me and my kid had fund talking about space.

This is excellent!

Perfect comprehension and no problem even with bad accents.


Jesus fuck that's fast, and I had no idea speed mattered that much. Incredible. Feels like an entirely different experience than the 5+ seconds latency with openai.

i have tried it. it is really fast! I know making a real-time voice bot is not easy with this low latency. which LLM did you use? how large LLM to make the conversation efficient?

This particular demo is using Llama3 8B. We initially started 70B, but it was a touch slower and needed much more VRAM. We found 8B good enough for general chit-chat like in this demo. Most real-world use-cases will likely have their own fine-tuned models.

This is so cool!

It /was/ nice and quick. Thanks for putting the demo online. It was quick to tell me complete nonsense. Apparently 7122 is the atomic number of Barium.



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: