Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Open-source turn detection model for voice AI
8 points by russ 56 days ago | hide | past | favorite | 1 comment
Hey HN, it’s Russ - cofounder of LiveKit. An open source stack for building realtime AI applications.

We’re sharing our first homegrown AI model for turn detection. Here’s a live demo: https://cerebras.vercel.app/

Voice AI has come a long way in the last year. We now have end-to-end systems that can generate a response to user input in 300-500ms — human level speeds!

As latency reduces, a common problem that surfaces is the LLM responds too quickly. Any time there’s a short pause in a user’s speech, it ends up interrupting them. This is largely due to how voice AI applications perform “turn detection” — that is, figuring out when the user has finished speaking and when the model can run inference and respond.

Pretty much everyone uses a signal processing technique called voice activity detection (VAD). In a nutshell, it figures out when the audio signal switches from speech to silence and then triggers an end of turn once a configurable amount of silence has transpired.

One obvious delta between VAD and how humans do turn detection is we also consider the content of speech (i.e. what someone says). These past few months, we’ve been working on an open weights, content-aware turn detection model for voice AI applications. It was fine-tuned from SmolLM v2 on text, runs on CPU (currently takes 50ms for inference), and uses speech transcriptions as input to predict when a user has completed a thought (also called an “utterance”). Since it was trained on text, notably it works well for pipeline-based architectures (i.e. STT ⇒ LLM ⇒ TTS).

We use this model together with VAD to make better predictions about whether a user is done speaking. Here’s some demos --

- Podcast interview: https://youtu.be/EYDrSSEP0h0

- Ordering food: https://youtu.be/fcr8Y-3c4E0

- Providing shipping address: https://youtu.be/2pQWvd6xozw

- Customer support: https://youtu.be/YoSRg3ORKtQ

In our testing we’ve found:

- 85% reduction in unintentional interruptions

- 3% false positives (where the user is done speaking, but the model thinks they aren’t)

In practice, we still have work to do. We currently delay inference if the model predicts a < 15% chance the user is done speaking. This threshold misses a bunch of middle-of-the-pack probabilities.

Next steps are improving the model accuracy, tuning performance, and expanding to support more languages (only supports English rn). Separately, we’re starting to explore an audio-based model that considers not just what someone says but how they say it, which can be used with natively multimodal models like GPT-4o that directly process and generate audio.

Code here: https://github.com/livekit/agents/tree/main/livekit-plugins/...

Let us know what you think!




I've been actively exploring ways to achieve a seamless and natural conversation with an AI. I played with detecting silences and punctuation ( STT can detect eg question marks ), but this is clearly not enough for turn detection. I think you made a huge step into that direction. Did you write an article or blog post about how you trained your model ? I'd love to make this work with multiple languages, or things like rhetorical question detection.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: