Hacker News new | past | comments | ask | show | jobs | submit | makeitmore's comments login

Very true. I think we are a bit aggressive with the VAD timeout. The demo was intended to showcase speed, but the bot can be a bit eager! You can tinker with the VAD settings, it could definitely use a bit more air (but that will impact latency in the event the user has indeed finished talking.) As others say below, the magic will be figuring out the pace and style in which the user talks and adapting to that on the fly.


This particular demo is using Llama3 8B. We initially started 70B, but it was a touch slower and needed much more VRAM. We found 8B good enough for general chit-chat like in this demo. Most real-world use-cases will likely have their own fine-tuned models.


Thanks for sharing. I did make some changes that seems to have improved things, although I do still see the occasional misfire. Perhaps good enough to remove that ugly red banner though!


Yes, echo cancellation via the browser (and maybe also at OS-level too, if you're on a Mac with Sonoma.) The accuracy of speech detection vs. noise is largely thanks to Silero, which runs on the client via WASM. I'm surprised at how well it works, even in noisy environments (and a reminder that I should experiment more with AudioWorklet stuff in the future!)


Hi, I built the client UI for this and... yea, I really wanted to get Firefox working :(

We needed a way to measure voice-to-voice latency from the end-user's perspective, and found Silero voice activity detection (https://github.com/snakers4/silero-vad) to be the most reliable at detecting when the user has stopped speaking, so we can start the timer (and stop it again when audio is received from the bot.)

Silero runs via onnx-runtime (with wasm). Whilst it sort-of-kinda works in Firefox, the VAD seems to misfire more than it should, causing the latency numbers to be somewhat absurd. I really want to get it working though! I'm still trying.

The code for the UI VAD is here: https://github.com/pipecat-ai/web-client-ui/tree/main/src/va...


Do you know why there's a difference in the performance of the algorithm in another browser? I would expect that all browsers run the code exactly the same way.


Most of the Pipecat examples we've been working on are focused on speech-to-speech. The examples guide you through how to do that (or you can give the hosted storytelling example a try: https://storytelling-chatbot.fly.dev/)

We should probably update the example in the README to better represent that, thank you!


Your project is amazing and I'm not trying to take away from what you have accomplished.

But..I looked at the code but didn't see any audio-to-audio service or model. Can you link to an example of that?

I don't mean speech to text to LLM to text to speech. I mean speech-to-speech directly, as in the ML model takes audio as input and outputs audio. As they have now in OpenAI.

I am very familiar with the typical multi-model workflow and have implemented it several times.


That's absolutely amazing, both visually and technically! Do you share any insights of the development process, perhaps some code?


I just realized this is exactly the example provided in the repo which I haven't run yet! Thanks for adding this!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: