Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: An open source framework for voice assistants (github.com/pipecat-ai)
346 points by kwindla 10 days ago | hide | past | favorite | 39 comments
I've been obsessed for the past ~year with the possibilities of talking to LLMs. I built a bunch of one-off prototypes, shared code on X, started a Meetup group in SF, and co-hosted a big hackathon. It turns out that there are a few low-level problems that everybody building conversational/real-time AI needs to solve on the way to building/shipping something that works well: low-latency media transport, echo cancellation, voice activity detection, phrase endpointing, pipelining data between models/services, handling voice interruptions, swapping out different models/services.

On the theory that something like a LlamaIndex or LangChain for real-time/conversational AI would be useful, a few of us started working on a Python library for voice (and multimodal) AI assistants/agents.

So ... Pipecat: a framework for building things like personal coaches, meeting assistants, story-telling toys for kids, customer support bots, virtual friends, and snarky social bots.

Most of the core contributors to Pipecat so far work together at our day jobs. This has been a kind of "20% time" thing at our company. But we're serious about welcoming all contributions. We want Pipecat to support any and all models, services, transport layers, and infrastructure tooling. If you're interested in this stuff, please check it out and let us know what you think. Submit PRs. Become a maintainer. Join the Discord. Post cool stuff. Post funny stuff when your voice agent goes completely off the rails (as mine sometimes do).

Nice to see an open source implementation, i have been seeing many startups get into this space like https://www.retellai.com/, https://fixie.ai/ etc. They always end up needing speech-to-speech models (current approach seems speech-text-text-speech with multiple agents handling 1 listening + 1 speaking), excited to see how this plays with recently announced gpt-4o

Adding to your list: https://vapi.ai -- really nice tools.

(I try to keep up with all the different layers/players in this space.)

We're (fixie.ai) working on on our SLM (speech language model). We'll release something soon to play with :)

How do speech to speech models work? Do they just that many more tokens to capture nuances of spoken language?

This is great but we really need an audio-to-audio model like they demoed in the open source world. Does anyone know of anything like that?

Edit: someone found one: https://news.ycombinator.com/item?id=40346992

Most of the Pipecat examples we've been working on are focused on speech-to-speech. The examples guide you through how to do that (or you can give the hosted storytelling example a try: https://storytelling-chatbot.fly.dev/)

We should probably update the example in the README to better represent that, thank you!

That's absolutely amazing, both visually and technically! Do you share any insights of the development process, perhaps some code?

I just realized this is exactly the example provided in the repo which I haven't run yet! Thanks for adding this!

Your project is amazing and I'm not trying to take away from what you have accomplished.

But..I looked at the code but didn't see any audio-to-audio service or model. Can you link to an example of that?

I don't mean speech to text to LLM to text to speech. I mean speech-to-speech directly, as in the ML model takes audio as input and outputs audio. As they have now in OpenAI.

I am very familiar with the typical multi-model workflow and have implemented it several times.

An audio-to-audio model is definitely a step forward. And I do think that's where things are going to go, generally speaking.

For context relating to real-time voice AI: once you're down below ~800ms things are fast enough to feel naturally responsive for most people and use cases.

The GPT-4o announcement page says they average ~320ms time to first token from an audio prompt. Which is definitely next level and is really, really exciting. You can't get to 800ms with any pipeline that includes GPT-4 Turbo today, so this is a big deal.

It's possible to do ~500ms time to first token by pipelining today's fastest transcription, inference, and tts models. (For example, Deepgram transcription, Groq Llama-3, Deepgram Aura voices.)

Every opening phrase is a platitude like ‘sure let’s do it’. So the OpenAI latency is probably higher, they are just using clever orchestration to generate some filler tokens to make latency lower. Unlikely the initial response at OpenAI is coming from the main model.

I'm familiar with Deepgram, groq, and Eleven Labs. I have recently built something on those and it's really not too bad as far as latency. But OpenAI has shown that audio-to-audio can't be beat.

Gazelle is (or will be) significantly faster than that.

Siri came out in October 2011. Amazon Alexa made its debut in November 2014. Google Assistant's voice-activated speakers were released in May 2016.

From what I can tell, Siri is still a dumpster fire that nobody is willing to use. And I have no personal experience with Alexa, so I can't speak to it. But I do have a few Google Home speakers and an Android phone, and I have seen no major improvements in years. In fact, it has gotten worse - for example, you can no longer add items directly to AnyList[0], only Google Keep.

Or, as an incredibly simple example of something I thought we'd get a long time ago, it's still unable to interpret two-part requests, e.g. "please repeat that but louder," or "please turn off the kitchen and dining room lights."

I find voice assistants very useful - especially when driving, lying in bed, cooking, or when I'm otherwise preoccupied. Yet they have stagnated almost since their debut. I can only imagine nobody has found a viable way to monetize them.

What will it take to get a better voice assistant for consumers? Willow[1] doesn't seem to have taken off.

[0] https://help.anylist.com/articles/google-assistant-overview/

[1] https://heywillow.io/

edit: I realize I hijacked your thread to dump something that's been on my mind lately. Pipecat looks really cool, and I hope it takes off! I hope to get some time to experiment this weekend.

I primarily use Google Home, but I do also have Echo Frames so I use Alexa semi-regularly. My use case is primarily home automation. In that scenario, I find Alexa to be much more responsive than Google Home. I do agree that it seems like Google Home has gotten worse in a number of ways. (As a happy AnyList user, that specific one was frustrating.)

For some activities, Siri is just fine. Thinks like “send a text to x” and “remind me to do x when I get home”.

And it does fine with no internet access.

Except dictation. Much better with internet access than without.

Those are about as basic of an action as you can get. Every assistant supports them. But as soon as you want to know something like "how many teaspoons in a cup," can Siri still handle it? What about "where is the aurora borealis visible tonight"?

Another issue Siri used to struggle was trying to play specific music on Spotify. Is that better these days?

I asked, it said 48 teaspoons.

I asked to play a song my an artist on Spotify and it did it. Popped up in the Dynamic Island.

Honestly, I didn’t know it could do that!

(It did ask permission to access Spotify data first, but only that first time).

It feels like there's a qualitative jump that voice assistants need to make which they wouldn't have been capable of before the last 18 months, and as a result, yes, the products themselves have stagnated. But if you were Amazon, at what point in the last year (say) would you have picked to draw a line in the sand and build a product iteration based on that level of tech?

> From what I can tell, Siri is still a dumpster fire that nobody is willing to use. I have no personal experience with Alexa, so I can't speak to it

I use both (albeit more Alexa than Siri, both just for a really limited functionality set), and FWIW, I believe Alexa is worse than Siri. It can do two things at the same time though (just as your example: "turn on X and turn off Y", "turn on X for Y seconds", and things like that).

I also feel that it has gotten worse over the years. I read about the possibility of microphones getting dust and therefore capturing worse audio, so I got a dust blower (for other reasons, too), but it didn't solve anything.

After listening in the app what Alexa picks up (from an Echo and Echo Dot, both 4th. Gen), I have to say that they use really shitty microphones. Furthermore, I have been testing Whisper extensively last month, with audio coming from low-quality sources, and I think a similar model would interpret a lot better my voice than whatever Amazon is using.

I have Alexa (Amazon Echo Show) and my use-case is asking for a news briefing, the weather, playing music, or setting timers.

Alexa is a dumpster fire and constantly getting dumber. She also completely disrespects your settings and will re-enable settings you have disabled. She constantly ignores my questions to ask me if I want to try some other new feature instead. She randomly decides to add news stations I have explicitly removed from my Flash Briefing list.

I am constantly baffled by how bad it is.

The hardware is nice though. Wish we could run open source assistants on it.

> What will it take to get a better voice assistant for consumers?

Give it eyes?


> it's still unable to interpret two-part requests

Our car has Google Assistant, and yeah that's annoying. Want to turn off steering wheel heater and seat heater? Gotta do two individual requests.

That said, it's actually quite nice to have voice control over these things. Especially when it's heavy traffic and snowing on top of the icy road, and you really want to have eyes on the traffic and both hands on the steering wheel.

> That said, it's actually quite nice to have voice control over these things.

Yes! I really think voice assistants are underrated. When I talk to iOS users, they have a much less favorable opinion of Siri (and I've watched my partner give up on using it over the past 10 years) and given that iOS has dominant market share in the US, I suspect this is a part of it.

But I also think there is just so much "low hanging fruit" that would drastically improve the experience. But I remember that even during "the race" for voice AI, everyone was wondering... how will they monetize this? And I'm not sure anyone was ever truly able to figure that out.

Just made https://feycher.com thats similar, but has realtime lip syncing as well. Let me know if you are interested and we can chat

We're also building bolna an open source voice orchestration: https://github.com/bolna-ai/bolna

LiveKit Agents, which OpenAI uses in voice mode is also open source:


The whole VAD thing is very interesting, keen to learn more about how it works and especially with multiple speakers!

Very cool, great work! I can def self using this when I start building in that direction.

How would I go about using this to live translate phone calls?

Daily now supports dial-in and dial-out, https://docs.daily.co/guides/products/dial-in-dial-out#main

Which means you can connect a bot to a call, and tell it dialout to a phone number and it will.

Why would you live translate phone calls? Also, whisper.

To talk to people in languages I am not fluent in.

Whisper is too slow and doesn’t allow interruption.

I wonder how the just announced "GPT-4o" with real-time voice impacts projects like this?

The demo on real-time multi language translation conversation blew me away!

Here's a translation demo in Pipecat using the now ancient and arthritic GPT-4 Turbo model. :-) https://github.com/pipecat-ai/pipecat/tree/main/examples/tra...

As soon as GPT-4o audio input is available through the APIs, we'll add 4o support to Pipecat. For bidirectional real-time audio, I think they'll need to make new WebSocket or WebRTC endpoints available.

Just letting you know it's available right now, just specify `gpt-4o` -- for text streaming anyway. I'd hazard a guess that the audio endpoints are open now, just not documented (like most of the last launches)...

Yeah, seems to be a drop-in replacement for the existing inference APIs. But I haven't found any docs yet for streaming audio/video input.

Yeah, same question here.

Building pipelines for bridging LLMs and TTS and STT models with lower latency is fine and all, but when you compare to a natively multimodal model like GPT-4o it seems strictly inferior. The future is clearly voice-native models that are able to understand nuances in voice and speech patterns, and it's not exactly a distant future.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact