Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: WhisperFusion – Low-latency conversations with an AI chatbot (github.com/collabora)
272 points by mfilion 3 months ago | hide | past | favorite | 102 comments
WhisperFusion builds upon the capabilities of open source tools WhisperLive and WhisperSpeech to provide a seamless conversations with an AI chatbot.



There are two things that I think are needed and that I'm not sure if anyone provides yet to make this scenario work well:

1. Interruption - I need to be able to say "hang on" and have the LLM pause. 2. Wait for a specific cue before responding. I like "What do you think?"

That + low latency are crucial. It needs to feel like talking to another person.


> Interruption

Well, today is your lucky day!: https://persona-webapp-beta.vercel.app/ and the demo https://smarterchild.chat/


The latency on this (or lack thereof) is the best I've seen, would love to know more about how it's achieved. I asked the bot and it claimed you're using Google's speech recognition, which I know supports streaming, but this result seems much lower lag than I remember Google's stuff being capable of


> I asked the bot and it claimed you're using Google's speech recognition

That doesn't sound plausible. How can the LLM part know which speech recognition service is being used?


It's not entirely unlikely that the llm is informed exactly what it's source data is, with the hope that it can potentially make corrections to transcription errors


Or just because it's interesting and people might ask. I could imagine it being a hallucination, but it could also be an easter egg of sorts.


Apparently it uses the Web Speech API [1], not a specific service.

[1] https://developer.mozilla.org/en-US/docs/Web/API/SpeechRecog...


I didn't think low-latency high-quality voice chat would make such a difference over our current ChatGPT chat, but oh my, I think that really takes it to the next level. It's entering creepy territory, at least for me.


The latency on smarterchild is very fast, but it doesn't seem to be interruptible. The UI seems to be restricting me from even inputting input in between my input and the ai response?


I had no problem with “hold on a sec” and then “sorry, please continue”


I clicked on "Talk", but the textbox just says "Preparing to speak..." without doing anything else


It doesn't work for me on Firefox, but works on Vivaldi.


this crops up in my feed every now and then and it has vastly superior perf vs. ØAI’s ChatGPT iOS app or anything else I’ve found. truly outstanding. are you planning on developing it further and/or monetizing it?


This isn't mine, it's from sindarin.tech, they already have paid versions, with one plan being $450/50 hours of speech (just checked and it's up from 30 hours).


In order to feel like a human, cues should not be a pre-programmed phrase, the system should continuously listen to the conversation, and evaluate constantly if speaking is pertinent at that particular moment. Humans will cut a conversation if it's important, and such a system should be able to do the same.


Totally agree with your take. But a pre-programmed phrase would work today and hopefully wouldn't be too difficult to implement. I would imagine that higher latency would be more tolerable as well. But in the fullness of time, your approach is better.

When I'm listening to someone else talk, I'm already formulating responses or at least an outline of responses in my head. If the LLM could do a progressive summarization of the conversation in real-time as part of its context this would be super cool as well. It could also interrupt you if the LLM self-reflects on the summary and realizes that now would be a good time to interrupt.


Indeed a great point. Waiting for a specific cue, before responding, is an interesting idea. It would make the interaction more natural, especially in situations where the user is thinking aloud or formulating their thoughts before seeking the AI's input.

Interruption is something that is already in the pipeline and we are working on it. You should see an update soon.


Thanks! Really looking forward to interruptions.

I think about the cue as kind of being like "Hey Siri/Alexa/Cortana" but in reverse.


I agree, it is unnatural and a little stressful with current implementations. It feels like I first need to figure out what to say and than say it so I don’t pause and mess up my input.

I hope the new improved Siri and Google assistant will be able to chain actions as well. “Ok Google, turn off the lights. Ok Google, stop music.” Feels a bit cumbersome.


A fast turnaround time is also super important; if the transcription is not correct, waiting multiple seconds for each turn would kill the application. E.g., ordering food using voice is only convenient if it gets me right all the time; if not, I will fall back to the app.


I wrote a sort of toy version of this a little while ago using Vosk and a variety of TTS engines, and the solution that worked mostly-well was to have a buffer that waited for audio that filled until a pause of so many seconds, then it sent that to the LLM.

With the implementation of tools for GPT, I could see a way to having the model check if it thinks it received a complete thought, and if it didn't, send back a signal to keep appending to the buffer until the next long pause. The addition of a longer "pregnant pause" timeout could have the model check in to see if you're done talking or whatever.


To streamline the experience we don't send the transcription to the LLM after the pause, since we are using the time we wait for the end of sentence trigger (pause) to generate the LLM and text-to-speech output. So ideally once we detected the pause, we already processed everything.


I agree. Have been working on a 2 way interruptions system + streaming like this. It's not robust yet, but when it works it does feel magical.


> 2. Wait for a specific cue before responding. I like "What do you think?"

"Over."


"Over and out" closes the app. ;)

Saying "Go" to indicate it's the bot's turn would work for me. (Or maybe pressing a button.) The bot should always stop wherever I start speaking.


> I like "What do you think?"

I like it too!

And I can't help but think getting into the habit of saying it -- it would help us get along much better with other people in our lives.


I did a video demo of this. Tell it to only respond only with OK to every message and only respond fully when I tell you I am finished. Ok? Ok.


It would be cool if the Ai could interrupt too.


"Imma let you finish, but..."


See also the blog post: https://www.collabora.com/news-and-blog/news-and-events/whis...

WhisperFusion, WhisperLive, WhisperSpeech, those are very interesting projects.

I'm curious about latency (of all those 3 systems individually, and also the LLM), and WER numbers of WhisperLive. I did not really find any numbers on that? This is a bit strange, as those are the most crucial information about such models? Maybe I just looked at the wrong places (the GitHub repos).


WhisperLive builds upon the Whisper model; for the demo, we used small.en, but you can also use large without introducing a bigger latency for the overall pipeline since the transcription process is decoupled from the LLM and text-to-speech process.


Yes, but when you change Whisper to make it live, to get WhisperLive, surely this has an effect on the WER, it will get worse. The question is, how much worse? And what is the latency? Depending on the type of streaming model, you might be able to control the latency, so you get a graph, latency vs WER, and in the extreme (offline) case, you have the original WER.

How exactly does WhisperLive work actually? Did you reduce the chunk size from 30 sec to something lower? To what? Is this fixed or can it be configured by the user? Where can I find information on those details, or even a broad overview on how WhisperLive works?



Yes I have looked there. I did not find any WER numbers and latency numbers (ideally both together in a graph). I also did not find the model being described.

*Edit*

Ah, when you write faster_whisper, you actually mean https://github.com/SYSTRAN/faster-whisper?

And for streaming, you use https://github.com/ufal/whisper_streaming? So, the model as described in http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main...?

There, for example in Table 1, you have exactly that, latency vs WER. But the latency is huge (2.85 sec the lowest). Usually, streaming speech recognition systems have latency well beyond 1 sec.

But anyway, is this actually what you use in WhisperLive / WhisperFusion? I think it would be good to give a bit more details on that.


WhisperLive supports both TensorRT and faster-whisper. We didn’t reduce the chunk size rather use padding based on the chunk size received from the client. Reducing the segment size should be a more optimised solution in the Live scenario.

For streaming we continuously stream audio bytes of fixed size to the server and send the completed segments back to the client while incrementing the timestamp_offset.


Ah, but that sounds like a very inefficient approach, which probably still has quite high latency, and probably also performs bad in terms of word-error-rate (WER).

But I'm happy to be proven wrong. That's why I would like to see some actual numbers. Maybe it's still okish enough, maybe it's actually really bad. I'm curious. But I don't just want to see a demo or a sloppy statement like "it's working ok".

Note that this is a highly non-trivial problem, to make a streamable speech recognition system with low latency and still good performance. There is a big research community working on just this problem.

I actually have worked on this problem myself. E.g. see our work "Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition" (https://arxiv.org/abs/2309.08436), which will be presented at ICASSP 2024. E.g. for a median latency of 1.11s ec, we get a WER of 7.5% on TEDLIUM-v2 dev, which is almost as good as the offline model with 7.4% WER. This is a very good result (only very minor WER degradation). Or with a latency of 0.78 sec, we get 7.7% WER. Our model currently does not work too well when we go to even lower latencies (or the computational overhead becomes impractical).

Or see Emformer (https://arxiv.org/abs/2010.10759) as another popular model.


whisper is simply not designed for this, in many ways, and it's impressive engineering to try and overcome its limitations, but I can't help but feel that it is easier to just use an architecture that is designed for the problem.

I was impressed by Kaldi's models for streaming ASR: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index... ; I suspect that the Nvidia/Suno Parakeet models will also be pretty good for streaming https://huggingface.co/nvidia/parakeet-ctc-0.6b


Very interesting. Thanks for the references. Have you released the code or pre-trained models yet or do you plan to do so at some point?


The code is all released already. You find it here: https://github.com/rwth-i6/returnn-experiments/tree/master/2...

This is TensorFlow-based. But I also have another PyTorch-based implementation already, also public (inside our other repo, i6_experiments). It's not so easy currently to set this up, but I'm working on a simpler pipeline in PyTorch.

We don't have the models online yet, but we can upload them later. But I'm not sure how useful they are outside of research, as they are specifically for those research tasks (Librispeech, Tedlium), and probably don't perform too well on other data.


We will add the details, thanks for pointing it out.



Interesting project, thanks for sharing


This is an excellent project with excellent packaging. It is primarily a packaging problem.

Why does every Python application on GitHub have its own ad-hoc, informally specified, bug ridden, slow implementation of half of setuptools?

Why does TensorRT distribute the most essential part of what it does in an "examples" directory?

huggingface_cli... man, I already have a way to download something by a name, it's a zip file. In fact, why not make a PyPi index that facades these models? We have so many ways already to install and cache read only binary blobs...


Well the huggingface one is obvious enough, they want to encourage vendor lock-in, make themselves the default. Same reason why docker downloads from dockerhub unless you explicitly request a full url.


This post reminded me of Vocode: https://github.com/vocodedev/vocode-python

Discussion on them here from 10 months ago: https://news.ycombinator.com/item?id=35358873

I tried the demo back then and was very impressed. Anyone using it in dev or production?


I think they did a pivot to LLM phone calls? I've tried their library the other day and it works quite well. It even has the "interrupt feature" that is being talked about a few threads up. Supports a ton of backends for transcribe/voice/LLM.


Yea the I interrupt worked well, would guess (?) this could be deployed for local conversation without need for phone.


Imagine porting this to a dedicated app that can access the context of the open window and the text on the screen, providing an almost real-time assistant for everything you do on screen.


Automatically take a screenshot and feed it to https://github.com/vikhyat/moondream or similar? Doable. But while very impressive, the results are a bit of mixed bag (some hallucinations)


I'm sure something like the accessibility API will have a smaller latency.

https://developer.apple.com/library/archive/samplecode/UIEle...


rewind.ai seems to be moving in this direction


this looks equally scary and incredible, especially the "summarize what I worked on today" examples.


it works really well, and locally too!


Oh this is neat! I was wondering how to get whisper to stream-transcribe well. I have a similar project using whisper + styletts with the similar goal to gave minimal delay: https://github.com/lxe/llm-companion


There must have been 100 folk with the same idea at the same time, I'm very excited for having something like this running mics in my home so long as it's running locally (and not costing $30/mo. in electricity to operate). Lots of starter projects, feels like a polished solution (e.g. easy maintainability and good home assistant integration etc) is right around the corner now

Have been tempted to try and build something out myself, there are tons of IP cameras around with 2-way audio. If the mic was reasonable enough quality, the potential for a multimodal LLM to comment contextually on the scene as well as respond through the speaker in a ceiling-mounted camera appeals to me a lot. "Computer, WTF is this old stray component I found lying under the sink?"


What is SOTA for model-available vision systems? If there's a camera, can it track objects so it can tell me where I put my keys in the room without having to put an $30 airtag on them?


I think good in-home vision models are probably still a little bit away yet, but it seems already the case you could start to plan for their existence. It would also be possible to fine-tune a puny model to trigger a function to pass the image to a larger hosted model if explicitly requested to, there are a variety of ways things could be tiered to keep processing that can be done practically at home at home, and still make it possible to automatically (or on user's request) defer the query to a larger model operated by someone else


Could someone please summarize the differences (or similarities) of the LLM part against TGWUI+llama.cpp setup with offloading layers to tensor cores?

Asking because 8x7B Q4_K_M (25GB, GGUF) doesn't seem to be "ultra-low latency" on my 12GB VRAM + RAM. Like, at all. I can imagine running 7-13GB sized model with that latency (cause I did, but... it's a small model), or using 2x P40 or something. Not sure what the assumptions they make in the README. Am I missing something? Can you try it without TTS part?


The video example is using Phi-2 which is a 2.7bn param network. I think that's part of how they're achieving the low latency here!

Has anybody fine-tuned Phi-2? I haven't found any good resources for that yet.


We tested https://huggingface.co/cognitivecomputations/dolphin-2_6-phi... as well, in some tasks it performs better. That said, you can use Mistral as well, we support a few models through TensorRT-LLM.


It's what Siri and Alexa should have been. I think we will see much more of this in the next years. If - and only if - it can run locally and not keep a permanent record then the issue of listening in the background would go away, too. This is really the biggest obstacle to a natural interaction. I want to first talk, perhaps to a friend and later ask the bot to chime in. And for that to work it really needs to listen for an extended period. This could be especially useful for home automation.


This is using phi-2, so the first assumption would be that it's local. It's a tiny little model in the grand scheme of things.

I've been toying around with something similar myself, only I want push-to-talk from my phone. There's a route there with a WebRTC SPA, and it feels like it should be doable just by stringing together the right bits of various tech demos, but just understanding how to string everything together is more effort than it should be if you're not familiar with the tech.

What's really annoying is Whisper's latency. It's not really designed for this sort of streaming use-case, they're only masking its unsuitability here by throwing (comparatively) ludicrous compute at it.


There are people trying to frankenstain-merge Mistral and Whisper in a single multimodal model [1]. I wonder if this could improve the latency.

[1] : https://paul.mou.dev/posts/2023-12-31-listening-with-llm/


yes (you skip a decoding step) but also no (when do you start emitting?)


This project is using Mistral, not Phi-2. However, it is clear from reading the README.MD that this runs locally, so your point still stands. That being said, it looks like all models have been optimized for TensorRT, so the Whisper component may not be as high-latency as you suggest.


Ah, so it is. I got confused by the video, where the assistant responses are labeled as phi-2.


For the transcription part, we are looking into W2v-BERT 2.0 as well and will make it available in a live-streaming context. That said, Whisper, especially small (<50ms), is not as compute-heavy; right now, most of the compute is consumed by the LLM.


No, it's not that it's compute-heavy, especially, it's that the model expects to work on 30-second samples. So if you want sub-second latency, you have to do 30 seconds worth of processing more than once a second. It just multiplies the problem up. If you can't offload it to a gpu it's painfully inefficient.

As to why that might matter: my single 4090 is occupied with most of a Mixtral instance, and I don't especially want to take any compute away from that.


For minimum latency you want a recurrent model that works in the time domain. A Mamba-like model could do it.


Seeing that this uses TensorRT (i.e. seems well optimized), what GPUs are supported? Could I run this on a Jetson?


That is something we are looking forward to as well. Stay tuned for updates on Jetson support.

We tested it on 3090 and 4090 works as expected.


I like how Chat GPT 4 will stammer, stutter and pause. This would be even better with a little "uhm" right when the speaker finishes talking, or even a chat bot that interrupts you a little bit, predicting when you're finishing - even incorrectly.

like an engaged but not-most-polite person does


Knowing when to speak is actually a prediction task in itself. See eg https://arxiv.org/abs/2010.10874

Would be indeed great to get something like this integrated with whisper, LLM and TTS


Hard for me to imagine that this could be solved in text space. I think the prediction task needs to be done on the audio.


We thought about doing this in Whisper itself, since its already working in the audio space.


Yes, this is something we want to look into in more detail, really appreciate sharing the research.


Has anyone experimented with integrating real-time lipsync into a low-latency audio bot? I saw some demos with d-id but their pricing was closer to $1/minute which makes it rather prohibitive


I had a quick read through, maybe I missed something but is this all local run or does it need api access to OpenAI’s remote system?

The reason I ask is that I’m building something that does both TTS and STT using OpenAI, but I do not want to be sending a never ending stream of audio to OpenAI just for it to listen for a single command I will eventually give it.

If I can do all of this local and use Mistral instead, then I’d give it a go too.


Everything runs locally, we use:

- WhisperLive for the transcription - https://github.com/collabora/WhisperLive - WhisperSpeech for the text-to-speech - https://github.com/collabora/WhisperSpeech

and an LLM (phi-2, Mistral, etc.) in between


Thank you! When I read OpenAI I was thinking would be going through them. This revelation is perfect timing for me… keeping user data even more private. Excellent!


My dream is to do pair coding (and pair learning) with an AI.

It would be a live conversation and it can see whatever I’m doing on my screen.

We’re gradually getting closer.


I have a similar dream; with one major caveat - it must unequivocally be a local model. "See whatever I'm doing on my screen" comes with "leaks information to the model" and that could go real-bad-wrong real fast.


I have a feeling people will really demand local only for AI.

I’m not sure why the demand never materialized for other highly personal services like search, photos, medical, etc.

But I just have this hunch we all really want it for AI.


I have a feeling that a small subset of privacy-conscious "computer savvy" folks will care about local-only for AI, but that the vast majority of humanity simply won't know, care, or care to know why they should even care. For proof, just look at how nobody cared about search, photos, medical, or other data until theirs got leaked, and still nobody cares about them because "it's not my data that got leaked".

We (we in the larger sense of computer users as a whole, not just the small subset of "power-users") should care more about privacy and security and such, but most people think of computers and networks in the same way they think of a toaster or a hammer. To them it's a tool that does stuff when they push the right "magic button", and they couldn't care less what's inside, or how it could harm them if mis-used until it actually does harm them (or come close enough to it that they can no longer ignore it).


More than that, it can monitor your screen continuously and have perfect recall, so it will be able to correct you immediately, or remind you of relevant context.

I like to call it "Artificial Attention".


I wanted to say that's Copilot, but you meant speaking instead of typing?


I envision it feeling like you're pair programming with a person. (I have problems staying motivated.) But that might be a good place to start.


Interesting! I do find projects with friends very engaging. They always seem to lose interest long before I do, though.

That really bums me out, and used to make me lose steam too. My current approach is, "I'm going to do it no matter what, and if you want to join that's cool too."

--

That's the main advantage of GPT for me... not infinite wisdom, but infinite willingness to listen, and infinite enthusiasm!

That is in very short supply among humans. Which is probably why it costs $200/hr in human form, heh.

Unconditional positive regard.

Pi.ai is also surprisingly good for that, better than GPT in some aspects (talking out "soft problems" — not as good for technical stuff).


I'm aching for someone to come up with a low latency round trip voice recognition, LLM, speech generation tuned to waste the time of phone scammers. There is one famous youtube guy who has tried this exact thing, but the one video I saw was very, very primitive and unconvincing.


OTOH the technology which allows that would just as easily, and more likely be used by the scammers themselves to fully automate robocalling rather than having to outsource to call centres like they currently do. Your time wasting robot would just be wasting the time of another robot that's simultaneously on the line with a thousand other people.


correction:

"... simultaneously on the line with a thousand other robots."

:)


If it were that easy to detect a scam call and redirect it to a robot then we could just block the scam calls in the first place.


We see these tools such as this posted several times a week. Is there any expectation they will be installable by the common person? Where is the setup.exe, .deb, .rpm, .dmg?


We are going to put the sample interface into the Docker, so it's more mainly:

> docker run --gpus all --shm-size 64G -p 80:80 -it ghcr.io/collabora/whisperfusion:latest

instead of:

> docker run --gpus all --shm-size 64G -p 6006:6006 -p 8888:8888 -it ghcr.io/collabora/whisperfusion:latest > cd examples/chatbot/html > python -m http.server


Very neat capability. Need to see more of hyper-optimizing models to one specific use case, this is a great example of doing so.


Does anyone know if you can replace TensorRT with a similar call to Apple's CoreML for the same functionality?


Great to hear its seamless real-time ultra low-latency. Hopefully the next iteration is blazingly fast too!


Whenever I walk my dog I find myself wanting a conversationalist LLM layer to exist in the best form. LLM's now are great at conversation, but the connective tissue between the LLM and natural dialog needs a lot of work.

Some of the problems:

- Voice systems now (including ChatGPT mobile app) stop you at times when a human would not, based on how long you pause. If you said, "I think I'm going to...[3 second pause]" then LLM's stop you, but a human would wait

- No ability to interrupt them with voice only

- Natural conversationalists tend to match one another's speed, but these system's speed are fixed

- Lots of custom instructions needed to change from what works in written text to what works in speech (no bullet points, no long formulas)

On the other side of this problem is a super smart friend you can call on your phone. That would be world changing.


Yeah. While I like the idea of live voice chat with an LLM, it turns out I’m not so good at getting a thought across without pauses, and that gets interpreted as the LLM’s turn to respond. I’d need to be able to turn on a magic spoken word like “continue” for it to be useful.

I do like the interface though.


pyryt posted https://arxiv.org/abs/2010.10874, which might be helpful here, but we probably end off with personalized models that learned from conversation styles. A magic stop/processing word would be the easiest to add since you already have the transcript, but it's taking the natural feel of a conversation.


Good point; another area we are currently looking into is predicting intention; often, when talking to someone, we have a good idea of what that person might say next. That would not only help with latency but also, allow us to give better answers, and load the right context.


I think the Whisper models need to predict end-of-turn based on content. And if it still gets input after the EOT, it can just drop the LLM generation and start over at the next EOT.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: