The latency on this (or lack thereof) is the best I've seen, would love to know more about how it's achieved. I asked the bot and it claimed you're using Google's speech recognition, which I know supports streaming, but this result seems much lower lag than I remember Google's stuff being capable of
It's not entirely unlikely that the llm is informed exactly what it's source data is, with the hope that it can potentially make corrections to transcription errors
I didn't think low-latency high-quality voice chat would make such a difference over our current ChatGPT chat, but oh my, I think that really takes it to the next level. It's entering creepy territory, at least for me.
The latency on smarterchild is very fast, but it doesn't seem to be interruptible. The UI seems to be restricting me from even inputting input in between my input and the ai response?
this crops up in my feed every now and then and it has vastly superior perf vs. ØAI’s ChatGPT iOS app or anything else I’ve found. truly outstanding. are you planning on developing it further and/or monetizing it?
This isn't mine, it's from sindarin.tech, they already have paid versions, with one plan being $450/50 hours of speech (just checked and it's up from 30 hours).
In order to feel like a human, cues should not be a pre-programmed phrase, the system should continuously listen to the conversation, and evaluate constantly if speaking is pertinent at that particular moment. Humans will cut a conversation if it's important, and such a system should be able to do the same.
Totally agree with your take. But a pre-programmed phrase would work today and hopefully wouldn't be too difficult to implement. I would imagine that higher latency would be more tolerable as well. But in the fullness of time, your approach is better.
When I'm listening to someone else talk, I'm already formulating responses or at least an outline of responses in my head. If the LLM could do a progressive summarization of the conversation in real-time as part of its context this would be super cool as well. It could also interrupt you if the LLM self-reflects on the summary and realizes that now would be a good time to interrupt.
Indeed a great point. Waiting for a specific cue, before responding, is an interesting idea. It would make the interaction more natural, especially in situations where the user is thinking aloud or formulating their thoughts before seeking the AI's input.
Interruption is something that is already in the pipeline and we are working on it. You should see an update soon.
I agree, it is unnatural and a little stressful with current implementations. It feels like I first need to figure out what to say and than say it so I don’t pause and mess up my input.
I hope the new improved Siri and Google assistant will be able to chain actions as well. “Ok Google, turn off the lights. Ok Google, stop music.” Feels a bit cumbersome.
A fast turnaround time is also super important; if the transcription is not correct, waiting multiple seconds for each turn would kill the application. E.g., ordering food using voice is only convenient if it gets me right all the time; if not, I will fall back to the app.
I wrote a sort of toy version of this a little while ago using Vosk and a variety of TTS engines, and the solution that worked mostly-well was to have a buffer that waited for audio that filled until a pause of so many seconds, then it sent that to the LLM.
With the implementation of tools for GPT, I could see a way to having the model check if it thinks it received a complete thought, and if it didn't, send back a signal to keep appending to the buffer until the next long pause. The addition of a longer "pregnant pause" timeout could have the model check in to see if you're done talking or whatever.
To streamline the experience we don't send the transcription to the LLM after the pause, since we are using the time we wait for the end of sentence trigger (pause) to generate the LLM and text-to-speech output. So ideally once we detected the pause, we already processed everything.
WhisperFusion, WhisperLive, WhisperSpeech, those are very interesting projects.
I'm curious about latency (of all those 3 systems individually, and also the LLM), and WER numbers of WhisperLive. I did not really find any numbers on that? This is a bit strange, as those are the most crucial information about such models? Maybe I just looked at the wrong places (the GitHub repos).
WhisperLive builds upon the Whisper model; for the demo, we used small.en, but you can also use large without introducing a bigger latency for the overall pipeline since the transcription process is decoupled from the LLM and text-to-speech process.
Yes, but when you change Whisper to make it live, to get WhisperLive, surely this has an effect on the WER, it will get worse. The question is, how much worse? And what is the latency? Depending on the type of streaming model, you might be able to control the latency, so you get a graph, latency vs WER, and in the extreme (offline) case, you have the original WER.
How exactly does WhisperLive work actually? Did you reduce the chunk size from 30 sec to something lower? To what? Is this fixed or can it be configured by the user? Where can I find information on those details, or even a broad overview on how WhisperLive works?
Yes I have looked there. I did not find any WER numbers and latency numbers (ideally both together in a graph). I also did not find the model being described.
There, for example in Table 1, you have exactly that, latency vs WER. But the latency is huge (2.85 sec the lowest). Usually, streaming speech recognition systems have latency well beyond 1 sec.
But anyway, is this actually what you use in WhisperLive / WhisperFusion? I think it would be good to give a bit more details on that.
WhisperLive supports both TensorRT and faster-whisper.
We didn’t reduce the chunk size rather use padding based on the chunk size received from the client. Reducing the segment size should be a more optimised solution in the Live scenario.
For streaming we continuously stream audio bytes of fixed size to the server and send the completed segments back to the client while incrementing the timestamp_offset.
Ah, but that sounds like a very inefficient approach, which probably still has quite high latency, and probably also performs bad in terms of word-error-rate (WER).
But I'm happy to be proven wrong. That's why I would like to see some actual numbers. Maybe it's still okish enough, maybe it's actually really bad. I'm curious. But I don't just want to see a demo or a sloppy statement like "it's working ok".
Note that this is a highly non-trivial problem, to make a streamable speech recognition system with low latency and still good performance. There is a big research community working on just this problem.
I actually have worked on this problem myself. E.g. see our work "Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition" (https://arxiv.org/abs/2309.08436), which will be presented at ICASSP 2024. E.g. for a median latency of 1.11s ec, we get a WER of 7.5% on TEDLIUM-v2 dev, which is almost as good as the offline model with 7.4% WER. This is a very good result (only very minor WER degradation). Or with a latency of 0.78 sec, we get 7.7% WER. Our model currently does not work too well when we go to even lower latencies (or the computational overhead becomes impractical).
whisper is simply not designed for this, in many ways, and it's impressive engineering to try and overcome its limitations, but I can't help but feel that it is easier to just use an architecture that is designed for the problem.
This is TensorFlow-based. But I also have another PyTorch-based implementation already, also public (inside our other repo, i6_experiments). It's not so easy currently to set this up, but I'm working on a simpler pipeline in PyTorch.
We don't have the models online yet, but we can upload them later. But I'm not sure how useful they are outside of research, as they are specifically for those research tasks (Librispeech, Tedlium), and probably don't perform too well on other data.
This is an excellent project with excellent packaging. It is primarily a packaging problem.
Why does every Python application on GitHub have its own ad-hoc, informally specified, bug ridden, slow implementation of half of setuptools?
Why does TensorRT distribute the most essential part of what it does in an "examples" directory?
huggingface_cli... man, I already have a way to download something by a name, it's a zip file. In fact, why not make a PyPi index that facades these models? We have so many ways already to install and cache read only binary blobs...
Well the huggingface one is obvious enough, they want to encourage vendor lock-in, make themselves the default. Same reason why docker downloads from dockerhub unless you explicitly request a full url.
I think they did a pivot to LLM phone calls? I've tried their library the other day and it works quite well. It even has the "interrupt feature" that is being talked about a few threads up. Supports a ton of backends for transcribe/voice/LLM.
Imagine porting this to a dedicated app that can access the context of the open window and the text on the screen, providing an almost real-time assistant for everything you do on screen.
Automatically take a screenshot and feed it to https://github.com/vikhyat/moondream or similar? Doable. But while very impressive, the results are a bit of mixed bag (some hallucinations)
Oh this is neat! I was wondering how to get whisper to stream-transcribe well. I have a similar project using whisper + styletts with the similar goal to gave minimal delay: https://github.com/lxe/llm-companion
There must have been 100 folk with the same idea at the same time, I'm very excited for having something like this running mics in my home so long as it's running locally (and not costing $30/mo. in electricity to operate). Lots of starter projects, feels like a polished solution (e.g. easy maintainability and good home assistant integration etc) is right around the corner now
Have been tempted to try and build something out myself, there are tons of IP cameras around with 2-way audio. If the mic was reasonable enough quality, the potential for a multimodal LLM to comment contextually on the scene as well as respond through the speaker in a ceiling-mounted camera appeals to me a lot. "Computer, WTF is this old stray component I found lying under the sink?"
What is SOTA for model-available vision systems? If there's a camera, can it track objects so it can tell me where I put my keys in the room without having to put an $30 airtag on them?
I think good in-home vision models are probably still a little bit away yet, but it seems already the case you could start to plan for their existence. It would also be possible to fine-tune a puny model to trigger a function to pass the image to a larger hosted model if explicitly requested to, there are a variety of ways things could be tiered to keep processing that can be done practically at home at home, and still make it possible to automatically (or on user's request) defer the query to a larger model operated by someone else
Could someone please summarize the differences (or similarities) of the LLM part against TGWUI+llama.cpp setup with offloading layers to tensor cores?
Asking because 8x7B Q4_K_M (25GB, GGUF) doesn't seem to be "ultra-low latency" on my 12GB VRAM + RAM. Like, at all. I can imagine running 7-13GB sized model with that latency (cause I did, but... it's a small model), or using 2x P40 or something. Not sure what the assumptions they make in the README. Am I missing something? Can you try it without TTS part?
It's what Siri and Alexa should have been. I think we will see much more of this in the next years. If - and only if - it can run locally and not keep a permanent record then the issue of listening in the background would go away, too. This is really the biggest obstacle to a natural interaction. I want to first talk, perhaps to a friend and later ask the bot to chime in. And for that to work it really needs to listen for an extended period. This could be especially useful for home automation.
This is using phi-2, so the first assumption would be that it's local. It's a tiny little model in the grand scheme of things.
I've been toying around with something similar myself, only I want push-to-talk from my phone. There's a route there with a WebRTC SPA, and it feels like it should be doable just by stringing together the right bits of various tech demos, but just understanding how to string everything together is more effort than it should be if you're not familiar with the tech.
What's really annoying is Whisper's latency. It's not really designed for this sort of streaming use-case, they're only masking its unsuitability here by throwing (comparatively) ludicrous compute at it.
This project is using Mistral, not Phi-2. However, it is clear from reading the README.MD that this runs locally, so your point still stands. That being said, it looks like all models have been optimized for TensorRT, so the Whisper component may not be as high-latency as you suggest.
For the transcription part, we are looking into W2v-BERT 2.0 as well and will make it available in a live-streaming context. That said, Whisper, especially small (<50ms), is not as compute-heavy; right now, most of the compute is consumed by the LLM.
No, it's not that it's compute-heavy, especially, it's that the model expects to work on 30-second samples. So if you want sub-second latency, you have to do 30 seconds worth of processing more than once a second. It just multiplies the problem up. If you can't offload it to a gpu it's painfully inefficient.
As to why that might matter: my single 4090 is occupied with most of a Mixtral instance, and I don't especially want to take any compute away from that.
I like how Chat GPT 4 will stammer, stutter and pause. This would be even better with a little "uhm" right when the speaker finishes talking, or even a chat bot that interrupts you a little bit, predicting when you're finishing - even incorrectly.
Has anyone experimented with integrating real-time lipsync into a low-latency audio bot? I saw some demos with d-id but their pricing was closer to $1/minute which makes it rather prohibitive
I had a quick read through, maybe I missed something but is this all local run or does it need api access to OpenAI’s remote system?
The reason I ask is that I’m building something that does both TTS and STT using OpenAI, but I do not want to be sending a never ending stream of audio to OpenAI just for it to listen for a single command I will eventually give it.
If I can do all of this local and use Mistral instead, then I’d give it a go too.
Thank you! When I read OpenAI I was thinking would be going through them. This revelation is perfect timing for me… keeping user data even more private. Excellent!
I have a similar dream; with one major caveat - it must unequivocally be a local model. "See whatever I'm doing on my screen" comes with "leaks information to the model" and that could go real-bad-wrong real fast.
I have a feeling that a small subset of privacy-conscious "computer savvy" folks will care about local-only for AI, but that the vast majority of humanity simply won't know, care, or care to know why they should even care. For proof, just look at how nobody cared about search, photos, medical, or other data until theirs got leaked, and still nobody cares about them because "it's not my data that got leaked".
We (we in the larger sense of computer users as a whole, not just the small subset of "power-users") should care more about privacy and security and such, but most people think of computers and networks in the same way they think of a toaster or a hammer. To them it's a tool that does stuff when they push the right "magic button", and they couldn't care less what's inside, or how it could harm them if mis-used until it actually does harm them (or come close enough to it that they can no longer ignore it).
More than that, it can monitor your screen continuously and have perfect recall, so it will be able to correct you immediately, or remind you of relevant context.
Interesting! I do find projects with friends very engaging. They always seem to lose interest long before I do, though.
That really bums me out, and used to make me lose steam too. My current approach is, "I'm going to do it no matter what, and if you want to join that's cool too."
--
That's the main advantage of GPT for me... not infinite wisdom, but infinite willingness to listen, and infinite enthusiasm!
That is in very short supply among humans. Which is probably why it costs $200/hr in human form, heh.
Unconditional positive regard.
Pi.ai is also surprisingly good for that, better than GPT in some aspects (talking out "soft problems" — not as good for technical stuff).
I'm aching for someone to come up with a low latency round trip voice recognition, LLM, speech generation tuned to waste the time of phone scammers. There is one famous youtube guy who has tried this exact thing, but the one video I saw was very, very primitive and unconvincing.
OTOH the technology which allows that would just as easily, and more likely be used by the scammers themselves to fully automate robocalling rather than having to outsource to call centres like they currently do. Your time wasting robot would just be wasting the time of another robot that's simultaneously on the line with a thousand other people.
We see these tools such as this posted several times a week. Is there any expectation they will be installable by the common person? Where is the setup.exe, .deb, .rpm, .dmg?
Whenever I walk my dog I find myself wanting a conversationalist LLM layer to exist in the best form. LLM's now are great at conversation, but the connective tissue between the LLM and natural dialog needs a lot of work.
Some of the problems:
- Voice systems now (including ChatGPT mobile app) stop you at times when a human would not, based on how long you pause. If you said, "I think I'm going to...[3 second pause]" then LLM's stop you, but a human would wait
- No ability to interrupt them with voice only
- Natural conversationalists tend to match one another's speed, but these system's speed are fixed
- Lots of custom instructions needed to change from what works in written text to what works in speech (no bullet points, no long formulas)
On the other side of this problem is a super smart friend you can call on your phone. That would be world changing.
Yeah. While I like the idea of live voice chat with an LLM, it turns out I’m not so good at getting a thought across without pauses, and that gets interpreted as the LLM’s turn to respond. I’d need to be able to turn on a magic spoken word like “continue” for it to be useful.
pyryt posted https://arxiv.org/abs/2010.10874, which might be helpful here, but we probably end off with personalized models that learned from conversation styles. A magic stop/processing word would be the easiest to add since you already have the transcript, but it's taking the natural feel of a conversation.
Good point; another area we are currently looking into is predicting intention; often, when talking to someone, we have a good idea of what that person might say next. That would not only help with latency but also, allow us to give better answers, and load the right context.
I think the Whisper models need to predict end-of-turn based on content. And if it still gets input after the EOT, it can just drop the LLM generation and start over at the next EOT.
1. Interruption - I need to be able to say "hang on" and have the LLM pause. 2. Wait for a specific cue before responding. I like "What do you think?"
That + low latency are crucial. It needs to feel like talking to another person.