Hacker News new | past | comments | ask | show | jobs | submit login

See also the blog post: https://www.collabora.com/news-and-blog/news-and-events/whis...

WhisperFusion, WhisperLive, WhisperSpeech, those are very interesting projects.

I'm curious about latency (of all those 3 systems individually, and also the LLM), and WER numbers of WhisperLive. I did not really find any numbers on that? This is a bit strange, as those are the most crucial information about such models? Maybe I just looked at the wrong places (the GitHub repos).




WhisperLive builds upon the Whisper model; for the demo, we used small.en, but you can also use large without introducing a bigger latency for the overall pipeline since the transcription process is decoupled from the LLM and text-to-speech process.


Yes, but when you change Whisper to make it live, to get WhisperLive, surely this has an effect on the WER, it will get worse. The question is, how much worse? And what is the latency? Depending on the type of streaming model, you might be able to control the latency, so you get a graph, latency vs WER, and in the extreme (offline) case, you have the original WER.

How exactly does WhisperLive work actually? Did you reduce the chunk size from 30 sec to something lower? To what? Is this fixed or can it be configured by the user? Where can I find information on those details, or even a broad overview on how WhisperLive works?



Yes I have looked there. I did not find any WER numbers and latency numbers (ideally both together in a graph). I also did not find the model being described.

*Edit*

Ah, when you write faster_whisper, you actually mean https://github.com/SYSTRAN/faster-whisper?

And for streaming, you use https://github.com/ufal/whisper_streaming? So, the model as described in http://www.afnlp.org/conferences/ijcnlp2023/proceedings/main...?

There, for example in Table 1, you have exactly that, latency vs WER. But the latency is huge (2.85 sec the lowest). Usually, streaming speech recognition systems have latency well beyond 1 sec.

But anyway, is this actually what you use in WhisperLive / WhisperFusion? I think it would be good to give a bit more details on that.


WhisperLive supports both TensorRT and faster-whisper. We didn’t reduce the chunk size rather use padding based on the chunk size received from the client. Reducing the segment size should be a more optimised solution in the Live scenario.

For streaming we continuously stream audio bytes of fixed size to the server and send the completed segments back to the client while incrementing the timestamp_offset.


Ah, but that sounds like a very inefficient approach, which probably still has quite high latency, and probably also performs bad in terms of word-error-rate (WER).

But I'm happy to be proven wrong. That's why I would like to see some actual numbers. Maybe it's still okish enough, maybe it's actually really bad. I'm curious. But I don't just want to see a demo or a sloppy statement like "it's working ok".

Note that this is a highly non-trivial problem, to make a streamable speech recognition system with low latency and still good performance. There is a big research community working on just this problem.

I actually have worked on this problem myself. E.g. see our work "Chunked Attention-based Encoder-Decoder Model for Streaming Speech Recognition" (https://arxiv.org/abs/2309.08436), which will be presented at ICASSP 2024. E.g. for a median latency of 1.11s ec, we get a WER of 7.5% on TEDLIUM-v2 dev, which is almost as good as the offline model with 7.4% WER. This is a very good result (only very minor WER degradation). Or with a latency of 0.78 sec, we get 7.7% WER. Our model currently does not work too well when we go to even lower latencies (or the computational overhead becomes impractical).

Or see Emformer (https://arxiv.org/abs/2010.10759) as another popular model.


whisper is simply not designed for this, in many ways, and it's impressive engineering to try and overcome its limitations, but I can't help but feel that it is easier to just use an architecture that is designed for the problem.

I was impressed by Kaldi's models for streaming ASR: https://k2-fsa.github.io/sherpa/onnx/pretrained_models/index... ; I suspect that the Nvidia/Suno Parakeet models will also be pretty good for streaming https://huggingface.co/nvidia/parakeet-ctc-0.6b


Very interesting. Thanks for the references. Have you released the code or pre-trained models yet or do you plan to do so at some point?


The code is all released already. You find it here: https://github.com/rwth-i6/returnn-experiments/tree/master/2...

This is TensorFlow-based. But I also have another PyTorch-based implementation already, also public (inside our other repo, i6_experiments). It's not so easy currently to set this up, but I'm working on a simpler pipeline in PyTorch.

We don't have the models online yet, but we can upload them later. But I'm not sure how useful they are outside of research, as they are specifically for those research tasks (Librispeech, Tedlium), and probably don't perform too well on other data.


We will add the details, thanks for pointing it out.



Interesting project, thanks for sharing




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: