Omni SenseVoice: High-Speed Speech Recognition with Words Timestamps

modeless · 2024-10-13T02:43:44.000000Z

Looks cool! Combine this with this new TTS that released today that looks really good and an LLM and you'd have a pretty good all-local voice assistant! https://github.com/SWivid/F5-TTS

staticautomatic · 2024-10-13T03:34:47.000000Z

I’ve been building a production app on top of ASR and find the range of models kind of bewildering compared to LLMs and video. The commercial offerings seem to be custom or built on top of Whisper or maybe nvidia canary/parakeet and then you have stuff like speechbrain that seems to run on top of lots of different open models for different tasks. Sometimes it’s genuinely hard to tell what’s a foundation model and what isn’t.

Separately, I wonder if this is the model Speechmatics uses.

woodson · 2024-10-13T06:02:34.000000Z

There’s just not a single one-size-fits-all model/pipeline. You choose the right one for the job, depending on whether you need streaming (i.e., low latency; words output right when they’re spoken), run on device (e.g. phone) or server, what languages/dialects, conversational or more “produced” like a news broadcast or podcast, etc. Best way is to benchmark with data in your target domain.

staticautomatic · 2024-10-13T06:37:32.000000Z

Sure, you're just going to try lots of things and see what works best, but it's confusing to be comparing things at such different levels of abstraction where a lot of the time you don't even know what you're comparing and it's impossible to do apples-to-apples even on your own test data. If your need is "speaker identification", you're going to end up comparing commercial black boxes like Speechmatics (probably custom) vs commercial translucent boxes like Gladia (some custom blend of whisper + pyannote + etc) vs [asr_api]/[some_specific_sepformer_model]. Like, I can observe that products I know to be built on top of whisper don't seem to handle overlapping speaker diarization that well, but I don't actually have any way of knowing if that's got anything to do with whisper.

leetharris · 2024-10-13T04:02:10.000000Z

We released a new SOTA ASR as open source just a couple of weeks ago. https://www.rev.com/blog/speech-to-text-technology/introduci...

Take a look. We'll be open sourcing more models very soon!

mkl · 2024-10-13T04:27:40.000000Z

> These models are accessible under a non-commercial license.

That is not open source.

threeseed · 2024-10-13T06:55:57.000000Z

Exactly. It is source available but not open source:

https://opensource.org/osd

yalok · 2024-10-13T04:34:14.000000Z

that's great to hear! amazing performance of the model!

for voice chat bots, however, shorter input utterances are a norm (anywhere from 1-10 sec), with lots of silence in between, so this limitation is a bit sad:

> On the Gigaspeech test suite, Rev’s research model is worse than other open-source models. The average segment length of this corpus is 5.7 seconds; these short segments are not a good match for the design of Rev’s model. These results demonstrate that despite its strong performance on long-form tests, Rev is not the best candidate for short-form recognition applications like voice search.

staticautomatic · 2024-10-13T04:29:22.000000Z

I'll check it out.

FWIW, in terms of benchmarking, I'm more interested in benchmarks against Gladia, Deepgram, Pyannote, and Speechmatics than whatever is built into the hyperscaler platforms. But I end up doing my own anyway so whatevs.

Also, you guys need any training data? I have >10K hrs of conversational iso-audio :)

throwaway2016a · 2024-10-13T15:54:50.000000Z

This looks really nice. What I find interesting is that it seems to advertise itself for the transcription use case but if it is "lightning fast" I wonder if there are better uses cases for it.

I use AWS Transcribe[1] primarily. It costs me $0.024 per minute of video and also provides timestamps. It's unclear to me without running the numbers if using this model I could do any better than that seeing as it needs a GPU to run.

With that said, I always love to see these things in the Open Source domain. Competition drives innovation.

Edit: Doing some math, with spot instances on EC2 or serverless GPU on some other platforms it could be relatively price competitive with AWS Transcribe if the performance is even slightly fast (2 hours of transcription per hour to break even). Of course the devops work for running your own model is higher.

[1] https://aws.amazon.com/transcribe/

ChrisMarshallNY · 2024-10-13T20:10:00.000000Z

> better uses cases for it.

I want my babelfish!

steinvakt · 2024-10-13T06:56:09.000000Z

How does the accuracy compare to Whisper?

Etheryte · 2024-10-13T09:36:13.000000Z

This uses SenseVoice under the hood, which claims to have better accuracy than Whisper. Not sure how accurate that statement is though, since I haven't seen a third party comparison, in this space it's very easy to toot your own horn.

[0] https://github.com/FunAudioLLM/SenseVoice

jmward01 · 2024-10-13T14:45:49.000000Z

This uses SenseVoice small under the hood. They claim their large model is better than Whisper large v3, not the small version. This small version is definitely worse than Whisper large v3 but still usable and the extra annotation it does is interesting.

khimaros · 2024-10-13T20:31:42.000000Z

this claims to have speaker diarization which is a potentially killer feature missing from most whisper implementations.

pferdone · 2024-10-13T11:04:38.000000Z

I mean they make a bold statement up top just to paddle back a little bit further down with: "[…] In terms of Chinese and Cantonese recognition, the SenseVoice-Small model has advantages."

It feels dishonest to me.

[0] https://github.com/FunAudioLLM/SenseVoice?tab=readme-ov-file...

ks2048 · 2024-10-13T16:03:41.000000Z

I've been doing some things with Whisper and find the accuracy very good, BUT I've found the timestamps to be pretty bad. For example, using the timestamps directly to clip words or phrases often clips off the end of word (even simple cases where is followed by silence). Since this emphases word timestamps, I may give it a try.

riiii · 2024-10-13T21:27:29.000000Z

Which languages does it support?

jbellis · 2024-10-13T15:12:51.000000Z

OOMs even in quantized mode on a 3090. What's a better option for personal use?

> torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 43.71 GiB. GPU 0 has a total capacity of 24.00 GiB of which 20.74 GiB is free.

yellow_lead · 2024-10-13T16:42:43.000000Z

Not sure if you mean in general, or options for this particular project, but Whisper should work for you.

unshavedyak · 2024-10-13T15:33:11.000000Z

Can't wait for a bundle of something like this with screen capture. I'd love to pipe my convos/habits/apps/etc to a local index for search. Seems we're getting close

satvikpendem · 2024-10-13T04:24:26.000000Z

Can it diarize?

staticautomatic · 2024-10-13T05:28:08.000000Z

Apparently not. See https://github.com/lifeiteng/OmniSenseVoice/blob/main/src/om.... See also FunASR running SenseVoice but using Kaldi for speaker identification https://github.com/modelscope/FunASR/blob/cd684580991661b9a0...

deegles · 2024-10-13T05:29:46.000000Z

Does it do diarization?

staticautomatic · 2024-10-13T05:30:53.000000Z

Apparently not. See my reply to satvikpendem.

mrkramer · 2024-10-13T11:31:26.000000Z

With timestamps?! I gotta try this.

frozencell · 2024-10-13T09:28:49.000000Z

Does it work with chorus?