More

jpcl · 2026-03-12T14:39:06 1773326346

What this means is that it does not support things like acting instructions or creating a voice from a text description. If you prompt it with a matching text+voice sample it will be able to generate more speech based on more text, just like a TTS. It can also generate it's own text on the fly but it won't be as good as your frontier model.

jpcl · on Jan 25, 2024

Hi, I used the [WhisperSpeech](https://github.com/collabora/WhisperSpeech) model for the TTS part after I did some serious torch.compile optimizations to bring the latency down. The Whisper speech recognition and the LLM were optimized through TensorRT-LLM by Marcus and Vineet.

It's not perfect but I am still extremely proud of how it came out. :)

stiffler01 · on Jan 25, 2024

Tried this on 4090 and the responsiveness and real-time communication it offers are truly impressive. It has significantly improved my overall experience, especially in scenarios where minimal delay is crucial.

Compared to WhisperFusion, Rabbit R1 feels like it's stuck in the past, they could maybe use the OpenSource WhisperFusion.

renus · on Jan 25, 2024

WhisperFusion is fully open-source - https://github.com/collabora/WhisperFusion

jpcl · on Jan 18, 2024

Good point, thanks. And I was thinking it will show that the model can really synthesize very varied samples... ;)

stavros · on Jan 18, 2024

It does, but maybe put it last!

jpcl · on Jan 18, 2024

Did exactly that, thanks for spotting that. :)

https://github.com/collabora/WhisperSpeech/commit/398b889060...

jpcl · on Jan 18, 2024

That's an interesting thought. The semantic tokens we get from Whisper serve a similar purpose – you can convert existing speech to different voices, I did not try with accents yet.

There is still a lot to explore in this space – we certainly don't have all the answers yet!

jpcl · on Jan 18, 2024

Yeah, Whisper is not clear-cut but since it is not a generative model I think their data usage is a lot more likely to be considered fair-use.

And the part of that which we use for WhisperSpeech is just the phonetic representation so our model is not able to recreate any of the Whisper training data in any way.

leereeves · on Jan 18, 2024

The readme says "We are working only with properly licensed speech recordings and all the code is Open Source so the model will be always safe to use for commercial applications."

Is that less certain than the quote implies?

jpcl · on Jan 18, 2024

We are working hard to uphold all the licensing rules but nobody can absolve you from all legal risks.

There may be a court ruling/new law that any training needs a special permission from the original author and then even a CC-BY license won't cover this.

doctorpangloss · on Jan 18, 2024

Laypeople value the aesthetics of statements like these. It's very Discord energy.

Everyone using learned weights from other models, especially ones released by OpenAI, Stability and Google, such as text and audio encoders, is tainted by training materials that were not expressly licensed for the purpose of AI model training or unlimitedly licensed for any use.

jpcl · on Jan 18, 2024

That's true but you make it sound like it's totally obvious where the line of fair use should be drawn for AI training.

Until courts or lawmakers make it clearer I personally believe non-generative models (Whisper, ResNet, DINOv2) should be legally trainable on publicly released data. Generative models (image or video generation, TTS, LLMs?) should be held to a much higher scrutiny since their outputs can potentially compete the creators who put a lot of creativity into their art. That not true for an ImageNet-trained classification model or Whisper ASR.

doctorpangloss · on Jan 18, 2024

I believe your use should be protected. This is not meant to be a takedown, better you hear it from me though, because you'll never hear it from Discord.

> "We are working only with properly licensed"

...versus:

> "fair use"

You're smart enough to know these are very different things - saying you believe you are protected by fair use, and claiming that the data is "properly licensed." In legal there is a colossal difference, you went from say you were authorized to use something to you believe you are unauthorized but still permitted due to fair use.

jpcl · on Jan 19, 2024

Yeah, thanks. I'd love to try to clarify this (I'll put this into our documentation as well ASAP) for anyone that may be reading this in the future:

Our model is not a derivative of Whisper but we do use the transcripts (and encoder outputs) in our data preprocessing. I am convinced this data cannot be regarded as a derivative of the (proprietary) Whisper training set. Whisper itself is open-source (MIT) and has no special TOS (unlike, say, ChatGPT).

WhisperSpeech itself is trained on mostly Public Domain and some CC-BY-SA recordings. We are working on adding more data and we will provide clearer documentation on all the licenses.

jpcl · on Jan 18, 2024

Hi, WhisperSpeech dev here.

Thanks for all the nice comments, I was working really hard on this model for quite a few months now but there are still a lot of ways we can make it better.

Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.

You can also buy our undivided engineering attention if you have a business use-case. :)

dale_glass · on Jan 18, 2024

> Thanks to generosity of Collabora this is a real open-source project (not just a one-time marketing ploy), so if you want to help improve it or integrate it into something you are building, I'd love to help.

We're probably interested! We're Overte, an open source VR/Desktop social platform.

The system targets VR and voice chat primarily, but we want to be more accessible to people who can't use voice chat for any reason. We do have an integrated chat, but it's not an ideal experience in VR. So good TTS to make it integrate better would be great for us. And the possibility of doing this without some sort of commercial API that requires keeping a secret API key is huge.

So yeah, we're very much interested in giving this one a try. It will probably take some time as we're gearing up for FOSDEM now, though.

jpcl · on Jan 18, 2024

Yeah, we'd love to help you when you decide to give it a try so feel free to reach out. We also have quite a few people working on VR at Collabora.

mastayoda · on Jan 18, 2024

Great job! Thanks for sharing this!

I'm developing a dynamic tutorial desktop app and I plan to use this model as text to speech synthesizer. Any chance it can be ported to ONNX format?

jpcl · on Jan 18, 2024

Hi, thanks a lot for the tip, I'll update the README samples ASAP. :)

I was busy working on inference performance in the last few weeks and totally did not expect to land on Hackernews today. Only noticed it an hour ago because my GitHub stars jumped quite a bit

jpcl · on Jan 18, 2024

Both Polish and English samples are actually synthesized with a voice trained on the WolneLektury audiobooks. They are the highest quality open source (CC BY-SA) audiobooks I could find.

By using the Whisper-derived phonetic representation (so called semantic tokens) we successfully trained a model with just a high-quality speech dataset of one language and the voice quality transferred to English.

satvikpendem · on Jan 18, 2024

How much training compute does it require to train from scratch? I'm wondering because I have a lot of audiobooks, they're not necessarily CC licensed though but for my private usage and training I think it'd be fine.

jpcl · on Jan 18, 2024

Training the T2S model from scratch takes around 8h on 96 A100 GPUs. Training the `tiny` S2A model is around 3x faster (training HQ `small` variant is comparable to T2S).

I think you would get good results with fine-tuning but unfortunately we don't have a user-friendly notebook or script to do that right now. The biggest model is 800MB (FP32) so you won't even need a very big GPU to be able to fine-tune.

e12e · on Jan 18, 2024

Link to these in English? I found some hits that may be correct for Polish - but I'm guessing they're hosted somewhere canonical?

jpcl · on Jan 18, 2024

https://wolnelektury.pl/katalog/audiobooki/ is the Polish audiobook collection.

The English audiobooks are public domain recordings from LibriVox (via the LibriLight dataset).

e12e · on Jan 18, 2024

Thank you. Is the Polish collection also a volunteer effort?

Link to librivox for others: https://librivox.org/

jpcl · on Jan 19, 2024

Not really, the Polish effort is run by a non-profit and hired professional voice actors.

jpcl · on Jan 18, 2024

Yeah, the Mimic is a lot less resource intensive. We are working to improve WhisperSpeech in this regard but it's probably always going to require more compute (but in return you'll get higher quality).

That said if you have a modern NVidia GPU you should be able to run a voice-bot in real-time with WhisperSpeech.

freedomben · on Jan 18, 2024

Will something like whisper.cpp be possible for whisper speech?

jpcl · on Jan 18, 2024

We looked at this at one point and it seems whisper.cpp/llama.cpp have all the bits needed to make it work. I'd love to help if someone wanted to give it a shot.

jpcl · on Jan 18, 2024

Yup, we are using Whisper to transcribe automatically so we can train the model on just speech recordings, without human transcripts.

This works for any language that is well supported by the OpenAI Whisper model.

deskamess · on Jan 18, 2024

Where can we find the latest OpenAI language model rankings?

jpcl · on Jan 18, 2024

There is a plot of language performance on their repo: https://github.com/openai/whisper

I am not aware of a multi-lingual leaderboard for speech recognition models.