Hacker News new | past | comments | ask | show | jobs | submit login
StyleTTS2 – open-source Eleven-Labs-quality Text To Speech (github.com/yl4579)
725 points by sandslides 18 days ago | hide | past | favorite | 234 comments

I made a 100% local voice chatbot using StyleTTS2 and other open source pieces (Whisper and OpenHermes2-Mistral-7B). It responds so much faster than ChatGPT. You can have a real conversation with it instead of the stilted Siri-style interaction you have with other voice assistants. Fun to play with!

Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU (tested on 3060 12GB) can install and converse with StyleTTS2 with one click, no fiddling with Python or CUDA needed: https://apps.microsoft.com/detail/9NC624PBFGB7

The demo is janky in various ways (requires headphones, runs as a console app, etc), but it's a sneak peek at what will soon be possible to run on a normal gaming PC just by putting together open source pieces. The models are improving rapidly, there are already several improved models I haven't yet incorporated.

How hard on your end does the task of making the chatbot converse naturally look? Specifically I'm thinking about interruptions, if it's talking too long I would like to be able to start talking and interrupt it like in a normal conversation, or if I'm saying something it could quickly interject something. Once you've got the extremely high speed, theoretically faster than real time, you can start doing that stuff right?

There is another thing remaining after that for fully natural conversation, which is making the AI context aware like a human would be. Basically giving it eyes so it can see your face and judge body language to know if it's talking too long and needs to be more brief, the same way a human talks.

Yes, I implemented the ability to interrupt the chatbot while it is talking. It wasn't too hard, although it does require you to wear headphones so the bot doesn't hear itself and get interrupted.

The other way around (bot interrupting the user) is hard. Currently the bot starts processing a response after every word that the voice recognition outputs, to reduce latency. When new words come in before the response is ready it starts over. If it finishes its response before any more words arrive (~1 second usually) it starts speaking. This is not ideal because the user might not be done speaking, of course. If the user continues speaking the bot will stop and listen. But deciding when the user is done speaking, or if the bot should interrupt before the user is done, is a hard problem. It could possibly be done zero-shot using prompting of a LLM but you'd want a GPT-4 level LLM to do a good job and GPT-4 is too slow for instant response right now. A better idea would be to train a dedicated turn-taking model that directly predicts who should speak next in conversations. I haven't thought much about how to source a dataset and train a model for that yet.

Ultimately the end state of this type of system is a complete end-to-end audio-to-audio language model. There should be only one model, it should take audio directly as input and produce audio directly as output. I believe that having TTS and voice recognition and language modeling all as separate systems will not get us to 100% natural human conversation. I think that such a system would be within reach of today's hardware too, all you need is the right training dataset/procedure and some architecture bits to make it efficient.

As for giving the model eyes, actually there are already open source vision-language models that could be used for this today! I'd love to implement one in my chatbot. It probably wouldn't have social intelligence to read body language yet, but it could definitely answer questions about things you present to the webcam, read text, maybe even look at your computer screen and have conversations about what's on your screen. The latter could potentially be very useful, the endgame there is like GitHub Copilot for everything you do on your computer, not just typing code.

Thanks, fascinating insights. I think an everything-to-everything multimodal model could work if it's big enough because of transfer learning (but then there are latency issues), and so could a refined system built on LLMs/LMMs with TTS (like what you are using), but I haven't seen any good research on audio-to-audio language models. My suspicion is that that would take a lot of compute, much more than text, and that the amount of semantically meaningful accessible data might be much lower as well. And if you do manage to get to the same level of quality as text, what is latency like then? Not 100% sure, just intuitions, but I doubt it's great.

I like the idea of an RL predictor for interruption timing, although I think it might struggle with factual-correction interruptions. It could be a good way to make a very fast system, and if latency on the rest of the system is low enough you could probably start slipping in your "Of course", "Yeah, I agree", and "It was in March, but yeah" for truly natural speech. If latency is low you could just use the RL system to find opportunities to interrupt, give them to the LLM/LMM, and it decides how to interrupt, all the way from "mhm", to "Yep, sounds good to me", to "Not quite, it was the 3rd entry, but yeah otherwise it makes sense", to "Actually can I quickly jump on that? I just wanted to quickly [make a point]/[ask a question] about [some thing that requires exploration before the conversation continues]".

Tuning a system like this would be the most annoying activity in human history, but something like this has to be achieved for truly natural conversation so we gotta do it lol.

> although it does require you to wear headphones so the bot doesn't hear itself and get interrupted.

Maybe you can use some sort of speaker identification to sort this out?


A simple correlation of audio chunks from microphone and from the TTS should be enough to tell which parts in the input stream are re-recorded TTS. Much simpler, no?

It's not so simple when the impulse response of the room and mic and speakers are all unknown, possibly changing, plus unknown background sounds as well, possibly at a very high level. and there's also unknown latency which can be quite large especially in the networked case, and maybe some codecs, and maybe some audio "enhancement" software the OEM installed on the user's machine. Also, ideally the computer would be able to hear the user even while it is speaking.

Echo cancellation is non-trivial for sure.

Could it be done more reasonably with the transcription? With diarisation the logic of "someone is saying exactly / almost exactly what I'm saying, that's probably me" might be pretty reasonable.

Yes, this is a good idea. Too many good ideas, not enough time!

I installed this on my system, having a little issue like you mention below where the bot hears itself so it got into a loop of talking to itself and replying to itself, it was pretty funny.

Could the issue be that I am using a pair of bluetooth headphones and the microphone is built into that - what is the optimum setup, should I be listening on the headphones and using a different mic input instead of the headphone mic?

This is pretty intermeeting I would love to get it to work. Running a 3060.


That's weird, Bluetooth headphones should work I think. Maybe there's something wrong with them. You can check if the mic can hear the speakers using Windows sound recorder. Try a different mic if you have one, the only important thing is the mic can't hear the headphones.

I installed this on my system, having a little issue like you mention below where the bot hears itself so it got into a loop of talking to itself and reply to itself.

Could the issue be that I am using a pair of bluetooth headphones and the microphone is built into that - what is the optimum setup, should I be listening on the headphones and using a different mic input instead of the headphone mic?

This is pretty intermeeting I would love to get it to work. Running a 3060.


> It could possibly be done zero-shot using prompting of a LLM

That's how I've been thinking of doing it - seemed like you could use a much smaller GPT-J-ish model for that, and measure the relative probability of 'yes' vs 'no' tokens in response to a question like 'is the user done talking'. Seemed like even that would be orders of magnitude better than just waiting for silence.

Instead of sacrificing flexibility by building one monolith model that does Audio to audio in one go, wouldn't it be better to train a model that handles conversing with the user (knows when the user is done talking, when it's hearing itself, etc) and leave the thinking to other, more generic models?

You don't lose flexibility with an end to end model. You lose controllability. But there are ways to mitigate that.

It would be very interesting to have something like BakLLaVA's image description fed from a webcam used as a context for the LLM. "You can see: <description of scene, description of changes from last snapshot>" or something along those lines in the system prompt.

you'd have to do something along the lines of what voice comm does to combat the output feedback problem. i think it involves an fft to analyse the two signals and cancel out the feedback, im not 100% sure on the details.

I plan to change the audio input to use WebRTC, then I get echo cancellation and network transparency for free. Although dealing with WebRTC is a headache harder than doing the AI parts.

What do you find hard about WebRTC?

I would love to help. Would even code up a prototype if you wanted :)

For starters, every WebRTC demo I've tried has at least 400ms of round trip latency even on a loopback connection. Shoot me an email if you know WebRTC, would be good to chat with someone who knows stuff!

Deciding when someone is done speaking is hard to do well and impossible to do perfectly. Some people finish speaking, then think of something else to say and pretend they were still talking.

True, perfection isn't achievable but human level performance is all you need and it may be possible to do better than that.

Short-term could it be configured as push to talk?

Certainly, but then it has little advantage over e.g. ChatGPT voice mode. I guess running locally is an advantage but the voice and answer quality is worse. The much better latency and more natural conversation is what I like about it.

Am I wrong to think it would have a couple major advantages? Like using speakers without having to worry about echo cancellation, having a distinct interrupt signal, and still getting all the latency benefits (possibly even more once you get used to it since the conversational style has to assume the end of the user’s sentence instead of knowing the second they let go of the button)

You wouldn't quite have all the latency benefits because you'd have the additional delay between when you stop speaking and when you release the button (or cut off speech if you release too early). It wouldn't respond any faster because it's already responding at the fastest possible speed right now, it doesn't wait at all. And it wouldn't be hands free, it wouldn't feel like a natural conversation which is what I'm going for.

I'd rather use speaker diarization and/or echo cancellation to solve the problem without needing the user to press any buttons.

What I would really like is a push-to-talk app on my phone, so I can talk to my house from anywhere without worrying about putting a microphone in every room, or monkeying about with wakeword detection. I'm sure it's doable, I'm just not conscious of having seen it done.

Tried it but it seems it only works with Cuda 11 and I have 12 installed. Not really willing to potentially screw up my Cuda environment to try it.

Thanks for trying, what error message did you get? It works without CUDA installed at all on my test machine.

  Process Process-2:
  Traceback (most recent call last):
    File "multiprocessing\process.py", line 314, in _bootstrap
    File "multiprocessing\process.py", line 108, in run
    File "chirp.py", line 126, in whisper_process
    File "chirp.py", line 126, in <listcomp>
    File "faster_whisper\transcribe.py", line 426, in generate_segments
    File "faster_whisper\transcribe.py", line 610, in encode
  RuntimeError: Library cublas64_11.dll is not found or cannot be loaded
  tts initialized

Hmm, the dll is included in the app package but maybe there is a conflict with other installed DLLs on some machines. When releasing PC software I always expect this type of issue unfortunately. I plan to move away from faster_whisper which may fix this.

I have to say that the Python ecosystem is just awful for distribution purposes and I spent a lot longer on packaging issues than I did on the actual AI parts. And clearly didn't find all of the issues :)

Agree completely. But in this case the fault is with CUDA which never ever works without a struggle. It’s insane how hard it is to get stuff that works cross-platform without a lot of work using CUDA. Even PyTorch has an awkward way of dealing with it and they have more resources to figure it out than just about anyone.

Using a conda environment should be able to get around that I believe

Cool work! I tested it and got some mixed results:

1) it throws an error if it's installed to any drive other than C:\ --I moved it to C: and it works fine.

2) I'm seeing huge latency on an EVGA 3080Ti with 12GB. Also seeing it repeat the parsed input, even though I only spoke once, it appears to process the same input many times with slightly different predictions sometimes. Here's some logs:

Latency to LLM response: 4.59 latency to speaking: 5.31 speaking 4: Hi Jim! user spoke: Hi Jim. user spoke recently, prompting LLM. last word time: 77.81 time: 78.11742429999867 latency to prompting: 0.31

Latency to LLM response: 2.09 latency to speaking: 3.83 speaking 5: So what have you been up to lately? user spoke: So what have you been up to lately? user spoke recently, prompting LLM. last word time: 83.9 time: 84.09415280001122 latency to prompting: 0.19 user spoke: So what have you been up to lately? No, I'm watching. user spoke a while ago, ignoring. last word time: 86.9 time: 88.92142140000942 user spoke: So what have you been up to lately? No, just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 90.76665070001036 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 94.16581820001011 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 88.9 time: 97.85854300000938 user spoke: So what have you been up to lately? No, I'm just watching TV. user spoke a while ago, ignoring. last word time: 87.9 time: 101.54986060000374 user spoke: No, I just bought you a TV. user spoke a while ago, ignoring. last word time: 87.8 time: 104.51332219998585 user spoke: No, I'll just watch you TV. user spoke a while ago, ignoring. last word time: 87.41 time: 106.60086529998807 Latency to LLM response: 46.09 latency to speaking: 50.49

Thanks for posting it!


3) It's hearing itself and responding to itself...

Thanks for trying it and thanks for the feedback! Yes, right now you need to use headphones so it doesn't hear itself. Sometimes Whisper inexplicably fails to recognize speech promptly. It seems to depend on what you say, so try saying something else. I have improvements that I haven't had time to release yet that should improve the situation, and a lot more work is definitely needed, this is definitely MVP level stuff right now. This stuff is fixable but it'll take time.

Is 12GB the minimum? got an out of memory error with 8GB

Yes, unfortunately these models take a lot of VRAM. It may be possible to do an 8GB version but it will have to compromise on quality of voice recognition and the language model so it might not be a good experience.

This might be silly because of how few people it benefits, but could it be broken up on to multiple 8GB cards on the same system?

Yes, it absolutely could. You're right that this configuration is rare. Although people have been putting together machines with multiple 24GB cards in order to split and run larger models like llama2-70B.

The latest large models are 120B and 100k context such as Goliath and Tess XL

But whisper does not support input streaming, so you have to wait for the whole llm response to trigger the transcription or not?

Apparently, by running it on windows of audio very often:


Wow! For those of us who don't have the necessary GPU hardware, can you post a video?

How do you get Whisper to be fast?

Isn't it quite non-realtime?

Great question! Whisper processes audio in 30 second chunks. But on a fast GPU it can finish in only 100 milliseconds or so. So you can run it 10+ times per second and get around 100ms latency. Even better actually because Whisper will predict past the end of the audio sometimes.

This is an advantage of running locally. Running whisper this way is inefficient but I have a whole GPU sitting there dedicated to one user, so it's not a problem as long as it is fast enough. It wouldn't work well for a cloud service trying to optimize GPU use. But there are other ways of doing real time speech recognition that could be used there.

The community upgrades to whisper are far faster than real-time, especially if you have a powerful gpu

There's several faster ones out there. I've been using https://github.com/Softcatala/whisper-ctranslate2 which includes a nice --live_transcribe flag. It's not as good as running it on a complete file but it's been helpful to get the gist of foreign language live streams.

use whisper-distil, it's like 5-8x faster

Hey modeless. Love it. Is your project open source by any chance? Would love to see it.

I haven't decided yet what I'm going to do with it. I think ideally I would open source it for people who have GPUs but also run it as a paid service for people who don't have GPUs. Open source that also makes money is always the holy grail :) I'll post updates on my Twitter/X account.

It threw a python exception for me and didn't generate speech

Thanks for trying, what exception did you get?

I tested StyleTTS2 last month, my step-by-step notes that might be useful for people doing local setup (not too hard): https://llm-tracker.info/books/howto-guides/page/styletts-2

Also I did a little speed/quality shootoff with the LJSpeech model (vs VITS and XTTS). StyleTTS2 was pretty good and very fast: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2

> inferences at up to 15-95X (!) RT on my 4090

That's incredible!

Are infill and outpainting equivalents possible? Super-RT TTS at this level of quality opens up a diverse array of uses esp for indie/experimental gamedev that I'm excited for.

It is theoretically possible to train a model that, given some speech, attempts to continue the speech, e.g. Spectron: https://michelleramanovich.github.io/spectron/spectron/. Similarly, it is possible to train a model to edit the content, a la Voicebox: https://voicebox.metademolab.com/edit.html.

Great. :P

Me: Won’t it be great when AI can-

Computer: Finish your sentences for you? OMG that’s exactly what I was thinking!

>Are infill and outpainting equivalents possible?

Do you mean outpainting as in you still what words to do, or the model just extends the audio unconditionally the way some image models just expand past an image borders without a specific prompt (in audio like https://twitter.com/jonathanfly/status/1650001584485552130)

Not sure what you mean: If you mean could inpainting and out painting with image models be faster, its a "not even wrong" question, similar to asking if the United Airlines app could get faster because American Airlines did. (Yes, getting faster is an option available to ~all code)

If you mean could you inpaint and outpaint text...yes, by inserting and deleting characters.

If you mean could you use an existing voice clip to generate speech by the same speaker in the clip, yes, part of the article is demonstrating generating speech by speakers not seen at training time

I'm not sure I understand what you mean to say. To me it's a reasonable question asking whether text to speech models can complete a missing part of some existing speech audio, or make it go on for longer, rather than only generating speech from scratch. I don't see a connection to your faster apps analogy.

Fwiw, I imagine this is possible, at least to some extent. I was recently playing with xtts and it can generate speaker embeddings from short periods of speech, so you could use those to provide a logical continuation to existing audio. However, I'm not sure it's possible or easy to manage the "seams" between what is generated and what is preexisting very easily yet.

It's certainly not a misguided question to me. Perhaps you could be less curt and offer your domain knowledge to contribute to the discussion?

Edit: I see you've edited your post to be more informative, thanks for sharing more of your thoughts.

It imposes a cost on others when when you makes false claims like I said or felt the question was unreasonable.

I didn't and don't.

It is a hard question to understand and an interesting mind-bender to answer.

Less policing of the metacontext and more focusing on the discussion at hand will help ensure there's interlocutors around to, at the very least, continue policing.

Sorry but it was pretty obvious what he meant.

It's not, at all.

He could have meant speed, text, audio, words, or phonemes, with least probably images.

He probably didn't mean phonemes or he wouldn't be asking.

He probably didn't mean arbitrarily slicing 'real' audio and stitching on fake audio - he made repeated references to a video game.

He probably didn't mean inpainting and outpainting imagery, even though he made reference to a video game, because its an audio model.

Thank you for explaining I deserve to get downvoted through the floor multiple times for asking a question because it's "obvious". Maybe you can explain to the rest of the class what he meant then? If it was obviously phonemes, will you then advocate for them being downvoted through the floor since the answer was obvious? Or is it only people who assume good faith and ask what they meant who deserve downvotes?

Inpainting and outpainting of images is when the model generates bits inside or outside the image that don't exist. By analogy he was talking about generating sound inside (I.e. filling gaps) or outside (extrapolating beyond the end) the audio.

I don't know why you would think he was talking about inpainting images, words. This whole discussion is about speech synthesis.

Right, _until he brought up inpainting and outpainting_. And as I already laid out, the audio options made just about as much sense as the art.

I honestly can't believe how committed you are to explaining to me that as the only person who bothered answering, I'm the problem.

I've been in AI art when it was 10 people in an IRC room trying to figure out what to do with a bunch of GPUs an ex-hedge fund manager snapped up, and spent the last week working on porting eSpeak, the bedrock of ~all TTS models, from C++.

It wasn't "obvious" they didn't mean art, and it definitely was not obvious that they want to splice real voice clips at arbitrary points and insert new words without being a detectable fake for a video game. I needed more info to answer. I'm sorry.

I'll be the first to admit that it was an off the cuff, vague, and unclear question, and I'm lucky some people got it.

Wait 'till you learn I'm a woman though. :>

Ignore the speed comment; it is unrelated to my question.

What I mean is, can output be conditioned on antecedent audio as well as text analogous to how image diffusion models can condition inpainting and outpatient on static parts of an image and clip embeddings?

Yes, the paper and Eleven Labs have a major feature of "given $AUDIO_SET, generate speech for $TEXT in the same style of $AUDIO_SET"

No, in that, you can't cut it at an arbitrary midword point, say at "what tim" in "what time is it bejing", and give it the string "what time is it in beijing", and have it recover seamlessly.

Yes, in that, you can cut it at an arbirtrary phoneme boundary, say 'this, I.S. a; good: test! ok?' in IPA is 'ðˈɪs, ˌaɪˌɛsˈeɪ; ɡˈʊd: tˈɛst! ˌoʊkˈeɪ?', and I can cut it 'between' a phoneme, give it the and have it complete.

Perfect! Thank you

Thanks. Following the instructions now. BTW mamba is no longer recommended (for those like me who aren't already using it), and the #mambaforge anchor in the link didn't work.

I switched from conda to mamba a while ago and never looked back (it's probably saved dozens of hours from waiting for conda's slow as molasses package resolution). I'm looking at the latest docs and it doesn't look like there's any deprecation messages or anything (it does warn against installing mamba inside of conda, but that's been the case for a long time): https://mamba.readthedocs.io/en/latest/installation/mamba-in...

It looks like miniforge is still the recommended install method, but also the anchor has changed in the repo docs, which I've updated, thx. FWIW, I haven't run into any problems using mamba. While I'm not a power user, so there are edge cases I might have missed, but I have over 35 mamba envs on my dev machine atm, so it's definitely been doing the job for me and remains wicked fast (if not particularly disk efficient).

I had somehow missed the introduction of mamba, and have been using the default conda solver (which I think is the 'classic' one). Apparently conda now supports using the mamba solver: https://www.anaconda.com/blog/a-faster-conda-for-a-growing-c...

  conda update -n base conda
  conda install -n base conda-libmamba-solver
  conda config --set solver libmamba

Was somewhat annoying to get everything to work as the documentation is a bit spotty, but after ~20 minutes it's all working well for me on WSL Ubuntu 22.04. Sound quality is very good, much better than other open source TTS projects I've seen. It's also SUPER fast (at least using a 4090 GPU).

Not sure it's quite up to Eleven Labs quality. But to me, what makes Eleven so cool is that they have a large library of high quality voices that are easy to choose from. I don't yet see any way with this library to get a different voice from the default female voice.

Also, the real special sauce for Eleven is the near instant voice cloning with just a single 5 minute sample, which works shockingly (even spookily) well. Can't wait to have that all available in a fully open source project! The services that provide this as an API are just too expensive for many use cases. Even the OpenAI one which is on the cheaper side costs ~10 cents for a couple thousand word generation.

To save people some time, this is tested on Ubuntu 22.04 (google is being annoying about the download link, saying too many people have downloaded it in the past 24 hours, but if you wait a bit it should work again):

  git clone https://github.com/yl4579/StyleTTS2.git
  cd StyleTTS2
  python3 -m venv venv
  source venv/bin/activate
  python3 -m pip install --upgrade pip
  python3 -m pip install wheel
  pip install -r requirements.txt
  pip install phonemizer
  sudo apt-get install -y espeak-ng
  pip install gdown
  gdown https://drive.google.com/uc?id=1K3jt1JEbtohBLUA0X75KLw36TW7U1yxq
  7z x Models.zip
  rm Models.zip
  gdown https://drive.google.com/uc?id=1jK_VV3TnGM9dkrIMsdQ_upov8FrIymr7
  7z x Models.zip
  rm Models.zip
  pip install ipykernel pickleshare nltk SoundFile
  python -c "import nltk; nltk.download('punkt')"
  pip install --upgrade jupyter ipywidgets librosa
  python -m ipykernel install --user --name=venv --display-name="Python (venv)"
  jupyter notebook
Then navigate to /Demo and open either `Inference_LJSpeech.ipynb` or `Inference_LibriTTS.ipynb` and they should work.

Very helpful, thanks!

One thing I've seen done for style cloning is a high quality fine tuned TTS -> RVC pipeline to "enhance" the output. TTS for intonation + pronunciation, RVC for voice texture. With StyleTTS and this pipeline you should get close to ElevenLabs.

I suspect they are doing many more things to make it sounds better. I certainly hope open source solutions can approach that level of quality, but so far I've been very disappointed.

RVC? R… Voice Model?

Retrieval-based voice conversion, apparently.

The LibriTTS demo clones unseen speakers from a five second or so clip

Ah ok, thanks. I tried the other demo.

I tried it. Sounds absolutely nothing like my voice or my wife's voice. I used the same sample files as I used 2 days ago on the Eleven Labs website, and they worked flawlessly there. So this is very, very far from being close to "Eleven Labs quality" when it comes to voice cloning.

Ah that's disappointing, have you tried https://git.ecker.tech/mrq/ai-voice-cloning ? I've had decent results with that, but inference is quite slow.

ElevenLabs are based on Tortoise-TTS which was already pre-trained on millions of hours of data, but this one was only trained on LibriTTS which was 500 hours at best. If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

The speech generated is the best I've heard from an open source model. The one test I made didn't make an exact clone either but this is still early days. There's likely something not quite right. The cloned voice does speak without any artifacts or other weirdness that most TTS systems suffer from.

Yep. Tried as well. Tried a little clip of Tony Sopranos and it came out as a british guy.

xTTSv2 does it much better. But the quality on the trained voices are great though.

Yes, same for my voice. Made me sound British and didn't capture anything special about my voice that makes it recognizable.

have you tested longer utterances with both ElevenLabs and with StyleTTS? Short audio synthesis is a ~solved problem in the TTS world but things start falling apart once you want to do something like create an audiobook with text to speech.

I can say that the paid service from ElevenLabs can do long form TTS very well. I used it for a while to convert long articles to voice to listen to later instead of reading. It works very well. I only stopped because it gets a little pricey.

The OpenAI API is ten times cheaper and a fair bit faster.

Also, ElevenLabs keeps diverging for me, and starts mispronouncing words after two or three sentences.

Funnily enough, the TTS2 examples sound better than the ground truth [0]. For example, the "Then leaving the corpse within the house [...]" example has the ground truth pronounce "house" weirdly, with some change in the tonality that sounds higher, but the TTS2 version sounds more natural.

I'm excited to use this for all my ePub files, many of which don't have corresponding audiobooks, such as a lot of Japanese light novels. I am currently using Moon+ Reader on Android which has TTS but it is very robotic.

[0] https://styletts2.github.io/

First Wife is a professional voice-over actor. I saw someone left her a bad review saying "Clearly an AI."

2023. There is no way to win.

The pace is better, but imho you there is still a very noticeable “metalic” tone which makes it inferior to the real thing.

Impressive results nonetheless, and superior to all other TTS.

how are you planning on using this with epubs? i'm in a similar boat. would really like to leverage something like this for ebooks.

I wonder if you can add a TTS engine to Android as an app or plugin, then make Moon+ Reader or another reader to use that custom engine. That's probably how I'd do it for the easiest approach, but if that doesn't work, I might just have to make my own app.

I’m planning on making a self-host solution where you can upload files and the host sends back the audio to play, as a first pass on this tech. I’ll open source the repo after fiddling and prototyping. I’ve needed this kinda thing for a long time!

Please make sure to link it back to HN so that we can check it out!

You can! [rhvoice](https://rhvoice.org/) is an open source example.

HN title at present is "StyleTTS2 – open-source Eleven Labs quality Text To Speech". Actual title at the far end doesn't name any particular other product; arXiv paper linked from there doesn't mention Eleven Labs either. I thought this sort of editorializing was frowned on.

Eleven Labs is the gold standard for voice synthesis. There is nothing better out there.

So it is extremely notable for an open source system to be able to approach this level of quality, which is why I'd imagine most would appreciate the comparison. I know it caught my attention.

OpenAI's TTS is better than Eleven Labs, but they don't let you train it to have a particular voice out of fear of the consequences.

I concur that, for the use cases that OpenAI's voices cover, it is significantly better than Eleven.

But is this even approaching Eleven? Doesn't seem like it from the other comments here.

It is editorializing and it is an exaggeration. However I've been using StyleTTS2 myself and IMO it is the best open source TTS by far and definitely deserves a spot on the top of HN for a while.

Yes, it's against the guidelines. In fact, when I read the title, I didn't think it was a new research paper but a random GitHub project.

Out of curiosity - to folks that have had success with this...

This voice cloning is... nothing like XTTSv2, let alone ElevenLabs.

It doesn't seem to care about accents at all. It does pretty well with pitch and cadence, and that's about it.

I've tried all kinds of different values for alpha, beta, embedding scale, diffusion steps.

Anyone else have better luck?

Sure it's fast and the sound quality is pretty good, but I can't get the voice cloning to work at all.

See my previous comment about this point. ElevenLabs are based on Tortoise-TTS which was already pre-trained on millions of hours of data, but this one was only trained on LibriTTS which was 500 hours at best. XTTS was also trained with probably millions of speakers in more than 20 languages.

If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

> It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.

It's really not that difficult, they are trained mostly on audiobooks and high quality audio from yt videos. If we talk about EV model then we are talking about around 500k hours of audio, but Tortoise-TTS is only around 50k from what I remember.

What's your basis for the claim that they are based on TorToiSe? I have seen this claim made (and rebutted) many times.

Very similar features, quite slow inference speed, and various rumors.

See the conclusion remarks in the paper - they acknowledge that voice cloning is not that good (yet).

I had the same experience as what you described (with a lot of experimentation with alpha and beta, as well as uploading different audio clips).

The quality is really really INSANE and pretty much unimaginable in early 2000s.

Could have interesting prospects for games where you have LLM assuming a character and such TTS giving those NPCs voice.

This is a big thing for one area I'm interested in - golf simulation.

Currently playing in a golf simulator has a bit of a post-apocalyptian vibe. The birds are cheeping, the grass is rustling, the game play is realistic, but there's not a human to be seen. Just so different from the smacktalking of a real round, or the crowd noise at a big game.

It's begging for some LLM-fuelled banter to be added.

In Super Video Golf which is more a old-school/retro-style game, that a friend of mine made, there are some clapping sound effects, when people are in view. However, I feel like the nature sound on its own is also kind of relaxing.

Or the occasional "Fore!!"s. :-)

Just tried the collab notebooks. Seems to be very good quality. It also supports voice cloning.

Great stuff, took a look through the README but... what are the minimum hardware requirements to run this? Is this gonna blow up my CPU / harddrive?

Not sure. The only inference demos are colab notebooks. The models are approx 700mb each so I imagine it will run on modest gpu

Would it run in a cheap non-GPU server?

Seems to run about "2x realtime" on 2015 4 core i7-6700HQ laptop, that is, 5 seconds to generate 10 seconds of output. Can imagine that being 4x or greater on a real machine

I skimmed the github but didn't see any info on this, how long does it take to finetune to a particular voice?

I really want to try this but making the venv to install all the torch dependencies is starting to get old lol.

How are other people dealing with this? Is there an easy way to get multiple venvs to share like a common torch venv? I can do this manually but I'm wondering if there's a tool out there that does this.

I use nix to setup the python env (python version + poetry + sometimes python packages that are difficult to install with poetry) and use poetry for the rest.

The workflow is:

  > nix flake init -t github:dialohq/flake-templates#python
  > nix develop -c $SHELL
  > # I'm in the shell with poetry env, I have a shell hook in the nix devenv that does poetry install and poetry activate.

I generally try to use Docker for this stuff, but yeah, it's the main reason why I pass on these, even though I've been looking for something like this. It's just too hard to figure out the dependencies.

Can relate to this problem a lot. I have considered starting using a Docker dev container and making a base image for shared dependencies which I then can customize in a dockerfile for each new project, not sure if there's a better alternative though.

Yeah there is the official Nvidia container with torch+cuda pre-installed that some projects use.

I feel more projects should start with that as the base instead of pinning on whatever variants. Most aren't using specialized CUDA kernels after all.

Suppose there's the answer, just pick the specific torch+CUDA base that matches the major version of the project you want to run. Then cross your fingers and hope the dependencies mesh :p.

Same here. I'm using conda and eyeing simply installing a pytorch into the base conda env

I don't think "base" works like that (while it can be a fallback for some dependencies, afaik, Python packages are isolated/not in path). But even if you could, don't do it. Different packages usually have different pytorch dependencies (often CUDA as well) and it will definitely bite you.

The biggest optimization I've found is to use mamba for everything. It's ridiculously faster than conda for package resolution. With everything cached, you're mostly just waiting for your SSD at that point.

(I suppose you could add the base env's lib path to the end of your PYTHONPATH, but that sounds like a sure way to get bitten by weird dependency/reproducibility issues down the line.)

Thank you! First time I come across. Looks very promising

> is starting to get old lol.

If it's starting to get old, then this means that an LLM like Copilot should be able to do it for you, no?

I mean that I already have like 10 different torch venvs for different projects all with various pinned versions and CUDA variants.

Still worth the trade-off of not having to deal with dependency hell, but you start to wonder if there is a better way. All together this is many GBs of duplicated libs, wasted bandwidth and compute.

Curious if we'll see a Civitai-style LoRA[1] marketplace for text-to-speech models.

1 = https://github.com/microsoft/LoRA

Having now tried it (the linked repo links to pre-built colab notebooks):

1) It does a fantastic job of text-to-speech.

2) I have had no success in getting any meaningful zero-shot voice cloning working. It technically runs and produces a voice, but it sounds nothing like the target voice. (This includes trying their microphone-based self-voice-cloning option.)

Presumably fine-tuning is needed - but I am curious if anyone had better luck with the zero-shot approach.

What's a ballpark estimate for inference time on a modern CPU?

If AI will render some jobs obsolete, I suppose the first one will be audio book narrators and voice actors.

I can see a future where the label "100% narrated by a human" (and similar in other industries) will be a thing

A la, A Young Lady's Illustrated Primer.

"No humans were fired in the making of this film"

Hardly. Imagine licensing your voice to Amazon so that any customer could stream any book narrated in your likeness without you having to commit the time to record. You could still work as a custom voice artist, all with a "no clone" clause if you chose. You could profit from your performance and craft in a fraction of the time, focusing as your own agent on the management of your assets. Or, you could just keep and commit to your day job.

Just imagine hearing the final novel of ASoIaF narrated by Roy Dotrice and knowing that a royalty went to his family and estate, or if David Attenborough willed the digital likeness of his voice and its performance to the BBC for use in nature documentaries after his death.

The advent of recorded audio didn't put artists out of business, it expanded the industries that relied on them by allowing more of them to work. Film and tape didn't put artists out of business, it expanded the industries that relied on them by allowing more of them to work. Audio digitization and the internet didn't put artists out of business; it expanded the industries that relied on them by allowing more of them to work.

And TTS won't put artists out of business, but it will create yet another new market with another niche that people will have to figure out how to monetize, even though 98% of the revenues will still somehow end up with the distributors.

What you're not considering here is that a large majority of this industry is made up of no-name voice actors who have a pleasant (but perfectly substitutible) voice which is now something that AI can do perfectly and at a fraction of the price.

Sure, celebrities and other well-known figures will have more to gain here as they can license out their voice; but the majority of voice actors won't be able to capitalize on this. So this is actually even more perverse because it again creates a system where all assets will accumulate at the top and there won't be any distributions for everyone else.

No, I am. I work with them, and I've been one (am one, rarely).

I listed just one possible use, but I also see voice cloning and advanced TTS expanding access for evocative instruction, as an aid to study style and expand range.

Don't be afraid on their behalf. The dooming you're talking about applied to every one of the technological changes I already listed, and we employ more performers and artists today than ever in history.

When animation went digital, we graduated more storyboard artists and digital animators. When music notation software and sampling could replace musicians and orchestras, we graduated more musicians and composers trained on those tools. Now it's the performing arts, and no one in industry is going to shrink their pool of available talent (or risk ire) by daring conflate authenticity and performance with virtual impersonation. Performance capture and vfx also didn't kill or consolidate the movie industry - it allowed it to expand.

Art evolves, and so does its business. People who love art want to see people who do art succeed. I'm optimistic.

> When animation went digital, we graduated more storyboard artists and digital animators. When music notation software and sampling could replace musicians and orchestras, we graduated more musicians and composers trained on those tools.

What you explained is that tech has changed the tools used by artists.

It's substantially different with AI-based TTS, though. It's not a tool for artists, but it's a tool for movie/game/book publishers to replace human voice actors. The AI will be much much more scalable and cheaper.

TTS actually allows scope for far more different artists' likenesses to be incorporated. An book can be read with all the characters having a different voice entirely. This is difficult currently and relies on the skill of the performer.

I don't know, I feel like the work produced through voice acting is more of a commodity than work in the other industries that you're describing. Sure, a voice actor can add a lot of emotion and verbal nuance in a way that is differentiating, but I'm not sure if the difference is enough to matter for most people for the vast majority of cases. (Or I may be too dense to realize it). This is in contradiction to say performing arts, where there are in my opinion many more dimensions to the creative output which makes it less perfectly substitutable.

Where do you see AI being most used in a production pipeline?

Do you think it will replace actors or that it might just reduce the burden on existing talent, like canned audio has done for decades? Will it make ADR easier or cheaper? Will it actually save anyone any money who wants to ever be able to hire a living actor again?

There are a lot of clever sounding, low probability arguments here, and I think a lot of people don't understand the work well enough to identify what are and aren't the elephants.

You're not really thinking it through. I have friends involved in the VA business, and it's only gotten more competitive as time has progressed - this is partially because it's rare that we need a voice actor that needs to create a crazy Looney Tunes sounding voice, the majority of VA work is surprisingly just close to the natural sounding voice of the VA themselves.

It's rare that you need a talent like Dan Castellaneta, Mel Blanc, etc.

Secondly, yes, VA licensing will become a thing – but that means that jobs that would previously be available to other lesser known voice actors, because the major players simply didn't have enough time to take those gigs, can no longer take them. A TTSVA can do unlimited recordings.

Thirdly, major studios that would require hundreds of voices for video games and other things don't have to license known voices at all, they can just create generate brand new ones and pay zero licensing fees.

The point is no one will pay for any of that if you can just clone someone's voice locally. Or just tell the AI how you want it to sound. Your argument literally ignores the entire elephant in the room.

I've been playing with XTTSv2 and on my 3080ti, and it's sightly faster than the length of the final audio. It's also good quality, but these samples sound better.

Excited to try it out!

The weights aren’t MIT-licensed, so this is not usable in commercial applications, right?

It is usable in commercial applications given you disclose the use of AI. This applies only to the pre-trained models. You can train your own from scratch without these restrictions.

You can fine tune it on your own voice and also not be required to disclose the use of AI.

What are the chances this gets packaged into something a little more streamlined to use? I have a lot of ebooks I'd love to generate audio versions of.

This only works for English voices right?

No? From the readme:

In Utils folder, there are three pre-trained models:

    ASR folder: It contains the pre-trained text aligner, which was pre-trained on English (LibriTTS), Japanese (JVS), and Chinese (AiShell) corpus. It works well for most other languages without fine-tuning, but you can always train your own text aligner with the code here: yl4579/AuxiliaryASR.

    JDC folder: It contains the pre-trained pitch extractor, which was pre-trained on English (LibriTTS) corpus only. However, it works well for other languages too because F0 is independent of language. If you want to train on singing corpus, it is recommended to train a new pitch extractor with the code here: yl4579/PitchExtractor.

    PLBERT folder: It contains the pre-trained PL-BERT model, which was pre-trained on English (Wikipedia) corpus only. It probably does not work very well on other languages, so you will need to train a different PL-BERT for different languages using the repo here: yl4579/PL-BERT. You can also replace this module with other phoneme BERT models like XPhoneBERT which is pre-trained on more than 100 languages.

Those are just parts of the system and don't make a complete TTS. In theory you could train a complete StyleTTS2 for other languages but currently the pretrained models are English only.

I am an introvert: I rarely socialize, listen to podcasts at 2x speed, and mostly use subtitles rather than listening to audio for movies; therefore having a below average ability to differentiate humans/robots.

I asked someone to play the recordings for me to differentiate. I could not tell which was human (only between StyleTTS2 and Ground truth. The others were obvious)

This is great! Nice work.

I made my own whisper & auto-typer which types what you say (forked whisper-typer).

I added OpenAI Q/A and RAG query feature so I could ask it questions (instead of auto keystroke typing) by voice command. For responses to questions, I used Eleven Labs - but even with latency optimized & streaming, it was slow, so disabled it.

I just swapped from OpenAI to Mistral 7b for Q/A querying. Much more responsive. Stoked to explore StyleTTS2 now!

Really glad that I came across your post. Thank you for sharing!

How fast is inference with this model?

For reference, I'm using 11Labs to synthesize short messages - maybe a sentence or something, using voice cloning, and I'm getting it at around 400 - 500ms response times.

Is there any OS solution that gets me to around the same inference time?

It depends on hardware but IIRC on V100s it took 0.01-0.03s for 1s of audio.

It should be pretty easy to make training data for TTS. The Whisper STT models are open so just chop up a ton of audio and use Whisper to annotate it, then train the other direction to produce audio from text. So you’re basically inverting Whisper.

STT training data includes all kinds of "noisy" speech so that the model learns to recognise speech in any conditions. TTS training data needs to be as clean as possible so that you don't introduce artefacts in the output and this high-quality data is much harder to get. A simple inversion is not really feasible or at least requires filtering out much of the data.

I think you’re talking about just using Whisper to annotate audio for a TTS pipeline but someone from Collabora actually created a TTS model directly from Whisper embeddings https://github.com/collabora/WhisperSpeech

As a tangent away from LLMs, is there an integration available to be used in Android as TTS Engine?. The TTS voice that I have now (RHVoice) for OSMAnd is really driving me crazy and almost makes me want to go back to Google Maps.

Is it possible to optimize somehow the model to run a Raspberry with 4 GB of RAM?

You may want to try Piper for this case (RPi 4): https://github.com/rhasspy/piper

I was able to get it work with libjemalloc.

How fast is it on your raspberry?

Super slow. On my Mac Mini the inference was running in seconds, on Raspberry, minutes.

They really should have uploaded the models on Huggingface than Gdrive.

Is there a way to port this to iOS? Apple doesn't provide an API for their version of this.

Yes, please integrate it with Mistral and Whisper. This has got to get into the LLM frontends.

Done: https://apps.microsoft.com/detail/9NC624PBFGB7

It's mostly just a demo for now and a little bit janky but it's fun to chat with and you can see the promise for 100% local voice AI in the future.

So, we've got this open-source TTS wizardry going on, which is kinda like if Siri had a caffeine overdose - faster, snappier, and way more fun at parties. This thing is running on gaming rigs with beefy GPUs, and it's apparently so user-friendly, even your grandma could set it up without accidentally summoning a digital demon.

But here's the real kicker - it's got the manners of a Victorian gentleman. You can rudely interrupt it mid-sentence, and it'll just stop and listen. Politeness level 100. The reverse, though - getting Mr. Bot to interrupt you - is still in the 'that's too much brain for my silicon' phase. Like, how do you teach a bunch of 1s and 0s to know when you're just taking a dramatic pause or actually done with your TED talk?

And get this - they're talking about making this bot read body language. Imagine your laptop judging you for your slouchy posture or that 'I haven't slept properly in days' look. Creepy? Maybe a bit. Cool? Absolutely.

In conclusion, StyleTTS2 is shaping up to be the cool new kid on the block, but it's still learning the ropes of human conversation. It's like that super smart friend who knows everything about quantum physics but can't tell when you're sarcastically saying 'Yeah, sure, let's invade Mars tomorrow.

Is this really opensource and/or free software? like code, data(set/s) and models?

I am quite tired to see some "open-source" advertisement, where the half or more is not really free.

general psa: please be honest in your announcements :|

MIT licensed. Models, code, and everything is available right there when you click the link.

Maybe actually check it out before complaining.

But you are wrong the trained models are separate on Google Drive and have following Text that seems to be an additional License Agreement that also includes using the software and any trained Modell.

License Part 2 Text: "Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices pubilc, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices."

That’s for the pre-trained models. Train one up yourself.

silicon valley is very leaky, eleven labs is widely rumored to have raised a huge round recently. great timing because with OpenAI's TTS and now this thing the options in the market have just expanded greatly.

Once this is working, is there a simple way to switch voices with the default downloaded models? Or does this require downloading other models or generating them?

When trying to input a larger amount of text I get the error:

The expanded size of the tensor (4293) must match the existing size (512)

Any way to fix this from the IPython notebook examples?

This is really harmful and unethical work. It will be used to hurt millions of elderly people with scams. That's the real application that will happen 100x more than anything else. It's unethical and harmful to release tools that will be overwhelmingly used to hurt elderly people. What they should do about it is: Stop releasing models. Only release a service so that scammers will not use it. Also, only released audio that is watermarked, so that apps can tell that a phone call might be a scam. When they share models with researchers, use previous best practices: post a Google Form to request access.

Just imagine if this line of thinking was used elsewhere.

This tech is already out of the bag and I thank the author(s) for the contribution to humanity. The correct solution here is not to shove your head in the sand and ignore reality, but to get your government to penalize any country or company that facilitates this crime. If they can force severe penalties for other financial crimes and funding terrorism, they can do the same here.

it's funny because just yesterday I posted:

> soon as it's out, a whole bunch of extremely privileged ML people will throw their hands up and say, "oh well, cats out of the bag."


Scammers scamming old people is already very wide spread, so should we maybe outlaw telephones as well? Or maybe mandate anti scamming filters that disconnect if something is discussed that could be a scam? If I think about it that actually would make more sense, but still be problematic.

Cars actually kill over a million of people per year. Not saying this is good, just that all technology has its tradeoffs.

Millions of elderly people are already getting scammed by overseas call centers so unless we do something more significant this tech will not make one iota of a difference.

That's not really true, most scammers have a male voice with a heavy accent. When they have tools that easily disguise their voice, scammers can reach many more elderly people.

That might have been true about a year ago, but I've been getting calls from well-spoken native-level scammers for about two months now. They are so frequent that I can put them on speaker during family gatherings to raise awareness.

Sample sizes of 1 are never representative but they definitely have full access to native speakers or tech that can generate very passable speech.

It seems quite possible that the change you've seen in these last two months is because some have started using these models. More likely than a sudden huge shift in either the country of origin or English skills of the scammers.

My point is that these models were already out there before StyleTTS2 was released. Plugging your ears and demanding their regulation in your country will not make them disappear.

Been looking for a speech to text that can work in real time and run locally, anyone know which are the best options available?

Those sound incredibly good.

Though would def like to clone a pleasant voice on it before using. Those sound good but not my cup of tea

Very impressive. It would take me a long time to even guess that some of these are text to speech.

Someone please create a TTS with marked-down emotions/intonations.

Well done, been waiting for a moment like this. Will give it a try!

Wow this thing is wicked fast!

> MIT license

> Before using these models, you agree to [...]

No, this is not MIT. If you don't like MIT license then feel free to use something else, but you can't pretend this is open source and then attempt to slap on additional restrictions on how the code can be used.

As I understand it the source code is licensed MIT, the weights are licensed "weird proprietary license that doesn't explicitly grant you any rights and implicitly probably grants you some usage rights so long as you tell the listeners or have permission from the voice you cloned".

Which, if you think the weights are copyright-able in the first place, makes them practically unusable for anything commercial/that you might get sued over because relying on a vague implicit license is definitely not a good idea.

And if you don't think weights are copyrightable, it means nothing at all.

I think you mis-parsed the disclaimer. It's just warning people that cloned voices come with a different set of rights to the software (because the person the voice is a clone of has rights to their voice).

(Don’t let’s derail the conversation, please, but “disclaimer” is completely the wrong word here. This is a condition of use. A disclaimer is “this isn’t mine” or “I’m not responsible for this”. Disclaimers and disclosures are quite different things and commonly confused, but this isn’t even either of them.)

This always annoys me when people put "disclaimers" on their posts. IANAL, so tired of hearing that one. It's pointless because even if you were a lawyer, you cannot meaningfully comment on a case without the details, jurisdiction, circumstance, etc. Next, it's meaningless because is anyone going to blindly bow down and obey if you state the opposite? "Yes, I AM a lawyer, you do not need to pay taxes, they are unconstitutional." Thirdly, when they "disclaimer" themselves as working at google, that's not a dis-claimer, thats a "claimer", asserting the affirmative. I know their companies require them to not speak for the company without permission, but I hardly ever hear that one, usually its just some useless self-disclosure that they might be biased because they work there. Ok, who isn't biased?

What bugs me overall is that it's usually vapid mimicry of a phrase they don't even understand.

Ianal, but giving legal advice without being a lawyer may be illegal in some jurisdictions. Not sure if the disclaimer is effective or was ever tested in court. The disclaimer/disclosure mix-up is super annoying, but disclosing obvious biases even if not legally required seems like good practice to me.

I think that's referring to the pre-trained models, not the source code.

This bothered me as well. I opened an issue on the repo asking them to consider updating the license file to reflect these additional requirements.

The wording they currently use suggests that this additional license requirement applies not only to their pre-trained models.

As if anyone outside of corporate legal actually cares

Yes, I noticed that. Doesn't seem right does it

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact