I made a 100% local voice chatbot using StyleTTS2 and other open source pieces (Whisper and OpenHermes2-Mistral-7B). It responds so much faster than ChatGPT. You can have a real conversation with it instead of the stilted Siri-style interaction you have with other voice assistants. Fun to play with!
Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU (tested on 3060 12GB) can install and converse with StyleTTS2 with one click, no fiddling with Python or CUDA needed: https://apps.microsoft.com/detail/9NC624PBFGB7
The demo is janky in various ways (requires headphones, runs as a console app, etc), but it's a sneak peek at what will soon be possible to run on a normal gaming PC just by putting together open source pieces. The models are improving rapidly, there are already several improved models I haven't yet incorporated.
How hard on your end does the task of making the chatbot converse naturally look? Specifically I'm thinking about interruptions, if it's talking too long I would like to be able to start talking and interrupt it like in a normal conversation, or if I'm saying something it could quickly interject something. Once you've got the extremely high speed, theoretically faster than real time, you can start doing that stuff right?
There is another thing remaining after that for fully natural conversation, which is making the AI context aware like a human would be. Basically giving it eyes so it can see your face and judge body language to know if it's talking too long and needs to be more brief, the same way a human talks.
Yes, I implemented the ability to interrupt the chatbot while it is talking. It wasn't too hard, although it does require you to wear headphones so the bot doesn't hear itself and get interrupted.
The other way around (bot interrupting the user) is hard. Currently the bot starts processing a response after every word that the voice recognition outputs, to reduce latency. When new words come in before the response is ready it starts over. If it finishes its response before any more words arrive (~1 second usually) it starts speaking. This is not ideal because the user might not be done speaking, of course. If the user continues speaking the bot will stop and listen. But deciding when the user is done speaking, or if the bot should interrupt before the user is done, is a hard problem. It could possibly be done zero-shot using prompting of a LLM but you'd want a GPT-4 level LLM to do a good job and GPT-4 is too slow for instant response right now. A better idea would be to train a dedicated turn-taking model that directly predicts who should speak next in conversations. I haven't thought much about how to source a dataset and train a model for that yet.
Ultimately the end state of this type of system is a complete end-to-end audio-to-audio language model. There should be only one model, it should take audio directly as input and produce audio directly as output. I believe that having TTS and voice recognition and language modeling all as separate systems will not get us to 100% natural human conversation. I think that such a system would be within reach of today's hardware too, all you need is the right training dataset/procedure and some architecture bits to make it efficient.
As for giving the model eyes, actually there are already open source vision-language models that could be used for this today! I'd love to implement one in my chatbot. It probably wouldn't have social intelligence to read body language yet, but it could definitely answer questions about things you present to the webcam, read text, maybe even look at your computer screen and have conversations about what's on your screen. The latter could potentially be very useful, the endgame there is like GitHub Copilot for everything you do on your computer, not just typing code.
Thanks, fascinating insights. I think an everything-to-everything multimodal model could work if it's big enough because of transfer learning (but then there are latency issues), and so could a refined system built on LLMs/LMMs with TTS (like what you are using), but I haven't seen any good research on audio-to-audio language models. My suspicion is that that would take a lot of compute, much more than text, and that the amount of semantically meaningful accessible data might be much lower as well. And if you do manage to get to the same level of quality as text, what is latency like then? Not 100% sure, just intuitions, but I doubt it's great.
I like the idea of an RL predictor for interruption timing, although I think it might struggle with factual-correction interruptions. It could be a good way to make a very fast system, and if latency on the rest of the system is low enough you could probably start slipping in your "Of course", "Yeah, I agree", and "It was in March, but yeah" for truly natural speech. If latency is low you could just use the RL system to find opportunities to interrupt, give them to the LLM/LMM, and it decides how to interrupt, all the way from "mhm", to "Yep, sounds good to me", to "Not quite, it was the 3rd entry, but yeah otherwise it makes sense", to "Actually can I quickly jump on that? I just wanted to quickly [make a point]/[ask a question] about [some thing that requires exploration before the conversation continues]".
Tuning a system like this would be the most annoying activity in human history, but something like this has to be achieved for truly natural conversation so we gotta do it lol.
A simple correlation of audio chunks from microphone and from the TTS should be enough to tell which parts in the input stream are re-recorded TTS. Much simpler, no?
It's not so simple when the impulse response of the room and mic and speakers are all unknown, possibly changing, plus unknown background sounds as well, possibly at a very high level. and there's also unknown latency which can be quite large especially in the networked case, and maybe some codecs, and maybe some audio "enhancement" software the OEM installed on the user's machine. Also, ideally the computer would be able to hear the user even while it is speaking.
Could it be done more reasonably with the transcription? With diarisation the logic of "someone is saying exactly / almost exactly what I'm saying, that's probably me" might be pretty reasonable.
I installed this on my system, having a little issue like you mention below where the bot hears itself so it got into a loop of talking to itself and replying to itself, it was pretty funny.
Could the issue be that I am using a pair of bluetooth headphones and the microphone is built into that - what is the optimum setup, should I be listening on the headphones and using a different mic input instead of the headphone mic?
This is pretty intermeeting I would love to get it to work. Running a 3060.
That's weird, Bluetooth headphones should work I think. Maybe there's something wrong with them. You can check if the mic can hear the speakers using Windows sound recorder. Try a different mic if you have one, the only important thing is the mic can't hear the headphones.
I installed this on my system, having a little issue like you mention below where the bot hears itself so it got into a loop of talking to itself and reply to itself.
Could the issue be that I am using a pair of bluetooth headphones and the microphone is built into that - what is the optimum setup, should I be listening on the headphones and using a different mic input instead of the headphone mic?
This is pretty intermeeting I would love to get it to work. Running a 3060.
> It could possibly be done zero-shot using prompting of a LLM
That's how I've been thinking of doing it - seemed like you could use a much smaller GPT-J-ish model for that, and measure the relative probability of 'yes' vs 'no' tokens in response to a question like 'is the user done talking'. Seemed like even that would be orders of magnitude better than just waiting for silence.
Instead of sacrificing flexibility by building one monolith model that does Audio to audio in one go, wouldn't it be better to train a model that handles conversing with the user (knows when the user is done talking, when it's hearing itself, etc) and leave the thinking to other, more generic models?
It would be very interesting to have something like BakLLaVA's image description fed from a webcam used as a context for the LLM. "You can see: <description of scene, description of changes from last snapshot>" or something along those lines in the system prompt.
you'd have to do something along the lines of what voice comm does to combat the output feedback problem. i think it involves an fft to analyse the two signals and cancel out the feedback, im not 100% sure on the details.
I plan to change the audio input to use WebRTC, then I get echo cancellation and network transparency for free. Although dealing with WebRTC is a headache harder than doing the AI parts.
For starters, every WebRTC demo I've tried has at least 400ms of round trip latency even on a loopback connection. Shoot me an email if you know WebRTC, would be good to chat with someone who knows stuff!
Deciding when someone is done speaking is hard to do well and impossible to do perfectly. Some people finish speaking, then think of something else to say and pretend they were still talking.
Certainly, but then it has little advantage over e.g. ChatGPT voice mode. I guess running locally is an advantage but the voice and answer quality is worse. The much better latency and more natural conversation is what I like about it.
Am I wrong to think it would have a couple major advantages? Like using speakers without having to worry about echo cancellation, having a distinct interrupt signal, and still getting all the latency benefits (possibly even more once you get used to it since the conversational style has to assume the end of the user’s sentence instead of knowing the second they let go of the button)
You wouldn't quite have all the latency benefits because you'd have the additional delay between when you stop speaking and when you release the button (or cut off speech if you release too early). It wouldn't respond any faster because it's already responding at the fastest possible speed right now, it doesn't wait at all. And it wouldn't be hands free, it wouldn't feel like a natural conversation which is what I'm going for.
I'd rather use speaker diarization and/or echo cancellation to solve the problem without needing the user to press any buttons.
What I would really like is a push-to-talk app on my phone, so I can talk to my house from anywhere without worrying about putting a microphone in every room, or monkeying about with wakeword detection. I'm sure it's doable, I'm just not conscious of having seen it done.
Process Process-2:
Traceback (most recent call last):
File "multiprocessing\process.py", line 314, in _bootstrap
File "multiprocessing\process.py", line 108, in run
File "chirp.py", line 126, in whisper_process
File "chirp.py", line 126, in <listcomp>
File "faster_whisper\transcribe.py", line 426, in generate_segments
File "faster_whisper\transcribe.py", line 610, in encode
RuntimeError: Library cublas64_11.dll is not found or cannot be loaded
tts initialized
Hmm, the dll is included in the app package but maybe there is a conflict with other installed DLLs on some machines. When releasing PC software I always expect this type of issue unfortunately. I plan to move away from faster_whisper which may fix this.
I have to say that the Python ecosystem is just awful for distribution purposes and I spent a lot longer on packaging issues than I did on the actual AI parts. And clearly didn't find all of the issues :)
Agree completely. But in this case the fault is with CUDA which never ever works without a struggle. It’s insane how hard it is to get stuff that works cross-platform without a lot of work using CUDA. Even PyTorch has an awkward way of dealing with it and they have more resources to figure it out than just about anyone.
Cool work! I tested it and got some mixed results:
1) it throws an error if it's installed to any drive other than C:\ --I moved it to C: and it works fine.
2) I'm seeing huge latency on an EVGA 3080Ti with 12GB. Also seeing it repeat the parsed input, even though I only spoke once, it appears to process the same input many times with slightly different predictions sometimes. Here's some logs:
Latency to LLM response: 4.59
latency to speaking: 5.31
speaking 4: Hi Jim!
user spoke: Hi Jim.
user spoke recently, prompting LLM. last word time: 77.81 time: 78.11742429999867
latency to prompting: 0.31
Latency to LLM response: 2.09
latency to speaking: 3.83
speaking 5: So what have you been up to lately?
user spoke: So what have you been up to lately?
user spoke recently, prompting LLM. last word time: 83.9 time: 84.09415280001122
latency to prompting: 0.19
user spoke: So what have you been up to lately? No, I'm watching.
user spoke a while ago, ignoring. last word time: 86.9 time: 88.92142140000942
user spoke: So what have you been up to lately? No, just watching TV.
user spoke a while ago, ignoring. last word time: 87.9 time: 90.76665070001036
user spoke: So what have you been up to lately? No, I'm just watching TV.
user spoke a while ago, ignoring. last word time: 87.9 time: 94.16581820001011
user spoke: So what have you been up to lately? No, I'm just watching TV.
user spoke a while ago, ignoring. last word time: 88.9 time: 97.85854300000938
user spoke: So what have you been up to lately? No, I'm just watching TV.
user spoke a while ago, ignoring. last word time: 87.9 time: 101.54986060000374
user spoke: No, I just bought you a TV.
user spoke a while ago, ignoring. last word time: 87.8 time: 104.51332219998585
user spoke: No, I'll just watch you TV.
user spoke a while ago, ignoring. last word time: 87.41 time: 106.60086529998807
Latency to LLM response: 46.09
latency to speaking: 50.49
Thanks for posting it!
Edit:
3) It's hearing itself and responding to itself...
Thanks for trying it and thanks for the feedback! Yes, right now you need to use headphones so it doesn't hear itself. Sometimes Whisper inexplicably fails to recognize speech promptly. It seems to depend on what you say, so try saying something else. I have improvements that I haven't had time to release yet that should improve the situation, and a lot more work is definitely needed, this is definitely MVP level stuff right now. This stuff is fixable but it'll take time.
Yes, unfortunately these models take a lot of VRAM. It may be possible to do an 8GB version but it will have to compromise on quality of voice recognition and the language model so it might not be a good experience.
Yes, it absolutely could. You're right that this configuration is rare. Although people have been putting together machines with multiple 24GB cards in order to split and run larger models like llama2-70B.
Great question! Whisper processes audio in 30 second chunks. But on a fast GPU it can finish in only 100 milliseconds or so. So you can run it 10+ times per second and get around 100ms latency. Even better actually because Whisper will predict past the end of the audio sometimes.
This is an advantage of running locally. Running whisper this way is inefficient but I have a whole GPU sitting there dedicated to one user, so it's not a problem as long as it is fast enough. It wouldn't work well for a cloud service trying to optimize GPU use. But there are other ways of doing real time speech recognition that could be used there.
There's several faster ones out there. I've been using https://github.com/Softcatala/whisper-ctranslate2 which includes a nice --live_transcribe flag. It's not as good as running it on a complete file but it's been helpful to get the gist of foreign language live streams.
I haven't decided yet what I'm going to do with it. I think ideally I would open source it for people who have GPUs but also run it as a paid service for people who don't have GPUs. Open source that also makes money is always the holy grail :) I'll post updates on my Twitter/X account.
Are infill and outpainting equivalents possible? Super-RT TTS at this level of quality opens up a diverse array of uses esp for indie/experimental gamedev that I'm excited for.
Do you mean outpainting as in you still what words to do, or the model just extends the audio unconditionally the way some image models just expand past an image borders without a specific prompt (in audio like https://twitter.com/jonathanfly/status/1650001584485552130)
Not sure what you mean:
If you mean could inpainting and out painting with image models be faster, its a "not even wrong" question, similar to asking if the United Airlines app could get faster because American Airlines did. (Yes, getting faster is an option available to ~all code)
If you mean could you inpaint and outpaint text...yes, by inserting and deleting characters.
If you mean could you use an existing voice clip to generate speech by the same speaker in the clip, yes, part of the article is demonstrating generating speech by speakers not seen at training time
I'm not sure I understand what you mean to say. To me it's a reasonable question asking whether text to speech models can complete a missing part of some existing speech audio, or make it go on for longer, rather than only generating speech from scratch. I don't see a connection to your faster apps analogy.
Fwiw, I imagine this is possible, at least to some extent. I was recently playing with xtts and it can generate speaker embeddings from short periods of speech, so you could use those to provide a logical continuation to existing audio. However, I'm not sure it's possible or easy to manage the "seams" between what is generated and what is preexisting very easily yet.
It's certainly not a misguided question to me. Perhaps you could be less curt and offer your domain knowledge to contribute to the discussion?
Edit: I see you've edited your post to be more informative, thanks for sharing more of your thoughts.
It imposes a cost on others when when you makes false claims like I said or felt the question was unreasonable.
I didn't and don't.
It is a hard question to understand and an interesting mind-bender to answer.
Less policing of the metacontext and more focusing on the discussion at hand will help ensure there's interlocutors around to, at the very least, continue policing.
He could have meant speed, text, audio, words, or phonemes, with least probably images.
He probably didn't mean phonemes or he wouldn't be asking.
He probably didn't mean arbitrarily slicing 'real' audio and stitching on fake audio - he made repeated references to a video game.
He probably didn't mean inpainting and outpainting imagery, even though he made reference to a video game, because its an audio model.
Thank you for explaining I deserve to get downvoted through the floor multiple times for asking a question because it's "obvious". Maybe you can explain to the rest of the class what he meant then? If it was obviously phonemes, will you then advocate for them being downvoted through the floor since the answer was obvious? Or is it only people who assume good faith and ask what they meant who deserve downvotes?
Inpainting and outpainting of images is when the model generates bits inside or outside the image that don't exist. By analogy he was talking about generating sound inside (I.e. filling gaps) or outside (extrapolating beyond the end) the audio.
I don't know why you would think he was talking about inpainting images, words. This whole discussion is about speech synthesis.
Right, _until he brought up inpainting and outpainting_. And as I already laid out, the audio options made just about as much sense as the art.
I honestly can't believe how committed you are to explaining to me that as the only person who bothered answering, I'm the problem.
I've been in AI art when it was 10 people in an IRC room trying to figure out what to do with a bunch of GPUs an ex-hedge fund manager snapped up, and spent the last week working on porting eSpeak, the bedrock of ~all TTS models, from C++.
It wasn't "obvious" they didn't mean art, and it definitely was not obvious that they want to splice real voice clips at arbitrary points and insert new words without being a detectable fake for a video game. I needed more info to answer. I'm sorry.
Ignore the speed comment; it is unrelated to my question.
What I mean is, can output be conditioned on antecedent audio as well as text analogous to how image diffusion models can condition inpainting and outpatient on static parts of an image and clip embeddings?
Yes, the paper and Eleven Labs have a major feature of "given $AUDIO_SET, generate speech for $TEXT in the same style of $AUDIO_SET"
No, in that, you can't cut it at an arbitrary midword point, say at "what tim" in "what time is it bejing", and give it the string "what time is it in beijing", and have it recover seamlessly.
Yes, in that, you can cut it at an arbirtrary phoneme boundary, say 'this, I.S. a; good: test! ok?' in IPA is
'ðˈɪs, ˌaɪˌɛsˈeɪ; ɡˈʊd: tˈɛst!
ˌoʊkˈeɪ?', and I can cut it 'between' a phoneme, give it the and have it complete.
Thanks. Following the instructions now. BTW mamba is no longer recommended (for those like me who aren't already using it), and the #mambaforge anchor in the link didn't work.
I switched from conda to mamba a while ago and never looked back (it's probably saved dozens of hours from waiting for conda's slow as molasses package resolution). I'm looking at the latest docs and it doesn't look like there's any deprecation messages or anything (it does warn against installing mamba inside of conda, but that's been the case for a long time): https://mamba.readthedocs.io/en/latest/installation/mamba-in...
It looks like miniforge is still the recommended install method, but also the anchor has changed in the repo docs, which I've updated, thx. FWIW, I haven't run into any problems using mamba. While I'm not a power user, so there are edge cases I might have missed, but I have over 35 mamba envs on my dev machine atm, so it's definitely been doing the job for me and remains wicked fast (if not particularly disk efficient).
Was somewhat annoying to get everything to work as the documentation is a bit spotty, but after ~20 minutes it's all working well for me on WSL Ubuntu 22.04. Sound quality is very good, much better than other open source TTS projects I've seen. It's also SUPER fast (at least using a 4090 GPU).
Not sure it's quite up to Eleven Labs quality. But to me, what makes Eleven so cool is that they have a large library of high quality voices that are easy to choose from. I don't yet see any way with this library to get a different voice from the default female voice.
Also, the real special sauce for Eleven is the near instant voice cloning with just a single 5 minute sample, which works shockingly (even spookily) well. Can't wait to have that all available in a fully open source project! The services that provide this as an API are just too expensive for many use cases. Even the OpenAI one which is on the cheaper side costs ~10 cents for a couple thousand word generation.
To save people some time, this is tested on Ubuntu 22.04 (google is being annoying about the download link, saying too many people have downloaded it in the past 24 hours, but if you wait a bit it should work again):
One thing I've seen done for style cloning is a high quality fine tuned TTS -> RVC pipeline to "enhance" the output. TTS for intonation + pronunciation, RVC for voice texture. With StyleTTS and this pipeline you should get close to ElevenLabs.
I suspect they are doing many more things to make it sounds better. I certainly hope open source solutions can approach that level of quality, but so far I've been very disappointed.
I tried it. Sounds absolutely nothing like my voice or my wife's voice. I used the same sample files as I used 2 days ago on the Eleven Labs website, and they worked flawlessly there. So this is very, very far from being close to "Eleven Labs quality" when it comes to voice cloning.
ElevenLabs are based on Tortoise-TTS which was already pre-trained on millions of hours of data, but this one was only trained on LibriTTS which was 500 hours at best. If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.
The speech generated is the best I've heard from an open source model. The one test I made didn't make an exact clone either but this is still early days. There's likely something not quite right. The cloned voice does speak without any artifacts or other weirdness that most TTS systems suffer from.
have you tested longer utterances with both ElevenLabs and with StyleTTS? Short audio synthesis is a ~solved problem in the TTS world but things start falling apart once you want to do something like create an audiobook with text to speech.
I can say that the paid service from ElevenLabs can do long form TTS very well. I used it for a while to convert long articles to voice to listen to later instead of reading. It works very well.
I only stopped because it gets a little pricey.
Funnily enough, the TTS2 examples sound better than the ground truth [0]. For example, the "Then leaving the corpse within the house [...]" example has the ground truth pronounce "house" weirdly, with some change in the tonality that sounds higher, but the TTS2 version sounds more natural.
I'm excited to use this for all my ePub files, many of which don't have corresponding audiobooks, such as a lot of Japanese light novels. I am currently using Moon+ Reader on Android which has TTS but it is very robotic.
I wonder if you can add a TTS engine to Android as an app or plugin, then make Moon+ Reader or another reader to use that custom engine. That's probably how I'd do it for the easiest approach, but if that doesn't work, I might just have to make my own app.
I’m planning on making a self-host solution where you can upload files and the host sends back the audio to play, as a first pass on this tech. I’ll open source the repo after fiddling and prototyping. I’ve needed this kinda thing for a long time!
HN title at present is "StyleTTS2 – open-source Eleven Labs quality Text To Speech". Actual title at the far end doesn't name any particular other product; arXiv paper linked from there doesn't mention Eleven Labs either. I thought this sort of editorializing was frowned on.
Eleven Labs is the gold standard for voice synthesis. There is nothing better out there.
So it is extremely notable for an open source system to be able to approach this level of quality, which is why I'd imagine most would appreciate the comparison. I know it caught my attention.
It is editorializing and it is an exaggeration. However I've been using StyleTTS2 myself and IMO it is the best open source TTS by far and definitely deserves a spot on the top of HN for a while.
See my previous comment about this point. ElevenLabs are based on Tortoise-TTS which was already pre-trained on millions of hours of data, but this one was only trained on LibriTTS which was 500 hours at best. XTTS was also trained with probably millions of speakers in more than 20 languages.
If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.
> It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.
It's really not that difficult, they are trained mostly on audiobooks and high quality audio from yt videos. If we talk about EV model then we are talking about around 500k hours of audio, but Tortoise-TTS is only around 50k from what I remember.
This is a big thing for one area I'm interested in - golf simulation.
Currently playing in a golf simulator has a bit of a post-apocalyptian vibe. The birds are cheeping, the grass is rustling, the game play is realistic, but there's not a human to be seen. Just so different from the smacktalking of a real round, or the crowd noise at a big game.
It's begging for some LLM-fuelled banter to be added.
In Super Video Golf which is more a old-school/retro-style game, that a friend of mine made, there are some clapping sound effects, when people are in view. However, I feel like the nature sound on its own is also kind of relaxing.
Seems to run about "2x realtime" on 2015 4 core i7-6700HQ laptop, that is, 5 seconds to generate 10 seconds of output. Can imagine that being 4x or greater on a real machine
I really want to try this but making the venv to install all the torch dependencies is starting to get old lol.
How are other people dealing with this? Is there an easy way to get multiple venvs to share like a common torch venv? I can do this manually but I'm wondering if there's a tool out there that does this.
I use nix to setup the python env (python version + poetry + sometimes python packages that are difficult to install with poetry) and use poetry for the rest.
The workflow is:
> nix flake init -t github:dialohq/flake-templates#python
> nix develop -c $SHELL
> # I'm in the shell with poetry env, I have a shell hook in the nix devenv that does poetry install and poetry activate.
I generally try to use Docker for this stuff, but yeah, it's the main reason why I pass on these, even though I've been looking for something like this. It's just too hard to figure out the dependencies.
Can relate to this problem a lot. I have considered starting using a Docker dev container and making a base image for shared dependencies which I then can customize in a dockerfile for each new project, not sure if there's a better alternative though.
Yeah there is the official Nvidia container with torch+cuda pre-installed that some projects use.
I feel more projects should start with that as the base instead of pinning on whatever variants. Most aren't using specialized CUDA kernels after all.
Suppose there's the answer, just pick the specific torch+CUDA base that matches the major version of the project you want to run. Then cross your fingers and hope the dependencies mesh :p.
I don't think "base" works like that (while it can be a fallback for some dependencies, afaik, Python packages are isolated/not in path). But even if you could, don't do it. Different packages usually have different pytorch dependencies (often CUDA as well) and it will definitely bite you.
The biggest optimization I've found is to use mamba for everything. It's ridiculously faster than conda for package resolution. With everything cached, you're mostly just waiting for your SSD at that point.
(I suppose you could add the base env's lib path to the end of your PYTHONPATH, but that sounds like a sure way to get bitten by weird dependency/reproducibility issues down the line.)
I mean that I already have like 10 different torch venvs for different projects all with various pinned versions and CUDA variants.
Still worth the trade-off of not having to deal with dependency hell, but you start to wonder if there is a better way. All together this is many GBs of duplicated libs, wasted bandwidth and compute.
Having now tried it (the linked repo links to pre-built colab notebooks):
1) It does a fantastic job of text-to-speech.
2) I have had no success in getting any meaningful zero-shot voice cloning working. It technically runs and produces a voice, but it sounds nothing like the target voice. (This includes trying their microphone-based self-voice-cloning option.)
Presumably fine-tuning is needed - but I am curious if anyone had better luck with the zero-shot approach.
Hardly. Imagine licensing your voice to Amazon so that any customer could stream any book narrated in your likeness without you having to commit the time to record. You could still work as a custom voice artist, all with a "no clone" clause if you chose. You could profit from your performance and craft in a fraction of the time, focusing as your own agent on the management of your assets. Or, you could just keep and commit to your day job.
Just imagine hearing the final novel of ASoIaF narrated by Roy Dotrice and knowing that a royalty went to his family and estate, or if David Attenborough willed the digital likeness of his voice and its performance to the BBC for use in nature documentaries after his death.
The advent of recorded audio didn't put artists out of business, it expanded the industries that relied on them by allowing more of them to work. Film and tape didn't put artists out of business, it expanded the industries that relied on them by allowing more of them to work. Audio digitization and the internet didn't put artists out of business; it expanded the industries that relied on them by allowing more of them to work.
And TTS won't put artists out of business, but it will create yet another new market with another niche that people will have to figure out how to monetize, even though 98% of the revenues will still somehow end up with the distributors.
What you're not considering here is that a large majority of this industry is made up of no-name voice actors who have a pleasant (but perfectly substitutible) voice which is now something that AI can do perfectly and at a fraction of the price.
Sure, celebrities and other well-known figures will have more to gain here as they can license out their voice; but the majority of voice actors won't be able to capitalize on this. So this is actually even more perverse because it again creates a system where all assets will accumulate at the top and there won't be any distributions for everyone else.
No, I am. I work with them, and I've been one (am one, rarely).
I listed just one possible use, but I also see voice cloning and advanced TTS expanding access for evocative instruction, as an aid to study style and expand range.
Don't be afraid on their behalf. The dooming you're talking about applied to every one of the technological changes I already listed, and we employ more performers and artists today than ever in history.
When animation went digital, we graduated more storyboard artists and digital animators. When music notation software and sampling could replace musicians and orchestras, we graduated more musicians and composers trained on those tools. Now it's the performing arts, and no one in industry is going to shrink their pool of available talent (or risk ire) by daring conflate authenticity and performance with virtual impersonation. Performance capture and vfx also didn't kill or consolidate the movie industry - it allowed it to expand.
Art evolves, and so does its business. People who love art want to see people who do art succeed. I'm optimistic.
> When animation went digital, we graduated more storyboard artists and digital animators. When music notation software and sampling could replace musicians and orchestras, we graduated more musicians and composers trained on those tools.
What you explained is that tech has changed the tools used by artists.
It's substantially different with AI-based TTS, though. It's not a tool for artists, but it's a tool for movie/game/book publishers to replace human voice actors. The AI will be much much more scalable and cheaper.
TTS actually allows scope for far more different artists' likenesses to be incorporated. An book can be read with all the characters having a different voice entirely. This is difficult currently and relies on the skill of the performer.
I don't know, I feel like the work produced through voice acting is more of a commodity than work in the other industries that you're describing. Sure, a voice actor can add a lot of emotion and verbal nuance in a way that is differentiating, but I'm not sure if the difference is enough to matter for most people for the vast majority of cases. (Or I may be too dense to realize it). This is in contradiction to say performing arts, where there are in my opinion many more dimensions to the creative output which makes it less perfectly substitutable.
Where do you see AI being most used in a production pipeline?
Do you think it will replace actors or that it might just reduce the burden on existing talent, like canned audio has done for decades? Will it make ADR easier or cheaper? Will it actually save anyone any money who wants to ever be able to hire a living actor again?
There are a lot of clever sounding, low probability arguments here, and I think a lot of people don't understand the work well enough to identify what are and aren't the elephants.
You're not really thinking it through. I have friends involved in the VA business, and it's only gotten more competitive as time has progressed - this is partially because it's rare that we need a voice actor that needs to create a crazy Looney Tunes sounding voice, the majority of VA work is surprisingly just close to the natural sounding voice of the VA themselves.
It's rare that you need a talent like Dan Castellaneta, Mel Blanc, etc.
Secondly, yes, VA licensing will become a thing – but that means that jobs that would previously be available to other lesser known voice actors, because the major players simply didn't have enough time to take those gigs, can no longer take them. A TTSVA can do unlimited recordings.
Thirdly, major studios that would require hundreds of voices for video games and other things don't have to license known voices at all, they can just create generate brand new ones and pay zero licensing fees.
The point is no one will pay for any of that if you can just clone someone's voice locally. Or just tell the AI how you want it to sound. Your argument literally ignores the entire elephant in the room.
I've been playing with XTTSv2 and on my 3080ti, and it's sightly faster than the length of the final audio. It's also good quality, but these samples sound better.
It is usable in commercial applications given you disclose the use of AI. This applies only to the pre-trained models. You can train your own from scratch without these restrictions.
You can fine tune it on your own voice and also not be required to disclose the use of AI.
What are the chances this gets packaged into something a little more streamlined to use? I have a lot of ebooks I'd love to generate audio versions of.
In Utils folder, there are three pre-trained models:
ASR folder: It contains the pre-trained text aligner, which was pre-trained on English (LibriTTS), Japanese (JVS), and Chinese (AiShell) corpus. It works well for most other languages without fine-tuning, but you can always train your own text aligner with the code here: yl4579/AuxiliaryASR.
JDC folder: It contains the pre-trained pitch extractor, which was pre-trained on English (LibriTTS) corpus only. However, it works well for other languages too because F0 is independent of language. If you want to train on singing corpus, it is recommended to train a new pitch extractor with the code here: yl4579/PitchExtractor.
PLBERT folder: It contains the pre-trained PL-BERT model, which was pre-trained on English (Wikipedia) corpus only. It probably does not work very well on other languages, so you will need to train a different PL-BERT for different languages using the repo here: yl4579/PL-BERT. You can also replace this module with other phoneme BERT models like XPhoneBERT which is pre-trained on more than 100 languages.
Those are just parts of the system and don't make a complete TTS. In theory you could train a complete StyleTTS2 for other languages but currently the pretrained models are English only.
I am an introvert: I rarely socialize, listen to podcasts at 2x speed, and mostly use subtitles rather than listening to audio for movies; therefore having a below average ability to differentiate humans/robots.
I asked someone to play the recordings for me to differentiate. I could not tell which was human (only between StyleTTS2 and Ground truth. The others were obvious)
I made my own whisper & auto-typer which types what you say (forked whisper-typer).
I added OpenAI Q/A and RAG query feature so I could ask it questions (instead of auto keystroke typing) by voice command. For responses to questions, I used Eleven Labs - but even with latency optimized & streaming, it was slow, so disabled it.
I just swapped from OpenAI to Mistral 7b for Q/A querying. Much more responsive. Stoked to explore StyleTTS2 now!
Really glad that I came across your post. Thank you for sharing!
For reference, I'm using 11Labs to synthesize short messages - maybe a sentence or something, using voice cloning, and I'm getting it at around 400 - 500ms response times.
Is there any OS solution that gets me to around the same inference time?
It should be pretty easy to make training data for TTS. The Whisper STT models are open so just chop up a ton of audio and use Whisper to annotate it, then train the other direction to produce audio from text. So you’re basically inverting Whisper.
STT training data includes all kinds of "noisy" speech so that the model learns to recognise speech in any conditions. TTS training data needs to be as clean as possible so that you don't introduce artefacts in the output and this high-quality data is much harder to get. A simple inversion is not really feasible or at least requires filtering out much of the data.
I think you’re talking about just using Whisper to annotate audio for a TTS pipeline but someone from Collabora actually created a TTS model directly from Whisper embeddings https://github.com/collabora/WhisperSpeech
As a tangent away from LLMs, is there an integration available to be used in Android as TTS Engine?. The TTS voice that I have now (RHVoice) for OSMAnd is really driving me crazy and almost makes me want to go back to Google Maps.
So, we've got this open-source TTS wizardry going on, which is kinda like if Siri had a caffeine overdose - faster, snappier, and way more fun at parties. This thing is running on gaming rigs with beefy GPUs, and it's apparently so user-friendly, even your grandma could set it up without accidentally summoning a digital demon.
But here's the real kicker - it's got the manners of a Victorian gentleman. You can rudely interrupt it mid-sentence, and it'll just stop and listen. Politeness level 100. The reverse, though - getting Mr. Bot to interrupt you - is still in the 'that's too much brain for my silicon' phase. Like, how do you teach a bunch of 1s and 0s to know when you're just taking a dramatic pause or actually done with your TED talk?
And get this - they're talking about making this bot read body language. Imagine your laptop judging you for your slouchy posture or that 'I haven't slept properly in days' look. Creepy? Maybe a bit. Cool? Absolutely.
In conclusion, StyleTTS2 is shaping up to be the cool new kid on the block, but it's still learning the ropes of human conversation. It's like that super smart friend who knows everything about quantum physics but can't tell when you're sarcastically saying 'Yeah, sure, let's invade Mars tomorrow.
But you are wrong the trained models are separate on Google Drive and have following Text that seems to be an additional License Agreement that also includes using the software and any trained Modell.
License Part 2 Text:
"Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices pubilc, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices."
silicon valley is very leaky, eleven labs is widely rumored to have raised a huge round recently. great timing because with OpenAI's TTS and now this thing the options in the market have just expanded greatly.
Once this is working, is there a simple way to switch voices with the default downloaded models? Or does this require downloading other models or generating them?
This is really harmful and unethical work. It will be used to hurt millions of elderly people with scams. That's the real application that will happen 100x more than anything else. It's unethical and harmful to release tools that will be overwhelmingly used to hurt elderly people. What they should do about it is: Stop releasing models. Only release a service so that scammers will not use it. Also, only released audio that is watermarked, so that apps can tell that a phone call might be a scam. When they share models with researchers, use previous best practices: post a Google Form to request access.
Just imagine if this line of thinking was used elsewhere.
This tech is already out of the bag and I thank the author(s) for the contribution to humanity. The correct solution here is not to shove your head in the sand and ignore reality, but to get your government to penalize any country or company that facilitates this crime. If they can force severe penalties for other financial crimes and funding terrorism, they can do the same here.
Scammers scamming old people is already very wide spread, so should we maybe outlaw telephones as well? Or maybe mandate anti scamming filters that disconnect if something is discussed that could be a scam? If I think about it that actually would make more sense, but still be problematic.
Millions of elderly people are already getting scammed by overseas call centers so unless we do something more significant this tech will not make one iota of a difference.
That's not really true, most scammers have a male voice with a heavy accent. When they have tools that easily disguise their voice, scammers can reach many more elderly people.
That might have been true about a year ago, but I've been getting calls from well-spoken native-level scammers for about two months now. They are so frequent that I can put them on speaker during family gatherings to raise awareness.
Sample sizes of 1 are never representative but they definitely have full access to native speakers or tech that can generate very passable speech.
It seems quite possible that the change you've seen in these last two months is because some have started using these models. More likely than a sudden huge shift in either the country of origin or English skills of the scammers.
My point is that these models were already out there before StyleTTS2 was released. Plugging your ears and demanding their regulation in your country will not make them disappear.
No, this is not MIT. If you don't like MIT license then feel free to use something else, but you can't pretend this is open source and then attempt to slap on additional restrictions on how the code can be used.
As I understand it the source code is licensed MIT, the weights are licensed "weird proprietary license that doesn't explicitly grant you any rights and implicitly probably grants you some usage rights so long as you tell the listeners or have permission from the voice you cloned".
Which, if you think the weights are copyright-able in the first place, makes them practically unusable for anything commercial/that you might get sued over because relying on a vague implicit license is definitely not a good idea.
I think you mis-parsed the disclaimer. It's just warning people that cloned voices come with a different set of rights to the software (because the person the voice is a clone of has rights to their voice).
(Don’t let’s derail the conversation, please, but “disclaimer” is completely the wrong word here. This is a condition of use. A disclaimer is “this isn’t mine” or “I’m not responsible for this”. Disclaimers and disclosures are quite different things and commonly confused, but this isn’t even either of them.)
This always annoys me when people put "disclaimers" on their posts. IANAL, so tired of hearing that one. It's pointless because even if you were a lawyer, you cannot meaningfully comment on a case without the details, jurisdiction, circumstance, etc. Next, it's meaningless because is anyone going to blindly bow down and obey if you state the opposite? "Yes, I AM a lawyer, you do not need to pay taxes, they are unconstitutional." Thirdly, when they "disclaimer" themselves as working at google, that's not a dis-claimer, thats a "claimer", asserting the affirmative. I know their companies require them to not speak for the company without permission, but I hardly ever hear that one, usually its just some useless self-disclosure that they might be biased because they work there. Ok, who isn't biased?
What bugs me overall is that it's usually vapid mimicry of a phrase they don't even understand.
Ianal, but giving legal advice without being a lawyer may be illegal in some jurisdictions. Not sure if the disclaimer is effective or was ever tested in court.
The disclaimer/disclosure mix-up is super annoying, but disclosing obvious biases even if not legally required seems like good practice to me.
Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU (tested on 3060 12GB) can install and converse with StyleTTS2 with one click, no fiddling with Python or CUDA needed: https://apps.microsoft.com/detail/9NC624PBFGB7
The demo is janky in various ways (requires headphones, runs as a console app, etc), but it's a sneak peek at what will soon be possible to run on a normal gaming PC just by putting together open source pieces. The models are improving rapidly, there are already several improved models I haven't yet incorporated.