Anyone who has a Windows gaming PC with a 12 GB Nvidia GPU (tested on 3060 12GB) can install and converse with StyleTTS2 with one click, no fiddling with Python or CUDA needed: https://apps.microsoft.com/detail/9NC624PBFGB7
The demo is janky in various ways (requires headphones, runs as a console app, etc), but it's a sneak peek at what will soon be possible to run on a normal gaming PC just by putting together open source pieces. The models are improving rapidly, there are already several improved models I haven't yet incorporated.
There is another thing remaining after that for fully natural conversation, which is making the AI context aware like a human would be. Basically giving it eyes so it can see your face and judge body language to know if it's talking too long and needs to be more brief, the same way a human talks.
The other way around (bot interrupting the user) is hard. Currently the bot starts processing a response after every word that the voice recognition outputs, to reduce latency. When new words come in before the response is ready it starts over. If it finishes its response before any more words arrive (~1 second usually) it starts speaking. This is not ideal because the user might not be done speaking, of course. If the user continues speaking the bot will stop and listen. But deciding when the user is done speaking, or if the bot should interrupt before the user is done, is a hard problem. It could possibly be done zero-shot using prompting of a LLM but you'd want a GPT-4 level LLM to do a good job and GPT-4 is too slow for instant response right now. A better idea would be to train a dedicated turn-taking model that directly predicts who should speak next in conversations. I haven't thought much about how to source a dataset and train a model for that yet.
Ultimately the end state of this type of system is a complete end-to-end audio-to-audio language model. There should be only one model, it should take audio directly as input and produce audio directly as output. I believe that having TTS and voice recognition and language modeling all as separate systems will not get us to 100% natural human conversation. I think that such a system would be within reach of today's hardware too, all you need is the right training dataset/procedure and some architecture bits to make it efficient.
As for giving the model eyes, actually there are already open source vision-language models that could be used for this today! I'd love to implement one in my chatbot. It probably wouldn't have social intelligence to read body language yet, but it could definitely answer questions about things you present to the webcam, read text, maybe even look at your computer screen and have conversations about what's on your screen. The latter could potentially be very useful, the endgame there is like GitHub Copilot for everything you do on your computer, not just typing code.
I like the idea of an RL predictor for interruption timing, although I think it might struggle with factual-correction interruptions. It could be a good way to make a very fast system, and if latency on the rest of the system is low enough you could probably start slipping in your "Of course", "Yeah, I agree", and "It was in March, but yeah" for truly natural speech. If latency is low you could just use the RL system to find opportunities to interrupt, give them to the LLM/LMM, and it decides how to interrupt, all the way from "mhm", to "Yep, sounds good to me", to "Not quite, it was the 3rd entry, but yeah otherwise it makes sense", to "Actually can I quickly jump on that? I just wanted to quickly [make a point]/[ask a question] about [some thing that requires exploration before the conversation continues]".
Tuning a system like this would be the most annoying activity in human history, but something like this has to be achieved for truly natural conversation so we gotta do it lol.
Maybe you can use some sort of speaker identification to sort this out?
Echo cancellation is non-trivial for sure.
Could the issue be that I am using a pair of bluetooth headphones and the microphone is built into that - what is the optimum setup, should I be listening on the headphones and using a different mic input instead of the headphone mic?
This is pretty intermeeting I would love to get it to work. Running a 3060.
That's how I've been thinking of doing it - seemed like you could use a much smaller GPT-J-ish model for that, and measure the relative probability of 'yes' vs 'no' tokens in response to a question like 'is the user done talking'. Seemed like even that would be orders of magnitude better than just waiting for silence.
I would love to help. Would even code up a prototype if you wanted :)
I'd rather use speaker diarization and/or echo cancellation to solve the problem without needing the user to press any buttons.
Traceback (most recent call last):
File "multiprocessing\process.py", line 314, in _bootstrap
File "multiprocessing\process.py", line 108, in run
File "chirp.py", line 126, in whisper_process
File "chirp.py", line 126, in <listcomp>
File "faster_whisper\transcribe.py", line 426, in generate_segments
File "faster_whisper\transcribe.py", line 610, in encode
RuntimeError: Library cublas64_11.dll is not found or cannot be loaded
I have to say that the Python ecosystem is just awful for distribution purposes and I spent a lot longer on packaging issues than I did on the actual AI parts. And clearly didn't find all of the issues :)
1) it throws an error if it's installed to any drive other than C:\ --I moved it to C: and it works fine.
2) I'm seeing huge latency on an EVGA 3080Ti with 12GB. Also seeing it repeat the parsed input, even though I only spoke once, it appears to process the same input many times with slightly different predictions sometimes. Here's some logs:
Latency to LLM response: 4.59
latency to speaking: 5.31
speaking 4: Hi Jim!
user spoke: Hi Jim.
user spoke recently, prompting LLM. last word time: 77.81 time: 78.11742429999867
latency to prompting: 0.31
Latency to LLM response: 2.09
latency to speaking: 3.83
speaking 5: So what have you been up to lately?
user spoke: So what have you been up to lately?
user spoke recently, prompting LLM. last word time: 83.9 time: 84.09415280001122
latency to prompting: 0.19
user spoke: So what have you been up to lately? No, I'm watching.
user spoke a while ago, ignoring. last word time: 86.9 time: 88.92142140000942
user spoke: So what have you been up to lately? No, just watching TV.
user spoke a while ago, ignoring. last word time: 87.9 time: 90.76665070001036
user spoke: So what have you been up to lately? No, I'm just watching TV.
user spoke a while ago, ignoring. last word time: 87.9 time: 94.16581820001011
user spoke: So what have you been up to lately? No, I'm just watching TV.
user spoke a while ago, ignoring. last word time: 88.9 time: 97.85854300000938
user spoke: So what have you been up to lately? No, I'm just watching TV.
user spoke a while ago, ignoring. last word time: 87.9 time: 101.54986060000374
user spoke: No, I just bought you a TV.
user spoke a while ago, ignoring. last word time: 87.8 time: 104.51332219998585
user spoke: No, I'll just watch you TV.
user spoke a while ago, ignoring. last word time: 87.41 time: 106.60086529998807
Latency to LLM response: 46.09
latency to speaking: 50.49
Thanks for posting it!
3) It's hearing itself and responding to itself...
Isn't it quite non-realtime?
This is an advantage of running locally. Running whisper this way is inefficient but I have a whole GPU sitting there dedicated to one user, so it's not a problem as long as it is fast enough. It wouldn't work well for a cloud service trying to optimize GPU use. But there are other ways of doing real time speech recognition that could be used there.
Also I did a little speed/quality shootoff with the LJSpeech model (vs VITS and XTTS). StyleTTS2 was pretty good and very fast: https://fediverse.randomfoo.net/notice/AaOgprU715gcT5GrZ2
Are infill and outpainting equivalents possible? Super-RT TTS at this level of quality opens up a diverse array of uses esp for indie/experimental gamedev that I'm excited for.
Me: Won’t it be great when AI can-
Computer: Finish your sentences for you? OMG that’s exactly what I was thinking!
Do you mean outpainting as in you still what words to do, or the model just extends the audio unconditionally the way some image models just expand past an image borders without a specific prompt (in audio like https://twitter.com/jonathanfly/status/1650001584485552130)
If you mean could you inpaint and outpaint text...yes, by inserting and deleting characters.
If you mean could you use an existing voice clip to generate speech by the same speaker in the clip, yes, part of the article is demonstrating generating speech by speakers not seen at training time
Fwiw, I imagine this is possible, at least to some extent. I was recently playing with xtts and it can generate speaker embeddings from short periods of speech, so you could use those to provide a logical continuation to existing audio. However, I'm not sure it's possible or easy to manage the "seams" between what is generated and what is preexisting very easily yet.
It's certainly not a misguided question to me. Perhaps you could be less curt and offer your domain knowledge to contribute to the discussion?
Edit: I see you've edited your post to be more informative, thanks for sharing more of your thoughts.
I didn't and don't.
It is a hard question to understand and an interesting mind-bender to answer.
Less policing of the metacontext and more focusing on the discussion at hand will help ensure there's interlocutors around to, at the very least, continue policing.
He could have meant speed, text, audio, words, or phonemes, with least probably images.
He probably didn't mean phonemes or he wouldn't be asking.
He probably didn't mean arbitrarily slicing 'real' audio and stitching on fake audio - he made repeated references to a video game.
He probably didn't mean inpainting and outpainting imagery, even though he made reference to a video game, because its an audio model.
Thank you for explaining I deserve to get downvoted through the floor multiple times for asking a question because it's "obvious". Maybe you can explain to the rest of the class what he meant then? If it was obviously phonemes, will you then advocate for them being downvoted through the floor since the answer was obvious? Or is it only people who assume good faith and ask what they meant who deserve downvotes?
I don't know why you would think he was talking about inpainting images, words. This whole discussion is about speech synthesis.
I honestly can't believe how committed you are to explaining to me that as the only person who bothered answering, I'm the problem.
I've been in AI art when it was 10 people in an IRC room trying to figure out what to do with a bunch of GPUs an ex-hedge fund manager snapped up, and spent the last week working on porting eSpeak, the bedrock of ~all TTS models, from C++.
It wasn't "obvious" they didn't mean art, and it definitely was not obvious that they want to splice real voice clips at arbitrary points and insert new words without being a detectable fake for a video game. I needed more info to answer. I'm sorry.
Wait 'till you learn I'm a woman though. :>
What I mean is, can output be conditioned on antecedent audio as well as text analogous to how image diffusion models can condition inpainting and outpatient on static parts of an image and clip embeddings?
No, in that, you can't cut it at an arbitrary midword point, say at "what tim" in "what time is it bejing", and give it the string "what time is it in beijing", and have it recover seamlessly.
Yes, in that, you can cut it at an arbirtrary phoneme boundary, say 'this, I.S. a; good: test! ok?' in IPA is
'ðˈɪs, ˌaɪˌɛsˈeɪ; ɡˈʊd: tˈɛst!
ˌoʊkˈeɪ?', and I can cut it 'between' a phoneme, give it the and have it complete.
It looks like miniforge is still the recommended install method, but also the anchor has changed in the repo docs, which I've updated, thx. FWIW, I haven't run into any problems using mamba. While I'm not a power user, so there are edge cases I might have missed, but I have over 35 mamba envs on my dev machine atm, so it's definitely been doing the job for me and remains wicked fast (if not particularly disk efficient).
conda update -n base conda
conda install -n base conda-libmamba-solver
conda config --set solver libmamba
Not sure it's quite up to Eleven Labs quality. But to me, what makes Eleven so cool is that they have a large library of high quality voices that are easy to choose from. I don't yet see any way with this library to get a different voice from the default female voice.
Also, the real special sauce for Eleven is the near instant voice cloning with just a single 5 minute sample, which works shockingly (even spookily) well. Can't wait to have that all available in a fully open source project! The services that provide this as an API are just too expensive for many use cases. Even the OpenAI one which is on the cheaper side costs ~10 cents for a couple thousand word generation.
git clone https://github.com/yl4579/StyleTTS2.git
python3 -m venv venv
python3 -m pip install --upgrade pip
python3 -m pip install wheel
pip install -r requirements.txt
pip install phonemizer
sudo apt-get install -y espeak-ng
pip install gdown
7z x Models.zip
7z x Models.zip
pip install ipykernel pickleshare nltk SoundFile
python -c "import nltk; nltk.download('punkt')"
pip install --upgrade jupyter ipywidgets librosa
python -m ipykernel install --user --name=venv --display-name="Python (venv)"
xTTSv2 does it much better. But the quality on the trained voices are great though.
Also, ElevenLabs keeps diverging for me, and starts mispronouncing words after two or three sentences.
I'm excited to use this for all my ePub files, many of which don't have corresponding audiobooks, such as a lot of Japanese light novels. I am currently using Moon+ Reader on Android which has TTS but it is very robotic.
2023. There is no way to win.
Impressive results nonetheless, and superior to all other TTS.
So it is extremely notable for an open source system to be able to approach this level of quality, which is why I'd imagine most would appreciate the comparison. I know it caught my attention.
This voice cloning is... nothing like XTTSv2, let alone ElevenLabs.
It doesn't seem to care about accents at all. It does pretty well with pitch and cadence, and that's about it.
I've tried all kinds of different values for alpha, beta, embedding scale, diffusion steps.
Anyone else have better luck?
Sure it's fast and the sound quality is pretty good, but I can't get the voice cloning to work at all.
If you have seen millions of voices, there are definitely gonna be some of them that sound like you. It is just a matter of training data, but it is very difficult to have someone collect these large amounts of data and train on it.
It's really not that difficult, they are trained mostly on audiobooks and high quality audio from yt videos. If we talk about EV model then we are talking about around 500k hours of audio, but Tortoise-TTS is only around 50k from what I remember.
Could have interesting prospects for games where you have LLM assuming a character and such TTS giving those NPCs voice.
Currently playing in a golf simulator has a bit of a post-apocalyptian vibe. The birds are cheeping, the grass is rustling, the game play is realistic, but there's not a human to be seen. Just so different from the smacktalking of a real round, or the crowd noise at a big game.
It's begging for some LLM-fuelled banter to be added.
How are other people dealing with this? Is there an easy way to get multiple venvs to share like a common torch venv? I can do this manually but I'm wondering if there's a tool out there that does this.
The workflow is:
> nix flake init -t github:dialohq/flake-templates#python
> nix develop -c $SHELL
> # I'm in the shell with poetry env, I have a shell hook in the nix devenv that does poetry install and poetry activate.
I feel more projects should start with that as the base instead of pinning on whatever variants. Most aren't using specialized CUDA kernels after all.
Suppose there's the answer, just pick the specific torch+CUDA base that matches the major version of the project you want to run. Then cross your fingers and hope the dependencies mesh :p.
The biggest optimization I've found is to use mamba for everything. It's ridiculously faster than conda for package resolution. With everything cached, you're mostly just waiting for your SSD at that point.
(I suppose you could add the base env's lib path to the end of your PYTHONPATH, but that sounds like a sure way to get bitten by weird dependency/reproducibility issues down the line.)
If it's starting to get old, then this means that an LLM like Copilot should be able to do it for you, no?
Still worth the trade-off of not having to deal with dependency hell, but you start to wonder if there is a better way. All together this is many GBs of duplicated libs, wasted bandwidth and compute.
1 = https://github.com/microsoft/LoRA
1) It does a fantastic job of text-to-speech.
2) I have had no success in getting any meaningful zero-shot voice cloning working. It technically runs and produces a voice, but it sounds nothing like the target voice. (This includes trying their microphone-based self-voice-cloning option.)
Presumably fine-tuning is needed - but I am curious if anyone had better luck with the zero-shot approach.
Just imagine hearing the final novel of ASoIaF narrated by Roy Dotrice and knowing that a royalty went to his family and estate, or if David Attenborough willed the digital likeness of his voice and its performance to the BBC for use in nature documentaries after his death.
The advent of recorded audio didn't put artists out of business, it expanded the industries that relied on them by allowing more of them to work. Film and tape didn't put artists out of business, it expanded the industries that relied on them by allowing more of them to work. Audio digitization and the internet didn't put artists out of business; it expanded the industries that relied on them by allowing more of them to work.
And TTS won't put artists out of business, but it will create yet another new market with another niche that people will have to figure out how to monetize, even though 98% of the revenues will still somehow end up with the distributors.
Sure, celebrities and other well-known figures will have more to gain here as they can license out their voice; but the majority of voice actors won't be able to capitalize on this. So this is actually even more perverse because it again creates a system where all assets will accumulate at the top and there won't be any distributions for everyone else.
I listed just one possible use, but I also see voice cloning and advanced TTS expanding access for evocative instruction, as an aid to study style and expand range.
Don't be afraid on their behalf. The dooming you're talking about applied to every one of the technological changes I already listed, and we employ more performers and artists today than ever in history.
When animation went digital, we graduated more storyboard artists and digital animators. When music notation software and sampling could replace musicians and orchestras, we graduated more musicians and composers trained on those tools. Now it's the performing arts, and no one in industry is going to shrink their pool of available talent (or risk ire) by daring conflate authenticity and performance with virtual impersonation. Performance capture and vfx also didn't kill or consolidate the movie industry - it allowed it to expand.
Art evolves, and so does its business. People who love art want to see people who do art succeed. I'm optimistic.
What you explained is that tech has changed the tools used by artists.
It's substantially different with AI-based TTS, though. It's not a tool for artists, but it's a tool for movie/game/book publishers to replace human voice actors. The AI will be much much more scalable and cheaper.
Do you think it will replace actors or that it might just reduce the burden on existing talent, like canned audio has done for decades? Will it make ADR easier or cheaper? Will it actually save anyone any money who wants to ever be able to hire a living actor again?
There are a lot of clever sounding, low probability arguments here, and I think a lot of people don't understand the work well enough to identify what are and aren't the elephants.
It's rare that you need a talent like Dan Castellaneta, Mel Blanc, etc.
Secondly, yes, VA licensing will become a thing – but that means that jobs that would previously be available to other lesser known voice actors, because the major players simply didn't have enough time to take those gigs, can no longer take them. A TTSVA can do unlimited recordings.
Thirdly, major studios that would require hundreds of voices for video games and other things don't have to license known voices at all, they can just create generate brand new ones and pay zero licensing fees.
Excited to try it out!
You can fine tune it on your own voice and also not be required to disclose the use of AI.
In Utils folder, there are three pre-trained models:
ASR folder: It contains the pre-trained text aligner, which was pre-trained on English (LibriTTS), Japanese (JVS), and Chinese (AiShell) corpus. It works well for most other languages without fine-tuning, but you can always train your own text aligner with the code here: yl4579/AuxiliaryASR.
JDC folder: It contains the pre-trained pitch extractor, which was pre-trained on English (LibriTTS) corpus only. However, it works well for other languages too because F0 is independent of language. If you want to train on singing corpus, it is recommended to train a new pitch extractor with the code here: yl4579/PitchExtractor.
PLBERT folder: It contains the pre-trained PL-BERT model, which was pre-trained on English (Wikipedia) corpus only. It probably does not work very well on other languages, so you will need to train a different PL-BERT for different languages using the repo here: yl4579/PL-BERT. You can also replace this module with other phoneme BERT models like XPhoneBERT which is pre-trained on more than 100 languages.
I asked someone to play the recordings for me to differentiate. I could not tell which was human (only between StyleTTS2 and Ground truth. The others were obvious)
I made my own whisper & auto-typer which types what you say (forked whisper-typer).
I added OpenAI Q/A and RAG query feature so I could ask it questions (instead of auto keystroke typing) by voice command. For responses to questions, I used Eleven Labs - but even with latency optimized & streaming, it was slow, so disabled it.
I just swapped from OpenAI to Mistral 7b for Q/A querying. Much more responsive. Stoked to explore StyleTTS2 now!
Really glad that I came across your post. Thank you for sharing!
For reference, I'm using 11Labs to synthesize short messages - maybe a sentence or something, using voice cloning, and I'm getting it at around 400 - 500ms response times.
Is there any OS solution that gets me to around the same inference time?
It's mostly just a demo for now and a little bit janky but it's fun to chat with and you can see the promise for 100% local voice AI in the future.
But here's the real kicker - it's got the manners of a Victorian gentleman. You can rudely interrupt it mid-sentence, and it'll just stop and listen. Politeness level 100. The reverse, though - getting Mr. Bot to interrupt you - is still in the 'that's too much brain for my silicon' phase. Like, how do you teach a bunch of 1s and 0s to know when you're just taking a dramatic pause or actually done with your TED talk?
And get this - they're talking about making this bot read body language. Imagine your laptop judging you for your slouchy posture or that 'I haven't slept properly in days' look. Creepy? Maybe a bit. Cool? Absolutely.
In conclusion, StyleTTS2 is shaping up to be the cool new kid on the block, but it's still learning the ropes of human conversation. It's like that super smart friend who knows everything about quantum physics but can't tell when you're sarcastically saying 'Yeah, sure, let's invade Mars tomorrow.
I am quite tired to see some "open-source" advertisement, where the half or more is not really free.
general psa: please be honest in your announcements :|
Maybe actually check it out before complaining.
License Part 2 Text:
"Before using these pre-trained models, you agree to inform the listeners that the speech samples are synthesized by the pre-trained models, unless you have the permission to use the voice you synthesize. That is, you agree to only use voices whose speakers grant the permission to have their voice cloned, either directly or by license before making synthesized voices pubilc, or you have to publicly announce that these voices are synthesized if you do not have the permission to use these voices."
The expanded size of the tensor (4293) must match the existing size (512)
Any way to fix this from the IPython notebook examples?
This tech is already out of the bag and I thank the author(s) for the contribution to humanity. The correct solution here is not to shove your head in the sand and ignore reality, but to get your government to penalize any country or company that facilitates this crime. If they can force severe penalties for other financial crimes and funding terrorism, they can do the same here.
> soon as it's out, a whole bunch of extremely privileged ML people will throw their hands up and say, "oh well, cats out of the bag."
Sample sizes of 1 are never representative but they definitely have full access to native speakers or tech that can generate very passable speech.
Though would def like to clone a pleasant voice on it before using. Those sound good but not my cup of tea
> Before using these models, you agree to [...]
No, this is not MIT. If you don't like MIT license then feel free to use something else, but you can't pretend this is open source and then attempt to slap on additional restrictions on how the code can be used.
Which, if you think the weights are copyright-able in the first place, makes them practically unusable for anything commercial/that you might get sued over because relying on a vague implicit license is definitely not a good idea.
What bugs me overall is that it's usually vapid mimicry of a phrase they don't even understand.
The wording they currently use suggests that this additional license requirement applies not only to their pre-trained models.