>This can only be used if you happen to own a huge audio corpus and have a lot of money.
May you elaborate on that? Do you mean that you need a large training set of your voice and you need $$$ in order to train the models on an expensive GPU?
So, looking at it again I see that the audio corpus is available for free (openslr.org, 60GB for 1000 hours of speech), however I suspect that training the model on that amount of data would take insane amounts of time on a single GPU.
Instead, usually companies train such models in the cloud. GPT-3 for example used 800 GB of training data and cost about 5 million USD to train. Extrapolating from that, I guess that the same setup would cost 375,000 USD to train (although I assume that this model has waaaaay fewer parameters making it a lot cheaper -- but I can't seem to find how many parameters it has).
If someone else already spent that money to train the model, then you could just take the "training weights", feed them into the model, and it would be as if you had already trained it -- at which point you'd only need to provide your own voice and retrain it on your own GPU for a short time.
I'm by no means an ML expert though, so I could be totally wrong on this.
James Betker in the tortoise-tts repo, which is similar, says he spent $15k for his home rig. I'm not finding right now how long it took to train the tortoise model but feel like I read him say weeks/months somewhere. Obvs all kinds of variations depending on coding efficiency and dataset size, but another datapoint
https://nonint.com/2022/05/30/my-deep-learning-rig/https://github.com/neonbjb/tortoise-tts
https://github.com/lucidrains/audiolm-pytorch