James Betker in the tortoise-tts repo, which is similar, says he spent $15k for his home rig. I'm not finding right now how long it took to train the tortoise model but feel like I read him say weeks/months somewhere. Obvs all kinds of variations depending on coding efficiency and dataset size, but another datapoint
https://nonint.com/2022/05/30/my-deep-learning-rig/https://github.com/neonbjb/tortoise-tts
Open source tortoise-TTS has been able to do this for 6+ months now, which is also based on the same theory as DALL-E. From playing with tortoise a bit over the last couple of weeks it seems like the issue is not so much accuracy anymore, rather how GPU intensive it is to make a voice of any meaningful duration. Tortoise is ~5 seconds on a $1000 GPU (P5000) to do one second of spoken text. There's cloud options (collab, paperspace, runpod) but still https://github.com/neonbjb/tortoise-tts
Heh you might want to use an equivalent gaming GPU for the price comparison. Surely a thousand dollars spent on an RTX 4000 series card (Hopper) would outperform a P5000?
I agree though, Tortoise TTS did a lot of similar work IIRC by a single person on their multi-GPU setup. Really impressive effort. Did they get a citation? They deserve one.
edit: reading other comments it seems there is a misconception that the model takes 3 seconds to run? That isn't the case - it requires "just" 3 seconds of example audio to successfully clone a voice (for some definition of success).
rtx4000 only has 8gig memory which means reducing the batch size (much slowness) and/or how much text you can give it at once (meaning you have to break up text chunks not at sentence breaks)
rtx5000 maybe but not sure how much of a value improvement there is
The commenter you're responding to is talking about Lovelace architecture based GeForce RTX 40x0 products. The Quadro line isn't even released yet on this architecture. You are talking about the specific Quadro RTX 4000 product, which is a TU104 (turing arch, 2 gens behind, with 2560 processors and 8GB memory). The commenter you're responding to is referring to something like a GeForce RTX 4090 which sports an AD102 (lovelace arch, with 16384 processors and 24GB memory).
You were merely an unfortunate casualty of Nvidia's product marketing scheme (and a commenter's slightly imprecise reference to it) here.
I'm pretty sure we all lost heh. Thanks for clarifying. Indeed, there were slight errors in my description and the other commenter was reasonable in assuming those other cards were in discussion.
reply