
Making a TTS model with 1 minute of speech samples within 10 minutes - kyubyong
https://github.com/Kyubyong/speaker_adapted_tts
======
crowbahr
The Kate sounds a lot better than the Nick (who has that robotic gravel to his
voice) but I think that there is some question as to how much better would it
actually be with more data:

If trained on longer samples does it start to hallucinate? Learning programs
are notorious for rough approximations failing to scale into useful detail.

Very cool work though!

~~~
spullara
If you go to the original you can hear what it sounds like when it is trained
over a lot more data.

------
arbie
The Notes section neatly showcases the characteristics of most of the publish-
fast-without-reproducibility "research" hitting arxiv starting mid-'17:

The paper didn't mention normalization, but without normalization I couldn't
get it to work. So I added layer normalization.

The paper fixed the learning rate to 0.001, but it didn't work for me. So I
decayed it.

I tried to train Text2Mel and SSRN simultaneously, but it didn't work. I guess
separating those two networks mitigates the burden of training.

The authors claimed that the model can be trained within a day, but
unfortunately the luck was not mine. However obviously this is much faster
than Tacotron as it uses only convolution layers.

The paper didn't mention dropouts. I applied them as I believe it helps for
regularization.

------
Crazyontap
Speaking as a person who doesn't know much about programming how much time
does it take to generate 1 min of audio (or say 50 words)? Is it possible to
integrate this in my browser?

I heard a Google TTS demo (I think it was called deepmind?) that sounds
extremely human-like and I was wondering if it can be used to turn webpages
into speech (I have few extension in Chrome but those voices sound very
robotic and it's hard to hear them after 2 mins).

Anyway congrats for making this. I'm nowhere near as smart to understand how
it works just that it's getting better and more human like everyday!

~~~
yorwba
If it's possible to fine-tune on 1 minute of audio in 10 minutes (which very
likely requires more than 10 passes), it should be possible to run this model
with real-time throughput (at least using a decent GPU).

The technology is definitely intended for reading websites to users. I'm
actually not sure why Google hasn't integrated it into Chrome yet. Maybe they
prefer leaving the task to the OS-level accessibility tools.

~~~
londons_explore
The text to speech demos google was doing are relatively computationally
expensive.

They couldn't afford for everyone to be text-to-speaking every web page...

------
petee
All considering, the samples sound great. Maybe not exactly like the real
people, but it definitely captures a human feeling, versus my android's robot-
lady.

------
kyubyong
I've updated the repo with additional demo samples trained on speech samples
by Modern Family Celebs.

