
An open source implementation of DeepVoice 3: 2000-Speaker Neural Text-to-Speech - allenleein
https://r9y9.github.io/deepvoice3_pytorch/
======
dspig
The "chirping" artifacts are quite distracting. I wonder if they can be
avoided by randomizing the alignment of the final piecing together of the
audio?

Post-processing by convolving with a short noise burst removes them pretty
well, as that randomizes phase vs. frequency.

~~~
skykooler
Would it be possible to upload the outputs of this convolution? I'd like to
hear how it sounds.

~~~
dspig
Here's the second example convolved with 30ms white noise:

[https://drive.google.com/file/d/1WwmCNwWwukhtYXRT4mDHxCUa0sQ...](https://drive.google.com/file/d/1WwmCNwWwukhtYXRT4mDHxCUa0sQbdy1s/view?usp=sharing)

But it also makes it sound like she's in a closet, so it would be better to
fix it at source if there's a way to do that.

------
rglullis
If I understood it correctly, the datasets being used are all in the order of
~20 hours. Could the results be improved with a different dataset? Say,
Mozilla's Common Voice[1]? They are already at 350 hours of labeled speech.

[1]: [https://voice.mozilla.org](https://voice.mozilla.org)

~~~
IshKebab
Maybe, but Google's tacotron2 also used about 24 hours of speech and gives
results that are indistinguishable from humans.

------
tekkk
Recently I started digging into text-to-speech implementations of which Google
seems to have the best know-how.

[https://google.github.io/tacotron/publications/global_style_...](https://google.github.io/tacotron/publications/global_style_tokens/index.html)

Those samples sound already frighteningly human.

Not to disparage the DeepVoice3 in anyway! There must have been a lot of work
put into it.

~~~
romaniv
_> Google seems to have the best know-how._

Funny how they publish their "research papers", yet no one else is able to
implement their engine with even remotely comparable results.

~~~
mkagenius
Oh, is it not verifiable?

------
thom
I think when Tacotron-level speech synthesis is feasible to create on mobiles,
offline, some really interesting opportunities for new apps open up. Right now
you wouldn't want to listen to a long-read article on the web read by speech
synthesis, but the moment a system can create realistic, emotionally-accurate
speech (especially if you can match quotes/dialogue to correct, gendered
voices), you'd probably consider it when you were on the go.

------
pmuk
Does this mean all banks that are using their customers’ voices as passwords
are going to have a big problem?

[https://www.nuance.com/en-gb/omni-channel-customer-
engagemen...](https://www.nuance.com/en-gb/omni-channel-customer-
engagement/security/identification-and-verification.html)

~~~
imustbeevil
A bigger problem than banks using 8 character strings as passwords?

~~~
madmulita
8? When did they double it? And since when can I use anything but digits? /s

------
zakki
Hi HN readers, What do I need to learn to make text to speech in my own
language?

~~~
tekkk
Big data-set, probably audiobooks in your language with full transcript. Then
fiddling around with your choice of model for training, this might be a good
place to start:
[https://github.com/Kyubyong/tacotron](https://github.com/Kyubyong/tacotron)

Just know that the voice will be similar to what Kyubyong or others managed to
train meaning it will sound eerily synthetic. Might fit your purposes but it's
probably not enough for consumer-applications. Also from what I played around
with it optimizing the synthesization is going to be big hurdle if you want it
done quickly. Or not I didn't dig that deep into it.

~~~
zakki
Thanks

------
StudentStuff
Not bad, for IVRs and the like a smoother, more neutral voice (esp. one that
can be tuned based on area) for TTS is extremely handy.

------
jacksmith21006
Saw this from Deepmind and thought I would share here as found interesting.

[https://cloudplatform.googleblog.com/2018/03/introducing-
Clo...](https://cloudplatform.googleblog.com/2018/03/introducing-Cloud-Text-
to-Speech-powered-by-Deepmind-WaveNet-technology.html)

Suppose to be using 16k samples a second through a NN which seems hard to
believe. But gets you a pretty incredible result.

------
nl
The URL says Pytorch by the installation instructions say Tensorflow.

It seems to be all Pytorch in the code though.

------
just_a_fella
It it just me that Apple's new Siri voice sound much more human-sounding then
Google's wavenet? By a huge margin.

[https://machinelearning.apple.com/2017/08/06/siri-
voices.htm...](https://machinelearning.apple.com/2017/08/06/siri-voices.html)

