
Deep Voice: Real-Time Neural Text-To-Speech - PieSquared
http://research.baidu.com/deep-voice-production-quality-text-speech-system-constructed-entirely-deep-neural-networks/
======
PieSquared
Hey there! I'm one of the authors of the paper and I'm happy to answer any
questions anyone may have!

Make sure to check out the paper on arxiv as well.

~~~
pain_perdu
How close (# years?)are we to being able to replicate the voices of any given
individual with sufficient samples of their voiceprint?

~~~
qq66
"My voice is my passport."

~~~
M_Grey
Verify me.

~~~
wiz21c
aahahahaha !

------
mrmaximus
Interesting. They are not TTS like we are accustomed to, they are replicating
a specific persons voice with TTS. Listen to the ground-truth recordings at
the bottom and then the synthesized versions above. "Fake News" is about to
get a lot more compelling when you can make anyone say anything as long as you
have some previous recordings of their voice.

~~~
modeless
> you can make anyone say anything as long as you have some previous
> recordings of their voice.

That's not what this is doing. They're simply resynthesizing exactly what the
person said, in the same voice. It's essentially cheating because they can use
the real person's inflection. Generating correct inflection is the hardest
part of speech synthesis because doing it perfectly requires a complete
understanding of the meaning of the text.

The top two are representative of what it sounds like when doing true text to
speech. The middle five are just resynthesis of a clip saying the exact same
thing. And even in that case, it doesn't always sound good. The fourth one is
practically unintelligible. But it's interesting because it demonstrates an
upper bound on the quality of the voice synthesis possible with their system
given perfect inflection as input.

To clarify, this is cool work, the real-time aspect sounds great, and I'm sure
it will lead to even more impressive results in the future. But I don't want
people to think that all of the clips on this page represent their current
text-to-speech quality.

~~~
PieSquared
Thank you for clarifying this! We tried fairly hard to make this clear,
because as you say, the hard part is generating inflection and duration that
sounds natural. There's still a ton of work left to do in this duration –
we're clearly nowhere near being able to generate human-level speech.

Our work is meant to make working with TTS easier to deep learning researchers
by describing a complete and trainable system that can be trained completely
from data, and demonstrate that the neural vocoder substitutes can actually be
deployed to streaming production servers. Future work (both by us and
hopefully other groups) will make further progress for inflection synthesis!

~~~
mrmaximus
My "Fake News" comment aside, I think what y'all are doing could be
transformational for many reasons. Imagine a scenario where a person loses a
loved one, and similar technology is able to allow them to "have
conversations" with the deceased as a form of healing and closure. Not to
mention, this could add a personal touch to assistant bots that will make them
a pleasure to use.

------
slay2k
How soon before you make an API available? In other words, how do I make use
of Deep Voice for my own applications?

~~~
PieSquared
Right now, we do not have plans to make an API available. This paper and blog
post are mostly meant to describe our techniques to other deep learning
researchers and spur innovation in the field. However, we hope that these
techniques will be available eventually, and we'll provide more information
when that happens.

~~~
rocky1138
In order to not miss this announcement, do you have a mailing list we could
sign up for to notify us when this becomes available? You have a LOT of people
interested.

------
chikiuso
That's great! when will the code / service be available to the public??

------
Elv13
Semi-related to the Baidu speech research:
[http://chrislord.net/index.php/2017/02/23/machine-
learning-s...](http://chrislord.net/index.php/2017/02/23/machine-learning-
speech-recognition/)

The work is done by Mozilla

------
dresaj8
does anyone know of good ways to do the opposite, speech to text?

~~~
forthefuture
Depends on how good you're talking. Chrome supports the SpeechRecognition API.

[https://developer.mozilla.org/en-
US/docs/Web/API/SpeechRecog...](https://developer.mozilla.org/en-
US/docs/Web/API/SpeechRecognition)

~~~
dresaj8
i'm more thinking of ways to programmatically turn long audio files into
indexable text.

~~~
skykooler
Julius[1] can do this. But the accuracy depends on the language model you are
using, and unfortunately the free English language model (VoxForge) is not the
best.

[1] [http://julius.osdn.jp/en_index.php](http://julius.osdn.jp/en_index.php)

------
100ideas
> "We conclude that the main barrier to progress towards natural TTS lies with
> duration and fundamental frequency prediction, and our systems have not
> meaningfully progressed past the state of the art in that regard."

Who is working on this problem, and how?

~~~
RodolpheO
We're working on this. Here is a very early demo of Julian. Don't be
surprised, he sounds like a teenager with a high-pitched voice, recorded in
his bedroom, because that's how the sample library was recorded.
[https://soundcloud.com/komponant/julian-speech-
demo](https://soundcloud.com/komponant/julian-speech-demo) NB the expressions
(durations, F0) are manually adjusted, not predicted by a NN. We've built a
fully flexible text-to-voice engine, not the brain that goes with it. But
we're looking for people with experience in ML to work on this, so feel free
to contact us.

------
computerwizard
I have A LOT of pdf's I'd much rather listen to than read. Can't wait for
this!

~~~
visarga
I hacked a script on top of PDF.js to make it read the text by TTS while
highlighting the words on page. I'm a big fan of having the computer speak to
me.

------
monk_e_boy
OK, that went from uncanny valley to flipping amazing. I could picture the
person speaking. An old lady. A young woman. It was hard to picture an
algorithm in a machine.

It's amazing that is all boils down to 1s and 0s and some boolean logic.

~~~
taejavu
You've misunderstood what you're listening to, I suggest reading the post
again.

The recordings at the bottom are just recordings of an old lady and a young
woman.

~~~
monk_e_boy
Yeah, I understood that. The ones in the middle are generated using their
voices. You don't find that amazing?

~~~
gcr
I mean, it's sort of amazing, but it wasn't completely generated by machine.
Those sound clips in the middle were generated by copying the inflections from
actual recordings, not generating the inflections from scratch. It sounds like
the current system they have sounds like the robotic voices at the very top.

------
Dowwie
Has anyone seen this yet?
[https://www.youtube.com/watch?v=XfcqBElF0ZI](https://www.youtube.com/watch?v=XfcqBElF0ZI)

So many innovations happening with voice related technology..

------
whodunser
It says they trained on 20 hours of a speech corpus subset. Will larger
datasets influence the future of TTS?

------
hprotagonist
how does this stack up against wavenet?

~~~
methou
It's in the abstract. "... For the audio synthesis model, we implement a
variant of WaveNet that requires fewer parameters and trains faster than the
original ..."[1]

[1]: [https://arxiv.org/abs/1702.07825](https://arxiv.org/abs/1702.07825)

~~~
Smerity
Disclosure: I'm one of the co-authors of the QRNN paper (James Bradbury,
Stephen Merity, Caiming Xiong, Richard Socher) produced by Salesforce
Research.

There are many interesting advances that Deep Voice paper and implementation
make but the part I'm excited by (and which might be transferable to other
tasks that use RNNs) is showing that QRNNs are indeed generalizable to speech
too - in this case in place of WaveNet.

"WaveNet uses transposed convolutions for upsampling and conditioning. We find
that our models perform better, train faster, and require fewer parameters if
we instead first encode the inputs with a stack of bidirectional quasi-RNN
(QRNN) layers (Bradbury et al., 2016) and then perform upsampling by
repetition to the desired frequency."

QRNNs are a variant of recurrent neural networks. They're up to 16 times
faster than even Nvidia's highly optimized cuDNN LSTM implementation and give
comparable or better accuracy in many tasks. This is the first time that it
has been tried in speech - to see them note the advantages hold across the
board (better, faster, smaller) is brilliant!

If you're interested in technical details, our blog post[1] provides a broader
overview and our paper is available for deeper detail[2].

[1]: [https://metamind.io/research/new-neural-network-building-
blo...](https://metamind.io/research/new-neural-network-building-block-allows-
faster-and-more-accurate-text-understanding)

[2]: [https://arxiv.org/abs/1611.01576](https://arxiv.org/abs/1611.01576)

------
m210658
very nice paper - one of my colleagues discovered it. I have been trying to
understand the details but I do not see how your stacked dilated layers are
arranged. "d" is mentioned once but no description given

------
ymow
it's awesome~

------
bayjingsf
Great work!

------
kayoone
if i understand this correctly it's a pretty big achievement on the way to
being able to replicate any persons voice in the future given enough audio
samples. Amazing. Similarly i have seen lip movement (talking) be replicated
using machine learning. Having completely artificial (or even real) identities
saying whatever you want them to on video is not that far off i guess (simpler
than general AI or even fully self driving cars), which is both amazing and
terrifying.

