
Baidu Deep Voice Explained: Part 1 – the Inference Pipeline - sethbannon
https://medium.com/athelas/paper-1-baidus-deep-voice-675a323705df#.lbmwd3u9t
======
whodunser
This post on Deep Voice seems a little off-the-mark. In fact, I would say it
is completely misleading about the technical accomplishments here.

From my perspective, Baidu's approach is a little embarrassing, with the use
of many modeling stages in their training and production of TTS. When the rest
of the community is moving towards end-to-end training, their usage of this
many stages sounds excruciating. Merlin[0], which was a pretty good standard
for 2016, has this painful feeling as well, with two DL stages (duration,
acoustic) followed by some conditioning and then a synthesis step.

The more important technical contribution seems to be the hand-tuned synthesis
code that makes their generation faster; cool but not particularly sexy (and
there are few details). The details on training hyperparams are nice to have
too, of course.

Contrary to the post, I would be very surprised if the voice sample included
in the post was actually generated by Deep Voice -- it has none of the robotic
qualities pointed out by the researchers themselves in their blog post[1].
More likely it is a demonstration of the loss in their last, WaveNet-like
stage. This was also pointed out in the previous HN discussion[2]

Lastly, Andrew Ng is neither thanked in the paper nor mentioned on any webpage
-- are we sure this was work he supervised?

[0] [https://github.com/CSTR-Edinburgh/merlin](https://github.com/CSTR-
Edinburgh/merlin)

[1] [http://research.baidu.com/deep-voice-production-quality-
text...](http://research.baidu.com/deep-voice-production-quality-text-speech-
system-constructed-entirely-deep-neural-networks/)

[2]
[https://news.ycombinator.com/item?id=13756489](https://news.ycombinator.com/item?id=13756489)

~~~
dhruvp
Thanks for your feedback.

\- I state the caveats that the voice sample published for Deep Voice are
using ground truth features. That being said I can make it clearer.

\- Andrew Ng runs the Baidu AI team. He may not have supervised it but he's
associated with this.

\- I've gotten direct feedback from an original author of this paper to ensure
the post represents its accomplishments well. At this point I believe it does
save for the caveat.

------
reedlaw
Are there any how-to guides to getting something like this running? Or are
there missing pieces of closed-source software needed? I'm interested in the
theory but it helps to be able to try it out.

~~~
kastnerkyle
If you want to see the most basic form of this pipeline, I have a blog post on
"bad speech synthesis" [0]. There are open source versions of WaveNet for TTS
[1], but I have not run the code myself or seen the quality of the output. Our
code for char2wav is theoretically available [2], but not yet ready for "how-
to-guide" level use.

[0] [http://kastnerkyle.github.io/posts/bad-speech-synthesis-
made...](http://kastnerkyle.github.io/posts/bad-speech-synthesis-made-simple/)

[1] [https://github.com/buriburisuri/speech-to-text-
wavenet](https://github.com/buriburisuri/speech-to-text-wavenet)

[2] [https://github.com/sotelo/parrot](https://github.com/sotelo/parrot)

------
gjm11
> The results of the paper speak for themselves.

Why yes, they do.

(I assume this was deliberate.)

------
kastnerkyle
Showing one of the top two samples from their blog (full prediction) along
with the one you link (only acoustic model) would more clearly show what you
are getting at in the explanation, since how the text inference works is most
of the complexity in this model (given baseline knowledge of WaveNet, at
least). The sample shown only does step 3 from your summary as far as I can
tell.

In particular the top 2 samples from the Baidu blog most clearly show the gaps
we need to cover from a research perspective to truly get "human level" TTS -
a lot of the complexity in TTS is in the text part, and getting the subtleties
of stress and f0 prediction correct is far from a solved problem. This is
partly why there are so many different submodules and parts in the text piece
of DeepVoice, and a whole appendix dedicated to that in WaveNet.

It is still surprising to me how good the audio models from WaveNet and
DeepVoice are (such as the sample you show) - it seems that given good enough
text features e.g. groundtruth the synthesis is nearly perfect. So it seems
(IMO) that the next research papers will be focused on the text/f0 part.

I will also plug the work we have been doing at MILA, which is related but
tries to directly go from text -> speech with attention based RNNs [0]. A
longer version of our paper should be on arxiv soon, but for now we have a
short teaser which was submitted as an ICLR workshop [1]. One fun feature is
that we have the ability to handle multi-speaker synthesis, and multi-speaker
datasets as well.

The primary differences from a high level are: we don't need a pronunciation
dictionary for training, but they (DeepVoice or WaveNet) do. However we need a
vocoder (currently WORLD) to get intermediate targets for training, while they
do not. For training we need a directory of text files, and a directory of
audio files, along with a script to run WORLD on the audio.

This means char2wav can go to new languages without extra lexical information
(e.g. pronunciation dictionaries), since the WORLD assumptions are broad, but
we find we have a difficult time on English without using phonemes (from a
pronunciation dict), since the character -> audio mapping is pretty difficult
for English.

We have better outputs (compared to char2wav English) on Spanish, German, and
Romanian [2] though we are still improving every day. Note that the Romanian
is still using fixed synthesis with WORLD - this is a demo only of the text ->
WORLD features part of the model, the "reader".

The end result of both char2wav and DeepVoice (be able to go from text ->
speech directly in generation) is the same. WaveNet also has this - you can
see most of the TTS details in the appendix of the paper [3].

There is also a good discussion on reddit with one of the DeepVoice authors
that has quite a bit of detail about their approach, I found it quite helpful
[4].

Heiga Zen has a great overview talk on youtube for people who are interested
in TTS [5].

There are also some interesting extensions on WaveNet [6][7], though I have
not translated enough to determine what exact methods are being used.

[0]
[http://www.josesotelo.com/speechsynthesis/](http://www.josesotelo.com/speechsynthesis/)

[1]
[https://openreview.net/pdf?id=B1VWyySKx](https://openreview.net/pdf?id=B1VWyySKx)

[2]
[https://www.youtube.com/watch?v=cwnDjq33uMs](https://www.youtube.com/watch?v=cwnDjq33uMs)

[3] [https://arxiv.org/abs/1609.03499](https://arxiv.org/abs/1609.03499)

[4]
[https://www.reddit.com/r/MachineLearning/comments/5wosbm/r_d...](https://www.reddit.com/r/MachineLearning/comments/5wosbm/r_deep_voice_realtime_neural_texttospeech/)

[5]
[https://www.youtube.com/watch?v=nsrSrYtKkT8](https://www.youtube.com/watch?v=nsrSrYtKkT8)

[6]
[https://twitter.com/ballforest/status/838759080621568002](https://twitter.com/ballforest/status/838759080621568002)

[7]
[https://twitter.com/ballforest/status/836803448435789828](https://twitter.com/ballforest/status/836803448435789828)

