
A 2019 Guide to Speech Synthesis with Deep Learning - mwitiderrick
https://heartbeat.fritz.ai/a-2019-guide-to-speech-synthesis-with-deep-learning-630afcafb9dd
======
PieSquared
I'm an author on a few of these papers referenced (the Deep Voice papers from
Baidu). I'm happy to answer any questions folks may have about neural speech
synthesis, as I've been working on this for several years now.

In general, it's a fascinating space. There are challenges in text processing
(not even mentioned in the blog), such as grapheme to phoneme conversion, part
of speech detection, word sense disambiguation, text normalization, challenges
in utterance-level modeling (spectrograms), and challenges in "spectrogram
inversion" / waveform synthesis. The NLP components of the pipeline are often
overlooked but are no less important than they were a few years ago -- part of
speech / word sense is the difference between "Time is a CONstruct" and "I'm
going to conSTRUCT a tower", and is the difference between "Let's drop that
bass" being about a DJ or about a fish. The acoustic modeling phase (e.g.
Tacotron, Deep Voice 3) works fairly well, and can produce some awesome demos
with things like style tokens ("GST-Tacotron"), but still has a ways to go
until it can encompass the full range of human inflection and emotion. At the
waveform synthesis level, models like WaveRNN (with subscale modeling) and
Parallel WaveNet make it possible to deploy modern waveform synthesis models,
but it's still a major issue to deploy them onto low-power devices due to
compute restrictions. Overall, lots of interesting challenges to work on, and
we're making a lot of progress quite quickly; and I haven't even started
talking about voice conversion or voice cloning!

~~~
bravura
What do you think is the best neural network currently for processing and
possibly generating 44.1 Khz music audio data?

If we're stuck with downsampling to 16 Khz, my question still stands.

~~~
PieSquared
I don't think anything about the current set of tools is specific to sample
rate; WaveNet, Tacotron, WaveRNN, etc, should work fine to generate 44.1Khz
audio. They might just need slightly different hyperparameters or sizes to
work well, or may take longer to train due to longer sequence lengths.

------
juris-ws
We're experimenting with this in combination with deepfake videos.
[https://wiserstate.com](https://wiserstate.com)

Spooky to think that one day we might be able to digitally "clone" ourselves
this way.

------
aswanson
Well written. Makes me want to open a medium account and explain something to
make sure I'm not getting rusty.

~~~
ghaff
Or your own blog. Though, honestly, for very occasional stuff, Medium probably
makes more sense.

~~~
mgradowski
Pardon the interruption, a free static page hosting would respect the reader a
little bit more than Medium.

~~~
ghaff
Personally I would (do) go with a free hosted service like Blogger which will
handle big traffic spikes. But if someone is just wanting to push out a blog
post or two a year, I'd be hard put to argue against Medium.

------
amelius
Repeated typo. "Casual convolutions" should probably be "causal convolutions".

~~~
mwitiderrick
Thanks for the noticing that.

