
Varying Speaking Styles with Neural Text-to-Speech - georgecarlyle76
https://developer.amazon.com/blogs/alexa/post/7ab9665a-0536-4be2-aaad-18281ec59af8/varying-speaking-styles-with-neural-text-to-speech
======
mrleiter
Before Charlie Brooker started producing Black Mirror, he already was a highly
observant critic of society and media. At one point he produced a generic news
snippet of an unspecified event, where he pointed out all the standard
videoshots and animations used these days to report on a topic [1]. He
presented this "report" with the standard news-speak pronunciation of
sentences.

With this achievement of Amazon, and the recent development by China in AI
news-anchors [2], one is not far off to see news reporting being done mostly
by machines. Yes, some videography must still be done manually, but the rest
could be automated.

What that means for the TV news industry? I don't know, but it certainly
deserves discussion.

[1]
[https://www.youtube.com/watch?v=aHun58mz3vI](https://www.youtube.com/watch?v=aHun58mz3vI)

[2] [https://www.fastcompany.com/90264587/watch-chinas-new-ai-
anc...](https://www.fastcompany.com/90264587/watch-chinas-new-ai-anchor-read-
the-news)

~~~
richrichardsson
I wish they would commission Newswipe again, was one of the best shows on BBC.

------
metildaa
For those lolking for something not hosted by a megacorp, check out Mozilla's
Text to Speech:
[https://github.com/mozilla/TTS/blob/master/README.md](https://github.com/mozilla/TTS/blob/master/README.md)

Audio Samples:
[https://soundcloud.com/user-565970875](https://soundcloud.com/user-565970875)

If your interested in helping improve this, spread the wors about the Common
Voice project by Mozilla, a public speech corpus that is easy to help improve,
which makes high caliber TTS and transcription possible ouside of walled
gardens: [https://voice.mozilla.org](https://voice.mozilla.org)

Mozilla DeepSpeech also does Speech to Text surprisingly well:
[https://github.com/mozilla/DeepSpeech](https://github.com/mozilla/DeepSpeech)

~~~
kastnerkyle
Worth noting that a big chunk of the core TTS code here is built on tools from
other researchers like Ryuichi Yamamoto and Keith Ito, and they have great
implementations to check out as well.

The best quality I have heard in OSS is probably [1] from Ryuichi using the
Tacotron 2 implementation of Rayhane Mamah, which is loosely what NVidia based
some of their baseline code on recently as well [3][4].

There's also a colab notebook for this stuff, so you can try it directly
without any pain
[https://colab.research.google.com/github/r9y9/Colaboratory/b...](https://colab.research.google.com/github/r9y9/Colaboratory/blob/master/Tacotron2_and_WaveNet_text_to_speech_demo.ipynb)

I also have my own pipeline for this (using some utilities from the above
authors + a lot of my own hacks), for a forthcoming paper release here
[https://github.com/kastnerkyle/representation_mixing/tree/ma...](https://github.com/kastnerkyle/representation_mixing/tree/master/pretrained)
, see the minimal demo. It has pretty fast sampling, but the audio quality is
not as high as WaveNet. I'd really like to tie in with WaveGlow, but it's work
in progress for me so far.

NOTE: None of these have voice adaptivity per se, but given a model which
trains well already + a multispeaker dataset with IDs such as VCTK, a lot of
things become possible as getting a baseline model and data pipeline for TTS
is quite difficult.

[0]
[https://github.com/keithito/tacotron](https://github.com/keithito/tacotron)

[1]
[https://r9y9.github.io/blog/2018/05/20/tacotron2/](https://r9y9.github.io/blog/2018/05/20/tacotron2/)

[2] [https://github.com/Rayhane-mamah/Tacotron-2](https://github.com/Rayhane-
mamah/Tacotron-2)

[3] [https://github.com/NVIDIA/waveglow](https://github.com/NVIDIA/waveglow)

[4] [https://github.com/NVIDIA/tacotron2](https://github.com/NVIDIA/tacotron2)

~~~
metildaa
Building that 5000+ hour dataset needed to train quality Speech to Text is a
serious challenge, and presumably TTS has a similar threshold of audio needed.

IMO that is why it is critical to spread the word about Common Voice (a CC0
licensed voice corpus) and get a large variety of people contributing to it:
[https://voice.mozilla.org](https://voice.mozilla.org)

~~~
kastnerkyle
Much less audio is potentially needed for TTS than ASR, however the spread and
quality of the TTS dataset is critical which is one reason why just training
on ASR datasets "in reverse" hasn't worked great. For example, commercial
databases run ~25 to 50 hours, but the "coverage" of the language is usually
very different from e.g. audiobooks, and focuses specifically on covering edge
cases of the language. You can think of it like a 25 hour "support set" which
covers as many cases as possible, and can also grow over time as users run
into cases where the system fails.

This all gets worse if you want multi-speaker output of course - getting even
a few speakers who all read the same large corpus is difficult. The two
datasets I've gotten the most out of so far are "LJSpeech" (a subset of the
LibriVox corpus), and the "Nancy Corpus / Blizzard 2013" dataset [0][1].

There's a pretty interesting corpus here that I hope to start using soon [2].

To me, the biggest issue / gap between commercial interests and publically
available data is curation - TTS really hinges on well curated, clean data at
least for now. And if that dataset has a very balanced coverage of triphones,
that's even better.

I'd like to try on the voice.mozilla data, but given current stuggles on even
1 speaker, a truly "in the wild" set of many speakers seems pretty difficult
if training from scratch. For voice cloning using pretrained weights it may be
a different story.

[0] [https://keithito.com/LJ-Speech-Dataset/](https://keithito.com/LJ-Speech-
Dataset/)

[1]
[https://www.synsig.org/index.php/Blizzard_Challenge_2013](https://www.synsig.org/index.php/Blizzard_Challenge_2013)

[2] [http://www.m-ailabs.bayern/en/the-mailabs-speech-
dataset/](http://www.m-ailabs.bayern/en/the-mailabs-speech-dataset/)

------
kastnerkyle
This type of concatenation I first saw in Alex Graves' work on "Generating
Sequences With Recurrent Neural Networks", including his unpublished TTS demo
[1]. Biasing with part of another sentence (as in handwriting) can possibly
improve style in TTS as well.

We followed this approach in char2wav [2], but "voice cloning" has come much
farther in my opinion [3][4][5]. There's a lot of relevant research on
techniques for this beyond concatenating indicators or embeddings, if people
are interested in the research side of this technology.

[0] [https://arxiv.org/abs/1308.0850](https://arxiv.org/abs/1308.0850)

[1]
[https://www.youtube.com/watch?v=-yX1SYeDHbg&t=38m30s](https://www.youtube.com/watch?v=-yX1SYeDHbg&t=38m30s)

[2]
[http://josesotelo.com/speechsynthesis/](http://josesotelo.com/speechsynthesis/)

[3]
[https://twitter.com/Jeanne_Heo/status/972089715225542657](https://twitter.com/Jeanne_Heo/status/972089715225542657)
(lyrebird.ai)

[4]
[https://google.github.io/tacotron/publications/gmvae_control...](https://google.github.io/tacotron/publications/gmvae_controllable_tts/#multispk_en.sample)

[5]
[https://google.github.io/tacotron/publications/speaker_adapt...](https://google.github.io/tacotron/publications/speaker_adaptation/index.html)

------
crivlaldo
Is it Alexa only feature? When it will be available? Should we expect this
feature in Polly? Is Amazon going to provide other styles? I'm curious.

~~~
superasn
Yes I too was hoping this would be available for polly like wavenet is from
Google

------
ai_ja_nai
Totally looks like Tacotron. I heard there was interest in Amazon in buying
the project, now it looks like they copied it.

------
Rebelgecko
Interesting stuff. I noticed a few weeks ago that Google Translate will
sometimes vary the pronunciation of words when you hit the audio playback
button repeatedly, I wonder if they're already doing something similar.

