Analyzing latent embedding capacity in Tacotron, Google's seq2seq TTS model

Research paper: https://arxiv.org/abs/1906.03402

Audio examples: https://google.github.io/tacotron/publications/capacitron/

Capacitron is the Tacotron team's most recent contribution to the world of expressive end-to-end speech synthesis (e.g., transfer and control of prosody and speaking style). Our previous Style Tokens and prosody transfer work implicitly controls reference embedding capacity by modifying the encoder architecture, thereby targeting a trade-off between text-specific transfer fidelity and text-agnostic style generality. Capacitron treats embedding capacity as a first class citizen by targeting a specific value for the representational mutual information via a variational information bottleneck.

We also show that by modifying the stochastic reference encoder to match the form of the true latent posterior, we can achieve high-fidelity prosody transfer, text-agnostic style transfer, and natural-sounding prior samples in the same model. The modified encoder also addresses the pitch range preservation problems we observed during inter-speaker transfer in our past work.

Lastly, we show the capacity of the embedding can be decomposed hierarchically, allowing us to control the amount of sample-to-sample variation for transfer use cases.

To appreciate the results fully, we recommend listening to the audio examples in conjunction with reading the paper.