
Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model - backpropaganda
https://google.github.io/tacotron/
======
zachrose
The stress and intonation examples are what puts this on the far side of the
uncanny valley. I've not heard computer speech like this, where the voice
seems understand the idea underneath the words.

~~~
red75prime
It is without doubt very impressive, but on longer texts the lack of
understanding will be apparent. Wrong semantic stresses, incorrect
pronunciation of homographs, maybe inconsistent pronunciation of names, and so
on.

Perfect speech synthesis from unannotated text will require much more complex
system. It will need database of common knowledge in some form, ability to
relate text to that database, ability to retain and use this semantic
information to guide synthesis. That is, it should be a step closer to
artificial general intelligence.

~~~
jerf
I'd say even in these samples there's still some errors. The intonation on
"brown" in "quick brown fox" is still wrong. "AllHipHop" is not pronounced as
a human would, though I'll grant you that's a made-up word so I wouldn't even
expect a non-native human speaker to necessarily get that one right.

Still, the quality is such that I'm only observing this in the spirit of
scientific analysis. It's some good quality.

~~~
iamed2
I wonder whether it would have been more natural if it was written as "quick,
brown fox" as would be customary English. Usually when adjectives don't have
commas between them it's because the last adjective is coupled to the noun.

------
cypher543
No source code, as usual.

How difficult would it be to duplicate these results with TensorFlow? Would
something like this require more than the building blocks that TF and other
toolkits provide? I have zero experience with machine learning, so I'm just
curious.

~~~
Animats
_No source code, as usual._

So why is it on Github?

~~~
striking
To make the voice samples easily available. Preprint archives and scientific
journals don't usually allow you to embed audio into your papers.

------
thom
Feels like we're getting breakthroughs every few months now. How close are we
to the point yet that a mobile phone can generate speech in close to realtime
that is reasonably pleasant to listen to for long periods?

~~~
pharrlax
Audio book piracy is gonna skyrocket when you can generate a pleasant track
from text.

~~~
Someone
Huh? I would think, if anything, it would go down if, as an alternative to
pirating an audio book, one can download the text from Gutenberg.org or pirate
the _text_, and generate a decent audio book from it.

~~~
pharrlax
True, technically it's not "audiobook piracy", but I used that as shorthand
for recording and distributing an audio version of a written book, which
courts will surely find to be a form of piracy.

------
lucidrains
can we eventually apply style transfer to speech? I'd like to hear all audio
narration in the voice of David Attenborough.

~~~
starmole
Adobe VoCo does something like that. Demo:
[https://www.youtube.com/watch?v=I3l4XLZ59iw](https://www.youtube.com/watch?v=I3l4XLZ59iw)

~~~
mbrookes
It was pointed out the last time I posted this that the demo is "fake" at
least in as much as the inserted words aren't being synthesized but rather
copy-pasted with existing intonation intact.

------
akhilcacharya
Being able to pronounce Taleb Kweli is blowing my mind. Absolutely amazing,
can't wait to see where this goes.

~~~
oliyoung
It doesn't, it's Tah-leb Kwah-lee, that sounds more Ta-Leb Kwell-ee

But damn, the rest of it is astounding

~~~
literallycancer
It sounds good to me. Not sure why you'd want to pronounce it as Tah-leb Kwah-
lee?

If anything, it sounds like "quiley".

~~~
lenocinor
Not trying to sound snarky here, but that's how he actually pronounces his
name.

------
visarga
Paper link here:
[https://arxiv.org/abs/1703.10135](https://arxiv.org/abs/1703.10135)

------
wishinghand
I'd love to see it take on this poem:
[http://www.thepoke.co.uk/2011/12/23/english-
pronunciation/](http://www.thepoke.co.uk/2011/12/23/english-pronunciation/)

~~~
NHern031
That was a great poem, storying that one away for later. It truly shows the
complexity of the English language.

------
yunolisten
I'm totally new to tts, so please forgive the questions:

* Is this open source?

* What hardware would one need to run this?

* If specialist hardware is required (e.g. reasonably high end GPU) would that simply be for building and training the system or its operation?

~~~
asadjb
From what I understand looking through the paper, this is only a research
paper, without any source code. So right now it's not possible to say what the
hardware requirement might be.

~~~
yunolisten
Thank you. It would have made more sense to me if the source code to generate
the output examples which they posted. Oh well.

------
mnemotronic
I would like an on-line version that can generate mp3 from supplied text. In
English using a British accent. Female.

I have my reasons.

~~~
baq
couple this with a system that transforms pictures of people into nude
pictures of the same people. win $$$. watch civilization crumble to dust.

~~~
knicholes
There's a reddit forum that has pictures of on/off. Maybe given enough of
those as the training data, one could create such a thing.

------
m1el
Question: why do people generate waveforms by neural networks?

Would it be easier for a neural network to "control" an emulated throat/mouth
and train against the sound output?

Emitting and post-processing waveforms entirely in neural networks seems like
putting a lot of responsibilities into neural networks.

~~~
svantana
The reason is, the only speech data that is available in high quality and
large quantities is in audio form. There has been research for decades on
measuring or estimating the properties of the speech organ (oral cavity
geometry etc), but that is largely an unsolved model inversion problem.
Existing physical speech synthesis models are pretty crude and sound quite
robotic.

~~~
jhegedus42
I like that you mentioned "inversion".

ML is strongly related to solving inverse problems but it does not tend to be
emphasised enough, in my opinion.

In reality, speech organs (a relatively low dimensional system) generate high
dimensional data (speech). This is an example of the so called curse of
dimensionality.

What ML does is to invert this process and create a model of the generation
process.

My favorite example is quantum mechanics. Physicists made lot of measurements
and created a model (quantum mechanics) that reproduces those measurements
(and even works on test data, not only on training data).

This is an example of a solution of an inverse problem. ML does the same.

So in one way or an other, every working TTS has a model that accurately
describes the speech organs and how the brain uses them to create speech from
text.

------
jxramos
I'd really love this to mature completely and applied to Kindle Whispersync
books. I've found that some readers of audiobooks make the book hard to
comprehend or even distracting, especially if the sex of the person does not
match the author or even remotely the sound of the author who I may be very
familiar with from video interviews and what not. Ideally every author would
be a great reader and create recordings for their own content, but if some
synthetic means could achieve an approximate result that would be fantastic.
Even better people could choose from a flavor of voices and settle upon the
one that works for them.

------
superkuh
Unless all these fancy new text to speech engines are released as actual
software instead of cloud services they're worthless. And we all know that
isn't going to happen.

~~~
railorsi
How is research worthless?

------
jnpatel
I've been a long-time user for OS X's built in text-to-speech [0].

IMO, it actually performs pretty robustly on these examples. Is Apple still
using just diphones, or are they post-processing in some way?

[0]: [https://en.wikipedia.org/wiki/PlainTalk#Text-to-
speech_in_Ma...](https://en.wikipedia.org/wiki/PlainTalk#Text-to-
speech_in_Mac_OS_X)

~~~
visarga
I'm using Alex to read articles and comments. I love the Alex voice but it
sometimes makes stupid mistakes.

I hope we will have decent TTS for linux too.

~~~
krackers
Fyi they added newer, (imo) better quality voices since alex debuted in 10.5.
My personal favorite is Kate (english UK voice) and Tessa (english - south
africa)

------
slinger
This is really awesome! Anyone has any resources on learning TTS synthesis?

~~~
modeless
These new models are replacing almost everything that's been developed for TTS
in the last 30 years with neural nets. So you don't need to study TTS anymore.
Study neural nets and then read this paper and you will understand the state
of the art in TTS.

~~~
rws
This is complete nonsense. You need to understand the problem you are
addressing even if you plan to use deep learning methods for them.

Saying you don't need to know anything about TTS is like developing a self-
driving car and saying you don't need to know anything about the rules of the
road. For one thing, how are you supposed to know when you got it wrong? And
what things it is important to get right?

------
alnitak
This model sounds so cool. Any idea of a high level software package where you
can feed pairs of <text, audio> and build a TTS model as a result?

~~~
backpropaganda
This research just came out, quite literally, like today.

There will be a tensorflow implementation tomorrow though.

~~~
alnitak
Do you have an update on this topic?

------
crawfordcomeaux
What would happen if trained on 2 or 3 voices for the same amount of time?

~~~
backpropaganda
The model is smart enough to not mix them together in the same sample.
Different samples would have different voices, unless you put in a voice
identifier as input to the model. Then you can control which voice to generate
with.

------
redsummer
Where is the code?

