
Char2Wav: End-To-End Speech Synthesis - serialx
https://mila.umontreal.ca/en/publication/char2wav-end-to-end-speech-synthesis/
======
rspeer
I am getting tired of people implementing "deep learning to convert foo into
bar" and staking a claim on the name "foo2bar".

It leads to "AI hallucination", where even if "foo2bar" doesn't work, people
assume that it's the one right AI for turning foo into bar. When someone gets
better at turning foo into bar, the typical response will be "is that just
foo2bar?"

This happened absurdly backwards with doc2vec, which after word2vec everyone
talked about as if it were a real thing, until Radim Řehůřek finally made a
reasonable implementation of it under that name.

~~~
phreeza
I'm not sure people will interpret it that way. For example, seq2seq is really
just a generic term for an entire class of networks that map sequences to
other sequences.

~~~
rspeer
If someone else made a speech synthesizer named "Char2wav", wouldn't the
author of this project feel like their branding was being stolen from them?

~~~
phreeza
"branding" a project that way might make the authors unhappy, but i can
totally imagine someone saying in a couple of months "I built a char2wav net
with GRUs and deconvolution" or something like that, ie using the word as a
generic term...

------
BugsJustFindMe
[http://www.josesotelo.com/speechsynthesis/files/wav/blizzard...](http://www.josesotelo.com/speechsynthesis/files/wav/blizzard/best_bidirectional_encoder_9.wav)

I have not laughed this hard in a long time.

~~~
kastnerkyle
You might also enjoy
[http://badsamples.tumblr.com/](http://badsamples.tumblr.com/)

------
billconan
the demo page isn't clearly presented.

for example, on this page, only spanish has the char2wav label.

[http://www.josesotelo.com/speechsynthesis/](http://www.josesotelo.com/speechsynthesis/)

It's unclear which results are the output of the model.

~~~
microcolonel
I love how the "Reader over characters with vocoder output." samples sound
sometimes like they're giving up or falling asleep.

------
verytrivial
Many of the synth voices sound to my ear very similar to people who are either
drunk or have a brain injury. I'm not complaining, it's an interesting
parallel.

------
option_greek
So how does this work? It's not very clear from the article.

~~~
jfsantos
Hi, I'm one of the authors. In broad lines, we pretrained one model (the
"Reader") to learn to read text and output vocoder variables, and another
model (SampleRNN) to go from these vocoder variables to an audio waveform.
Then, we finetuned both models together to be able to go from text to speech,
end-to-end. The "end product" is a text-to-speech system, but without the need
of having to extract tons of hand-engineered features from the text to be able
to generate speech. We also expect that with more training this will be able
to overcome the usual vocoder speech "unnaturalness" issues.

~~~
BugsJustFindMe
Can you comment on what's happening in this sample (result???) clip?
[http://www.josesotelo.com/speechsynthesis/files/wav/blizzard...](http://www.josesotelo.com/speechsynthesis/files/wav/blizzard/best_bidirectional_encoder_9.wav)

Also, I notice that many of the result clips trail off in volume. Is that a
processing error or intentional in how the clips are edited?

~~~
jfsantos
I think the model just got tired of reading text and decided to mock us :)
Just kidding. The attention mechanism got stuck somehow for this sample. This
does not happen very often, though. It's important to note the samples we
posted were not cherry-picked: they are just the first 10 sentences from our
test set.

Regarding the truncation at the end, that was a bug in our sampling code that
we just fixed. We will update the samples soon!

~~~
atomicthumbs
Is there any way to artificially induce that failure? I'm an artist and I've
been trying to get a handle on ML stuff, and being able to feed speech through
this to give it the flat affect of the phoneme-mode samples, or insert
attention failures at specific points, would be extremely useful for a number
of projects I have in mind.

------
wernerb
Proceeding to feed this Paul Bettany's "Jarvis" from some movies...

~~~
kastnerkyle
This is no joke something I have considered - do you have a source on a
pairing of "read speech" and "transcript" for this? I could process the movies
myself but that seems... tedious...

------
teddyh
I’d like to feed _The Chaos_ to it and see how it fares.

