
Translatotron: An End-to-End Speech-to-Speech Translation Model - bkudria
https://ai.googleblog.com/2019/05/introducing-translatotron-end-to-end.html
======
Nr7
Cool stuff. I wonder how long it will take until speech synthesis is at the
point where it can be used to create dialogue for video games. Obviously it
would (at least initially) replace just the less important stuff spoken by
background characters or less important NPCs, but even that could be a big
improvement.

Just imagine how much more immersive a game like Skyrim would be if the
writers could just write the lines and then run it through a speech
synthesizer to get finished dialogue. No need to hire multiple actors and get
them to a studio to record their lines. It would be so much easier and faster
to create a massive amount of unique dialogue and you wouldn't have to listen
to the same "arrow to the knee" line spoken by the same couple of actors over
and over again everywhere you go.

It would improve user created mods as well since just about anyone could then
just create new characters with completely custom voices and dialogue.

~~~
indy
We could see a resurgence in text adventures, except this time the input would
be voice based and the output would be to your headphones.

Downside to this would be people on public transport saying stuff like: "Use
magic spell on trolls"

~~~
logicchains
Sounds like a pretty big upside to me.

------
bowmessage
Is it just me or are the final results (Translatotron translation) not
playable, while the initial audio samples work fine?

~~~
MrTrvp
Works in chrome

~~~
gpvos
:angry face:

------
kozak
A perfect illustration for how Chromium's monopoly is killing the open web.
The sample audio doesn't play in Firefox, because who cares about anything
other than Chrome, right?

~~~
londons_explore
To be fair, this looks like a firefox bug rather than google inventing their
own 'standards'.

It's a bog-standard HTML5 audio tag with a WAV file. One file is 8khz and the
other 24khz, both standard frequencies. One is Microsoft PCM format, while the
other is IEEE float. I suspect the latter is the issue - even though it's been
around for decades, I bet it isn't a well tested codepath in firefox.

It makes sense they use floats for machine learning outputs. Unless someone
specifically thought to quantize the data to a specific bit depth, whatever
wav file writing library google used probably thought it was being helpful by
using the 'right' encoding.

~~~
kozak
The problem is that they only tested it on Chrome.

~~~
gattilorenz
Maybe. On the other hand, the audio tag is pretty standard and the format is
standard too, and if you look at w3c support page [1] it looks like ffox
supports wav files (and it does, just not all of them. And they might just
have tested if the first file plays (or even the second). I would say it's a
firefox "bug" instead, there's not a good reason not to support this wav
format.

And I say that as a Firefox user that never betrayed the fox for that shiny
metallic look.

[1]
[https://www.w3schools.com/html/html5_audio.asp](https://www.w3schools.com/html/html5_audio.asp)

~~~
londons_explore
Especially since firefox probably ought to be using the underlying OS
libraries for media decoding (and on windows, mac, and linux all the major
libraries support this format).

Firefoxes "we need to invent it ourself so we aren't at the whim of the
platform" has bitten them here.

Also, a patch to support this format is probably only 10 lines of code...
Simply a for loop over every sample converting to 16 bit PCM.

~~~
05
Using underlying non-hardened OS libraries will open your browser to numerous
exploits, see e.g. IE’s WMF vulns [0]

[0]
[https://en.wikipedia.org/wiki/Windows_Metafile_vulnerability](https://en.wikipedia.org/wiki/Windows_Metafile_vulnerability)

~~~
londons_explore
Presumably the solution is to harden the OS rather than reinvent the wheel and
leave all other applications vulnerable...

------
akie
I'm just going to say that this is flat out amazing. Really super impressive.

------
ttctciyf
Am I reading this correctly, that there's no explicit semantic representation
going on at any stage, it's purely audio input frequencies -> ML -> audio
output frequencies? If so, that's ... so much for Jerry Fodor and his Language
of Thought Hypothesis, eh? (Yeah, mostly joking but still..)

~~~
makomk
If I'm understanding this right, the actual translation goes from audio
frequencies -> ML -> audio frequencies, but the training process for the ML
algorithm relies on text transcripts of the speech.

------
lostmsu
Is the model available yet?

------
londons_explore
There seem to be lots of questionable translations in their source data -
missing emphasis, wrong words, and sometimes stuttering or mistaken
vocalisations.

If they could fix that, I think results could be much better.

