
Google claims near-human accuracy at imitating a person speaking from text - nopinsight
https://qz.com/1165775/googles-voice-generating-ai-is-now-indistinguishable-from-humans/
======
jscheel
Go to the actual audio samples page to hear several interesting examples:
[https://google.github.io/tacotron/publications/tacotron2/ind...](https://google.github.io/tacotron/publications/tacotron2/index.html)

~~~
ghaff
I _think_ I'd still pick out most of these as likely machine-generated. But
it's getting very close. Some of Amazon's Polly voices (especially a couple of
the English women) sound good to me as well. They're not quite at the point
where I'd use them for applications where people expect human speech but it's
not far away.

~~~
corysama
The last set of samples includes comparisons with the source human. The only
one that I can guess with confidence is the last pair. (Sample 1 of the pair
contains assumed-context emotional inflection.)

~~~
Asraelite
I find the machine-generated ones to be more rhythmic and to have intonations
with more consistent rises and falls in tone. I guessed the last three
correctly based on listening for this. The first one I guessed wrong but in
that case I was trying to differentiate them based on other tells.

------
AnimalMuppet
"However, the system is only trained to mimic the one female voice; to speak
like a male or different female, Google would need to train the system again."

If I understand this correctly, they trained it to mimic a _specific_ person,
not just a human voice. That's... kind of terrifying. Trained to my voice, and
given the right text, you could probably get me fired, possibly divorced, and
maybe jailed. Trained to mimic Trump's voice, and given the right text, you
might be able to start a war.

~~~
kevinh
In the next ten years, I think there's going to be plausible deniability that
any person's presence in a video, and especially in audio, is actually
representative of them. There's already imperfect transpositions of celebrity
faces onto the bodies of other people.

I imagine it's going to have a big impact when it first gets rolled out to
mimic someone on the political stage, and then people will use it to discount
any audio or video they don't like.

~~~
AnimalMuppet
> ... and then people will use it to discount any audio or video they don't
> like.

I consider that to _also_ be a big impact. If recorded events are now only
hearsay, it's going to be a lot harder to prove what actually happened - both
in politics and in court.

~~~
Cyphase
Although let's remember that it was that way for the vast majority of human
history. It's only in the last 150-200 years that we've had light and sound
recording technology.

------
allan_s
I know nothing in text to speech, so maybe my question is stupid, but I've
always wondered if somebody tried to produce "natural" sound by modelizing the
air through a human mouth+throat+nose, so that your would have the
"naturalness" of the voice (especially if you add the dynamics part, like air
volume in the lungs that force you to pause, time it takes to reposition the
tongue/mouth between two sounds) , or if it's actually more complicated than
that/too ressource heavy/too hard to modelize etc.

~~~
ewrcoffee
I think the point of machine learning is that such detailed models do not need
to be explicitly modeled, but as some implicit layers to learn.

~~~
rspeer
It's better to have the right explicit features in your model than implicit
ones. It's just harder to know how to make those features, and to know that
they're right, which is part of the appeal of implicit features.

------
lathiat
What is it they do to voices in audio recordings that make them sound like
this?

I can’t put my finger on it but like all Audio Books have this distinctly
processed sound. Is it some kind of voice auto tune or?

------
stablemap
Some discussion on the Google blog post from a week ago:

[https://news.ycombinator.com/item?id=15962543](https://news.ycombinator.com/item?id=15962543)

------
aerialcombat
I'd say it's pretty damn close.

------
mathgenius
Are there any decent "voice synthesis as a service" out there ?

~~~
bmc7505
[https://lyrebird.ai/](https://lyrebird.ai/)

------
659087
Just wait until they start using the voice data they've been collecting from
millions of people via "OK <brandname>" enabled devices to imitate individual
consumers' voices and use them to endorse products/brands to their contacts.

