
RealTalk Speech Synthesis - Peroni
https://medium.com/@dessa_/real-talk-speech-synthesis-5dd0897eef7f
======
amirhirsch
"Because of this, at this time we will not be releasing our research, model or
datasets publicly."

Seems like people are designing research for this conclusion, begging their
own controversy. However, it rings hollow, especially presented thusly, on
medium, with the first person plural voice of a corporation.

Ethical discussions in machine learning technology presentations are becoming
trite and self-congratulatory ("we've made an AI so good it merits discussion
of the ethical implications") especially when a discussion of actual
applications is missing.

------
Donald

      Because of this, at this time we will not be releasing our research, model or datasets publicly.
    

Has OpenAI's handling of GPT-2 inadvertently provided political cover for
commercial organizations who would love to claim they engage with the ML
research community but would actually prefer to contribute nothing other than
medium articles?

~~~
csande17
One could argue that it wasn't "inadvertent" at all, and OpenAI is such an
organization.

------
waiseristy
So I've seen multiple of these ML speech synthesis projects, when am I going
be able to use one of these for a screen reader? I'd like to listen to wiki
articles with a synthesizer using modern methods, not Microsoft Sam.

~~~
gambler
When Google repackages it as a product that spies on you and you can't use in
ways not intended by Google.

------
deepblue129
> _have produced the most realistic AI simulation of a voice we’ve heard to
> date._

No you didn't. Please do not lie.

There are a number of projects replicating Google's Tacotron 2 research from
December 2017 that achieved human parity in text-to-speech as measured by MOS
score. Google's Tacotron 2 model was then successfully deployed by Google in a
service called Duplex.

Following up on this research, there are a number of open source and
commercial projects that have used Google's Tacotron 2 human-parity TTS
research:

Open source projects:

\- [https://github.com/mozilla/TTS](https://github.com/mozilla/TTS)

\- [https://github.com/Rayhane-mamah/Tacotron-2](https://github.com/Rayhane-
mamah/Tacotron-2)

\- [https://github.com/NVIDIA/tacotron2](https://github.com/NVIDIA/tacotron2)

Commercial TTS projects:

\- [http://deepzen.io/](http://deepzen.io/)

\- [https://wellsaidlabs.com/](https://wellsaidlabs.com/)

\- [https://ai.googleblog.com/2018/05/duplex-ai-system-for-
natur...](https://ai.googleblog.com/2018/05/duplex-ai-system-for-natural-
conversation.html)

\--------------------------------------------------

> _he didn’t actually endorse our work like this, it’s a clip from the video
> the team created featuring their work. Video and more after the jump!_

It's absolutely unresponsible/illegal to clone a person's voice without
consent. To use Joe Rogan's likeness for your publicity stunt without his
consent is unethical in its self. It's Joe Rogan's legal right to control his
own likeness.

Furthermore, this presents a number of safety risks to Joe Rogan including the
possibility of identity fraud.

\--------------------------------------------------

Finally, at this time, TTS human parity technology is at human-parity when
tested on phrases and sentences similar to those in the training set. Google's
Tacotron-2 models showed a significant decrease in performance reading 37 news
headlines. They mentioned in their evaluation:

> _This result points to a challenge for end-to-end approaches – they require
> training on data that cover intended usage._

~~~
bufferoverflow
None of these come even close to the Joe Rogan example. To me it's
indistinguishable from his real voice.

------
mwcampbell
There's a telltale discontinuity in the rise of the voice on the word "chimps"
in the sentence "and these chimps have been working out hard". I wonder if
future generations of kids will have to be trained to spot such things.

------
floren
We've already got a world where anything slightly embarrassing or regrettable
you do is likely to recorded and uploaded to Youtube. Maybe once there's tools
that can perfectly fake a video and corresponding audio, we'll be free again.

------
agentultra
Remember in Fahrenheit 451 where Guy Montag's partner is glued to the Wall? A
screen where she participates in her favorite shows and the audience and cast
members talk directly to her?

Read in a modern context I don't think Guy was merely burning books in service
to an authoritarian government. I think he kept the books because he was
becoming an outsider. He didn't want to participate in the world of the Wall
and consent to the expectations and norms of his society. And the firemen were
there to ensure everyone participated.

------
bitwize
Well, now we know what's going to happen to all that Alexa voice data.

------
masswerk
Impressive (really), but it raises a more philosophical question (as in
practical ethics): do we really want voice-bots to blend in perfectly, or
should they better feature distinctive marks (like a rather monotonous
personality)?

[Edit: This is not so much about DeepFakes, as discussed in the article, but
more about a general level of implementation.]

~~~
mikeash
If perfect imitations are outlawed, only outlaws will have perfect imitations.

I’m not sure how useful that question is. If it can be done then it will be,
and we’ll have to deal with it whether we want it or not.

~~~
masswerk
I suppose, we'd want bots distinguished (by tone, etc). Where's the practical
value of not being able to discern an algorithmic speaker, e.g., on the phone.
There's probably some value in being able to do so, regarding liabilities and
so on. (A contract arises from an agreement of intents. We may not be sure, if
such an agreement has actually been reached, or if we were just witnessing a
behavioral pattern triggered by a Markov chain. We may also question the
nature of the intent or who's intent this actually is.)

------
gambler
So, on one hand, I see people hyping half-baked AI through the roof, cherry-
picking good examples, refusing to study and discuss its limitations and even
outright dismissing the idea that AI failures are, in fact, failures, rather
than some kind of "different way of thinking".

On the other hand, I see _the same crowd_ engaging in ridiculous alarmism
that's not grounded in reality. They place technologies in far-fetched
scenarios, completely ignoring that the same scenarios can already be enacted
without AI. The usual conclusion is always that technologies needs to be kept
out of the hands of the public.

Someone is drinking too much of their own Kool-Aid. But regardless of how much
they believe in what they're posting, this behavior is disgusting and
unethical.

\---

 _> Here are some examples of what might happen if the technology got into the
wrong hands_

Since when do we _start_ a discussion with the assumption that a piece of
software will be restricted in distribution? Software tends to get in the
hands of everyone who wants it.

 _> Spam callers impersonating your mother or spouse to obtain personal
information_

News flash: this is already happening without AI. All you need is a bad phone
connection and someone who sounds vaguely like the person being impersonated.

Moreover, it's already trivial to change the pitch of your voice in real time.
With some simple audio engineering, you can alter timbre as well (e.g.
filtering, equalization). If that's such a big deal, why is no one using this
already? It's way, way, way easier than collecting lots of voice samples and
training a model.

 _> Impersonating someone for the purposes of bullying or harassment_

Why would someone need to impersonate someone else for bullying or harassment?
Bullying or harassment seems to work pretty well as is.

 _> Gaining entrance to high security clearance areas by impersonating a
government official_

If someone can get access to a place simply by using voice coming from
computer speakers, it's clearly _not_ a "high security clearance area".

 _> An ‘audio deepfake’ of a politician being used to manipulate election
results or cause a social uprising_

Media organizations already do this every day, in plain sight, via selective
editing.

------
em-bee
i look forward to see this applied to audio books. this will bring their price
down drastically. even to zero. i may just buy the text version, and have a
program to generate voice as it reads the book.

there are tools that do that now, but i find that i can't listen to the
current quality of computer generated voices for more than a few minutes.

with an almost human like voice i don't think i'll care if there is the
occasional glitch that makes me realize that it is a generated voice as long
as it sounds fine otherwise.

------
RandomInteger4
Sounds good in terms of lack of pauses between words and his general voice,
but you can tell something is off due to the cadence and at times it seems
like words trail off into breathlessness.

------
DoofusOfDeath
It seems a little off to me, but perhaps that's because I already knew it was
fake. I'd be interested in how this holds up in a blinded study.

Still, it's impressive that the model seems to need such a sensitive test to
judge its believability.

------
EamonnMR
At least to my ears, this sounds better than Lyrebird did when it was posted
here. Bravo!

------
grenoire
Hot damn, as someone who has listened to a few JRE episodes, it is... quite
good. What really struck me was its ability to recreate words that were (I'm
guessing) weren't said by him in the past. Impressive!

