
“I Used AI to Clone My Voice and Trick My Mom into Thinking It Was Me” - isp
https://www.buzzfeednews.com/article/charliewarzel/i-used-ai-to-clone-my-voice-and-trick-my-mom-into-thinking
======
pnash
Nine years ago, my late wife had developed a tumor in her throat next to her
vocal chords. She was fighting cancer while trying to be a mom to our 3 young
boys. Directed radiation treatment was ruled out for this tumor, leaving
surgery as the only viable option. The downside was the very real risk of her
permanently losing her voice.

Hoping that she’d one day beat the cancer, but may not have a voice, I came up
with an idea of trying to “capture it” in 2009 - hoping that it could be
algorithmically rebuilt in the future. I reached out to a number of
individuals that ultimately put me in touch with a research group that had a
proprietary setup for capturing samples and rebuilding the voice. Over the
Thanksgiving break, I managed to get access to a soundproof recording room and
they worked with my wife to capture samples over a period of 4 hours.

Having worked in the infosec space since the 90s, my first reaction is often
either how new tech/innovation can be used to bypass a control and how one
could detect/prevent that. It’s easy to lose sight of how something like this
could fundamentally changes a persons life.

~~~
yomly
This is a great post, although I am sorry for the experiences you went through
to acquire this perspective.

Thinking more about the specific use-case you have in mind, I find myself
wondering how sentiment and inflection might be captured via a synthetic
voice. Would it be inferred by context? How would that inference deal with
things like sarcasm/irony. I wonder if there could be some input mechanism for
controlling the inflection - what would that input interface look like? Could
it go off facial expression?

I wonder where the existing tech sits in the uncanny valley for this space...

------
throwaway66666
I went camping in the alps once. On our last night, my friend took a bowl and
gathered ashes from the campfire. Half ritualistic, half jokingly, she said
that those ashes mark our trek and experiences, that she would carry the ashes
back home no matter what.

I was very confused how she would pass a bowl of unidentified ash from the
airport security (we only had a backpack each). She drafted a poorly done and
obviously fake death certificate. It was not campfire ash anymore, it was the
remains of her father.

The people at the airport were visibly awkward, they tried to be as
accommodating as they could. She flew back home with a plastic bowl of ashes
from our campfire, it even had some parts of birch and branches.

Airport security was easily fooled. And the author's mom is easily fooled too,
motherly instincts be damned. Would a neural net be fooled by the author's
attempts? I know for sure that an automated security system would sound the
alarm on my friend. I 'd like to see adversarial networks fighting each other
on such premises. A son network trying to fool the mother network and vice
versa ad infinitum, at least 1 billion of simulation hours in. What kind of
wonders would come out

~~~
babkayaga
Why would ashes be forbidden on an airplane?

~~~
informatimago
I can see easily half a dozen problems you can wreak havoc with with ashes on
an airplane. Ashes or any dust, actually. Remember, it's essentially a closed,
hermetic, environment.

~~~
VBprogrammer
People have this idea that air on an airplane is constantly being recycled.
I'm guessing that is where your objection comes from.

In reality the engines are constantly producing a stream of hot compressed air
which is bleed off for various subsystems. One of these is the air
conditioning systems. These cool that air and filter it so that it's clean and
at a reasonable temperature for passenger comfort. The air is added to the
cabin at a pretty much fixed rate and the pressure is regulated by a dump
valve which dumps excess pressure overboard. There is no real recycling of
air.

~~~
MajorSauce
I'm genuinely curious,

How do they regulate oxygen levels?

Or are those levels more or less the same as on the ground but it's the
pressure that incommodes?

~~~
MisterTea
The pressure changes with altitude, not the gas makeup. So at 35kft/10.6km the
air is still 21% oxygen. The gp is right, the engines bleed off some of the
fresh air from the compressor stages in the engine, before the combustion
stage, and use that to supply the cabin. This is called Bleed Air -
[https://en.wikipedia.org/wiki/Bleed_air](https://en.wikipedia.org/wiki/Bleed_air)

------
TeMPOraL
Many years ago, I briefly used to do the reverse. I was tired of constant
calls by certain people, so I jokingly started to pick up and say "the
subscriber is currently unavailable; please leave a message after the tone
<BEEP>". Fooled two people with it before realizing that's a little too
disrespectful and stopping.

~~~
epaga
I did this one time as a joke, the other person hung up immediately. I only
figured out weeks later it was a good friend I hadn't seen in ages who had
happened to be in town that day and had wanted to get together.

They were ... not very happy when we realized what had happened.

------
minimaxir
(Disclosure: I work at BuzzFeed)

I do recommend watching the episode of Follow This as suggested in the article
(episode 7) if you’re interested in the latest deepfake tech, and its
implications for fooling people who can’t obviously tell it’s fake.

~~~
trqx
Hi,

> required

Is this even legal according to GDPR?

[https://screenshots.firefox.com/MnEgMtsGavMlxcts/www.buzzfee...](https://screenshots.firefox.com/MnEgMtsGavMlxcts/www.buzzfeednews.com)

~~~
jonathanstrange
It's legal if you can use the site just normally after you've pressed the
_Reject All_ button and no data is collected in that case. Otherwise, it
doesn't comply with GDPR. (Disclaimer: IANAL)

------
actionowl
I was tempted to sign up and try using it for our daily scrum standup conf
calls for fun. But after thinking about it I'm also terrified of the
possibility that my account or data could be compromised. Imagine the damage
someone could do by calling up a relative, posing as me, and saying that I'm
in trouble and need money or something?

~~~
DEADBEEFC0FFEE
Something you know, something you have. Time to share some authentication
bsecrets with family.

~~~
koolba
“Hi mom can you authorize my logon request?”

“Sure honey.”

“200 OK. Eh I mean, thanks.”

------
throwaway208113
I was hoping there was another method of doing this instead of playing the
audio file out the speaker and using the phone in speakerphone mode.

I have a stutter that is especially bad when I first talk on the phone. I used
to do something similar where I would record an introduction and then play it
when the phone connected.

The quality wasn't great, but it was better than me not being able to say
anything.

~~~
tantalor
You could probably configure the computer as a bluetooth microphone

------
kelvin0
The only reason this 'fooled' his mom is because cell phone sound is already
bad, so the obvious garbles, fluctuating intonation and weird pauses seem
normal.

So it's impressive, but let's not get ahead of ourselves here.

~~~
Splognosticus
That's not as big a drawback as you might think. There's a guy on YouTube[1]
that with a channel fashioned after one of those Saturday morning edutainment
shows who debunks hoax videos, and one of the tricks he frequently points out
is when the video quality has been deliberately degraded to mask editing
flaws. It's good enough to fool anybody not looking for it.

[1]
[https://www.youtube.com/channel/UCEOXxzW2vU0P-0THehuIIeg](https://www.youtube.com/channel/UCEOXxzW2vU0P-0THehuIIeg)

------
richrichardsson
I wonder if this would fool the "My voice is my passport" identification
systems that I noticed cropped up on a few of the telephone services I was
using in the UK?

~~~
opless
Ahh sneakers reference
[https://www.youtube.com/watch?v=-zVgWpVXb64](https://www.youtube.com/watch?v=-zVgWpVXb64)

~~~
richrichardsson
Yeah, made me a smile a little when I first encountered it.

------
ShakataGaNai
Kinda a strange way to go about this, but interesting none the less. I don't
know much about Lyrebird or how long they've been around, but as others have
noted... this sounds like a really terrible voice call at best (so far as the
samples have shown).

I want to use it, but I wouldn't use it for any real products today. Amazon
Poly isn't amazing either, but sounds more natural than these samples. Yes,
it's only a few stock voices, but it's a lot closer.

~~~
phkahler
>> this sounds like a really terrible voice call at best

Do you believe this isn't an advertisement for Lyrebird?

~~~
ShakataGaNai
To be a good ad for Lyrebird, I would have at least put a ton more into the
Obama/Trump voice samples. Enough that they didn't sound like they came out of
a voip connection from the 1980's. All the current demo's make me think is
"Well, if I'm going to do anything voice, use Poly".

Like I said, it's a really cool idea. If I could realistically duplicate my
voice I'd certainly use that. But as of right now... meh.

------
bcaa7f3a8bbc
Public key verification is ultimately needed in all end-to-end encryption
systems to offer a strong guarantee that the conversation is not subverted by
a man-in-the-middle attacker. If can be done by using the Socialist
Millionaire Protocol
([https://en.wikipedia.org/wiki/Socialist_millionaires](https://en.wikipedia.org/wiki/Socialist_millionaires))
and a shared secret, but more often the verification is arranged in-person,
out-of-band, manually.

As realtime, realistic voice synthesize is thought to be difficult, a
voice/phone call encryption system usually circumvents this problem by using
the caller's voice as the proof, as both recognize each other's voice. In most
phone encryption systems, like many commercial systems, or the ZRTP protocol
by Phil Zimmermann, or the "safety number" in Signal, they allow both parties
to read out their pubkey's SHA-256 hash digest (usually encoded to words)
aloud, as a mean of verification.

If this type of AI-based voice synthesizer becomes widely-available, it could
be disastrous to cryptography. It is not the end-of-the-world of course, as
targeted attack with social engineering is not an issue for most people, and
those who need this level of security is going to perform out-of-band exchange
anyway, but still, the certainty of voice-based key verification would be
greatly weakened.

~~~
beguiledfoil
Cryptography doesn't solve social problems.

Cryptography doesn't solve political problems.

Cryptography solves communication problems. That's. It.

~~~
emiliobumachar
Being sure that it's really my son talking to me over the phone is primarily a
communication problem, no? The social and political implications are quite
large, but they can be addressed by solving the communication problem. This
can involve crypto.

Or am I missing something?

(I say nothing for or against the particular scheme proposed by gp, just
against parent's implied generalized dismissal of crypto to solve this
problem)

~~~
bcaa7f3a8bbc
Agree. I was just talking about a pure technical issue of voice synthesize,
and its implications to voice encryption. In other words, a communication-
over-the-phone problem. I didn't mention nor intended to talk from a political
or social perspective.

While I understand the parent comment's stance that cryptography is not the
magic sauce, I don't think it's related to my comment.

------
jklein11
The cadence of his speech in the video was clearly off. It was pretty jarring
to me.

It didn't sound like the mom was buying it either. Her tone of voice was
somewhere between "I'm going to play along with this" to "Dear god I think my
son is on drugs again."

~~~
kelvin0
Definitely not convincing to me either. But things sound so bad over a cell
phone conversation, which kinda makes this works if you squint your ears and
drink a liter of spirits.

------
zahrc
The examples in this thread don't seem very convincing for me, is that because
I know that it's an AI?

Not judging his work here, but it sounds like an unstable VoIP call.

~~~
eboyjr
I agree that it's not convincing. But I think over the phone it makes it close
to indiscernible.

------
defnotarobot
Is there anything at all like this open source?

~~~
michael_h
Sure is: Festival, HTS, Merlin, and a gaggle of wavenet implementations.
You'll have to put in some work, it's not turnkey, but you can get some really
good results.

------
jasonlfunk
In what sense can this actually be considered to be "AI"? It's software that
builds a voice profile from input sources and then uses that profile to
generate waveforms. Where is the intelligence?

~~~
tantalor
Artificial intelligence can be defined as training a model with real-world
input/output pairs & approximating a general solution to generating output for
arbitrary input.

In this case, your brain (the "non-artificial intelligence") can take some
text and control your vocal chords to emit sound waves to produce speech. You
can even learn different voices like a cartoon character voice artist. The
artificial intelligence can learn to do the same thing.

------
PeterStuer
A few years ago, well before Trump was a thing, a writer asked me what I
feared about the future. After thinking about it for a few minutes, I answered
'the post-truth society'. Even without AI algos, we were already well on our
way with 'traditional' evidence forging technology and public discourse
manipulations, to cast reasonable suspicion on everything we see or hear.

Simple AI accelerated and deskilled the former, and combined with ubiquitous
social networks exponentially empowered the latter.

The thing is: these truth undermining technologies need not be perfect to have
the effect, just 'good enough' to cast significant doubt and allow near
everyone to believe their own 'truths'.

The result will be a highly dis-empathic society, where trust beyond the most
closest 'clan' is close to nil and even then some.

Confusion always empowered narcissist and sociopaths, the con-artists and the
cultists. It isn't so hard to see anymore how old civilizations could devolve
into the dark ages.

~~~
gm-conspiracy
Agreed.

We have already seen this work on "lo-fi" ads.

Ads with obvious spelling and grammar errors, meant to immediately engage
those who have no concerns of such things (filter out the critical thinkers).

------
ThinkingGuy
Clearly there's a lot of potential for abuse here. On the other hand, similar
technology has enabled radio reporter Jamie Dupree to get back on the air
after losing his voice to a rare neurological condition:

[http://jamiedupree.blog.wsbradio.com/2018/06/18/back-on-
the-...](http://jamiedupree.blog.wsbradio.com/2018/06/18/back-on-the-air-with-
jamie-dupree-2-0/)

------
butler14
What's wrong with wolfie?

[https://youtu.be/MT_u9Rurrqg?t=45s](https://youtu.be/MT_u9Rurrqg?t=45s)

------
bufferoverflow
It clearly sounds synthesized.

~~~
tyingq
Also sounds pretty close to a spotty VoIP connection.

[http://www.voiptroubleshooter.com/problems/robotic.html](http://www.voiptroubleshooter.com/problems/robotic.html)

~~~
femto
That's because state of the art vocoders (as used in VoIP) _are_ voice
synthesisers. The encoding process breaks your voice down into a series of
coefficients, which are sent sent to the decoder (voice synthesiser) at the
other end.

When you talk to your Mum on any modern phone system, there's an argument that
you're actually speaking to a voice synthesiser that sounds like her. Maybe we
need to view the thought process as the person rather than the voice?

------
skookumchuck
So much for audio recordings being evidence in court.

I always wondered why emails were evidence. They're just text, anyone could
fake an email.

~~~
Kostchei
Sure, you could say the same of any document, electronic or not. Most crime is
not that complicated.

    
    
      It turns out that the metadata footprint left on a computer creating a document- if you can seize it- is rich in detail enabling creation date and the like to be identified. Mail servers may hold logs. Often a fake email or document is part of an offense and proving where a faked email was sent from becomes quite relevant, and yes, ends up as evidence in court.

You may hear a prosecutor say "and on the 30th of june 2011 did you send the
following email..blah..blah"

There is a reason for that- they are establishing the possibility that it is
actually evidence. Its not clear cut, otherwise we wouldn't have courts. You
proffer evidence and convince people of it's weight. And that will continue to
be the case.

------
paul7986
I rarely call or use the phone vs. texting. I would think that almost half the
population rarely use the phone too.

Also the perpetrator is going to have to spoof my exact number to trick
friends & relatives. Who I would hope would after speaking to a fake me would
then text me soon or a bit after talking with comments & questions.

------
abledon
Trolling your mom is a pretty weird target, one of the people who would have
the least suspicion when talking to you that your trying to deceive them esp
if planning dinner like the article

~~~
rangibaby
In Japan "It's me!" scams where a criminal pretends to be their mark's child
over the phone and asks them to wire money are / were quite common. I guess
they would find this useful if they knew who they were meant to be
impersonating.

------
isp
Lyrebird: [https://lyrebird.ai/](https://lyrebird.ai/)

Demonstration from the author of the Buzzfeed article:
[https://soundcloud.com/cwarzel/2018-02-06t19-53-39769z-1](https://soundcloud.com/cwarzel/2018-02-06t19-53-39769z-1)

Demonstration using Trump's voice:
[https://twitter.com/LyrebirdAi/status/904595052521025536](https://twitter.com/LyrebirdAi/status/904595052521025536)

------
peterwwillis
FWIW, secret agents have been doing this in movies for years. Took them long
enough...

------
mslate
This is very sad :'(

------
m1573rp34130dy
or on a different angle...[swaps grey hat for black hat]

...used AI to clone victims voice and used deep fake to social engineer
customer support staff into compromising security swapping SIM card, draining
bank account, opening new credit accounts and mortgages, and to call random
people @55h@ts, then snicker and profit..

