
AI Clones Your Voice After Listening for 5 Seconds (2018) - lukeplato
https://google.github.io/tacotron/publications/speaker_adaptation/
======
arkades
I once got a call while I was lecturing some students. It was repeated three
times in three minutes - I assumed it was an emergency and stepped out.

I was greeted by someone explaining that my father had caused a car accident,
and they were calling on his behalf. That someone would need to send over some
money for repairs or they’d call the police.

Sure.

They added that their cousin, the driver, is a parolee now holding my father
at gunpoint. That if I don’t send them money to make them whole, they’ll kill
my father.

This was super fishy, you know? But still, with things like “life of a loved
one” at stake, it’s hard to call a bluff.

I can only imagine what I’d have done if I’d heard my fathers voice pleading
for help. They might have been able to get any amount of money out of me.

Well, if my father hadn’t passed away nine months prior. They were not
delighted to hear that.

~~~
toxicFork
How do you even prepare for something like that... Do we need to assign
identifying keywords to each other when we leave home so we know we are really
ourselves? Like a vocal pgp?

~~~
robbiemitchell
I told my wife that if I ever mention <redacted> while on a phone call, she
should know that I am in trouble an unable to speak freely.

Sound like we'll all need more things like this eventually :(

~~~
jiveturkey
how do you pronounce the < > symbols? i mean, 'redacted' is already a pretty
strange thing to say by itself.

~~~
Shorel
You just make static noises, like a radio signal being lost.

Also, you are lacking in the abstract thought department. Get that fixed, for
your own benefit.

~~~
throw1234651234
It's either a knowledge gap...or he is hopeless if he knew the symbol but
didn't pick up on it. Nothing I know of can improve that.

------
AbbasHaiderAli
Wow, impressive results! Already a few examples in the comments of what bad
actors could do this tech. I wanted to share an example of something good.

I lost my dad about 6 years ago after a Stage 4 cancer diagnosis and a 3 month
rapid diagnosis. I have some, but not a lot of video content of him from over
the years. My mom still misses him terribly so for her 60th birthday I tried
to splice together an audio message and greeting from her saying what I
thought he would have said.

The work was rough and nowhere near what this Google project could produce.
She listens to that poor facsimile every year for her birthday. It's
therapeutic for her. With some limits for her mental health of course, I'm
sure she would love to hear my dad again with this level of fidelity.

And so would I.

~~~
pmoriarty
Philip K Dick wrote about people going to commune with artificial personality
constructs of their deceased loved ones.

Unfortunately, it's been a long time since I read it, so I don't remember
which book it was in. Maybe someone who's read him more recently can remember.

Update: Apparently, lots of other people wrote about this too, but PKD wrote
about this before any of the ones mentioned so far, as he wrote about this in
the 1950's or 60's. I'm not sure if he was the absolute first, however. So if
anyone knows of any earlier references, it would be interesting to learn about
them.

~~~
ErikAugust
Ubik? Though that is not about artificial personality constructs, it's about
communicating with loved ones in half-dead states.

~~~
iainmerrick
This idea crops up in a few of his novels and stories, but I think it’s most
fleshed-out in Ubik, yeah.

~~~
xkcd-sucks
Under the hood they are all about religious Gnosticism and the physical
universe as a false facade to the "true" universe. VALIS is a pretty good
explication as well as a really good book; if you are into mental
illness+theology, only then is his Exegesis a good read

~~~
iainmerrick
Very belatedly, yes, VALIS is strange and wonderful.

This is making me want to re-read some PKD!

------
ihm
Reading this headline I begin to understand certain people’s worry about
having their soul stolen upon being photographed.

~~~
nwsm
Images of us can be sourced from any number of places: social media,
government surveillance, private surveillance. Video less so but from the same
sources. Audio from phone companies, VoIP services, surveillance, etc. Health
data easily from a number of private companies if you use new-age "health"
services, or less easily (illegally) from health records.

Maybe we can find solace in the fact that is or will soon be infeasible to
avoid, so we needn't try to avoid it.

~~~
ben_w
“Don’t worry, just about anyone can steal your soul and there’s nothing you
can practically do to stop it”

That doesn’t seem like a message of solace to me.

~~~
baroffoos
The message is "Just about anyone could replicate your voice, its value in
authentication is about as trustworthy as writing your name at the bottom of a
letter"

------
Lucadg
At first glance there seems to be more more malicious uses than good ones with
voice. Yes, hearing someone dear to me say things he/she never said maybe
comforting. Anything else?

Maybe some movies with the deceased actor's voice?

But what if someone who wants to hurt me sends me files (or phone calls) from
the deceased person saying horrible things like:

\- "I am still alive but left as I was tired of you"

\- "oh Jan, I love you" [fake phone call from the past, where Jan is a lover
which never existed]

or even from alive people:

\- "I am leaving you"

\- or my live voice saying stuff which gets me fired or in prison.

We will never be able to believe voice again...how will we adapt?

~~~
Defenestresque
People had the exact same concerns when digital image manipulation software
became popular, including the "we will never be able to trust an image again"
question.

To answer your question, I think the biggest step we took in adapting to the
ever-present risk that an image may have been manipulated is acknowledging
that it's possible. As soon as people knew that something could be faked, they
realized that having a purported photograph wasn't irrefutable proof that it
happened and learned to ask for corroboration before making assumptions.

I think we'll learn to deal with this new development too.

~~~
Loughla
Luckily now people don't ever believe photographs as sole evidence in the
court of public opinion, and always corroborate that evidence.

Wait. No. I had that backward.

How long has photo manipulation been around? And people still fall for it
every minute of every day.

I have zero faith these tech developments will lead to anything good, or that
we'll even learn how to deal with them effectively.

~~~
baroffoos
I don't think its so much that they fall for it but that people create fake
images that confirm what people already believe and the viewers deeply want
the image to be true and will not think critically.

Verifying an image is not impossible. You just have to consider:

* Who took the image

* Do they have any reasons for wanting to fake it?

* Was anyone else able to verify they saw or took a photo of the same thing

* Is anyone in disagreement with the content of the image

We don't need image manipulation to fool people on facebook. Recently a random
image of a park full of rubbish was used with the caption that this was the
result of a recent environmental protest but the image was actually months old
from a totally unrelated event. People believed and shared it because they
wanted to. You could just as easily write a text post saying you saw a bunch
of rubbish after the event and you would have almost the same effect.

------
Lordarminius
AI can make decisions, create deep fakes, and now, clone voices.

It may be that the next big business opportunity lies in creating 'anti-AI'
technology just as it did with antiviruses in the 90's and 2000's

~~~
pcmaffey
AI that detects AI seems entirely plausible. And like all “anti” measures, is
another arms race (and if I put my scifi hat on, is what may lead to AI self-
awareness).

~~~
BOBOTWINSTON
Sort of reminds me of a talk Valve gave about creating an anti-cheat for
Counter-Strike using AI. When asked if they were worried about people using AI
to create cheats to fool the AI, his answer was essentially that it was an
arms-race won by the person with more data/processing power. That person would
most likely always be Valve.

Link to talk:
[https://www.youtube.com/watch?v=ObhK8lUfIlc](https://www.youtube.com/watch?v=ObhK8lUfIlc)

~~~
hr5eaqhera
It's a nice sentiment, but there are popular and easy-to-find cheating
projects (not sure if I can name them here) that are still widely used, these
projects have been active for years, before that talk was made, and still
active today. Based on youtube videos and comments it seems many users are
still using these cheats with little issue. And afaik, the one I'm referring
to (initials P.I.) doesn't use any machine learning at all.

------
CraneWorm
Now we can have audiobooks read by anyone we like!

They can direct us to our destination!

They can speak at our funeral, being long dead themselves (as long as there is
sufficient training material recorded).

The future is awesome.

~~~
thomascgalvin
I legitimately think this could be huge for self-published authors. It takes a
skilled professional about forty hours of work to produce an audiobook from a
novel-length manuscript. Tacotron could do it in minutes.

~~~
jfengel
I don't see that coming soon. The voice is one thing, but the performance goes
far, far beyond that. Without understanding the text, you can't get good
prosody out of a single sentence, much less developing a character for a whole
performance.

You'd have to "direct" this on a word-by-word basis: "Put the emphasis here.
Speed up 10% here. Decrease vocal intensity 25%". You'd end up producing a
whole "score", and it would take at least as long as the human actor puts into
it.

Having done that, it would be amusing to switch it from voice to voice, as a
party trick. But the result would still be much poorer than you'd get out of
an actor. Really solving the work of an actor is strong-AI-complete.

~~~
animal531
What about using a tablet to direct the piece by drawing? You can get values
for the intensity, speed and volume (up/down) pretty easily and intuitively.

Even better if its linked to the voice generation system in real time, then
you can save/redo sentences etc. as you go along.

------
Riccardo_G
What it is doing is not really cloning, but because it was trained on 18k
different voices, it actually finds one that is closest to yours, and uses
that one. It can do a bit of interpolation to create an embedding which is
closer to your own, but only if it is well represented by a mix of other
voices. Real voice cloning like at
[https://replicastudios.com/](https://replicastudios.com/) can take just a
minute or two of audio, and it does a fairly good job, and it is always
improving. With more audio you start being able to also play with emotion and
styles, which is very cool!

~~~
JaRail
I'm not really sure where you're getting this. It doesn't pick a specific
voice from a database to use.

From their introduction: "Our approach is to decouple speaker modeling from
speech synthesis by independently training a speaker-discriminative embedding
network that captures the space of speaker characteristics and training a high
quality TTS model on a smaller dataset conditioned on the representation
learned by the first network."

Section 2 of the paper explains how it works. Two minute papers also goes
through it if you'd prefer a video. Link:
[https://youtu.be/0sR1rU3gLzQ](https://youtu.be/0sR1rU3gLzQ)

~~~
sillysaurusx
They’re saying that underrepresented voices will have trouble being modeled.
That matches my experience with this project: for example, I had a very tough
time cloning female voices compared to nerdy-sounding / deep male voices.

~~~
zamubafoo
It's more that the sounds produced during the recordings didn't cover the
entire spectrum of possible sounds, so the model had to estimate their sound.
All you really need is a paragraph which you can have the person read to get
sufficient coverage or just enough recordings that it's not an issue anymore.

------
lukeplato
Two minute papers video:
[https://www.youtube.com/watch?v=0sR1rU3gLzQ](https://www.youtube.com/watch?v=0sR1rU3gLzQ)

------
Angostura
My bank's (HSBC) telephone banking offers the option do away with a PIN and
instead a 'my voice is my password'phrase system.

I'm glad I never opted in.

~~~
ropiwqefjnpoa
Oh god, Sneakers

~~~
Angostura
Just to be clear - I'm Robert Redford

------
kleer001
Saw this on Minute papers last night and had a discussion with my partner
about if we needed a secret password or not to tell if it were really one or
the other on the phone. I figured that we had enough shared history that that
wouldn't be a problem. Then we realized that there's no such thing as a
simulated sense of humor yet and that that would be the best natural
encryption to any communication.

~~~
toxicFork
2023: ai identifies your sense of humour after hearing you breathe for 0.7
milliseconds

~~~
kleer001
Ha. There'd need to be so many more Ai breakthroughs to make that happen it'd
be a little thing.

------
undershirt
“When you see something that is technically sweet, you go ahead and do it and
you argue about what to do about it only after you have had your technical
success.”

—Oppenheimer

~~~
robertjwebb
The thing is, these topics have already been discussed by philosophers!
Questions of authenticity, human subjectivity, reproducibility etc are not
new. But for the average joe and the non-philosophically-inclined techie, the
thing has to actually exist before they start really talking about it.

------
goodmachine
The malign applications of this technology greatly outweigh the benign.
Discuss.

~~~
carapace
Yes, I think there are almost no legitimate uses for the Farnsworth Device.
[https://theinfosphere.org/A_Device_That_Makes_Anyone_Sound_L...](https://theinfosphere.org/A_Device_That_Makes_Anyone_Sound_Like_Farnsworth)

My personal hell: My mother has dementia and a land line telephone.

Scammers call all the time. All day long. (Although the last few days have
been pretty good, I assume somebody somewhere is doing their jobs. The
scammers will adapt.) One thing they do is spoof their number to have the same
area code and prefix as the one they're calling, so it's like "Oh, is this a
neighbor?" or something, but of course it's not. It's an automated machine
abusing the telephone network to try to _steal money_ from a little old lady
with dementia.

Evil men with robots are attacking my mom. Another one called while I was
writing this post!

This is a goddamned sci-fi dystopia.

And now the robo-thieving bastards can _imitate my voice_!?

I'm going to have to get her one of those satellite-linked walkie-talkies or
something. Thank God she doesn't use the internet.

~~~
hurrdurr2
Back when I worked as a telemarketer in high school a long time ago, we sold
paper subscriptions, and usually the people that didn't cuss us out and hang
up right away are the lonely old people who just wanted to chat. I lasted two
months and had to quit; felt like we were just taking advantage of them.

------
rshi
I wonder what the legal implications of this alongside similar developments
like deepfakes are going to be in the next couple years. We're already having
fraudsters impersonate CEOs using Deep learning-aided Voice generation[1] due
to just how low the barrier of entry is now. There's already a public
implementation of the paper out [2]!

[1]: [https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-
ceos...](https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-
in-unusual-cybercrime-case-11567157402) [2]:
[https://github.com/CorentinJ/Real-Time-Voice-
Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)

~~~
jermaustin1
The latest episode of Blacklist had a dark plot based on deep-fakes.

~~~
fasturdotcom
didn't know new season was out! thx!

------
moyix
This is from 2018 – does anyone know if there are pretrained models and code
for this? I found [https://github.com/CorentinJ/Real-Time-Voice-
Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning) , but the
generated audio quality was much worse than the samples here.

~~~
JaRail
The biggest missing piece is WaveNet, which is Google's proprietary voice-
synthesizer. With only the models trained for this paper, the best you could
build is a voice-recognizer. As far as I know, Google only allows people to do
TTS with one of their provided voices.

I don't expect them to open it up until other companies/academics have
achieved similar results. It's too much of a competitive advantage right now.
Alexa, Siri, etc all sound like robots compared to WaveNet (google assistant).

------
grawprog
So....I'm going to paste the abstract here because the headline is incredibly
misleading and should be changed.

>Abstract: We describe a neural network-based system for text-to-speech (TTS)
synthesis that is able to generate speech audio in the voice of many different
speakers, including those unseen during training. Our system consists of three
independently trained components: (1) a speaker encoder network, trained on a
speaker verification task using an independent dataset of noisy speech from
thousands of speakers without transcripts, to generate a fixed-dimensional
embedding vector from seconds of reference speech from a target speaker; (2) a
sequence-to-sequence synthesis network based on Tacotron 2, which generates a
mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-
regressive WaveNet-based vocoder that converts the mel spectrogram into a
sequence of time domain waveform samples. We demonstrate that the proposed
model is able to transfer the knowledge of speaker variability learned by the
discriminatively-trained speaker encoder to the new task, and is able to
synthesize natural speech from speakers that were not seen during training. We
quantify the importance of training the speaker encoder on a large and diverse
speaker set in order to obtain the best generalization performance. Finally,
we show that randomly sampled speaker embeddings can be used to synthesize
speech in the voice of novel speakers dissimilar from those used in training,
indicating that the model has learned a high quality speaker representation.

~~~
aerovistae
Do you want to elaborate on how the title is misleading? From reading the
abstract it seems accurate to me.

~~~
corobo
"AI Clones Your Voice" implies there might be something on the linked page
that involves an AI cloning my voice. Maybe a way to record a few phrases,
maybe a text to speech that then uses my voice. Something like that.

This does not do that - only provides pre-rendered samples, kinda
disappointing. Impressive, but disappointing.

------
jchook
Lyrebird has had similar technology for a few years now.

[https://www.descript.com/lyrebird-ai](https://www.descript.com/lyrebird-ai)

~~~
0xcafecafe
An apt name for the technology considering the marvel of nature that the lyre
bird is.

[https://www.youtube.com/watch?v=VjE0Kdfos4Y](https://www.youtube.com/watch?v=VjE0Kdfos4Y)

------
hurrdurr2
Autocratic regimes can rejoice...they can extract public confessions so much
easier now...

~~~
blacksmith_tb
Only ones that care about the appearance of propriety though, presumably most
never had to try to produce convincing fakes when their word was already law?

~~~
x220
>Only ones that care about the appearance of propriety though

Such as the United States?

~~~
exo762
USA is a plutocracy. Source - this study:

[https://www.cambridge.org/core/journals/perspectives-on-
poli...](https://www.cambridge.org/core/journals/perspectives-on-
politics/article/testing-theories-of-american-politics-elites-interest-groups-
and-average-citizens/62327F513959D0A304D4893B382B992B)

~~~
x220
Thanks for the link, I was looking for something to read about this topic.

------
gist
Will point out that it is cloning after a short sample and with an unknown
speaker. So this is great for that type of comparison and in particular when
the person listening does not know much or have great experience with the
person speaking.

Now if you were to take something by a well known person (where there is a
great deal of audio) it would be much harder to clone anything other than a
really short passage.

This would be similar to faking handwriting. Easier to fake one word than to
fake three pages. Easier to fake something where you have little to compare a
pattern (less can go wrong).

Not saying this isn't impressive it is. But it's also a bit of a trick based
on the very short clips (both samples and created).

I would say that a trained person could do a better fake because they could
take into account all the info and be less likely to make a mistake.

Now sure you could manually change the AI as well doing the same thing.

------
sillysaurusx
I used this repository to make Half Life’s Dr. Kleiner sing “I am the very
model of a modern major general”:

[https://twitter.com/theshawwn/status/1171806394783326208?s=2...](https://twitter.com/theshawwn/status/1171806394783326208?s=21)

[https://www.youtube.com/watch?v=koU3L7WBz_s](https://www.youtube.com/watch?v=koU3L7WBz_s)

Then @jonathanfly deepfaked Dr Kleiner’s face onto a live performance of the
song, which was hilariously unexpected. The AI twitter scene is awesome:

[https://twitter.com/jonathanfly/status/1171907301231513605?s...](https://twitter.com/jonathanfly/status/1171907301231513605?s=21)

There is some promising new work in the GitHub issues. For example, someone
has been training on ~10,000 additional speakers.

------
nmeofthestate
VCTK p240: duplicates the (a) north of England accent well.

VCTK p260: all over the place accent-wise.

LibriSpeech: can't really comment on the American examples, but they seem
decent.

~~~
jdbernard
Sample 9 is a good example. The pronunciation of "biographer" is consistent
even when it should be very different. All of the examples stress the first
syllable but an American would stress the second.

------
namaemuta
I wonder if this could be use on RPG games in which there are so many texts
and dialogs that having a recorded voice for all of them is impracticable.

------
dation
Now a politician can deny any sound bite. "They just deepfaked my voice!"

------
snissn
I heard a rumor that robot calls were harnessing your voice prints. Not
necessarily true currently but an interesting concern

------
joshmarlow
There's a potential _upside_ to malicious uses - synthesized voices (and
deepfakes) can give victims of revenge porn some plausible deniability. This
would hopefully take some of the sting out of that experience.

------
redsymbol
Not looking forward to the phishing that will exploit this.

Going to call my parents today and warn them. If they ever hear something from
me that's not adding up, be skeptical, and verify it some other way before
taking any action.

------
werds
Anybody else notice how that Scottish male reference voice sounds considerably
more English in the synthesized versions.

~~~
VBprogrammer
Yeah, maybe it's just having a more sensitive ear to the Scottish accent but
that to me was the furthest from the reference by far.

~~~
luxpir
I heard that too; being a Brit (and working in languages) probably helps. It
did pick it up occasionally though, which gives hope that increased sampling
and training could fix the slight miss there.

It was that and the Swedish-accented English ("Sentence in Different Voices"
section, middle recording) made it struggle. No traces of the Scandi-lilt were
left in the synth version.

Final note would be the French speaker at the bottom of the page _seems_ to be
English first language, despite having very good spoken French. Not quite as
pure a test of that last part as I'd have liked, despite the ability for the
speaker to perhaps read the synthesized version in English _back_ in English.
That could be fun.

~~~
seszett
I can't hear any hint of an English accent in the French-language recordings,
they just sound like regular Québec French to me.

However, I'm not convinced at all by these voice transfers across language. I
can imagine the second Chinese one being the same speaker in both languages,
but not the three others.

~~~
luxpir
That's no Quebecois I've ever heard. Sounds like a Brit who picked it up as a
second language in the home or soon after.

Even struggles to finish the sentence due to the effort of reading in the 2nd
to last one. Struggles with an extremely common word, 'grand', as well as
stumbling over a simple sentence. To be fair, he has heard enough French (i.e.
lived and studied there most likely) to get the intonations mostly right but
there are a few other giveaways too... it's just not natural or native from
where I'm sitting.

------
epx
One thing is, voice over phone is so compressed that it actually took a long
while for this kind of voice cloning (and associated frauds) are all over the
place.

We are going to need 2FA over voice communications :)

------
SubiculumCode
For those who were unable speak e.g. S. Hawkings, it would have been feasible
to have had the computer speaking system use a voice that had sounded like him
prior to his condition. Amazing.

------
nsxwolf
A staple of science fiction comes to life.

~~~
teraflop
And conversely, Star Trek's frequent use of voiceprint authentication looks
sillier by the day.

~~~
toomuchtodo
Fun fact: Several financial institutions (Vanguard, Schwab) allow your
voiceprint to be an authentication mechanism.

~~~
floatrock
There was a story a few months back about some British subdivision VP wired a
million dollars to eastern europe because the CEO called him up and told him
to do it or something like that.

It was the CEO's voice, but it wasn't the CEO.

------
johnsonjo
I was literally just researching tts (text to speech) programs yesterday and I
believe Mozilla’s open source (open source in this case meaning weak copyleft)
TTS [1] also uses Tacotron and is trying to implement multi speaker tts
currently [2]. I literally just posted Mozilla’s TTS to hacker news [3]
without even seeing this which made me experience a bit of the Baader-Meinhof
Phenomenon [4].

[1]: [https://github.com/mozilla/TTS](https://github.com/mozilla/TTS)

[2]:
[https://github.com/mozilla/TTS/blob/master/README.md#major-t...](https://github.com/mozilla/TTS/blob/master/README.md#major-
todos)

[3]:
[https://news.ycombinator.com/item?id=21532189](https://news.ycombinator.com/item?id=21532189)

[4]:
[https://en.m.wikipedia.org/wiki/List_of_cognitive_biases#Fre...](https://en.m.wikipedia.org/wiki/List_of_cognitive_biases#Frequency_illusion)

------
czbond
I have not been "wow-ed" by a technology in quite a while on HN. Wow.

------
jonplackett
Do papers like this have code to play with anywhere?

~~~
lukeplato
from two minute papers video description: > An unofficial implementation of
this paper is available here. Note that this was not made by the authors of
the original paper and may contain deviations from the described technique -
please judge its results accordingly! [https://github.com/CorentinJ/Real-Time-
Voice-Cloning](https://github.com/CorentinJ/Real-Time-Voice-Cloning)

~~~
jonplackett
Funny, it gives them all a slight American accent!

------
admn2
Yikes - don't a lot of financial institutions use your voice as a layer
authentication?

------
hakanito
It will be interesting to see if these made-up AI voices can deliver jokes
with the same tonality and delivery as good comedians can. I'm just a layman
but it feels like a hard problem to solve.

~~~
NoodleIncident
The furthest right column in the first table shows that they might be a long
way off from getting timing right. The 5-second sample happens to have a
comma, at which the speaker pauses; this pause is in most of the generated
output, at seemingly random places in the sentence. The one sentence that does
have a comma doesn't use the pause, either.

------
seph-reed
I think it's time we officially declare we're going the dystopia route, and
really commit to it. The sooner we hit the great filter, the less suffering
there will be.

------
pier25
In a couple of years we won't be able to trust any media.

I wonder what the cultural implications will be, much like photoshopped models
and actors have change the beauty ideals.

~~~
Riccardo_G
There are a lot of security features that are being put in place to help us
all understand what is real and fake. Of course there is still a lot of work
to be done and the technology is very new, but at
[https://replicastudios.com/](https://replicastudios.com/) work in
watermarking audio, as well as authentication of Replica voices and detection
of fake, non-authorised replicas is already in progress.

Facebook, YouTube, Twitter, and the likes, will then be able to let users know
what is using real (actual real voices, or the authenticated Replica voice) or
fake voices.

------
kingkawn
This is part of the process of truth being removed from all recording. Soon we
will be back to a state where the only certainty is the person we speak to in
person

~~~
Hoasi
> Soon we will be back to a state where the only certainty is the person we
> speak to in person

That is until we are able to tell whether that "reality" is synthesized or
not.

------
cs702
"Hi Jim, it's Jane. How are you? I'm calling you on the phone to confirm the
wire transfer instructions I just sent you via email..."

------
obaid
This is awesome (in a crazy way). I was playing with Resemble.ai [1] yesterday
and was surprised by how good it was in replicating my voice with just a few
minutes of dataset. This technology is going to keep on getting better by the
minute.

As with any piece of technology, there are always good and bad actors.

1- [Resemble.ai]([http://resemble.ai](http://resemble.ai))

------
gregcrv
Maybe after all these years of people disconnecting physically, using more
phones and apps to communicate or meet we will go back to the fundamental real
world human connection because that's the only one that can be trusted to be
genuine. And the digital world will be left out because not trustworthy. Is
this where we are heading to?

------
nergik
Awesome! can’t wait to have some code to play with and start feeding ebooks to
produce audiobooks with voices i actually like

------
SubiculumCode
If acting was ever a good career choice, it isn't now. I am becoming convinced
that actors will be replaced wholesale.

~~~
visarga
No, actors are still going to act. But their voices will be just an input
feature to the speech engine.

~~~
echelon
Meaning that this democratizes acting. Celebrity actors won't be required.

I think this is a good thing for the field and gives a whole lot more people
access to the opportunity they strive so hard to achieve.

------
agentultra
Impressive but they still sound like robots imitating humans. I can only
imagine the chaos this could cause, if used by bad actors, as it continues to
improve. If someone took my voice I'm not sure that my partner would know it
wasn't me. That would enable all kinds of social engineering attacks.

~~~
Enginerrrd
Exactly what I was thinking. We already have a problem with scammers claiming
to be relatives who need some quick cash wired over. Imagine how much more
effective that could be if it actually sound indistinguishable from a loved
one.

------
luxpir
I wonder if we'll need a PGP signature for every kind of recording we might
make in the future.

------
badrequest
Has anybody tried making an AI that generates 5 seconds of arbitrary speech to
feed into this AI?

~~~
kleer001
No to create an Ai for that. Just like with any neural network random noise
can be fed into the detector networks, fedback to its self then used to create
novel maps. Like deepdream.

------
reifnir
Unfortunately, the source isn't available. I was hoping I could generate my
own narration of any book in the voice of anyone I could throw at the trainer.
(Heck, even if it was just my own, anything is better than Scott Brick!)

------
soulofmischief
My bank is going to need to do better than a 6-digit account number and verbal
password. Customer service is in for the ride of its life once this rapidly
maturing technology is commoditized for criminal enterprise.

------
i2shar
Wow. So is that why I am getting incessant spam calls and they only need to
hear me or my voicemail greeting for 5 seconds to be able to impersonate my
voice ever after?

------
kevin_thibedeau
There was an old Piece on Headline News 20 years ago where someone had done
this with Whoopi Goldberg's voice. Never heard about it since. Presumably it
went black.

------
thimkerbell
Ok, what happens to society when this gets to be really good?

~~~
tiborsaas
Better memes, better apps, better scams

------
jacquesm
One more scene from Terminator that we can now do for real.

~~~
martin1b
What's wrong with Wolfy?

------
sebiw
Anybody else thinking about the scene from the Bourne trilogy where Jason
opens Noah Vosens safe using a voice sample he took of him on the phone?

------
tenebrisalietum
Oh neat. I would totally use this to make ASMR audio in a voice different than
my own without having to ask someone else to read a script.

------
anoncow
Mission Impossible is now Mission Possible.

~~~
onion2k
It always was. IMF never failed in the TV shows or the more recent films. :)

~~~
gruez
>the more recent films

Was there ever a film where they didn't succeed at the end?

------
injb
They turned the Scottish accent into an English accent. The American ones are
very convincing though.

------
irrational
So, now we can't trust text, images, videos, or audio. Any of them could be
fake. What is left?

~~~
dole
AI-generated scents. After you die, your body odor, sprayed on an article of
clothing, as a monthly subscription to your loved ones.

------
costcopizza
I was born in the wrong era. Christ.

------
frandroid
The synthesized voices sound similar but would probably not fool a good voice-
print system.

~~~
macca321
No, but they would probably fool a friend or family member over a phone line.
Yikes.

~~~
chrisan
or some post on social media...

------
jasonbourne1901
In the future moms will no longer complain that their sons never call!

------
hindsightbias
How long until we have the napster equivalent of voices? Voicester?

------
solveq1
why do we need this kind of technology? cheating?

It seems anti-human

------
datlife
Welcome to 2020. Crazy time to be alive!

------
andrefmoniz
It means we can talk to anyone forever

~~~
seph-reed
It means you can make someone talk, non-stop, for an indefinite amount of
time.

Now I want to make an art piece that's just a valley girl droning on, and on,
and on, and on about believable and obnoxious life experiences. The stores
they go to, how they feel about certain colors, what "too spicy" is. It just
never stops.

~~~
judah
This sounds like a brilliant idea, and I'm almost wanting to build it myself.
If we can tack on a visual head and sync up the lips to the words, it'd be
amazing and would likely go viral.

~~~
seph-reed
ShadyWillowCreek@gmail.com if you're actually interested in collaborating on
this.

------
nkkollaw
This is pretty scary.

------
alpineidyll3
To not release the source is a a pretty bad distortion of research norms

------
x220
Jordan Peterson got it right months ago. "It’s hard to imagine a technology
with more power to disrupt. I’m already in the position (as many of you soon
will be as well) where anyone can produce a believable audio and perhaps video
of me saying absolutely anything they want me to say. How can that possibly be
fought? More to the point: how are we going to trust anything electronically
mediated in the very near future (say, during the next presidential
election)?"

[https://nationalpost.com/opinion/jordan-peterson-deep-
fake](https://nationalpost.com/opinion/jordan-peterson-deep-fake)

------
kd3
This is cool. Hopefully soon I'll be able to use this to do automatic voice
overs from written texts using my voice for podcasts and videos. Read pages of
text without getting tired.

------
retrovm
2018, FWIW.

