Hacker News new | past | comments | ask | show | jobs | submit login
AI Clones Your Voice After Listening for 5 Seconds (2018) (google.github.io)
911 points by lukeplato 24 days ago | hide | past | web | favorite | 319 comments



I once got a call while I was lecturing some students. It was repeated three times in three minutes - I assumed it was an emergency and stepped out.

I was greeted by someone explaining that my father had caused a car accident, and they were calling on his behalf. That someone would need to send over some money for repairs or they’d call the police.

Sure.

They added that their cousin, the driver, is a parolee now holding my father at gunpoint. That if I don’t send them money to make them whole, they’ll kill my father.

This was super fishy, you know? But still, with things like “life of a loved one” at stake, it’s hard to call a bluff.

I can only imagine what I’d have done if I’d heard my fathers voice pleading for help. They might have been able to get any amount of money out of me.

Well, if my father hadn’t passed away nine months prior. They were not delighted to hear that.


I live in Houston. I recently read a Houston Chronicle article describing very similar scenario. I don't have the exact link (it was from 2019), but here's one from 2013 [0].

Combined with the inability to verify the actual phone number displayed on caller ID has led me to tell all of my family to not ever accept a phone call from a number they don't recognize. There's literally zero trust in our phone system upon which we've built our modern economy.

Unfortunately that's not possible for everyone. Some people are legally required to answer the phone, always, even for numbers they don't recognize.

[0]: https://www.chron.com/news/houston-texas/houston/article/Hou...


Given caller ID spoofing, they really shouldn't even accept calls from numbers they do recognize... especially with tech like this. Let it go to voice mail then return the call afterwards.


I wholeheartedly agree. I don't answer the phone for unknown numbers unless I'm expecting a call from an unknown number; the expectation will have been set up via prior correspondence.

Unfortunately, not everyone can do that. Some people are legally required to answer the phone, even if they don't recognize the number. And unfortunately many businesses only communicate via the phone system.

So, unfortunately, our entire country is built upon a system in which we're told to implicitly trust but doesn't have any capability for us to verify.


Why are there people who are legally required to answer the phone? On what grounds? Why? (Assuming private persons here)


For example: people who are entangled in the court systems are required to answer their phone. Even if they're not convicted and are out on bail, they still must answer the phone -- it could be their bail bondsman. If someone's on parole, they must answer their parole officer. So as part of the bond contract and the parole contract, you must answer the phone.


Oh!


to someone that will likewise not answer any calls!


Chances are whoever just called you will answer the phone when you call back rightaway.

It reminds me of port knocking.


So what's the end game there - using voicemail as a poor man's texting?


The end game is everyone gets the shits, and there is a noticeable drop in the usage of the old POTS (plain old telephone system) network. People use Apple Facetime, Google Duo or whatever instead. Then the telco's start to notice they are loosing customers.

At that point one of two things happen. One is that telco's fix their networks. The second thing is they decide it isn't worth the effort, and let the traditional phone system die. Given phone calls are effectively free so there is stuff all revenue in them, I bet it's the latter.

If that happens it will be painful. Like it is with messaging now, but even more so. Messaging now is either SMS with it's limitations (like you can't use it from a computer), or a choice of a zillion walled gardens - Apple, Hangouts, Slack, Signal, Viber, Telegram, WhatsApp, Facebook, ... most of which I don't have installed so I can't communicate with someone using them. The voice equivalents are Facetime, Duo, Viber, Signal - many of the same things in fact. The result will worse than messaging - the ability to communicate universally with anyone dies, but with no SMS fallback.

But that's not the end point. Universal communication is just too useful to be dispensed with - as the explosion of internet and the postal system before that have shown. So something will replace it, and once again we will all be able communicate with anyone we please.

However, the replacement has to solve the parasite problem. Once the cost of sending a message drops below a certain point every universal system we've had so far has been overrun with parasites, aka spammers. The postal system has junk mail, email has it's spam, now the phone system, and of course SMS.

A solution may be to allow the recipient to charge the sender any amount they like for successful delivery of a message. Most people would allow friends to send for free, messages from unknown recipients to cost something, messages from spammers cost more.

That could happen with the existing phone system of course, but I'd lay log odds the incumbents have too much in common with the dinosaurs for it to even cross their minds. Sadly that means we are in for a very painful transition period. In fact they are already losing customers as people stop using land lines in droves, so I'd say the writing is on the wall.


Loss of trust in PSTN, generally. I'd suggested this a few days ago in a similar discussion, bolstered by a recently-discovered quote from an industry engineer:

https://news.ycombinator.com/item?id=21494300

[S]ince mid-2015, a consortium of engineers from phone carriers and others in the telecom industry have worked on a way to [stop call-spoofing], worried that spam phone calls could eventually endanger the whole system. “We’re getting to the point where nobody trusts the phone network,” says Jim McEachern, principal technologist at the Alliance for Telecommunications Industry Solutions (ATIS.) “When they stop trusting the phone network, they stop using it.”

https://nymag.com/intelligencer/2018/05/how-to-stop-spam-rob...

At the point at which individuals and businesses in sufficient numbers find the downsides of participating in the PSTN exceed the benefits, they'll start defecting to other systems. Likely small and closed networks initially.

It took decades for the telephone to become established as the principle means of business communication, and as it was, numerous other alternatives existed in parallel: postal mail, telegraph, telex (for what we'd now call b2b communications), fax, and early email systems.

Email seems to be dying along with telephony, and for much the same reasons.

It's occurred to me that much the value in social networks is in trying to corner a sufficiently large directory (that is, user base) to be able to credibly take on telephony. What seems to happen is that as these networks grow in size, they too fall prey to the hygiene factors already affecting telephone and email comms: spam and annoyance messages, with concommitant trust issues in the network as a whole.

Whether a technical solution to the trust and identity problem can emerge (and preserve privacy and protect against the surveillance state, surveillance capitalism, and surveillance by other actors (organised crime, racist or facist oppressors, stalkers, etc), remains to be seen. I'm starting to think that's a hard, possibly an impossible, problem. An essay of Herbert Simon's I've recently turned up is exceptionally discouraging owing to a critical error Simon made in it (claiming Nazi Germany committed it atrocities without the benefit of mechanical data processing -- it in fact had ample assistance willingly provided by IBM).

More generally, I'm suspecting that progress in information technology and communications capabilities reduce trust relationships, with some fairly strong historical evidence.

(Overall risks may be reduced, but the mechanisms by which this occurs replaces actual trust with validation, verification, and surveillance mechanisms).


I’ll bite.

Who is legally required to do that? Are they not allowed to sleep or be otherwise indisposed?


One I can think of-- people whose job requires it of them. Risk could still be to them personally if someone gets their number/extension/transferred to them. But risk could be on the business as well, which could just as easily be targeted by scams like this.


First off, Awesome story.

I have a friend who had something similar happen, he got a frantic call from his grandmother who learned via a scam call that he was in jail across the country and needed bail money. This was a few years ago, so they couldn't have used a duplicate of his voice, but possible they were relying on imperfect memory.

Sweeping generalization, but elderly are and would likely prime targets of this kind of scam in the future since they likely have funds and are less likely to be educated in the state of the art for this kind of tech, not to mention a protective instinct.


I received a call with a human on the other end. When I said hello, the person said in a friendly tone "Grandpa!" And tried to start talking to me.

That strategy probably works some percentage of the time.


How do you even prepare for something like that... Do we need to assign identifying keywords to each other when we leave home so we know we are really ourselves? Like a vocal pgp?


I told my wife that if I ever mention <redacted> while on a phone call, she should know that I am in trouble an unable to speak freely.

Sound like we'll all need more things like this eventually :(


It would make sense to have another word that indicates it is genuinely you and you are genuinely speaking freely.


If that's the default situation (a likely scenario for most people), you'd need something other than a single code word.

In practice, most people can conduct a reasonable verification through a series of challenge/response interactions based on shared knowledge, should they need to do so. Mentioning something done, said, or shared in private recently would suffice in many instances.

For more robust tradecraft, should you need it, a set of one-time codes (passwords or passphrases) might substitute.

When the former head of InterPol was arrested in China, he managed to alert his wife through the use of a duress signal, an image of a knife:

https://www.nbcnews.com/news/world/wife-missing-interpol-pre...

Not subtle, but effective.

Spoiler:

In the film Capricorn One (1978), one of the astronauts alerted his wife by referring to a holiday they'd recently taken together, by mis-stating the destination as Disneyland, rather than Hollywood -- the land of make believe -- as it had actually been, which led to the revealing of the hoax mission.


That would solve the problem of having to find a word that you would never normally use but could slip in to a sentence normally.


If you can speak freely you don't need an extra codeword to explain that you are using the codeword in it's real meaning. Unless maybe you suspect that somebody is listening to you and might learn your codeword from that.


The codeword is to make it clear your words should be taken 100% seriously without considering the risk you are being coerced / spoofed with AI. If I agree on a word in advance with someone that no one could possibly guess and insert into an attempt to coerce / spoof my voice, then if there is truly an emergency in which I need this person to wire money to a random account, they will actually do it because they will know my request is genuine.

If I'm being coerced, I could have a codeword to indicate that. If I'm being spoofed with AI, I'm not in control of "my" words, so I can't. I need instead to prove when I'm not being spoofed with AI. That's the purpose of this second codeword.


I invented two code phrases for pretty much exactly the same reason, but in case I ever met myself from the future.


While it's a great idea, how do you test this? Is it worth the time?

All my loved ones are on my iCloud so I would just ping their phone/watch while confirming location and asking the assailant to let you hear the phone ping on the line.


Find My Friend is so unreliable for me it is nearly worthless. Apple isn’t really delivering what I need.


Really? How so? I use it probably almost daily and haven’t had an issue, at least I don’t think I have!


I've seen it fail to update the location pretty often.


how do you pronounce the < > symbols? i mean, 'redacted' is already a pretty strange thing to say by itself.


You just make static noises, like a radio signal being lost.

Also, you are lacking in the abstract thought department. Get that fixed, for your own benefit.


It's either a knowledge gap...or he is hopeless if he knew the symbol but didn't pick up on it. Nothing I know of can improve that.


I'd bet jiveturkey was joking.


They are pronounced "wakka" and "wakka" respectively.


less than redacted greater than, though it's clear he's keeping the real word a secret.


<...> is a common way of indicating a placeholder.


> How do you even prepare for something like that.

You don't because it statistically never happens. Just like you don't prepare for a plane crash or a lightning bolt striking you.


Yet, so many people have made plans for what they will do when they win the lottery.


I prepare for the astronomically improbable chance of a plane crash or lightning bolt striking me by having life insurance, so why such a reaction to someone asking this question?


Yes. My mom and I had phrases for duress. Since she has long since passed away, I can share ours. "I love you". Sad, right?


I'm broke so it makes things a lot easier.


It was a real rollercoaster reading this comment! Also now, you have to also be worried about talking back being a smartass, because now they will record YOUR voice and use it to contact another loved one...


In my opinion, all phones should be set to not ring unless the number is in the address book in a specific category. Mine won't make a noise. If it's important, they will leave a message.


> I can only imagine what I’d have done if I’d heard my fathers voice pleading for help.

Hm... that certainly gives me pause, and my first reaction was to be very afraid.

On the other hand-- it still doesn't hold a candle to pyramid scheme sales techniques. I mean in a lot of cases those involve your actual loved ones betraying your trust and love in order to sell merchandise for a third party. Yet somehow in the face of a rising tide of those we still have functioning communities in the U.S.


At least in a MLM sales pitch, nobody thinks that one of you is about to die. For that reason, I think the scam call is far worse than any MLM sales pitch.

Side note, it's not accurate to refer to MLMs as pyramid schemes... Even though they're legal, MLMs are worse than actual pyramid schemes (and I don't think most people know what an actual pyramid scheme is, which is unfortunate because they're fascinating).

https://simple.wikipedia.org/wiki/Pyramid_scheme


It's harder for Eve to get your loved ones to betray you than to speak for 5 seconds.


I know someone that had a fling abroad and their fling began asking for money for treatment over facebook

The American assumed it was a scam and the person did die

I have often found that truth is stranger than fiction, and people are too conditioned for fiction that they can’t perceive truth


Wait, what? Did you miss a word or a sentence? What is "treatment" and how did this escalate to death? Was it a medical treatment and they were ill and surcumbed to the illness?


"They were not delighted to hear that." - I used to do that, be a smartass. Then I realized at worst they get more info, at best, you are training them and wasting your time.

Then there is that whole thing where they are getting your voice.


POTS just needs to die unless they allow for authenticated CID.

One could wonder if this was some sort of conspiracy to break one of the most successful protocols in the world (or at least not update it so it dies by neglect) to increase profits by other means.


I’m convinced that POTS is already a dead man walking. The younger generation has no interest in talking on the phone in the first place, and with the amount of spam calls my phone number is absolutely the worst part of my phone.

Texting and apps is a much more pleasant way to interact with someone, and bonus of no hold times and much can be automated.

I think business texting is an upcoming startup unicorn which will be another “trivial idea packaged properly into a billion dollar product”.


>business texting

You mean Slack?


B2C - not team internal chats.

It will have “jumped the shark” once there’s an SMS button on the business listing that Google Maps displays.

Instead of the wonky/creepy Google demo which did speech to text and then analysis and then text to speech relayed over to the business, every business will just communicate directly with customers over text.

It’s not that this isn’t already done (to some extent). And more so in some countries outside of the US.

But I have no doubt it will become the primary/preferred way to connect with any business, to the point where you will text with an 800- number long before it would occur to you to dial an 800- number to get service.

Like for example, the warranty claim I just made on my Dyson handheld vacuum for a battery replacement. Search for “Dyson warranty claim” and they tell you to dial their 800- number. Now their phone helpline is absolutely the best of the best, but even still most people would [will eventually] prefer to interact via text.

Another example, making a reservation directly with a restaurant (which I prefer to do versus using OpenTable which will take a cut for doing nothing), is a perfect usecase for texting. Also ordering take-out if you already have a favorite order saved, obviously all the notification type things which make sense over SMS instead of email, making an appointment when a dedicated app is too much overhead, etc. etc.


You're way off on this one. A lot of those use cases don't work because texting is an asynchronous communication channel unless it's got some sort of automated system behind it. The reason you can't order takeout or make a reservation through text is because you come to an impasse if the person on the other end gets distracted and doesn't respond. The value in something like UberEats or OpenTable isn't in message passing, it's in state management. With UberEats if a restaurant closes suddenly or doesn't take your order within a timeout period, both you and the restaurant are notified and the state of your communication is updated, so there is no confusion. If you text or slack or whatever your order to a restaurant, and the person on the phone doesn't respond, you're fucked. How do you know whether it's safe to place an order somewhere else or not?

Sure, every restaurant could build their own automated system that texts you back and manages the communication, but that's never going to happen when there's already a managed, standardized service available.


You might think that, but when I was in Brazil, ordering food by WhatsApp was commonplace. The restaurants would, generally speaking, answer very quickly. Some would send you the daily menu every morning.


That is so nice! And I thought I had it made a few years ago when I lived two buildings down and three floors up from the best Indian restaurant in town and would call them to place the order then go downstairs to pick it up twenty minutes later. Calling. On the phone. Pfft.

The daily menu thing is especially endearing.


What is stopping this text-only paradigm shift from happening? What developments are needed before this happens? Why hasn't it happened already?

Twilio has given the ability to programmatically text anyone for years. Why hasn't this hypothetical B2C text business developed yet?


The work involved in setting up a server/dashboard for twilio to work is too high for it to be popular for mass independent businesses.

This hints that a "shopify for twilio" would be popular


Mobile slack is a terrible UX for short term engagement.


Or BlackBerry?


POTS is pretty much dead almost everywhere in the USA, most are VOiP these days. They are not replacing the copper wire.


Isn't that a bit pedantic? Perhaps Im conflating PTSN and POTS. The fact that you have publicly addressable phone number is the important part.



How were you to send money?


Wow, impressive results! Already a few examples in the comments of what bad actors could do this tech. I wanted to share an example of something good.

I lost my dad about 6 years ago after a Stage 4 cancer diagnosis and a 3 month rapid diagnosis. I have some, but not a lot of video content of him from over the years. My mom still misses him terribly so for her 60th birthday I tried to splice together an audio message and greeting from her saying what I thought he would have said.

The work was rough and nowhere near what this Google project could produce. She listens to that poor facsimile every year for her birthday. It's therapeutic for her. With some limits for her mental health of course, I'm sure she would love to hear my dad again with this level of fidelity.

And so would I.


When I was doing a computer repair: I remember a woman coming in with a digital answering machine; the kind that stored its recordings in volatile flash. During a thunderstorm the night prior the machine lost power, and subsequently lost all the stored recordings. As it happens some of those lost recordings included messages from the woman's late mother.

That moment has stuck with me for many, many years. The heartbreak on her face, combined with my own frustration of knowing that no amount of luck (or skill) will ever be able to flip the bits of that flash chip back to a permutation which contains samples of her loved one's voice.

Fast forward to the present, my own grandmother passed away shortly after the start of 2019. I was able to salvage some of the many voicemails she had left me over the years, despite having had probably five or six cellphones during that period. Why? I used Google Voice, which is part of their Google Takeout data exfiltration program. I was able to download all those voicemails as MP3s, neatly categorized by caller. My grandma was very terse, so most of them are exactly the same: "Robert, can you please call me?", but in spite of that each one is unique and precious to me. A lot of developers think about getting data into their platform, but it seems to me that not as many think about users getting their data, sometimes precious & irreplaceable, back out of the platform.


Thanks so much for sharing this story--this never even occurred to us when we created Google Takeout back in the day!

-Fitz


My pet project I will likely never have the resources to work on would be AI-generated 3D virtual environments based on old photos / videos that you could navigate in VR and relive long lost memories

I'd pay a good amount of money to be able to relive certain experiences from my childhood with that level of immersion


Philip K Dick wrote about people going to commune with artificial personality constructs of their deceased loved ones.

Unfortunately, it's been a long time since I read it, so I don't remember which book it was in. Maybe someone who's read him more recently can remember.

Update: Apparently, lots of other people wrote about this too, but PKD wrote about this before any of the ones mentioned so far, as he wrote about this in the 1950's or 60's. I'm not sure if he was the absolute first, however. So if anyone knows of any earlier references, it would be interesting to learn about them.


Ubik? Though that is not about artificial personality constructs, it's about communicating with loved ones in half-dead states.


Yes, Ubik has half-life states.

But I'm thinking of a different PKD book where there were actual artificial personality constructs instead.


This idea crops up in a few of his novels and stories, but I think it’s most fleshed-out in Ubik, yeah.


Under the hood they are all about religious Gnosticism and the physical universe as a false facade to the "true" universe. VALIS is a pretty good explication as well as a really good book; if you are into mental illness+theology, only then is his Exegesis a good read


Very belatedly, yes, VALIS is strange and wonderful.

This is making me want to re-read some PKD!


Check out the movie with Jon Hamm, Marjorie Prime, screenplay by Jordan Harrison. Without spoiling too much, there's a company that can create holographic projections of loved ones which a woman's family gives to her as a gift which is a hologram of her deceased husband, but when he was a young man. The interesting part, narratively, is that while the holograms are near perfect physical recreations, their personalities and memories must be trained by those who knew them, family/friends which raises the question of how we're perceived in fragmentary and contradictory pieces depending on whose doing the training and the amalgamation of a person that's ultimately constructed from these parallax accounts. The writing is actually quite strong and the only scifi aspect is the holograms so I wouldn't say there's much of scifi crutch. I know it's not PKD and there are similar Black Mirror episodes, but I thought the drama itself was robust and displayed the range of Jon Hamm to be someone other than Don Draper.


This movie is available on Amazon Prime.

It's not bad, but my recommendation is to go into it with the expectation of a Black Mirror episode rather than something you might pay to see in the cinema.


There was also a Black Mirror episode.


It’s also present in the Revelation Space series, Neuromancer, Red Dwarf, and Star Trek.


Don’t know about the Philip K Dick work but William Gibson has this in Neuromancer.


When I interviewed Ray Kurzweil we talked about the obvious-in-hindsight insight that his life’s work was essentially trying to build an AI to bring his father back to life.


Except no matter how good it is, it will only reinforce that it’s not real and that he’s gone. Perhaps that’s the therapy he needs to move on?

Note: this is different from listening to recordings from the actual person.

Having loved ones die is one of life’s universal terrible qualities.


I think it would be cathartic to talk to someone you trusted but who is now gone. There's been decision points over the last few years where I would love to have just said my thoughts out loud to my dad and just have him nod and ask a couple of open ended questions so I could get it out. No specific guidance needed, just his particular style of listening.

Clearly losing someone and being able to deal with it is an important life skill but just as we build technology powered aids for other situations, I don't think this would be any different


"I think it would be cathartic to talk to someone you trusted but who is now gone."

It would be cathartic, but in this case you wouldn't be talking to them but to a computer, who (at best) is pretending to be them.

I think it's kind of creepy, when you really think about it, and it reminds me of the aversion the creator of Eliza had to his creation when he found out that people were spilling their guts out to it and treating it as a real person.

Which isn't to say that it can't be helpful to talk to something that's not a real person (and especially not a formerly living person you once knew) can't be healing. But if people get confused by these machines in to thinking the machines are actually people close to them that died and are now living again, that will make them vulnerable to some really serious manipulation and delusion.


Fair point


I think a lot of people's passions are driven by a hole in their heart, that they hope their work will help fill somehow. I suspect that no small amount of the enthusiasm for XR is due to a deep and abiding desire to be someone else, somewhere else, among the people developing or early-adopting for it. Of course, it doesn't have to be so high-tech; much non-profit or social work is prompted personal experience with the presence, or lack thereof, of the service being rendered.

In the end, I don't know if any of that works. But what's being subscribed doesn't seem too far outside the norm. Deprivation often leads to desperation for even a taste, however imperfect it may be.


I'm deeply sorry for your loss. Thanks for sharing your story.


Beautiful. Thanks for sharing. It's good to point out the positive potential uses as well as the negative.


This podcast explores how similar tech is used to give voice back to people who have lost it due to voice impairment. Basically allowing all the people using machines that sound like the classic Hawking computer voice to have their own voice instead.

https://www.npr.org/2019/07/15/741827437/finding-your-voice-...


It could also be used for recreating voices for people that have lost theirs, like Roger Ebert. I think he benefited by having so much of his voice already recorded, this would make it much easier for regular people.

https://www.youtube.com/watch?v=hMyxgSLESz8


Great use case!


I'm imagining Siri with the voice of your partner.


Either I would wind up being extra nice to a digital assistant or being curt with my partner :)


Yeah, you probably don't want to mixing up some habits of how you talk to a digital assistant and the person you love.


My partner and I frequently ask each other things, to which the response is "I bet you could google that." Seems fitting :)


Sorry for your loss, thank you for providing an optimistic example.


That was part of the original idea behind the Infibond start-up in Israel. Not clear how real it ever was.

https://sifted.eu/articles/infibond-investigation-israeli-st...


Would be interesting to have a community where people looking to be comforted by their loved one's voice would post whatever snippet of recording they have, then others would listen and see if they know someone who has a similar voice and have them record a message.


Reminds me of the LifeAfter podcast: https://www.stitcher.com/podcast/panoply/the-message


Reading this headline I begin to understand certain people’s worry about having their soul stolen upon being photographed.


Images of us can be sourced from any number of places: social media, government surveillance, private surveillance. Video less so but from the same sources. Audio from phone companies, VoIP services, surveillance, etc. Health data easily from a number of private companies if you use new-age "health" services, or less easily (illegally) from health records.

Maybe we can find solace in the fact that is or will soon be infeasible to avoid, so we needn't try to avoid it.


“Don’t worry, just about anyone can steal your soul and there’s nothing you can practically do to stop it”

That doesn’t seem like a message of solace to me.


The message is "Just about anyone could replicate your voice, its value in authentication is about as trustworthy as writing your name at the bottom of a letter"


>> Maybe we can find solace in the fact that is or will soon be infeasible to avoid, so we needn't try to avoid it.

It's the meager solace of the absolution of personal responsibility - there's no way to avoid it, so at least no one can say "why did you allow that to happen to you".


I am okay with the events that are unfolding currently.

http://gunshowcomic.com/648


“When remedies are over griefs are ended”


I had that exact same thought the other day in respect to biometric data from photographs.


Even decades ago I gave such reported concern with photographs more credence than the typical western account of it. I also wondered if the translation was precise enough -- could it (in some cases) have reflected a concern with "essence" more generally? Even without reference to a soul the concern can be a bit immaterial.


I'm looking forward to moving past just talking about the AI and concentrate on the new products the tech enables.

I guess it's similar to how most photos are a means to an end now, rather than the final product. ie satellite imaging or Instagram.


At first glance there seems to be more more malicious uses than good ones with voice. Yes, hearing someone dear to me say things he/she never said maybe comforting. Anything else?

Maybe some movies with the deceased actor's voice?

But what if someone who wants to hurt me sends me files (or phone calls) from the deceased person saying horrible things like:

- "I am still alive but left as I was tired of you"

- "oh Jan, I love you" [fake phone call from the past, where Jan is a lover which never existed]

or even from alive people:

- "I am leaving you"

- or my live voice saying stuff which gets me fired or in prison.

We will never be able to believe voice again...how will we adapt?


People had the exact same concerns when digital image manipulation software became popular, including the "we will never be able to trust an image again" question.

To answer your question, I think the biggest step we took in adapting to the ever-present risk that an image may have been manipulated is acknowledging that it's possible. As soon as people knew that something could be faked, they realized that having a purported photograph wasn't irrefutable proof that it happened and learned to ask for corroboration before making assumptions.

I think we'll learn to deal with this new development too.


Luckily now people don't ever believe photographs as sole evidence in the court of public opinion, and always corroborate that evidence.

Wait. No. I had that backward.

How long has photo manipulation been around? And people still fall for it every minute of every day.

I have zero faith these tech developments will lead to anything good, or that we'll even learn how to deal with them effectively.


I don't think its so much that they fall for it but that people create fake images that confirm what people already believe and the viewers deeply want the image to be true and will not think critically.

Verifying an image is not impossible. You just have to consider:

* Who took the image

* Do they have any reasons for wanting to fake it?

* Was anyone else able to verify they saw or took a photo of the same thing

* Is anyone in disagreement with the content of the image

We don't need image manipulation to fool people on facebook. Recently a random image of a park full of rubbish was used with the caption that this was the result of a recent environmental protest but the image was actually months old from a totally unrelated event. People believed and shared it because they wanted to. You could just as easily write a text post saying you saw a bunch of rubbish after the event and you would have almost the same effect.


> How long has photo manipulation been around? And people still fall for it every minute of every day.

People fall for headlines every minute of every day.


Yes. Tech is probably developing faster than our ability to adapt.


These tech developments will happen regardless of you stance on their morality. Either adapt like everyone else or stay behind. Your choice.


"acknowledging that it's possible" Even then, once the poison of suspicion has been delivered the harm is done.


Facebook users buy into doctored photos all the time. It's part of this little known phenomenon called "fake news". Video deepfakes are still time consuming and difficult to develop.


It doesn't have to be about creating audio of specific text with <specific person>'s voice; it can be much more about creating audio of specific text with a wide variety of believable people's voices.

I could see this, if it becomes commercially viable, potentially being a huge boon to indie game creation, for instance, since hiring a load of voice actors to record the dialogue for an entire game is vastly more expensive than, say, hiring a bunch of different people to record their voices for 5 seconds—or even, if this ever took off, buying a bunch of samples pre-recorded (or networks pre-trained) for the purpose.


This 100%. It's not about impersonating someone, it's about providing natural sounding text-to-speech for games, film, robots etc.

We're currently working on using voice AI to create real products over at https://replicastudios.com


Again this is a good use scenario which seems to bring less benefit than the damage a bad use case scenario.

Cheaper games vs distressing phone calls in this case.

I'm open to better use cases but for now I haven't heard any.


I mean...it's not like we're getting a choice here. The technology is possible, therefore it will happen (and is now starting to happen).

Like so much else in life, we have to take the bad with the good, but not looking for good in it doesn't mean we're less likely to get the bad.


Maybe a little less insidious than your examples, my first thought was being able to generate voice actor lines for videogames. Maybe not main characters, but background NPCs and the such. Might make the VA union a little nervous going foward though!


>We will never be able to believe voice again...how will we adapt?

"Say, Grandma, before I wire the money, what's the name of your cat?"


We can barely trust pictures anymore either.

It seems we'll be going back to judging the likelihood of one's actions based on one's reputation, for better or for worse.

There soon won't be such a thing as unreasonable doubt.


real life interaction is going to be more and more valuable. So, "we won't need to travel again because of videos calls" is not going to happen any time soon, it seems.

And public/private key validation may become invaluable.


I'm looking forward to an AI assistant that can use my voice to make phone calls

I'm sure some people with selective mutism that would like to use text to speech with their own voice

> How will we adapt?

Digitally signing audio clips


Ok mutism is a great use case, thanks. Still, the problem is solved also with normal text to speech software


AI can make decisions, create deep fakes, and now, clone voices.

It may be that the next big business opportunity lies in creating 'anti-AI' technology just as it did with antiviruses in the 90's and 2000's


AI that detects AI seems entirely plausible. And like all “anti” measures, is another arms race (and if I put my scifi hat on, is what may lead to AI self-awareness).


Sort of reminds me of a talk Valve gave about creating an anti-cheat for Counter-Strike using AI. When asked if they were worried about people using AI to create cheats to fool the AI, his answer was essentially that it was an arms-race won by the person with more data/processing power. That person would most likely always be Valve.

Link to talk: https://www.youtube.com/watch?v=ObhK8lUfIlc


It's a nice sentiment, but there are popular and easy-to-find cheating projects (not sure if I can name them here) that are still widely used, these projects have been active for years, before that talk was made, and still active today. Based on youtube videos and comments it seems many users are still using these cheats with little issue. And afaik, the one I'm referring to (initials P.I.) doesn't use any machine learning at all.


In Greg Egan's _Permutation City_ this is mentioned in passing (an arms race between AI video call spam bots and AIs screening the calls for spam while impersonating the owner of the phone).

The anti-spam software loses because eventually having self-aware AI view spam calls 24/7 was considered torture and they weren't allowed to go that far.


Typically the AI to detect the AI is exactly what they use to train networks like Deep Fake in the first place. GANs are effectively just a local arms race between two machine learning networks.


Is it that difficult to create/retrofit an AV container format for cryptographically signed audio and video streams? Key management & revocation could be a pain, but it's something that consumer electronics companies like Apple could do: think of it as the MPEG LA, but with signature checking & non-repudiation.


Congratulations, you have now created a class of people who can forge audio and video streams at will, relatively cheaply - and an underclass of people who cannot and may not even be able to record genuine footage they wish to record. This is not a road you want to go down.


No - I just created the equivalent to https for video; the "underclass" can still create, share and play unsigned videos - those would get low-trust warnings[1] (as they should, just like there is an "underclass" with no cert for their site). This wouldn't take away anything from todays' tech, only adds attestation for person/org behind videos they would like to mark as "official".

1. (edit) It occurred to me that some people may wish to manage the public keys independent of (say, Apple) and they could distribute via keybase or key-signing parties, so they actually don't have to suffer low-trust warnings. Now that I think of it, instead of merely signing streams, they could be signed and encrypted using recipients PK for 1:1 transmissions. Obviously law enforcement won't be a fan


Law enforcement will just coerce the CA system you've suggested to secretly improperly issue certificates. Just as it does with the current CA system. Problem not solved - it's just hidden.

You conflate content trustworthyness with origin sureity but a CA system doesn't even provide that.


Like the booming anti Photoshop industry of the early 2000s?


10 years ago this stuff was often easy to notice. https://thelede.blogs.nytimes.com/2008/07/10/in-an-iranian-i...

Being good at Photoshop is really difficult and producing good fakes is extremely time consuming. Today, even with a Hollywood budget, most such effects still look off. However, the industry has gotten much with actor enhancement for example generally going unnoticed.

Which I think is real issue, this stuff is becoming easier over time. AI could be the tipping point where eventually people just stop trusting images and video. But, that transition is gonna be difficult.


Good luck with that. The best product you'll come up with is some sort of snake oil. The whole point of GANs is that you can't really "detect" the synthesized components anymore. Not that this would/will stop people from claiming otherwise in the spirit of profitablity :-)


No, GANs train exactly one discriminator, jointly with the generator. There's no guarantee that you can't train another good discriminator out of band.

Furthermore, GAN discriminators are (as I understand it) often hobbled a bit to ensure that the generator can make progress on the loss function. An always-correct D doesn't provide a useful gradient.


GANs may produce imagery or audio which fools humans, but they are unlikely to consistently produce imagery or audio which fools humans over time.


GANs train by fooling AIs, not humans. Fooling humans is a side effect, not the primary thing trained for (mostly because that's cheaper of course). It's just that humans are in some ways different from AIs in terms of fooling.

Looking at the papers I must say I think the ability to fool humans is a scale problem, not a fundamental limitation. Already GAN produced images and sounds survive "normal" human scrutiny: if you have no reason to suspect foul play you won't see it. If you really go looking, you'll see it.


Snake oil-tier anti-ML might do the trick. One of the problems is the level of confidence people put into ML in the first, when a lot of times it's also snake oil-quality. Just being able to cast doubt on that again would help prevent a loss of healthy skepticism and critical thinking.


Better provenance is the way to guard against fakes.

For video and audio, I imagine a combination of hardware signing, perhaps with the camera itself living on an isolated, Secure Enclave-like chip, and sending hashes of (incoming images/video * deviceID * trustedTimestamp) to a blockchain or some other public distributed ledger. Getting the timestamp from a service that keeps its own record adds further security.

This obviously requires an internet connection, and would likely be useful mostly for news and government agencies, law enforcement. But if the culture is affected enough by deepfakes, I can imagine it becoming more ubiquitous. The parts are all there, it’s a question of utility.


If you are trusting a camera to upload to a blockchain, the same can be simulated on a computer, given enough time.


Sure, that’s the reason for including the 3rd-party timestamp AND for keeping the device signing keys on a separate secure chip. The idea is to say “these images/sounds were recorded at this time by this device,” and to have that statement both registered publicly at the time of creation and backed by the reputation of the device maker.

It’s acknowledging that using AI to catch AI fakes is a fool’s errand, and relies instead on the premise that hashing a raw data stream is much faster than producing a good fake, and that a secure device key is secure. You’d need both for it to work, otherwise you can generate a deepfake beforehand and get the device to sign a fake stream. That may be easier to do than I think; this is not my area of expertise.


Wouldn't an AI-detector just end up being used like a benchmark for fine-tuning AIs though?


Yep! Adversarial learning. Whoever has the best math and compute wins.


ah yes, if u listen carefully to the samples, you can always tell subtle things that make it seem a little off. Maybe if you look at the binary data very carefully, it would be easy to show HUMAN_AUTHENTIC or CREATED_BY_MACHINE and sell this service. Someone have a recording of something you never said? For $99.99 get it checked at AreYouHumanDotCom!



And for $9,999.99 we'll give certify whatever answer you prefer!


The only way to stop a bad guy with an AI is a good guy with an AI...


Now we can have audiobooks read by anyone we like!

They can direct us to our destination!

They can speak at our funeral, being long dead themselves (as long as there is sufficient training material recorded).

The future is awesome.


I legitimately think this could be huge for self-published authors. It takes a skilled professional about forty hours of work to produce an audiobook from a novel-length manuscript. Tacotron could do it in minutes.


I don't see that coming soon. The voice is one thing, but the performance goes far, far beyond that. Without understanding the text, you can't get good prosody out of a single sentence, much less developing a character for a whole performance.

You'd have to "direct" this on a word-by-word basis: "Put the emphasis here. Speed up 10% here. Decrease vocal intensity 25%". You'd end up producing a whole "score", and it would take at least as long as the human actor puts into it.

Having done that, it would be amusing to switch it from voice to voice, as a party trick. But the result would still be much poorer than you'd get out of an actor. Really solving the work of an actor is strong-AI-complete.


What about using a tablet to direct the piece by drawing? You can get values for the intensity, speed and volume (up/down) pretty easily and intuitively.

Even better if its linked to the voice generation system in real time, then you can save/redo sentences etc. as you go along.


Audio books with genuinely good performances seem rare to me. There are a handful of voice actors that stand out, but many of the titles I've listened to have very flat delivery; the first sample in the original article has as much inflection as they do.


Feed in enough "good" audio books, and you would probably get something passable for smaller titles.


AI is separating the talent from the looks across the board. As it is now, one has to both be able to act and look good, but now the AI will enable those who can act, to be re-skinned, literally, into whatever the client needs.


I did something like this before my grandmother passed. She was a teacher and loved reading books to kids. I recorded her reading Dr Seuss and the Giving Tree to my cousins so I could give my future children a glimpse of that wonderful woman.

It seems that we aren’t far from being able to take those recordings and spin it into a reading of anything. Fascinating. It’s kind of scary though. Grandma’s voice can read anything. Anything.


The emphasis on a lot of these sentences is all wrong; I wouldn't want to listen to an audiobook by this engine. It's still super impressive/terrifying though.


I am currently working on an audiobook project, called Odiobooks.com. I hope to release something soon.

If anyone's interested in the project, feel free to contact me at iamjsonkim@gmail.com.


Imagine when they can also generate the visuals of the book to show you the book as an auto generated video


What it is doing is not really cloning, but because it was trained on 18k different voices, it actually finds one that is closest to yours, and uses that one. It can do a bit of interpolation to create an embedding which is closer to your own, but only if it is well represented by a mix of other voices. Real voice cloning like at https://replicastudios.com/ can take just a minute or two of audio, and it does a fairly good job, and it is always improving. With more audio you start being able to also play with emotion and styles, which is very cool!


I'm not really sure where you're getting this. It doesn't pick a specific voice from a database to use.

From their introduction: "Our approach is to decouple speaker modeling from speech synthesis by independently training a speaker-discriminative embedding network that captures the space of speaker characteristics and training a high quality TTS model on a smaller dataset conditioned on the representation learned by the first network."

Section 2 of the paper explains how it works. Two minute papers also goes through it if you'd prefer a video. Link: https://youtu.be/0sR1rU3gLzQ


They’re saying that underrepresented voices will have trouble being modeled. That matches my experience with this project: for example, I had a very tough time cloning female voices compared to nerdy-sounding / deep male voices.


It's more that the sounds produced during the recordings didn't cover the entire spectrum of possible sounds, so the model had to estimate their sound. All you really need is a paragraph which you can have the person read to get sufficient coverage or just enough recordings that it's not an issue anymore.


Was it 18k voices or samples? Also, is it finding the closest voice, or is it a continuous parameter space formed from the voices?



My bank's (HSBC) telephone banking offers the option do away with a PIN and instead a 'my voice is my password'phrase system.

I'm glad I never opted in.


I recently called Chase and got some message like "your voice may be recorded for verification purposes" or something to that extent. Creeped me out and I don't recall ever opting into that specifically, so I'm guessing it is an opt out buried in some agreement.


Oh god, Sneakers


Just to be clear - I'm Robert Redford


passPORT

Also, that whole scenario would have been far less cool if they just recorded the dude for 5 seconds doing anything and pulled a whole CSI-style "put his voice into the visual basic GUI neural network and it works, bro!"

I love Sneakers.


The Tax Office in Australia does the same thing and push it every time I call, I imagine it branches out to the other gov bodies too. It's fun listening to the whole spiel about 'Like a fingerprint your voice is unique to you'.


unchecked, the progress of this technology and the staleness of banking security might cause entire institutions to fail


Saw this on Minute papers last night and had a discussion with my partner about if we needed a secret password or not to tell if it were really one or the other on the phone. I figured that we had enough shared history that that wouldn't be a problem. Then we realized that there's no such thing as a simulated sense of humor yet and that that would be the best natural encryption to any communication.


2023: ai identifies your sense of humour after hearing you breathe for 0.7 milliseconds


Ha. There'd need to be so many more Ai breakthroughs to make that happen it'd be a little thing.


Here's the link to the Two Minute Papers video if anyone else is curious: https://www.youtube.com/watch?v=0sR1rU3gLzQ


“When you see something that is technically sweet, you go ahead and do it and you argue about what to do about it only after you have had your technical success.”

—Oppenheimer


The thing is, these topics have already been discussed by philosophers! Questions of authenticity, human subjectivity, reproducibility etc are not new. But for the average joe and the non-philosophically-inclined techie, the thing has to actually exist before they start really talking about it.


The malign applications of this technology greatly outweigh the benign. Discuss.


Yes, I think there are almost no legitimate uses for the Farnsworth Device. https://theinfosphere.org/A_Device_That_Makes_Anyone_Sound_L...

My personal hell: My mother has dementia and a land line telephone.

Scammers call all the time. All day long. (Although the last few days have been pretty good, I assume somebody somewhere is doing their jobs. The scammers will adapt.) One thing they do is spoof their number to have the same area code and prefix as the one they're calling, so it's like "Oh, is this a neighbor?" or something, but of course it's not. It's an automated machine abusing the telephone network to try to steal money from a little old lady with dementia.

Evil men with robots are attacking my mom. Another one called while I was writing this post!

This is a goddamned sci-fi dystopia.

And now the robo-thieving bastards can imitate my voice!?

I'm going to have to get her one of those satellite-linked walkie-talkies or something. Thank God she doesn't use the internet.


Back when I worked as a telemarketer in high school a long time ago, we sold paper subscriptions, and usually the people that didn't cuss us out and hang up right away are the lonely old people who just wanted to chat. I lasted two months and had to quit; felt like we were just taking advantage of them.


I wonder if there is a market for proxying your mom's calls to you, allowing you to approve/deny each one before it gets to your mom?

One consistent trend in HN comments is young people complaining about their parents' naivety / incapability to understand the modern scamming world, and wishing they could install something or use some service to keep them from falling into these expensive traps. I know this is a big reason why people get their grandparents iPads instead of full blown laptops, because laptops are much easier to inadvertently install malware on.


Land-line telephones are awful because the majority of people who will pick up the phone during the day on Monday-Friday are old or disabled, i.e. easier to manipulate. I don't think I have received a single legitimate phone call during that time. In fact, legit callers know this and know that if they do call for a good reason, the person at the other end will distrust them.


Con: It's now easier than ever to fake someone saying something outrageous, and have that lie spread across the world long before the truth can get its boots on.

Pro: Humphrey Bogart can direct you to your destination!

I admit, it's a hard choice.


I'm currently working by myself on a game that will likely launch without voice acting (text only) because I don't have the money or skill to find and pay voice actors.

If I could act out the dialog myself and then purchase or generate voices other than my own to overlay on top of those performances, the quality and accessibility of my finished product would go up dramatically.

That would also open up the door for more people to be able to mod the game and add additional dialog options. A big complication with voice acting is that it's essentially static. Even though a big focus of my game is modability, if I do voice acting no-one else can add additional levels or areas or expand on the characters without breaking the recorded dialog.

It would be amazing if I could ship some kind of compiler so that modders could record themselves talking through new/changed dialog, and then insert it seamlessly into the game with the correct character's voices.


Exactly. I was working on an Air Traffic Control mod for Kerbal Bpace Program, but the work has gone on hold due to having to find enough people to record all the lines (to have a decent number of airport voices). Being able to record everything once, and only having to find people willing to let me record five seconds of speech rather than a lengthy recording session, would make this much more feasible.


Hypothetically, if we were interested in donating some voice samples, where would we look to see what lines were needed?


I haven't uploaded the list of lines; I should add that to the github repo.


Have you tried out Replica? I can hook you up with a beta account to see if it'll help with your voice acting.

https://replicastudios.com/


This looks really interesting!

That being said, because mod support is such a huge part of the design of this specific game, I have a policy that I won't use any tools or libraries that aren't either owned by me, that are Open Source, or that are exporting to common, open formats that can be freely read, manipulated and written by Open Source programs.

If I used a licensed product to generate my voices, I would be in the same position as if I hired a voice actor -- I wouldn't have a tool that I was free to ship with the game that any modder could use to edit or add dialog, or to even create new characters with new voices.

The few exceptions for proprietary tools I'm willing to tolerate for this specific game are things that generate MIDI output, sounds, fonts, and PNG files. Everything else is either Open Source or completely owned by me. Even for the final assets like mp3 files and fonts, nothing can be licensed, because I want to have full control over when players have the ability to remix and distribute game assets in their mods. I need to know that 20 years from now players will still have access to everything in the game.

I don't want to derail, so to bring that back around to the current discussion on AI-generated fakes, I believe these kinds of AI techniques should be freely available. A world where AI-fakes are considered so dangerous that only a few select guardians can control them is a world where, to me, this technology stops being useful. I'm not saying Replica is in that position -- I'm just speaking to a broad trend in the conversation around AI.

I think we'll start to see more calls to have single companies controlling AI under the guise of being able to ban bad actors or prevent abuse. I think that would be a mistake -- if anything, ubiquitous technology makes it easier for society to adapt to that technology. A purely SaaS, licensed model for AI generated faces, voices, and text would be all of the negatives of this technology with none of the positives that come from Open access and creative usage.

Gatekeeping won't work, we just have to adapt.


I think there are many benign applications, and definitely a massive potential for abuse. In practice, it will be used mostly for benign applications, I think, but due to the outsized impact, you could still say the malign applications outweigh.

However, what I found reassuring is that the paper actually addresses these concerns:

"However, it is also important to note the potential for misuse of this technology, for example impersonating someone's voice without their consent. In order to address safety concerns consistent with principles such as [1], we verify that voices generated by the proposed model can easily be distinguished from real voices"

This doesn't mean it won't fool humans, especially when used in a carefully crafted setting (low-quality phone call with distressing content).


I'm guessing this will just be used as an excuse by Google to prevent this technology from being easily or fully accessible to others.


On a benign level, many VO artists will find themselves out of work now that we can have Don LaFontaine back.

On more positive outlook, perhaps this, along with deepfakes, propels us faster towards an evidence-based society.


On a related note, I can definitely foresee a lot of voice actors having their voices cloned for uses they wouldn't really intend. Seems like a big legal grey area as many countries have personality rights.


Here's an outlandishly optimistic take on the possibilities.....as the internet and media is flooded with increasingly convincing but false representations of reality, a widespread habit of greater skepticism of "the facts" starts to spread throughout society, leading people to alter the speed at which they form new opinions on various issues (possibly degrading confidence in preexisting opinions), and the manner in which they construct their personal mental model of reality. As the frequency with which an individual's brain is fed data inconsistent with directly observable reality increases, might a tipping point be eventually released where it refuses to continue making snap decisions, and instead delays judgement until a later point in time when more information is available?

Perhaps loosely similar to being forced into a stature of "noting" in meditation: https://www.insightmeditationcenter.org/books-articles/menta...


I doubt that. I think that as we have to have more skepticism of 'facts' we'll see more and wider splintering of viewpoint based on preferred communication/news agency.

If you KNOW Fox News won't lie to you, just go there, and only there. Everything else is a lie. If you KNOW NPR won't lie to you, just go there, but only go there. Everything else is a lie.

I think it will only make things worse, because that's the simplest, least 'change' solution for the most people. Society is like water, it always seeks the lowest point.


I imagine, but I feel like there's a point of absurdity where people just aren't buying it anymore. The whole Epstein thing sure seemed to get a lot of coverage and outright bi-partisan mocking just as one example. /r/politics and the "organic" front page of reddit aside, it seems to me the number of people waking up to the possibility that the whole thing is an utter and complete farce is growing.

> Society is like water, it always seeks the lowest point.

Let's not forget these things are analogies, not laws. A trend remains in place until it doesn't.


Imagine we are actually a product of some advanced civilization bored with its capabilities where nothing seems real anymore, so they constructed us as a reality show to experience something that feels real.


There is the obvious, using this technology to put words in somebodies mouth. The more nefarious though, is that certain people who lie all the time can easily claim that a lie you recorded them saying wasn't them and now you can't prove otherwise. Certain people can say whatever they want and just deny it later. "Fake News" indeed..


Con/Pro, depending on perspective: people will have to give up the illusion that they ever really could definitively tell truth from fiction.


Are you asking about cordite or gunpowder in genneral?


I wonder what the legal implications of this alongside similar developments like deepfakes are going to be in the next couple years. We're already having fraudsters impersonate CEOs using Deep learning-aided Voice generation[1] due to just how low the barrier of entry is now. There's already a public implementation of the paper out [2]!

[1]: https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos... [2]: https://github.com/CorentinJ/Real-Time-Voice-Cloning


CorentinJ's implementation isn't quite as good as Google's - I think with some of Google's samples I couldn't tell that they weren't real, especially over the phone. But I could easily tell with CorentinJ's.

That seems to be common with open implementations of Google's voice synthesis and speech recognition work. I guess they hold back some of the secret sauce, or can afford to train it more.


Currently watching UK: https://mobile.twitter.com/FutureAdvocacy/status/11942824810...

Sorry for the Twitter link but Future Advocacys website seems to be down.


The latest episode of Blacklist had a dark plot based on deep-fakes.


didn't know new season was out! thx!


This is from 2018 – does anyone know if there are pretrained models and code for this? I found https://github.com/CorentinJ/Real-Time-Voice-Cloning , but the generated audio quality was much worse than the samples here.


The biggest missing piece is WaveNet, which is Google's proprietary voice-synthesizer. With only the models trained for this paper, the best you could build is a voice-recognizer. As far as I know, Google only allows people to do TTS with one of their provided voices.

I don't expect them to open it up until other companies/academics have achieved similar results. It's too much of a competitive advantage right now. Alexa, Siri, etc all sound like robots compared to WaveNet (google assistant).


So....I'm going to paste the abstract here because the headline is incredibly misleading and should be changed.

>Abstract: We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.


Do you want to elaborate on how the title is misleading? From reading the abstract it seems accurate to me.


"AI Clones Your Voice" implies there might be something on the linked page that involves an AI cloning my voice. Maybe a way to record a few phrases, maybe a text to speech that then uses my voice. Something like that.

This does not do that - only provides pre-rendered samples, kinda disappointing. Impressive, but disappointing.


Thanks... Too long of a scroll to find somebody posting the actual science behind the click-bait.


Nit: this is more design/engineering than science. There is no hypothesis being tested about how the world works.


Did you read section 3 of the paper where they evaluate their system?

> We primarily rely on crowdsourced Mean Opinion Score (MOS) evaluations based on subjective listening tests. All our MOS evaluations are aligned to the Absolute Category Rating scale [14], with rating scores from 1 to 5 in 0.5 point increments. We use this framework to evaluate synthesized speech along two dimensions: its naturalness and similarity to real speech from the target speaker.

They're testing if the generated speech sounds natural with a well-defined and reproducible experiment. That's science.


Evaluation doesn’t make it science.

There’s no investigation of the physical or natural world going on, unless they really think they’re modeling how humans are able to talk. But they’re not — they’re trying to create a system that works no matter how unnatural it is.


I'll take that as a no.

> There’s no investigation of the physical or natural world going on

I just quoted them describing their observational method! Do you just not believe psychology is a science?

> unless they really think they’re modeling how humans are able to talk

I've lost you. They're not generating birdsong. What do you think WaveNet does exactly?


Did you know that there's an entire field called Computer Science - https://en.wikipedia.org/wiki/Computer_science


Computer science should have been called "Computer math".


I took the headline from this Two Minute Papers video: https://www.youtube.com/watch?v=0sR1rU3gLzQ


Lyrebird has had similar technology for a few years now.

https://www.descript.com/lyrebird-ai


An apt name for the technology considering the marvel of nature that the lyre bird is.

https://www.youtube.com/watch?v=VjE0Kdfos4Y


Worth noting that Lyrebird is very rough -- at least last I tried -- and produced extremely robotic sounding (though recognizable) audio.

This method has much clearer audio, but seems to lack generality / TTS capability.


Yes, I remember seeing this a while ago. Nice work by the MILA group in Montreal.


Autocratic regimes can rejoice...they can extract public confessions so much easier now...


Only ones that care about the appearance of propriety though, presumably most never had to try to produce convincing fakes when their word was already law?


>Only ones that care about the appearance of propriety though

Such as the United States?


USA is a plutocracy. Source - this study:

https://www.cambridge.org/core/journals/perspectives-on-poli...


Thanks for the link, I was looking for something to read about this topic.


I used this repository to make Half Life’s Dr. Kleiner sing “I am the very model of a modern major general”:

https://twitter.com/theshawwn/status/1171806394783326208?s=2...

https://www.youtube.com/watch?v=koU3L7WBz_s

Then @jonathanfly deepfaked Dr Kleiner’s face onto a live performance of the song, which was hilariously unexpected. The AI twitter scene is awesome:

https://twitter.com/jonathanfly/status/1171907301231513605?s...

There is some promising new work in the GitHub issues. For example, someone has been training on ~10,000 additional speakers.


Will point out that it is cloning after a short sample and with an unknown speaker. So this is great for that type of comparison and in particular when the person listening does not know much or have great experience with the person speaking.

Now if you were to take something by a well known person (where there is a great deal of audio) it would be much harder to clone anything other than a really short passage.

This would be similar to faking handwriting. Easier to fake one word than to fake three pages. Easier to fake something where you have little to compare a pattern (less can go wrong).

Not saying this isn't impressive it is. But it's also a bit of a trick based on the very short clips (both samples and created).

I would say that a trained person could do a better fake because they could take into account all the info and be less likely to make a mistake.

Now sure you could manually change the AI as well doing the same thing.


VCTK p240: duplicates the (a) north of England accent well.

VCTK p260: all over the place accent-wise.

LibriSpeech: can't really comment on the American examples, but they seem decent.


Sample 9 is a good example. The pronunciation of "biographer" is consistent even when it should be very different. All of the examples stress the first syllable but an American would stress the second.


I wonder if this could be use on RPG games in which there are so many texts and dialogs that having a recorded voice for all of them is impracticable.


Now a politician can deny any sound bite. "They just deepfaked my voice!"


There's a potential upside to malicious uses - synthesized voices (and deepfakes) can give victims of revenge porn some plausible deniability. This would hopefully take some of the sting out of that experience.


I heard a rumor that robot calls were harnessing your voice prints. Not necessarily true currently but an interesting concern


Not looking forward to the phishing that will exploit this.

Going to call my parents today and warn them. If they ever hear something from me that's not adding up, be skeptical, and verify it some other way before taking any action.


Anybody else notice how that Scottish male reference voice sounds considerably more English in the synthesized versions.


Yeah, maybe it's just having a more sensitive ear to the Scottish accent but that to me was the furthest from the reference by far.


I heard that too; being a Brit (and working in languages) probably helps. It did pick it up occasionally though, which gives hope that increased sampling and training could fix the slight miss there.

It was that and the Swedish-accented English ("Sentence in Different Voices" section, middle recording) made it struggle. No traces of the Scandi-lilt were left in the synth version.

Final note would be the French speaker at the bottom of the page seems to be English first language, despite having very good spoken French. Not quite as pure a test of that last part as I'd have liked, despite the ability for the speaker to perhaps read the synthesized version in English back in English. That could be fun.


I can't hear any hint of an English accent in the French-language recordings, they just sound like regular Québec French to me.

However, I'm not convinced at all by these voice transfers across language. I can imagine the second Chinese one being the same speaker in both languages, but not the three others.


That's no Quebecois I've ever heard. Sounds like a Brit who picked it up as a second language in the home or soon after.

Even struggles to finish the sentence due to the effort of reading in the 2nd to last one. Struggles with an extremely common word, 'grand', as well as stumbling over a simple sentence. To be fair, he has heard enough French (i.e. lived and studied there most likely) to get the intonations mostly right but there are a few other giveaways too... it's just not natural or native from where I'm sitting.


One thing is, voice over phone is so compressed that it actually took a long while for this kind of voice cloning (and associated frauds) are all over the place.

We are going to need 2FA over voice communications :)


For those who were unable speak e.g. S. Hawkings, it would have been feasible to have had the computer speaking system use a voice that had sounded like him prior to his condition. Amazing.


A staple of science fiction comes to life.


Everyone's really pitching in doing their part to get the T-800 ready within 10 years.


And conversely, Star Trek's frequent use of voiceprint authentication looks sillier by the day.


It's even sillier when robots in almost all sci-fi movies have 'robotic voice' and human level intelligence. It's actually much easier to have human voice and 'robot intelligence'. They got it backwards. At least the computer voice in Star-trek and Data's voice were human.


Fun fact: Several financial institutions (Vanguard, Schwab) allow your voiceprint to be an authentication mechanism.


There was a story a few months back about some British subdivision VP wired a million dollars to eastern europe because the CEO called him up and told him to do it or something like that.

It was the CEO's voice, but it wasn't the CEO.


Code 1,1A Code 1,1A,2B Code 1B,2B,3 Zero-Zero-Zero Destruct Zero

What that is secure.......


Now imagine the Borg with their advanced computers not being able to cause all Federation star ships to self destruct (BSG style).


It rarely even worked in Star Trek.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: