I was greeted by someone explaining that my father had caused a car accident, and they were calling on his behalf. That someone would need to send over some money for repairs or they’d call the police.
They added that their cousin, the driver, is a parolee now holding my father at gunpoint. That if I don’t send them money to make them whole, they’ll kill my father.
This was super fishy, you know? But still, with things like “life of a loved one” at stake, it’s hard to call a bluff.
I can only imagine what I’d have done if I’d heard my fathers voice pleading for help. They might have been able to get any amount of money out of me.
Well, if my father hadn’t passed away nine months prior. They were not delighted to hear that.
Combined with the inability to verify the actual phone number displayed on caller ID has led me to tell all of my family to not ever accept a phone call from a number they don't recognize. There's literally zero trust in our phone system upon which we've built our modern economy.
Unfortunately that's not possible for everyone. Some people are legally required to answer the phone, always, even for numbers they don't recognize.
Unfortunately, not everyone can do that. Some people are legally required to answer the phone, even if they don't recognize the number. And unfortunately many businesses only communicate via the phone system.
So, unfortunately, our entire country is built upon a system in which we're told to implicitly trust but doesn't have any capability for us to verify.
It reminds me of port knocking.
At that point one of two things happen. One is that telco's fix their networks. The second thing is they decide it isn't worth the effort, and let the traditional phone system die. Given phone calls are effectively free so there is stuff all revenue in them, I bet it's the latter.
If that happens it will be painful. Like it is with messaging now, but even more so. Messaging now is either SMS with it's limitations (like you can't use it from a computer), or a choice of a zillion walled gardens - Apple, Hangouts, Slack, Signal, Viber, Telegram, WhatsApp, Facebook, ... most of which I don't have installed so I can't communicate with someone using them. The voice equivalents are Facetime, Duo, Viber, Signal - many of the same things in fact. The result will worse than messaging - the ability to communicate universally with anyone dies, but with no SMS fallback.
But that's not the end point. Universal communication is just too useful to be dispensed with - as the explosion of internet and the postal system before that have shown. So something will replace it, and once again we will all be able communicate with anyone we please.
However, the replacement has to solve the parasite problem. Once the cost of sending a message drops below a certain point every universal system we've had so far has been overrun with parasites, aka spammers. The postal system has junk mail, email has it's spam, now the phone system, and of course SMS.
A solution may be to allow the recipient to charge the sender any amount they like for successful delivery of a message. Most people would allow friends to send for free, messages from unknown recipients to cost something, messages from spammers cost more.
That could happen with the existing phone system of course, but I'd lay log odds the incumbents have too much in common with the dinosaurs for it to even cross their minds. Sadly that means we are in for a very painful transition period. In fact they are already losing customers as people stop using land lines in droves, so I'd say the writing is on the wall.
[S]ince mid-2015, a consortium of engineers from phone carriers and others in the telecom industry have worked on a way to [stop call-spoofing], worried that spam phone calls could eventually endanger the whole system. “We’re getting to the point where nobody trusts the phone network,” says Jim McEachern, principal technologist at the Alliance for Telecommunications Industry Solutions (ATIS.) “When they stop trusting the phone network, they stop using it.”
At the point at which individuals and businesses in sufficient numbers find the downsides of participating in the PSTN exceed the benefits, they'll start defecting to other systems. Likely small and closed networks initially.
It took decades for the telephone to become established as the principle means of business communication, and as it was, numerous other alternatives existed in parallel: postal mail, telegraph, telex (for what we'd now call b2b communications), fax, and early email systems.
Email seems to be dying along with telephony, and for much the same reasons.
It's occurred to me that much the value in social networks is in trying to corner a sufficiently large directory (that is, user base) to be able to credibly take on telephony. What seems to happen is that as these networks grow in size, they too fall prey to the hygiene factors already affecting telephone and email comms: spam and annoyance messages, with concommitant trust issues in the network as a whole.
Whether a technical solution to the trust and identity problem can emerge (and preserve privacy and protect against the surveillance state, surveillance capitalism, and surveillance by other actors (organised crime, racist or facist oppressors, stalkers, etc), remains to be seen. I'm starting to think that's a hard, possibly an impossible, problem. An essay of Herbert Simon's I've recently turned up is exceptionally discouraging owing to a critical error Simon made in it (claiming Nazi Germany committed it atrocities without the benefit of mechanical data processing -- it in fact had ample assistance willingly provided by IBM).
More generally, I'm suspecting that progress in information technology and communications capabilities reduce trust relationships, with some fairly strong historical evidence.
(Overall risks may be reduced, but the mechanisms by which this occurs replaces actual trust with validation, verification, and surveillance mechanisms).
Who is legally required to do that? Are they not allowed to sleep or be otherwise indisposed?
I have a friend who had something similar happen, he got a frantic call from his grandmother who learned via a scam call that he was in jail across the country and needed bail money. This was a few years ago, so they couldn't have used a duplicate of his voice, but possible they were relying on imperfect memory.
Sweeping generalization, but elderly are and would likely prime targets of this kind of scam in the future since they likely have funds and are less likely to be educated in the state of the art for this kind of tech, not to mention a protective instinct.
That strategy probably works some percentage of the time.
Sound like we'll all need more things like this eventually :(
In practice, most people can conduct a reasonable verification through a series of challenge/response interactions based on shared knowledge, should they need to do so. Mentioning something done, said, or shared in private recently would suffice in many instances.
For more robust tradecraft, should you need it, a set of one-time codes (passwords or passphrases) might substitute.
When the former head of InterPol was arrested in China, he managed to alert his wife through the use of a duress signal, an image of a knife:
Not subtle, but effective.
In the film Capricorn One (1978), one of the astronauts alerted his wife by referring to a holiday they'd recently taken together, by mis-stating the destination as Disneyland, rather than Hollywood -- the land of make believe -- as it had actually been, which led to the revealing of the hoax mission.
If I'm being coerced, I could have a codeword to indicate that. If I'm being spoofed with AI, I'm not in control of "my" words, so I can't. I need instead to prove when I'm not being spoofed with AI. That's the purpose of this second codeword.
All my loved ones are on my iCloud so I would just ping their phone/watch while confirming location and asking the assailant to let you hear the phone ping on the line.
Also, you are lacking in the abstract thought department. Get that fixed, for your own benefit.
You don't because it statistically never happens. Just like you don't prepare for a plane crash or a lightning bolt striking you.
Hm... that certainly gives me pause, and my first reaction was to be very afraid.
On the other hand-- it still doesn't hold a candle to pyramid scheme sales techniques. I mean in a lot of cases those involve your actual loved ones betraying your trust and love in order to sell merchandise for a third party. Yet somehow in the face of a rising tide of those we still have functioning communities in the U.S.
Side note, it's not accurate to refer to MLMs as pyramid schemes... Even though they're legal, MLMs are worse than actual pyramid schemes (and I don't think most people know what an actual pyramid scheme is, which is unfortunate because they're fascinating).
The American assumed it was a scam and the person did die
I have often found that truth is stranger than fiction, and people are too conditioned for fiction that they can’t perceive truth
Then there is that whole thing where they are getting your voice.
One could wonder if this was some sort of conspiracy to break one of the most successful protocols in the world (or at least not update it so it dies by neglect) to increase profits by other means.
Texting and apps is a much more pleasant way to interact with someone, and bonus of no hold times and much can be automated.
I think business texting is an upcoming startup unicorn which will be another “trivial idea packaged properly into a billion dollar product”.
You mean Slack?
It will have “jumped the shark” once there’s an SMS button on the business listing that Google Maps displays.
Instead of the wonky/creepy Google demo which did speech to text and then analysis and then text to speech relayed over to the business, every business will just communicate directly with customers over text.
It’s not that this isn’t already done (to some extent). And more so in some countries outside of the US.
But I have no doubt it will become the primary/preferred way to connect with any business, to the point where you will text with an 800- number long before it would occur to you to dial an 800- number to get service.
Like for example, the warranty claim I just made on my Dyson handheld vacuum for a battery replacement. Search for “Dyson warranty claim” and they tell you to dial their 800- number. Now their phone helpline is absolutely the best of the best, but even still most people would [will eventually] prefer to interact via text.
Another example, making a reservation directly with a restaurant (which I prefer to do versus using OpenTable which will take a cut for doing nothing), is a perfect usecase for texting. Also ordering take-out if you already have a favorite order saved, obviously all the notification type things which make sense over SMS instead of email, making an appointment when a dedicated app is too much overhead, etc. etc.
Sure, every restaurant could build their own automated system that texts you back and manages the communication, but that's never going to happen when there's already a managed, standardized service available.
The daily menu thing is especially endearing.
Twilio has given the ability to programmatically text anyone for years. Why hasn't this hypothetical B2C text business developed yet?
This hints that a "shopify for twilio" would be popular
I lost my dad about 6 years ago after a Stage 4 cancer diagnosis and a 3 month rapid diagnosis. I have some, but not a lot of video content of him from over the years. My mom still misses him terribly so for her 60th birthday I tried to splice together an audio message and greeting from her saying what I thought he would have said.
The work was rough and nowhere near what this Google project could produce. She listens to that poor facsimile every year for her birthday. It's therapeutic for her. With some limits for her mental health of course, I'm sure she would love to hear my dad again with this level of fidelity.
And so would I.
That moment has stuck with me for many, many years. The heartbreak on her face, combined with my own frustration of knowing that no amount of luck (or skill) will ever be able to flip the bits of that flash chip back to a permutation which contains samples of her loved one's voice.
Fast forward to the present, my own grandmother passed away shortly after the start of 2019. I was able to salvage some of the many voicemails she had left me over the years, despite having had probably five or six cellphones during that period. Why? I used Google Voice, which is part of their Google Takeout data exfiltration program. I was able to download all those voicemails as MP3s, neatly categorized by caller. My grandma was very terse, so most of them are exactly the same: "Robert, can you please call me?", but in spite of that each one is unique and precious to me. A lot of developers think about getting data into their platform, but it seems to me that not as many think about users getting their data, sometimes precious & irreplaceable, back out of the platform.
I'd pay a good amount of money to be able to relive certain experiences from my childhood with that level of immersion
Unfortunately, it's been a long time since I read it, so I don't remember which book it was in. Maybe someone who's read him more recently can remember.
Update: Apparently, lots of other people wrote about this too, but PKD wrote about this before any of the ones mentioned so far, as he wrote about this in the 1950's or 60's. I'm not sure if he was the absolute first, however. So if anyone knows of any earlier references, it would be interesting to learn about them.
But I'm thinking of a different PKD book where there were actual artificial personality constructs instead.
This is making me want to re-read some PKD!
It's not bad, but my recommendation is to go into it with the expectation of a Black Mirror episode rather than something you might pay to see in the cinema.
Note: this is different from listening to recordings from the actual person.
Having loved ones die is one of life’s universal terrible qualities.
Clearly losing someone and being able to deal with it is an important life skill but just as we build technology powered aids for other situations, I don't think this would be any different
It would be cathartic, but in this case you wouldn't be talking to them but to a computer, who (at best) is pretending to be them.
I think it's kind of creepy, when you really think about it, and it reminds me of the aversion the creator of Eliza had to his creation when he found out that people were spilling their guts out to it and treating it as a real person.
Which isn't to say that it can't be helpful to talk to something that's not a real person (and especially not a formerly living person you once knew) can't be healing. But if people get confused by these machines in to thinking the machines are actually people close to them that died and are now living again, that will make them vulnerable to some really serious manipulation and delusion.
In the end, I don't know if any of that works. But what's being subscribed doesn't seem too far outside the norm. Deprivation often leads to desperation for even a taste, however imperfect it may be.
Maybe we can find solace in the fact that is or will soon be infeasible to avoid, so we needn't try to avoid it.
That doesn’t seem like a message of solace to me.
It's the meager solace of the absolution of personal responsibility - there's no way to avoid it, so at least no one can say "why did you allow that to happen to you".
I guess it's similar to how most photos are a means to an end now, rather than the final product. ie satellite imaging or Instagram.
Maybe some movies with the deceased actor's voice?
But what if someone who wants to hurt me sends me files (or phone calls) from the deceased person saying horrible things like:
- "I am still alive but left as I was tired of you"
- "oh Jan, I love you" [fake phone call from the past, where Jan is a lover which never existed]
or even from alive people:
- "I am leaving you"
- or my live voice saying stuff which gets me fired or in prison.
We will never be able to believe voice again...how will we adapt?
To answer your question, I think the biggest step we took in adapting to the ever-present risk that an image may have been manipulated is acknowledging that it's possible. As soon as people knew that something could be faked, they realized that having a purported photograph wasn't irrefutable proof that it happened and learned to ask for corroboration before making assumptions.
I think we'll learn to deal with this new development too.
Wait. No. I had that backward.
How long has photo manipulation been around? And people still fall for it every minute of every day.
I have zero faith these tech developments will lead to anything good, or that we'll even learn how to deal with them effectively.
Verifying an image is not impossible. You just have to consider:
* Who took the image
* Do they have any reasons for wanting to fake it?
* Was anyone else able to verify they saw or took a photo of the same thing
* Is anyone in disagreement with the content of the image
We don't need image manipulation to fool people on facebook. Recently a random image of a park full of rubbish was used with the caption that this was the result of a recent environmental protest but the image was actually months old from a totally unrelated event. People believed and shared it because they wanted to. You could just as easily write a text post saying you saw a bunch of rubbish after the event and you would have almost the same effect.
People fall for headlines every minute of every day.
I could see this, if it becomes commercially viable, potentially being a huge boon to indie game creation, for instance, since hiring a load of voice actors to record the dialogue for an entire game is vastly more expensive than, say, hiring a bunch of different people to record their voices for 5 seconds—or even, if this ever took off, buying a bunch of samples pre-recorded (or networks pre-trained) for the purpose.
We're currently working on using voice AI to create real products over at https://replicastudios.com
Cheaper games vs distressing phone calls in this case.
I'm open to better use cases but for now I haven't heard any.
Like so much else in life, we have to take the bad with the good, but not looking for good in it doesn't mean we're less likely to get the bad.
"Say, Grandma, before I wire the money, what's the name of your cat?"
It seems we'll be going back to judging the likelihood of one's actions based on one's reputation, for better or for worse.
There soon won't be such a thing as unreasonable doubt.
And public/private key validation may become invaluable.
I'm sure some people with selective mutism that would like to use text to speech with their own voice
> How will we adapt?
Digitally signing audio clips
It may be that the next big business opportunity lies in creating 'anti-AI' technology just as it did with antiviruses in the 90's and 2000's
Link to talk:
The anti-spam software loses because eventually having self-aware AI view spam calls 24/7 was considered torture and they weren't allowed to go that far.
1. (edit) It occurred to me that some people may wish to manage the public keys independent of (say, Apple) and they could distribute via keybase or key-signing parties, so they actually don't have to suffer low-trust warnings. Now that I think of it, instead of merely signing streams, they could be signed and encrypted using recipients PK for 1:1 transmissions. Obviously law enforcement won't be a fan
You conflate content trustworthyness with origin sureity but a CA system doesn't even provide that.
Being good at Photoshop is really difficult and producing good fakes is extremely time consuming. Today, even with a Hollywood budget, most such effects still look off. However, the industry has gotten much with actor enhancement for example generally going unnoticed.
Which I think is real issue, this stuff is becoming easier over time. AI could be the tipping point where eventually people just stop trusting images and video. But, that transition is gonna be difficult.
Furthermore, GAN discriminators are (as I understand it) often hobbled a bit to ensure that the generator can make progress on the loss function. An always-correct D doesn't provide a useful gradient.
Looking at the papers I must say I think the ability to fool humans is a scale problem, not a fundamental limitation. Already GAN produced images and sounds survive "normal" human scrutiny: if you have no reason to suspect foul play you won't see it. If you really go looking, you'll see it.
For video and audio, I imagine a combination of hardware signing, perhaps with the camera itself living on an isolated, Secure Enclave-like chip, and sending hashes of (incoming images/video * deviceID * trustedTimestamp) to a blockchain or some other public distributed ledger. Getting the timestamp from a service that keeps its own record adds further security.
This obviously requires an internet connection, and would likely be useful mostly for news and government agencies, law enforcement. But if the culture is affected enough by deepfakes, I can imagine it becoming more ubiquitous. The parts are all there, it’s a question of utility.
It’s acknowledging that using AI to catch AI fakes is a fool’s errand, and relies instead on the premise that hashing a raw data stream is much faster than producing a good fake, and that a secure device key is secure. You’d need both for it to work, otherwise you can generate a deepfake beforehand and get the device to sign a fake stream. That may be easier to do than I think; this is not my area of expertise.
They can direct us to our destination!
They can speak at our funeral, being long dead themselves (as long as there is sufficient training material recorded).
The future is awesome.
You'd have to "direct" this on a word-by-word basis: "Put the emphasis here. Speed up 10% here. Decrease vocal intensity 25%". You'd end up producing a whole "score", and it would take at least as long as the human actor puts into it.
Having done that, it would be amusing to switch it from voice to voice, as a party trick. But the result would still be much poorer than you'd get out of an actor. Really solving the work of an actor is strong-AI-complete.
Even better if its linked to the voice generation system in real time, then you can save/redo sentences etc. as you go along.
It seems that we aren’t far from being able to take those recordings and spin it into a reading of anything. Fascinating. It’s kind of scary though. Grandma’s voice can read anything. Anything.
If anyone's interested in the project, feel free to contact me at firstname.lastname@example.org.
From their introduction: "Our approach is to decouple speaker modeling from speech synthesis by independently training a speaker-discriminative embedding network that captures the space of speaker characteristics and training a high quality TTS model on a smaller dataset conditioned on the representation learned by the first network."
Section 2 of the paper explains how it works. Two minute papers also goes through it if you'd prefer a video. Link: https://youtu.be/0sR1rU3gLzQ
I'm glad I never opted in.
Also, that whole scenario would have been far less cool if they just recorded the dude for 5 seconds doing anything and pulled a whole CSI-style "put his voice into the visual basic GUI neural network and it works, bro!"
I love Sneakers.
My personal hell: My mother has dementia and a land line telephone.
Scammers call all the time. All day long. (Although the last few days have been pretty good, I assume somebody somewhere is doing their jobs. The scammers will adapt.) One thing they do is spoof their number to have the same area code and prefix as the one they're calling, so it's like "Oh, is this a neighbor?" or something, but of course it's not. It's an automated machine abusing the telephone network to try to steal money from a little old lady with dementia.
Evil men with robots are attacking my mom. Another one called while I was writing this post!
This is a goddamned sci-fi dystopia.
And now the robo-thieving bastards can imitate my voice!?
I'm going to have to get her one of those satellite-linked walkie-talkies or something. Thank God she doesn't use the internet.
One consistent trend in HN comments is young people complaining about their parents' naivety / incapability to understand the modern scamming world, and wishing they could install something or use some service to keep them from falling into these expensive traps. I know this is a big reason why people get their grandparents iPads instead of full blown laptops, because laptops are much easier to inadvertently install malware on.
Pro: Humphrey Bogart can direct you to your destination!
I admit, it's a hard choice.
If I could act out the dialog myself and then purchase or generate voices other than my own to overlay on top of those performances, the quality and accessibility of my finished product would go up dramatically.
That would also open up the door for more people to be able to mod the game and add additional dialog options. A big complication with voice acting is that it's essentially static. Even though a big focus of my game is modability, if I do voice acting no-one else can add additional levels or areas or expand on the characters without breaking the recorded dialog.
It would be amazing if I could ship some kind of compiler so that modders could record themselves talking through new/changed dialog, and then insert it seamlessly into the game with the correct character's voices.
That being said, because mod support is such a huge part of the design of this specific game, I have a policy that I won't use any tools or libraries that aren't either owned by me, that are Open Source, or that are exporting to common, open formats that can be freely read, manipulated and written by Open Source programs.
If I used a licensed product to generate my voices, I would be in the same position as if I hired a voice actor -- I wouldn't have a tool that I was free to ship with the game that any modder could use to edit or add dialog, or to even create new characters with new voices.
The few exceptions for proprietary tools I'm willing to tolerate for this specific game are things that generate MIDI output, sounds, fonts, and PNG files. Everything else is either Open Source or completely owned by me. Even for the final assets like mp3 files and fonts, nothing can be licensed, because I want to have full control over when players have the ability to remix and distribute game assets in their mods. I need to know that 20 years from now players will still have access to everything in the game.
I don't want to derail, so to bring that back around to the current discussion on AI-generated fakes, I believe these kinds of AI techniques should be freely available. A world where AI-fakes are considered so dangerous that only a few select guardians can control them is a world where, to me, this technology stops being useful. I'm not saying Replica is in that position -- I'm just speaking to a broad trend in the conversation around AI.
I think we'll start to see more calls to have single companies controlling AI under the guise of being able to ban bad actors or prevent abuse. I think that would be a mistake -- if anything, ubiquitous technology makes it easier for society to adapt to that technology. A purely SaaS, licensed model for AI generated faces, voices, and text would be all of the negatives of this technology with none of the positives that come from Open access and creative usage.
Gatekeeping won't work, we just have to adapt.
However, what I found reassuring is that the paper actually addresses these concerns:
"However, it is also important to note the potential for misuse of this technology, for example impersonating someone's voice without their consent. In order to address safety concerns consistent with principles such as , we verify that voices generated by the proposed model can easily be distinguished from real voices"
This doesn't mean it won't fool humans, especially when used in a carefully crafted setting (low-quality phone call with distressing content).
On more positive outlook, perhaps this, along with deepfakes, propels us faster towards an evidence-based society.
Perhaps loosely similar to being forced into a stature of "noting" in meditation: https://www.insightmeditationcenter.org/books-articles/menta...
If you KNOW Fox News won't lie to you, just go there, and only there. Everything else is a lie. If you KNOW NPR won't lie to you, just go there, but only go there. Everything else is a lie.
I think it will only make things worse, because that's the simplest, least 'change' solution for the most people. Society is like water, it always seeks the lowest point.
> Society is like water, it always seeks the lowest point.
Let's not forget these things are analogies, not laws. A trend remains in place until it doesn't.
That seems to be common with open implementations of Google's voice synthesis and speech recognition work. I guess they hold back some of the secret sauce, or can afford to train it more.
Sorry for the Twitter link but Future Advocacys website seems to be down.
I don't expect them to open it up until other companies/academics have achieved similar results. It's too much of a competitive advantage right now. Alexa, Siri, etc all sound like robots compared to WaveNet (google assistant).
>Abstract: We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference speech from a target speaker; (2) a sequence-to-sequence synthesis network based on Tacotron 2, which generates a mel spectrogram from text, conditioned on the speaker embedding; (3) an auto-regressive WaveNet-based vocoder that converts the mel spectrogram into a sequence of time domain waveform samples. We demonstrate that the proposed model is able to transfer the knowledge of speaker variability learned by the discriminatively-trained speaker encoder to the new task, and is able to synthesize natural speech from speakers that were not seen during training. We quantify the importance of training the speaker encoder on a large and diverse speaker set in order to obtain the best generalization performance. Finally, we show that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
This does not do that - only provides pre-rendered samples, kinda disappointing. Impressive, but disappointing.
> We primarily rely on crowdsourced Mean Opinion Score (MOS) evaluations based on subjective listening tests. All our MOS evaluations are aligned to the Absolute Category Rating scale , with rating scores from 1 to 5 in 0.5 point increments. We use this framework to evaluate synthesized speech along two dimensions: its naturalness and similarity to real speech from the target speaker.
They're testing if the generated speech sounds natural with a well-defined and reproducible experiment. That's science.
There’s no investigation of the physical or natural world going on, unless they really think they’re modeling how humans are able to talk. But they’re not — they’re trying to create a system that works no matter how unnatural it is.
> There’s no investigation of the physical or natural world going on
I just quoted them describing their observational method! Do you just not believe psychology is a science?
> unless they really think they’re modeling how humans are able to talk
I've lost you. They're not generating birdsong. What do you think WaveNet does exactly?
This method has much clearer audio, but seems to lack generality / TTS capability.
Such as the United States?
Then @jonathanfly deepfaked Dr Kleiner’s face onto a live performance of the song, which was hilariously unexpected. The AI twitter scene is awesome:
There is some promising new work in the GitHub issues. For example, someone has been training on ~10,000 additional speakers.
Now if you were to take something by a well known person (where there is a great deal of audio) it would be much harder to clone anything other than a really short passage.
This would be similar to faking handwriting. Easier to fake one word than to fake three pages. Easier to fake something where you have little to compare a pattern (less can go wrong).
Not saying this isn't impressive it is. But it's also a bit of a trick based on the very short clips (both samples and created).
I would say that a trained person could do a better fake because they could take into account all the info and be less likely to make a mistake.
Now sure you could manually change the AI as well doing the same thing.
VCTK p260: all over the place accent-wise.
LibriSpeech: can't really comment on the American examples, but they seem decent.
Going to call my parents today and warn them. If they ever hear something from me that's not adding up, be skeptical, and verify it some other way before taking any action.
It was that and the Swedish-accented English ("Sentence in Different Voices" section, middle recording) made it struggle. No traces of the Scandi-lilt were left in the synth version.
Final note would be the French speaker at the bottom of the page seems to be English first language, despite having very good spoken French. Not quite as pure a test of that last part as I'd have liked, despite the ability for the speaker to perhaps read the synthesized version in English back in English. That could be fun.
However, I'm not convinced at all by these voice transfers across language. I can imagine the second Chinese one being the same speaker in both languages, but not the three others.
Even struggles to finish the sentence due to the effort of reading in the 2nd to last one. Struggles with an extremely common word, 'grand', as well as stumbling over a simple sentence. To be fair, he has heard enough French (i.e. lived and studied there most likely) to get the intonations mostly right but there are a few other giveaways too... it's just not natural or native from where I'm sitting.
We are going to need 2FA over voice communications :)
It was the CEO's voice, but it wasn't the CEO.
What that is secure.......