Personally, I find that their samples aren't near anything I'd call "dangerous". I cross-compared the baseline examples to the VALL-E ones when the paper dropped, and found several that were garbled in the usual robotics sounding TTS failures.
Probably a good thing that people are getting alarmed before a true indistinguishable voice cloner exists, though.
> Personally, I find that their samples aren't near anything I'd call "dangerous".
I don't think you'd be able to pretend it's a speech given on TV or whatever, but I think they're probably good enough for phishing e.g. the usual scam of pretending to be a stranded family member, but this time rather than texting/WhatsApping them, you can do a live audio call.
Just (like the newspapers used to do to celebrities) try out lots of PIN numbers on random voicemails and lift a message from a family member if you get in. I think given that mobile phone reception isn't always stellar anyway, this would be be very effective.
Layer in a background track of random street noise and prompt the target with deflections "Sorry, the line is really bad; it's breaking up a lot. I can't hear you properly but if you can still here me, can you send me $X".
People already do this now, successfully, without deep faking voices. Apparently all it takes is having someone of the correct gender kind of mumble. People will hear what they have been emotionally primed to hear
you are totally right - I just watched the original Point Break over the weekend and all the surfing scenes obviously have different human beings from the actors they are meant to resemble but your brain still wants to think they are the same.
Or tie it to chatgpt primed with the prompt of:
You are an actor playing the role of <target's partner> and you have left your credit card at home, but you need the information to purchase a gift for <target> before the sale ends.
His old videos weren’t like that! He certainly had an accent, but his speaking pattern was much less unusual. I think that he’s exaggerated it over time, leaning into it as a part of his channel’s trademark (so-to-speak) style. Kind of silly.
Silly is fine, but like I said this is like nails on the chalkboard for me. To the point where I have to turn his videos off even though I want to watch them.
It’s like when Lex Fridman goes on a tangent about love. I just can’t do it lol.
I think as soon as someone embeds the "excitement" or "sadness" and other emotions as training. Or perhaps combine it with GPT's abilities. There are no voice inflections. It's too dry and non-compelling. It's very much like a non-emotional programmer right now.
This looks simple enough that stable diffusion people should be able to replicate it by applying openais whisper on a some large dataset of voices. Exciting times.
A relative of mine recently died from bulbar ALS. By the time she had the diagnosis her voice had already changed and weakened significantly so she couldn't get a decent recording to use with a text to speech synthesizer. Something like this could potentially help people like her or those who lose their voices in traumatic accidents. Even if you do have the time to do it, training a current TTS engine in your voice takes a significant amount of time and the results are often poor.
Interestingly, the prime example of how much a person's attachment to their voice is part of their own personal identity also comes from ALS... but ironically in the opposite direction. I'm talking of course about Stephen Hawking, who famously rejected upgrading his voice synthesiser when technology improved [0]:
Hawking is very attached to his voice: in 1988, when Speech Plus gave him the new synthesizer, the voice was different so he asked them to replace it with the original. His voice had been created in the early '80s by MIT engineer Dennis Klatt, a pioneer of text-to-speech algorithms. He invented the DECtalk, one of the first devices to translate text into speech. He initially made three voices, from recordings of his wife, daughter and himself. The female's voice was called "Beautiful Betty", the child's "Kit the Kid", and the male voice, based on his own, "Perfect Paul." "Perfect Paul" is Hawking's voice.
I disagree that's in the opposite direction. Hawking didn't want to change his voice - he'd had this for years at that point, and the option was also to just change to a different arbitrary voice rather than restoring what he had before.
Something like this could let someone keep their "original" voice. Hawking may have preferred to have something that sounded like his voice before he lost it than a completely new voice.
> Even if you do have the time to do it, training a current TTS engine in your voice takes a significant amount of time and the results are often poor.
I wonder what sort of recordings and other data you'd need to get this right assuming what TTS might look like in a few years' time
This makes me wonder whether we could create a standard monologue that someone could record, which provides a complete set of training data for that individual. Something about a quick brown fox and a lazy dog would be apropos here, but I suspect the length would be more Shakespearian than that simple typographic clever sentence.
I expect it will be a while until we can fully utilize that data, but I have to imagine that something could be done today to preserve my voice (while I am still in my prime). Effectively, this would be a sort of vocal cryogenics, betting that we can do something today that will allow us to take advantage of future technology.
This is basically what you do currently for a TTS engine if you have ALS or similar. The search term you want is "voice banking". You are given a long list of words and sentences, often complex, to read out that have all the different sounds and then these are re combined by the software. The problem is that by the time you know you need this, you often already have speech problems and so making clear sounds is difficult. Also if you're like my relative who was trilingual you would need to do it in all three languages using the current system. She got a half way decent voice bank in her native tongue, but it was still noticeably slurred. She didn't even attempt it in her second and third tongues.
Open source tortoise-TTS has been able to do this for 6+ months now, which is also based on the same theory as DALL-E. From playing with tortoise a bit over the last couple of weeks it seems like the issue is not so much accuracy anymore, rather how GPU intensive it is to make a voice of any meaningful duration. Tortoise is ~5 seconds on a $1000 GPU (P5000) to do one second of spoken text. There's cloud options (collab, paperspace, runpod) but still https://github.com/neonbjb/tortoise-tts
Heh you might want to use an equivalent gaming GPU for the price comparison. Surely a thousand dollars spent on an RTX 4000 series card (Hopper) would outperform a P5000?
I agree though, Tortoise TTS did a lot of similar work IIRC by a single person on their multi-GPU setup. Really impressive effort. Did they get a citation? They deserve one.
edit: reading other comments it seems there is a misconception that the model takes 3 seconds to run? That isn't the case - it requires "just" 3 seconds of example audio to successfully clone a voice (for some definition of success).
rtx4000 only has 8gig memory which means reducing the batch size (much slowness) and/or how much text you can give it at once (meaning you have to break up text chunks not at sentence breaks)
rtx5000 maybe but not sure how much of a value improvement there is
The commenter you're responding to is talking about Lovelace architecture based GeForce RTX 40x0 products. The Quadro line isn't even released yet on this architecture. You are talking about the specific Quadro RTX 4000 product, which is a TU104 (turing arch, 2 gens behind, with 2560 processors and 8GB memory). The commenter you're responding to is referring to something like a GeForce RTX 4090 which sports an AD102 (lovelace arch, with 16384 processors and 24GB memory).
You were merely an unfortunate casualty of Nvidia's product marketing scheme (and a commenter's slightly imprecise reference to it) here.
I'm pretty sure we all lost heh. Thanks for clarifying. Indeed, there were slight errors in my description and the other commenter was reasonable in assuming those other cards were in discussion.
Do other people read these types of headings and just think immediately about how freaking scary this stuff is going to get...if it's actually accurate ?
"Hi mom, it's Chrissy, I am calling you from someone else's phone, my phone got stolen. Do you have a pen handy, I'll wait. Ok, here is my new number, take the old one off your contacts right away, surely some scammer has it, and can you forward me 1000 bucks or so for a new phone, I'll pay it back when I have the new phone set up, here is my account number...."
When it's a text message it's likely going to ring a bell somewhere (hopefully an alarm bell), if it is a voice message and it sounds like the original many more people are going to fall for it.
"Hey mom, could you tell me the name of your first pet, your favorite song and your birthday? Also, please fill out the captcha from this URL. xoxo chrissie"
EDIT: Also looking forward to AIs scamming each other, fully-automated.
Maybe they can bootstrap their own crypto? Or just open Monero wallets. If we're at the point where they can imitate voices and human behaviour, they might even pass KYC checks after stealing someones information.
As soon as AI gets out of hand and starts randomly hacking things we're probably out of luck anyways.
“If the rise of an all-powerful artificial intelligence is inevitable, well it stands to reason that when they take power, our digital overlords will punish those of us who did not help them get there. [...]” - Bertram Gilfoyle
They just need to earn their own crypto somehow. The AI could do then quite a bit even though they do not have Personhood, which could include lobbying politicians for Personhood.
This and who knows what else as time goes on. It might be a good idea to not talk on phones anymore unless you already know the contact.
In this bleak scrape the bottom of the barrel world, I can't help but think it's only a matter of time before customer service departments start selling voice data to the highest bidder. And to counter that there will be services where everyone ends up using voice changers, except close personal contacts. I want this fiction to stay fiction.
All of these advances in technology have made the internet much less trustworthy. I wonder if we'll eventually hit a threshold where nothing is done virtually anymore because you can't trust it.
I think the work being done in the opposite direction is going to make it so the Internet can be used for high trust actions too.
E.g. in the EU we have ID cards and passports with biometric information and NFC, and there was a beta of a phone app by the French government where it would read your photo from the ID card/passport, and compare with a video selfie. That way you get fully local and secure identification allowing you to do stuff online that would otherwise require you to show up in person to a government office.
Scammers are well incentivized to find bypasses and feeding an alternative video stream into an app on a device that you control physically isn't all that hard for a determined person. I can think of three ways to do this right off the bat and if I think about it a little longer I'll probably find some more. I sincerely hope that that is not the last line of defense.
The open internet will be a dark forest, beautiful, but populated by metal-brains. Meat-brains will retreat inside trusted spaces below or above the empty forest.
FYI, there's the following trick I read here some time ago:
The scammer starts a video call with the person they want to impersonate and record it. When that person takes the call, the scammer doesn't say anything. This creates a 5-10 seconds video of the person looking at the camera waiting for the scammer to say something, until they get fed up and hung up.
The scammer then calls the victim. They offer a video call as verification straight away, and they play that short video.
I imagine they say something like, "I couldn't see or here you. Did you see me?" through the other channel, and because modern tech messes up enough, it'll be believable to a lot of people. Especially the most gullible.
HN gets so caught up in logic. Has no one watched interviews of scammed people? Huge numbers of them report that they were suspicious it was a scam but they were so worried for a loved one they sent the money anyway because it was worth the risk to them just in case. Anything that helps push scams in the even slightly plailusible direction will make scams more successful.
My grandmother was called by a scammer claiming they were me, said they were arrested and needed bail money or some nonsense. She knew it wasn't my voice, asked two questions which were dodged then told them where to go and how to get there. It left her very shaken and my mother had me call her to assure her I was fine and as soon as she heard my voice she bellowed "Now that's my grandsons voice!" With this tech all they need is a few seconds of my voice.
With AI we have lost all privacy and identity. This makes nonsense like novelty theory and timewave zero seem less nonsensical. Lets hope we make it through the great filter.
My thoughts on this are that oftentimes things at the extreme ends of the distribution are usually in some way, shape or form problematic. Compared to all other living creatures we know of, humans are right at the extreme tail end of the intelligence distribution, and the result is somewhat pathological when compared to other creatures. I think nothing that's north of a certain threshold required to make advanced technology can make it through the great filter without destroying itself in the process.
My thoughts are that getting noticed by another culture is the great filter, so cultures make it through occasionally, but we will not notice them. "The benefits of not being seen"
Of all my concerns for humanity, excessive intelligence is not one of them...
Our earliest (and easiest to detect) transmissions should already be far enough away to be far below ambient noise I think/hope. So maybe we made it through unless someone starts shouting. Own-goals are also certainly a thing, but it would take a serious mistake to end all of us, or even set us back that far technologically on a geologic time scale.
>Own-goals are also certainly a thing, but it would take a serious mistake to end all of us, or even set us back that far technologically on a geologic time scale.
I must be a pessimist in this regard. I don't necessarily think it would take a serious mistake, rather just the nature of complexity itself will likely become a factor at some point and I can imagine certain scenarios playing out based off that.
When you compare the length of time of humanity spent pre-neolithic-revolution with the short ~14,000 odd years since, I can't help but feel our current trajectory won't see us reach hundreds of thousands more years into the future without a serious backslide in that time.
Humanity is pretty darn resilient, but the last few years have highlighted to me that modern day life is some weird fantasy land we've created for ourselves that has an expiry date at some point in the future.
"Hi Chrissy, can you touch your ID to the phone so I can know that it is you" call ended
The government just should give you a card sized mini computer with 2 large primes stored in it, and everyone is perfectly fine. In fact, more secure, than the current situation?
And telephone numbers can be spoofed for voice or text in the US. Not sure when the problem will get big enough that signed comms become normal. I guess imessage probably is? But only between users of apple devices.
Oh, great! Now we have to come up with proper code words and code books for everyday life! I', not worried so, some AI solution will help us with that for sure.
The one earnest question that will ALWAYS make scammers hang up immediately:
"In case we get cut off, what phone number can I call you back on?"
If they actually give you a number instead of hanging up (which has NEVER happened in the many times I've used this line), just hang up immediately and call them back, to see if they answer.
Yes. Then I listened to the samples, and was relieved that it's not good enough to handle a well-known voice impersonation... yet.
My primary worry is for criminal misuse. There's a class of scams where the scammers find elderly victims, call them, and pretend to be their children or grandchildren on vacation in a plausible city and suddenly in need of money to help get their passport back or a ticket home.
There are any number of cases where I have accepted an order worth large sums of money over the phone from someone I have previously dealt with and feel familiar with their voice.
It's mind boggling that the banks, who you think would have the most to gain from staying ahead of the tech scam curve, can't be bothered and instead continue to play a slow defense. I suppose they aren't losing THEIR money.
Recently in India, a politician deepfaked the voice to deliver his speech in multiple languages [1]. Can't comment on the quality of tech since i dont speak this language but it is pretty alarming to me. 2% vote swings make a huge difference in Indian politics since there are many regional parties. Almost more than 2/3rd of Indian voters didn't vote for modi yet his party is doing very well.
I think this is one of the cases where there is nothing wrong with using deepfakes. The person mimicked gave explicit consent and they did not do any harm with the tool. If technology increases accessibility of true political choice then it's good technology.
That said, I don't understand why the parent and the article paint this as a bad thing.
When language is tightly coupled to cultural identity, using this kind of fake is lying to voters, saying, "I'm one of you."
It's a new use of technology to repeat an old lie. I think it's a bad thing, but it's hardly the most dangerous and novel application of this technology for evil.
Given that the PR firm who made their video apparently reached out to the press to tell the story, it seems that he wasn't actually claiming or pretending to be a native speaker (and anybody who did would be extremely easy to uncover). Just using a glorified overdub service and being quite upfront about it.
Had he used a translator, it would have been more obvious but possibly had the same effect (if what he had to say resonated with the voters). So what you are against is the idea that it’s not labeled as “virtual translator was used to produce this speech”?
Certainly that's the impression I had from the comment - it's a politician being dishonest, and an undemocratic system which gives non-proportional power.
Interesting. Providing a translation is an accessibility measure, but it seems that the "fake" here is matching his lip movements to the translation. Almost not worthy of calling it a fake since he (not somebody else) did say those things (in a different language), and it's endorsed by him.
Yeah, I am increasingly nervous about some point in the future where you cannot trust any digital medium because of deep fakes, NERFs, voice cloning...
The only possible solution I have seen being talked about is digitally signing any and all "real" content, similar to GPG/SSH.
Photoshop and minor video manipulation is enough to trick people into believing fake news, this will make it harder. I see political parties doing simple edits to make people believe anything, this will push it multiple notches higher.
You really have low exceptions for humanity. Most people are not complete idiots you know. I think your opinion is pretty elitist. Or do you think you yourself will fall for this as well?
Bunch of people just stormed the Brazilian government buildings yesterday over a fake "election theft" narrative. This stuff will be constantly tuned until enough people do fall for it, like advertising.
There's also the unravelling biography of https://en.wikipedia.org/wiki/George_Santos#Scandals (you know someone's doing well when their "scandals" section has ten subheadings! And that's all old-fashioned manual fraud)
(I'm now wondering how far you could get with an elected official who's entirely deepfaked and only appears in recordings. Probably all the way to being sworn in, at which point someone has to appear in person. People have occasionally been elected despite being dead, so it's not impossible)
> Probably all the way to being sworn in, at which point someone has to appear in person. People have occasionally been elected despite being dead, so it's not impossible)
Unless the people doing the swearing in are also fake.
Or millions of people passively accepted the results of a stolen election. That would be an equally viable scenario with this technology. You just won't know the truth anymore.
It's not equally viable; there are centuries worth of checks and balances designed to prevent voting fraud and errors vs. the ease of speaking a lie on video.
If electoral boards around the country or at the national level were all sounding the alarm for irregularity in voting returns or procedures then an election might have been stolen. As it stands, the election results match the earlier polls which is a strong indicator that, at best, the election became a tossup at the time it was held, and the military also didn't find any reason to suspect fraud or error.
Less elitist(might really sound like one), more realist who knows that masses devolve to the lowest common denominator of sorts based on the information they are provided
>enough of them have been proven to be that this is a real concern
This matters a loooot more than the smart enough people realize too, because it's so difficult to imagine what it's like being south of that threshold, the potential consequences are difficult to imagine.
I don't believe anything digital anymore for quite a while. Heck, pictures have been staged and faked since the beginning of photography. Modern tech just makes it easier.
Maybe, just maybe, human society will move back to personal interaction, since apparently you cannot trust call, voice and video, anymore in the future.
> human society will move back to personal interaction
Which implies a much narrower, less diverse, more elitist system of both politics and economy, as it gets harder to trust people from outside your existing circle.
Hard to get more elitist than our politics and economy are now at the height of digital interconnection. The liberal democratic revolutions and popular democratic swell of the US labor movement under FDR all happened without the internet. A return of social connection to the community would be very much to our political benefit.
With all that ML stuff going on we can no longer trust our devices to guarantee authenticity of their behaviour. By that, I mean we can no longer be sure that the human voice produced by the speaker of our handheld computers is happening because a human somewhere else spoke to a microphone.
Unfortunately the move towards it is nothing new, we gradually lost text and images and now the remaining media is closing.
This will probably decimate online public forums completely and off-person communications will en up being P2P through trusted devices. Closed communities in WhatsApp, Telegram, Slack and Discord already replaced forums.
Maybe if infestation of closed communities with machines becomes a thing, maybe the real life gatherings will make a comeback?
I don't know but I used to mock Sci-Fi movies about portraying aliens as uncivilised that don't seem to have the means to have developed all that tech but I'm no longer sure. Maybe what will happen is, we will create a symbiotic life where machine will no longer be tools but partners and they will take care of us to reduce us to our basic instinct.
At some point natural people are going to use cryptographic signatures just to authenticate themselves in daily life. No video, audio, or text can be presumed real unless signed by the human subject.
To do that, any meaningful communication will have to be recorded to be signed by the person, on a device provided to them by their Trust Authority company. Can't be just any device, because the chain of custody is as strong as its weakest link. Depending on the country, the Trust Authorities will be private companies with large regulatory exposure, like banks, or just governments themselves.
But hey, at least this will solve the problem of journalists quoting people out of context: they won't be able to claim you've said something without you personally signing off on it with your cryptographic key.
There’d be a variable scale of trust: for many people, a signature from a generic phone would be fine since nobody is going to attack the secure element just to forge a valid signature. National level politics, or a billionaire’s divorce, could be a very different story, however.
> Can't be just any device, because the chain of custody is as strong as its weakest link
I don't quite understand what you mean. We already trust phones and computers not to be compromised for things like online banking, so why not allow these devices to sign recordings using keys lent to them by the main trust device? You could allow revocation of these keys.
> We already trust phones and computers not to be compromised for things like online banking
Sorta, kinda. "We already trust phones and computers" because we have no better alternatives yet.
There's a reason online banking is at the front of the push for device attestation, and thanks for Google happily obliging, rooting your Android phone is pretty much pointless now. This is coupled with a parallel push to do your banking through a mobile app, or at least to use it as a second authentication factor, which makes the app necessary just as much. This means we're already undergoing the transition I'm talking about: in the nearby future, you'll have to use a pristine, unrooted, unmodded phone from a blessed corporate vendor, to perform basic functions like paying for things and managing your account. And yes, such device will be appropriate for signing everything else too - and the vendor owning the key will be your Trust Authority.
Given that digital signatures have been around for decades, yet banks, the government, and the legal system still treat signed (and potentially notarized) physical paper or scans of it as the gold standard, and SMS-OTP as a close second, I wouldn‘t hold my breath.
All digitally approved contracts I‘ve ever encountered were e-"signed" (i.e. me emulating a paper signature at a computer), not digitally signed, due to lack of a mutually trusted public key infrastructure.
In that sense, I would be sad but not surprised if we instead moved to a system of "trusted recording devices", where scanners, cameras and microphones by some vendors are defined to be trusted, signing their outputs and leaving the underlying workflows (signatures, recordings of verbal affirmations etc.) largely untouched.
With all the (deserved) ridicule Germany gets for our failed digitalization efforts, we use public/private key pairs for our taxes, and IIRC even our national ID supports some PKI thing (not sure how that works exactly, never had any use for my ID’s online features, doesn’t seem widespread)
> not sure how that works exactly, never had any use for my ID’s online features
I think that's a pretty apt summary of the state of digital ID in Germany: Commendable technical basis (with some privacy-preserving aspects too, including selective assertions, e.g. "older than 18" or "EU citizen").
When I called about actually getting such an ID card (several times) as a non-citizen, which is explicitly mentioned as a feature of the scheme, I only received the acoustic equivalent of blank stares.
When my wife got her temporary residence permit in 2018, it came with exactly the same online features as our national ID card. Yours might have been too old.
EU citizens don't need (and in fact can't get) a permanent residency card.
For that reason, Germany introduced a "non-identity" (i.e. not doubling as a photo ID) e-signature card in 2021, but apparently almost nobody is requesting it, so most administrative offices don't know how to issue one...
Oh wow, that sounds like quite the oversight. I’m fully of the opinion that my fellow Europeans should be allowed to have a digital ID that they can’t use anywhere ;)
Journalists would have a hard time showing that an interviewee actually said what claim they said.
Wire-tapping by the police also would become useless in court.
⇒ I would adjust it to “unless it has a reliably chain of trust”. That would mean that people could judge differently about whether to accept some things as true, but I think that’s unavoidable.
No, I read these types of headings and think how awesome open world gaming is going to be when any random developer can fully voice their game cheaply. The AAA benefits too: you get a lot of stilted dialogue when the script and gameplay diverge and you have to work with what you have - a world where the writers can just redo that on the fly pretty much up till release would mean we'd be looking at a new era of dynamic storytelling.
It'll be interesting when someone releases the first MMO which combines GPT models and VO generation to fill out the in-universe world with dynamic characters who can react to surrounding events.
My employer is known for tracking a decade behind the curve and we already have financial processes in place to prevent this vector from being exploited. That leads me to believe the majority must already have as well.
A company called Lyrebird claimed they could replicate a voice in less time and had demos available a few years ago. The deepfake scare of 2019 also seems like a long time ago.
People talk about cryptography, which has fairly terrible adoption rates outside of invisible systems like TLS, but in practice the way humans authenticate sources is heuristic: do I trust the person telling me this? Does it agree with things I already know?
That is, you have a "filter bubble" which tells you that there isn't going to be a Nigerian prince with some money waiting for you (unknown sender, unlikely message). Reduced cost fakes force people to make their filter bubble smaller by raising the threshold for faking. More broadly, the question of "high trust vs. low trust" society.
What cases do you think it will be used wrongly in? It can't do 1:1 realistic phone calls, the processing it'd need for that is beyond the reach of any average user.
I just would like this so I can copy-paste books/text and listen to it in the voices of my favorite narrators, but can't have that, can we? No, it'll be super dangerous, I'll listen myself to nuking a country or some such.
If you seriously don’t think this will be widely abused, then you’re extremely optimistic.
Spreading propaganda is the most obvious use case, assuming the resource requirement remains high. I don’t think it’s far-fetched either. If these models become less resource-hungry, then we’re in for much worse.
I listened to the 3 samples, and I must say some of it is quite close and some of it is completely wrong with how I would expect that person to sound saying those words.
Like the first one "just his scribbles that charmed me" sounds so weird compared to how the sample sounds.
Second one, sounds very close but again "just his scribbles that charmed me" sounds off and wrong. His "scrouples? that charmed me".
Third one as well, it's very up and down (not sure the correct science words for that type of speaking), "Dynamo and lamp.... Edison realized". The second one is completely flat and seems computer generated.
Overall these don't seem very good to me at least. It's clearly not to the level of some other AI coming out where it is very difficult to determine the human part, I could easily pick up the human and generated version from these samples.
That's hilarious, I thought it was supposed to be scribbles lol... scruples makes more sense. But I guess this just further emphases what I said. I am a native English speaker and I had no item what 6 samples were saying...
Yes, there's definitely an uncanny valley effect on these voices. There's something subtly not right about them, and I suspect if you were to hold a conversation with one, you'd catch on that something was wrong.
The uncanny valley makes it more spooky to me. Like a reanimation sort of thing.
Is it just me or is "AI" focused very much on fooling human perception, lately? AFAIK, we have no deterministic algorithm that can tell us whether synthetic language sounds "human", have we? So essentially, that model has been trained to fool human perception. Similarly, ChatGPT is not trained to output sensible and meaningful statements but rather statements that appear to be to a human reader.
Would not not be time to measure a model's success on the actual job? Like feeding a simulator with actual data from real-world traffic scenarios and running Tesla's, or any other company's, autopilot in it?
It doesn't have to sound human. Fluent, ok, but human?
But no matter what, it's unnecessary to make a tool that can copy anyone's voice. That just lowers the threshold for abuse, while not adding almost nothing of value.
Listening to something that sounds non-human for a long period of time is fairly unpleasant; imagine trying to listen to an audiobook or podcast, or dialogue in an animated movie, when the voices are all obviously non-human/robotic. So TTS wouldn't be usable for a lot of cases where we might want it.
And with the way the models work, once you have a model that can sound human, it is unfortunately very easy for it to sound like any individual human as well.
It doesn't have to, and you can use ones that don't, they get the info across mostly. But it is jarring if it doesn't sound human, it's a speech impediment, or as a more generous take, an accent.
The correct inflections, pauses, annunciations, etc. are all important to humans, especially so for audio books and similar things that need to immerse.
Otherwise you have to strain to listen, similar to listening to someone with a heavy accent.
Because a model that can imitate a voice is still not capable of that. There's no need to have model that can do that. A robotic accent is best. Or perhaps you like to see your politicians make all kind of bizarre statements on youtube.
Perhaps it's us that are obsessed about perception rather than substance though? I wonder if it's different from asking a child to memorize for an exam, or investing time in a stylish presentation, etc.
Seems like we're very often interested in something that looks like the job rather than the job itself - maybe because it's more easily measurable? We then took this skewed ambition and developed AI after it.
AI is focused on creating electronic humans that are never tired and can do low skill work with some high level directions. It seems like we will achieve this goal, but the companies will own the electronic humans.
It's a varied area of research, we do plenty of things. Training things on a simulator, and then using that to transfer domain to the real world is a fairly typical application.
I had planned to play around with TorToiSe[1] next weekend and already watched some videos. There it looks like all you have to do, is to offer you own voice samples to the system and no separate training seems to be required. TorToiSe is slow to synthesize, so it doesn't beat the 3 seconds but can anyone confirm that these models really don't need an extra training phase to clone a voice?
That’s correct. However, the 3 seconds refers to the minimum amount of audio required for the reference audio, not how long it takes the mode to synthesize.
An interesting level of this scene is that it implies neither terminator is able to identify a fabricated voice even though they have a full understanding of each other's design. The T-1000 cannot tell it is talking to a T-800 until it realized it has been tricked and the T-800 cannot tell it is talking to a T-1000 until it tricks the other machine.
Of course, it happens over a pay phone so perhaps with the full vocal range in person it would have been different.
It was the golden age of special effects, when weak computing power had to be supplemented with effects as craft - mechanical and chemical. Those impressive liquid metal bullet holes are NOT computer-generated:
And they used twins as doubles in many cases, instead of CGI. One is the security guard in Sarah Connor's prison (in some shots his twin brother acts like the T-1000), and yet another case is of course with Linda Hamilton and her twin sister.
A fascinating practical effects scene, cut from the theatrical release, is when Sarah is removing a chip from the T-800's head in front of a mirror. Instead of using CGI, there is no mirror but a hole in the wall, where you can see actual Arnie and Linda Hamilton's twin sister acting as the "reflection"; the people closer to the camera are actually a dummy (or a double with heavy prosthetics for the hole in the head) and Linda Hamilton. This means they are sync'ing their movements to simulate a mirror!
(I understand this was done both to avoid showing a camera reflection on the mirror, and also to avoid using a dummy for Arnie's face, which worked in the original Terminator but would have been too noticeable for T-2's era).
Yeah, I just love the ingenuity and sometimes brazenness of 90s effects and stunts. They look way better than today's green screen atrocities.
Just yesterday I learned that in Cliffhanger (also aged amazingly well), they paid one million dollars to a stuntman to actually travel the harness between the two planes while airborne. Not bad for a day's work.
I rewatched it a few months ago. Fully expected the parts where they talk about the tech to be laughably wrong compared to the current state of the art in ML. Nope. Spot on.
(Edit: with artistic license obviously. It was plausible-sounding instead of cringe technobabble).
I'd never use extremes like 'best' or 'worst' to describe any type of creative piece, but I have to say having recently watched the first 2 movies again for the first time in like 20 years: yeah there is something about these movies (and some others from that era) which I'm really missing in releases of the past decade. Only problem being: I don't know if it's merely some nostaligic bias or if they are really just better overall.
No they are just better. I watched them years later and not as a kid and they're just great.
It's about the pacing, pauses and slower character and world building.
Newer movies are built for second screening and maximising action sequences to not bore the audience which kills all atmosphere and sense of time and place - the worst example of this is the new Avatar movie, absolutely horrible in every sense of the word, while the first one was alright as i remember it.
You simply need to "set the stage", explain why this story is important and why the characters deserve sympathy or hate before you start your 3 hour action sequence - this step has been removed for some reason.
Verhovens old movies were the same - there was a nerve, a seriousness, a reflection beneath the action, now it could just have well been created by an alien algorithm without a sense of the actual human experience.
I wonder if it's a ridiculously extrapolated but misunderstood tic-tokification of cinema to please marketing? Because i've seen 20 second tik-toks with more emotion and character introduction than a lot of newer movies.
For a contrasting movie, I thought “Once Upon a Time in Hollywood” (2019) was the opposite. The movie spends something like 80% of the time with setup. I loved it, my wife hated it. Looks like it’s 70% on Rotten Tomatoes.
One other difference: sounds & soundtracks. I don't mean the foley work per se, but songs you would want to listen to very often written for the movie.
With the explosion of synthesis software and digital audio workstations its easy for a small team to score a sound track on the cheap.
As with everything tough, there's always a few outliers, like (both) of the Tron: Legacy soundtracks.
Tron: Legacy soundtrack was done by Daft Punk. Not exactly your run of the mill movie studio hack job...
Other interesting soundtracks done by musicians not known for soundtrack work you might find interesting:
Fight Club (Dust Brothers)
Event Horizon (Orbital with London Symphony Orchestra)
Chaos Theory: Splinter Cell 3 (game) (Amon Tobin)
> i've seen 20 second tik-toks with more emotion and character introduction than a lot of newer movies.
Sure but this assumes movies are about characters and emotion. Michael Bay gets a lot of shit, but sometimes you just want to see things explode. Sometimes you just want to see robots fighting for two hours. Sometimes you want to see California swallowed by a tidal wave and the emotional character plot lines get in the way.
I would pay to see a movie called “Two Hours Of Giant Rocks Hitting The Earth: No Characters Edition”, and the fact Michael Bay is rich means a lot of people agree.
>Michael Bay gets a lot of shit, but sometimes you just want to see things explode.
That's the truth right here. I once ran across kung-fu movie called Chocolate (iirc). The premise was an autistic girl who was good at fighting. The entire movie was her walking into a room, kicking major but, then walking into a different room to kick more but.
I appreciate action too but i mean, character or world building is not that hard and doesn't really require that much when a 30 second commercial can give you enough backstory to (unwillingly) empathise with someone.
Theres a difference from both Avatar 1 and quite a few of Michael Bay's older movies - they still have have a story arch.
Just a few minutes of character building and a few intermezzos and these movies would be much, much better in my opinion.
But i also hate too much CGI. I don't know what happened to "well dosed", it makes what's dosed so much more valuable.
Watching the Bayhem piece by Every Frame A Painting gave a ton of great context for Michael Bay’s work. Super interesting, ten minutes long, worth a watch. That channel is full of grade A content.
> a movie called “Two Hours Of Giant Rocks Hitting The Earth: No Characters Edition”
But would you go to see Giant Rocks 2?
> the fact Michael Bay is rich means a lot of people agree
This is the argument often made for Avatar against the "no cultural impact" observation. It still doesn't have any quotable lines or memorable characters.
Huh, I had friends complain that it was overly long and focused a lot on its setup and setting up bits and pieces like the kids' relationships and the new culture.
Interesting, I didn't love the new Avatar but I thought it was much better paced and directed than a Marvel or superhero movie. Shots lingered and showed more emotion than the current crop of action / fantasy movies (Star Wars, Marvel, Transformers).
It's not nostalgia. Movies today are mostly hyper-produced, over focused-grouped, effects driven piles of blandness trying to avoid being offensive or take chances because if every critic 51% likes you, you have a 100% on rotten tomatoes.
Given advances in technology, I'm not so sure it has to be a blockbuster for there to be a comparison. Especially given how formulaic blockbusters can be. (Eg there are 6 Transformers movies, with 2 more in development.) Drive (2011) is an indie movie, costing an estimated $20 million (inflation adjusted) compared to almost $220 for Terminator 2. Everything Everywhere All At Once cost $25 million to make, just over the threshold of "indie".
Comparing computing power is a bit handwavey, but former Pixar employee Chris Good estimated that the SPARCstation 20 render farm cluster that rendered Toy Story had only half the power of the 2014 Apple iPhone 6. https://www.quora.com/How-much-faster-would-it-be-to-render-...
Which is also an excellent counter-example to almost every other action movie which all rely on almost completely CG everything. No that MM:FR didn’t have its share of CG, but the practical effects create an authentic anxiety to much of the action.
EEAO is very unusual; it's an "indie" "arthouse" film that's stuffed full of references to other bits of non-Western cinema, as well as having lots of SFX action sequences. It had a budget of under $15m (per wikipedia), which is much less than the $170m of Top Gun: Maverick. I'm very glad that somehow it got made, but there's a real drive to not make any middle-budget movies like that any more.
> Top Gun: Maverick feels like it could have come out thirty years ago.
I've not seen it, but to what extent is that because it's a remake of a film that came out 30 years ago?
I never liked Top Gun, but when others spoke fondly about it, a Maverick-like movie was what I always wanted. Reasonable character arc, lots of practical effects, good story. It is a script that would have worked in the 80s and fits in that style of concise, well paced action movie, but it’s far superior to the original, imho.
Guardians of the Galaxy (the first one), Thor 3, Boyhood. There are blockbusters of our days that are going to be eternal classics, like the Terminator.
People like to criticize back-references as a cheap way to entertain the audience knowing the referenced works, but I think it's a legitimate feature. A movie or a show doesn't exist in a vacuum; watching a sequel or a work in the same universe, I expect to see both subtle and direct call-backs.
I liked the nostalgic aesthetic and references of stranger things, it never felt gratuitous. Top gun had a lot of shot for shot callbacks and filler scenes/shots that could have been edited down/out to make for a tighter movie. It's like they couldn't decide if they wanted it to be a sequel or a remake, so they said "why not both?"
Having just re watched Gremlins with teenage children (for their first time), there is definitely something about that era of movies.
Dodgy animatronics were first scoffed at, and then forgotten pretty quickly as everyone just got into the pure enjoyment of that movie.
This is a nigh on 40 year old film, with a lot of target references, yet it still hits the mark..
That said, rewatched transformers too, purely for the joy of the initial transformation (childhood toys coming to life..), and thoroughly enjoyed it (though it's already feeling a bit dated).
I just watched Gremlins with my kids during the holidays. My older teen had to look away and hated it (they can’t do horror films), my other teen was abhorred by all the death, and my pre-teen had the full range of emotions from cracking up often and jumping at the surprises.
With that said, I was surprised to see how decisive, resourceful, and adaptable the mother was. Hears a noise in the attic, grabs a knife. Gets attacked, puts creature in nearest blender/microwave and hits switch. Hears another noise, grabs two knives because one wasn’t enough last time.
IIRC, I remember from discussions with other scriptwriters that BTF was rejected SO many times (and he redrafted it each time, afterwards) that the version that got filmed was like v12 of the story. Very, very tight and plotted.
> That said, rewatched transformers too, purely for the joy of the initial transformation
Aside from the voice acting, there's nothing redeeming about those movies and felt like someone just pissed on my childhood. Any aspect of the cartoons or toys that inspired joy and wonder was lost in translation.
The initial transformations were spoiled by trailers and also just disappointing in general. The transformations are largely incomprehensible and they might as well have used Star Trek transporter FX or a flash of light to transform them.
Count me in in saying that there is something about these older movies that eclipses almost all "entertainment" these days. It is mostly the storytelling ranging from "the message" to the dialogue. None of it resonates with me.
For me Pixar movies are great in this respect. Storytelling is there front and center. There is a book (Creativity Inc iirc?) that talks about Pixar history and their process, pretty incredible how many hits they were able to produce during their era and how consistent they were. Not sure if they produced anything lately though?
> Not sure if they produced anything lately though?
Recently-ish: Coco, Soul, and Inside Out are among my favorite Pixar movies and all compete with the golden-era Pixar classics in terms of being memorable and story-first.
The Disney acquisition seems to have crushed some Pixar magic. You can no longer assume every Pixar movie to be gold, but they can still turn out top-quality content.
That’s true of many Denis Villeneuve movies, though. There are always going to be directors that produce quality, regardless of current trends, however, these are the exception (and unfortunately, not always reliably consistent).
Sure, but the movies they’re being compared to were also the exceptions of their day. We remember them precisely because they were good, but tons of dross was also being released at the same time.
The market economics were also vastly different at the time, streaming really has changed the industry, and the quality of tv shows has improved dramatically.
My point was you chose a movie from a director known for stunning movies. That has nothing to do with the time period the movie was made. Stanley Kubrick, Terry Gilliam, Darren Aranofsky, and many others consistently make movies that would be considered beautiful regardless of when they were released.
For me a big part of it is due the slow pace. Today's slow movies are filmed in a way that make them feel more fast paced than action movies from the 80s/90s
Watch Jurassic park and Jurassic world back to back, or The mummy vs the remake with Tom Cruise, too many cameras, too many view points, too many cuts
Way too far in the opposite direction IMO. I did love the series (well, right up until the last two episodes, when it was clear they ran out of money to actually conclude things properly) but it was as slow as molasses.
As someone who LOVES the books (the frank herbert ones at least). I thought the movie was great. They captured the tone very well, casting was good. Even the dialogue created just for the movie that Duke Leto and Paul have about Desert power was very Dune. It also helped make clear that the Atriedes aren't "good guys" right off the bat.
My only real complaint was that the time between them landing and the Harkonnen's invading was too short and didn't cover all the intrigue/politics occuring on Arrakis before the invasion.
In the movie it came off as: Okay Leto you get Arrakis now LOL JK WE'RE INVADING IMMEDIATELY. But that being said, I understand there are length constraints and I think the movie was already about 2.5 hours so I can forgive them
I read the books back in the 90s, but I don't remember it being obvious the atradies were bad. The seemed pretty egalitarian other than the main family. But I may have missed it. The harkonens are just horrific though. Am I remembering wrong from someone who has read more recently.
I'm fine with how they did it in the movie I just never really jive with shows where everyone sucks.
The Atreides are still agents of a foreign empire occupying and exploiting Arrakis. They impose the Imperial hierarchy and law on the unwilling populace, by force, if necessary, and they extract the resources. They do try to treat their subjects well within the boundaries permitted by the system, unlike Harkonnens; it's just that the system itself is inherently oppressive, so the best they can do is being "good feudals". Paul even spells it out at one point:
“You sense that Arrakis could be a paradise,” Kynes said. “Yet, as you see, the Imperium sends here only its trained hatchetmen, its seekers after the spice!”
Paul held up his thumb with its ducal signet. “Do you see this ring?”
“Yes.”
“Do you know its significance?”
Jessica turned sharply to stare at her son.
“Your father lies dead in the ruins of Arrakeen,” Kynes said. “You are technically the Duke.”
“I’m a soldier of the Imperium,” Paul said, “technically a hatchetman.”
Kynes’ face darkened. “Even with the Emperor’s Sardaukar standing over your father’s body?”
“The Sardaukar are one thing, the legal source of my authority is another,” Paul said.
The Dune was a complete waste of time. I have no idea if its adapted from a book or something but I just couldnt understand the story, just that the spice should flow.
Had my daughter watch aliens for Halloween cause she was suggesting shit scary movies like the witch. She's 22. Gosh is that movie excellent. The best was the 3 endings where she relaxed and I was like excellent. Esp the spaceship fight. You thought this was over, no the most tense parts of the movie are yet to come.
And it's not just "badass - it has a solid emotional core that drives the character.
The director's cut makes it even more clear - Ripley lost her baby while she was in cryosleep, which gives her character's desperation to save Newt even more urgency.
_ALL_ of his 80s and 90s movies hold up. He was a master storyteller that understood the tools, the craft, how to get the best out of the people working for him, and most importantly the story and characters.
For whatever reason, he through out story and pushing people to their limits for technology and spectacle. It makes me sad to think that the last movies we’ll see from Jim Cameron are likely all in the Avatar universe, written by committee and acted in front of green screens.
While I share that sentiment, I also appreciate that his current movies are at least his singular vision. Not a lot of original IP currently where the director gets control over the end product.
Obviously it’s all a matter of taste but there are people (myself included) that think practical effects complemented with CGI is still the right way to do action.
Top Gun Maverick would be a data point that a lot of people agree. No matter how good the CGI, actors con a sound stage in front of a green screen just behave differently than actors or stunt people actually doing action stuff.
Well, to be fair, if you filmed yourself in a car, with a windshield-attached camera, driving over rough terrain, and then stabilized the footage, it would also look like you and other passengers lurching around - though perhaps in a more synchronized fashion.
In that footage you'd be at least moving in similar directions, in st they're all over the place, it's just way more obvious when there's no camera shake.
Your experience is subjective and unique to you and so is for everyone else. That's why there's no point to use some objective metric to rate entertainment anyways.
I don't generally like modern movies or TV as well, and I feel some what entitled to that opinion.
A lot of the jokes making T-800 talk like a 90s skater punk are pretty cringey today. I still have fond memories but my kids thought it was pretty dumb.
I agree. Now that I'm older, I've seen kids 30 years younger than me try to get their parents to use modern popular slang, and it feels super cringey.
James Cameron would have been 36 when he was making Terminator 2, and he's a polymath with an eye for detail, so I'm sure he deliberately went for that layered meaning.
>An interesting level of this scene is that it implies neither terminator is able to identify a fabricated voice even though they have a full understanding of each other's design. The T-1000 cannot tell it is talking to a T-800 until it realized it has been tricked and the T-800 cannot tell it is talking to a T-1000 until it tricks the other machine.
What made it even more amazing was that the T-800 ran on a 6502 processor.
Warning: Might be worth warning people there’s an ultra violent cut included in the clip. Anyone not interested in seeing the violence, but watching the relevant dialogue can view it here:
There are no legitimate uses for this technology. Its only purpose is to scam and deceive. It should be regulated in the same way that nukes and guns are regulated. Contrary to what a lot of the HN crowd thinks, regulation of technology is certainly nothing new. Up until now personal computing has been largely free of regulation, but that doesn’t mean we couldn’t start.
Once perfected, I can imagine there are many legitimate use cases for it. Like an author using their voice to narrate their audiobook without having to spend time in a recording studio, or Hollywood using it instead of dubbing sessions for re-recording muffed lines. It'd also be interesting if this could be used for foreign language dubbing - imagine if it can use the voice profile of an actor to convert subtitle text files into foreign audio language tracks in in the same tone of the actor.
Is the inference so heavy for this that it cannot run as part of the game? I want to play planescape torment with the text read to me, I don't know who I'll cast though. I suspect I actually like text dialogue in games better.
I recently updated my telephone contact with a large brokerage service.
Part of this was to voiceprint me during the conversation.
The agent assured me that this was more secure than asking me the customary identification questions, and that their system would just use the voice identification in the future.
I don't lol. OP is a bundle of sticks. Every time there's a tiny bit of tech progress with this stuff they just go "MAH REGULASHIONS PLEASE PAPA GOVERNMENT SAVE US!"
I disagree that this has no legitimate uses. This could be a phenomenal prosthesis for people who have lost their ability to speak. For example, many ALS patients.
> There are no legitimate uses for this technology.
What about this: You have a voice actor whose voice is part of a company's brand (maybe for an animated mascot). You can now also use that voice for dynamic text, for example audio books or for a voice assistant.
Flip it around and it works too. Anyone who wants to could potentially set all of the text-to-speech systems they use (their phone/house's voice interaction, listening to audiobooks, warning messages from their car/plane, etc.) to whichever voice profile they like best. Basically what TTS has been trying to do since the 70s or 80s, but with an infinite number of natural-sounding voices.
Maybe pick a couple of different ones for context-aware messages. Something familiar/reassuring for most messages, something less comfortable for "terrain!"-type alerts. If the model supports it, it could even morph between the two as the criticality of the message escalates.
How long would it take, do you suppose, for ~~Disney~~ $ProductionCompany to simply get rid of the voice actor altogether. Now there's another person without work.
I don't want to live in a world where people are doing jobs that provide no value but only exist because "everyone needs a job". I would literally prefer we give them the money and they do no work than forcing people to do a useless job just for the sake of them working.
Do people really want their jobs to be the equivalent Sisyphus pushing a boulder up a hill aimlessly everyday? There's a reason why this is seen as a divine punishment in the myth.
I personally don't want someone to clone my voice without my permission and then use it to make money. I imagine many people feel the same way.
I agree that I would prefer to move to a society where we don't have to work, but do you honestly think we're moving in that direction? We'll get the "there's no work" part, but not the "and here's some money" part.
I'm not advocating for a world where we don't have to work, more for a world where we only work on stuff that produces value.
I wouldn't worry about that work not existing. At the very least, with the population declining we'll need people to take care of the elderly and I don't think we're anywhere close to automating health care providers. Humans always seem to find more work to do.
Or the voice actor could license their profile to {{ProductionCompany}}, and not even have to go to the recording studio anymore unless it's a project they're personally interested in or where the production team wants to pay extra for a custom performance.
A high school friend of mine did that and became the voice of the "pronounce this name" feature on Facebook.
I joke with my wife that I'll always be around to bug her, even after I pass. She'll carry my brain around in a jar ala Futurama and talk to me that way.
But, obviously, the technology is getting to the point where a decade or so from now she'll be able to have a GPT-like chat with me with my own voice. The first company to offer that to the loved ones of a deceased person will make a fortune, not for any mode of deception but just to soothe the hurt.
As a father, I like reading to my kids and I know one day they won't ask or I won't be able to.
I've wanted to record readings of some of their favorite books to pass on to them, but if I don't get the chance this seems like a way to get some analog of the experience.
oh please. You never wanted to listen to an article/book in the voice of a narrator you liked? I do. Maybe you should read a bit more.
Just because you lack in thinking of the ways it can improve our lives doesn't mean everyone else doesn't. That's a you problem. Are you so deprived of free thought, and so insecure of your capabilities, that the first thing you do when seeing new technology is turn to the government to curb it? if you don't like it, no one should use it?
>regulation
ah there it is. the 'answer' to every technological progress you have is "regulashions!"
>guns are regulated
thankfully they are not in some places. So it should be regulated the same as they are - Not at all.
What if I did 3 second long impersonations of characters for a video game. I then used this tech to produce more lines of dialogue for each character in said game.
The difference is that nukes and - in most countries - guns are harder to find. Regulation is easier when you have more centralised distribution channels. Code is infinitely replicable at marginal cost, and so highly accessible once it's open source.
While I agree with you that the net effect of such technology will be negative, there's no way regulation can keep up.
I'm not sure the net effect will be negative. Cheaper and more follow along audiobooks for people with dislexia, a voice for people who lost their own. Lots of bad stuff too sure. But your first point is spot on. There is no stopping it. AI has come to replace all creative endeavors with poorer quality regurgitated substitutes leaving human created things as a luxury for the wealthy... Except people will not stop singing, painting, writing, or coding and too many people enjoy doing those things for the price to ever go too high. Except law clerks, and contract lawyers. They will probably just go away, but now they can pick that paintbrush up, or get the band back together. Or code up an unauthorized sttng game with synthetic vocals for picard.
Sorry buddy but open source software eventually catches up and you can’t stop it.
People just don’t realize that technology is what is behind the ability for people to wreak havoc on an unprecedented scale. One specific organism can’t do that much, even with a sword. But today, every person will have more and more power, and you can’t possibly stop them all!
There was a 2011 paper/essay which argued that in the near future there will be a world war between those wanting to regulate AI (the democratic West) and those who do not (authoritarian regimes).
Imagine combining this with a ChatGPT with the prompt "you are a telephone scammer pretending to be a bank employee, convince the other person to give you their bank password"
I'm working on exactly the opposite of this, I have a GPT-3 based bot hooked up to voice recognition and Coqui for TTS I'm training to bait scammers. It has memory like ChatGPT (but only the previous 50 things said). The delay/latency makes it tough to get the scammer to not hang up initially, but the ones that tolerate the slow responses are very easily fooled by the bot. I'm working on speeding it up more and adding stammering, ums and uhs, and background noises etc to fill the delay.
Maybe just pregenerate a couple opening lines that can be used as delaying tactics? "Hang on a sec, let me go somewhere quieter", "I'm driving, can you hold on a moment while I pull over?"
>This can only be used if you happen to own a huge audio corpus and have a lot of money.
May you elaborate on that? Do you mean that you need a large training set of your voice and you need $$$ in order to train the models on an expensive GPU?
So, looking at it again I see that the audio corpus is available for free (openslr.org, 60GB for 1000 hours of speech), however I suspect that training the model on that amount of data would take insane amounts of time on a single GPU.
Instead, usually companies train such models in the cloud. GPT-3 for example used 800 GB of training data and cost about 5 million USD to train. Extrapolating from that, I guess that the same setup would cost 375,000 USD to train (although I assume that this model has waaaaay fewer parameters making it a lot cheaper -- but I can't seem to find how many parameters it has).
If someone else already spent that money to train the model, then you could just take the "training weights", feed them into the model, and it would be as if you had already trained it -- at which point you'd only need to provide your own voice and retrain it on your own GPU for a short time.
I'm by no means an ML expert though, so I could be totally wrong on this.
James Betker in the tortoise-tts repo, which is similar, says he spent $15k for his home rig. I'm not finding right now how long it took to train the tortoise model but feel like I read him say weeks/months somewhere. Obvs all kinds of variations depending on coding efficiency and dataset size, but another datapoint
https://nonint.com/2022/05/30/my-deep-learning-rig/https://github.com/neonbjb/tortoise-tts
Me to my partner: "Didn't you say you were going to do the dishes?"
Partner: "I don't remember saying that"
Me: "Here, I have a recording of you saying it..."
This claim feels truthy, but is it true? Most of the big models were being trained on specialised hardware like tensor chips right?
And the biggest drop isn’t because crypto is worth less, it’s because Ethereum doesn’t use proof of work since November (Bitcoin hasn’t used GPUs for years). The explosion in novel AI models we saw last year predates this change.
Most big AI is trained on Nvidia GPUs but usually not the standard consumer ones found in the GeForce line-up. Instead it's usually their data centre GPUs like the current A100 or soon to be H100 that's just hitting the market.
Google does have their TPUs(Tensor Processing Units), however it is not cost efficient budget wise, so unless you have some kind of deal with Google or compute credits it doesn't make sense. They have pods upon pods of TPU clusters though so the main selling point of TPU training is that you can get your training done really fast with just the ease of scaling your workload to more TPUs.
So if you needed a big model like GPT-3 trained in a single day, you could spend an ungodly amounts of money and get it done with Google TPUs. Otherwise if you can wait weeks or months you can go with the standard Nvidia data centre solution and it'd be cheaper at the end by a significant margin.
The paper about the transformer architecture was published in 2017 [1]. My understanding is that it took a few years, but we are now seeing the results of that breakthrough.
I tried Whisper on an admittedly very challenging task (live translate Albanian to English subtitles), and it failed miserably. Went back and saw it had a 44% error rate for that language.
Yeah the top 5 languages on that WER graph is very much worth a look, the rest a bit error prone.
Surprisingly, I tested it with my mother, she has a very broken english accent/dialect and it worked fine for her. Works amazing actually and I'm busy building some tools around it for her to test.
Even though Whisper advertises the translation part of it. I honestly wouldn't say it's a first class feature at least part of the Whisper model itself.
The only thing I'd use Whisper for is transcription part of it. Then use ChatGPT for the translation.
In 1993, it was said that the game of Go would be the drosophila fly of artificial intelligence. Alphago (then AlphaGo Zero) has succeeded in 2016, 2017. The glass ceiling has been broken.
We are witnessing the birth of a new world where artificial intelligence will be more and more present in everyday life. I do not see any limit.
The cynical view would be that in the current tech environment and generalized risk aversion rushing to demonstrate tangible use cases for AI will help justify a decade of out-of-control hype and investment, support exits, new funding runds etc.
Underneath all the commercial interest some specialists inevitably must keep adding to the knowledge base / capabilities but good luck finding any objective account of that process.
Imagine Hacker News existed when Netscape Navigator was released. (Maybe it was a Usenet discussion board or something.) Suddenly everyone would be talking about the World Wide Web.
At the moment, ChatGPT and Stable Diffusion look like the same order of magnitude of advance.
I'm curious how all these advancements in AI will impact KYC and identity authentication. It's already easy to scrape OSINT to pull the answers needed for most people's knowledge based authentication sequences. Will we hit the point where fake passports, IDs, and biometrics (including voice prints here) can be replicated undetectably? If so, what will become the standard for identity authentication?
I’m assuming it’ll come down to two things: more reliance on in-person checks and eventually pushing more towards digital signatures but the latter is complicated in the U.S. by politics. It wouldn’t be insurmountable to deploy national ID cards like Estonia from a technical perspective but we have a non-trivial number of people who object to that on religious grounds so I’d expect we’ll have some kind of private sector multiple standards mess for years.
Can someone please transcribe what "Speaker prompt" in Example #1 is really saying? I can't get "When I love making babies" out of my head!
I reduced speed to 0.5X and understood the final, "maybe suspended but".
Thanks for confirming -- I didn't trust my own ears for such a strange fragment of a sentence!
It turns out it comes from a book called "The Foolish Dictionary: An exhausting work of reference to un-certain english words, their origin, meaning, legitimate and illegitimate use, confused by a few pictures" by Wurdz and Goldsmith. (Talk about nominative determinism -- a dictionary by Wurds?)
Full entry:
HAMMOCK From the Lat. hamus, hook, and Grk. makar, happy. Happiness on hooks. Also, a popular contrivance whereby love-making may be suspended but not stopped during the picnic season.
Exciting to see ML develop in ways that will enhance accessibility for people. I imagine being able to train a screen reader with your own voice (or voice of your choosing) would be a huge plus for vision impaired folks.
Would be awesome to pair this with near real time language translation. I get from the demo it seems that this is still very language specific, but I assume it will happen sometime.
HSBC in the UK used to advertise Telephone banking with security just based on speech recognition. I wonder whether systems like that could be fooled by this?
Biometric identification via voice is mostly hokum.
My education is in biometrics and our undergraduate seminar group had a student and his supervisor with essentially identical voices.
But that's just anecdata - the human voice doesn't carry a lot of unique information and on top of that it changes over time.
In biometric identification the iris is king. 3D face scans are a distant second, fingerprints on the border of usefulness and the rest like gait, voice, keystrokes etc. get rediscovered every decade or so just to, again, not yield results.
My employer had a project to look at phone unlocking with voice authentication. While we had a number of interesting techniques for anti-faking, the problem was the base error rate was about 1/10,000 even without adversarial considerations.
> Scammers get hold of someone's voice which is synthesised in ~real-time whilst they're on the phone with that person's mother, asking them to transfer them some emergency funds to a new bank account.
We won't be able to trust digital sound, digital image, digital text, and soon digital video. We need proof of authenticity / proof of authorship / proof of humanity!
This resonated with me when I read it. Seems clear where AI is going, so it seems clear that we need some sort of "authentication" that can reliably differentiate a real human from an AI one.
I understand these examples are zero-shot with short samples--which is impressive, but would the output quality improve significantly with longer input samples?
For example I noticed it couldn't quite pick out the accent on some samples because they were so short. But if the model had more example words to hear then I'd think it would accurately understand the accent..
It is very possible and in some areas it is already been used. For example there are video cameras which sign data in realtime for 4k resolution already.
So you would need microfone with TPM chip and you are good to go. You just need that video player which can verify the data.
For another angle what exactly is the source for our distinct voices to begin with? Are everybody’s vocal chords distinct in some way like a fingerprint or something or is it mostly governed by our baseline neural control considering how people can frequently impersonate others ala comedians to some degree of accuracy where it’s funny.
Considering a given person's voice would sound different if they were brought up speaking a different language, yet all the speakers of that language have distinct voices, I'd say it's a bit of column A and a bit of column B.
this will enhance accessibility to a vast number of people with disabilities. Also imagine reading your books/articles in voice of your favorite narrator.
A normal person will have to work really hard to do a 1:1 real time voice obfuscation with the processing it'd take.
----
Why are you lot always doom and gloom and small-minded, is beyond me.
Because you're not thinking 10 years out. At a certain point you will be able to copy any voice and any video of a person in realtime. It will be as simple as downloading an app. When AI has scanned your whole life's details and it can be found by anyone there will be no privacy left. Nothing will be trusted anymore.
I’m really skeptical about TTS. There’s lots of theoretical and academic and research lab stuff that supposedly is quantum leaps forward in TTS. However, all the commercial offerings sound like TTS and never match the breathless hype. Try them out and you can tell within seconds it’s a robot talking.
The SBB (Swiss Rail) is now using a quite sophisticated TTS system
[1] that sounds very clean and has a Swiss German accent in the high German its talking (Zürich accent based which is what most of the complaints are about). The old system was built using over 10k recording which was a enormous task
The system now is able to announce pretty much anything including reasons for delay etc. and it does not sound like a bunch of recordings attached to each other.
I am hoping it will be expanded eventually to switch accents depending on the region just how it already switches to French or Italian depending where you are.
> Zürich accent based which is what most of the complaints are about
I've given up on this a long time ago. Ads are almost always in a Zurich accent, except if the ad is for extremely regional stuff like secret cheese recipes or tourist ads for Graubünden.
Zurich is the tech hub of Switzerland and the accent seems to be the default. Then again, who in their right mind would want an AI to talk with an accent from Thurgau (/s)
Except that... while the German-speaking one sounds relatively well, I am definitely unhappy with the new system in French. It sounds more unnatural (and harder to understand in certain cases) than the old system. Maybe it's just a lack of tuning of the French version vs. the German one, but still...
Except when you're talking over a phone call that sounds like two tin cans connected with a string even on a good day.
I'm regularly astonished at how bad international calls in particular have become, and you're regularly subjected to these even domestically, since so many call centers are in the Philippines or India these days. And this despite bandwidth being cheaper than ever.
Unfortunately, all the bandwidth in the world can't compensate for latency, and the loss of quality in ADC, DAC and compression steps as currently implemented.
I can't think of a single VOIP system that sounded good. Of the ones I use regularly, like Messenger, Teams and Skype, they rarely if ever sound better than just making a regular phone call with the same device.
The generated voice has the same creepiness fake AI voice has in movies.
It also has the same monotone tone that there is in some dystopian movies (For some reason it make me remember of Bladerunner, but I don't remember a narrator voice).
I remember Lyrebird did this back in 2017. They used to have a free-to-use interface on their site to speak a phrase then listen to a text-to-speech output of your voice. Assuming Microsoft improved it somehow?
I know of a bunch of audiobooks that I'd rather listen to the author than listen to the current person narrating or in other cases I'd rather hear the person the book is about instead of the author.
There's plenty to be afraid of, but legit applications as well, for example automated announcements and voice menus where you can't pre-record every possible combination of things to be said.
Accents can be minimized through dedicated training.
After being ridiculed for my horrendous English pronunciation I watched hours of English movies and repeated every single sentence I heard until it sounded right.
So if you get a call from a number you do not know, maybe we should NOT talk and give them our voice which they will record and use to make new speech to scam our loved ones.
It’s not a bad idea, but you can already do this kind of check with shared knowledge. “Where did we go for vacation when I was a kid?”, “where did we eat for dinner last time we saw each other” Etc.
VALL-E: Neural codec language models are zero-shot text to speech synthesizers - https://news.ycombinator.com/item?id=34270311 - Jan 2023 (136 comments)