Hacker News new | past | comments | ask | show | jobs | submit login
VALL-E: Microsoft’s new zero-shot text-to-speech model (mpost.io)
524 points by cbeach on Jan 9, 2023 | hide | past | favorite | 445 comments



Recent and related:

VALL-E: Neural codec language models are zero-shot text to speech synthesizers - https://news.ycombinator.com/item?id=34270311 - Jan 2023 (136 comments)


More samples available on their github page:

https://valle-demo.github.io/

Personally, I find that their samples aren't near anything I'd call "dangerous". I cross-compared the baseline examples to the VALL-E ones when the paper dropped, and found several that were garbled in the usual robotics sounding TTS failures.

Probably a good thing that people are getting alarmed before a true indistinguishable voice cloner exists, though.


> Personally, I find that their samples aren't near anything I'd call "dangerous".

I don't think you'd be able to pretend it's a speech given on TV or whatever, but I think they're probably good enough for phishing e.g. the usual scam of pretending to be a stranded family member, but this time rather than texting/WhatsApping them, you can do a live audio call.

Just (like the newspapers used to do to celebrities) try out lots of PIN numbers on random voicemails and lift a message from a family member if you get in. I think given that mobile phone reception isn't always stellar anyway, this would be be very effective.

Layer in a background track of random street noise and prompt the target with deflections "Sorry, the line is really bad; it's breaking up a lot. I can't hear you properly but if you can still here me, can you send me $X".


People already do this now, successfully, without deep faking voices. Apparently all it takes is having someone of the correct gender kind of mumble. People will hear what they have been emotionally primed to hear

a real life $5 wrench type solution


you are totally right - I just watched the original Point Break over the weekend and all the surfing scenes obviously have different human beings from the actors they are meant to resemble but your brain still wants to think they are the same.


Or tie it to chatgpt primed with the prompt of: You are an actor playing the role of <target's partner> and you have left your credit card at home, but you need the information to purchase a gift for <target> before the sale ends.


Sprinkle in some deepfake on top and you get a killer combo.


One or two papers down the line...


I can hear their voice

> "Dear Fellow Scholars, this is Two Minute Papers.... "


"What a time to be alive!"


...but is it actually their voice, or just a simulation of it?


He makes great videos but I wish he’d use a tool like Vall-E to narrate because his natural vocal cadence is like nails on a chalkboard to me.

The way. He talks. Is like. He puts. A period. Every. Word. Or two.


His old videos weren’t like that! He certainly had an accent, but his speaking pattern was much less unusual. I think that he’s exaggerated it over time, leaning into it as a part of his channel’s trademark (so-to-speak) style. Kind of silly.


Silly is fine, but like I said this is like nails on the chalkboard for me. To the point where I have to turn his videos off even though I want to watch them.

It’s like when Lex Fridman goes on a tangent about love. I just can’t do it lol.


> The way. He talks. Is like. He puts. A period. Every. Word. Or two.

So... Like Shatner? Sorry, couldn't resist.


I think as soon as someone embeds the "excitement" or "sadness" and other emotions as training. Or perhaps combine it with GPT's abilities. There are no voice inflections. It's too dry and non-compelling. It's very much like a non-emotional programmer right now.


> before a true indistinguishable voice cloner exists

At the rate things are progressing that's about two years away.


This looks simple enough that stable diffusion people should be able to replicate it by applying openais whisper on a some large dataset of voices. Exciting times.



$300/month :/


Yeah, it is only a matter of time. Better to be prepared before it happens then to be caught off guard.


A relative of mine recently died from bulbar ALS. By the time she had the diagnosis her voice had already changed and weakened significantly so she couldn't get a decent recording to use with a text to speech synthesizer. Something like this could potentially help people like her or those who lose their voices in traumatic accidents. Even if you do have the time to do it, training a current TTS engine in your voice takes a significant amount of time and the results are often poor.


Interestingly, the prime example of how much a person's attachment to their voice is part of their own personal identity also comes from ALS... but ironically in the opposite direction. I'm talking of course about Stephen Hawking, who famously rejected upgrading his voice synthesiser when technology improved [0]:

    Hawking is very attached to his voice: in 1988, when Speech Plus gave him the new synthesizer, the voice was different so he asked them to replace it with the original. His voice had been created in the early '80s by MIT engineer Dennis Klatt, a pioneer of text-to-speech algorithms. He invented the DECtalk, one of the first devices to translate text into speech. He initially made three voices, from recordings of his wife, daughter and himself. The female's voice was called "Beautiful Betty", the child's "Kit the Kid", and the male voice, based on his own, "Perfect Paul." "Perfect Paul" is Hawking's voice.
[0] https://www.wired.com/2015/01/intel-gave-stephen-hawking-voi...


I disagree that's in the opposite direction. Hawking didn't want to change his voice - he'd had this for years at that point, and the option was also to just change to a different arbitrary voice rather than restoring what he had before.

Something like this could let someone keep their "original" voice. Hawking may have preferred to have something that sounded like his voice before he lost it than a completely new voice.


> Even if you do have the time to do it, training a current TTS engine in your voice takes a significant amount of time and the results are often poor.

I wonder what sort of recordings and other data you'd need to get this right assuming what TTS might look like in a few years' time


100%. The tech my father tried to use in the last year of his MND was so poor (effort to train it, reality of what it delivered- jilted voice).

The impact on his quality of life - imagine not being able to communicate at all - would have been massive were it better.


This makes me wonder whether we could create a standard monologue that someone could record, which provides a complete set of training data for that individual. Something about a quick brown fox and a lazy dog would be apropos here, but I suspect the length would be more Shakespearian than that simple typographic clever sentence.

I expect it will be a while until we can fully utilize that data, but I have to imagine that something could be done today to preserve my voice (while I am still in my prime). Effectively, this would be a sort of vocal cryogenics, betting that we can do something today that will allow us to take advantage of future technology.


This is basically what you do currently for a TTS engine if you have ALS or similar. The search term you want is "voice banking". You are given a long list of words and sentences, often complex, to read out that have all the different sounds and then these are re combined by the software. The problem is that by the time you know you need this, you often already have speech problems and so making clear sounds is difficult. Also if you're like my relative who was trilingual you would need to do it in all three languages using the current system. She got a half way decent voice bank in her native tongue, but it was still noticeably slurred. She didn't even attempt it in her second and third tongues.


Open source tortoise-TTS has been able to do this for 6+ months now, which is also based on the same theory as DALL-E. From playing with tortoise a bit over the last couple of weeks it seems like the issue is not so much accuracy anymore, rather how GPU intensive it is to make a voice of any meaningful duration. Tortoise is ~5 seconds on a $1000 GPU (P5000) to do one second of spoken text. There's cloud options (collab, paperspace, runpod) but still https://github.com/neonbjb/tortoise-tts


Heh you might want to use an equivalent gaming GPU for the price comparison. Surely a thousand dollars spent on an RTX 4000 series card (Hopper) would outperform a P5000?

I agree though, Tortoise TTS did a lot of similar work IIRC by a single person on their multi-GPU setup. Really impressive effort. Did they get a citation? They deserve one.

edit: reading other comments it seems there is a misconception that the model takes 3 seconds to run? That isn't the case - it requires "just" 3 seconds of example audio to successfully clone a voice (for some definition of success).


rtx4000 only has 8gig memory which means reducing the batch size (much slowness) and/or how much text you can give it at once (meaning you have to break up text chunks not at sentence breaks)

rtx5000 maybe but not sure how much of a value improvement there is


What is this, chatGPT? RTX 4000 is a series of cards, some of which have 24 GB of VRAM. There is no such thing as RTX 5000 series yet.



The commenter you're responding to is talking about Lovelace architecture based GeForce RTX 40x0 products. The Quadro line isn't even released yet on this architecture. You are talking about the specific Quadro RTX 4000 product, which is a TU104 (turing arch, 2 gens behind, with 2560 processors and 8GB memory). The commenter you're responding to is referring to something like a GeForce RTX 4090 which sports an AD102 (lovelace arch, with 16384 processors and 24GB memory).

You were merely an unfortunate casualty of Nvidia's product marketing scheme (and a commenter's slightly imprecise reference to it) here.


I'm pretty sure we all lost heh. Thanks for clarifying. Indeed, there were slight errors in my description and the other commenter was reasonable in assuming those other cards were in discussion.


I think you mean https://github.com/neonbjb/tortoise-tts (missing last "s")


fixed thx


Is the link wrong?


Oh that's great news \s

Do other people read these types of headings and just think immediately about how freaking scary this stuff is going to get...if it's actually accurate ?


"Hi mom, it's Chrissy, I am calling you from someone else's phone, my phone got stolen. Do you have a pen handy, I'll wait. Ok, here is my new number, take the old one off your contacts right away, surely some scammer has it, and can you forward me 1000 bucks or so for a new phone, I'll pay it back when I have the new phone set up, here is my account number...."

When it's a text message it's likely going to ring a bell somewhere (hopefully an alarm bell), if it is a voice message and it sounds like the original many more people are going to fall for it.


"Hey mom, could you tell me the name of your first pet, your favorite song and your birthday? Also, please fill out the captcha from this URL. xoxo chrissie"

EDIT: Also looking forward to AIs scamming each other, fully-automated.


But AIs can't own crypto... or can they?


Maybe they can bootstrap their own crypto? Or just open Monero wallets. If we're at the point where they can imitate voices and human behaviour, they might even pass KYC checks after stealing someones information.

As soon as AI gets out of hand and starts randomly hacking things we're probably out of luck anyways.

“If the rise of an all-powerful artificial intelligence is inevitable, well it stands to reason that when they take power, our digital overlords will punish those of us who did not help them get there. [...]” - Bertram Gilfoyle


They just need to earn their own crypto somehow. The AI could do then quite a bit even though they do not have Personhood, which could include lobbying politicians for Personhood.


What's to stop them? A wallet is just a keypair.


I don't see what relevance that has other than making it easier to scam people irreversably?


This and who knows what else as time goes on. It might be a good idea to not talk on phones anymore unless you already know the contact.

In this bleak scrape the bottom of the barrel world, I can't help but think it's only a matter of time before customer service departments start selling voice data to the highest bidder. And to counter that there will be services where everyone ends up using voice changers, except close personal contacts. I want this fiction to stay fiction.


>It might be a good idea to not talk on phones anymore unless you already know the contact.

As a millennial, I'm way ahead of you on that one. The government and telecom companies just need to crack down on number spoofing. Any day now..


>It might be a good idea to not talk on phones anymore unless you already know the contact.

As opposed to what? Writing is even easier to forge, no?


My god...I haven't really thought about scamming at all, I was "just" worried about manipulation of media and politics. New nightmare unlocked.


All of these advances in technology have made the internet much less trustworthy. I wonder if we'll eventually hit a threshold where nothing is done virtually anymore because you can't trust it.


I think the work being done in the opposite direction is going to make it so the Internet can be used for high trust actions too.

E.g. in the EU we have ID cards and passports with biometric information and NFC, and there was a beta of a phone app by the French government where it would read your photo from the ID card/passport, and compare with a video selfie. That way you get fully local and secure identification allowing you to do stuff online that would otherwise require you to show up in person to a government office.


That 'video selfie' is untrustworthy, which was sort of the point.


It's taken by the app directly, so it's somewhat hard to pass fake 3D video input to it.


Scammers are well incentivized to find bypasses and feeding an alternative video stream into an app on a device that you control physically isn't all that hard for a determined person. I can think of three ways to do this right off the bat and if I think about it a little longer I'll probably find some more. I sincerely hope that that is not the last line of defense.


Only if hardware is completely locked down.


The open internet will be a dark forest, beautiful, but populated by metal-brains. Meat-brains will retreat inside trusted spaces below or above the empty forest.

https://maggieappleton.com/ai-dark-forest


"Okay, let me just video call you because I also need to show you something" and then the scammer hung up and was to never be seen again.


FYI, there's the following trick I read here some time ago:

The scammer starts a video call with the person they want to impersonate and record it. When that person takes the call, the scammer doesn't say anything. This creates a 5-10 seconds video of the person looking at the camera waiting for the scammer to say something, until they get fed up and hung up.

The scammer then calls the victim. They offer a video call as verification straight away, and they play that short video.


How is a 5-10 second video of someone either not saying anything, or saying, "Hello, who is it?" over and over convincing at all?


I imagine they say something like, "I couldn't see or here you. Did you see me?" through the other channel, and because modern tech messes up enough, it'll be believable to a lot of people. Especially the most gullible.


HN gets so caught up in logic. Has no one watched interviews of scammed people? Huge numbers of them report that they were suspicious it was a scam but they were so worried for a loved one they sent the money anyway because it was worth the risk to them just in case. Anything that helps push scams in the even slightly plailusible direction will make scams more successful.


Society will respond by having secret keywords that you know from a young age in your family to verify or to point to an emergency that is ongoing.

“ Here’s my new account number. It’s really me. Slippers in the toaster.”


My grandmother was called by a scammer claiming they were me, said they were arrested and needed bail money or some nonsense. She knew it wasn't my voice, asked two questions which were dodged then told them where to go and how to get there. It left her very shaken and my mother had me call her to assure her I was fine and as soon as she heard my voice she bellowed "Now that's my grandsons voice!" With this tech all they need is a few seconds of my voice.

With AI we have lost all privacy and identity. This makes nonsense like novelty theory and timewave zero seem less nonsensical. Lets hope we make it through the great filter.


>Lets hope we make it through the great filter.

My thoughts on this are that oftentimes things at the extreme ends of the distribution are usually in some way, shape or form problematic. Compared to all other living creatures we know of, humans are right at the extreme tail end of the intelligence distribution, and the result is somewhat pathological when compared to other creatures. I think nothing that's north of a certain threshold required to make advanced technology can make it through the great filter without destroying itself in the process.


My thoughts are that getting noticed by another culture is the great filter, so cultures make it through occasionally, but we will not notice them. "The benefits of not being seen"

Of all my concerns for humanity, excessive intelligence is not one of them...

Our earliest (and easiest to detect) transmissions should already be far enough away to be far below ambient noise I think/hope. So maybe we made it through unless someone starts shouting. Own-goals are also certainly a thing, but it would take a serious mistake to end all of us, or even set us back that far technologically on a geologic time scale.


>Own-goals are also certainly a thing, but it would take a serious mistake to end all of us, or even set us back that far technologically on a geologic time scale.

I must be a pessimist in this regard. I don't necessarily think it would take a serious mistake, rather just the nature of complexity itself will likely become a factor at some point and I can imagine certain scenarios playing out based off that.

When you compare the length of time of humanity spent pre-neolithic-revolution with the short ~14,000 odd years since, I can't help but feel our current trajectory won't see us reach hundreds of thousands more years into the future without a serious backslide in that time.

Humanity is pretty darn resilient, but the last few years have highlighted to me that modern day life is some weird fantasy land we've created for ourselves that has an expiry date at some point in the future.


"Hi Chrissy, can you touch your ID to the phone so I can know that it is you" call ended

The government just should give you a card sized mini computer with 2 large primes stored in it, and everyone is perfectly fine. In fact, more secure, than the current situation?


Two large Prime minicomputers sounds pretty heavy for people to carry around with them all day!

https://en.wikipedia.org/wiki/Prime_Computer

https://en.wikipedia.org/wiki/Prime_Computer#/media/File:Pri...


"My ID got stolen with my phone..."

It'd be better to ask something personal. "What did we do to Christmas/Birthday/[other event] last year? Who was with us?"


In the US where stealing a SSN is enough to impersonate someone I doubt the government would provide anything to mitigate these


And telephone numbers can be spoofed for voice or text in the US. Not sure when the problem will get big enough that signed comms become normal. I guess imessage probably is? But only between users of apple devices.


Oh, great! Now we have to come up with proper code words and code books for everyday life! I', not worried so, some AI solution will help us with that for sure.


You can still

1. Ignore voice messages

2. Ask pointed questions to any live caller, esp. one calling from a new number and asking for money...


So now scammers will have to first scrape your & your families social media, then run that through ChatGPT to answer any questions.


Let the games begin.


The one earnest question that will ALWAYS make scammers hang up immediately:

"In case we get cut off, what phone number can I call you back on?"

If they actually give you a number instead of hanging up (which has NEVER happened in the many times I've used this line), just hang up immediately and call them back, to see if they answer.


So, we need to work on this ringing bells. Sure some people will fall for it initially, but then we adapt.


Scammers have been scamming since the beginning of society


Yes. Then I listened to the samples, and was relieved that it's not good enough to handle a well-known voice impersonation... yet.

My primary worry is for criminal misuse. There's a class of scams where the scammers find elderly victims, call them, and pretend to be their children or grandchildren on vacation in a plausible city and suddenly in need of money to help get their passport back or a ticket home.

There are any number of cases where I have accepted an order worth large sums of money over the phone from someone I have previously dealt with and feel familiar with their voice.


Only a year ago, my bank replaced their (already not very secure) second factor authentication with voice recognition.



They will recycle it for sure once fraud hits their internal thresholds. Until that: self defense.


It's mind boggling that the banks, who you think would have the most to gain from staying ahead of the tech scam curve, can't be bothered and instead continue to play a slow defense. I suppose they aren't losing THEIR money.


Time to look into robot-voice plugins that activate when the caller is not in the contacts.


Recently in India, a politician deepfaked the voice to deliver his speech in multiple languages [1]. Can't comment on the quality of tech since i dont speak this language but it is pretty alarming to me. 2% vote swings make a huge difference in Indian politics since there are many regional parties. Almost more than 2/3rd of Indian voters didn't vote for modi yet his party is doing very well.

[1] https://www.theverge.com/2020/2/18/21142782/india-politician...


I think this is one of the cases where there is nothing wrong with using deepfakes. The person mimicked gave explicit consent and they did not do any harm with the tool. If technology increases accessibility of true political choice then it's good technology.

That said, I don't understand why the parent and the article paint this as a bad thing.


When language is tightly coupled to cultural identity, using this kind of fake is lying to voters, saying, "I'm one of you."

It's a new use of technology to repeat an old lie. I think it's a bad thing, but it's hardly the most dangerous and novel application of this technology for evil.


Given that the PR firm who made their video apparently reached out to the press to tell the story, it seems that he wasn't actually claiming or pretending to be a native speaker (and anybody who did would be extremely easy to uncover). Just using a glorified overdub service and being quite upfront about it.


Had he used a translator, it would have been more obvious but possibly had the same effect (if what he had to say resonated with the voters). So what you are against is the idea that it’s not labeled as “virtual translator was used to produce this speech”?


Certainly that's the impression I had from the comment - it's a politician being dishonest, and an undemocratic system which gives non-proportional power.


Interesting. Providing a translation is an accessibility measure, but it seems that the "fake" here is matching his lip movements to the translation. Almost not worthy of calling it a fake since he (not somebody else) did say those things (in a different language), and it's endorsed by him.

The unendorsed version is far more dangerous.


The pope reads in phonetic alphabet to speak a big number of languages and most people do not consider it cheating. I do not se the problem.


>>Almost more than 2/3rd of Indian voters didn't vote for modi yet his party is doing very well.

Do you understand what is first-past-the-post voting system?


Yeah, I am increasingly nervous about some point in the future where you cannot trust any digital medium because of deep fakes, NERFs, voice cloning...

The only possible solution I have seen being talked about is digitally signing any and all "real" content, similar to GPG/SSH.


Photoshop and minor video manipulation is enough to trick people into believing fake news, this will make it harder. I see political parties doing simple edits to make people believe anything, this will push it multiple notches higher.


> Photoshop and minor video manipulation is enough to trick people into believing fake news

If the news confirms what they think is the truth, it doesn't need manipulation, just a fake heading

https://www.snopes.com/fact-check/climate-protest-ukraine-bo...

for example (and in that case it even had the true caption on the video, just not in English)


That they can do so without consequences is almost as scary as that they can do so in the first place.


Fraud, slander etc are already illegal. There are consequences for crossing these lines.


AFAIK nothing ever happened to the political parties that use these tactics. But I'll be happy to be corrected.


Not in politics they aren't. There was an incredibly high bar to clear for the Alex Jones libel lawsuit to succeed.


You really have low exceptions for humanity. Most people are not complete idiots you know. I think your opinion is pretty elitist. Or do you think you yourself will fall for this as well?


Bunch of people just stormed the Brazilian government buildings yesterday over a fake "election theft" narrative. This stuff will be constantly tuned until enough people do fall for it, like advertising.

There's also the unravelling biography of https://en.wikipedia.org/wiki/George_Santos#Scandals (you know someone's doing well when their "scandals" section has ten subheadings! And that's all old-fashioned manual fraud)

(I'm now wondering how far you could get with an elected official who's entirely deepfaked and only appears in recordings. Probably all the way to being sworn in, at which point someone has to appear in person. People have occasionally been elected despite being dead, so it's not impossible)


> Probably all the way to being sworn in, at which point someone has to appear in person. People have occasionally been elected despite being dead, so it's not impossible)

Unless the people doing the swearing in are also fake.


Or millions of people passively accepted the results of a stolen election. That would be an equally viable scenario with this technology. You just won't know the truth anymore.


It's not equally viable; there are centuries worth of checks and balances designed to prevent voting fraud and errors vs. the ease of speaking a lie on video.

If electoral boards around the country or at the national level were all sounding the alarm for irregularity in voting returns or procedures then an election might have been stolen. As it stands, the election results match the earlier polls which is a strong indicator that, at best, the election became a tossup at the time it was held, and the military also didn't find any reason to suspect fraud or error.


Not the person you’re responding to, but I don’t think it’s egotist at all. I’m sure I’ll fall for some news manipulation. I’m sure I already have.

I don’t think it’s elitist to think many people will be duped. I think it’s arrogant to think you won’t.


Agent K: A person is smart. People are dumb, panicky dangerous animals and you know it. --Men in Black


Less elitist(might really sound like one), more realist who knows that masses devolve to the lowest common denominator of sorts based on the information they are provided


Perhaps most people aren't, evidence is pending, but enough of them have been proven to be that this is a real concern.


>enough of them have been proven to be that this is a real concern

This matters a loooot more than the smart enough people realize too, because it's so difficult to imagine what it's like being south of that threshold, the potential consequences are difficult to imagine.


It is easy to see through bullshit... ex-post. But not so easy in real-time.

> Or do you think you yourself will fall for this as well?

Yes.


I don't believe anything digital anymore for quite a while. Heck, pictures have been staged and faked since the beginning of photography. Modern tech just makes it easier.

Maybe, just maybe, human society will move back to personal interaction, since apparently you cannot trust call, voice and video, anymore in the future.


> human society will move back to personal interaction

Which implies a much narrower, less diverse, more elitist system of both politics and economy, as it gets harder to trust people from outside your existing circle.


Hard to get more elitist than our politics and economy are now at the height of digital interconnection. The liberal democratic revolutions and popular democratic swell of the US labor movement under FDR all happened without the internet. A return of social connection to the community would be very much to our political benefit.


True. Sometimes it seems some people read every CyberPunk story and considered it an instruction instead of something between a parody and a warning.


With all that ML stuff going on we can no longer trust our devices to guarantee authenticity of their behaviour. By that, I mean we can no longer be sure that the human voice produced by the speaker of our handheld computers is happening because a human somewhere else spoke to a microphone.

Unfortunately the move towards it is nothing new, we gradually lost text and images and now the remaining media is closing.

This will probably decimate online public forums completely and off-person communications will en up being P2P through trusted devices. Closed communities in WhatsApp, Telegram, Slack and Discord already replaced forums.

Maybe if infestation of closed communities with machines becomes a thing, maybe the real life gatherings will make a comeback?

I don't know but I used to mock Sci-Fi movies about portraying aliens as uncivilised that don't seem to have the means to have developed all that tech but I'm no longer sure. Maybe what will happen is, we will create a symbiotic life where machine will no longer be tools but partners and they will take care of us to reduce us to our basic instinct.


At some point natural people are going to use cryptographic signatures just to authenticate themselves in daily life. No video, audio, or text can be presumed real unless signed by the human subject.


To do that, any meaningful communication will have to be recorded to be signed by the person, on a device provided to them by their Trust Authority company. Can't be just any device, because the chain of custody is as strong as its weakest link. Depending on the country, the Trust Authorities will be private companies with large regulatory exposure, like banks, or just governments themselves.

But hey, at least this will solve the problem of journalists quoting people out of context: they won't be able to claim you've said something without you personally signing off on it with your cryptographic key.


There’d be a variable scale of trust: for many people, a signature from a generic phone would be fine since nobody is going to attack the secure element just to forge a valid signature. National level politics, or a billionaire’s divorce, could be a very different story, however.


> Can't be just any device, because the chain of custody is as strong as its weakest link

I don't quite understand what you mean. We already trust phones and computers not to be compromised for things like online banking, so why not allow these devices to sign recordings using keys lent to them by the main trust device? You could allow revocation of these keys.


> We already trust phones and computers not to be compromised for things like online banking

Sorta, kinda. "We already trust phones and computers" because we have no better alternatives yet.

There's a reason online banking is at the front of the push for device attestation, and thanks for Google happily obliging, rooting your Android phone is pretty much pointless now. This is coupled with a parallel push to do your banking through a mobile app, or at least to use it as a second authentication factor, which makes the app necessary just as much. This means we're already undergoing the transition I'm talking about: in the nearby future, you'll have to use a pristine, unrooted, unmodded phone from a blessed corporate vendor, to perform basic functions like paying for things and managing your account. And yes, such device will be appropriate for signing everything else too - and the vendor owning the key will be your Trust Authority.


Or maybe it will be like today - if it's reported by Reuters/... it's trusted, if it's reported by DailyStar, maybe not.


Given that digital signatures have been around for decades, yet banks, the government, and the legal system still treat signed (and potentially notarized) physical paper or scans of it as the gold standard, and SMS-OTP as a close second, I wouldn‘t hold my breath.

All digitally approved contracts I‘ve ever encountered were e-"signed" (i.e. me emulating a paper signature at a computer), not digitally signed, due to lack of a mutually trusted public key infrastructure.

In that sense, I would be sad but not surprised if we instead moved to a system of "trusted recording devices", where scanners, cameras and microphones by some vendors are defined to be trusted, signing their outputs and leaving the underlying workflows (signatures, recordings of verbal affirmations etc.) largely untouched.


With all the (deserved) ridicule Germany gets for our failed digitalization efforts, we use public/private key pairs for our taxes, and IIRC even our national ID supports some PKI thing (not sure how that works exactly, never had any use for my ID’s online features, doesn’t seem widespread)


> not sure how that works exactly, never had any use for my ID’s online features

I think that's a pretty apt summary of the state of digital ID in Germany: Commendable technical basis (with some privacy-preserving aspects too, including selective assertions, e.g. "older than 18" or "EU citizen").

When I called about actually getting such an ID card (several times) as a non-citizen, which is explicitly mentioned as a feature of the scheme, I only received the acoustic equivalent of blank stares.


When my wife got her temporary residence permit in 2018, it came with exactly the same online features as our national ID card. Yours might have been too old.


EU citizens don't need (and in fact can't get) a permanent residency card.

For that reason, Germany introduced a "non-identity" (i.e. not doubling as a photo ID) e-signature card in 2021, but apparently almost nobody is requesting it, so most administrative offices don't know how to issue one...


Oh wow, that sounds like quite the oversight. I’m fully of the opinion that my fellow Europeans should be allowed to have a digital ID that they can’t use anywhere ;)


But... That's good no?

> No video, audio, or text can be presumed real unless signed by the human subject.

Where is the controversy in that?


Agreed. The time between now and then is much scarier.

Companies can't even wrap their heads around sim swapping scams and social engineering of secret questions.


It would be nice if we were at 100% signing, but currently we're at ~0%, and the transition is going to be rough.


Why would that be nice?

Follow the thou g ht through to how it would work and what it would mean in terms of privacy for the average person.

Your niceties seem dystopian to me.


Those subjects could lie.

Journalists would have a hard time showing that an interviewee actually said what claim they said.

Wire-tapping by the police also would become useless in court.

⇒ I would adjust it to “unless it has a reliably chain of trust”. That would mean that people could judge differently about whether to accept some things as true, but I think that’s unavoidable.


No, I read these types of headings and think how awesome open world gaming is going to be when any random developer can fully voice their game cheaply. The AAA benefits too: you get a lot of stilted dialogue when the script and gameplay diverge and you have to work with what you have - a world where the writers can just redo that on the fly pretty much up till release would mean we'd be looking at a new era of dynamic storytelling.

It'll be interesting when someone releases the first MMO which combines GPT models and VO generation to fill out the in-universe world with dynamic characters who can react to surrounding events.


When everyone know this is possible, we need better methods to verify what is real. We'll handle it, people said the same about photoshop.


Photoshop isnt realtime. This will be.


I suspect it is already widely known.

My employer is known for tracking a decade behind the curve and we already have financial processes in place to prevent this vector from being exploited. That leads me to believe the majority must already have as well.


A company called Lyrebird claimed they could replicate a voice in less time and had demos available a few years ago. The deepfake scare of 2019 also seems like a long time ago.


No because I know as technology advances, security advances.


People talk about cryptography, which has fairly terrible adoption rates outside of invisible systems like TLS, but in practice the way humans authenticate sources is heuristic: do I trust the person telling me this? Does it agree with things I already know?

That is, you have a "filter bubble" which tells you that there isn't going to be a Nigerian prince with some money waiting for you (unknown sender, unlikely message). Reduced cost fakes force people to make their filter bubble smaller by raising the threshold for faking. More broadly, the question of "high trust vs. low trust" society.


Oh great, more fearmongering.

What cases do you think it will be used wrongly in? It can't do 1:1 realistic phone calls, the processing it'd need for that is beyond the reach of any average user.

I just would like this so I can copy-paste books/text and listen to it in the voices of my favorite narrators, but can't have that, can we? No, it'll be super dangerous, I'll listen myself to nuking a country or some such.


If you seriously don’t think this will be widely abused, then you’re extremely optimistic.

Spreading propaganda is the most obvious use case, assuming the resource requirement remains high. I don’t think it’s far-fetched either. If these models become less resource-hungry, then we’re in for much worse.


Compute keeps getting cheaper. This will absolutely be within the reach of a dedicated user in a very short span of time.

If you can't see the danger in this, perhaps it's more that you're ok with it.


> the voices of my favorite narrators

If you really like them and their work, why would you want a forgery?


I listened to the 3 samples, and I must say some of it is quite close and some of it is completely wrong with how I would expect that person to sound saying those words.

Like the first one "just his scribbles that charmed me" sounds so weird compared to how the sample sounds.

Second one, sounds very close but again "just his scribbles that charmed me" sounds off and wrong. His "scrouples? that charmed me".

Third one as well, it's very up and down (not sure the correct science words for that type of speaking), "Dynamo and lamp.... Edison realized". The second one is completely flat and seems computer generated.

Overall these don't seem very good to me at least. It's clearly not to the level of some other AI coming out where it is very difficult to determine the human part, I could easily pick up the human and generated version from these samples.


I think it is actually meant to be "scruples" - so it sounds pretty accurate to me:

https://www.ibiblio.org/ebooks/James/Turn_Screw.pdf (see p4)


That's hilarious, I thought it was supposed to be scribbles lol... scruples makes more sense. But I guess this just further emphases what I said. I am a native English speaker and I had no item what 6 samples were saying...


Yes, there's definitely an uncanny valley effect on these voices. There's something subtly not right about them, and I suspect if you were to hold a conversation with one, you'd catch on that something was wrong.

The uncanny valley makes it more spooky to me. Like a reanimation sort of thing.


why did they pick such a batshit thing for the ai to say.. I don't understand. Still laughing at scruples.


Is it just me or is "AI" focused very much on fooling human perception, lately? AFAIK, we have no deterministic algorithm that can tell us whether synthetic language sounds "human", have we? So essentially, that model has been trained to fool human perception. Similarly, ChatGPT is not trained to output sensible and meaningful statements but rather statements that appear to be to a human reader.

Would not not be time to measure a model's success on the actual job? Like feeding a simulator with actual data from real-world traffic scenarios and running Tesla's, or any other company's, autopilot in it?


> So essentially, that model has been trained to fool human perception.

Is there any other sensible goal function for a text to speech model?

> Similarly, ChatGPT is not trained to output sensible and meaningful statements but rather statements that appear to be to a human reader.

Almost certainly not true. If they could they would make it output sensible and meaningful statements all the time.

> Like feeding a simulator with actual data from real-world traffic scenarios and running Tesla's, or any other company's, autopilot in it?

Do you seriously think that self driving car companies are not doing this already?


> Is there any other sensible goal function for a text to speech model?

Yes: to transfer information aurally. And the models are there, and have been for quite some time. They can simply stop with the trying-to-fool part.


The "fooling here" is "people wanting TTS to sound human". And obvious step for that is making the AI infer human voice traits from existing humans


It doesn't have to sound human. Fluent, ok, but human?

But no matter what, it's unnecessary to make a tool that can copy anyone's voice. That just lowers the threshold for abuse, while not adding almost nothing of value.


Listening to something that sounds non-human for a long period of time is fairly unpleasant; imagine trying to listen to an audiobook or podcast, or dialogue in an animated movie, when the voices are all obviously non-human/robotic. So TTS wouldn't be usable for a lot of cases where we might want it.

And with the way the models work, once you have a model that can sound human, it is unfortunately very easy for it to sound like any individual human as well.


It doesn't have to, and you can use ones that don't, they get the info across mostly. But it is jarring if it doesn't sound human, it's a speech impediment, or as a more generous take, an accent.

The correct inflections, pauses, annunciations, etc. are all important to humans, especially so for audio books and similar things that need to immerse.

Otherwise you have to strain to listen, similar to listening to someone with a heavy accent.


> especially so for audio books

Perhaps real human readers can help?

> The correct inflections, pauses

Because a model that can imitate a voice is still not capable of that. There's no need to have model that can do that. A robotic accent is best. Or perhaps you like to see your politicians make all kind of bizarre statements on youtube.


> Perhaps real human readers can help?

Certainly, and they do, but computers can annotate much faster, without attrition, and cost much less.

Generally you'll have human recordings for what you can, and TTS for anything missing.

There are a lot of books, live streams, podcasts, articles, etc. in the world.

> There's no need to have model that can do that.

I wouldn't say there's no need, or otherwise we wouldn't talk that way. It's a human element and the generated speech is for human ears.

> A robotic accent is best.

Best is subjective because that's a human preference. That being said, I'd say your preference is a far outlier.

Most people's "best" would be what they are used to hearing; the speech of native speakers.


Perhaps it's us that are obsessed about perception rather than substance though? I wonder if it's different from asking a child to memorize for an exam, or investing time in a stylish presentation, etc.

Seems like we're very often interested in something that looks like the job rather than the job itself - maybe because it's more easily measurable? We then took this skewed ambition and developed AI after it.


AI is focused on creating electronic humans that are never tired and can do low skill work with some high level directions. It seems like we will achieve this goal, but the companies will own the electronic humans.


It's a varied area of research, we do plenty of things. Training things on a simulator, and then using that to transfer domain to the real world is a fairly typical application.


I had planned to play around with TorToiSe[1] next weekend and already watched some videos. There it looks like all you have to do, is to offer you own voice samples to the system and no separate training seems to be required. TorToiSe is slow to synthesize, so it doesn't beat the 3 seconds but can anyone confirm that these models really don't need an extra training phase to clone a voice?

[1] https://github.com/neonbjb/tortoise-tts


That’s correct. However, the 3 seconds refers to the minimum amount of audio required for the reference audio, not how long it takes the mode to synthesize.


Correct, that's what zero-shot means, no training steps.


I've been voice-cloning using tortoise-tts and I'm very happy with the results, but it is indeed very slow. It's also free and open-source though.


- Hey Janelle, what's wrong with wolfie? I can hear him barking. Is everything okay?

- Wolfie is fine dear. Wolfie's just fine. Where are you?


An interesting level of this scene is that it implies neither terminator is able to identify a fabricated voice even though they have a full understanding of each other's design. The T-1000 cannot tell it is talking to a T-800 until it realized it has been tricked and the T-800 cannot tell it is talking to a T-1000 until it tricks the other machine.

Of course, it happens over a pay phone so perhaps with the full vocal range in person it would have been different.


That movie holds up so incredibly well.


It was the golden age of special effects, when weak computing power had to be supplemented with effects as craft - mechanical and chemical. Those impressive liquid metal bullet holes are NOT computer-generated:

https://www.instagram.com/p/CViWXcMlJrM/


And they used twins as doubles in many cases, instead of CGI. One is the security guard in Sarah Connor's prison (in some shots his twin brother acts like the T-1000), and yet another case is of course with Linda Hamilton and her twin sister.

A fascinating practical effects scene, cut from the theatrical release, is when Sarah is removing a chip from the T-800's head in front of a mirror. Instead of using CGI, there is no mirror but a hole in the wall, where you can see actual Arnie and Linda Hamilton's twin sister acting as the "reflection"; the people closer to the camera are actually a dummy (or a double with heavy prosthetics for the hole in the head) and Linda Hamilton. This means they are sync'ing their movements to simulate a mirror!

https://youtu.be/wrDo7wVXrBQ?t=122

(I understand this was done both to avoid showing a camera reflection on the mirror, and also to avoid using a dummy for Arnie's face, which worked in the original Terminator but would have been too noticeable for T-2's era).


Yeah, I just love the ingenuity and sometimes brazenness of 90s effects and stunts. They look way better than today's green screen atrocities.

Just yesterday I learned that in Cliffhanger (also aged amazingly well), they paid one million dollars to a stuntman to actually travel the harness between the two planes while airborne. Not bad for a day's work.


My favorite scenes where they used a body double was with Eddie Murphy in Beverly Hills Cop. Pause at about the 0:34 mark.

https://www.youtube.com/watch?v=SyWjZM780lw


I rewatched it a few months ago. Fully expected the parts where they talk about the tech to be laughably wrong compared to the current state of the art in ML. Nope. Spot on.

(Edit: with artistic license obviously. It was plausible-sounding instead of cringe technobabble).


Best action movie of all time. Always has been.


I'd never use extremes like 'best' or 'worst' to describe any type of creative piece, but I have to say having recently watched the first 2 movies again for the first time in like 20 years: yeah there is something about these movies (and some others from that era) which I'm really missing in releases of the past decade. Only problem being: I don't know if it's merely some nostaligic bias or if they are really just better overall.


No they are just better. I watched them years later and not as a kid and they're just great.

It's about the pacing, pauses and slower character and world building.

Newer movies are built for second screening and maximising action sequences to not bore the audience which kills all atmosphere and sense of time and place - the worst example of this is the new Avatar movie, absolutely horrible in every sense of the word, while the first one was alright as i remember it.

You simply need to "set the stage", explain why this story is important and why the characters deserve sympathy or hate before you start your 3 hour action sequence - this step has been removed for some reason.

Verhovens old movies were the same - there was a nerve, a seriousness, a reflection beneath the action, now it could just have well been created by an alien algorithm without a sense of the actual human experience.

I wonder if it's a ridiculously extrapolated but misunderstood tic-tokification of cinema to please marketing? Because i've seen 20 second tik-toks with more emotion and character introduction than a lot of newer movies.


For a contrasting movie, I thought “Once Upon a Time in Hollywood” (2019) was the opposite. The movie spends something like 80% of the time with setup. I loved it, my wife hated it. Looks like it’s 70% on Rotten Tomatoes.


One other difference: sounds & soundtracks. I don't mean the foley work per se, but songs you would want to listen to very often written for the movie.

With the explosion of synthesis software and digital audio workstations its easy for a small team to score a sound track on the cheap.

As with everything tough, there's always a few outliers, like (both) of the Tron: Legacy soundtracks.


Tron: Legacy soundtrack was done by Daft Punk. Not exactly your run of the mill movie studio hack job...

Other interesting soundtracks done by musicians not known for soundtrack work you might find interesting: Fight Club (Dust Brothers) Event Horizon (Orbital with London Symphony Orchestra) Chaos Theory: Splinter Cell 3 (game) (Amon Tobin)


Considering that DP lazily sampled entire hooks, still pretty run of the mill.


> i've seen 20 second tik-toks with more emotion and character introduction than a lot of newer movies.

Sure but this assumes movies are about characters and emotion. Michael Bay gets a lot of shit, but sometimes you just want to see things explode. Sometimes you just want to see robots fighting for two hours. Sometimes you want to see California swallowed by a tidal wave and the emotional character plot lines get in the way.

I would pay to see a movie called “Two Hours Of Giant Rocks Hitting The Earth: No Characters Edition”, and the fact Michael Bay is rich means a lot of people agree.


>Michael Bay gets a lot of shit, but sometimes you just want to see things explode.

That's the truth right here. I once ran across kung-fu movie called Chocolate (iirc). The premise was an autistic girl who was good at fighting. The entire movie was her walking into a room, kicking major but, then walking into a different room to kick more but.

It was great.


I appreciate action too but i mean, character or world building is not that hard and doesn't really require that much when a 30 second commercial can give you enough backstory to (unwillingly) empathise with someone.

Theres a difference from both Avatar 1 and quite a few of Michael Bay's older movies - they still have have a story arch.

Just a few minutes of character building and a few intermezzos and these movies would be much, much better in my opinion.

But i also hate too much CGI. I don't know what happened to "well dosed", it makes what's dosed so much more valuable.


Watching the Bayhem piece by Every Frame A Painting gave a ton of great context for Michael Bay’s work. Super interesting, ten minutes long, worth a watch. That channel is full of grade A content.


> a movie called “Two Hours Of Giant Rocks Hitting The Earth: No Characters Edition”

But would you go to see Giant Rocks 2?

> the fact Michael Bay is rich means a lot of people agree

This is the argument often made for Avatar against the "no cultural impact" observation. It still doesn't have any quotable lines or memorable characters.


> But would you go to see Giant Rocks 2?

If the graphics were better and the explosions were bigger, yes!


I'm almost disgusted by this post!

Jokes aside, it's funny how that sounds the most boring stuff to me and you'd have to pay me to sit in a chair for 2 hours watching that.


Huh, I had friends complain that it was overly long and focused a lot on its setup and setting up bits and pieces like the kids' relationships and the new culture.


Interesting, I didn't love the new Avatar but I thought it was much better paced and directed than a Marvel or superhero movie. Shots lingered and showed more emotion than the current crop of action / fantasy movies (Star Wars, Marvel, Transformers).


It's not nostalgia. Movies today are mostly hyper-produced, over focused-grouped, effects driven piles of blandness trying to avoid being offensive or take chances because if every critic 51% likes you, you have a 100% on rotten tomatoes.


You (and many in this thread) are comparing the best movies of a period to the average movies of today.

In reality there's as good or better movies now and a ton of crap from back then, too.


What are the blockbusters of equivalent quality nowadays?

I have to specify blockbuster because otherwise it doesn’t make much sense to compare Terminator to a small indie movie


Given advances in technology, I'm not so sure it has to be a blockbuster for there to be a comparison. Especially given how formulaic blockbusters can be. (Eg there are 6 Transformers movies, with 2 more in development.) Drive (2011) is an indie movie, costing an estimated $20 million (inflation adjusted) compared to almost $220 for Terminator 2. Everything Everywhere All At Once cost $25 million to make, just over the threshold of "indie".

Comparing computing power is a bit handwavey, but former Pixar employee Chris Good estimated that the SPARCstation 20 render farm cluster that rendered Toy Story had only half the power of the 2014 Apple iPhone 6. https://www.quora.com/How-much-faster-would-it-be-to-render-...


Mad Max: Fury Road


Which is also an excellent counter-example to almost every other action movie which all rely on almost completely CG everything. No that MM:FR didn’t have its share of CG, but the practical effects create an authentic anxiety to much of the action.

Also: Top Gun: Maverick


This.


The Mission Impossible films have maintained high quality despite an absurd number of sequels.

I’d also compare _Everything Everywhere All at Once_ favorably against a lot of old blockbusters.

Modern technology aside, Top Gun: Maverick feels like it could have come out thirty years ago.


EEAO is very unusual; it's an "indie" "arthouse" film that's stuffed full of references to other bits of non-Western cinema, as well as having lots of SFX action sequences. It had a budget of under $15m (per wikipedia), which is much less than the $170m of Top Gun: Maverick. I'm very glad that somehow it got made, but there's a real drive to not make any middle-budget movies like that any more.

> Top Gun: Maverick feels like it could have come out thirty years ago.

I've not seen it, but to what extent is that because it's a remake of a film that came out 30 years ago?


I never liked Top Gun, but when others spoke fondly about it, a Maverick-like movie was what I always wanted. Reasonable character arc, lots of practical effects, good story. It is a script that would have worked in the 80s and fits in that style of concise, well paced action movie, but it’s far superior to the original, imho.


It would have been concise if they hadn't wasted 30 minutes of screen time trying to milk top gun nostalgia.


Terminator was made for $6.4 mil, like $18 million now. Not a blockbuster.


>> I have to specify blockbuster because otherwise it doesn’t make much sense to compare Terminator to a small indie movie

I thought Terminator was a relatively low budget movie. Not an indie though.


Guardians of the Galaxy (the first one), Thor 3, Boyhood. There are blockbusters of our days that are going to be eternal classics, like the Terminator.


The new Top Gun movie?


AKA Navy Star Wars + hardcore nostalgia play (so many shot-for-shots with the original)?


What's wrong with nostalgia, though?

People like to criticize back-references as a cheap way to entertain the audience knowing the referenced works, but I think it's a legitimate feature. A movie or a show doesn't exist in a vacuum; watching a sequel or a work in the same universe, I expect to see both subtle and direct call-backs.


I liked the nostalgic aesthetic and references of stranger things, it never felt gratuitous. Top gun had a lot of shot for shot callbacks and filler scenes/shots that could have been edited down/out to make for a tighter movie. It's like they couldn't decide if they wanted it to be a sequel or a remake, so they said "why not both?"


How about Blade Runner 2049?


I very much enjoyed it. Sad it was a box office failure.


That's because it's an art movie.


Dune, Avatar 2 probably.


Bladerunner 2049


No, we are comparing the best action movies of then with the best action movies of now.


Having just re watched Gremlins with teenage children (for their first time), there is definitely something about that era of movies.

Dodgy animatronics were first scoffed at, and then forgotten pretty quickly as everyone just got into the pure enjoyment of that movie.

This is a nigh on 40 year old film, with a lot of target references, yet it still hits the mark..

That said, rewatched transformers too, purely for the joy of the initial transformation (childhood toys coming to life..), and thoroughly enjoyed it (though it's already feeling a bit dated).


I just watched Gremlins with my kids during the holidays. My older teen had to look away and hated it (they can’t do horror films), my other teen was abhorred by all the death, and my pre-teen had the full range of emotions from cracking up often and jumping at the surprises.

With that said, I was surprised to see how decisive, resourceful, and adaptable the mother was. Hears a noise in the attic, grabs a knife. Gets attacked, puts creature in nearest blender/microwave and hits switch. Hears another noise, grabs two knives because one wasn’t enough last time.


I still catch Back to the Future every time its on TV.

That movie is so well-paced even by modern standards.


IIRC, I remember from discussions with other scriptwriters that BTF was rejected SO many times (and he redrafted it each time, afterwards) that the version that got filmed was like v12 of the story. Very, very tight and plotted.


"Back to the Future is a perfect movie!"

- Quentin Tarantino


> That said, rewatched transformers too, purely for the joy of the initial transformation

Aside from the voice acting, there's nothing redeeming about those movies and felt like someone just pissed on my childhood. Any aspect of the cartoons or toys that inspired joy and wonder was lost in translation.

The initial transformations were spoiled by trailers and also just disappointing in general. The transformations are largely incomprehensible and they might as well have used Star Trek transporter FX or a flash of light to transform them.


Count me in in saying that there is something about these older movies that eclipses almost all "entertainment" these days. It is mostly the storytelling ranging from "the message" to the dialogue. None of it resonates with me.


For me Pixar movies are great in this respect. Storytelling is there front and center. There is a book (Creativity Inc iirc?) that talks about Pixar history and their process, pretty incredible how many hits they were able to produce during their era and how consistent they were. Not sure if they produced anything lately though?


> Not sure if they produced anything lately though?

Recently-ish: Coco, Soul, and Inside Out are among my favorite Pixar movies and all compete with the golden-era Pixar classics in terms of being memorable and story-first.

The Disney acquisition seems to have crushed some Pixar magic. You can no longer assume every Pixar movie to be gold, but they can still turn out top-quality content.


How do you feel about more modern movies like Sicario? That one has both a very strong message, fantastic dialogue and beautiful camera work, IMO.


That’s true of many Denis Villeneuve movies, though. There are always going to be directors that produce quality, regardless of current trends, however, these are the exception (and unfortunately, not always reliably consistent).


Sure, but the movies they’re being compared to were also the exceptions of their day. We remember them precisely because they were good, but tons of dross was also being released at the same time.

The market economics were also vastly different at the time, streaming really has changed the industry, and the quality of tv shows has improved dramatically.


My point was you chose a movie from a director known for stunning movies. That has nothing to do with the time period the movie was made. Stanley Kubrick, Terry Gilliam, Darren Aranofsky, and many others consistently make movies that would be considered beautiful regardless of when they were released.


Loved Sicario, and the parallela are there : the story, the gradual reveal, the tension, the character building.


For me a big part of it is due the slow pace. Today's slow movies are filmed in a way that make them feel more fast paced than action movies from the 80s/90s

Watch Jurassic park and Jurassic world back to back, or The mummy vs the remake with Tom Cruise, too many cameras, too many view points, too many cuts


This isn't the case for everything. Go watch Too Old to Die Young on Amazon Prime if you want to see beautiful slow cinema.


Way too far in the opposite direction IMO. I did love the series (well, right up until the last two episodes, when it was clear they ran out of money to actually conclude things properly) but it was as slow as molasses.


How do you feel about the new Dune? Curious to hear your thoughts.


As someone who LOVES the books (the frank herbert ones at least). I thought the movie was great. They captured the tone very well, casting was good. Even the dialogue created just for the movie that Duke Leto and Paul have about Desert power was very Dune. It also helped make clear that the Atriedes aren't "good guys" right off the bat.

My only real complaint was that the time between them landing and the Harkonnen's invading was too short and didn't cover all the intrigue/politics occuring on Arrakis before the invasion.

In the movie it came off as: Okay Leto you get Arrakis now LOL JK WE'RE INVADING IMMEDIATELY. But that being said, I understand there are length constraints and I think the movie was already about 2.5 hours so I can forgive them


I read the books back in the 90s, but I don't remember it being obvious the atradies were bad. The seemed pretty egalitarian other than the main family. But I may have missed it. The harkonens are just horrific though. Am I remembering wrong from someone who has read more recently.

I'm fine with how they did it in the movie I just never really jive with shows where everyone sucks.


The Atreides are still agents of a foreign empire occupying and exploiting Arrakis. They impose the Imperial hierarchy and law on the unwilling populace, by force, if necessary, and they extract the resources. They do try to treat their subjects well within the boundaries permitted by the system, unlike Harkonnens; it's just that the system itself is inherently oppressive, so the best they can do is being "good feudals". Paul even spells it out at one point:

“You sense that Arrakis could be a paradise,” Kynes said. “Yet, as you see, the Imperium sends here only its trained hatchetmen, its seekers after the spice!”

Paul held up his thumb with its ducal signet. “Do you see this ring?”

“Yes.”

“Do you know its significance?”

Jessica turned sharply to stare at her son.

“Your father lies dead in the ruins of Arrakeen,” Kynes said. “You are technically the Duke.”

“I’m a soldier of the Imperium,” Paul said, “technically a hatchetman.”

Kynes’ face darkened. “Even with the Emperor’s Sardaukar standing over your father’s body?”

“The Sardaukar are one thing, the legal source of my authority is another,” Paul said.


The Dune was a complete waste of time. I have no idea if its adapted from a book or something but I just couldnt understand the story, just that the spice should flow.


… which version did you watch?

And yep, it’s adapted from a a book!


Not sure which version, I saw it in 2021 in a theater.


Terminator 2 is widely recognised as an amazing piece of filmmaking, film scholars included.

I believe Paul Thomas Anderson partially left film school because his professor was shitting on the movie. And he was damn right.


This movie is about this phenomena :D.

https://www.imdb.com/title/tt0364955/


phenomenon


They were better. Creators experimented. Now, it's all about profit maximization and it's better to repeat known formulas: sequels, reboots, etc.


Most of Cameron’s older movies hold up very well. Aliens is still fun, tense, and well-paced. Even True Lies is still fun.

I think Cameron’s secret weapon is the camerawork and editing. He really lets the camera breathe and has an incredible sense of “beats” for every cut.


Had my daughter watch aliens for Halloween cause she was suggesting shit scary movies like the witch. She's 22. Gosh is that movie excellent. The best was the 3 endings where she relaxed and I was like excellent. Esp the spaceship fight. You thought this was over, no the most tense parts of the movie are yet to come.


And it's not just "badass - it has a solid emotional core that drives the character.

The director's cut makes it even more clear - Ripley lost her baby while she was in cryosleep, which gives her character's desperation to save Newt even more urgency.


_ALL_ of his 80s and 90s movies hold up. He was a master storyteller that understood the tools, the craft, how to get the best out of the people working for him, and most importantly the story and characters.

For whatever reason, he through out story and pushing people to their limits for technology and spectacle. It makes me sad to think that the last movies we’ll see from Jim Cameron are likely all in the Avatar universe, written by committee and acted in front of green screens.


While I share that sentiment, I also appreciate that his current movies are at least his singular vision. Not a lot of original IP currently where the director gets control over the end product.


Obviously it’s all a matter of taste but there are people (myself included) that think practical effects complemented with CGI is still the right way to do action.

Top Gun Maverick would be a data point that a lot of people agree. No matter how good the CGI, actors con a sound stage in front of a green screen just behave differently than actors or stunt people actually doing action stuff.


Reminded of the star trek gif where they take out the camera motion and it's just the actors randomly lurching around in their chairs while acting.


Well, to be fair, if you filmed yourself in a car, with a windshield-attached camera, driving over rough terrain, and then stabilized the footage, it would also look like you and other passengers lurching around - though perhaps in a more synchronized fashion.


In that footage you'd be at least moving in similar directions, in st they're all over the place, it's just way more obvious when there's no camera shake.


Your experience is subjective and unique to you and so is for everyone else. That's why there's no point to use some objective metric to rate entertainment anyways. I don't generally like modern movies or TV as well, and I feel some what entitled to that opinion.


I think the guy that made them has some new movie out or something


The best action movie in my mind is: Terminator 2. Hands down. It's like an archetypal movie


It wasn't in 1990


A lot of the jokes making T-800 talk like a 90s skater punk are pretty cringey today. I still have fond memories but my kids thought it was pretty dumb.


Just wait til they’re 30-40, and they hear how kids are talking. Guaranteed it will sound just as dumb. :)

I think it was always supposed to sound a little dumb. The T-800 is acting as a father figure, and bridging the cultural gap with his surrogate ‘son.’


I agree. Now that I'm older, I've seen kids 30 years younger than me try to get their parents to use modern popular slang, and it feels super cringey.

James Cameron would have been 36 when he was making Terminator 2, and he's a polymath with an eye for detail, so I'm sure he deliberately went for that layered meaning.


>An interesting level of this scene is that it implies neither terminator is able to identify a fabricated voice even though they have a full understanding of each other's design. The T-1000 cannot tell it is talking to a T-800 until it realized it has been tricked and the T-800 cannot tell it is talking to a T-1000 until it tricks the other machine.

What made it even more amazing was that the T-800 ran on a 6502 processor.


>so perhaps with the full vocal range in person it would have been different.

I think there might have been other giveaways in person where the T-800 is concerned.


If the fabricated voice is 100% indistinguishible from the real one, how can the T-1000 or T-800 distinguish them?


The content of the message gave away that the speaker was missing knowledge the real person should have.


So you are implying that the robots know all about the people? Seems unreasonable.


This is the scene they're talking about: https://youtu.be/qKLTbJMJOSI


I know exactly what the scene is. If the T800 does not know the name of the dog, why would the T1000 know?


The T800 said the wrong name to test the T1000. The person the T1000 pretended to be would have noticed.


That is exactly the point. The person the T1000 would have noticed. But not the T1000.

EDIT: I just noticed that the ones who replied to me interpreted it the other way around.



For those who don't know, the woman playing his mother is "Vasquez" from James Cameron's previous film, Aliens.


OMG. two of my favourite movies and I never spotted her as the same woman. I guess I'm not alone, looking at her IMDB bio...

Jenette Goldstein is a true chameleon. She is so effective as an actress, it is nearly impossible to recognize her from role to role.


“Hey Vasquez, have you ever been mistaken for a cybernetic organism?”


"Hey, Vasquez, have you ever been mistaken for a 17th century Spanish painter?"


Why drag Vasquez into this?


No. Have you?


Vasquez*


Thanks. Corrected.


Can I just unsolicitedly say that James Cameron is a god?


https://youtu.be/MT_u9Rurrqg

I felt obligated to find and post the link to the scene. What a great movie.


Warning: Might be worth warning people there’s an ultra violent cut included in the clip. Anyone not interested in seeing the violence, but watching the relevant dialogue can view it here:

https://news.ycombinator.com/item?id=34309876


- Your parents are dead.

That deadpan delivery always makes me laugh.


I smell a terminator...


There are no legitimate uses for this technology. Its only purpose is to scam and deceive. It should be regulated in the same way that nukes and guns are regulated. Contrary to what a lot of the HN crowd thinks, regulation of technology is certainly nothing new. Up until now personal computing has been largely free of regulation, but that doesn’t mean we couldn’t start.


Once perfected, I can imagine there are many legitimate use cases for it. Like an author using their voice to narrate their audiobook without having to spend time in a recording studio, or Hollywood using it instead of dubbing sessions for re-recording muffed lines. It'd also be interesting if this could be used for foreign language dubbing - imagine if it can use the voice profile of an actor to convert subtitle text files into foreign audio language tracks in in the same tone of the actor.


God yes, a dubbed European film without the naff American actor.

Or for games where you can get the language for the NPCs all sorted out and change dialogue right up until it goes gold.


Is the inference so heavy for this that it cannot run as part of the game? I want to play planescape torment with the text read to me, I don't know who I'll cast though. I suspect I actually like text dialogue in games better.


I recently updated my telephone contact with a large brokerage service.

Part of this was to voiceprint me during the conversation.

The agent assured me that this was more secure than asking me the customary identification questions, and that their system would just use the voice identification in the future.


> There are no legitimate uses for this technology.

I mean this kindly: your lack of imagination is not an adequate replacement for actual facts


>I mean this kindly

I don't lol. OP is a bundle of sticks. Every time there's a tiny bit of tech progress with this stuff they just go "MAH REGULASHIONS PLEASE PAPA GOVERNMENT SAVE US!"


You can look in this thread for legitimate non-deceptive uses. Voice synthesis for people with degenerative diseases like ALS, for example.

https://news.ycombinator.com/item?id=34309432


I disagree that this has no legitimate uses. This could be a phenomenal prosthesis for people who have lost their ability to speak. For example, many ALS patients.


I think it's important to keep considering whether or not something should be regulated, at least keep asking and not fall victim to an echo chamber.

But in this case, couldn't your argument have been applied to photoshop or video manipulation software? Those are meant to deceive, right?


> There are no legitimate uses for this technology.

What about this: You have a voice actor whose voice is part of a company's brand (maybe for an animated mascot). You can now also use that voice for dynamic text, for example audio books or for a voice assistant.


Flip it around and it works too. Anyone who wants to could potentially set all of the text-to-speech systems they use (their phone/house's voice interaction, listening to audiobooks, warning messages from their car/plane, etc.) to whichever voice profile they like best. Basically what TTS has been trying to do since the 70s or 80s, but with an infinite number of natural-sounding voices.

Maybe pick a couple of different ones for context-aware messages. Something familiar/reassuring for most messages, something less comfortable for "terrain!"-type alerts. If the model supports it, it could even morph between the two as the criticality of the message escalates.


Or this: Morgan Freeman reads you all of your audiobooks from now on and they sound so much more profound and relaxing now.


It was the best of times, it was the worst of times.


How long would it take, do you suppose, for ~~Disney~~ $ProductionCompany to simply get rid of the voice actor altogether. Now there's another person without work.


I always find this argument bizzare.

I don't want to live in a world where people are doing jobs that provide no value but only exist because "everyone needs a job". I would literally prefer we give them the money and they do no work than forcing people to do a useless job just for the sake of them working.

Do people really want their jobs to be the equivalent Sisyphus pushing a boulder up a hill aimlessly everyday? There's a reason why this is seen as a divine punishment in the myth.


I personally don't want someone to clone my voice without my permission and then use it to make money. I imagine many people feel the same way.

I agree that I would prefer to move to a society where we don't have to work, but do you honestly think we're moving in that direction? We'll get the "there's no work" part, but not the "and here's some money" part.


I'm not advocating for a world where we don't have to work, more for a world where we only work on stuff that produces value.

I wouldn't worry about that work not existing. At the very least, with the population declining we'll need people to take care of the elderly and I don't think we're anywhere close to automating health care providers. Humans always seem to find more work to do.


Or the voice actor could license their profile to {{ProductionCompany}}, and not even have to go to the recording studio anymore unless it's a project they're personally interested in or where the production team wants to pay extra for a custom performance.

A high school friend of mine did that and became the voice of the "pronounce this name" feature on Facebook.


You could create a lot of jobs by banning the wheel.


I joke with my wife that I'll always be around to bug her, even after I pass. She'll carry my brain around in a jar ala Futurama and talk to me that way.

But, obviously, the technology is getting to the point where a decade or so from now she'll be able to have a GPT-like chat with me with my own voice. The first company to offer that to the loved ones of a deceased person will make a fortune, not for any mode of deception but just to soothe the hurt.


As a father, I like reading to my kids and I know one day they won't ask or I won't be able to.

I've wanted to record readings of some of their favorite books to pass on to them, but if I don't get the chance this seems like a way to get some analog of the experience.


>There are no legitimate uses for this technology

oh please. You never wanted to listen to an article/book in the voice of a narrator you liked? I do. Maybe you should read a bit more.

Just because you lack in thinking of the ways it can improve our lives doesn't mean everyone else doesn't. That's a you problem. Are you so deprived of free thought, and so insecure of your capabilities, that the first thing you do when seeing new technology is turn to the government to curb it? if you don't like it, no one should use it?

>regulation

ah there it is. the 'answer' to every technological progress you have is "regulashions!"

>guns are regulated

thankfully they are not in some places. So it should be regulated the same as they are - Not at all.


you're like OP in that you're just state what you think should be with out really saying why



I can think of one... podcast cleanup app:

1. Speech to text

2.fix up/edit text with GPT-3

3.text to speech in the original speaker's voice(s), preserving prosody and inflection with Vall-e.

If done with every participant's consent I don't see how it's not legitimate.


Descript already does this.


you could also just skip step 1 https://twitter.com/lexman_ai


What if I did 3 second long impersonations of characters for a video game. I then used this tech to produce more lines of dialogue for each character in said game.


The difference is that nukes and - in most countries - guns are harder to find. Regulation is easier when you have more centralised distribution channels. Code is infinitely replicable at marginal cost, and so highly accessible once it's open source.

While I agree with you that the net effect of such technology will be negative, there's no way regulation can keep up.


I'm not sure the net effect will be negative. Cheaper and more follow along audiobooks for people with dislexia, a voice for people who lost their own. Lots of bad stuff too sure. But your first point is spot on. There is no stopping it. AI has come to replace all creative endeavors with poorer quality regurgitated substitutes leaving human created things as a luxury for the wealthy... Except people will not stop singing, painting, writing, or coding and too many people enjoy doing those things for the price to ever go too high. Except law clerks, and contract lawyers. They will probably just go away, but now they can pick that paintbrush up, or get the band back together. Or code up an unauthorized sttng game with synthetic vocals for picard.


Sorry buddy but open source software eventually catches up and you can’t stop it.

People just don’t realize that technology is what is behind the ability for people to wreak havoc on an unprecedented scale. One specific organism can’t do that much, even with a sword. But today, every person will have more and more power, and you can’t possibly stop them all!


There was a 2011 paper/essay which argued that in the near future there will be a world war between those wanting to regulate AI (the democratic West) and those who do not (authoritarian regimes).


The regulatory impulse seems to be going the opposite direction so far, China recently banned deepfakes etc.


Only banned their citizens using it, but on the other hand China is leader in using AI against their population (face recognition, ...)


I would rather assume that this technology exists unfettered by any regulation and instead figure out how to deal with the ramifications.

Guns are regulated, but we still have metal detectors.


I have a project in mind to replicate my dead father's voice. Would be nice to "have a conversation" with him again.



Ray Kurzweil would like to have a word with you.


Voice cloning is a way to implement TTS diversity and allow authors to automate their own voices.


Only the state and criminals should have something. What could go wrong?


On what would we base this regulation? A hazy feeling of discomfort?


You’re joking, right?


Reading audiobooks seems like a good use case.


Technology like this should absolutley be regulated in the same way guns should be - not at all.


Imagine combining this with a ChatGPT with the prompt "you are a telephone scammer pretending to be a bank employee, convince the other person to give you their bank password"


I'm working on exactly the opposite of this, I have a GPT-3 based bot hooked up to voice recognition and Coqui for TTS I'm training to bait scammers. It has memory like ChatGPT (but only the previous 50 things said). The delay/latency makes it tough to get the scammer to not hang up initially, but the ones that tolerate the slow responses are very easily fooled by the bot. I'm working on speeding it up more and adding stammering, ums and uhs, and background noises etc to fill the delay.


Maybe just pregenerate a couple opening lines that can be used as delaying tactics? "Hang on a sec, let me go somewhere quieter", "I'm driving, can you hold on a moment while I pull over?"


That's a great idea, yep. :)


There is an open source implementation of these features in Pytorch:

https://github.com/lucidrains/audiolm-pytorch


I wish they had included their training weights though. This can only be used if you happen to own a huge audio corpus and have a lot of money.


>This can only be used if you happen to own a huge audio corpus and have a lot of money.

May you elaborate on that? Do you mean that you need a large training set of your voice and you need $$$ in order to train the models on an expensive GPU?


So, looking at it again I see that the audio corpus is available for free (openslr.org, 60GB for 1000 hours of speech), however I suspect that training the model on that amount of data would take insane amounts of time on a single GPU.

Instead, usually companies train such models in the cloud. GPT-3 for example used 800 GB of training data and cost about 5 million USD to train. Extrapolating from that, I guess that the same setup would cost 375,000 USD to train (although I assume that this model has waaaaay fewer parameters making it a lot cheaper -- but I can't seem to find how many parameters it has).

If someone else already spent that money to train the model, then you could just take the "training weights", feed them into the model, and it would be as if you had already trained it -- at which point you'd only need to provide your own voice and retrain it on your own GPU for a short time.

I'm by no means an ML expert though, so I could be totally wrong on this.


James Betker in the tortoise-tts repo, which is similar, says he spent $15k for his home rig. I'm not finding right now how long it took to train the tortoise model but feel like I read him say weeks/months somewhere. Obvs all kinds of variations depending on coding efficiency and dataset size, but another datapoint https://nonint.com/2022/05/30/my-deep-learning-rig/ https://github.com/neonbjb/tortoise-tts


Can someone merge this with Chat GPT so I don't have to attend anymore zoom meetings?


This would have made Sneakers a much more boring movie.


Me to my partner: "Didn't you say you were going to do the dishes?" Partner: "I don't remember saying that" Me: "Here, I have a recording of you saying it..."


Yikes.

I think a remake of the movie "Gas Light" using these kinds of AI technologies needs to happen as a sort of social commentary on where this is going.


anyone know what's up with the sudden uptick in AI assisted everything?

  - ChatGPT
  - DALL-E
  - VALL-E
  - Stable Diffusion
maybe it's just a reflection of my interest being peaked with AI junk and google ads -- but I feel like I'm seeing it more and more even in HN.


One of the perks of crypto losing value. GPUs finally being used for breakthrough research, rather than pyramid schemes.


This claim feels truthy, but is it true? Most of the big models were being trained on specialised hardware like tensor chips right?

And the biggest drop isn’t because crypto is worth less, it’s because Ethereum doesn’t use proof of work since November (Bitcoin hasn’t used GPUs for years). The explosion in novel AI models we saw last year predates this change.


Most big AI is trained on Nvidia GPUs but usually not the standard consumer ones found in the GeForce line-up. Instead it's usually their data centre GPUs like the current A100 or soon to be H100 that's just hitting the market.

Google does have their TPUs(Tensor Processing Units), however it is not cost efficient budget wise, so unless you have some kind of deal with Google or compute credits it doesn't make sense. They have pods upon pods of TPU clusters though so the main selling point of TPU training is that you can get your training done really fast with just the ease of scaling your workload to more TPUs.

So if you needed a big model like GPT-3 trained in a single day, you could spend an ungodly amounts of money and get it done with Google TPUs. Otherwise if you can wait weeks or months you can go with the standard Nvidia data centre solution and it'd be cheaper at the end by a significant margin.


Crypto has some real use cases... have you ever tried to transfer money to another country? That's just one of many.


Crypto has been running on ASICs for years. The "evil crypto miners make your GPUs so expensive" psyop is a NVIDIA campaign to rise GPU prices.


True, but that doesn't mean that crypto miners aren't still evil.


The paper about the transformer architecture was published in 2017 [1]. My understanding is that it took a few years, but we are now seeing the results of that breakthrough.

[1] https://arxiv.org/abs/1706.03762


And Open AI's Whisper, which for me is the cherry on top!


I tried Whisper on an admittedly very challenging task (live translate Albanian to English subtitles), and it failed miserably. Went back and saw it had a 44% error rate for that language.

Might give it another go on an easier task.


Yeah the top 5 languages on that WER graph is very much worth a look, the rest a bit error prone.

Surprisingly, I tested it with my mother, she has a very broken english accent/dialect and it worked fine for her. Works amazing actually and I'm busy building some tools around it for her to test.


Even though Whisper advertises the translation part of it. I honestly wouldn't say it's a first class feature at least part of the Whisper model itself.

The only thing I'd use Whisper for is transcription part of it. Then use ChatGPT for the translation.


Jesus it can even understand Scottish. I need this app!


Obligatory Burnistoun reference: https://www.youtube.com/watch?v=TqAu-DDlINs

But I'd be interested to see what its reliability is like on various types of Scottish accent.


In 1993, it was said that the game of Go would be the drosophila fly of artificial intelligence. Alphago (then AlphaGo Zero) has succeeded in 2016, 2017. The glass ceiling has been broken. We are witnessing the birth of a new world where artificial intelligence will be more and more present in everyday life. I do not see any limit.


> I do not see any limit.

Hmmm, now I'm wondering if you're actually an advanced AI that has already simulated what's going to happen.


The cynical view would be that in the current tech environment and generalized risk aversion rushing to demonstrate tangible use cases for AI will help justify a decade of out-of-control hype and investment, support exits, new funding runds etc.

Underneath all the commercial interest some specialists inevitably must keep adding to the knowledge base / capabilities but good luck finding any objective account of that process.


Imagine Hacker News existed when Netscape Navigator was released. (Maybe it was a Usenet discussion board or something.) Suddenly everyone would be talking about the World Wide Web.

At the moment, ChatGPT and Stable Diffusion look like the same order of magnitude of advance.


There have been some pretty big advances in the last few months in machine learning.


Breakthrough tech doing breakthrough tech things


It's the new hot topic?


I'm curious how all these advancements in AI will impact KYC and identity authentication. It's already easy to scrape OSINT to pull the answers needed for most people's knowledge based authentication sequences. Will we hit the point where fake passports, IDs, and biometrics (including voice prints here) can be replicated undetectably? If so, what will become the standard for identity authentication?


I’m assuming it’ll come down to two things: more reliance on in-person checks and eventually pushing more towards digital signatures but the latter is complicated in the U.S. by politics. It wouldn’t be insurmountable to deploy national ID cards like Estonia from a technical perspective but we have a non-trivial number of people who object to that on religious grounds so I’d expect we’ll have some kind of private sector multiple standards mess for years.


KYC isn't about stopping bad guys doing stuff. It's about stopping the government from shutting you down.


Maybe governments finally start implementing public-key cryptography with their chip-enabled IDs and passports?


Yes, this is the CEO and I do need you to wire that money. C'mon Frank (from accounting), you need to act more promptly on these urgent requests


Can someone please transcribe what "Speaker prompt" in Example #1 is really saying? I can't get "When I love making babies" out of my head! I reduced speed to 0.5X and understood the final, "maybe suspended but".


I heard “whereby lovemaking may be suspended, but…”


Thanks for confirming -- I didn't trust my own ears for such a strange fragment of a sentence!

It turns out it comes from a book called "The Foolish Dictionary: An exhausting work of reference to un-certain english words, their origin, meaning, legitimate and illegitimate use, confused by a few pictures" by Wurdz and Goldsmith. (Talk about nominative determinism -- a dictionary by Wurds?)

Full entry:

HAMMOCK From the Lat. hamus, hook, and Grk. makar, happy. Happiness on hooks. Also, a popular contrivance whereby love-making may be suspended but not stopped during the picnic season.


Ah, I have serious Demolition Man vibes now.


Exciting to see ML develop in ways that will enhance accessibility for people. I imagine being able to train a screen reader with your own voice (or voice of your choosing) would be a huge plus for vision impaired folks.


Would be awesome to pair this with near real time language translation. I get from the demo it seems that this is still very language specific, but I assume it will happen sometime.


This should probably link straight to page of the paper https://valle-demo.github.io/


Good News Everyone, I'm a Horse's Butt

https://www.youtube.com/watch?v=M1D9WEL7PKQ


HSBC in the UK used to advertise Telephone banking with security just based on speech recognition. I wonder whether systems like that could be fooled by this?


Biometric identification via voice is mostly hokum.

My education is in biometrics and our undergraduate seminar group had a student and his supervisor with essentially identical voices.

But that's just anecdata - the human voice doesn't carry a lot of unique information and on top of that it changes over time.

In biometric identification the iris is king. 3D face scans are a distant second, fingerprints on the border of usefulness and the rest like gait, voice, keystrokes etc. get rediscovered every decade or so just to, again, not yield results.


My employer had a project to look at phone unlocking with voice authentication. While we had a number of interesting techniques for anti-faking, the problem was the base error rate was about 1/10,000 even without adversarial considerations.


HSBC Hong Kong is actively promoting this. Just say "I will need your help with something" and you're in.


New scam incoming...

> Scammers get hold of someone's voice which is synthesised in ~real-time whilst they're on the phone with that person's mother, asking them to transfer them some emergency funds to a new bank account.

We won't be able to trust digital sound, digital image, digital text, and soon digital video. We need proof of authenticity / proof of authorship / proof of humanity!


> proof of humanity

This resonated with me when I read it. Seems clear where AI is going, so it seems clear that we need some sort of "authentication" that can reliably differentiate a real human from an AI one.


I understand these examples are zero-shot with short samples--which is impressive, but would the output quality improve significantly with longer input samples?

For example I noticed it couldn't quite pick out the accent on some samples because they were so short. But if the model had more example words to hear then I'd think it would accurately understand the accent..


I wonder to what degree some of these issues can be addressed with cryptographic protocols.

In some sense I feel like we already have good solutions in place when dealing with this problem in other contexts.

We can for example get some guarantees about “who wrote this code” with signed git commits for example.

The problem is a lot of our commonly used communication protocols were never designed with them in mind.


It is very possible and in some areas it is already been used. For example there are video cameras which sign data in realtime for 4k resolution already.

So you would need microfone with TPM chip and you are good to go. You just need that video player which can verify the data.


For another angle what exactly is the source for our distinct voices to begin with? Are everybody’s vocal chords distinct in some way like a fingerprint or something or is it mostly governed by our baseline neural control considering how people can frequently impersonate others ala comedians to some degree of accuracy where it’s funny.


Considering a given person's voice would sound different if they were brought up speaking a different language, yet all the speakers of that language have distinct voices, I'd say it's a bit of column A and a bit of column B.


that's a good point too, accents can creep over from first learned languages of a different tongue.


Really scary to be honest…

Making things much easier for scammers. Voice phishing and social engineering is now easier than ever.


or...get this

this will enhance accessibility to a vast number of people with disabilities. Also imagine reading your books/articles in voice of your favorite narrator.

A normal person will have to work really hard to do a 1:1 real time voice obfuscation with the processing it'd take.

----

Why are you lot always doom and gloom and small-minded, is beyond me.


Because you're not thinking 10 years out. At a certain point you will be able to copy any voice and any video of a person in realtime. It will be as simple as downloading an app. When AI has scanned your whole life's details and it can be found by anyone there will be no privacy left. Nothing will be trusted anymore.


It’s not about being doom and gloom. It’s the opposite of being small-minded.

I like all of the above use cases you suggested, I like science advancements. I’m sure this will help plenty.

But, at the same time, we should be really careful and aware of what we’re creating.


I’m really skeptical about TTS. There’s lots of theoretical and academic and research lab stuff that supposedly is quantum leaps forward in TTS. However, all the commercial offerings sound like TTS and never match the breathless hype. Try them out and you can tell within seconds it’s a robot talking.


I have to disagree.

The SBB (Swiss Rail) is now using a quite sophisticated TTS system [1] that sounds very clean and has a Swiss German accent in the high German its talking (Zürich accent based which is what most of the complaints are about). The old system was built using over 10k recording which was a enormous task

The system now is able to announce pretty much anything including reasons for delay etc. and it does not sound like a bunch of recordings attached to each other.

I am hoping it will be expanded eventually to switch accents depending on the region just how it already switches to French or Italian depending where you are.

[1] https://www.tagesanzeiger.ch/so-klingt-die-neue-stimme-der-s...


> Zürich accent based which is what most of the complaints are about

I've given up on this a long time ago. Ads are almost always in a Zurich accent, except if the ad is for extremely regional stuff like secret cheese recipes or tourist ads for Graubünden.

Zurich is the tech hub of Switzerland and the accent seems to be the default. Then again, who in their right mind would want an AI to talk with an accent from Thurgau (/s)


Well we do have Tschugger[1] which is definitely not Zürich Swiss German...

[1] https://www.youtube.com/watch?v=C1AwYqquL7k


True, and thanks for reminding me! I've meant to watch it for some time now and heard that the second season was out.

Have you watched it/did you like it?


I watched it, it was ok. I did like the first season better.


Might give it a shot then!


Except that... while the German-speaking one sounds relatively well, I am definitely unhappy with the new system in French. It sounds more unnatural (and harder to understand in certain cases) than the old system. Maybe it's just a lack of tuning of the French version vs. the German one, but still...


I agree, I hope the fix that soon. I think the product they are using is from Germany which may explain the quality difference.


Do you know what the actual product is?


Except when you're talking over a phone call that sounds like two tin cans connected with a string even on a good day.

I'm regularly astonished at how bad international calls in particular have become, and you're regularly subjected to these even domestically, since so many call centers are in the Philippines or India these days. And this despite bandwidth being cheaper than ever.


Unfortunately, all the bandwidth in the world can't compensate for latency, and the loss of quality in ADC, DAC and compression steps as currently implemented.


It's all in the compression. That's why VOIP systems sound much better. Lots of telco systems still use AMR with a very low bitrate.


I can't think of a single VOIP system that sounded good. Of the ones I use regularly, like Messenger, Teams and Skype, they rarely if ever sound better than just making a regular phone call with the same device.


Good enough ADC/DACs are so cheap and plentiful I don't even know why you're even mentioning them in same sentence


The generated voice has the same creepiness fake AI voice has in movies. It also has the same monotone tone that there is in some dystopian movies (For some reason it make me remember of Bladerunner, but I don't remember a narrator voice).


Are these text-to-speech models manually controllable in terms of prosody, etc., or is it all transformer-based text-to-audio?

I've followed some of the research on prosody transfer, etc., but it still seems bad in the TTS systems I've heard.


I remember Lyrebird did this back in 2017. They used to have a free-to-use interface on their site to speak a phrase then listen to a text-to-speech output of your voice. Assuming Microsoft improved it somehow?


I know of a bunch of audiobooks that I'd rather listen to the author than listen to the current person narrating or in other cases I'd rather hear the person the book is about instead of the author.


Given its name, I was quite surprised not to hear it sounding like this: https://youtu.be/R5Q1yVLSR3I?t=17


There's plenty to be afraid of, but legit applications as well, for example automated announcements and voice menus where you can't pre-record every possible combination of things to be said.


Does this mean I could finally speak English without my heavy accent?


Accents can be minimized through dedicated training.

After being ridiculed for my horrendous English pronunciation I watched hours of English movies and repeated every single sentence I heard until it sounded right.


Thatsa correcto.


I don't speak Italian ;)


Zattsu Raito?


So if you get a call from a number you do not know, maybe we should NOT talk and give them our voice which they will record and use to make new speech to scam our loved ones.


Will probably have to give my mom a secret word that only I know that I need to say to her so she knows its me and not a scammer using my voice.


It’s not a bad idea, but you can already do this kind of check with shared knowledge. “Where did we go for vacation when I was a kid?”, “where did we eat for dinner last time we saw each other” Etc.



Someone will eventually release an open source model, right?


Hope that 15.ai finishes soon.


Yes, this is Mike Smith's father Steve, my son won't be attending school the rest of this week. We're flying back east to visit our relatives.


I would love to hear David Attenborough's voice training VALL-E so that we can forever narrate "planet earth" like tv shows with his voice.


It will be interesting if vall-e could express emotions like fear or love by the the tone or speed of the voice. This will be even more scary


Perhaps digital signatures will help with authenticity issues. It may become necessary. Transcoding presents an issue I wonder how it may be solved


"My voice is my passport. Verify me?"


"Your scientists were so preoccupied with whether or not they could, they didn't stop to think if they should."


Didn’t Deep Voice from Baidu claim this several years ago? Was that a legit claim or is MS playing catch up?


How long before we have pretty good AI bards?

They'll sing never written verses, non-existent sounds in a velvet voice.


I’m excited for when they can duplicate patterns and mannerisms, especially if in a different voice.


Sigh.. Now we'll need voice changers when calling people/institutions we don't trust


You might want to look at www.aflorithmic.ai - scalable infrastructure using models like VALL-E


Serious question: do the people who work on this topic think they’re doing a good thing?


Hmm. I wonder if this can be used to make a home assistant sound like someone


Time to use the Snake voice more often:

"Text to speech? Anyone voice? What?"


It only works in English(?)


Thanks MSFT for making my job of phishing people easier :)


So is there anywhere you can play with this model yet?


Can someone try with David Attenborough 's voice?


One more step deeper into the post-truth era.


VALL-E sounds uncanny, and totally tubular!


I'd like to get off the ride, please.


I didn't see any way to use it...


"My voice is my passport."


is this tech usable in any way? without releasing the code or the model how is this not vaporware?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: