Hacker News new | past | comments | ask | show | jobs | submit login
Highly realistic talking head video generation (github.com/fudan-generative-vision)
136 points by HuiLi1998 19 days ago | hide | past | favorite | 60 comments



I have found EMO (not open though) [1] to be the best yet.

Look at the rapping example near the end. The lip sync is nearly flawless. The first black and white lady singing is also almost perfect. It even gives them the subtle jerk to pause for breathing. Unless you know and really are looking for flaws, you won't find anything that stands out making them look real.

[1] https://humanaigc.github.io/emote-portrait-alive/


It's awesome and I hate it.

This singular peace of technology makes me pessimistic towards the future. Until now, video record was considered to be a very good evidence. Let's say you argue with a person about what X person said. You show them a video and they will be like "ok, he did say that, but...". You could at least set some facts straight, and then discuss the interpretations.

But that will be now gone. You can now generate mass amount of real looking fakes and at the same time label anything you don't like as fake. There's really no independent evidence now, you can only put trust into the medium of your choosing (youtube channel, newspapers, tv station) that they report honestly.

This seems to have only minimal benefits for the society, but huge negatives. But there's no stopping here...


This was an inevitable outcome of the advancement of technology. I would argue that we lost trust in all mediums a long time ago it is just now being realized by the masses.

But as usual, we shall adapt and overcome.


> But as usual, we shall adapt and overcome.

I don't believe this is a problem we can "overcome". We will need to learn to live with the "alternative facts" being more prominent than now, but I'm not looking forward to it.


> I don't believe this is a problem we can "overcome".

Digital signatures can not remedy this problem? When you login to your bank how do you know you are logging into your bank? In the future a recording without signatures will be like a bank login without https is today.


The only thing which signatures / https provide is ascertaining the identity of the other side, it won't help you in determining whether the recording is fake or not.

For this to work, you need to have an already trusting relationship with the media. Like, ok, I can trust NYT, so I will trust videos signed by them. But another person distrusts NYT and trusts only Truth Social. In the past, we could at least agree on basic facts like January 6th actually happening, but I think this generative AI will make laying out the facts much more difficult or even impossible.


You raise a valid point that digital signatures and HTTPS alone cannot guarantee the authenticity of a recording. However, modern smartphones and other mobile devices have the capability to provide stronger assurances about the originality of recordings through the use of tamper-proof secure hardware.

Many high-end smartphones, such as iPhones and some Android devices, incorporate secure enclaves or trusted execution environments (TEEs). These are isolated, tamper-resistant hardware components that can securely store and process sensitive data. When a recording is made on such a device, the secure hardware can associate the recording with additional metadata, including the specific date, time, GPS coordinates, and user account information. This metadata is cryptographically bound to the recording itself.

Furthermore, the device can digitally sign the recording and its associated metadata using a unique key stored within the secure hardware. This digital signature serves as a testament to the recording's originality. Companies like Apple or Google, who manage the secure hardware and signing keys, can then vouch for the authenticity of the recording.

While this approach doesn't completely eliminate the possibility of fake recordings, it significantly raises the bar for creating convincing forgeries. Modifying the recording or its metadata would invalidate the digital signature, making it evident that tampering has occurred.

Of course, as you mentioned, trust in the entity verifying the signatures (e.g., Apple or Google) is still required. However, this trust is based on their reputation and the security measures they employ, rather than on the content of the recording itself.


No such thing as tamper proof hardware, only tamper resistant hardware. Also the whole "sign what came from the sensor" idea is widely known to not work because you can easily record a playback of doctored footage. Lots of LLM-isms from this comment too.


> you can easily record a playback of doctored footage

You believe this is easy when the device has multiple recording sensors and multidimensional information (such as spatial information, changes in focus sensors durning recording, etc) is part of recording that is digitally signed?


Who's proposing such a device to get widespread adoption? I've heard of sensor data signing [1] but not what you're describing.

[1] https://pro.sony/ue_US/solutions/forgery-detection


The concept of sensor data signing to authenticate videos and images captured on mobile devices is still an emerging technology, not yet widely adopted. However, as AI-generated synthetic media becomes more prevalent and potentially problematic, solutions like this may gain traction.

The key idea is to leverage the array of sensors built into modern smartphones and tablets - accelerometer, gyroscope, GPS, WiFi/cellular signal data, etc. - to cryptographically sign the sensor readings along with the visual data itself at the time of capture. This extra layer of verifiable sensor data would help establish that a recording originated from a real physical device in a particular place and time, as opposed to a purely digital fabrication.

Historically, technologies like digital signatures and public key cryptography started out in niche military/government applications before becoming ubiquitous in the computer era. In a similar way, sensor-level authentication of audiovisual media could follow an adoption curve driven by the growing need to combat sophisticated AI forgeries.


I know I’m logging into my bank because I initiated the connection, and refuse to believe anyone in any other context who claims to be my bank. People are routinely defrauded by scammers who claim to be their bank, and banks are routinely scammed by people who claim to be an account holder.


> I know I’m logging into my bank because I initiated the connection, ...

Just because you initiated the connection how do you know the other end is you bank? Do you trust every internet company that carries your packets to the bank? Trust their employees? Trust their security practices? Do you trust firmware on all the devices involved?

> People are routinely defrauded by scammers who claim to be their bank,

I have read about this in the news just like I read about snakes with two heads etc yet I have yet to meet someone that has had this happen to them. What fraction of people that you know have had this happen?

Could it be that these people believe like you do that "I know I’m logging into my bank because I initiated the connection" as opposed to checking the digital signatures on the connection?


Maybe relying on video "evidence" to prove something, is actually the bug/vulnerability, and this technology will finally "fix" the bug by calling into question all video evidence. I'd rather the tech be widely publicized and out there, so people know it's a thing and can be convinced to disregard video "evidence", than it be kept secret and the public just unknowingly trusting video. Just like people know photoshop is a thing and (hopefully) don't by default believe images they see on the Internet.


> by calling into question all video evidence

I think it's a dangerous mindset to doubt everything you can't see with your own eyes. There was this fringe group claiming that the war in Ukraine is fake, that there's no war. With this mindset, such claims will become more mainstream, you could even call it reasonable.


That is very unnerving to think about. If video evidence could currently be flawlessly faked it'd give some legitimacy to those claims. If we do reach this point the value of a source will shift from "I can prove it" to "you can trust me", and I do not think that will be for the better. I'm not yet ready to live in a completely subjective world, where truth and falsity have equal weight.


The Polk County Sheriff's Office recently announced a partnership with Florida Polytechnic University to start working on this, dubbed the Sheriff's Artificial Intelligence Laboratory (SAIL).

https://www.polksheriff.org/news-investigations/polk-county-...

The conference video at 1:00 starts off with a generated clip of Elon Musk saying he's going to move to Polk County. The Sheriff highlights your concerns as well as many others.

Conference video (29:56): https://www.youtube.com/watch?v=DHj18pOcXHc


Why do you care so much about what people said? Before video recording was a thing people didn't have to constantly watch their back and monitor what they were saying in fear of losing their jobs. What happened at a party, stayed at a party.

You may say it's important only for public officials. But why is it important? Because you're giving huge amount of power to single individuals and somehow we're taught that's a good thing - or at least that it's inevitable for keeping peace or to keep crime at bay. What a load of bs. I hope distrust in centralisation increases. It should have been there in the first place.


Speeches of politicians is just one aspect of it. This technology makes it easier to both deny the reality and construct new ones. Beyond the horizon of your own eyes, there will be only subjective facts, fed to you by the media of your choosing. Is war in Ukraine raging, or is it all fake? It's a matter of opinion, not a fact you can establish (apart from shipping you and your discussion partner there to see it with your own eyes).


You raise a good point yet it is not what people say that matters but what it predicts about them. Modern society is built on trust and the things you want to know but can not observe can often be predicted from what you can observe such as things being said.


Wow! EMO is impressive. Do they plan on open sourcing it?

The page has a link to github[1] right at the top but the repo is basically empty.

https://github.com/HumanAIGC/EMO


Issues comments in EMO repo point to V-Express repo [1] which was released 2 weeks ago and appears to be a fully functioning open source?

[1] https://github.com/tencent-ailab/V-Express


The black and white lady is nightmare fuel for me personally.


What irks you in her? I haven't seen her before at all may be that's the reason I am not seeing anything too strange.


What Ces11 said.


Audrey Hepburn? Was she in a scary movie or play someone scary?


In the synthetic video she looks like some kind of Frankenstein's monster, brought to life with electrodes or hidden motors, similar to the other video.

Both 'move' in ways that are very unnatural.


Glad it isn't just me.


She was a very graceful lady in wholesome movies.

The juxtaposition of modern facial expressions of an influencer type singer covering Ed Sheeran at some X factor type television show are what makes it creepy. It is somehow doubly fake and extremely out of character if you are familiar with her.


[flagged]


Ok. So I showed the first 2 videos to my wife. She noticed the teeth merging looking different each time and then ear. But that was all.

For me, lip sync and body movement is what excited me most. They are closest to real when compared with any similar tech.


Crickey. Well I don't know what to think anymore. I guess it got "good enough" for some things. I can still tell. This is going to suck for some people (it feels uncomfortable).


IMO, it sucks for me beyond the level of quality.

For starters, consent is the first problem I have. Yes, lots of examples, but none of the individuals consented to having their likeness used to say things they didn't. Now, abstract this problem of a lack of consent beyond "examples"—the creators of this have no problem with the ethics of not asking for consent, thus the world at large will not either.

Then we have the problem of how it is going to be abused and what problems will exist because of it.


Perhaps realistic physically, but not emotionally.

It's truly bizarre to watch these talking heads because their lips are moving, but their eyes and cheeks aren't moving along with them, except for blinking.

Real people speak with their whole face, not just their lips.

Of course, to do that "right", you need to actually understand the emotional content of what is being spoken. And I'm not talking about highly "emotional" content like in TV drama -- even in a technical presentation, the speaker's face contains lots of emotional signals. Whether warmth, or a sense of humor, or being proud, or excitement of what they're about to reveal, or curiosity about whether the audience understands, etc.


> It's truly bizarre to watch these talking heads because their lips are moving, but their eyes and cheeks aren't moving along with them, except for blinking.

Not true?!

Please rewatch the video at the top: especially the eyebrows are quite animated, and the contours of the face also change as well as the shadows that indicate muscle movement in the phase.


It's just not looking real to me at all. Yeah there's a little bit of random movement, but none of the patterns of movement reflect the ways in which people's faces are actually expressive when speaking.

They look kind of lobotomized, sure with maybe some random eyebrow raises thrown in. It's nothing like whole-face expression.


I think the point of comparison should be news anchors and other "talking head" media, not real life humans emoting. Maybe it's just me, but people looking directly into and speaking in front of a TV camera about boring things also look lobotomized and uncanny. Check out your local newscasters some time and tell me they're really more realistic than these results.


You raise good points about the current version of this tech but trained on larger datasets it will likely become more emotionally realistic than the average person. Video filters for "tuning up" emotional expressions will likely be common in the future giving all the emotional range that talented actors have. If you want this ability it may be democratizing but for talented actors it may be dystopian.


It's still on the wav2lip[1] level from 5 years ago (look at the teeth) just slightly higher resolution. The only real player with a moat is probably flawlessai. [2]

[1] https://github.com/Rudrabha/Wav2Lip

[2] https://www.flawlessai.com/


Actually, our work not only targets lip movements, it can also produce more realistic head movements and facial muscle movements. ps: It also achieved a good syncnnet score on 200 randomly collected wild image-audio pair. And it is open sourced.:)


I get exactly what you are doing but I don't understand what would be novel about it? It really looks just like a simple SD/wav2vec/insightface/animatediff pipeline, everyone can plug together in comfy or with diffusers. The muscle claim is also a bit dubious...

PS: With insightface models in your reqs the oss aspect is also pretty much void for any use other than research and your readme should reflect that.


The flawlessai is interesting, but even in their formal example for marketing there are weird tells. The actresses eyes are too jerky and weird during the changed sections. It's strange, but in a way that I'm not sure I could tag without the context of the rest of the video.

Is this what we're in for? Human movement that is just normal enough to pass, but not really natural enough to be comfortable with?


We're sure to be in an uncanny valley for a little while first, the real question is whether it is followed by a winter or some of that sweet verisimilitude.


What's SOTA open source?


Just for level setting.

A huge amount of people who may be users of a platform such as Facebook are very likely able to watch a video generated using a technology such as this and not consider it possibly to be fraudulent.

Even if the content is somewhat strange or farfetched, these people, the majority of people are likely to not only "not notice" there's something off about the video, but believe what is being said.

They are unlikely to act on the information or regurgitate it unprovoked, but likely to just remember it as a small fact they have received in the past.


I had logo on my Atari 800 back in '82. After having learned BASIC, it took a while for me to wrap my head around list-oriented programing. After a few weeks of beating my head against the manual, however, I got halfway good at it.

Then, my little brother, 12 years younger than me, in 2nd grade, sits down and 2 days later he's programming circles around me. He could make that atari dance like Fred Astaire.

And I was the one who wanted to be a computer programmer. He had no inclination in that direction at all, and became a businessman.

I learned a very painful and embarrassing lesson about how learning the wrong programming language can give you brain damage.


I've always thought this is where blockchain technologies can finally be of some practical use - as the age of Deep Fake fully unfolds.

If an organization/entity releases media that is (by them) marked as "real" (or genuine, or trustworthy, original, not tampered with, etc), the stream can be checksummed and committed to a ledger. This way, any content coming from sources participating in this "verity scheme" can be validated against a distributed record (which is presumably incredibly difficult to modify surreptiously).


That doesn't need block chain though, just a public key.


How would a public key suffice? Once the media has been published, the publisher could alter it, since they have the private key.

I don't mean to encrypt the media, I mean to make a hash stream of the media stream, and store that in a way that cannot be adulterated.


The most comforting thought I've had in a while is that AI just somehow fades away. I'm all for AI and use it regularly, but I can't help but feel uneasy about where this is all headed.


It’s getting there, but you can still see that certain sounds are not realistically rendered. For example V in “vengence” the bottom lip has to touch the upper teeth, but in the renderings it’s only approximated.


Looks like it allows only 1 reference image, so I'm not sure how realistic this is going to be in practice.


I just have to think about the implications of going the other way round.


Automated lip-reading?

Ugh... paired with laws that restrict audio recording but not video recording.


I don't know what that means. What do you mean?


Lip reading. Video to speech.


It means that soon we'll all be wearing burqas to mask our lips.


have you overslept 2020-2022?


I could never stand the ear-straps.


The AI slop shovelware spam posts will continue until morale improves.


Lmao, it is not realistic at all, every movement is grossly exaggerated.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: