Show HN: Infinity – Realistic AI characters that can speak

yellowapple · 2024-09-06T20:39:59 1725655199

As soon as I saw the "Gnome" face option I gnew exactly what I gneeded to do: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

EDIT: looks like the model doesn't like Duke Nukem: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Cropping out his pistol only made it worse lol: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

A different image works a little bit better, though: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

zaptrem · 2024-09-06T21:23:33 1725657813

Fixed Duke Nukem: https://youtu.be/mcLrA6bGOjY

ainiriand · 2024-09-06T21:21:48 1725657708

Haha I almost wake up my kid with my sudden laugh!

andrew-w · 2024-09-06T20:51:18 1725655878

This is why we do what we do lol

squarefoot · 2024-09-06T20:36:31 1725654991

Someone had to do that, so here it is: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

vessenes · 2024-09-06T20:42:05 1725655325

Hi Lina, Andrew and Sidney, this is awesome.

My go-to for checking the edges of video and face identification LLMs are Personas right now -- they're rendered faces done in a painterly style, and can be really hard to parse.

Here's some output: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Source image from: https://personacollective.ai/persona/1610

Overall, crazy impressive compared to competing offerings. I don't know if the mouth size problems are related to the race of the portrait, the style, the model, or the positioning of the head, but I'm looking forward to further iterations of the model. This is already good enough for a bunch of creative work, which is rad.

lcolucci · 2024-09-06T20:53:38 1725656018

I didn't know about Persona Collective - very cool!

I think the issues in your video are more related to the style of the image and the fact that she's looking sideways than the race. In our testing so far, it's done a pretty good job across races. The stylized painting aesthetic is one of the harder styles for the model to do well on. I would recommend trying with a straight on portrait (rather than profile) and shorter generations as well... it might do a bit better there.

Our model will also get better over time, but I'm glad it can already be useful to you!

vessenes · 2024-09-06T21:22:14 1725657734

It's not portrait orientation or gender specific or length related: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

It's not stylization (alone): here's a short video using the same head proportions as the original video, but the photo style is a realistic portrait. I'd say the mouth is still overly wide. https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

I tentatively think it might be race related -- this is one done of a different race. Her mouth might also be too wide? But it stands out a bit less to me. https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

p.s. happy to post a bug tracker / github / whatever if you prefer. I'm also happy to license over the Persona Collective images if you want to pull them in for training / testing -- : feel free to email me -- there's a move away from 'painterly' style support in the current crop of diffusion models (flux for instance absolutely CANNOT do painting styles), and I think that's a shame.

Anyway, thanks! I really like this.

hansoolo · 2024-09-07T12:01:00 1725710460

This is fun!

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

sidneyprimas · 2024-09-08T00:10:05 1725754205

Well then. Tik Tok, and keep ticking to you too.

PerilousD · 2024-09-06T18:04:10 1725645850

Damn - I took an (AI) image that I "created" a year ago that I liked and then you animated it AND let it sing Amazing Grace. Seeing IS believing this technology pretty much means video evidence ain't necessarily so.

lcolucci · 2024-09-06T18:27:15 1725647235

We're definitely moving into a world where seeing is no longer believing

shitloadofbooks · 2024-09-07T00:33:14 1725669194

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

It’s astounding that 2 sentences generated this. (I used text-to-image and the prompt for a space marine in power armour produced something amazing with no extra tweaks required).

advael · 2024-09-07T00:59:51 1725670791

There is prior art here, e.g. Emo from alibaba research (https://humanaigc.github.io/emote-portrait-alive/), but this is impressive and also actually has a demo people can try, so that's awesome and great work!

lcolucci · 2024-09-07T02:02:54 1725674574

Yep for sure! EMO is a good one. VASA-1 (Microsoft) and Loopy Avatar (ByteDance) are two others from this year. And thanks!

swyx · 2024-09-07T23:08:43 1725750523

seriously, kudos for having a publicly available demo (w/ no sign in!) you did what very very few ai founders dare do

sidneyprimas · 2024-09-08T00:28:53 1725755333

Thank you! Just want many people to use it. And, it's super interesting to see what type of content people are making with it.

wseqyrku · 2024-09-07T15:43:59 1725723839

Was just about to post this, I'm yet to see a model beating that in terms of realistic quality

Andrew_nenakhov · 2024-09-06T18:23:09 1725646989

I tried making this short clip [0] of Baron Vladimir Harkonnen announcing the beginning of the clone war, and it's almost fine, but the last frame somehow completely breaks.

[0]: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

lcolucci · 2024-09-06T18:44:55 1725648295

This is a bug in the model we're aware of but haven't been able to fix yet. It happens at the end of some videos but not all.

Our hypothesis is that the "breakdown" happens when there's a sudden change in audio levels (from audio to silence at the end). We extend the end of the audio clip and then cut it out the video to try to handle this, but it's not working well enough.

drhodes · 2024-09-06T19:41:29 1725651689

just an idea, but what if the appended audio clip was reversed to ensure continuity in the waveform? That is, if >< is the splice point and CLIP is the audio clip, then the idea would be to construct CLIP><PILC.

andrew-w · 2024-09-06T19:43:14 1725651794

This is exactly what we do today! It seems to work better the more you extend it, but extending it too much introduces other side effects (e.g. the avatar will start to open its mouth, as if it were preparing to talk).

drhodes · 2024-09-06T20:00:01 1725652801

Hmm, maybe adding white noise would work. -- OK, that's quite enough unsolicited suggestions from me up in the peanut gallery. Nice job on the website, it's impressive, thank you for not requiring a sign up.

andrew-w · 2024-09-06T20:17:05 1725653825

All for suggestions! We've tried white noise as well, but it only works on plain talking samples (not music, for example). My guess is that the most robust solution will come from updating how it's trained.

bobbylarrybobby · 2024-09-07T00:11:17 1725667877

What if you train it to hold the last frame on silence (or quiet noise)?

andrew-w · 2024-09-07T00:35:46 1725669346

We've talked about doing something like that. Feels like it should work in theory.

jazzyjackson · 2024-09-07T02:00:18 1725674418

Or noise corresponding with a closed mouth

Hmmmmmmmm

Ohmmmmmmm

swyx · 2024-09-07T23:09:35 1725750575

hmm weird, i thought you criticise heygen for doing exactly that (mirroring the input)

sidneyprimas · 2024-09-08T00:59:08 1725757148

HeyGen (and our V1 model) literally uses the user on-boarding video in the final output. See here for a demonstration of this (https://toinfinityai.github.io/v2-launch-page/#comparisons). We are not talking about that in this thread. We are trying to solve a quirk of our Diffusion Transformer model (V2 model).

Our V2 model is trained on specific durations of audio (2s, 5s, 10s, etc) as input. So, if give the model a 7s audio clip during inference, it will generate lower quality videos than at 5s or 10s. So, instead, we buffer the audio to the nearest training bucket (10s in this case). We have tried buffering it with a zero array, white noise and just concatenating the input audio (inverted) to the end. The drawback is that the last frame (the one at 7s) has a higher likelihood to fail. We need to solve this.

And, no shade on HeyGen. It's literally what we did before. And their videos look hyper realistic, which is great for B2B content. The drawback is you are always constrained to the hand motions and environment of the on-boarding video, which is more limiting for entertainment content.

swyx · 2024-09-08T03:45:01 1725767101

i already love you guys more than them bc of how transparent you are. keep it up!!

dang · 2024-09-06T17:24:05 1725643445

This is my favorite: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

lcolucci · 2024-09-06T17:35:21 1725644121

Love this one as well. It's a painting of Trithemius, a German monk, who actually said that

klipt · 2024-09-06T17:58:43 1725645523

Although I assume he didn't say it in British English ;-)

lcolucci · 2024-09-06T18:22:22 1725646942

No, probably not haha ;-)

mjlbach · 2024-09-07T01:11:14 1725671474

FYI dang they kinda ripped off our product down to copying the UI (Hedra.com). Our model is about 12x faster and supports 4 minute long videos…

shermantanktop · 2024-09-07T02:21:49 1725675709

Fwiw, you’ve got one video on your homepage and everything else is locked behind a signup button.

I know that signup requirement is an article of faith amongst some startup types, but it’s not a surprise to me shareable examples lead to sharing.

mjlbach · 2024-09-07T02:47:35 1725677255

We have a sign-up because we ensure users accept our terms of service and acceptable use policy before creating their first video, which affirms they understand how their data is used (legally required in most US states) and will not use our technology to cause harm.

mhuffman · 2024-09-07T08:18:00 1725697080

>legally required in most US states

Funny how other sites can do this with a birthday dropdown, an IP address, and a checkbox.

>We have a sign-up because we ensure users accept our terms of service and acceptable use policy before creating their first video

So your company would have no problem going on record saying that they will never email you for any reason, including marketing, and your email will never be shared or sold even in the event of a merger or acquisition? Because this is the problem people have with sign-up ... and the main reason most start-ups want it.

I am not necessarily for or against required sign-ups, but I do understand people that are adamantly against them.

KTibow · 2024-09-07T04:03:10 1725681790

You can have that without a sign up.

d13 · 2024-09-07T08:31:39 1725697899

99% of visitors will just hit the back button.

squarefoot · 2024-09-07T13:35:03 1725716103

Do you realize that this or similar technology will eventually end in every computer really soon? By building walls, you're essentially asking your potential users to go elsewhere. You should be as open as possible now that there is still room and time for competition.

the__alchemist · 2024-09-07T13:50:40 1725717040

This thread has opened my eyes to how many similar products exist; beyond your companies' and OP's. Was yours the first? Could the other companies make the same claim about yours? Do you make the same claim about the others?

play_nice_now · 2024-09-07T01:25:40 1725672340

[flagged]

mjlbach · 2024-09-07T01:40:48 1725673248

I welcome competition, but they have made disingenuous claims about being first after having chatted with our team (in person), are using celebrity deepfakes for their advertising, and have 1-1 copied our UI down to the three panel mobile layout and autocrop button.

Yiin · 2024-09-07T01:54:25 1725674065

To be fair your UI looks much better and the layout of both these sites is so basic (not a bad thing), it should be the last thing to worry about. Better product will win so focus on that, because except for ones ego, no one else cares who's "first".

jameshart · 2024-09-07T02:17:22 1725675442

You've expressed this concern, and your comment has been upvoted. The brigade of newly registered users echoing these talking points on this topic is not necessary and undermines your point.

mjlbach · 2024-09-07T02:49:34 1725677374

I understand your implication, however I’m not behind any brigading.

Banditoz · 2024-09-07T03:34:41 1725680081

Why did you make a brand new account to comment this?

tsunamifury · 2024-09-07T03:25:21 1725679521

nprateem · 2024-09-07T11:03:37 1725707017

This is such a lame comment. It reflects very badly on you company.

bschmidt1 · 2024-09-07T13:03:33 1725714213

Especially considering how many people are attempting something similar - for example everyone copied ChatGPT's UI.

Will be funny/ironic when the first AI companies start suing each other for copyright infringement.

Personally for me the "3 column" UI isn't that good anyway, I would have gone with an "MMO Character Creation" type UX for this.

sidneyprimas · 2024-09-08T01:10:28 1725757828

Interesting! Are you saying you would first want tools to really design your character, and only after start making videos with the character you built? That's interesting.

b0ner_t0ner · 2024-09-07T02:26:41 1725676001

Steve Jobs on Microsoft Edge: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

jagged-chisel · 2024-09-07T12:15:20 1725711320

Would be more impressive with something closer to Steve’s voice

passion__desire · 2024-09-07T15:28:45 1725722925

Spotify has launched tiktok like feature where best music snippets of a track as recommended in the feed. Imagine AI art generative videos + Faces lipsyncing the lyrics could form video portion of those tracks for the feed.

parentheses · 2024-09-08T14:48:19 1725806899

The accent is off but still amazing

zach_miller · 2024-09-06T22:04:47 1725660287

Tried to make this meme [1] a reality and the source image was tough for it.

Heads up, little bit of language in the audio.

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

[1] https://i.redd.it/uisn2wx2ol0d1.jpeg

andrew-w · 2024-09-06T22:57:05 1725663425

I see a lot of potential in animating memes and making them more fun to share with friends. Hopefully, we can do better on orcs soon!

johnchristopher · 2024-09-07T18:37:56 1725734276

Well, I don't know what to think about this, I don't know where we are going. I should read some scifi from back then about conversational agents maybe ?

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

sidneyprimas · 2024-09-08T00:08:49 1725754129

Nice! These are really good. I wanted them to continue telling their story.

johnchristopher · 2024-09-08T00:55:14 1725756914

   Through many births
   I have wandered on and on,
   Searching for, but never finding,
   The builder of this house.

is from https://en.wikipedia.org/wiki/Dhammapada (https://buddhasadvice.wordpress.com/2021/02/26/dhammapada-15... and http://www.floweringofgoodness.org/dhammapada-11.php).

    This is the way the world ends
    Not with a bang but a whimper.

is from T.S Eliot, The Hollow Men https://en.wikipedia.org/wiki/The_Hollow_Men (https://interestingliterature.com/2021/02/eliot-this-way-wor...).

First and second pictures are profile pictures that were generated years ago, before openai went on stage. I keep them around for when I need profile pics for templates. The third one has been in my random pictures folder for years.

marginalia_nu · 2024-09-06T18:58:53 1725649133

Tried my hardest to push this into the uncanny valley. I did, but it was pretty hard. Seems robust.

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

klipt · 2024-09-06T19:42:00 1725651720

It even works on animals: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

lcolucci · 2024-09-06T19:55:13 1725652513

I think you've made the 1st ever talking dog with our model! I didn't know it could do that

trunch · 2024-09-06T20:35:30 1725654930

Not robust enough to work against a sketch https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

though perhaps it rebelled against the message

marginalia_nu · 2024-09-06T21:05:50 1725656750

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

xD

eth0up · 2024-09-06T23:09:59 1725664199

I had difficulty getting my lemming to speak. After selecting several alternatives, I tried one with a more defined, open mouth, which required multiple attempts but mostly worked. Additional iterations on the same image can produce different results.

andrew-w · 2024-09-06T21:22:30 1725657750

Cartoons are definitely a limitation of the current model.

lcolucci · 2024-09-06T19:06:54 1725649614

Nice! Earlier checkpoints of our model would "gender swap" when you had a female face and male voice (or vice versa). It's more robust to that now, which is good, but we still need to improve the identity preservation

layer8 · 2024-09-06T19:35:29 1725651329

The jaw is particularly unsettling somehow.

ardrak · 2024-09-07T01:01:25 1725670885

> It often inserts hands into the frame.

Looks like too much Italian training data

lcolucci · 2024-09-07T01:17:44 1725671864

this made me laugh out loud

RobinL · 2024-09-06T18:22:53 1725646973

Have to say, whilst this tech has some creepy aspects, just playing about with this my family have had a whole sequence of laughs out loud moments - thank you!

sidneyprimas · 2024-09-06T19:57:40 1725652660

This makes me so happy. Thanks for reporting back! Goal is to reduce creepiness over time.

lcolucci · 2024-09-06T18:40:29 1725648029

I'm so glad! We're trying to increase the laugh out loud moments in the world :)

naveensky · 2024-09-06T17:38:03 1725644283

Is it similar to https://loopyavatar.github.io/. I was reading about this today and even the videos are exactly the same.

I am curious if you are anyway related to this team?

lcolucci · 2024-09-06T17:48:08 1725644888

No, not related. We just took some of Loopy's demo images + audios since they came out 2 days ago and people were aware of them. We want to do an explicit side-by-side at some point, but in the meantime people can make their own comparisons, i.e. compare how the two models perform on the same inputs.

Loopy is a Unet-based diffusion model, ours is a diffusion transformer. This is our own custom foundation model we've trained.

arcticfox · 2024-09-06T17:59:14 1725645554

This took me a minute - your output demos are your own, but you included some of their inputs, to make for an easy comparison? Definitely thought you copied their outputs at first and was baffled.

lcolucci · 2024-09-06T18:08:27 1725646107

Exactly. Most talking avatar papers re-use each others images + audios in their demo clips. It's just a thing everyone does... we never thought that people would think it means we didn't train our own model!

For whoever wants to, folks can re-make all the videos themselves with our model by extracting the 1st frame and audio.

sidneyprimas · 2024-09-06T18:12:36 1725646356

Yes, exactly! We just wanted to make it easy to compare. We also used some inputs from other famous research papers for comparison (EMO and VASA). But all videos we show on our website/blog are our own. We don't host videos from any other model on our website.

Also, Loopy is not available yet (they just published the research paper). But you can try our model today, and see if it lives up to the examples : )

Stevvo · 2024-09-06T17:54:16 1725645256

[flagged]

csallen · 2024-09-06T22:06:48 1725660408

vunderba · 2024-09-06T17:43:25 1725644605

It was posted to hacker news as well within the last day.

https://news.ycombinator.com/item?id=41463726

Examples are very impressive, here's hoping we get an implementation of it on huggingface soon so we can try it out, and even potentially self-host it later.

cchance · 2024-09-06T18:06:56 1725646016

Holy shit loopy is good, i imagine another closed model, opensource never gets good shit like that :(

aaroninsf · 2024-09-06T17:44:56 1725644696

[flagged]

ricardobeat · 2024-09-06T18:58:34 1725649114

These papers are simply using each other's examples to make performance comparisons possible.

This is EMO from 6 months ago: https://humanaigc.github.io/emote-portrait-alive/

sidneyprimas · 2024-09-06T17:51:55 1725645115

We are not related to Loopy Avatar. We trained our own models. It's a coincidence that they launched yesterday.

In the AI/research community, people often try to use the same examples so that it's easier to compare performance across different models.

echelon · 2024-09-06T18:14:03 1725646443

You should watch out for Hedra and Sync. Plus a bunch of Loopy activity on Discord.

robertlagrant · 2024-09-07T12:19:57 1725711597

Not seeing other possibilities isn't great though, right? Clearly there are other possibilities.

zaptrem · 2024-09-06T17:53:45 1725645225

I know these guys in real life, they've been working on this for months and, unlike the ByteDance paper, have actually shipped something you can try yourself.

zoogeny · 2024-09-06T23:22:58 1725664978

I am actively working in this area from a wrapper application perspective. In general, tools that generate video are not sufficient on their own. They are likely to be used as part of some larger video-production workflow.

One drawback of tools like runway (and midjourney) is the lack of an API allowing integration into products. I would love to re-sell your service to my clients as part of a larger offering. Is this something you plan to offer?

The examples are very promising by the way.

andrew-w · 2024-09-06T23:55:52 1725666952

I agree, I think power users are happy to go to specific platforms, but APIs open up more use cases that can reach a broader audience. What kind of application would you use it for? We don't have specific plans at the moment, but are gauging interest.

zoogeny · 2024-09-07T00:23:25 1725668605

I'm looking to create an end-to-end story telling interface. I'm currently working on the MVP and my plan was just to generate the prompts and then require users to manually paste those prompts into the interfaces of products that don't support APIs and then re-upload the results. This is so far below ideal that I'm not sure it will sell at all. It is especially difficult if one tries to imagine a mobile client. Given the state of the industry it may be acceptable for a while, but ideally I can just charge some additional margin on top of existing services and package that as credits (monthly plan + extras).

Consider all of the assets someone would have to generate for a 1 minute video. Lets assume 12 clips of 5 seconds each. First they may have to generate a script (claude/openai). They will have to generate text audio and background/music audio (suno/udio). They probably have to generate the images (runway/midjourney/flux/etc) which they will feed into a img2vid product (infinity/runway/kling/etc). Then they need to do basic editing like trimming clip lengths. They made need to add text/captions and image overlays. Then they want to upload it to TikTok/YouTube/Instagram/etc (including all of the metadata for that). Then they will want to track performance, etc.

That is a lot of UI, workflows, etc. I don't think a company such as yours will want to provide all of that glue. And consumers are going to want choice (e.g. access to their favorite image gen, their favorite text-to-speech).

Happy to talk more if you are interested. I'm at the prototype stage currently. As an example, consider the next logical step for an app like https://autoshorts.ai/

35mm · 2024-09-07T14:32:54 1725719574

I am doing this in a semi-automated way right now based on a voiceover of me speaking.

It would be very useful to have API access to Infinity to automate the creation of a talking head avatar.

andrew-w · 2024-09-07T00:28:43 1725668923

Makes sense, thank you!

bhanu423 · 2024-09-07T00:24:05 1725668645

Hopping onto the original comment - I am building an video creation platform focused on providing accessible education to the masses in developing countries. Would love to integrate something like this into our platform. Would love to pay for an API access and so will so many others. Please consider opening API, you would make lot of money right now which can be used for your future plans.

andrew-w · 2024-09-07T00:30:00 1725669000

Cool use case! Thanks for sharing your thoughts.

nextcaller · 2024-09-07T07:15:46 1725693346

It's great https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

naveensky · 2024-09-06T17:57:15 1725645435

For such models, is it possible to fine-tune models with multiple images of the main actor?

Sorry, if this question sounds dumb, but I am comparing it with regular image models, where the more images you have, the better output images you generate for the model.

andrew-w · 2024-09-06T18:09:01 1725646141

It is possible to fine-tune the model with videos of a specific actor, but not images. You need videos to train the model.

We actually did this in early overfitting experiments (to confirm our code worked!), and it worked surprisingly well. This is exciting to us, because it means we can have actor-specific models that learn the idiosyncratic gestures of particular person.

naveensky · 2024-09-07T06:34:16 1725690856

This is actually great, waiting for your API integration or replicate integration to get my hands dirty :)

w10-1 · 2024-09-06T17:49:58 1725644998

Breathtaking!

First, your (Lina's) intro is perfect in honestly and briefly explaining your work in progress.

Second, the example I tried had a perfect interpretation of the text meaning/sentiment and translated that to vocal and facial emphasis.

It's possible I hit on a pre-trained sentence. With the default manly-man I used the phrase, "Now is the time for all good men to come to the aid of their country."

Third, this is a fantastic niche opportunity - a billion+ memes a year - where each variant could require coming back to you.

Do you have plans to be able to start with an existing one and make variants of it? Is the model such that your service could store the model state for users to work from if they e.g., needed to localize the same phrase or render the same expressivity on different facial phenotypes?

I can also imagine your building different models for niches: faces speaking, faces aging (forward and back); outside of humans: cartoon transformers, cartoon pratfalls.

Finally, I can see both B2C and B2B, and growth/exit strategies for both.

lcolucci · 2024-09-06T18:03:18 1725645798

Thank you! You captured the things we're excited about really well. And I'm glad your video was good! Honestly, I'd be surprised if that sentence was in the training data... but that default guy tends to always look good.

Yes, we plan on allowing people to store their generations, make variations, mix-and-match faces with audios, etc. We have more of an editor-like experience (script-to-video) in the rest of our web app but haven't had time to move the new V2 model there yet. Soon!

johnyzee · 2024-09-06T18:29:50 1725647390

It's incredibly good - bravo. Only thing missing for this to be immediately useful for content creation, is more variety in voices, or ideally somehow specifying a template sound clip to imitate.

andrew-w · 2024-09-06T18:34:19 1725647659

Thanks for the feedback! We used to have more voices, but didn't love the experience, since users had no way of knowing what each voice sounded like without creating a clip themselves. Probably having pre-generated samples for each one would solve that. Let us know if you have any other ideas.

We're also very excited about the template idea! Would love to add that soon.

artur_makly · 2024-09-06T18:34:51 1725647691

oh this made my day: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

!NWSF --lyrics by Biggy$malls

kelseyfrog · 2024-09-06T19:25:11 1725650711

Big Dracula Flow energy which is not bad :)

knodi123 · 2024-09-06T21:00:14 1725656414

So if we add autotune....

lcolucci · 2024-09-06T18:51:01 1725648661

that's a great one!

max4c · 2024-09-06T20:04:14 1725653054

This is amazing and another moment where I question what the future of humans will look like. So much potential for good and evil! It's insane.

jaysonelliot · 2024-09-07T02:15:48 1725675348

And it seems that absolutely no one involved is concerned with the potential uses for evil, so long as they're in line to make a couple dollars.

lcolucci · 2024-09-06T21:18:26 1725657506

thank you! it's for sure an interesting time to be alive... can't complain about it being boring

svieira · 2024-09-06T19:40:17 1725651617

Quite impressive - I tried to confuse it with things it would not generally see and it avoided all the obvious confabulations https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

lcolucci · 2024-09-06T19:53:08 1725652388

Wow this worked so well! Sometimes with long hair and paintings, it separates part of the hair from the head but not here

andrew-w · 2024-09-06T19:51:55 1725652315

Thank you! It has learned a surprising amount of world knowledge.

scotty79 · 2024-09-07T10:45:54 1725705954

It's awesome for very short texts. Like a single long sentence. For even a bit longer sequences it seems to be losing adherence to the initial photo and also venture into uncanny valley with exaggerated facial expressions.

A product that might be build on top of this could split the input into reasonable chunks, generate video for each of them separately and stitch them with another model that can transition from one facial expression into another in a fraction of a second.

Additional improvement might be feeding the system not with one image but with a few expressing different emotional expressions. Then the system could analyze the split input to find out in which emotional state each part of the video should be started on.

On unrelated note ... generated expressions seem to be relevant to the content of the input text. So either text to speech might understand language a bit or the video model itself.

siffin · 2024-09-07T04:16:44 1725682604

Very cool, thanks for the play.

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

Managed to get it working with my doggo.

snickmy · 2024-09-07T08:13:49 1725696829

Out of curiosity, where are you training all this ? aka where do you find the money to support such training

swyx · 2024-09-07T23:11:45 1725750705

its $500k, its not much in ai funding land

IXCoach · 2024-09-11T23:51:47 1726098707

WOW this is very good!!

I have an immediate use case for this. Can you stream via AI to support real time chat this way?

Very very good!

Jonathan

founder@ixcoach.com

We deliver the most exceptional simulated life coaching, counseling and personal development experiences in the world through devotion to the belief that having all the support you need should be a right, not a privilege.

Test our capacity at ixcoach.com for free to see for yourself.

sharemywin · 2024-09-06T18:25:46 1725647146

you need a slider for how animated the facial expression are.

lcolucci · 2024-09-06T18:31:36 1725647496

That's a good idea! CFG is roughly correlated with expressiveness, so we might to expose that to the user at some point

Andrew_nenakhov · 2024-09-06T18:21:05 1725646865

i wonder how long would it take for this technology to advance to a point where nice people from /r/freefolk would be able to remake seasons 7 and 8 of Game of Thrones to have a nice proper ending? 5 years, 10?

lcolucci · 2024-09-06T19:14:46 1725650086

I'd say the 5 year ballpark is about right, but it'll involve combining a bunch of different models and tools together. I follow a lot of great AI filmmakers on Twitter. They typically make ~1min long videos using 3-8 different tools... but even those 1min videos were not possible 9 months ago! Things are moving fast

andrew-w · 2024-09-06T19:23:01 1725650581

Haha, wouldn't we all love that? In the long run, we will definitely need to move beyond talking heads, and have tools that can generate full actors that are just as expressive. We are optimistic that the approach used in our V2 model will be able to get there with enough compute.

squarefoot · 2024-09-06T20:50:08 1725655808

In a few years we'll have entire shows made exclusively by AI.

DistractionRect · 2024-09-07T18:09:12 1725732552

In one hand... But on the other, there's soo many shows that got canceled or just got a really shitty ending that could be rewritten. Kinda looking forward to it.

fragmede · 2024-09-08T00:38:11 1725755891

Where have you been? AI Seinfeld has been streaming on twitch since February of last year. https://www.theverge.com/23581186/ai-seinfeld-twitch-stream-...

archon1410 · 2024-09-06T19:53:07 1725652387

The website is pretty lightweight and easy-to-use. The service also holds up pretty well, specially if the source image is high-enough resolution. The tendency to "break" at the last frame happens with low resolution images it seems.

My generation: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

lcolucci · 2024-09-06T19:57:25 1725652645

Thank you! It's interesting you've noticed the last frame breakdown happening more with low-res images. This is a good hypothesis that we should look into. We've been trying to debug that issue

parkaboy · 2024-09-07T20:44:11 1725741851

Max headroom hack x hacker's manifesto! I'm impressed with the head movement dynamism on this one.

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

nickfromseattle · 2024-09-06T18:13:26 1725646406

I need to create a bunch of 5-7 minute talking head videos. What's your timeline for capabilities that would help with this?

lcolucci · 2024-09-06T18:50:14 1725648614

Our model can recursively extend video clips, so theoretically we could generate your 5-7min talking head videos today. In practice, however, error accumulates with each recursion and the video quality gets worse and worse over time. This is why we've currently limited generations to 30s.

We're actively working on improving stability and will hopefully increase the generation length soon.

btbuildem · 2024-09-07T13:52:42 1725717162

Could you not do that today, with the judicious use of cuts and transitions?

_z2co · 2024-09-06T21:37:51 1725658671

Does anybody know about the legality of using Eminem's "Gozilla" as promotional material[1] for this service?

I thought you had to pay artists for a license before using their work in promotional material.

[1] https://infinity.ai/videos/setA_video3.mp4

GrantMoyer · 2024-09-06T23:59:18 1725667158

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

tiahura · 2024-09-07T02:51:02 1725677462

Parody is fair use.

sroussey · 2024-09-06T17:06:57 1725642417

I look forward to movies that are dubbed moving the face+lips to the dubbed text. Also using the original actors voice.

SwiftyBug · 2024-09-06T17:50:24 1725645024

+1 for the lips matching the dubbed speech, but I'm not sure about cloning the actor's voice. I really like dubbing actor's unique voices and how they become the voice of some characters in their language.

schrijver · 2024-09-06T18:39:32 1725647972

I thought the larger public was starting to accept subtitles so I was hoping we’d rather see the end of dubbed movies !

foreigner · 2024-09-06T17:44:25 1725644665

Wow that would be very cool.

lcolucci · 2024-09-06T17:08:02 1725642482

agreed!

ladidahh · 2024-09-06T17:33:26 1725644006

I have uploaded an image and then used text to image, and both videos were not animated but the audio was included

andrew-w · 2024-09-06T18:01:56 1725645716

This can happen with non-humanoid images. The model doesn't know how to animate them.

lcolucci · 2024-09-06T17:39:08 1725644348

can you clarify? what image did you use? or send the link to the resulting video

ladidahh · 2024-09-10T00:26:01 1725927961

Sorry for the delay in response, the text prompt was "cute dog" and the uploaded image was also of a dog

guessmyname · 2024-09-06T22:14:27 1725660867

Is this the original? https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

andrew-w · 2024-09-06T22:39:57 1725662397

No, it's just a hallucination of the model. The audio in your clip is synthetic and doesn't reflect any video in the real world.

Hopefully we can animate your bear cartoon one day!

eth0up · 2024-09-06T22:39:47 1725662387

Lemming overlords

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

andrew-w · 2024-09-06T22:51:00 1725663060

I know what will be in my nightmares tonight...

eth0up · 2024-09-06T23:04:14 1725663854

One person's nightmare is another's sweet dream. I, for one.. and all that.

LarsDu88 · 2024-09-06T17:28:19 1725643699

Putting Drake as a default avatar is just begging to be sued. Please remove pictures of actual people!

sidneyprimas · 2024-09-06T18:20:37 1725646837

Ya, this is tricky. Our stance is the people should be able to make funny, parody videos with famous people.

thih9 · 2024-09-07T13:15:01 1725714901

Is that legal? As in: can you use an image of a celebrity without their consent as part of the product demo?

stevenpetryk · 2024-09-06T17:39:07 1725644347

That would be ironic given how Drake famously performed alongside an AI recreation of Pac.

bongodongobob · 2024-09-06T17:36:38 1725644198

Sounds like free publicity to me.

zaptrem · 2024-09-06T17:50:13 1725645013

The e2e diffusion transformer approach is super cool because it can do crazy emotions which make for great memes (like Joe Biden at Live Aid! https://youtu.be/Duw1COv9NGQ)

Edit: Duke Nukem flubs his line: https://youtu.be/mcLrA6bGOjY

lcolucci · 2024-09-06T18:11:50 1725646310

Nice :) It's been really cool so see the model get more and more expressive over time

andrew-w · 2024-09-06T20:11:53 1725653513

I don't think we've seen laughing quite that expressive before. Good find!

SlackingOff123 · 2024-09-07T00:52:21 1725670341

Oh, this is amazing! I've been having so much fun with it.

One small issue I've encountered is that sometimes images remain completely static. Seems to happen when the audio is short - 3 to 5 seconds long.

sidneyprimas · 2024-09-07T01:54:19 1725674059

Can you share an example of this happening? I am curious. We can get static videos if our model doesn't recognize it as a face (e.g. an Apple with a face, or sketches). Here is an example: https://toinfinityai.github.io/v2-launch-page/static/videos/...

I would be curious if you are getting this with more normal images.

jodrellblank · 2024-09-07T18:58:35 1725735515

I got it with a more normal image which was two frames from a TV show[1]; with "crop face" on, your model finds the face and animates it[2] and with crop face off the picture was static... just tried to reproduce to show you and now instead it's animated both faces.

[1] https://i.pinimg.com/236x/ae/65/d5/ae65d51130d5196187624d52d...

[2] https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

[3] https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

But that image was one which both could find a face and gave a static image once.

doctorpangloss · 2024-09-06T22:54:51 1725663291

If you had a $500k training budget, why not buy 2 DGX machines?

andrew-w · 2024-09-06T23:28:45 1725665325

To be honest, one of our main goals as a startup is to move quickly, and using readily accessible cloud providers for training makes that much more easy.

AnnaMere · 2024-09-08T23:53:15 1725839595

This is surprisingly very intelligent and awesome, any plan for research paper or full grown project with pricing or open source?

dhbradshaw · 2024-09-07T00:09:15 1725667755

So good it feels like I think maybe I can read their lips

lcolucci · 2024-09-07T00:28:37 1725668917

This is the best compliment :) and also a good idea… could a trained lip reader understand what the videos are saying? Good benchmark!

ilaksh · 2024-09-06T19:46:58 1725652018

It would be amazing to be able to drive this with an API.

sidneyprimas · 2024-09-06T19:52:55 1725652375

We are considering it. Do you have anything specific you want to use it for?

ilaksh · 2024-09-06T20:01:50 1725652910

Basically as a more engaging alternative to Eleven Labs or other TTS.

I am working on my latest agent (and character) framework and I just started adding TTS (currently with the TTS library and xtts_v2 which I think is maybe also called Style TTS.) By the way, any idea what the license situation is with that?

Since it's driven by audio, I guess it would come after the TTS.

sidneyprimas · 2024-09-08T23:18:30 1725837510

After much user feedback, we removed the Infinity watermark from the generated videos. Thanks for the feedback. Enjoy!

whitehexagon · 2024-09-07T09:54:15 1725702855

Thank you for no signup, it's very impressive, especially the physics of the head movement relating to vocal intonation.

I feel like I accidentally made an advert for whitening toothpaste:

https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

I am sure the service will get abused, but wish you lots of success.

modeless · 2024-09-06T20:19:33 1725653973

Won't be long before it's real time. The first company to launch video calling with good AI avatars is going to take off.

andrew-w · 2024-09-06T20:32:01 1725654721

Totally agree. We tweaked some settings after other commenters asked about speed, and got it up to 23fps generation (at the cost of lower resolution). Here is the example: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/fast...

bpanahij · 2024-09-08T03:44:59 1725767099

Tavus.io already does this. They have realtime conversational replicas: with a < 1 second response time. Hyper realistic too.

modeless · 2024-09-09T07:18:18 1725866298

Thanks for the pointer! Pretty cool, although it seems quite buggy.

kemmishtree · 2024-09-07T06:24:23 1725690263

I'd love to enable Keltar, the green guy in the ceramic cup, to do this www.molecularReality/QuestionDesk

billconan · 2024-09-06T18:36:38 1725647798

can this achieve real-time performance or how far are we from a real-time model?

andrew-w · 2024-09-06T19:00:57 1725649257

The model configuration that is publicly available is about 5x slower than real-time (~6fps). At lower resolution and with a less conservative number of diffusion steps, we are able to generate the video at 20-23 fps, which is just about real-time. Here is an example: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/fast...

We use rectified flow for denoising, which is a (relatively) recent advancement in diffusion models that allow them to run a lot faster. We also use a 3D VAE that compresses the video along both spatial and temporal dimensions. Temporal compression also improves speed.

bpanahij · 2024-09-08T04:34:22 1725770062

Checkout Tavus.io for realtime. They have a great API for realtime conversational replicas. You can configure the CVI to do just about anything you want to do with a realtime streaming replica.

android521 · 2024-09-07T08:22:52 1725697372

This is great. is it open source? is there an api and what is the pricing?

bufferoverflow · 2024-09-06T22:02:32 1725660152

It completely falls apart on longer videos for me, unusable over 10 seconds.

lcolucci · 2024-09-06T22:57:06 1725663426

This is a good observation. Can you share the videos you’re seeing this with? For me, normal talking tends to work well even on long generations. But singing or expressive audio starts to devolve with more recursions (1 forward pass = 8 sec). We’re working on this.

mjlbach · 2024-09-07T00:43:21 1725669801

[flagged]

bufferoverflow · 2024-09-07T02:51:22 1725677482

Oh wow. Much slower, but much higher quality. Which I definitely prefer.

Thank you!

dvfjsdhgfv · 2024-09-07T10:00:06 1725703206

Hi, there is a mistake in the headline, you wrote "realistic".

lofaszvanitt · 2024-09-06T17:43:23 1725644603

Rudimentary, but promising.

vadiml · 2024-09-07T09:17:27 1725700647

Let's see what Putin says about it: https://6ammc3n5zzf5ljnz.public.blob.vercel-storage.com/inf2...

ASalazarMX · 2024-09-07T18:11:07 1725732667

Seems like some longer videos gradually slip into the uncanny valley.

Sakos · 2024-09-07T09:42:38 1725702158

It's like I'm watching him on the news

protocolture · 2024-09-07T10:41:24 1725705684

Sadly wouldnt animate an image of shodan from system shock 2

strogonoff · 2024-09-07T02:02:14 1725674534

Is it fairly trained?

b0ner_t0ner · 2024-09-07T02:31:54 1725676314

You think Kanye approved this?

strogonoff · 2024-09-07T15:15:44 1725722144

You think every musician personally approves every use of their work?

jadbox · 2024-09-06T23:44:37 1725666277

Awesome, any plans for an API and, if so, how soon?

andrew-w · 2024-09-06T23:52:40 1725666760

No plans at the moment, but there seems to be a decent amount of interest here. Our main focus has been making the model as good as it can be, since there are still many failure modes. What kind of application would you use it for?

naveensky · 2024-09-06T17:36:21 1725644181

Is there any limitation on the video length?

lcolucci · 2024-09-06T17:52:07 1725645127

Our transformer model was trained to generate videos that are up to 8s in length. However, we can make videos that are longer by using it an an autoregressive manner, and taking the last N frames of output i to seed output (i+1). It is important to use more than just 1 frame. Otherwise ,the direction of movement can suddenly change, which looks very uncanny. Admittedly, the autoregressive approach tends to accumulate errors with each generation.

It is also possible to fine-tine the model so that single generations (one forward pass of the model) are longer than 8s, and we plan to do this. In practice, it just means our batch sizes have to be smaller when training.

Right now, we've limited the public tool to only allow videos up to 30s in length, if that is what you were asking.

leobg · 2024-09-06T19:23:18 1725650598

Video compression algorithms use key frames. So can’t you do the same thing? Essentially, generate five seconds. Then pull out the last frame. Use some other AI model to enhance it (upscale, consistency with the original character, etc.). Then use that as the input for the next five seconds?

andrew-w · 2024-09-06T19:34:29 1725651269

This is a good idea. We have discussed incorporating an additional "identity" signal to the conditioning, but simply enforcing consistency with the original character as a post-processing step would be a lot easier to try. Are there any tools you know of that do that?