Hacker News new | past | comments | ask | show | jobs | submit login
Emu Video and Emu Edit, our latest generative AI research milestones (meta.com)
201 points by ot 5 months ago | hide | past | favorite | 47 comments



Somewhat tangential, but I hadn't heard about the Emu model, which was apparently released (the paper [1] at least) in September. I was curious about the details and read the Emu paper and ... I feel like I'm taking crazy pills reading it.

> To the best of our knowledge, this is the first work highlighting fine-tuning for generically promoting aesthetic alignment for a wide range of visual domains.

... unlike Stable Diffusion which did aesthetic fine tuning when it was released? Or like the thousands of aesthetic finetunes released since?

> We show that the original 4-channel autoencoder design [27] is unable to reconstruct fine details. Increasing channel size leads to much better reconstructions.

Is it not expected that decreasing the compression ratio would lead to better reconstructions? The whole point of the latent diffusion architecture is to make a trade-off here. They're more than welcome to do pixel diffusion if they want better quality, or upscaling architecture.

And then the rest of the paper is this long documentation that can be summed up as "we used industry standard filtering and then human filtering to build an aesthetic dataset which we finetuned a model with". Which, again, has been done a thousand times already.

I really, really don't mean to knock the researcher's work here. I'm just very confused as to why the work is being represented as new or groundbreaking. Contrast to OAI which documents using a diffusion based latent decoder. That's interesting, different, and worth publishing. Scaling up your latent space to get better results is just ... obvious? (As obvious as anything in ML is, anyway). Facebook's research isn't usually this off the mark. E.g. the Emu Edit paper is very interesting and contributes many new methods to the field.

[1] https://scontent-lax3-1.xx.fbcdn.net/v/t39.2365-6/10000000_1...


Yeah. It is still useful for them to share these. My take away:

1. data is all you need to generate these amazing videos with the right gait (gait is something I focused on).

2. nobody doing new network structure, it is animatediff beefed up a little bit with temporal masking applied (neat trick, not a big leap from inpainting task we already see).

3. additional conditioning vector helps, and can be trained, look at these editing tasks!

These are pretty valuable for a looker like me to decipher what they did for Gen-2 or Pika Labs etc.


Emu Edit is awesome. I think we have officially brought this scene from Star Trek to life.

https://m.youtube.com/watch?v=NXX0dKw4SjI&pp=ygUII3Npbm50ZWs...


With the advent of these models my head cannon now insists that when Star Trek characters say they "programmed" something, they really mean that they have a log of all of their iterative prompts and that there's some optimization the computer can use to aggregate all of those into the final resulting warp model/holodeck simulation/transporter filter/biobed pathogen detector/etc without having to reiterate through all of those prompts again...kind of like a NixOS declarative build.

And when somebody comes along and fixes their program or reprograms what they did, they simply insert or change some of the prompts along the way and get a different effect.

When the characters add new data to the computer (like the episode where Geordi added the psycho profile of the enterprise engine designer), they're just tuning the foundational model with some new input data.

Yeah....that feels right for now to me.


Yes absolutely. I’ve started thinking of some interfaces for this type of “programming”. I think we’ll have some pretty cool stuff to play around with in the not too distant future.


> Computer, show me a table.

> There are 5047 classifications of tables on file. Specify design parameters.

Interestingly enough, it seems existing AI models are already better than the Star Trek computer at dealing with ambiguity. Stable Diffusion would just generate a "normal" table and let you go from there.


Yes, they seem to handle emotion, humor and ambiguity better than Data or any computer ever on the shows. 24th century technology, today.


OTOH Data is sentient.


So he says, but LLMs will tell you that too.


"The Commander is a physical representation of a dream; an idea, conceived of by the mind of a man. Its purpose: to serve human needs and interests. It's a collection of neural nets and heuristic algorithms; its responses dictated by an elaborate software written by a man, its hardware built by a man. And now -- and now a man will shut it off."


He was in love once too


"Computer, load up CELERY MAN, please"

https://www.youtube.com/watch?v=a8K6QUPmv8Q


Tim and Eric are going to go crazy with Gen AI. They won't need Adult Swim to toss them shoestring budgets.


LLMs and image models are already better than that scene

I can think of solutions for the physical component or simulating the perception of a physical component


I thought of this Running Man scene https://www.youtube.com/watch?v=BVdOr0z6X7Y


I wonder how far away we are from "make a movie from a sentence".

2030?

Also why do these AI people always end with "this does not replace anyone". Surely they do not believe this?


they don't believe it, they are placating an unease in society from people that know they are replaced already. there is a group that plays devil’s advocate with the aforementioned people, and it is convenient to agree with the devil’s advocates.

but there are lots of specialists I used to contract with in the ideation phase that I no longer do.

professional logo designers

testing out names of potential services

designers for landing pages for websites

additional coders for landing pages of websites

templates for powerpoint presentations

graphics for them

many many billable hours for lawyers for things I would have otherwise asked them about and thats totally a risk I’m willing to shoulder. now I simply have them implement unless they are not able to corroborate the legal view. In the past, I would have to explore several paths and then consider switching lawyers after I had all the information I wanted, having the subsequent lawyer implement without any knowledge of why.

some of these ideas generate revenue and I can get to that point far faster and cheaper

I can already code in the latest frameworks and have high proficiency in most media suites, but the media creation was not where I specialize or want to spend time on

so there is a general denial thats kind of useful, if a big company wanted tax breaks from a municipality they can say “look, jobs, we’re big on that”

but everyone knows whats happening


You’ll be replaced soon too because right now your click buttoning things to save money, once everyone can click button their app idea into prod?


Yes it is a good motivator to entrench revenue streams right now


What does this mean and how do you plan to entrench revenue streams where the barrier to cloning your products is next to zero?


By making enough right now


Reminds me of people uploading music videos to YouTube with "copyright not intended" in the comments


Is anyone able to determine how long it takes to generate a video with one of these methods? Can't find it in the paper.


Emu image is not significantly slower than SDXL or similar. So you would expect to have similar performance as Hotshot. The upscaler (8 frame to 37 frame) version probably would take significantly longer.


Definitely looks like progress, but they're still firmly in the center of the uncanny valley.


Does anyone know where the source code is, I can't seem to find it anywhere.


There's some source code in the paper for Emu edit at least. If you look at the supplementary material in the paper, you'll see they spell out the techniques used there too.

I didn't see a repository, but I think in this case, the paper is actually a perfect balance of detail? I think Meta benefits from startups building using their tooling (startups usually buy ads), and so the lack of a full implementation leaves a bit of room for startups to turn the work in to something a bit more production ready.

The cool techniques from the paper are:

Generating a bunch of example images in one go, and using CLiP to score your generated images

And mixing pre-built pipelines and grammars to execute common tasks.

These two ideas alone (with examples) give people in the space plenty to run with.

Great paper!


I don’t think either of these (or the base Emu model) are open source.


That’s a bit disappointing. Meta had been on an “open source” roll lately


First dose of gen AI is free


I'd be happy if they just sold it.


Technically none of their models are actually open source.


Hence the quotes.


I never would have guessed the artists would be who AI took out first.


How can AI “take out” artists ? It’s an absurd statement.

Maybe career wise? Should art have ever been really considered a career ? It was a nice side effect people might pay for it, outside of that ?


a huge pile of money on fire forever


That's living in a nutshell.

These are pretty great results though, don't you think?


nope, they look as terrible as everything else in the generative space, and consumers will reject them outright


Do you think they look as terrible as what was possible two years ago?

If it at some point they go past looking terrible, will you think these in between "terrible" models were a waste of time?


An impressive technical achievement, yes - but the presentation/marketing of this is absurd.

The generated videos are aesthetically horrendous. I don't know what kind of mental gymnastics are going on that they can confidently describe something where the body shapes are nonsensically in flux with every change of frame (look at the eagle's talons, or the dog's leg movements as it runs) as "high-quality video".

Is generative AI hype blinding them to how hideous these videos are, or do they know and they just pretend like it's something it isn't?


I don't like them; aesthetically they don't appeal and technically they fall short as you describe. But just about a year ago this was the State of the Art ('Age of Illusion' by Die Antwoord) with visual coherence maintainable for <10 frames or less.

https://www.youtube.com/watch?v=Cq56o0YH3mE


That wasn’t quite the epitome of generated video a year ago; it was barely trained for temporal coherence.

But the best video generators were much worse than Emu Video; there was Make-A-Video[0] from Meta, and Phenaki[1] and Imagen Video[2] from Google.

[0]: https://ai.meta.com/blog/generative-ai-text-to-video/

[1]: https://sites.research.google/phenaki/

[2]: https://imagen.research.google/video/


Check out what AI generated images looked like 24 months ago and this comment may feel a lot less pithy.


A year ago this technology simply didn't exist at all. What are you expecting?


Compared to prior work, it looks unbelievable. Is this just an armchair criticism or have you been paying any attention?


Compared to prior work, it's great. On it's own, I don't agree with describing it as high-quality.


Okay.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: