Phenaki: A model for generating minutes-long, changing-prompt videos from text

Hard_Space · on Sept 29, 2022

This appears to be just one plank of a tripartite shock assault on the October conference season.

The other two, also by anonymous authors using the same formatting, are:

AudioGen: Textually Guided Audio Generation https://openreview.net/forum?id=CYK7RfcOzQ4

and

Re-Imagen: Retrieval-Augmented Text-to-Image Generator https://openreview.net/forum?id=XSEBx0iSjFQ

There is a samples site for AudioGen, but it is currently flooded and inaccessible:

https://anonymous.4open.science/w/iclr2023_samples-CB68/repo...

abeppu · on Sept 29, 2022

What's the commonality in formatting you're paying attention to? I think the conference asks everyone to use their template/style.

But the architecture figures look like they have different styles. E.g. the Re-Imagen paper uses rows/stacks of small colored circles to represent output tensors, and colored rectangles of different ratios to indicate shape differences, where the phenaki paper uses stacks of squares for output tensors, and differently shaped elements to distinguish different kinds of components.

https://github.com/ICLR/Master-Template

arcticfox · on Sept 30, 2022

I can’t wait to see Michael Scott’s idea that he pitched for a Dunder Mifflin ad come to life.

> Little girl, in a field, holding a flower. We zoom back, to find, she's in the desert, and the field's an oasis. Zoom back further, the desert is a sandbox in the world's largest resort hotel. Zoom back further, the hotel is actually a playground, of the world's largest prison. But we zoom back further--

It’s incredible that the joke was that his idea was simply impossible, and yet technology has advanced to the point where you can basically do it instantly.

mycall · on Sept 30, 2022

There are tons of animators that could draw that scene. Amazing video synthesis can to some degree.

arcticfox · on Oct 6, 2022

To clarify, impossible meaning infeasible for a local TV market ad.

anigbrowl · on Sept 29, 2022

This addresses several of the shortcomings in the AI video technology that's the current top story on HN. It's entertaining to consider the possibility that the explosion of innovation is partly due to artificially generated papers and business entities that are busily iterating upon each others' capabilities while we write micro-editorials about what that means.

blondin · on Sept 29, 2022

this does not sound too far-fetched. the paper says anonymous authors pending review...

aliqot · on Sept 30, 2022

Lots of people publish anonymously or pseudonymously. Sometimes you don't want ancillary factors coloring your submission.

Hard_Space · on Sept 29, 2022

That long embedded video is the nearest T2V has got to breaking my cynicism about how long it is going to take to become (at least) coherent.

Check it out alongside the project page to see the text that formed it alongside, or just watch it here:

https://phenaki.video/stories/2_minute_movie.webp

However, there are some flourishes and timing that are not indicated from the prompt text, and I think there is some manual tweaking at play (which is okay, it's still impressive).

ag8 · on Sept 29, 2022

Wow--this qualitatively feels a lot more impressive than the Meta model. The two-minute video is better than anything I've seen in video generation on that scale.

anigbrowl · on Sept 29, 2022

I'm happy about it because all the celebs who paid $ to churn out an 'AI music video!!?!' with Stable Diffusion and whatever shitty demo they had lying around are suddenly revealed as tryhard chasing the hype cycle rather than innovators.

th0ma5 · on Sept 30, 2022

It feels like it is breaking up the text into still scenes scenes, making squiggle type animations of several variations of each still, and then morphing between them. I wonder what a NN designed progression would actually be like.

GaggiX · on Sept 29, 2022

I didn't read both papers deep enough but I think that Make-A-Scene by being conditioned only on image embeddings is incapable of generating videos that required a broader understanding that cannot be encoded in an image embedding like "Camera zooms quickly into the eye of the cat" Make-A-Scene is more like text-to-animated_image, this model seems more powerful.

asab · on Sept 30, 2022

The name comes from Phenakistiscope, an early form of animation: https://en.wikipedia.org/wiki/Phenakistiscope

telotortium · on Sept 30, 2022

Would it kill these guys to use real HTML video? I can't tell if the dithering and overall low resolution are from the model or the GIFs they inexplicably decided to use to showcase their result.

narrator · on Sept 30, 2022

This is really cool. The next awesome breakthrough would be for the computer to describe the contents of a video. That would be a big step toward AGI, IMHO.