What's the commonality in formatting you're paying attention to? I think the conference asks everyone to use their template/style.
But the architecture figures look like they have different styles. E.g. the Re-Imagen paper uses rows/stacks of small colored circles to represent output tensors, and colored rectangles of different ratios to indicate shape differences, where the phenaki paper uses stacks of squares for output tensors, and differently shaped elements to distinguish different kinds of components.
I can’t wait to see Michael Scott’s idea that he pitched for a Dunder Mifflin ad come to life.
> Little girl, in a field, holding a flower. We zoom back, to find, she's in the desert, and the field's an oasis. Zoom back further, the desert is a sandbox in the world's largest resort hotel. Zoom back further, the hotel is actually a playground, of the world's largest prison. But we zoom back further--
It’s incredible that the joke was that his idea was simply impossible, and yet technology has advanced to the point where you can basically do it instantly.
This addresses several of the shortcomings in the AI video technology that's the current top story on HN. It's entertaining to consider the possibility that the explosion of innovation is partly due to artificially generated papers and business entities that are busily iterating upon each others' capabilities while we write micro-editorials about what that means.
However, there are some flourishes and timing that are not indicated from the prompt text, and I think there is some manual tweaking at play (which is okay, it's still impressive).
Wow--this qualitatively feels a lot more impressive than the Meta model. The two-minute video is better than anything I've seen in video generation on that scale.
I'm happy about it because all the celebs who paid $ to churn out an 'AI music video!!?!' with Stable Diffusion and whatever shitty demo they had lying around are suddenly revealed as tryhard chasing the hype cycle rather than innovators.
It feels like it is breaking up the text into still scenes scenes, making squiggle type animations of several variations of each still, and then morphing between them. I wonder what a NN designed progression would actually be like.
I didn't read both papers deep enough but I think that Make-A-Scene by being conditioned only on image embeddings is incapable of generating videos that required a broader understanding that cannot be encoded in an image embedding like "Camera zooms quickly into the eye of the cat"
Make-A-Scene is more like text-to-animated_image, this model seems more powerful.
Would it kill these guys to use real HTML video? I can't tell if the dithering and overall low resolution are from the model or the GIFs they inexplicably decided to use to showcase their result.
This is really cool. The next awesome breakthrough would be for the computer to describe the contents of a video. That would be a big step toward AGI, IMHO.
The other two, also by anonymous authors using the same formatting, are:
AudioGen: Textually Guided Audio Generation https://openreview.net/forum?id=CYK7RfcOzQ4
and
Re-Imagen: Retrieval-Augmented Text-to-Image Generator https://openreview.net/forum?id=XSEBx0iSjFQ
There is a samples site for AudioGen, but it is currently flooded and inaccessible:
https://anonymous.4open.science/w/iclr2023_samples-CB68/repo...