Progress is this field has not been linear, though. So it's quite possible that ...

dr_dshiv · on Feb 16, 2024

On the other hand, this is the first convincing use of a “diffusion transformer” [1]. My understanding is that videos and images are tokenized into patches, through a process that compresses the video/images into abstracted concepts in latent space. Those patches (image/video concepts in latent space) can then be used with transformers (because patches are the tokens). The point is that there is plenty of room for optimization following the first demonstration of a new architecture.

Edit: sorry, it’s not the first diffusion transformer. That would be [2]

[1] https://openai.com/research/video-generation-models-as-world...

[2] https://arxiv.org/abs/2212.09748

koconder · on Feb 16, 2024

Here is an explainer https://towardsdatascience.com/explaining-openai-soras-space...

dr_dshiv · on Feb 18, 2024

I think it is misleading. The role of the diffusion network is completely absent from this explanation