On the other hand, this is the first convincing use of a “diffusion transformer” [1]. My understanding is that videos and images are tokenized into patches, through a process that compresses the video/images into abstracted concepts in latent space. Those patches (image/video concepts in latent space) can then be used with transformers (because patches are the tokens). The point is that there is plenty of room for optimization following the first demonstration of a new architecture.
Edit: sorry, it’s not the first diffusion transformer. That would be [2]