Hacker News new | past | comments | ask | show | jobs | submit login

Progress is this field has not been linear, though. So it's quite possible that two papers ahead we are still in the same place.



On the other hand, this is the first convincing use of a “diffusion transformer” [1]. My understanding is that videos and images are tokenized into patches, through a process that compresses the video/images into abstracted concepts in latent space. Those patches (image/video concepts in latent space) can then be used with transformers (because patches are the tokens). The point is that there is plenty of room for optimization following the first demonstration of a new architecture.

Edit: sorry, it’s not the first diffusion transformer. That would be [2]

[1] https://openai.com/research/video-generation-models-as-world...

[2] https://arxiv.org/abs/2212.09748



I think it is misleading. The role of the diffusion network is completely absent from this explanation




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: