So I can't find that paper that was posted on HN that said that, when viewed under the right theoretical framework, asserts that diffusion and transformers are doing the same thing under a different basis.. am I misrembering something?
The contrast here is real: there are pixel space diffusion models and latent space diffusion models. Pixel space diffusion is slower because there's more redundant information.
The most popular method using autoregression in image generation space is to predict image patches/tokens and not pixels, though that still scales worse than diffusion.
A fairly new but promising approach for autoregression that seems to scale as well as diffusion is predicting the next image scale/resolution rather than the next image patch.
However diffusion models suck at details, like how many fingers on a hand, and with language words and characters matter, both which ones and where they are.
So while I'm sure diffusion could produce walls of text that look convincingly like a blog post at a glance say, I'm not sure it would hold up to anyone actually reading.