And they seem to be about 10x as fast as similar sized transformers.

317070 · 2025-03-14T19:58:48 1741982328

No, 10x less sampling steps. Whether or not that means 10x faster remains to be seen, as a diffusion step tends to be more expensive than an autoregressive step.

littlestymaar · 2025-03-14T20:17:28 1741983448

If I understood correctly, in practice they show actual speed improvement on high-end cards, because autoregressive LLMs are bandwidth limited and do not compute bound, so switching to a more expensive but less memory bandwidth heavy is going to work well on current hardware.

AlexCoventry · 2025-03-15T01:47:33 1742003253

The SEDD architecture [1] probably allows for parallel sampling of all tokens in a block at once, which may be faster but not necessarily less computationally demanding in terms of runtime times computational resources used.

[1] Which Inception Labs's new models may be based on; one of the cofounders is a co-author. See equations 18-20 in https://arxiv.org/abs/2310.16834