Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm not an insider and I'm not sure whether this is directly related to "energy minimization", but "diffusion language models" have apparently gained some popularity in recent weeks.

https://arxiv.org/abs/2502.09992

https://www.inceptionlabs.ai/news

(these are results from two different teams/orgs)

It sounds kind of like what you're describing, and nobody else has mentioned it yet, so take a look and see whether it's relevant.



And they seem to be about 10x as fast as similar sized transformers.


No, 10x less sampling steps. Whether or not that means 10x faster remains to be seen, as a diffusion step tends to be more expensive than an autoregressive step.


If I understood correctly, in practice they show actual speed improvement on high-end cards, because autoregressive LLMs are bandwidth limited and do not compute bound, so switching to a more expensive but less memory bandwidth heavy is going to work well on current hardware.


The SEDD architecture [1] probably allows for parallel sampling of all tokens in a block at once, which may be faster but not necessarily less computationally demanding in terms of runtime times computational resources used.

[1] Which Inception Labs's new models may be based on; one of the cofounders is a co-author. See equations 18-20 in https://arxiv.org/abs/2310.16834




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: