Still waiting for a competitive diffusion llm

nextaccountic · 2024-10-14T19:22:49.000000Z

So I can't find that paper that was posted on HN that said that, when viewed under the right theoretical framework, asserts that diffusion and transformers are doing the same thing under a different basis.. am I misrembering something?

orbital-decay · 2024-10-16T16:57:01.000000Z

https://sander.ai/2024/09/02/spectral-autoregression.html

kleiba · 2024-10-14T07:56:14.000000Z

WithinReason · 2024-10-14T08:17:11.000000Z

Diffusion works significantly better for images than sequential pixel generation, there is a good chance it would work better for language as well.

Sequential generation used to be state of the art in 2016 and it's basically how current LLMs work:

https://arxiv.org/abs/1601.06759

kleiba · 2024-10-14T09:08:12.000000Z

Neural LMs used to be based on recurrent architectures until the Transformer came along. That architecture is not recursive.

I am not sure that a diffusion approach is all that suitable for generating language. Word are much more discrete than pixels.

WithinReason · 2024-10-14T09:18:08.000000Z

I meant sequential generation, I didn't mean using an RNN.

Diffusion doesn't work on pixels directly either, it works on a latent representation.

kleiba · 2024-10-14T10:18:04.000000Z

All NNs work on latent representations.

barrkel · 2024-10-14T10:49:02.000000Z

The contrast here is real: there are pixel space diffusion models and latent space diffusion models. Pixel space diffusion is slower because there's more redundant information.

og_kalu · 2024-10-14T20:23:50.000000Z

The most popular method using autoregression in image generation space is to predict image patches/tokens and not pixels, though that still scales worse than diffusion.

A fairly new but promising approach for autoregression that seems to scale as well as diffusion is predicting the next image scale/resolution rather than the next image patch.

https://arxiv.org/abs/2404.02905

magicalhippo · 2024-10-14T11:23:44.000000Z

I had similar thoughts to you.

However diffusion models suck at details, like how many fingers on a hand, and with language words and characters matter, both which ones and where they are.

So while I'm sure diffusion could produce walls of text that look convincingly like a blog post at a glance say, I'm not sure it would hold up to anyone actually reading.