More

benanne · 2026-05-06T22:08:07 1778105287

It kind of does! In the modern era of generative modelling, it seems like we rely on pre-training to capture the data distribution, and then on post-training (and various other tricks) to carve out a sliver of that distribution that we actually care about (i.e. what we want our model to generate).

To be able to specify that subset with relatively few examples, a good high-level understanding of the data distribution is necessary. The way I see this, is that training a diffusion model gets you to that point, and then once you've selected the part of the distribution you actually care about, you can distill it down quite aggressively, because you no longer need all of that computation to model a much simpler distribution (sometimes all the way to one step, but usually it's a few steps in practice).

benanne · 2026-05-06T22:03:01 1778104981

I briefly covered that connection in an earlier blog post: https://sander.ai/2023/07/20/perspectives.html#flow ... but it's definitely something that might deserve a longer-form treatment at some point :)

benanne · on Sept 4, 2024

I actually wrote down some thoughts about audio phase in a previous blog post: https://sander.ai/2020/03/24/audio-generation.html#motivatio...

I have an example audio clip in there where the phase information has been replaced with random noise, so you can perceive the effect. It certainly does matter perceptually, but it is tricky to model, and small "vocoder" models do a decent job of filling it in post-hoc.

benanne · on Sept 3, 2024

I'm not sure if frequency decomposition makes sense for anything that's not grid-structured, but there is certainly evidence that there is positive "transfer" between generative modelling tasks in vastly different domains, implying that there are some underlying universal statistics which occur in almost all data modalities that we care about.

That said, the gap between perceptual modalities (image, video, sound) and language is quite large in this regard, and probably also partially explains why we currently use different modelling paradigms for them.

benanne · on Sept 3, 2024

Oof, you're not going to like this other blog post I wrote then :D https://sander.ai/2023/07/20/perspectives.html

catgary · on Sept 3, 2024

Well, yeah, I don’t know what you expect me to say, it’s sloppy work.

benanne · on Sept 3, 2024

Sorry to hear that. My blog posts are intended to build intuition. I also write academic papers, which of course involves a different standard of rigour. Perhaps you'd prefer those, only one of those is about diffusion models though.

benanne · on Sept 3, 2024

Thanks for reading! Absolutely, I included a few references that explore that approach at the bottom of section 4 (last two paragraphs).

magicalhippo · on Sept 3, 2024

Excellent, thanks, will check them out.

Had just finished watching the Physics of Language Models[1] talk, where they show how GPT2 models could learn non-trivial context-free grammars, as well as effectively do dynamic programming to an extent, so though it would be interesting to see how they performed in the spectral fine-graining task.

[1]: https://physics.allen-zhu.com/home

magicalhippo · on Sept 3, 2024

> I included a few references that explore that approach at the bottom of section 4

Man, reading on mobile phone just ain't the same. Somehow managed to not catch then end of that section. The first reference, "Generating Images with Sparse Representations", is very close to what I had in mind.

benanne · on Sept 3, 2024

Thanks for reading! The paper that directly inspired this blog post actually investigates the latter (blurring as the corruption process): https://arxiv.org/abs/2206.13397

benanne · on Sept 3, 2024

Thanks for reading! Check out subspace diffusion: https://arxiv.org/abs/2205.01490

benanne · on March 26, 2024

I've since moved on to work primarily on diffusion models, so I have a series of blog posts about that topic as well!

- https://sander.ai/2022/01/31/diffusion.html is about the link between diffusion models and denoising autoencoders, IMO the easiest to understand out of all interpretations; - https://sander.ai/2023/07/20/perspectives.html covers a slew of different perspectives on diffusion models (including the "autoencoder" one).

In a nutshell, diffusion models break up the difficult task of generating natural signals (such as images or sound) into many smaller partial denoising tasks. This is done by defining a corruption process that gradually adds noise to an input until all of the signal is drowned out (this is the "diffusion"), and then learning how to invert that process step-by-step.

This is not dissimilar to how modern language models work: they break up the task of generating text into a series of easier next-word-prediction tasks. In both cases, the model only solves a small part of the problem at a time, and you apply it repeatedly to generate a signal.

benanne · on Sept 20, 2016

In Georgian, mother is "deda" and father is "mama", which lends further evidence to this :)