It kind of does! In the modern era of generative modelling, it seems like we rely on pre-training to capture the data distribution, and then on post-training (and various other tricks) to carve out a sliver of that distribution that we actually care about (i.e. what we want our model to generate).
To be able to specify that subset with relatively few examples, a good high-level understanding of the data distribution is necessary. The way I see this, is that training a diffusion model gets you to that point, and then once you've selected the part of the distribution you actually care about, you can distill it down quite aggressively, because you no longer need all of that computation to model a much simpler distribution (sometimes all the way to one step, but usually it's a few steps in practice).
I briefly covered that connection in an earlier blog post: https://sander.ai/2023/07/20/perspectives.html#flow
... but it's definitely something that might deserve a longer-form treatment at some point :)
I have an example audio clip in there where the phase information has been replaced with random noise, so you can perceive the effect. It certainly does matter perceptually, but it is tricky to model, and small "vocoder" models do a decent job of filling it in post-hoc.
I'm not sure if frequency decomposition makes sense for anything that's not grid-structured, but there is certainly evidence that there is positive "transfer" between generative modelling tasks in vastly different domains, implying that there are some underlying universal statistics which occur in almost all data modalities that we care about.
That said, the gap between perceptual modalities (image, video, sound) and language is quite large in this regard, and probably also partially explains why we currently use different modelling paradigms for them.
Sorry to hear that. My blog posts are intended to build intuition. I also write academic papers, which of course involves a different standard of rigour. Perhaps you'd prefer those, only one of those is about diffusion models though.
Had just finished watching the Physics of Language Models[1] talk, where they show how GPT2 models could learn non-trivial context-free grammars, as well as effectively do dynamic programming to an extent, so though it would be interesting to see how they performed in the spectral fine-graining task.
> I included a few references that explore that approach at the bottom of section 4
Man, reading on mobile phone just ain't the same. Somehow managed to not catch then end of that section. The first reference, "Generating Images with Sparse Representations", is very close to what I had in mind.
Thanks for reading! The paper that directly inspired this blog post actually investigates the latter (blurring as the corruption process): https://arxiv.org/abs/2206.13397
In a nutshell, diffusion models break up the difficult task of generating natural signals (such as images or sound) into many smaller partial denoising tasks. This is done by defining a corruption process that gradually adds noise to an input until all of the signal is drowned out (this is the "diffusion"), and then learning how to invert that process step-by-step.
This is not dissimilar to how modern language models work: they break up the task of generating text into a series of easier next-word-prediction tasks. In both cases, the model only solves a small part of the problem at a time, and you apply it repeatedly to generate a signal.
To be able to specify that subset with relatively few examples, a good high-level understanding of the data distribution is necessary. The way I see this, is that training a diffusion model gets you to that point, and then once you've selected the part of the distribution you actually care about, you can distill it down quite aggressively, because you no longer need all of that computation to model a much simpler distribution (sometimes all the way to one step, but usually it's a few steps in practice).
reply