I have an undergraduate degree in Math and generally find math quite enjoyable, however, even I had a pretty hard time grasping most of what this blog was talking about.
I think it pretty quickly went from extremely high level (add noise to image, then remove noise) to extremely low level specific.
Also I had a hard time figuring out which part of the equation differentiate the equations for different data points, is that what the meaning of "theta" is in all the equations?
To guide the initial noise towards one type of image instead of another is that what theta is responsible for? Is the innovation in GenAI images to use text embeddings to create the theta?
Theta represents all the model params — all the weights in the neural network. The convention is to write theta for the “learned” score function and omit theta for the “true” score function.
Yeah, the article could definitely use some clarification in many parts (the author may be suffering from the curse of knowledge a bit). Plus there's the fact that even if you know your ODEs, SDEs are a different beast bringing in probability (certainly one may not be accustomed to seeing `p(x|y)` in the middle of a differential equation…)
I've noticed that when finance guys learn that there's a really useful AI thing called diffusion, they get all excited and start writing stochastic differential equations and drawing 1-D Brownian motion plots all over the place. It's not yet clear to me whether this helps anyone understand AI diffusion models, but they seem to enjoy it a lot.
I'm a controls guy, not a quant guy but I've found the SDE perspective and this blog post to be incredible in helping me understand how a diffusion model works.
I personally find the SDEs the most intuitive, and the deterministic ODE / consistency models / rectified flow stuff as ideas that are easier to understand after the SDEs. But not everyone agrees!
Thanks for sharing this! I tend to agree that it’s easiest to understand this way.
I just find it a frustrating fact about modern machine learning that in fact, the nice SDE interpretation is somehow not the “truth”, it’s just a tool for understanding.
A much simpler exploration of this topic I always liked is the "Linear Diffusion" [0] example which implements a basic Diffusion model using only linear components. Given it's simplicity it gets surprisingly good results at generating digits.
At least the article warns you upfront of the sort of mathematical sophistication required to get some of the explanations. The Author is a financial engineering sort, so their big thing is SDEs and assumes (for some of the explanations) that you bring that sort of intuition with you. If the Author was a signal processing type, they might use Kalman filter analogies, or pure statisticians would cite autocorrelation.
Don’t try to catch all the mathematical Pikachus in the paper, just choose the insights that resonate with you. Thankfully, there isn’t a pop quiz lurking at the end.
In honor of the Bay Area roots of HN, “believe it if you need it, if you don't, just pass it on”. I liked the paper even when skipping the SDE material.
I left the TENET references on the cutting room floor.
I too found it really surprising that the reverse-time equation has a simple closed form. Like, surely breaking a glass is easier than unbreaking it? That’s part of what got me interested in this stuff in the first place!
If you haven’t seen it yet, highly recommend the blogs of Sander Dieleman & Yang Song (who co-invented the SDE interpretation).
Also I had a hard time figuring out which part of the equation differentiate the equations for different data points, is that what the meaning of "theta" is in all the equations? To guide the initial noise towards one type of image instead of another is that what theta is responsible for? Is the innovation in GenAI images to use text embeddings to create the theta?
I definitley feel some tears coming on.