Diffusion models learn a transformation operator. The parameters are adjusted such that the operator maximises the evidence lower bound, or in other words, increasing the likelihood of observing a slightly less noisy version of the input.
The guidance component is a vector representation of the text that changes where we are in the sample space. A change in the sample space changes likelihood so for the different prompts the likelihood of the same output image for the same input image will be different.
Since the model is trained to maximise the ELBO, it will produce a change closer to the prompt.
A good way to think about it is this: given a classifier, I can select a target class and compute the derivative of the input with respect to the target class, and apply the derivative to the input. This puts it closer to my target class.
From the perspective of some models (score models), they produce a derivative of the density (of the samples), so it’s a bit similar to computing a derivative via classifier.
The above was concerned with what the NN was doing.
The algorithm applies the operator a number of steps, and progressively improves the image. In some probabilistic models, you can think of this as an inverse of stochastic gradient descent procedure (meaning a series of steps) that, with some stochasticity, reach a high value region (the density).
However, it turns out that learning this operation doesn’t have to be grounded in probability theory and graphical models.
As long as the NN learns a sufficiently good recovery operator, diffusion will construct something based on the properties of the dataset that has been used.
At no point however are there condensed representations of images since the NN is not learning to produce an image from zero in one step. It merely learns to recover some operation applied to the input.
For the probabilistic view, read Denoising Diffusion Probabilistic Networks and references, in particular langevin dynamics. It includes citations to score models as well.
For the non probabilistic component, read Cold diffusion.
For using the classifier gradient to update an image towards another class, read about adversarial generation via input gradients.
> A good way to think about it is this: given a classifier, I can select a target class and compute the derivative of the input with respect to the target class, and apply the derivative to the input. This puts it closer to my target class.
i wrote an OCR program in college. we split the data set in half. you train it on one half then test it against the other half.
you can train stable diffusion on half the images, but then what? you use the image descriptions of the other half and measure how similar they are? in essence, attempting to reproduce exact replicas. but i guess even then it wouldn't be copyright if those images weren't used in the model. more like me describing something vividly to you and asking you to paint it and then getting angry at you because its too accurate
You would not need have of the images to perform that test. No more than a handful of images to prove that the text representation will not produce a identical image to a given image that has had a description described.
They don't even produce the same image twice from the same description and a different random seed.
Instead of aiming to reproduce exact replicas, you use a classifier and retrieve the input of the last layer. Do it for both generated and original inputs, and then measure the differences in the statistics.