How does it work then? :)

PartiallyTyped · on Jan 14, 2023

Diffusion models learn a transformation operator. The parameters are adjusted such that the operator maximises the evidence lower bound, or in other words, increasing the likelihood of observing a slightly less noisy version of the input.

The guidance component is a vector representation of the text that changes where we are in the sample space. A change in the sample space changes likelihood so for the different prompts the likelihood of the same output image for the same input image will be different.

Since the model is trained to maximise the ELBO, it will produce a change closer to the prompt.

A good way to think about it is this: given a classifier, I can select a target class and compute the derivative of the input with respect to the target class, and apply the derivative to the input. This puts it closer to my target class.

From the perspective of some models (score models), they produce a derivative of the density (of the samples), so it’s a bit similar to computing a derivative via classifier.

The above was concerned with what the NN was doing.

The algorithm applies the operator a number of steps, and progressively improves the image. In some probabilistic models, you can think of this as an inverse of stochastic gradient descent procedure (meaning a series of steps) that, with some stochasticity, reach a high value region (the density).

However, it turns out that learning this operation doesn’t have to be grounded in probability theory and graphical models.

As long as the NN learns a sufficiently good recovery operator, diffusion will construct something based on the properties of the dataset that has been used.

At no point however are there condensed representations of images since the NN is not learning to produce an image from zero in one step. It merely learns to recover some operation applied to the input.

For the probabilistic view, read Denoising Diffusion Probabilistic Networks and references, in particular langevin dynamics. It includes citations to score models as well.

For the non probabilistic component, read Cold diffusion.

For using the classifier gradient to update an image towards another class, read about adversarial generation via input gradients.

visarga · on Jan 14, 2023

> A good way to think about it is this: given a classifier, I can select a target class and compute the derivative of the input with respect to the target class, and apply the derivative to the input. This puts it closer to my target class.

excellent description, thanks

realusername · on Jan 14, 2023

The opposite way, the training images are there to support the model to generalize features.

Reproducing parts of existing images in the dataset is called overfitting and is considered a failure of the model.

8n4vidtmkvmk · on Jan 14, 2023

how do you measure success?

i wrote an OCR program in college. we split the data set in half. you train it on one half then test it against the other half.

you can train stable diffusion on half the images, but then what? you use the image descriptions of the other half and measure how similar they are? in essence, attempting to reproduce exact replicas. but i guess even then it wouldn't be copyright if those images weren't used in the model. more like me describing something vividly to you and asking you to paint it and then getting angry at you because its too accurate

Lerc · on Jan 14, 2023

You would not need have of the images to perform that test. No more than a handful of images to prove that the text representation will not produce a identical image to a given image that has had a description described.

They don't even produce the same image twice from the same description and a different random seed.

PartiallyTyped · on Jan 14, 2023

FID score is a measure of success.

Instead of aiming to reproduce exact replicas, you use a classifier and retrieve the input of the last layer. Do it for both generated and original inputs, and then measure the differences in the statistics.

Wikipedia has a good article on this.

ben_w · on Jan 14, 2023

Computerphile has friendly introductions to just about everything: https://youtu.be/1CIpzeNxIhU