Thought I'd share this for others interested. I've modified txt2img to save an image after each step. (actually quite easy as you can specify an img_callback to sampler)
Interestingly, both of these runs are using the same seed and prompt, yet they yield different final images, the only difference is the number of ddim sampling steps. I'd love to understand why if anyone has any idea.
Interesting. I suppose GPUs could calculate things differently. I just checked, I can rerun both 40/80 step runs and the final images are bit-identical to the first runs. So at least in my scenario the same parameters are deterministic, but changing the number of ddim sampling steps changes the result.
Maybe it's doing something fancy with the total number of steps, beyond just stopping after the count is reached.
Most GPUs are non-deterministic - learned this the hard way in deep learning on pathology data.
This is for optimization purposes. In fact, you can set a flag in Pytorch / Cuda to disable this which comes at the cost of performance.
Can you explain? How much does it actually affect results in extreme cases? The source of non-determinism does seem the GPU but parallelism and dynamic allocation in the frameworks. (Also seems that some parts of pytorch still return runtime error if you request a deterministic version). Are there other more performant deterministic DL frameworks?
You can set both the seed and the number of inference steps when running StableDiffusion locally (or in Google CoLab). I assumed that they just set a seed and then generated the image at each inference step. With a decent GPU, it’s only going to take a few minutes.
You could definitely modify it to output at each step, but the output step takes a relatively long time, so it would slow down the process.
I have a couple examples of such--though I only go through ~20 steps--in a talk I gave a couple days ago (one in the middle with a horse, and one at the very end generating a person).
The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.
It's basically just fallen into a local minima in the latent space and nothing will ever change, no matter how many steps you add.
The benefit of this kind of approach technically is that you can add a frame-to-frame diff as you're generating and stop early once you've hit a steady state, instead of having to pick an arbitrary number of steps.
You don't even need to do a diff, the model itself actually predicts the diff (which is how it samples the image) so you could just stop once the model is predicting close to 0
> The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.
On these pictures. Depending on your prompt, more steps can be beneficial—let's say you're trying to make infinitely detailed fractal monster landscapes—or it might hurt, especially with DDIM, which seems to overfit a lot.
I do t really understand, what is the process or mechanism for when it is happy with something? At some point it “liked” the way the armor was coming together and refined it is small amounts only.
The neural network always tries to predict the final image, but the diffusion process takes the vector and shrinks it, then turns it into a distribution by adding Gaussian noise, so if the model makes a decision, it may not be the final one.
Edit: I thought you were concerned about the model changing decision; the model has a defined amount of steps it can take, and this affects the amount that the diffusion process can shrink the vector from the Unet (the neural network).
What is happening is that the noiser/earlier timesteps are responsible for low-frequency features, while the final timesteps are responsible for high-frequency features.
Caveat: While I believe there's nothing nefarious about this notebook... I am unaware whether or not there are security risks with random colab notebooks.
I would be interested to see what varying levels of noise added to these intermediate steps produces. In this example it looks like it kind-of decided what it was drawing in the transition from steps 11 to 14 and then refined.
Would a little noise added around here make a subtly different result, widely diverge, or simply slow the refining process?
I'm astonished at the evolution of the image and reminded of a documentary I saw about Picasso ( I think). He would paint the same painting again and again tweaking it slightly each time until he was satisfied.
I went to a Picasso exhibition in Madrid once where they had an entire, huge, long room filled with every sketch and painting he'd done in preparation for Guernica, plus Guernica itself of course. It was eye-opening to say the least, especially as so many of the prep-work pieces were not in his typical style. Some were in an absolutely beautifully detailed realist style, and at that point I'd never seen a Picasso that wasn't cubist so it really stuck with me.
It is fed using a fascinating mechanism called "cross-attention" that originated in the NN architecture called Transformer - which was used to achieve state-of-the-art translations. It uses something like associative memory, where a NN inside Stable Diffusion, that generates image (UNet, working in latent space), at almost each step "asks" the whole encoded prompt to provide data at various positions using query vector Q that is matched against key vectors K and value vectors V [0].
How Stable Diffusion works [1] as a whole is not really hard to comprehend at a high level - but you'll need some prereqs - probability theory underlying this is explained in Variational Autoencoders [2], then Diffusion Models [3] sort of made a really cool "deep variational" autoencoder that uses small noise-denoise steps, but largely the same math (variational inference), but they were unwieldy because operated in pixel space, after that Latent Diffusion Models [4] democratized the thing by vastly reducing the amount of computation needed - operating in latent space (btw that's why the images in this HN post look so cool - the denoising is not in the pixel space!).
I understand neural networks, embeddings, convolutions, etc. The part that's unclear to me is specifically how textual embeddings are linked into the img-to-img network trying to reduce the noise. In other words, am missing how the process is 'conditioned upon' the text. (I lack a understanding the same for conditional GANs as well.)
If the answer is just that the textual embeddings are also fed as simple inputs to the network, I already understand then.
Thought I'd share this for others interested. I've modified txt2img to save an image after each step. (actually quite easy as you can specify an img_callback to sampler)
Interestingly, both of these runs are using the same seed and prompt, yet they yield different final images, the only difference is the number of ddim sampling steps. I'd love to understand why if anyone has any idea.