Stable Diffusion forming images from text: image snapshots at each step

TheMiddleMan · on Sept 2, 2022

Another gallery with 80 ddim steps: https://postimg.cc/gallery/b1kn7yd

Thought I'd share this for others interested. I've modified txt2img to save an image after each step. (actually quite easy as you can specify an img_callback to sampler)

Interestingly, both of these runs are using the same seed and prompt, yet they yield different final images, the only difference is the number of ddim sampling steps. I'd love to understand why if anyone has any idea.

sp332 · on Sept 2, 2022

A couple of replies to https://news.ycombinator.com/item?id=32634807 suggest some sources of non-determinism.

TheMiddleMan · on Sept 3, 2022

Interesting. I suppose GPUs could calculate things differently. I just checked, I can rerun both 40/80 step runs and the final images are bit-identical to the first runs. So at least in my scenario the same parameters are deterministic, but changing the number of ddim sampling steps changes the result.

Maybe it's doing something fancy with the total number of steps, beyond just stopping after the count is reached.

dangero · on Sept 3, 2022

It will most likely render differently on different hardware since gpu float math is nondeterministic across different environments

lbotos · on Sept 3, 2022

How are cloud GPU providers handling this then? Do the fancy A1000 chips solve this?

jimmySixDOF · on Sept 3, 2022

aaaand there goes people trying to turn this into a 99% compression system

beecafe · on Sept 3, 2022

Each sampling step runs at a specific scale, fewer steps would skip some of the intermediate scales

TheMiddleMan · on Sept 3, 2022

Ah I see, it makes bigger leaps each step to try to get to the same end result in less total steps. That makes sense, assuming I have it right.

mpaepper · on Sept 3, 2022

Most GPUs are non-deterministic - learned this the hard way in deep learning on pathology data. This is for optimization purposes. In fact, you can set a flag in Pytorch / Cuda to disable this which comes at the cost of performance.

riedel · on Sept 3, 2022

Can you explain? How much does it actually affect results in extreme cases? The source of non-determinism does seem the GPU but parallelism and dynamic allocation in the frameworks. (Also seems that some parts of pytorch still return runtime error if you request a deterministic version). Are there other more performant deterministic DL frameworks?

mgarciaisaia · on Sept 3, 2022

Do you have a diff/patch of the change to do this?

I may try understanding both StableDiffusion and Python enough to do it, but if you already solved it - that'll be appreciated :)

gbear605 · on Sept 3, 2022

You can set both the seed and the number of inference steps when running StableDiffusion locally (or in Google CoLab). I assumed that they just set a seed and then generated the image at each inference step. With a decent GPU, it’s only going to take a few minutes.

You could definitely modify it to output at each step, but the output step takes a relatively long time, so it would slow down the process.

saurik · on Sept 3, 2022

I have a couple examples of such--though I only go through ~20 steps--in a talk I gave a couple days ago (one in the middle with a horse, and one at the very end generating a person).

https://twitter.com/saurik/status/1565728123705966592?s=21

eshack94 · on Sept 3, 2022

Saurik, spotted in the wild. Off-topic, but thanks for all the work you've done over the years for the iOS jailbreak community. Hope all is well.

sogen · on Sept 3, 2022

Saurik!

aaaaaaaaata · on Sept 3, 2022

Wonder if he enjoys this.

wokwokwok · on Sept 3, 2022

The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

It's basically just fallen into a local minima in the latent space and nothing will ever change, no matter how many steps you add.

The benefit of this kind of approach technically is that you can add a frame-to-frame diff as you're generating and stop early once you've hit a steady state, instead of having to pick an arbitrary number of steps.

beecafe · on Sept 3, 2022

You don't even need to do a diff, the model itself actually predicts the diff (which is how it samples the image) so you could just stop once the model is predicting close to 0

stavros · on Sept 3, 2022

Is there a parameter for this, to stop when subsequent iterations do nothing?

arecurrence · on Sept 3, 2022

> The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

I would expect both the scheduler and prompt in-use to have a significant effect on this.

Filligree · on Sept 3, 2022

> The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

On these pictures. Depending on your prompt, more steps can be beneficial—let's say you're trying to make infinitely detailed fractal monster landscapes—or it might hurt, especially with DDIM, which seems to overfit a lot.

GaggiX · on Sept 3, 2022

An even better example using Midjourney beta (so Stable Diffusion): https://www.reddit.com/r/deepdream/comments/ww7ubl/the_genes...

SV_BubbleTime · on Sept 3, 2022

I do t really understand, what is the process or mechanism for when it is happy with something? At some point it “liked” the way the armor was coming together and refined it is small amounts only.

GaggiX · on Sept 3, 2022

The neural network always tries to predict the final image, but the diffusion process takes the vector and shrinks it, then turns it into a distribution by adding Gaussian noise, so if the model makes a decision, it may not be the final one.

Edit: I thought you were concerned about the model changing decision; the model has a defined amount of steps it can take, and this affects the amount that the diffusion process can shrink the vector from the Unet (the neural network).

ShamelessC · on Sept 3, 2022

What is happening is that the noiser/earlier timesteps are responsible for low-frequency features, while the final timesteps are responsible for high-frequency features.

https://dsp.stackexchange.com/questions/1637/what-does-frequ...

hwers · on Sept 3, 2022

Here’s another one in gif form (right image) https://twitter.com/johnowhitaker/status/1565710033463156739 These seem really useful to get intuition about im2im

gus_massa · on Sept 3, 2022

Can you make a video with it that shows how it's improves?

mgdlbp · on Sept 3, 2022

https://imgur.com/a/b7Bw7HB

(padded with 1 s of the final frame)

No way to prevent Imgur from reencoding... whatever

gus_massa · on Sept 3, 2022

Nice!

Why does the border change in each frame?

dagmx · on Sept 3, 2022

Because it has no understanding of a border. It is effectively throwing pixels at the image to see if it’s getting better or worse

dd36 · on Sept 3, 2022

A GIF would be great

arecurrence · on Sept 3, 2022

A colab notebook shared on discord has some interesting exploration of this https://colab.research.google.com/drive/1dlgggNa5Mz8sEAGU0wF...

Caveat: While I believe there's nothing nefarious about this notebook... I am unaware whether or not there are security risks with random colab notebooks.

Lerc · on Sept 3, 2022

I would be interested to see what varying levels of noise added to these intermediate steps produces. In this example it looks like it kind-of decided what it was drawing in the transition from steps 11 to 14 and then refined.

Would a little noise added around here make a subtly different result, widely diverge, or simply slow the refining process?

pontifier · on Sept 3, 2022

I'm astonished at the evolution of the image and reminded of a documentary I saw about Picasso ( I think). He would paint the same painting again and again tweaking it slightly each time until he was satisfied.

drcongo · on Sept 3, 2022

I went to a Picasso exhibition in Madrid once where they had an entire, huge, long room filled with every sketch and painting he'd done in preparation for Guernica, plus Guernica itself of course. It was eye-opening to say the least, especially as so many of the prep-work pieces were not in his typical style. Some were in an absolutely beautifully detailed realist style, and at that point I'd never seen a Picasso that wasn't cubist so it really stuck with me.

alok-g · on Sept 3, 2022

Does someone have an easy explanation how the text prompt is fed into the image.

Dzugaru · on Sept 3, 2022

It is fed using a fascinating mechanism called "cross-attention" that originated in the NN architecture called Transformer - which was used to achieve state-of-the-art translations. It uses something like associative memory, where a NN inside Stable Diffusion, that generates image (UNet, working in latent space), at almost each step "asks" the whole encoded prompt to provide data at various positions using query vector Q that is matched against key vectors K and value vectors V [0].

How Stable Diffusion works [1] as a whole is not really hard to comprehend at a high level - but you'll need some prereqs - probability theory underlying this is explained in Variational Autoencoders [2], then Diffusion Models [3] sort of made a really cool "deep variational" autoencoder that uses small noise-denoise steps, but largely the same math (variational inference), but they were unwieldy because operated in pixel space, after that Latent Diffusion Models [4] democratized the thing by vastly reducing the amount of computation needed - operating in latent space (btw that's why the images in this HN post look so cool - the denoising is not in the pixel space!).

[0] https://jalammar.github.io/illustrated-transformer/

[1] https://huggingface.co/blog/stable_diffusion

[2] https://arxiv.org/abs/1906.02691

[3] https://arxiv.org/abs/2006.11239

[4] https://arxiv.org/abs/2112.10752

alok-g · on Sept 3, 2022

Thanks!

andsens · on Sept 3, 2022

Uhm. You’re basically asking how the entire NN works. There is no easy explanation for that.

alok-g · on Sept 3, 2022

I understand neural networks, embeddings, convolutions, etc. The part that's unclear to me is specifically how textual embeddings are linked into the img-to-img network trying to reduce the noise. In other words, am missing how the process is 'conditioned upon' the text. (I lack a understanding the same for conditional GANs as well.)

If the answer is just that the textual embeddings are also fed as simple inputs to the network, I already understand then.

capableweb · on Sept 3, 2022

Might be worth looking through the dataset it was trained on, here's on example: https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/im...

So the model understands (kinda) who Bob Moog is, so when you include "Bob Moog" in the prompt, the model knows what you are looking for.

ShamelessC · on Sept 5, 2022

Why did they unnecassarily re-index a smaller subset of Laion Aesthetic? You can search _all_ of laion using the pre-built faiss indices from laion..

https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2...

is a hosted version, but you can download and host it yourself as well.

carrolldunham · on Sept 3, 2022

the prompt would be informative

TheMiddleMan · on Sept 3, 2022

Prompt is: monkey astronaut, bright, bright, bright, bright

(I was experimenting with repeating words, which does seem to amplify the effect each repeat with some keywords)

Seed is: 948574399