Hacker News new | past | comments | ask | show | jobs | submit login
Stable Diffusion forming images from text: image snapshots at each step (postimg.cc)
146 points by TheMiddleMan on Sept 2, 2022 | hide | past | favorite | 44 comments



Another gallery with 80 ddim steps: https://postimg.cc/gallery/b1kn7yd

Thought I'd share this for others interested. I've modified txt2img to save an image after each step. (actually quite easy as you can specify an img_callback to sampler)

Interestingly, both of these runs are using the same seed and prompt, yet they yield different final images, the only difference is the number of ddim sampling steps. I'd love to understand why if anyone has any idea.


A couple of replies to https://news.ycombinator.com/item?id=32634807 suggest some sources of non-determinism.


Interesting. I suppose GPUs could calculate things differently. I just checked, I can rerun both 40/80 step runs and the final images are bit-identical to the first runs. So at least in my scenario the same parameters are deterministic, but changing the number of ddim sampling steps changes the result.

Maybe it's doing something fancy with the total number of steps, beyond just stopping after the count is reached.


It will most likely render differently on different hardware since gpu float math is nondeterministic across different environments


How are cloud GPU providers handling this then? Do the fancy A1000 chips solve this?


aaaand there goes people trying to turn this into a 99% compression system


Each sampling step runs at a specific scale, fewer steps would skip some of the intermediate scales


Ah I see, it makes bigger leaps each step to try to get to the same end result in less total steps. That makes sense, assuming I have it right.


Most GPUs are non-deterministic - learned this the hard way in deep learning on pathology data. This is for optimization purposes. In fact, you can set a flag in Pytorch / Cuda to disable this which comes at the cost of performance.


Can you explain? How much does it actually affect results in extreme cases? The source of non-determinism does seem the GPU but parallelism and dynamic allocation in the frameworks. (Also seems that some parts of pytorch still return runtime error if you request a deterministic version). Are there other more performant deterministic DL frameworks?


Do you have a diff/patch of the change to do this?

I may try understanding both StableDiffusion and Python enough to do it, but if you already solved it - that'll be appreciated :)


You can set both the seed and the number of inference steps when running StableDiffusion locally (or in Google CoLab). I assumed that they just set a seed and then generated the image at each inference step. With a decent GPU, it’s only going to take a few minutes.

You could definitely modify it to output at each step, but the output step takes a relatively long time, so it would slow down the process.


I have a couple examples of such--though I only go through ~20 steps--in a talk I gave a couple days ago (one in the middle with a horse, and one at the very end generating a person).

https://twitter.com/saurik/status/1565728123705966592?s=21


Saurik, spotted in the wild. Off-topic, but thanks for all the work you've done over the years for the iOS jailbreak community. Hope all is well.


Saurik!


Wonder if he enjoys this.


The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

It's basically just fallen into a local minima in the latent space and nothing will ever change, no matter how many steps you add.

The benefit of this kind of approach technically is that you can add a frame-to-frame diff as you're generating and stop early once you've hit a steady state, instead of having to pick an arbitrary number of steps.


You don't even need to do a diff, the model itself actually predicts the diff (which is how it samples the image) so you could just stop once the model is predicting close to 0


Is there a parameter for this, to stop when subsequent iterations do nothing?


> The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

I would expect both the scheduler and prompt in-use to have a significant effect on this.


> The take-away here is that steps above ~60 do nothing to the image, you're just burning gpu cycles doing nothing.

On these pictures. Depending on your prompt, more steps can be beneficial—let's say you're trying to make infinitely detailed fractal monster landscapes—or it might hurt, especially with DDIM, which seems to overfit a lot.


An even better example using Midjourney beta (so Stable Diffusion): https://www.reddit.com/r/deepdream/comments/ww7ubl/the_genes...


I do t really understand, what is the process or mechanism for when it is happy with something? At some point it “liked” the way the armor was coming together and refined it is small amounts only.


The neural network always tries to predict the final image, but the diffusion process takes the vector and shrinks it, then turns it into a distribution by adding Gaussian noise, so if the model makes a decision, it may not be the final one.

Edit: I thought you were concerned about the model changing decision; the model has a defined amount of steps it can take, and this affects the amount that the diffusion process can shrink the vector from the Unet (the neural network).


What is happening is that the noiser/earlier timesteps are responsible for low-frequency features, while the final timesteps are responsible for high-frequency features.

https://dsp.stackexchange.com/questions/1637/what-does-frequ...


Here’s another one in gif form (right image) https://twitter.com/johnowhitaker/status/1565710033463156739 These seem really useful to get intuition about im2im


Can you make a video with it that shows how it's improves?


https://imgur.com/a/b7Bw7HB

(padded with 1 s of the final frame)

No way to prevent Imgur from reencoding... whatever


Nice!

Why does the border change in each frame?


Because it has no understanding of a border. It is effectively throwing pixels at the image to see if it’s getting better or worse


A GIF would be great


A colab notebook shared on discord has some interesting exploration of this https://colab.research.google.com/drive/1dlgggNa5Mz8sEAGU0wF...

Caveat: While I believe there's nothing nefarious about this notebook... I am unaware whether or not there are security risks with random colab notebooks.


I would be interested to see what varying levels of noise added to these intermediate steps produces. In this example it looks like it kind-of decided what it was drawing in the transition from steps 11 to 14 and then refined.

Would a little noise added around here make a subtly different result, widely diverge, or simply slow the refining process?


I'm astonished at the evolution of the image and reminded of a documentary I saw about Picasso ( I think). He would paint the same painting again and again tweaking it slightly each time until he was satisfied.


I went to a Picasso exhibition in Madrid once where they had an entire, huge, long room filled with every sketch and painting he'd done in preparation for Guernica, plus Guernica itself of course. It was eye-opening to say the least, especially as so many of the prep-work pieces were not in his typical style. Some were in an absolutely beautifully detailed realist style, and at that point I'd never seen a Picasso that wasn't cubist so it really stuck with me.


Does someone have an easy explanation how the text prompt is fed into the image.


It is fed using a fascinating mechanism called "cross-attention" that originated in the NN architecture called Transformer - which was used to achieve state-of-the-art translations. It uses something like associative memory, where a NN inside Stable Diffusion, that generates image (UNet, working in latent space), at almost each step "asks" the whole encoded prompt to provide data at various positions using query vector Q that is matched against key vectors K and value vectors V [0].

How Stable Diffusion works [1] as a whole is not really hard to comprehend at a high level - but you'll need some prereqs - probability theory underlying this is explained in Variational Autoencoders [2], then Diffusion Models [3] sort of made a really cool "deep variational" autoencoder that uses small noise-denoise steps, but largely the same math (variational inference), but they were unwieldy because operated in pixel space, after that Latent Diffusion Models [4] democratized the thing by vastly reducing the amount of computation needed - operating in latent space (btw that's why the images in this HN post look so cool - the denoising is not in the pixel space!).

[0] https://jalammar.github.io/illustrated-transformer/

[1] https://huggingface.co/blog/stable_diffusion

[2] https://arxiv.org/abs/1906.02691

[3] https://arxiv.org/abs/2006.11239

[4] https://arxiv.org/abs/2112.10752


Thanks!


Uhm. You’re basically asking how the entire NN works. There is no easy explanation for that.


I understand neural networks, embeddings, convolutions, etc. The part that's unclear to me is specifically how textual embeddings are linked into the img-to-img network trying to reduce the noise. In other words, am missing how the process is 'conditioned upon' the text. (I lack a understanding the same for conditional GANs as well.)

If the answer is just that the textual embeddings are also fed as simple inputs to the network, I already understand then.


Might be worth looking through the dataset it was trained on, here's on example: https://laion-aesthetic.datasette.io/laion-aesthetic-6pls/im...

So the model understands (kinda) who Bob Moog is, so when you include "Bob Moog" in the prompt, the model knows what you are looking for.


Why did they unnecassarily re-index a smaller subset of Laion Aesthetic? You can search _all_ of laion using the pre-built faiss indices from laion..

https://rom1504.github.io/clip-retrieval/?back=https%3A%2F%2...

is a hosted version, but you can download and host it yourself as well.


the prompt would be informative


Prompt is: monkey astronaut, bright, bright, bright, bright

(I was experimenting with repeating words, which does seem to amplify the effect each repeat with some keywords)

Seed is: 948574399




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: