Diffusion Models

sashank_1509 · 2024-05-26T16:13:22 1716740002

Good post, I always thought diffusion originated from score matching, today I realized diffusion came before score matching theory, so when OpenAI trained on 250 million images, they didn’t even have great theory explaining why they were modeling the underlying distribution. Gutsy move

reasonableklout · 2024-05-26T21:33:14 1716759194

The original Dickstein 2015 paper [1] formulated diffusion as maximizing (a lower bound of) log-likelihood of generating the distribution, so there was some theory. But my understanding is that the breakthrough was empirical results from Ho [2] and Nichol [3] showing diffusion could produce not only high-quality samples but better than GANs in some cases.

[1] https://arxiv.org/abs/1503.03585 [2] https://arxiv.org/abs/2006.11239 [3] https://arxiv.org/abs/2105.05233

ilaksh · 2024-05-26T00:44:34 1716684274

What's the best Apache or MIT-licensed python library for Diffusion Transformers?

reasonableklout · 2024-05-26T01:00:39 1716685239

HuggingFace Diffusers is Apache and supports Diffusion Transformers: https://huggingface.co/docs/diffusers/en/api/pipelines/dit

ilaksh · 2024-05-26T05:44:48 1716702288

They are actually based on the Attribution - Non-commercial Facebook code and have the same license.

simonw · 2024-05-26T06:18:25 1716704305

Are you sure about that? https://github.com/huggingface/diffusers lists the Apache 2 license.

ilaksh · 2024-05-26T09:17:10 1716715030

https://github.com/huggingface/diffusers/blob/v0.27.2/src/di...

mrbungie · 2024-05-26T11:21:54 1716722514

Pretty sure that license header ended up in the codebase from a clever guy making a PR or it was just a mistake.

pama · 2024-05-26T20:19:16 1716754756

Unfortunately this statement would not offer sufficient legal protection, so the original authors would have to be convinced to give up their previous rights and change the upstream copyright (and huggingface should update their repo license statement). Of course, these days it is typically easy enough to reimplement the code from a paper in plain pytorch, so I’m not sure one needs all this huggingface repo with the extra framework and risk, but to me it doesnt fit the requirement of the OP question.

ilaksh · 2024-05-26T15:56:45 1716739005

Ok, I would like to believe that. That's great then thanks.

mrbungie · 2024-05-26T17:46:15 1716745575

If it worries you, maybe open an issue? No sane man would allow a weird license that's an API call away from screwing up your own products.

reasonableklout · 2024-05-28T19:20:03 1716924003

I opened an issue here: https://github.com/huggingface/diffusers/issues/8306

bitvoid · 2024-05-26T10:01:14 1716717674

That's sort of confusing (to me at least) because that particular header also lists MIT and Apache licenses.

eli_gottlieb · 2024-05-26T04:33:29 1716698009

Besides huggingface there's also the DDPT repo: https://github.com/lucidrains/denoising-diffusion-pytorch/

ilaksh · 2024-05-26T05:46:34 1716702394

Nice. But is that a diffusion transformer?

eli_gottlieb · 2024-05-31T18:24:24 1717179864

I believe it's got one, yes: https://github.com/lucidrains/denoising-diffusion-pytorch/bl...

ilaksh · 2024-05-26T05:48:26 1716702506

I found this:

https://paperswithcode.com/paper/scalable-diffusion-models-w...

https://github.com/mindspore-lab/mindone/tree/master/example...

davidguetta · 2024-05-28T00:43:03 1716856983

How is doing classifier-free guidance where you

"Train a single diffusion model on every training sample x0x0 twice: once paired with its class label yy, and once paired with a null class label."

not doing exactly the same, and having the same problem that was deemed bad in the first paragraph of the same section:

"However, the label can sometimes lead to samples that are not realistic or lack diversity if the model has not seen enough samples from p(x∣y)p(x∣y) for a particular yy. So we often want to tune how much the model “follows” the label during generation."

Awesome post btw

reasonableklout · 2024-05-29T07:29:35 1716967775

It's not quite the same thing because you are able to control the tradeoff of realism and class strength, since sampling is formulated as pushing an unconditional sample in the direction of the conditional one for a tunable amount, rather than simply sampling from the conditional distribution. However, you are right that we get similar problems with realism if we apply guidance with too much strength.

davidguetta · 2024-05-29T09:37:03 1716975423

Mm. I understand the argument (tho this subletly is not present in the article, maybe you should add it if that's actually what people do).

That said I am not 100% convinced : first the underlying model is still the same and because you are effectively learning a eps(x, t, y). secondly the equivalence suggests (if you actually use the difference as gradient) you should end up in the same global min/max as if you were doing this directly no ?

reasonableklout · 2024-05-29T20:46:09 1717015569

Thanks for the feedback!

Yes, when conditioned on a non-null label, the underlying model for classifier-free guidance can be seen as the same as a class-conditional model (in fact, the original CFG paper[0] discusses training a separate model - they didn't do that because it's extra work for not much benefit). And for a certain weighting of the guidance vector for each step, you can obtain the same sampling step you would get with a class-conditional model. But my understanding is that the point of guidance is that you can control this weighting, because you are able to estimate the unconditional score too.

[0] https://arxiv.org/pdf/2207.12598

Tao3300 · 2024-05-26T17:10:15 1716743415

> I spent 2022 learning to draw and was blindsided by the rise of AI art models like Stable Diffusion. Suddenly, the computer was a better artist than I could ever hope to be.

I hope the author stuck with it anyway. The more AI encroaches on creative work, the more I want to tear it all down.

ctippett · 2024-05-26T21:53:35 1716760415

Conversely I've become more motivated to draw things and try my hand at digital art since being exposed to Stable Diffusion, Midjourney et. al. I take the output from these tools and then attempt to recreate or trace over them.

arvinsim · 2024-05-27T07:18:56 1716794336

People who do art for art's sake will do it regardless of AI.

After all, photography didn't stop people from drawing or painting.

kmacdough · 2024-05-26T18:38:17 1716748697

Thanks for sharing. This has given me much more insight into how and why diffusion models work. Randomness is oddly powerful. Time to try and code one up in some suitably unsuitable language.

Not much to TL;DR for the comment lurkers. This post is the TL;DR of stable diffusion.

davidguetta · 2024-05-26T18:32:40 1716748360

The train loop is wrong no ? neither x0s and eps are used in the expression of xts so it loons like your training to predict random noise

kmacdough · 2024-05-26T19:41:11 1716752471

Not sure which eq you refer to, but from what I understand, the network never network "sees" the correct images. Rather, the network must learn to infer the information indirectly through the loss function.

The loss function encodes information about the noise and, because the network sees the noised up image exactly, this is equivalent to learning about the actual sample images. It's worth noting that you could design a loss function measuring the difference between the output and the real images. This contains equivalent information, but it turns out that the properties of gaussian noise make it much more conducive estimating the gradient.

But point being, the information on the true images is in the loop albeit only through the lense of some noise.

fisian · 2024-05-26T18:54:05 1716749645

Yes, should be the same as the equation before. Like this:

  xts = alpha_bar[t].sqrt() * x0s + (1.-alpha_bar[t]).sqrt() * eps

Additionally, the code isn't consistent. In the sampling code a time embedding is used, while in training it isn't.

reasonableklout · 2024-05-26T21:05:51 1716757551

Oops, you're right. Fixed, thanks.

sidcool · 2024-05-26T07:46:47 1716709607

This is a great post

b33j0r · 2024-05-26T23:21:42 1716765702

I really like how readable the layout is. (Why do I waste so much time making hard-to-read layouts?)

The only disappointment came when I hit Reader View—almost to prove “this page is semantically perfect!!”—and alas, the nav list has a line height less than one in that medium, and scrunches up all crazy. I’ll let it slide ;)