Hacker News new | past | comments | ask | show | jobs | submit login
Diffusion Models (andrewkchan.dev)
315 points by reasonableklout 31 days ago | hide | past | favorite | 31 comments



Good post, I always thought diffusion originated from score matching, today I realized diffusion came before score matching theory, so when OpenAI trained on 250 million images, they didn’t even have great theory explaining why they were modeling the underlying distribution. Gutsy move


The original Dickstein 2015 paper [1] formulated diffusion as maximizing (a lower bound of) log-likelihood of generating the distribution, so there was some theory. But my understanding is that the breakthrough was empirical results from Ho [2] and Nichol [3] showing diffusion could produce not only high-quality samples but better than GANs in some cases.

[1] https://arxiv.org/abs/1503.03585 [2] https://arxiv.org/abs/2006.11239 [3] https://arxiv.org/abs/2105.05233


What's the best Apache or MIT-licensed python library for Diffusion Transformers?


HuggingFace Diffusers is Apache and supports Diffusion Transformers: https://huggingface.co/docs/diffusers/en/api/pipelines/dit


They are actually based on the Attribution - Non-commercial Facebook code and have the same license.


Are you sure about that? https://github.com/huggingface/diffusers lists the Apache 2 license.



Pretty sure that license header ended up in the codebase from a clever guy making a PR or it was just a mistake.


Unfortunately this statement would not offer sufficient legal protection, so the original authors would have to be convinced to give up their previous rights and change the upstream copyright (and huggingface should update their repo license statement). Of course, these days it is typically easy enough to reimplement the code from a paper in plain pytorch, so I’m not sure one needs all this huggingface repo with the extra framework and risk, but to me it doesnt fit the requirement of the OP question.


Ok, I would like to believe that. That's great then thanks.


If it worries you, maybe open an issue? No sane man would allow a weird license that's an API call away from screwing up your own products.



That's sort of confusing (to me at least) because that particular header also lists MIT and Apache licenses.


Besides huggingface there's also the DDPT repo: https://github.com/lucidrains/denoising-diffusion-pytorch/


Nice. But is that a diffusion transformer?




How is doing classifier-free guidance where you

"Train a single diffusion model on every training sample x0x0 twice: once paired with its class label yy, and once paired with a null class label."

not doing exactly the same, and having the same problem that was deemed bad in the first paragraph of the same section:

"However, the label can sometimes lead to samples that are not realistic or lack diversity if the model has not seen enough samples from p(x∣y)p(x∣y) for a particular yy. So we often want to tune how much the model “follows” the label during generation."

Awesome post btw


It's not quite the same thing because you are able to control the tradeoff of realism and class strength, since sampling is formulated as pushing an unconditional sample in the direction of the conditional one for a tunable amount, rather than simply sampling from the conditional distribution. However, you are right that we get similar problems with realism if we apply guidance with too much strength.


Mm. I understand the argument (tho this subletly is not present in the article, maybe you should add it if that's actually what people do).

That said I am not 100% convinced : first the underlying model is still the same and because you are effectively learning a eps(x, t, y). secondly the equivalence suggests (if you actually use the difference as gradient) you should end up in the same global min/max as if you were doing this directly no ?


Thanks for the feedback!

Yes, when conditioned on a non-null label, the underlying model for classifier-free guidance can be seen as the same as a class-conditional model (in fact, the original CFG paper[0] discusses training a separate model - they didn't do that because it's extra work for not much benefit). And for a certain weighting of the guidance vector for each step, you can obtain the same sampling step you would get with a class-conditional model. But my understanding is that the point of guidance is that you can control this weighting, because you are able to estimate the unconditional score too.

[0] https://arxiv.org/pdf/2207.12598


> I spent 2022 learning to draw and was blindsided by the rise of AI art models like Stable Diffusion. Suddenly, the computer was a better artist than I could ever hope to be.

I hope the author stuck with it anyway. The more AI encroaches on creative work, the more I want to tear it all down.


Conversely I've become more motivated to draw things and try my hand at digital art since being exposed to Stable Diffusion, Midjourney et. al. I take the output from these tools and then attempt to recreate or trace over them.


People who do art for art's sake will do it regardless of AI.

After all, photography didn't stop people from drawing or painting.


Thanks for sharing. This has given me much more insight into how and why diffusion models work. Randomness is oddly powerful. Time to try and code one up in some suitably unsuitable language.

Not much to TL;DR for the comment lurkers. This post is the TL;DR of stable diffusion.


The train loop is wrong no ? neither x0s and eps are used in the expression of xts so it loons like your training to predict random noise


Not sure which eq you refer to, but from what I understand, the network never network "sees" the correct images. Rather, the network must learn to infer the information indirectly through the loss function.

The loss function encodes information about the noise and, because the network sees the noised up image exactly, this is equivalent to learning about the actual sample images. It's worth noting that you could design a loss function measuring the difference between the output and the real images. This contains equivalent information, but it turns out that the properties of gaussian noise make it much more conducive estimating the gradient.

But point being, the information on the true images is in the loop albeit only through the lense of some noise.


Yes, should be the same as the equation before. Like this:

  xts = alpha_bar[t].sqrt() * x0s + (1.-alpha_bar[t]).sqrt() * eps
Additionally, the code isn't consistent. In the sampling code a time embedding is used, while in training it isn't.


Oops, you're right. Fixed, thanks.


This is a great post


I really like how readable the layout is. (Why do I waste so much time making hard-to-read layouts?)

The only disappointment came when I hit Reader View—almost to prove “this page is semantically perfect!!”—and alas, the nav list has a line height less than one in that medium, and scrunches up all crazy. I’ll let it slide ;)




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: