Hacker News new | past | comments | ask | show | jobs | submit login
Diffusion Models Beat GANs on Image Synthesis (arxiv.org)
209 points by lnyan 42 days ago | hide | past | favorite | 50 comments

The biggest downside with diffusion models is that GANs can be rendered in much less than half a second (sometimes 10fps or higher) on one core of a standard device you probably have. (Many of you have an nVidia GPU, and any ol’ nVidia GPU will render stylegan quickly.)

With diffusion models, you need to do >25 forward passes to achieve a result. It’s kind of like an O(1) algorithm vs O(N): stylegan has one pass, diffusion models have N. And N is currently 25 or more, which means it tends to be 25x slower than stylegan at a minimum. (In our experience it was often many seconds to a full minute before we saw results, but we didn’t try very hard to make it fast, and this paper shows advances in speed since then.)

The flipside: this paper has the most beautiful photorealistic complex samples I’ve ever seen. I don’t really care if they’re cherry picked; it’s hard to pick cherries on a rotten tree. I know.

The last thing I want to say is that I am disappointed the model is, yet again, not released. The hoarding of ML models needs to stop. These research models aren’t commercially useful, but they have incredible benefits for people outside of ML. I myself got into ML thanks to being able to play with GPT-2 and Sfylegan. I would leap at the chance to play with AlphaZero or OpenAI’s old Dota 2 models; no reason to keep those locked up. I also don’t care about the excuses for not releasing: none of them hold water. In my experience, no model has ever had a society-threatening impact, and it’s usually a euphemism for “we’re worried there’s a small chance we might look bad.” Releasing your model is as simple as scp’ing to your server; just do it and quit worrying so much.

Love the work. Diffusion models are a really interesting way of thinking about generative approaches in ML.

Although the weights aren't available, I wanted to note that the model source itself is actually hosted at https://github.com/openai/guided-diffusion.

Indeed, the weights aren't available, and researchers seem to console themselves by saying "at least we're publishing code." Well, thank you. Code is nice. Y'know what else is nice? Being able to reproduce your claimed results without investing six months of mental effort and 500 GPU hours on a V100.

(Sorry, I'll step off the rant now. I just miss the old days, the ye olde long-long ago of "one and a half years." It seemed like every other day there was another neat model release that happened to be world-changing as a matter of course: StyleGAN, BigGAN, GPT-2. Before that there was YOLO and a ton of other neat stuff. Nowadays it's "Oh golly gee, I'm not so sure we should release this panda generator, it might have ethical implications down the line if people start generating pandas in sexy poses." But hopefully it's just a phase or something.)

Very generous of you to assume their model is reproducible.

No need to stop the rant :) I'm curious about the reasons models +/- weights are published less frequently.

For starters, is that actually the case? I don't know, genuine question.

Second, and I recognise the humorous intent regarding sexy pandas, are ethical concerns actually the reasons some models/weights are not released? Or is it about hoarding a potentially valuable commodity? Or some other reason, or a combination of several?

is that actually the case? [are model weights being published less frequently?]

It sure feels like it. But it could be survivorship bias: https://www.youtube.com/watch?v=_Qd3erAPI9w&t=52s&ab_channel... or regression to the mean: https://www.youtube.com/watch?v=1tSqSMOyNFE&ab_channel=Verit.... It's possible that StyleGAN and GPT-2 just happened to coincide, and both companies just happened to decide to release their models rather than hoard them, and that the usual case is for companies to rarely (if ever) release weights with their research.

But the more interesting question is, why wouldn't they? So onward to your second one:

are ethical concerns actually the reasons some models/weights are not released? Or is it about hoarding a potentially valuable commodity? Or some other reason, or a combination of several?

As with most things, the answer is likely the mundane, dull truth that big companies tend to have lots of inertia and friction preventing such things from happening. Sources of friction seem to include (but are not limited to):

- are we legally responsible for what other people do with the model? (no.)

- should we restrict people from using the model commercially? (why bother? but sure, nVidia did that for stylegan. It didn't stop Artbreeder from blatantly ignoring and monetizing it anyway. And while that sounds negative, Artbreeder is one of the coolest ML projects ever, as far as I'm concerned, so the world is better off for ignoring your stupid policy.)

- What if people start misusing the model? (So what? I feel like I should just get in the face of whoever is asking this question and repeat "So what?" until they have the police escort me off the premises. You can keep asking it to whatever they reply with, and eventually their logic never seems to go anywhere but one big ass-covering circle.)

- What if we might look bad because of it?

Now that last bullet point deserves some real treatment. In my opinion, the head of Facebook AI has done one of the most damaging things possible to the ML scene by getting everyone riled up about GPT generating "immoral" outputs:




Good lord, AIs are generating racist outputs! They're saying that cis white men are intellectually superior to trans people! People are generating child porn in AI dungeon! Someone, do something!

... please. Every researcher I've talked to has rolled their eyes hard at the criticisms raised by Jerome. But since he's the head of Facebook AI, no one dares say so publicly. It may as well be an outsider like me: Jerome, I know your heart is in the right place, and that you believe very strongly that this is an important moral issue. But your moral concerns need to be balanced by the ethical considerations of the massive chilling effect you had by publicly shaming OpenAI so hard that they ended up completely losing their confidence, adding some ridiculous profanity filter to GPT-3 that flags pretty much anything mildly naughty, or the fact that AI dungeon is now in hot water because you've given the world an excuse to be pissed off about a mindless, memory-less AI (https://news.ycombinator.com/item?id=23346972) generating text that offends someone, somewhere, for some reason.

Well, people are going to be offended. Let them be. I am half Jewish. There is a very realistic chance that GPT-3 was trained on the full text of "Mein Kampf," and I could care less. Even if someone went out of their way to train the most offensive, most threatening language model the world has ever seen, what could they really do with that power? Are they going to hurt you right in the feelings? Is the language model going to argue vigorously for the extermination of all Jews? If it did, who would listen? No no, don't try to say that yes, there's a very real danger that people might generate propaganda and influence others. You know as well as I do that this is extraordinarily difficult in practice for multiple reasons, and that no one has ever seen an interactive, adaptive AI that can dynamically deliver the most persuasive propaganda to Reddit and fool everybody into thinking it's a genuine grassroots movement. And you know as well as I do that that would be roughly equivalent to inventing AGI, and that AGI is still nowhere in sight. It's not even living in the same country. Hell, it's not in our sector of the galaxy. See that star in the sky? It's possible that we're more likely to reach that star before we invent AGI, because nobody knows how to invent AGI yet.

Before I get too worked up, I should keep some focus and say constructive things. One. If you, as an ML researcher, find yourself wanting to release your work, but your corporation is putting the red tape around you, push back. They need you more than you need them, even if it doesn't feel like it. T

Two. Releasing work is generally beneficial for the world. I can't think of a single model release that has ever harmed the world. Let's wait until one actually does before we freak out about whether it might.

Three (https://www.youtube.com/watch?v=jpw2ebhTSKs&ab_channel=TheCh...). By releasing your work, you give countless people the opportunity to better themselves. Thanks to GPT-2 1.5B, we were able to vastly extend the capabilities of https://reddit.com/r/subsimulatorgpt2 to include over a hundred subreddits in a single model. This has roughly zero commercial value, yet has improved the lives of countless people who show up to giggle at all the (sometimes horribly offensive) things that robots say to each other. I was proud to be a part of that, and I want to be a part of more.

Please release your models. The most recent model release was CLIP, and it's already had a profound impact. Just look at how freaking awesome this is! https://twitter.com/l4rz/status/1367853921427984390 They're using CLIP to turn someone into Dracula! That's badass, and it inspires people (like me, at one time) to get into ML and become the researchers of tomorrow (as I try to be now).

I remember rooting for OpenAI's Dota 2 bot while they were facing off vs .. OG, I think? It's been several years. I spent five years playing dota-type games. It was one of the most exciting things I'd ever seen. And being able to beat rtz 1v1 mid SF? Not merely beat him, but annihilate him? Holy crap, give me that model. Why are you not releasing that model?! It's so cool!

And now nobody gets to play with it ever, and it's locked up in the real AI dungeon: OpenAI's. I hope they reconsider.

> No no, don't try to say that yes, there's a very real danger that people might generate propaganda and influence others. You know as well as I do that this is extraordinarily difficult in practice for multiple reasons, and that no one has ever seen an interactive, adaptive AI that can dynamically deliver the most persuasive propaganda to Reddit and fool everybody into thinking it's a genuine grassroots movement. And you know as well as I do that that would be roughly equivalent to inventing AGI, and that AGI is still nowhere in sight.

I generally agree with your point, but you're setting the bar way too high. Repeating an idea over and over is a depressingly effective way to bring something into the public discourse; if it happens to align with the aims of some opportunists (slandering a political opponent, casting doubt on an inconvenient fact, etc.) they will happily spend the manual effort to make less robotic-seeming versions. At that point it gets into the news, and well-intended individual are burdened with proving why the batshit-nonsense-du-jour is batshit nonsense.

Whilst this sort of thing is easily dismissed individually, it can serve as ammunition for Gish Gallops, JAQing off, and generally nudge the Overton Window (even if subconsciously).

You give Reddit as an example; there are probably many trying to spam it with astroturf at the moment, it's debatable how effective that is.

If we lower our standards a bit and apply the same reasoning to 4chan, it seems even more plausible (due to the ephemeral, anonymous, disjointed nature of its threads).

> Two. Releasing work is generally beneficial for the world. I can't think of a single model release that has ever harmed the world. Let's wait until one actually does before we freak out about whether it might.

I like a lot of the idealism in this comment, and learned a few things too, so thanks for that. I'm not an ML researcher, which may cloud my views a bit, but I think you would be hardpressed to think that the release of ML models hasn't harmed the world. Perhaps we haven't reached AGI, but we don't need to for the release of ML models to harm the world.

I'll give two examples: 1) image detection - Now that image detection is good enough, surveillance in dictatorships can be run more efficiently and scalably in a way KGB couldn't have dreamt of. Even if the release of a model like YOLO saves an hour's worth of time of an ML researcher working for the dictatorship on their surveillance project, this can cause a lot of harm to the oppressed ppl living in those countries.

2) troll bots based on GPT-2. You gave an example of high-quality, human-quality propaganda, but what troll bots lack in quality can be made up by quantity. If you run a lot of troll bots and can sway the dominant viewpoint on all the forums you want to target (which you can since these bots are infinitely scalable), you've achieved your purpose. I also think you are overestimating the quality needed to influence the average person's worldview. For example, we saw from the Cambridge Analytica news that all they had to do to affect a few swing voters' behavior was target them with a few ads. I personally read through some of the example output from GPT-3 and if I were browsing a forum, I wouldn't be all that confident whether they were a bot or not.

Why do research on something at all, if it could be misused? What if we invent a new sabre blade that's made out of light and can cut through almost anything, including people? Should we not invent it? Would we only give it to a Jedi, who could be trained to never do bad things with it?

Why are these researchers doing research on this? Seriously, why? Why do research on something and then not release the model over gasp ethical concerns -- if they had ethical concerns, why did they pursue this avenue of research in the first place? Why advance the state of the art if it's wrong to do so?

Not releasing the model for "ethical" concerns is a cop-out. There's probably another reason; what is that reason?

The thing is that training the models themselves is well within the resources of even the smaller actors that would want to use them in these way. It's the interested but not-quite-interested-enough or time-poor enthusiasts that are hit the worst.

Well, better safe than sorry with releasing a model with potentially harmful effects, right?

I think that is slow right now is even more the reason to study them because in a few years chips will be faster and speed won't be a concern anymore. Small nitpick on the Big O Notation though. Constant speed differences are exactly what is hidden by it. I don't know the algorithms well but from what you describe they should be equal regarding big O.

Sorry if I described it confusingly. What I meant to say was, with StyleGAN, you have one forward pass (~150ms). With diffusion models, you must have at least 25 forward passes (25 x ~150ms). You're right that chips will get faster, but those speedups won't trickle down to the market segment near and dear to my heart: tinkerers like you and me who just want to play around with a model at home without needing to spend thousands of dollars, or to rent a supercomputer.

In reality though, diffusion models are probably fast and lightweight enough that (with patience) you'll be able to generate some neat stuff yourself. At least, if you have an nVidia GPU, or are lucky with a Colab instance. Me, though, I never had one, and a lot of times I was constrained to play with the models that I could run on CPU inference only. I was often delighted to discover that CPU inference gets you quite far! But with 25 forward passes instead of 1, it would be 25x more painful to play around with them -- on the order of waiting 15+ minutes per attempt, rather than seeing things happen in ~25s. The activation energy adds up, and I'm keen to keep ML as accessible as possible for people who just want to play, since playing is a key step toward "Ok, I guess this ML stuff is worth taking more seriously. Let me dive in..."

That's not to dismiss Diffusion models whatsoever. I just had a sight twinge of sadness that being able to generate interpolation videos (one of the coolest things you can possibly do with generative image models) might be out of reach of people without GPUs (I was one of them).

> Releasing your model is as simple as scp’ing to your server; just do it and quit worrying so much.

I guess it is about that simple if you'd be satisfied with the release of a documentation-free binary blob of weights. Somehow I suspect others might complain about usability.

I think that might be one of the biggest sources of inertia preventing model releases. The model inferences generated in the paper often depend on private libraries that the company hasn't released.

All I can say is, please release it anyway as a documentation-free binary blob of weights. People like me will figure out how to make it work, and to make it accessible. Gamedev has been showing for decades that the community will do an amazing amount of work just for the fun of it, and I assure you that we'd all be delighted to figure out some clever way to inference from a model trained on a massive cluster of GPUs.

Why broadly claim an entire class of generative model outperforms another? They simply show that their particular implementation of a diffusion model attains better performance (for some metric) than the current best implementation of a GAN. Almost inevitably (and likely in the very near future) someone will invest the time and tuning to squeeze out some more performance from a GAN/VAE/flow, and this title will be outdated.

Red Sox beat Yankees in the world series =\= Red Sox will always beat the Yankees in every world series game. And I’m guessing most people reading this title know that. I’m sure there are people new to the field or new to CS research in general who might not know how to interpret this, but the only thing most people take out of this title is “hmm, maybe we should consider and study diffusion models more seriously”.

The Red Sox and Yankees can't play each other in the World Series fyi.

The preeminence of either GANs or diffusion models is an ongoing debate, with evidence trickling in paper by paper. To claim diffusion models beat GANs is like claiming the Red Sox beat the Yankees because they happen to be ahead in the third inning.

It’s like saying the Sox are ahead of the Yankees and it’s the third inning with no predetermined number of innings. That nuance is what makes the title fair.

Eh I feel like may be worth cutting some slack here..

It sounds like you’d like the title to be prefaced with “one time that”.

I don’t think anyone reading this says, hey gans are done for. I doubt the authors think that and I (clearly) don’t get the impression that’s what they mean.

Even the first sentence of the abstract:

> We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models

Has the word current in front of state of the art generative models.

Why I agree with your sentiment, GANs always seemed finickily and unreliable. I am willing to indulge new approaches to find something more robust

While I tend to agree with you, aren't title like these understood to be short hand for what you described?

Diffusion Models Can Sometimes Beat GANs on Image Synthesis

For an actual example of this more cautious framing, see another paper (also by OpenAI, also comparing two classes of generative models), https://arxiv.org/abs/2011.10650.

It's fine to claim it. It's shorthand for "We achieve smaller FID values (most of the time) than any other generative model."

You're correct about what you're saying, though. It's also worse: FID is a measurement biased towards imagenet-style images, and this model was also trained on imagenet, so it's quite well-suited for achieving small FID values. Whereas for generative anime, it's still unclear what the best model architecture is.

I read your comment, thought “to get clicks, duh”, and then I clicked through the link.

This is a research paper.

I would like to echo your question: why? Does academia incentivize clickbait?

With the rise of Altmetrics? Arguably yes.

IIRC, the last research I saw on this, which was awhile back, argued that clever titles get read more and cited less. https://www.researchtrends.com/issue24-september-2011/headin...

Well, not sure it's any worse than cite-bait?

Are you going on the title? If so, the verb "beat" here admits both a singular past action and a present continuous action.

Maybe that's a lot to ask, but can someone explain to me or guide me to material to gain a better understanding of principles behind diffusion probabilistic models? My knowledge of statistics is just too basic to even read these papers.

Not a lot to ask at all. The go-to person (and one of the pioneers of this type of model) is Alexia: https://twitter.com/jm_alexia

The article you want is probably this one: https://ajolicoeur.wordpress.com/the-new-contender-to-gans-s...

I would launch into a thorough explanation, but it would likely not be correct in every detail, because it's been around nine months since I was immersing myself in DDPM type models. But, broadly speaking, with normal training, your goal is to train a model (show it a bunch of examples) until the model can guess the right answer most of the time in one try. Except, "guessing the right answer" is actually an easier problem than generating an image, because the model usually gives you its top-N guesses, so it says "I think it's a snake or a dog or an apple."

Whereas with generative images, it's much harder to come up with a technique that can be "sort of correct": if you generate a stylegan image, it either looks cool or looks like crap, and it's rather difficult to automatically take a crappy output and turn it into something that looks cool. (The "automatic" is key; there are manual human-guided techniques that I'm quite fond of, and amazed no one's turned it into a photoshop-type plugin yet, but the field of ML seems to compete/care about fully automatic solutions right now. For some reason.)

DDPM is the inverse: you have a trained model, and you start with noise, but then it gets progressively closer to a cool looking result by searching multiple times (i.e. taking multiple forward passes). That's as much as I remember, I'm afraid.

What are the manual / human-guided techniques you speak of?

I wanted to do a thorough writeup, but I never got around to it. Here are a bunch of examples of me using those techniques though: https://twitter.com/theshawwn/status/1182208124117307392

The dota community seemed to like it. :)


Basically, it was an interactive editor where you could slightly move along stylegan directions, combined with Peter Baylies' reverse encoder to slightly morph the image to a specific face on demand.

It was instantly so much better than any automated solution. It felt like being a pilot in front of the controls of a cockpit.

I can try to provide a high level intuition. For actual experts, please forgive my errors of simplification.

Drawing samples from "simple" distributions like normal distributions is computationally easy. You can do it in microseconds. Sampling an arbitrary 1-D distribution is a bit harder - you have to invert its cumulative density function to recover probability under an "equivalent" uniform random variable, and potentially use a rejection sampling approach to sample 1-D values under this distribution.

Sampling a high-D distribution (such as an image) is even harder - you need to learn a mapping from this high-D image back to "tractable" densities. This imposes some pretty harsh "optimization bottlenecks" when trying to contort the manifold of images to normal distributions. The whole point of this exercise is that the transformation respects a valid probability distribution, so you can start from the normal distribution and apply this mapping to get a valid sample from the image distribution. This in practice is pretty hard, and the quality of samples seems to be lower than other forms of deep generative models which use less parameters.

Now instead of learning such a complex, hard-to-optimize transformation to valid densities, what if we instead learn a function E(x) that outputs a scalar "energy". The energy is low for "realistic" images, and high for unrealistic images. Kind of like inverse probability, except its not normalized - the energy value tells you nothing about likelihood unless you know the energy for all other images possible. This tends to actually be easier than learning densities, because the functional form of this energy function is unconstrained.

Furthermore, not knowing likelihoods doesn't stop you from getting "realistic" image, as all you need to do is descend the gradient x -= d/dx E(x), which takes you to an image with "lower" energy (i.e. more realistic). Under certain procedures (e.g. adding some noise to the gradient), this can be thought of as actually equivalent to sampling from a valid probability distribution, even though you can't compute its likelihood analytically.

The diffusion probabilistic model you refer to can be thought of as such a model - the more steps you take (i.e. the more compute you spend), the better the quality of the model.

GANs can be thought of as a one-pass neural network amortization of the end result of the diffusion process, but unlike MCMC methods, they cannot be "iteratively refined" with additional compute. Their sample quality is limited to whatever the generator spits out, even if you had additional compute.

This sounds like you're describing energy-based models, not diffusion models.

Ah, thanks. I had mistakenly assumed the diffusion process here was comparable / drop-in replacement to Langevin dynamics used with energy models.

I went and read the paper in more detail. Yeah, my original comment was way off-base, except if you draw a parallel with normalizing flows as iterated refinement (similar to iterated de-noising), and see that the DDPM is a more unconstrained form.

But at a surface level, there isn't a clear connection between DDPM and energy-based models.

See my comment above. I highly recommend this talk by Stefano Ermon from Stanford: https://www.youtube.com/watch?v=8TcNXi3A5DI

I don't know enough about this to understand why their model is better, but the imperfections in their images are much more unsettling that other algorithms.

Yeah, some of those images are nightmare material.

I wonder if it would be possible to train them to specifically avoid generating images that look scary to people. Eg prefer blur or other artifacts over distorted bodies and faces.

The last set of images has some really scary generated humans via the GAN they were comparing against in the male w/ fish examples.

This new approach to generative modelling looks very intriguing.

In a similar ilk, there's this ICLR paper from this year using stochastic differential equations for generative modelling: https://arxiv.org/abs/2011.13456

If you are interested in learning the intuition & theory behind diffusion and score-based models, I highly recommend this talk by Stefano Ermon from Stanford:


I've disliked GANs since the Nvidia man with his head melting back into the bedroom made me feel like I was having a stroke.

GANs blindspots have the potential to generate some confusing optical illusions.

What Nvidia man are you referring to?

Sorry, I missed this - unfortunately I did not save the picture.

Repeatedly clicking refresh on https://thispersondoesnotexist.com/ I eventually reached an image where the top of a generated man's head melted into the room behind it.

Melted, not faded.

The room and the man's head were not in anyway separate and superimposed, it was a perfect fluid transition.

Looking at it, it genuinely felt like my brain was malfunctioning.

I don't know the extent to which it was due to under-training or some more fundamental limitation of GANs; the feeling that I had was that the generator had found an exploitable hole in the discernor's comprehension.

There are samples with the "Download PDF" button. There are no samples if you click the https link to GitHub...

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact