With diffusion models, you need to do >25 forward passes to achieve a result. It’s kind of like an O(1) algorithm vs O(N): stylegan has one pass, diffusion models have N. And N is currently 25 or more, which means it tends to be 25x slower than stylegan at a minimum. (In our experience it was often many seconds to a full minute before we saw results, but we didn’t try very hard to make it fast, and this paper shows advances in speed since then.)
The flipside: this paper has the most beautiful photorealistic complex samples I’ve ever seen. I don’t really care if they’re cherry picked; it’s hard to pick cherries on a rotten tree. I know.
The last thing I want to say is that I am disappointed the model is, yet again, not released. The hoarding of ML models needs to stop. These research models aren’t commercially useful, but they have incredible benefits for people outside of ML. I myself got into ML thanks to being able to play with GPT-2 and Sfylegan. I would leap at the chance to play with AlphaZero or OpenAI’s old Dota 2 models; no reason to keep those locked up. I also don’t care about the excuses for not releasing: none of them hold water. In my experience, no model has ever had a society-threatening impact, and it’s usually a euphemism for “we’re worried there’s a small chance we might look bad.” Releasing your model is as simple as scp’ing to your server; just do it and quit worrying so much.
Love the work. Diffusion models are a really interesting way of thinking about generative approaches in ML.
(Sorry, I'll step off the rant now. I just miss the old days, the ye olde long-long ago of "one and a half years." It seemed like every other day there was another neat model release that happened to be world-changing as a matter of course: StyleGAN, BigGAN, GPT-2. Before that there was YOLO and a ton of other neat stuff. Nowadays it's "Oh golly gee, I'm not so sure we should release this panda generator, it might have ethical implications down the line if people start generating pandas in sexy poses." But hopefully it's just a phase or something.)
For starters, is that actually the case? I don't know, genuine question.
Second, and I recognise the humorous intent regarding sexy pandas, are ethical concerns actually the reasons some models/weights are not released? Or is it about hoarding a potentially valuable commodity? Or some other reason, or a combination of several?
It sure feels like it. But it could be survivorship bias: https://www.youtube.com/watch?v=_Qd3erAPI9w&t=52s&ab_channel... or regression to the mean: https://www.youtube.com/watch?v=1tSqSMOyNFE&ab_channel=Verit.... It's possible that StyleGAN and GPT-2 just happened to coincide, and both companies just happened to decide to release their models rather than hoard them, and that the usual case is for companies to rarely (if ever) release weights with their research.
But the more interesting question is, why wouldn't they? So onward to your second one:
are ethical concerns actually the reasons some models/weights are not released? Or is it about hoarding a potentially valuable commodity? Or some other reason, or a combination of several?
As with most things, the answer is likely the mundane, dull truth that big companies tend to have lots of inertia and friction preventing such things from happening. Sources of friction seem to include (but are not limited to):
- are we legally responsible for what other people do with the model? (no.)
- should we restrict people from using the model commercially? (why bother? but sure, nVidia did that for stylegan. It didn't stop Artbreeder from blatantly ignoring and monetizing it anyway. And while that sounds negative, Artbreeder is one of the coolest ML projects ever, as far as I'm concerned, so the world is better off for ignoring your stupid policy.)
- What if people start misusing the model? (So what? I feel like I should just get in the face of whoever is asking this question and repeat "So what?" until they have the police escort me off the premises. You can keep asking it to whatever they reply with, and eventually their logic never seems to go anywhere but one big ass-covering circle.)
- What if we might look bad because of it?
Now that last bullet point deserves some real treatment. In my opinion, the head of Facebook AI has done one of the most damaging things possible to the ML scene by getting everyone riled up about GPT generating "immoral" outputs:
Good lord, AIs are generating racist outputs! They're saying that cis white men are intellectually superior to trans people! People are generating child porn in AI dungeon! Someone, do something!
... please. Every researcher I've talked to has rolled their eyes hard at the criticisms raised by Jerome. But since he's the head of Facebook AI, no one dares say so publicly. It may as well be an outsider like me: Jerome, I know your heart is in the right place, and that you believe very strongly that this is an important moral issue. But your moral concerns need to be balanced by the ethical considerations of the massive chilling effect you had by publicly shaming OpenAI so hard that they ended up completely losing their confidence, adding some ridiculous profanity filter to GPT-3 that flags pretty much anything mildly naughty, or the fact that AI dungeon is now in hot water because you've given the world an excuse to be pissed off about a mindless, memory-less AI (https://news.ycombinator.com/item?id=23346972) generating text that offends someone, somewhere, for some reason.
Well, people are going to be offended. Let them be. I am half Jewish. There is a very realistic chance that GPT-3 was trained on the full text of "Mein Kampf," and I could care less. Even if someone went out of their way to train the most offensive, most threatening language model the world has ever seen, what could they really do with that power? Are they going to hurt you right in the feelings? Is the language model going to argue vigorously for the extermination of all Jews? If it did, who would listen? No no, don't try to say that yes, there's a very real danger that people might generate propaganda and influence others. You know as well as I do that this is extraordinarily difficult in practice for multiple reasons, and that no one has ever seen an interactive, adaptive AI that can dynamically deliver the most persuasive propaganda to Reddit and fool everybody into thinking it's a genuine grassroots movement. And you know as well as I do that that would be roughly equivalent to inventing AGI, and that AGI is still nowhere in sight. It's not even living in the same country. Hell, it's not in our sector of the galaxy. See that star in the sky? It's possible that we're more likely to reach that star before we invent AGI, because nobody knows how to invent AGI yet.
Before I get too worked up, I should keep some focus and say constructive things. One. If you, as an ML researcher, find yourself wanting to release your work, but your corporation is putting the red tape around you, push back. They need you more than you need them, even if it doesn't feel like it. T
Two. Releasing work is generally beneficial for the world. I can't think of a single model release that has ever harmed the world. Let's wait until one actually does before we freak out about whether it might.
Three (https://www.youtube.com/watch?v=jpw2ebhTSKs&ab_channel=TheCh...). By releasing your work, you give countless people the opportunity to better themselves. Thanks to GPT-2 1.5B, we were able to vastly extend the capabilities of https://reddit.com/r/subsimulatorgpt2 to include over a hundred subreddits in a single model. This has roughly zero commercial value, yet has improved the lives of countless people who show up to giggle at all the (sometimes horribly offensive) things that robots say to each other. I was proud to be a part of that, and I want to be a part of more.
Please release your models. The most recent model release was CLIP, and it's already had a profound impact. Just look at how freaking awesome this is! https://twitter.com/l4rz/status/1367853921427984390 They're using CLIP to turn someone into Dracula! That's badass, and it inspires people (like me, at one time) to get into ML and become the researchers of tomorrow (as I try to be now).
I remember rooting for OpenAI's Dota 2 bot while they were facing off vs .. OG, I think? It's been several years. I spent five years playing dota-type games. It was one of the most exciting things I'd ever seen. And being able to beat rtz 1v1 mid SF? Not merely beat him, but annihilate him? Holy crap, give me that model. Why are you not releasing that model?! It's so cool!
And now nobody gets to play with it ever, and it's locked up in the real AI dungeon: OpenAI's. I hope they reconsider.
I generally agree with your point, but you're setting the bar way too high. Repeating an idea over and over is a depressingly effective way to bring something into the public discourse; if it happens to align with the aims of some opportunists (slandering a political opponent, casting doubt on an inconvenient fact, etc.) they will happily spend the manual effort to make less robotic-seeming versions. At that point it gets into the news, and well-intended individual are burdened with proving why the batshit-nonsense-du-jour is batshit nonsense.
Whilst this sort of thing is easily dismissed individually, it can serve as ammunition for Gish Gallops, JAQing off, and generally nudge the Overton Window (even if subconsciously).
You give Reddit as an example; there are probably many trying to spam it with astroturf at the moment, it's debatable how effective that is.
If we lower our standards a bit and apply the same reasoning to 4chan, it seems even more plausible (due to the ephemeral, anonymous, disjointed nature of its threads).
I like a lot of the idealism in this comment, and learned a few things too, so thanks for that. I'm not an ML researcher, which may cloud my views a bit, but I think you would be hardpressed to think that the release of ML models hasn't harmed the world. Perhaps we haven't reached AGI, but we don't need to for the release of ML models to harm the world.
I'll give two examples:
1) image detection - Now that image detection is good enough, surveillance in dictatorships can be run more efficiently and scalably in a way KGB couldn't have dreamt of. Even if the release of a model like YOLO saves an hour's worth of time of an ML researcher working for the dictatorship on their surveillance project, this can cause a lot of harm to the oppressed ppl living in those countries.
2) troll bots based on GPT-2. You gave an example of high-quality, human-quality propaganda, but what troll bots lack in quality can be made up by quantity. If you run a lot of troll bots and can sway the dominant viewpoint on all the forums you want to target (which you can since these bots are infinitely scalable), you've achieved your purpose. I also think you are overestimating the quality needed to influence the average person's worldview. For example, we saw from the Cambridge Analytica news that all they had to do to affect a few swing voters' behavior was target them with a few ads. I personally read through some of the example output from GPT-3 and if I were browsing a forum, I wouldn't be all that confident whether they were a bot or not.
Why are these researchers doing research on this? Seriously, why? Why do research on something and then not release the model over gasp ethical concerns -- if they had ethical concerns, why did they pursue this avenue of research in the first place? Why advance the state of the art if it's wrong to do so?
Not releasing the model for "ethical" concerns is a cop-out. There's probably another reason; what is that reason?
In reality though, diffusion models are probably fast and lightweight enough that (with patience) you'll be able to generate some neat stuff yourself. At least, if you have an nVidia GPU, or are lucky with a Colab instance. Me, though, I never had one, and a lot of times I was constrained to play with the models that I could run on CPU inference only. I was often delighted to discover that CPU inference gets you quite far! But with 25 forward passes instead of 1, it would be 25x more painful to play around with them -- on the order of waiting 15+ minutes per attempt, rather than seeing things happen in ~25s. The activation energy adds up, and I'm keen to keep ML as accessible as possible for people who just want to play, since playing is a key step toward "Ok, I guess this ML stuff is worth taking more seriously. Let me dive in..."
That's not to dismiss Diffusion models whatsoever. I just had a sight twinge of sadness that being able to generate interpolation videos (one of the coolest things you can possibly do with generative image models) might be out of reach of people without GPUs (I was one of them).
I guess it is about that simple if you'd be satisfied with the release of a documentation-free binary blob of weights. Somehow I suspect others might complain about usability.
All I can say is, please release it anyway as a documentation-free binary blob of weights. People like me will figure out how to make it work, and to make it accessible. Gamedev has been showing for decades that the community will do an amazing amount of work just for the fun of it, and I assure you that we'd all be delighted to figure out some clever way to inference from a model trained on a massive cluster of GPUs.
It sounds like you’d like the title to be prefaced with “one time that”.
I don’t think anyone reading this says, hey gans are done for. I doubt the authors think that and I (clearly) don’t get the impression that’s what they mean.
Even the first sentence of the abstract:
> We show that diffusion models can achieve image sample quality superior to the current state-of-the-art generative models
Has the word current in front of state of the art generative models.
You're correct about what you're saying, though. It's also worse: FID is a measurement biased towards imagenet-style images, and this model was also trained on imagenet, so it's quite well-suited for achieving small FID values. Whereas for generative anime, it's still unclear what the best model architecture is.
This is a research paper.
I would like to echo your question: why? Does academia incentivize clickbait?
IIRC, the last research I saw on this, which was awhile back, argued that clever titles get read more and cited less. https://www.researchtrends.com/issue24-september-2011/headin...
The article you want is probably this one: https://ajolicoeur.wordpress.com/the-new-contender-to-gans-s...
I would launch into a thorough explanation, but it would likely not be correct in every detail, because it's been around nine months since I was immersing myself in DDPM type models. But, broadly speaking, with normal training, your goal is to train a model (show it a bunch of examples) until the model can guess the right answer most of the time in one try. Except, "guessing the right answer" is actually an easier problem than generating an image, because the model usually gives you its top-N guesses, so it says "I think it's a snake or a dog or an apple."
Whereas with generative images, it's much harder to come up with a technique that can be "sort of correct": if you generate a stylegan image, it either looks cool or looks like crap, and it's rather difficult to automatically take a crappy output and turn it into something that looks cool. (The "automatic" is key; there are manual human-guided techniques that I'm quite fond of, and amazed no one's turned it into a photoshop-type plugin yet, but the field of ML seems to compete/care about fully automatic solutions right now. For some reason.)
DDPM is the inverse: you have a trained model, and you start with noise, but then it gets progressively closer to a cool looking result by searching multiple times (i.e. taking multiple forward passes). That's as much as I remember, I'm afraid.
The dota community seemed to like it. :)
Basically, it was an interactive editor where you could slightly move along stylegan directions, combined with Peter Baylies' reverse encoder to slightly morph the image to a specific face on demand.
It was instantly so much better than any automated solution. It felt like being a pilot in front of the controls of a cockpit.
Drawing samples from "simple" distributions like normal distributions is computationally easy. You can do it in microseconds. Sampling an arbitrary 1-D distribution is a bit harder - you have to invert its cumulative density function to recover probability under an "equivalent" uniform random variable, and potentially use a rejection sampling approach to sample 1-D values under this distribution.
Sampling a high-D distribution (such as an image) is even harder - you need to learn a mapping from this high-D image back to "tractable" densities. This imposes some pretty harsh "optimization bottlenecks" when trying to contort the manifold of images to normal distributions. The whole point of this exercise is that the transformation respects a valid probability distribution, so you can start from the normal distribution and apply this mapping to get a valid sample from the image distribution. This in practice is pretty hard, and the quality of samples seems to be lower than other forms of deep generative models which use less parameters.
Now instead of learning such a complex, hard-to-optimize transformation to valid densities, what if we instead learn a function E(x) that outputs a scalar "energy". The energy is low for "realistic" images, and high for unrealistic images. Kind of like inverse probability, except its not normalized - the energy value tells you nothing about likelihood unless you know the energy for all other images possible. This tends to actually be easier than learning densities, because the functional form of this energy function is unconstrained.
Furthermore, not knowing likelihoods doesn't stop you from getting "realistic" image, as all you need to do is descend the gradient x -= d/dx E(x), which takes you to an image with "lower" energy (i.e. more realistic). Under certain procedures (e.g. adding some noise to the gradient), this can be thought of as actually equivalent to sampling from a valid probability distribution, even though you can't compute its likelihood analytically.
The diffusion probabilistic model you refer to can be thought of as such a model - the more steps you take (i.e. the more compute you spend), the better the quality of the model.
GANs can be thought of as a one-pass neural network amortization of the end result of the diffusion process, but unlike MCMC methods, they cannot be "iteratively refined" with additional compute. Their sample quality is limited to whatever the generator spits out, even if you had additional compute.
But at a surface level, there isn't a clear connection between DDPM and energy-based models.
In a similar ilk, there's this ICLR paper from this year using stochastic differential equations for generative modelling: https://arxiv.org/abs/2011.13456
GANs blindspots have the potential to generate some confusing optical illusions.
Repeatedly clicking refresh on https://thispersondoesnotexist.com/ I eventually reached an image where the top of a generated man's head melted into the room behind it.
Melted, not faded.
The room and the man's head were not in anyway separate and superimposed, it was a perfect fluid transition.
Looking at it, it genuinely felt like my brain was malfunctioning.
I don't know the extent to which it was due to under-training or some more fundamental limitation of GANs; the feeling that I had was that the generator had found an exploitable hole in the discernor's comprehension.