Hacker News new | past | comments | ask | show | jobs | submit login
Direct pixel-space megapixel image generation with diffusion models (crowsonkb.github.io)
280 points by stefanbaumann 11 months ago | hide | past | favorite | 47 comments

I'm one of the authors; happy to answer questions. this arch is of course nice for high-resolution synthesis, but there's some other cool stuff worth mentioning..

activations are small! so you can enjoy bigger batch sizes. this is due to the 4x patching we do on the ingress to the model, and the effectiveness of neighbourhood attention in joining patches at the seams.

the model's inductive biases are pretty different than (for example) a convolutional UNet's. the innermost levels seem to train easily, so images can have good global coherence early in training.

there's no convolutions! so you don't need to worry about artifacts stemming from convolution padding, or having canvas edge padding artifacts leak an implicit position bias.

we can finally see what high-resolution diffusion outputs look like _without_ latents! personally I think current latent VAEs don't _really_ achieve the high resolutions they claim (otherwise fine details like text would survive a VAE roundtrip faithfully); it's common to see latent diffusion outputs with smudgy skin or blurry fur. what I'd like to see in the future of latent diffusion is to listen to the Emu paper and use more channels, or a less ambitious upsample.

it's a transformer! so we can try applying to it everything we know about transformers, like sigma reparameterisation or multimodality. some tricks like masked training will require extra support in [NATTEN](https://github.com/SHI-Labs/NATTEN), but we're very happy with its featureset and performance so far.

but honestly I'm most excited about the efficiency. there's too little work on making pretraining possible at GPU-poor scale. so I was very happy to see HDiT could succeed at small-scale tasks within the resources I had at home (you can get nice oxford flowers samples at 256x256px with half an hour on a 4090). I think with models that are better fits for the problem, perhaps we can get good results with smaller models. and I'd like to see big tech go that direction too!

-Alex Birch

Hi Alex Amazing work. I scanned the paper and dusted off my aging memories of Jeremy Howard’s course. Will your model live happily alongside the existing SD infrastructure such as ControlNet, IPAdapter, and the like? Obviously we will have to retrain these to fit onto your model, but conceptually, does your model have natural places where adapters of various kinds can be attached?

regarding ControlNet: we have a UNet backbone, so the idea of "make trainable copies of the encoder blocks" sounds possible. the other part, "use a zero-inited dense layer to project the peer-encoder output and add it to the frozen-decoder output" also sounds fine. not quite sure what they do with the mid-block but I doubt there'd be any problem there.

regarding IPAdapter: I'm not familiar with it, but from the code it looks like they just run cross-attention again and sum the two attention outputs. feels a bit weird to me, because the attention probabilities add up to 2 instead of 1. and they scale the bonus attention output only instead of lerping. it'd make more sense to me to formulate it as a cross-cross attention (Q against cat([key0, key1]) and cat([val0, val1])), but maybe they wanted it to begin as a no-op at the start of training or something. anyway.. yes, all of that should work fine with HDiT. the paper doesn't implement cross-attention, but it can be added in the standard way (e.g. like stable-diffusion) or as self-cross attention (e.g. DeepFloyd IF or Imagen).

I'd recommend though to make use of HDiT's mapping network. in our attention blocks, the input gets AdaNormed against the condition from the mapping network. this is currently used to convey stuff like class conditions, Karras augmentation conditions and timestep embeddings. but it supports conditioning on custom (single-token) conditions of your choosing. so you could use this to condition on an image embed (this would give you the same image-conditioning control as IPAdapter but via a simpler mechanism).

IPAdapter, I am curious if there are useful GUIs for this? Creating image masks through uploading to colab is not so cute.

Here's one example: https://github.com/Acly/krita-ai-diffusion/

But generally, most other UIs support it. It has serious limitations though, for example it center-crops the input to 224x224px. (which is enough for a surprisingly large amount of uses, but not enough for many others)

Yes. I discussed this issue with the author of the ComfyUI IP-Adapter nodes. It would doubtless be handy if someone could end-to-end train a higher resolution IP-Adapter model that integrated its own variant of CLIPVision that is not subject to the 224px constraint. I have no idea what kind of horsepower would be required for that.

A latent space CLIPVision model would be cool too. Presumably you could leverage the semantic richness of the latent space to efficiently train a more powerful CLIPVision. I don’t know whether anyone has tried this. Maybe there is a good reason for that.

I appreciate the restraint of showing the speedup on a log-scale chart rather than trying to show a 99% speed up any other way.

I see your headline speed comparison is to "Pixel-space DiT-B/4" - but how does your model compare to the likes of SDXL? I gather they spent $$$$$$ on training etc, so I'd understand if direct comparisons don't make sense.

And do you have any results on things that are traditionally challenging for generative AI, like clocks and mirrors?

Alex - I run Invoke (one of the popular OSS SD UIs for pros)

Thanks for your work - it’s been impactful since the early days of the project.

Excited to see where we get to this year.

ah, originally lstein/stable-diffusion? yeah that was an important fork for us Mac users in the early days. I have to confess I've still never used a UI. :)

this year I'm hoping for efficiency and small models! even if it's proprietary. if our work can reduce some energy usage behind closed doors that'd still be a good outcome.

Yes, indeed. Lincoln's still an active maintainer.

Energy efficiency is key - Especially with some of these extremely inefficient (wasteful, even) features like real-time canvas.

Good luck - Let us know if/how we can help.

Did you do any inpainting experiments? I can imagine a pixel-space diffusion model to be better at it than one with a latent auto-encoder.

Not yet, we focused on the architecture for this paper. I totally agree with you though - pixel space is generally less limiting than a latent space for diffusion, so we would expect good performance inpainting behavior and other editing tasks.

Seems like a solid paper from a skim through it. My rough summary:

The popular large scale diffusion models like StableDiffusion are CNN based at their heart, with attention layers sprinkled throughout. This paper builds on recent research exploring whether competitive image diffusion models can be built out of purely transformers, no CNN layers.

In this paper they build a similar U-Net like structure, but out of transformer layers, to improve efficiency compared to a straight Transformer. They also use local attention when the resolution is high to save on computational cost, but regular global attention in the middle to maintain global coherence.

Based on ablation studies this allows them to maintain or slightly improve FID score compared to Transformer-only diffusion models that don't do U-net like structures, but at 1/10th the computation cost. An incredible feat for sure.

There is a variety of details: RoPE positional encoding, GEGELU activations, RMSNorm, learnable skip connections, learnable cosine-sim attention, neighborhood attention for the local attention, etc.

The biggest gains in FID occur when the authors use "soft-min-snr" as the loss function; FID drops from 41 to 28!

Lots of ablation study was done across all their changes (see Table 1).

Training is otherwise completely standard AdamW, 5e-4, 0.01, 256 batch, constant LR, 400k steps for most experiments at 128x128 resolution.

So yeah, overall seems like solid work that combines a great mixture of techniques and pushes Transformer based diffusion forward.

If scaled up I'm not sure it would be "revolutionary" in terms of FID compared to SDXL or DALLE3, mostly because SD and DALLE already use attention obviating the scaling issue, and lots of tricks like diffusion based VAEs. But it's likely to provide a nice incremental improvement in FID, since in general Transformers perform better than CNNs unless the CNNs are _heavily_ tuned.

And being pixel based rather than latent based has many advantages.

FID doesn't reward high-resolution detail. the inception feature size is 299x299! so we are forced to downsample our FFHQ-1024 samples to compute FID.

it also doesn't punish poor detail either! this advantages latent diffusion, which can claim to achieve a high resolution but without actually needing to have correct textures to get good metrics.

I enjoyed this paper (I share a discord with the author so I read it a bit earlier).

It's not entirely clear from the comparison numbers at the end, but I think the big argument here is efficiency for the amount of performance achieved. One can get lower FID numbers, but also with a ton of compute.

I can't really speak technically to it as I've not given it a super in depth look, but this seems like a nice set of motifs for going halfway between a standard attention network and a convnet in terms of compute cost (and maybe performance)?

The large-resolution scaling seems to be a strong suit as a result. :)

Thanks a lot!

Yeah, the main motivation was trying to find a way to enable transformers to do high-resolution image synthesis: transformers are known to scale well to extreme, multi-billion parameter scales and typically offer superior coherency & composition in image generation, but current architectures are too expensive to train at scale for high-resolution inputs.

By using a hierarchical architecture and local attention at high-resolution scales (but retaining global attention at low-resolution scales), it becomes viable to apply transformers at these scales. Additionally, this architecture can now directly be trained on megapixel-scale inputs and generate high-quality results without having to progressively grow the resolution over the training or applying other "tricks" typically needed to make models at these resolutions work well.

Which discord if its open to the public? I was on one woth kath in 2021 and loved her insights, would love to again

You and the guy below you in this thread should probably tag me on twitter, same tag as here, I can point you. I do not especially want to leave the discord link in a frontpage hn thread.

Same; a good ML focused discord would be great. Training ViTs all day is lonely work. I'm mostly locked into skimming the "Research" channels of image generation discords. LAION used to be decent with a good amount of interesting discussion, but it seems to have devolved into toxicity in the last year.

See my other comment replying to that.

LAION is good

I hope that all these insights about diffusion model training that have been explored in last few years will be used by Stability AI to train their large text-to-image models, because when it comes to that they just use to most basic pipeline you can imagine with plenty of problems that get "solved" by some workarounds, for example to train SDXL they used the scheduler used by the DDPM paper(2020), epsilon-objective and noise-offset, an ugly workaround that was created when people realized that SD v1.5 wasn't able to generate images that were too dark or bright, a problem related to the epsilon-objective that cause the model to always generate images with a mean close to 0 (the same as the gaussian noise).

A few people have finetuned Stable Diffusion models on v-objective and solved the problem from the root.

I have good news about who wrote this paper

Two authors are from Stability AI, that's the reason why I wrote the comment.

I think NATTEN does not support cross attention, wonder if the authors have tried any text-conditioned cases? Does the cross-attention can only add to regular attention? Or added through adanorm?

cross-attention doesn't need to involve NATTEN. there's no neighbourhood involved because it's not self-attention. so you can do it the stable-diffusion way: after self-attention, run torch sdp with Q=image and K=V=text.

I tried adding "stable-diffusion-style" cross-attn to HDiT, text-conditioning on small class-conditional datasets (oxford flowers), embedding the class labels as text prompts with Phi-1.5. trained it for a few minutes, and the images were relevant to the prompts, so it seemed to be working fine.

but if instead of a text condition you have a single-token condition (class label) then yeah the adanorm would be a simpler way.

Are there any public pretrained checkpoints available or planned?

This is probably a stupid question, but what kind of image generation does this do? The architecture overview shows "input image", and I don't see anything about text to image. Is it super resolution? Does class-conditional mean that it takes a class like "car" or "face" and generate a new random image of that class?

If it's Imagenet class-conditioned, FFHQ unconditioned.

>Does class-conditional mean that it takes a class like "car" or "face" and generate a new random image of that class?


> Is it super resolution?

nope, we don't do Imagen-style super-resolution. we go direct to high resolution with a single-stage model.

I was referring to the input image in the diagram, what is that and how is the output image generated from it? Is it 256x256 noise that gets denoised into an image? I guess what I'm really asking is what guides the process into the final image if it's not text to image?

The "input image" is just the noisy sample from the previous timestep, yes.

The overall architecture diagram does not explicitly show the conditioning mechanism, which is a small separate network. For this paper, we only trained on class-conditional ImageNet and completely unconditional megapixel-scale FFHQ.

Training large-scale text-to-image models with this architecture is something we have not yet attempted, although there's no indication that this shouldn't work with a few tweaks.

Thank you, I'm not used to reading this kind of research papers but I think I got the gist of it now.

Can this architecture be used to distill models that need fewer timesteps like LCMs or SDXL turbo?

Both Latent Consistency Models and Adversarial Diffusion Distillation (the method behind SDXL Turbo) are methods that do not depend on any specific properties of the backbone. So, as Hourglass Diffusion Transformers are just a new kind of backbone that can be used just like the Diffusion U-Nets in Stable Diffusion (XL), these methods should also be applicable to it.

Looking at the output image examples, very nice, although they seem a little blurry. But I guess that's a dataset issue? Have you tried training anything above 1024x1024? Hope someone releases a model based on this since open source pixel space models are a rarity afaik

the FFHQ-1024 examples shouldn't be blurry. you can download the originals from the project page[0] — click any image in the teaser, or download our 50k samples.

the ImageNet-256 examples also aren't typically blurry (but they are 256x256 so your viewer may be bicubic scaling them or something). the ImageNet dataset _can_ have blurry, compressed or low resolution training samples, which can afflict some classes more than others, and we learn to produce samples like the training set.

[0] https://crowsonkb.github.io/hourglass-diffusion-transformers...

Can I ask one basic thing --> From what are the images generated?

The models presented in the paper are trained on class-conditional ImageNet (where the input is Gaussian noise and one of 1000 classes, e.g., "car") and unconditional FFHQ (where the input is only Gaussian noise).

Not text but an id representing a class/category of images from the dataset. Or they are “unconditional” and the model tries to output something similar to a random image from the dataset each time.


Making a bank account work for you is a hard discipline and requires budgeting and the like. True, not all of us can "hack" it, but that doesn't mean that with some community classes and help you'll be able to use your bank account well!

Why do I feel like a chatbot with this message.

You're actually talking to a bot, in this particular case. 12 minutes old with -2 karma. :berk:

Did you seriously just berk me on HN.

What next, is a Walmart cashier going to do that RobloxNite "dab dance" or whatever on me?

Discusting. Dusgraseful.

You can take the Katt Williams route in response, or the way of Zen. The choice is yours.

Accusations of being a bot are explicitly against the rules.

HN to SEO spam pipeline

Ah yes, the checks notes ever popular bank account hacking SEO spam line.

(Legitimately I am confused but maybe it is a one-two scam or the like, lolz. <3 :')))) ;'PPPP ;'PPPP)

Sure. "pay me xx amount of money to hack your enemy's bank account", or possibly "pay me xx amount of money to get your money back after you've been hacked"

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact
