Hacker News new | past | comments | ask | show | jobs | submit login

Some cool stuff from the paper:

• Earlier SD versions would often generate images where the head or feet of the subject was cropped out of frame. This was because random cropping was applied to its training data as data augmentation, so it learned to make images that looked randomly cropped — not ideal! To fix this issue, they still used random cropping during training, but also gave the crop coordinates to the model so that it would know it was training on a cropped image. Then, they set those crop coordinates to 0 at test-time, and the model keeps the subject centered! They also did a similar thing with the pixel dimensions of the image, so that the model can learn to operate at different "DPI" ranges.

• They're using a two-stage model instead of a single monolithic model. They have one model trained to get the image "most of the way there" and second model to take the output of the first and refine it, fixing textures and small details. Sort of mixture-of-experts-y. It makes sense that different "skillsets" would be required for the different stages of the denoising process, so it's reasonable to train separate models for each of the stages. Raises the question of whether unrolling the process further might yield more improvments — maybe a 3- or 4-stage model next?

• Maybe I missed it, but I don't see in the paper whether the images they're showing come from the raw model or the RLHF-tuned variant. SDXL has been available for DreamStudio users to play with since April, and Emad indicated that the reason for this was to collect tons of human preference data. He also said that when the full SDXL 1.0 release happens later this month, both RLHF'd and non-RLHF'd variants of the weights will be available for download. I look forward to seeing detailed comparisons between the two.

• They removed the lowest-resolution level of the U-Net — the 8x downsample block — which makes sense to me. I don't think there's really that much benefit from wasting flops on a tiny 4x4 or 8x8 latent tbh. Also thought it was interesting that they got rid of the cross-attention on the highest-resolution level of the U-Net.




The two stage process using different skill sets reminds me of the old painting masters. Often the master would come up with the overall composition, which requires a lot of creativity and judgement of forms, apprentice painters would then come in and paint clothes, trees or whatever and then the master would finish off the critical details such as eyes, mouth and hands etc.

It makes sense to have different criteria of what good is for each stage


not questioning the veracity here (I'm under the same impression) but curious if you have any sources on this, its so hard to find information on that kind of thing...


Yes sources on this are notoriously difficult, especially since the use of apprentices was an open secret. I did find this source which talked about Rubens and Rembrandt:

"The structure of the work was usually as following: Rubens sketched and corrected the painting at the very end, and his "staff" was given the whole main stage. To increase the speed and efficiency of work, Rubens shared the duties: some of the pupils painted the background, others were focused on details — they worked on foliage or clothing, the master himself corrected the whole work and execute the most "important" parts — hands and faces."

Link: https://arthive.com/publications/2854~Rembrandt_the_teacher_...


Not the same thing, but this is exactly how manga is made, and you can probably find a lot more about that.


I’d recommend to talk to any museum‘s curator of the old masters during a guided tour and they will be able to tell you a lot about it. Sometimes which parts were „workshop“ and which were the master are also visually identifiable by a schooled eye.


Hardly something anyone will do from the curiosity from reading a forum thread.


The two-stage model sounds a lot like the "Hires fix" checkbox in the Automatic1111 interface. If you enable that, it will generate an image, and then generate a scaled up image based on the first image. You can do the same thing yourself by sending an image to the "Image to Image" tab and then upscaling. If you do it that way, you also have the option of swapping the image model or adding LoRAs.

Presumably the two parts of the SDXL model are complimentary: a first pass that's an expert on overall composition, and a second pass that's an expert on details.


Its not quite the same, the highres fix in Auto1111 used the same model twice, the upscaling for SDXL uses two different models for each stage.


cool




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: