• Earlier SD versions would often generate images where the head or feet of the subject was cropped out of frame. This was because random cropping was applied to its training data as data augmentation, so it learned to make images that looked randomly cropped — not ideal! To fix this issue, they still used random cropping during training, but also gave the crop coordinates to the model so that it would know it was training on a cropped image. Then, they set those crop coordinates to 0 at test-time, and the model keeps the subject centered! They also did a similar thing with the pixel dimensions of the image, so that the model can learn to operate at different "DPI" ranges.
• They're using a two-stage model instead of a single monolithic model. They have one model trained to get the image "most of the way there" and second model to take the output of the first and refine it, fixing textures and small details. Sort of mixture-of-experts-y. It makes sense that different "skillsets" would be required for the different stages of the denoising process, so it's reasonable to train separate models for each of the stages. Raises the question of whether unrolling the process further might yield more improvments — maybe a 3- or 4-stage model next?
• Maybe I missed it, but I don't see in the paper whether the images they're showing come from the raw model or the RLHF-tuned variant. SDXL has been available for DreamStudio users to play with since April, and Emad indicated that the reason for this was to collect tons of human preference data. He also said that when the full SDXL 1.0 release happens later this month, both RLHF'd and non-RLHF'd variants of the weights will be available for download. I look forward to seeing detailed comparisons between the two.
• They removed the lowest-resolution level of the U-Net — the 8x downsample block — which makes sense to me. I don't think there's really that much benefit from wasting flops on a tiny 4x4 or 8x8 latent tbh. Also thought it was interesting that they got rid of the cross-attention on the highest-resolution level of the U-Net.
The two stage process using different skill sets reminds me of the old painting masters. Often the master would come up with the overall composition, which requires a lot of creativity and judgement of forms, apprentice painters would then come in and paint clothes, trees or whatever and then the master would finish off the critical details such as eyes, mouth and hands etc.
It makes sense to have different criteria of what good is for each stage
not questioning the veracity here (I'm under the same impression) but curious if you have any sources on this, its so hard to find information on that kind of thing...
Yes sources on this are notoriously difficult, especially since the use of apprentices was an open secret. I did find this source which talked about Rubens and Rembrandt:
"The structure of the work was usually as following: Rubens sketched and corrected the painting at the very end, and his "staff" was given the whole main stage. To increase the speed and efficiency of work, Rubens shared the duties: some of the pupils painted the background, others were focused on details — they worked on foliage or clothing, the master himself corrected the whole work and execute the most "important" parts — hands and faces."
I’d recommend to talk to any museum‘s curator of the old masters during a guided tour and they will be able to tell you a lot about it.
Sometimes which parts were „workshop“ and which were the master are also visually identifiable by a schooled eye.
The two-stage model sounds a lot like the "Hires fix" checkbox in the Automatic1111 interface. If you enable that, it will generate an image, and then generate a scaled up image based on the first image. You can do the same thing yourself by sending an image to the "Image to Image" tab and then upscaling. If you do it that way, you also have the option of swapping the image model or adding LoRAs.
Presumably the two parts of the SDXL model are complimentary: a first pass that's an expert on overall composition, and a second pass that's an expert on details.
“Images generated with our code use the invisible-watermark library to embed an invisible watermark into the model output. We also provide a script to easily detect that watermark.”
Interesting - if it can be detected then it can be removed, right?
There have been some talk of various countries requiring AI images be watermarked. They may be trying to signal their willingness to comply, and hopefully ensure they are not outright banned. That fact it can be bypassed isn't necessarily their problem to fix.
marked images can be excluded from training data in the future, so that it doesn't eat its own output, which is known to cause model quality to degenerate (an issue likely to plague training LLMs on post-2022 datasets.)
as for training the model to add a watermark, that's an interesting idea but I'm not sure it's feasible/not trivial to circumvent.
What exactly is in the watermark? Other image metadata is commonly used to track people, or used in combination with a few other hints to track people.
If it's strictly a primitive watermark with absolutely nothing unique per mark, then it might be no big deal.
I see no reason to disable it. Its fast, and its no more personal than basic metadata like image format and resolution... I can't think of any use case for disabling it other than passing AI generated images as authentic.
Stability, Midjourney, Dall-e, etc don’t want their products to be used for misinformation. They really do want AI generated images to be trivially identifiable.
There are papers around describing how to bake a watermark into a model. Don’t know if anyone is doing that yet.
Given the models’ propensity to being able to fairly-regularly reproduce gettyimages watermarks, I’d imagine putting a mark in the same place, at the same size on all the training images would train it that the mark must be there.
Sure, but that can be said about any security measure. If we agree that passing AI generated art as human-made or as a photograph is bad, then making that harder to do is worthwhile.
We’ll almost certainly see models trained to produce watermarks directly in the next few months. A highly diffuse statistical abnormality would be nearly impossible to detect or remove.
And as soon as such models appear there will appear methods to process images destroying such “statistical anomalies” without changing visible appearance.
Depends on the watermark. The current watermark uses a DCT to add a bit of undetectable noise around the edges. It survives resizing but tends to fail after rotation or cropping.
A robust rotationally invariant repeating-but-not-obviously micro watermark would be quite a neat trick.
An orthogonally rotational invariant watermark would cover 90%+ of real world image transforms - think flips, mirrors, and 90 degree rotations - and would probably be much easier to create.
Watermarks can be made to be pretty robust, though having an open source detection routine will make it much easier to remove (if nothing else, if you can differentiate the detection algorithm then you can basically just optimise it away). The kind of watermarking that is e.g. used for 'traitor tracking' tends to rely on at least a symmetric key being secret, if not the techniques used as a whole (which can include more specific changes to the content that are carefully varied between different receivers of the watermarked information, like slightly different cuts in a film, for example).
Microstates vs macrostates. Doesn't it depend on the basis for the encoding and how many bits it is? Watermark, we usually think of something in the pixel plane, but a watermark can be in the frequency spectrum, and other steganographic (clever) resilient basises.
Simply running a cycle of "ok" quality jpeg compression can completely devastate information encoded at higher frequencies.
Quantization & subsampling do a hell of a job at getting "unwanted" information gone. If a human cannot perceive the watermark, then processes that aggressively approach this threshold of perception will potentially succeed at removing it.
At this point the watermark can be meaningful content like "every time there's a bird there is also a cirrus cloud", or "blades of grass lean slightly further to the left than a natural distribution".
Because our main interest is this meaningful content, it will be harder to scrub from the image.
That would be indistinguishable from a model that was also trained on that output, wouldn't it?
It seems much more likely that it's their solution to detect and filter AI images from being used in their training corpus - kind of a latent "robots.txt".
Not seeing anything about the dataset, are they still using LAION? There's no mention of LAION in the paper and the results look quite different from 1.5 so I'm guessing no.
> the model may encounter challenges when synthesizing intricate structures, such as human hands
I think there's two main reasons for poor hands/text
- Humans care about certain areas of the image more than others, giving high saliency to faces, hands, body shape etc and lower saliency to backgrounds and textures. Due to the way the unet is trained it cares about all areas of the image equally. This means model capacity per area is uniform, leading to capacity problems for objects with a large number of configurations that humans care more about.
- The sampling procedure implicitly assumes a uniform amount of variance over the entire image. Text glyphs basically never change, which means we should basically have infinite CFG in the parts of the image that contain text.
I'm not sure if there's any point in working on this though, since both can be fixed by simply making a bigger model.
From hanging out around the LAION Discord server a bunch over the past few months, I've gathered that they're still using LAION-5B in some capacity, but they've done a bunch of filtering on it to remove low-quality samples. I believe Emad tweeted something to this effect at some point, too, but I can't find the tweet right now.
Are any of these txt2img models being partially trained on synthetic datasets? Automatically rendering tens of thousands of images with different textures, backgrounds, camera poses, etc should be trivial with a handful of human models, or text using different fonts.
> At inference time, a user can then set the desired apparent resolution of the image via this size-conditioning.
This is really cool.
So is the difference between the first and second stage. The output of the first stage looks good enough for generating "drafts" when can then be picked by the user before being finished by the second stage.
The improved cropping data augmentation is also a big deal. Bad framing is a constant issue with SD 1.5.
The report does not detail hardware -- though it states that SDXL has 2.6B parameters in its UNet component, compared to SD 1.4/1.5 with 860M and SD 2.0/2.1 with 865M. So SDXL has roughly 3x more UNet parameters.
In January, MosaicML claimed a model comparable to Stable Diffusion V2 could be trained with 79,000 A100-hours in 13 days.
Some sort of inference can be made from this information, would be interested to hear someone with more insight here provide more perspective.
bitsandbytes is only used during training with these models tho (the 8-bit Adamw) quantizing the weights and the activations to a range of 256 values when the model needs to output a range 256 values creates noticeable artifacts as they are not going to map 1-to-1.
Draw Things recently released a 8-bit quantized SD model that has comparable output as the FP16. It does use k-means based LUT and separate weights into blocks to minimize quantization errors.
I was going to search on the internet about it, but then I realized you are the author (and there is nothing online I think). I imagine that the activations are left in FP16 and the weights are converted in FP16 during inference, right?
Yes, computes are carried out in FP16 (so there is no compute efficiency gains, might be latency reductions due to memory-bandwidth saving). These savings are not realized yet because no custom kernels introduced yet.
• Earlier SD versions would often generate images where the head or feet of the subject was cropped out of frame. This was because random cropping was applied to its training data as data augmentation, so it learned to make images that looked randomly cropped — not ideal! To fix this issue, they still used random cropping during training, but also gave the crop coordinates to the model so that it would know it was training on a cropped image. Then, they set those crop coordinates to 0 at test-time, and the model keeps the subject centered! They also did a similar thing with the pixel dimensions of the image, so that the model can learn to operate at different "DPI" ranges.
• They're using a two-stage model instead of a single monolithic model. They have one model trained to get the image "most of the way there" and second model to take the output of the first and refine it, fixing textures and small details. Sort of mixture-of-experts-y. It makes sense that different "skillsets" would be required for the different stages of the denoising process, so it's reasonable to train separate models for each of the stages. Raises the question of whether unrolling the process further might yield more improvments — maybe a 3- or 4-stage model next?
• Maybe I missed it, but I don't see in the paper whether the images they're showing come from the raw model or the RLHF-tuned variant. SDXL has been available for DreamStudio users to play with since April, and Emad indicated that the reason for this was to collect tons of human preference data. He also said that when the full SDXL 1.0 release happens later this month, both RLHF'd and non-RLHF'd variants of the weights will be available for download. I look forward to seeing detailed comparisons between the two.
• They removed the lowest-resolution level of the U-Net — the 8x downsample block — which makes sense to me. I don't think there's really that much benefit from wasting flops on a tiny 4x4 or 8x8 latent tbh. Also thought it was interesting that they got rid of the cross-attention on the highest-resolution level of the U-Net.