Raising the cfg ("classifier-free guidance") scale is essential for following the prompt, but if you raise it too high the image gets weird and saturated.
According to Google's Imagen paper this is literally because the pixels get multiplied by the cfg scale and start clipping; they have a technique called dynamic thresholding that replaces it. Not sure if SD uses this, but I saw Emad hinting they were training an Imagen model…
According to Google's Imagen paper this is literally because the pixels get multiplied by the cfg scale and start clipping; they have a technique called dynamic thresholding that replaces it. Not sure if SD uses this, but I saw Emad hinting they were training an Imagen model…