Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

The reason you can't get the images you want from it is not because of the noise diffusion process (after all, this is probably the closest similarity to how a human gets a flash of creativity) but the lack of a large language model in SD - it was deliberately scaled down so the result could fit in consumer GPUs.

DALLE-2 uses a much larger language model and you can explain more complicated concepts to it. Googles Imagen likewise (not released though).

It's mostly a matter of scaling to get this better.



It's not just size but also model architecture. DALLE mini (craiyon.com) has the opposite priority because of its different architecture; you can enter a complex prompt and it will follow it, but it's much slower and the image quality is a lot worse. SD prefers to make aesthetic pictures over listening to everything you tell it.

You can improve this in SD by raising cfg_scale at the cost of some weird "oversharpening" artifacts. Or, you can make a crappy image in DallE mini and use that as the img2img prompt with SD to make it prettier.

The real sign it's lacking intelligence is, if you ask it a question it won't draw the answer, it'll just draw the question. Of course, they could fix that too, it's got a GPT in it, they just don't let it recurse…


Yeah true, I like dalle-mini :) It did seem to understand the prompts better.

The training set also affects it, as the guidance signal competes with the diffusion-model's priors it learned from the training set (the cfg_scale) and I've found situations where it seems the priors are just encoded too strong it seems - for example with very well-known celebs or objects it's difficult to make variations.

I guess it's interesting that these issues are kind of reflected in humans as well.




Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: