Hacker News new | past | comments | ask | show | jobs | submit login
How to generate realistic people in Stable Diffusion (stable-diffusion-art.com)
120 points by m0wer 9 months ago | hide | past | favorite | 78 comments



I might be going around in the wrong social circles, but none of the people I know look anything like the realistic people in these images. Are these models even able to generate pictures of actual normal everyday people instead of glossy photo models and celebrity lookalikes?


> Are these models even able to generate pictures of actual normal everyday people instead of glossy photo models and celebrity lookalikes?

Yes.

For example, take a look at this LoRA which is one of my favorites: https://civitai.com/models/259627/bad-quality-lora-or-sdxl

This, along with a proper model and when prompted properly, will give you photos of people who actually look like real people.


Could it be because conventionally beautiful people are photographed more often and so there’s just more training data?


Also because models in photographs are symetrical, emotionless, softly-lit, and have perfect skin imperfections. Things like age lines, wrinkles, scars, emotional expression, deep shadows and asymetry require actual understanding of human anatomy to draw convincingly.


Yet, the Gen ai image producers have no understanding of anything and draw human anatomy convincingly very often. Yin other words, you're wrong. AI does not need anatomy knowledge, that's not how any of this works. They just need enough training data.


> AI does not need anatomy knowledge, that's not how any of this works.

Surely someone has done a paired kinematics model to filter results by this point?

Not my field, but I figured 11 fingered people were just because it was computationally cheaper to have the ape on the other side of the keyboard hit refresh until happy.


There are "normal diffusion" models that create average people with flaws in the style of a low grade consumer camera. They're kind of unsettling because they don't have the same uncanny valley as the typical supermodel photographed with a $10k camera look, but there is still weirdness around the fringes.


Try making a picture of people working in an office with any diffusion model.

It looks like the stock photo cover for that mandatory course you hated.

Even adding keywords like “everyday” doesn’t help. And a fear it’s going to be worse in a few years when this stuff constitutes the majority of the input.


If you prompt the model in the way that pushes it towards low quality amateur photography, the results tend to be more realistic and not glossy:

"90s, single use camera, documentary, of anoffice worker in an open plan office, realistic, amateur photo, blurry"

Results: https://imgur.com/a/GJLqYft


> Results: https://imgur.com/a/GJLqYft

Corporate accounts payable, Nina speaking. Just a moooment. https://m.youtube.com/watch?v=4s5yHUpumkY


Is he squinting or am I?


only a prompter would think that looks realistic


Oh a "prompter" is it? No, I'm not a prompter, but if you want to interface with a GenAI model, a prompt is kind of the way to do it. No need to be a salty about it.


Yes, however, for the purposes of a demo article like this, it's significantly easier to use famous people who are essentially baked into the model. You encapsulate all of their specific details in a keyword that's easily reusable across a variety of prompts.

By contrast, a "normal" random person is very easy to generate, but very difficult to keep consistent across scenes.


I know. As well these model outputs are so messed up, it's too much. Penis fingers, vacant cross-eyed stares. This has got to be fine. "Trained on explicit images". It kills me every time. TFA is so emphatic, I think the author is hallucinating as badly as the models are.


Also clothing zippers and seams that end nonsensically.


Not really, I'm in my sixties, and it's surprisingly difficult to get round the biases these models have of young, perfect people, if you want to get images of older people.

Try the single word prompt 'woman' and see what you get...


> biases these models have

Have larger diffusion models gotten to synthetic training dogfooding yet?

The irony is that once we get there, we can address biases in historical data. I.e. having a training set that matches reality vs images that were captured and available ~2020.


The larger base models do an excellent job of aging, try out asking for ages increased by 5 year increments, and you’ll see clear progression (some of it caricatured of course), e.g. “55 year old woman” vs “woman”.


so put "old woman" then.


Yeah, I mean it's not great that the models are biased around a certain subset of "woman" (usually young, pretty, white etc.) but you can just describe what you want to see and push the model to give it you. Yes, sometimes it's a bit of a fight, but it's doable.


Generating fake portrait photos seems kind of boring.

Wouldn't these kinds of negative prompts and tweaking break down if I wanted to plug in more varied descriptions of people?

I find it interesting to plug in colorful descriptions of person's traits from a novel for example, or of people actually doing something.

Using "ugly", "disfigured" as negative prompt probably wouldn't work then...

For the pictures in the article, my first association is someone generating romance scam profile pictures, not art.


It’ll be really cool to use entire books as prompts to generate visual representations of the characters in them.


Try using the "I Can't Believe It's Not Photography" model. Instead of trying to micro-manage the details, use strong emotional terms. I've had good results with prompts along the lines of "Aroused angry feral Asian woman wearing a crop top riding a motorcycle fast in monsoon rain."[1]

[1] https://i.ibb.co/3zHGyrR/feral34.png


RealvisXL is much better, probably the best model for photorealism. Example with same prompt: https://gencdn.aieasypic.com/original/ed7c2742-4a76-4e22-9bb...

It’s currently the best open weights model for prompt adherence too: https://imgsys.org/


> It’s currently the best open weights model for prompt adherence too: https://imgsys.org/

This doesn't specifically measure prompt following but how good it is overall.


Is it also based on SD?


Pro tip, if you see a model with XL on the end, it's for SDXL, which is a stable diffusion model


That looks incredibly fake, to be fair.


The rain is heavy enough to be coming off her body in sheets, but not heavy enough to have plastered her hair onto her head yet.


The bike appears to have 2 front brake levers, only one fork, and she is holding the left handlebar the wrong side of the switches.


And you would say this is realistic?


If a person had drawn or painted it by hand, or put it together in Blender, I would say they were aiming for a realistic style, and they'd done an impressive job.

Sure the motorbike handlebar is messed up, as is one of the hands. And the torso doesn't quite match up with where the thigh is. And the reflections on the arms, the face and the cleavage all look differently lit. And the hair isn't behaving like wet hair. And the face has an uncanny-valley airbrushed look about it.

But those are all easily overlooked. Chuck this into a Facebook news feed and I think 70% of the general public would believe it was a photograph.


While I think the number is much smaller than 70%, you make a good point: Relative to a human artist it is pretty realistic, especially relative to a painter who doesn't use Photoshop or Blender.


It has life and energy. If you put static descriptive phrases into a Stable Diffusion system, you get a static, boring scene. Stable Diffusion can do better than that.


Some people do have six fingers.


Why does everything generated by SD seem to have this weird plasticky sheen to it? Is that a preference of people generating these or innate to the model?


Poor defaults, preferences and the prevalence of bad taste.

There's examples that don't do this but they are harder to find (and prompt for)


Lossy compression tends to remove high frequency information first. That most obviously applies to audio, but you also see it in image codecs, generative AI, and LLMs, all of which are different forms of compression.


It seems to be a requirement to have a model trained on a large number of explicit images to generate correct anatomy.

While people have tried going from a base model to a fine tuned model based on explicit images, I wonder if there are people are attempting to go the other way round (train a base model on explicit photographs and other images not involving humans; then fine-tune away the explicit parts), which might lead to better results?


You can usually remove the NSFW stuff with some negative prompting.

IMO finetuning is a waste of time unless you do cutting edge stuff.

Loras are just as powerful as a finetuned model and you can train one in minutes even on consumer hardware.


> Loras are just as powerful as a finetuned model and you can train one in minutes even on consumer hardware.

Do you have some more details on training a LoRA in minutes? Last I tried, it took several hours on an RTX 3090, but I am sure there have been improvements since then.


Not really, but there are lots of guides around, for example https://civitai.com/models/351583/sdxl-pony-fast-training-gu...

Tbh I mostly use loras others trained, there are hundreds around for all kinds of things


Where does that idea come from? Models can generalise well - if you request photorealistic animals dressed in something specific, you'll get it even though there's likely no training image with that example. People wear enough close fitting clothes that the general form is easy to find.

I first heard the "we need naked images to generate good clothed images" when SD3 came out and suddenly it's everywhere. But it just doesn't make any sense to me and as far as I know it wasn't explicitly practiced in previous popular models.


Why would you do this?


Because when I’m generating photographs I don’t want to create something with grotesque hands that are melting in with their other hand, or their face, while still having a model that is not meant for porn generation?


You seem to be equating nudity with pornography, though. I tend to generate images that are nudes first in order to get aspects such as lighting and environment correct. Then i is may or may not add clothing.

Is this process objectionable to you? If so, why?

It’s also possible I’m simply misunderstanding what you’re objecting to.


I’m not “objecting” to anything, it’s just that companies putting out a model capable of generating nudity would invoke some controversy they’d rather be out of.


> Caution - Nearly all of [the special-purpose models] are prone to generating explicit images. Use clothing terms like dress in the prompt and nude in the negative prompt to suppress them.

I like how even with all the "please don't make it porn" terms in the prompt, you can easily see (by choice of dresses, cleavage, pose, facial expressions etc) which models "want" to generate porn and are barely held back by the prompt.


I find with stable diffusion, images for a type of prompt seem to all show the same person. Add something like “mature” to the prompt and you get a different (but same) person for all those images, regardless of seed.

When one asks prompts for which it hasn’t seen in training data, the results start to look less realistic.

Have even seen adult video logos in generated images.

I very much strongly suspect AI is not what we think.


You thought AI companies weren't training on one of the biggest datasets of pictures of people out there? Especially when the owners of those photos are too small to really sue you over it.


Use SDXL instead of sd1.5. Also don't do negative prompting on "distorted, ugly" because everyone follows these guides blindly and the result is the same types of mode collapsed faces which people have learned to recognize as AI. Use loras instead (to make it look slightly amateurish, or just away from the cookie cutter AI style which everyone notices).


This is stable diffusion 1.5. https://civitai.com/ and https://huggingface.co/ suggest the popular options are SDXL based - it is a much better model (effectively SD 2.5). Still imperfect, but much better.


I find all those photos generated by Stable Diffusion to be kind of repetitious and boring.

Eking out something "interesting" is difficult, especially with limited time and low-end hardware. Interesting is highly subjective of course. I tend towards the more artistic / surrealist style, usually NSFW. Only nudes, no pornography.

I've been experimenting these last few months with interesting generating images, trying to make them "artistic" rather than photo-realistic, or the usual bland anime tributes.

I usually pick a "classical" artist which already has nudes in their repertoire, and try to blend their style with some photos I take myself, and with the style of other artists.

Most fall flat, some come close to what I consider acceptable, but still have major flaws. However, due to my time and hardware constraints they're good enough to post. I use fooocus which is kind of limiting, but after trying and failing to produce satisfactory results with Automatic, fooocus is just what I needed.

I can't really understand why more people don't do the same. Stable Diffusion was trained on a long and diverse list of artists, but most people seem to disregard that and focus only on anime or realistic photographs. The internet is inundated with those. I'm following some people on Mastodon who post more interesting stuff, but they usually tend to be all same-ish. I try to produce more diverse stuff, but most of the time it feels like going against the grain.

The women still tend to look like unrealistic supermodels. Sometimes this is what I want. Sometimes not, and it takes many tweaks to make them normal women, and usually I can't spare the time. Which is unfortunate.

If anyone's interested, I post the somewhat better experiments in:

https://mastodon.social/@TheNudeSurrealist

Warning: Most are NSFW. But are NSFW in the way Titian's Venus, say, is NSFW.


I like how the body shapes of most of the nudes are more 'real' looking as opposed to what you typically find. Well done.

For others like me who aren't familiar with fooocus, there's a lot of ai-related sites with that name. I believe this is the one parent is referring to:

https://github.com/lllyasviel/Fooocus


Yes, that's the one.


Impressive breakdown, but this is six months old.


It still holds up very well, almost all the techniques are still very much in use.

Edit: except hypernetworks, I don’t think they are still relevant for most users.


Yeah Loras killed that


If your goal is to generate content for a fully fictional celebrity magazine, this article will help you.

How come this technology appears to be exclusively used to generate fake pictures of unrealistically good-looking women? And to what end..?


I use it to generate art when building new characters in the Pathfinder group I'm in. In the past, I'd build a backstory for a character, but it would still take a while for them to become more well-defined and for me to have a clear image of them in my mind.

With AI image generation, I can start with the broad brush strokes for the character and then use AI to generate an image based on those prompts which can then help me further define the character to the point that it already feels fleshed out and real before I have even started the game.

I'm pretty pleased with what Copilot (using DALLE-3) spit out for my newest character, a Gothic-themed forensic medical investigator: https://imgur.com/a/bXHeqAX


A popular use cases include advertisements for restaurants, hotels, everyday products.


Step 1: Don’t use stability’s lastest model


It’s unintuitive but I still like the 1.6 models - they just have more variety and seem more „creative“

Of course I have a battery of dozens of techniques and addons to improve the 1.6 models


What are some of the most helpful techniques and add-ons that you use?


Loras in general have been a game changer. I especially like the VantaBlack Loras.

Also controlnet is so useful.

I also generate 512x512 with 1.6 and then upscale to 1024 with iterative upscaling - adds a ton of detail. Then I can easily upscale to 4K.

If I do iterative upscale to 4K right away it takes like 20h on my m1. But that adds even more details.

And there are negative embeddings which are great eg badhandsv2

So those 4 have been the most impactful for me


As someone out of the loop, why?


Whatever they did to censor it broke its ability to generate SFW pictures of humans.

More likely to generate 7 armed nightmare fuel monsters than a bag of plutonium.


Slow and a lot censored


Here it may be more reasonable to actually pay some money for commercial models that are far ahead of Stable Diffusion in terms of image quality and prompt understanding. Like Dall-E 3, Imagen 2 (Imagen 3 comes out soon), or Midjourney. The gap between free and commercial diffusion models seems to be larger than the gap between free and commercial LLMs.


Wanted to give it a try just for fun, using the same prompts, base model and parameters (as far as I can tell), and the first 5 images that were created... will probably haunt me in my dreams tonight.

I don't know if it was me misconfiguring it, or if the images in post were really cherry-picked.


Scrolling through the article the pictures look no more realistic as it goes on.

You need to simulate poor lighting, dirt, soul, realistic beauty etc. Perhaps even situations that give a reason for a photo to be taken other than I’m a basic heteronormative woman who is attractive.


The images generated by this guy are nowhere close to realistic. The resolution he's using is terrible for getting realisting faces. Most people with better GPUs get way better results whlist using 10% of the tricks from the article


It's kinda telling when the author says (e.g. about the "Realistic Vision v2" model) that "the anatomy is excellent [...]" when this is obviously not the case.

Actually it is in no single image in that blog post.

If you have a trained eye that is.


I have a hard time believing that the huge prompt they used at the end (before img2img) will fit in diffusers prompts. I noticed that after 75 tokens or so, it just chops off the prompt and runs with whatever didn't get cut.


I've spent a ton of time playing with Stable Diffusion, just for amusement. I've rarely found it interesting to generate realistic people.


Missing reference to dreambooth


Scary, that's all I can say.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: