I might be going around in the wrong social circles, but none of the people I know look anything like the realistic people in these images. Are these models even able to generate pictures of actual normal everyday people instead of glossy photo models and celebrity lookalikes?
Also because models in photographs are symetrical, emotionless, softly-lit, and have perfect skin imperfections. Things like age lines, wrinkles, scars, emotional expression, deep shadows and asymetry require actual understanding of human anatomy to draw convincingly.
Yet, the Gen ai image producers have no understanding of anything and draw human anatomy convincingly very often. Yin other words, you're wrong. AI does not need anatomy knowledge, that's not how any of this works. They just need enough training data.
> AI does not need anatomy knowledge, that's not how any of this works.
Surely someone has done a paired kinematics model to filter results by this point?
Not my field, but I figured 11 fingered people were just because it was computationally cheaper to have the ape on the other side of the keyboard hit refresh until happy.
There are "normal diffusion" models that create average people with flaws in the style of a low grade consumer camera. They're kind of unsettling because they don't have the same uncanny valley as the typical supermodel photographed with a $10k camera look, but there is still weirdness around the fringes.
Try making a picture of people working in an office with any diffusion model.
It looks like the stock photo cover for that mandatory course you hated.
Even adding keywords like “everyday” doesn’t help. And a fear it’s going to be worse in a few years when this stuff constitutes the majority of the input.
Oh a "prompter" is it? No, I'm not a prompter, but if you want to interface with a GenAI model, a prompt is kind of the way to do it. No need to be a salty about it.
Yes, however, for the purposes of a demo article like this, it's significantly easier to use famous people who are essentially baked into the model. You encapsulate all of their specific details in a keyword that's easily reusable across a variety of prompts.
By contrast, a "normal" random person is very easy to generate, but very difficult to keep consistent across scenes.
I know. As well these model outputs are so messed up, it's too much. Penis fingers, vacant cross-eyed stares. This has got to be fine. "Trained on explicit images". It kills me every time. TFA is so emphatic, I think the author is hallucinating as badly as the models are.
Not really,
I'm in my sixties, and it's surprisingly difficult to get round the biases these models have of young, perfect people, if you want to get images of older people.
Try the single word prompt 'woman' and see what you get...
Have larger diffusion models gotten to synthetic training dogfooding yet?
The irony is that once we get there, we can address biases in historical data. I.e. having a training set that matches reality vs images that were captured and available ~2020.
The larger base models do an excellent job of aging, try out asking for ages increased by 5 year increments, and you’ll see clear progression (some of it caricatured of course), e.g. “55 year old woman” vs “woman”.
Yeah, I mean it's not great that the models are biased around a certain subset of "woman" (usually young, pretty, white etc.) but you can just describe what you want to see and push the model to give it you. Yes, sometimes it's a bit of a fight, but it's doable.
Try using the "I Can't Believe It's Not Photography" model. Instead of trying to micro-manage the details, use strong emotional terms. I've had good results with prompts along the lines of "Aroused angry feral Asian woman wearing a crop top riding a motorcycle fast in monsoon rain."[1]
If a person had drawn or painted it by hand, or put it together in Blender, I would say they were aiming for a realistic style, and they'd done an impressive job.
Sure the motorbike handlebar is messed up, as is one of the hands. And the torso doesn't quite match up with where the thigh is. And the reflections on the arms, the face and the cleavage all look differently lit. And the hair isn't behaving like wet hair. And the face has an uncanny-valley airbrushed look about it.
But those are all easily overlooked. Chuck this into a Facebook news feed and I think 70% of the general public would believe it was a photograph.
While I think the number is much smaller than 70%, you make a good point: Relative to a human artist it is pretty realistic, especially relative to a painter who doesn't use Photoshop or Blender.
It has life and energy. If you put static descriptive phrases into a Stable Diffusion system, you get a static, boring scene. Stable Diffusion can do better than that.
Why does everything generated by SD seem to have this weird plasticky sheen to it?
Is that a preference of people generating these or innate to the model?
Lossy compression tends to remove high frequency information first. That most obviously applies to audio, but you also see it in image codecs, generative AI, and LLMs, all of which are different forms of compression.
It seems to be a requirement to have a model trained on a large number of explicit images to generate correct anatomy.
While people have tried going from a base model to a fine tuned model based on explicit images, I wonder if there are people are attempting to go the other way round (train a base model on explicit photographs and other images not involving humans; then fine-tune away the explicit parts), which might lead to better results?
> Loras are just as powerful as a finetuned model and you can train one in minutes even on consumer hardware.
Do you have some more details on training a LoRA in minutes? Last I tried, it took several hours on an RTX 3090, but I am sure there have been improvements since then.
Where does that idea come from? Models can generalise well - if you request photorealistic animals dressed in something specific, you'll get it even though there's likely no training image with that example. People wear enough close fitting clothes that the general form is easy to find.
I first heard the "we need naked images to generate good clothed images" when SD3 came out and suddenly it's everywhere. But it just doesn't make any sense to me and as far as I know it wasn't explicitly practiced in previous popular models.
Because when I’m generating photographs I don’t want to create something with grotesque hands that are melting in with their other hand, or their face, while still having a model that is not meant for porn generation?
You seem to be equating nudity with pornography, though. I tend to generate images that are nudes first in order to get aspects such as lighting and environment correct. Then i is may or may not add clothing.
Is this process objectionable to you? If so, why?
It’s also possible I’m simply misunderstanding what you’re objecting to.
I’m not “objecting” to anything, it’s just that companies putting out a model capable of generating nudity would invoke some controversy they’d rather be out of.
> Caution - Nearly all of [the special-purpose models] are prone to generating explicit images. Use clothing terms like dress in the prompt and nude in the negative prompt to suppress them.
I like how even with all the "please don't make it porn" terms in the prompt, you can easily see (by choice of dresses, cleavage, pose, facial expressions etc) which models "want" to generate porn and are barely held back by the prompt.
I find with stable diffusion, images for a type of prompt seem to all show the same person. Add something like “mature” to the prompt and you get a different (but same) person for all those images, regardless of seed.
When one asks prompts for which it hasn’t seen in training data, the results start to look less realistic.
Have even seen adult video logos in generated images.
I very much strongly suspect AI is not what we think.
You thought AI companies weren't training on one of the biggest datasets of pictures of people out there? Especially when the owners of those photos are too small to really sue you over it.
Use SDXL instead of sd1.5. Also don't do negative prompting on "distorted, ugly" because everyone follows these guides blindly and the result is the same types of mode collapsed faces which people have learned to recognize as AI. Use loras instead (to make it look slightly amateurish, or just away from the cookie cutter AI style which everyone notices).
This is stable diffusion 1.5. https://civitai.com/ and https://huggingface.co/ suggest the popular options are SDXL based - it is a much better model (effectively SD 2.5). Still imperfect, but much better.
I find all those photos generated by Stable Diffusion to be kind of repetitious and boring.
Eking out something "interesting" is difficult, especially with limited time and low-end hardware. Interesting is highly subjective of course. I tend towards the more artistic / surrealist style, usually NSFW. Only nudes, no pornography.
I've been experimenting these last few months with interesting generating images, trying to make them "artistic" rather than photo-realistic, or the usual bland anime tributes.
I usually pick a "classical" artist which already has nudes in their repertoire, and try to blend their style with some photos I take myself, and with the style of other artists.
Most fall flat, some come close to what I consider acceptable, but still have major flaws. However, due to my time and hardware constraints they're good enough to post. I use fooocus which is kind of limiting, but after trying and failing to produce satisfactory results with Automatic, fooocus is just what I needed.
I can't really understand why more people don't do the same. Stable Diffusion was trained on a long and diverse list of artists, but most people seem to disregard that and focus only on anime or realistic photographs. The internet is inundated with those. I'm following some people on Mastodon who post more interesting stuff, but they usually tend to be all same-ish. I try to produce more diverse stuff, but most of the time it feels like going against the grain.
The women still tend to look like unrealistic supermodels. Sometimes this is what I want. Sometimes not, and it takes many tweaks to make them normal women, and usually I can't spare the time. Which is unfortunate.
If anyone's interested, I post the somewhat better experiments in:
I like how the body shapes of most of the nudes are more 'real' looking as opposed to what you typically find. Well done.
For others like me who aren't familiar with fooocus, there's a lot of ai-related sites with that name. I believe this is the one parent is referring to:
I use it to generate art when building new characters in the Pathfinder group I'm in. In the past, I'd build a backstory for a character, but it would still take a while for them to become more well-defined and for me to have a clear image of them in my mind.
With AI image generation, I can start with the broad brush strokes for the character and then use AI to generate an image based on those prompts which can then help me further define the character to the point that it already feels fleshed out and real before I have even started the game.
I'm pretty pleased with what Copilot (using DALLE-3) spit out for my newest character, a Gothic-themed forensic medical investigator: https://imgur.com/a/bXHeqAX
Here it may be more reasonable to actually pay some money for commercial models that are far ahead of Stable Diffusion in terms of image quality and prompt understanding. Like Dall-E 3, Imagen 2 (Imagen 3 comes out soon), or Midjourney. The gap between free and commercial diffusion models seems to be larger than the gap between free and commercial LLMs.
Wanted to give it a try just for fun, using the same prompts, base model and parameters (as far as I can tell), and the first 5 images that were created... will probably haunt me in my dreams tonight.
I don't know if it was me misconfiguring it, or if the images in post were really cherry-picked.
Scrolling through the article the pictures look no more realistic as it goes on.
You need to simulate poor lighting, dirt, soul, realistic beauty etc. Perhaps even situations that give a reason for a photo to be taken other than I’m a basic heteronormative woman who is attractive.
The images generated by this guy are nowhere close to realistic. The resolution he's using is terrible for getting realisting faces. Most people with better GPUs get way better results whlist using 10% of the tricks from the article
It's kinda telling when the author says (e.g. about the "Realistic Vision v2" model) that "the anatomy is excellent [...]" when this is obviously not the case.
Actually it is in no single image in that blog post.
I have a hard time believing that the huge prompt they used at the end (before img2img) will fit in diffusers prompts. I noticed that after 75 tokens or so, it just chops off the prompt and runs with whatever didn't get cut.