Hacker News new | past | comments | ask | show | jobs | submit login
The uncanny failures of A.I.-generated hands (newyorker.com)
75 points by andsoitis on March 13, 2023 | hide | past | favorite | 120 comments



They fall apart with just about anything that I'd label "specifically" repetitive (instead of infinitely, generally, or chaotically, e.g. like fences, tree branches, or a pebble riverbed). Stuff where there is a narrow range of correct repetitions, and going outside that is immediately obvious.

Hands. Chopsticks. Teeth. Sport balls. Etc.

Kinda feels like it gets into loops, but it can't count so it can't tell when to stop.


I think that the problem is that it primarily learns bottom up features from the high frequency domain.

The invariants that apply top down structures from the low frequency domain aren't as readily recognised or understood.


That sounds intuitively similar to language models generating syntactically correct sentences, but they don't even fact check the result.

Classic chatgpt: "The word contains R because it's spelled L-E-M-O-N"


Tried to get an explanation for a crossword clue today:

The answer "gnocchi" is derived from the clue "spouted pan I used for dumplings" using a type of wordplay known as anagram. An anagram is a type of wordplay where the letters of one word or phrase are rearranged to form another word or phrase.

In this case, the clue contains an anagram indicator, "spouted," which suggests that the letters of some words need to be rearranged. The words "pan I used" can be rearranged to form the word "nog," and "for dumplings" suggests that we need to add another letter to complete the solution. The letter "c" is a suitable choice here, as it is a common letter in the word "dumplings," and it fits the remaining letters to form the word "gnocchi," which is a type of Italian dumpling made from potatoes.

Therefore, "spouted pan I used for dumplings" is a cryptic crossword clue that requires you to rearrange the letters of "pan I used" and add a letter "c" to get the answer "gnocchi."


ChatGPT told me that the answer to a riddle was “man” because a crawling baby has three legs and a person with a cane has four legs.


Someone said on reddit that SD can not yet generate a simple sensible 'hammer' either.


The article only makes a brief mention of it, but i wonder if the models struggle with foreshortening in general. Which when you consider hands are a combination of repeated pattern and lots of foreshortening.


> The strange contortions of A.I. hands make me feel a sense of anticipatory nostalgia, for a future when the technology inevitably improves and we will look back on such flaws as a kitschy relic of the “early A.I.” era, the way grainy digital-camera photos are redolent of the two-thousands.

This will be true for most AI "failures". People love to harp on some minor issue some new generative AI has in edge cases and act like it's some insurmountable obstacle that will keep humans superior forever: "haha look at the AI try to make a hand what an idiot"

As humans become more and more irrelevant, we will cling to these minor failures more and more to preserve our egos


There are cheap optimists and cheap pessimists. Both amount to trolls and don’t particularly need to be fed or heeded.

The truth is we don’t know how far today’s breakthroughs in AI will take us. There’s quite a lot of headroom to explore, but history has shown that there are always some brick walls as well.

The rich discussions can be good, but the cheap flamewars are exhausting.


> The truth is we don’t know how far today’s breakthroughs in AI will take us

That's the central question. So far, AI has proven great at ingesting huge amounts of "prior art" and producing variations on the themes it has learnt (more or less) according to the prompts you give it, but it consistently fails to grasp the underlying rules (gravity, perspective, anatomy etc.) that are supposed to govern the images it generates. So it's better for "surrealist" art than for images that are supposed to be realistic...


The surrealistic manifesto by Breton excludes AI, since it has no thought:

"Psychic automatism in its pure state, by which one proposes to express—verbally, by means of the written word, or in any other manner—the actual functioning of thought. Dictated by thought, in the absence of any control exercised by reason, exempt from any aesthetic or moral concern."


We do know how far AI can go: Full parity with human cognition.

Everything we do, ultimately, comes down to recognition.

The inflection point will be the first model that is trained to recognize trained models.


I don't even see the position you are arguing against recently around here. But I see lots of comments similar to yours, so what do you have to add?

Saying that intelligent machines are impossible because of hand-wavy metaphysical reasons (or vanity as you claim) is completely different from saying that this technique can or can't be improved to do things that it can't now.

Jet turbines definitely can and have been improved to exceed the capabilities of birds, and it surely doesn't violate the laws of physics to build a machine that flies by flapping wings, but one technology doesn't inherently lead to the other.

It's not a choice of either unlimited improvement or an insurmountable obstacle if improvements are going in a different direction that what's being considered.


People love the idea of Superman because he can do everything. I think there is a psychological need to have a machine that can do everything and these AI tools feel like we’re not far off at times. Close to the supernatural.

I think there is some deep psychology at play that makes us want to believe we’re about to solve all the things we’re insecure about.

Being good at art, being good at speaking or writing, handling the idea of mortality, being great at maths.

We’ve decided that we want machines that can outdraw everyone, in all cases, with zero limitations, and we might be, but we’re also happy to overlook some of the floors and setbacks of such systems to believe there we’ve solved art.

As someone who’s played with these tools and as someone who has done a lot of illustrating in my life, as a hobby, I can still tell you that there ares setbacks, which are, these tools don’t read minds…I still don’t get to produce what’s in my minds eye. I can come close but it is a different product that’s produced. The pros are they do make some great art.

TL;DR nothing is perfect…


Not around here, the "Ai will never replace artists" sentiment due to the failures with hands is much more frequent on Twitter among artists. Not that I disagree with their frustration, but the optimism that it will always be second best to pure human creativity seems like egoistic coping, which has been proven because looking at recent generations from Stable Diffusion, the hand issue is nearly solved.

I'm just extrapolating that kind of wishful thinking to anyone who thinks that AI will just be a tool for limited uses, always inferior to the less naïve human whose multi-modal efficiency could never be matched.

There's nothing against the laws of physics that prevent AI from becoming superhuman in all domains. There are people who nevertheless claim AGI is impossible because of handwavy reasons.


I just asked ChatGPT for something today; I framed it in terms of a question in an attempt not to influence it - is there a...?

And it told me firmly "yes" followed by fiction, with a plausible link that went nowhere. Honestly, I'm not entirely sure it was complete fiction, it was so plausible and what I wanted to hear. But I couldn't find it with Google.

What's bothering me is not just people saying this is like a human being, but people saying the next generation, the progression will bring us closer to a human mind. "Better", if it is more deceptive, will be worse.

Improving what it is already good at will produce something but not a human-like mind and not a superhuman oracle.

Humans have eyeballs, but an artificial eye is never going to be a human-like mind no matter how much you develop it and whether it is better or worse than a human eye.


AI pics used to look like things you'd see in a dream. Hands are not different, in fact one of the best ways to find if you're dreaming or not is to try and count your fingers, in a dream your hand would look exactly like a hand generated by AI. I remember reading this on some lucid dreaming subreddit and tried it, you need to get into the habbit of doing it awake so you also get the reflex to do in dream.


Another trick is it pinch your nose and try to breath through it. As you aren't really pinching your nose, your breathing is unobstructed. I got in a habit of checking when I would see certain signifiers (like machines behaving in surprising ways) and maybe 1/3 of the time, I would find I'm dreaming.


Habitually counting your own fingers seems like a much worse problem to have than not being able to tell you're dreaming.



Two reasonably valid contradictions: 1) You are having serious problems with nightmares (esp. kids) 2) You're a massively obsessed sex pervert


Counting fingers is one of the tricks. Another is to try to go back to the room/street/place you just came from, when you do, it will be changed to something else. This is also what AI video generators fail at.

Another which I discovered myself as a kid as a first sign was that if I go back and reread the same text it never stayed the same. Discovered other lucid dreaming signs later. The brain continuously generates/halluicnates stuff while dreaming, keeps too little context.


> in a dream your hand would look exactly like a hand generated by AI

I'm going to say... no, this isn't true.


> in fact one of the best ways to find if you're dreaming or not is to try and count your fingers

Or looking twice in a clock and seeing if the mind is making up the position of the clock hands on the spot.

Or looking text in a book and seeing if it's gibberish


Is that why potheads look at their hands all the time?


More like acid heads, never seen anybody high on just pot itself to be so far off reality to marvel at their hands, thats usually reserved for next levels of dissociation


you need better weed then


1) Do they? Having spent lots of time around stoned people, I don't remember this.

2) Why the need for a relatively pejorative term, "potheads"?


Maybe you were also stoned and thus don’t remember doing it :)


replying 1) - it's a well known comedic trope that when someone is stoned they look at their hands and say "have you ever really looked at your hands, man?"


Have you though?


All humans look at their hands all the time. It's called hand-eye coordination. Potheads merely consciously realize they're doing it, and are probably more mesmerized at the constant, subconscious looking at hands we all do, than the hands themselves. Having said that, our hands are perhaps the most crucial instrument for manipulating reality, so they are quite amazing.


Hand-eye coordination doesn't involve looking at the hands. Hand-eye coordination involves looking at the object you want to interact with, e.g. a button you're going to click, or a ball you're going to catch. You don't look at the mouse or your hands.


Yes, Carlos Castenada's books said it's sufficient just to look at your hands.

I wish I could remember to do this :-O


Two hands have ~60 degrees of freedom (4 in each finger, 4 for extension/flexion/abduction and a few for a pose between the hands). You can speak with your hands! Or perform (shadowgraphy, puppeteering, magic).

In computer graphics rendering hands and hands gestures was considered very challenging for a long time (This was, for example, a PhD thesis of one of the Pixar co-founders, back when computer graphics was just starting). And it is still very challenging to render a close-up of a hand performing a natural looking gesture.


As a human drawing hands is a difficult task. Look how many beginners try to avoid exactly the problem you see in SD. You need to break it down in a way so it becomes easier to render. Took me ages.


It's not just humans, I've had trouble generating appendages for other creatures and also machines like robots. Dragons with missing wings and seven legs, robots with feet that have feet that have feet. Stuff like that. So far the best technique I've found is to fix in retouch photos, or to hide their hands by e.g. prompting characters to have their hands behind their backs.


Agreed. I have found that generated horses never look right. The legs are routinely wrong. And the horses look more like a cartoon representation than an actual horse.


It is possible to generate good quality hands with Stable Diffusion with ControlNet. Here's a video talking about generating different hand positions from models created in Blender. There's also discussions about this on r/StableDiffusion.

https://www.youtube.com/watch?v=ptEZQrKgHAg

Now from personal experience, it's not as easy as it seems in the video. It still takes a lot of tweaking and many iterations don't come out well, but hands are definitely not an insurmountable problem.


Depth controlnets have been the most helpful sofar, they give space directions the best and untangles the good from the bad imo,


I mentioned it in another response as well, but here's the original implementation repo for anyone interested:

https://github.com/lllyasviel/ControlNet


> machines can grasp small patterns but not the unifying whole

that's actually true for all the art generated by (the current generation of) AI. With some things (like hands) it's more obvious than with others, but the closer you look, the more "wrong" things you notice...


When you see the picture as a thumbnail it looks okay. When you look at the details of a full sized picture, it’s a fractal of wrong.


Idea: do crimes wearing some toy that makes you look like you have extra fingers. In court point out that evidences are AI generated.


I was thinking of going and editing extra fingers and fucked up teeth into someone's childhood photos.


Sticky Six Fingered Discount has more of a ring to it, I have to admit.


Bad timing, since there's a rumor that Midjourney v5 has mostly fixed the hand and teeth problems.


All the examples of supposedly 'fixed' hands I've seen still had problems, or relied on poses which hid most of the fingers. Even if the obvious 'can't count to five' issue gets fixed, it still has problems correctly rendering knuckles and fingernails and getting good proportions. And people don't even attempt difficult scenes with hands like holding small objects or performing specific actions, presumably because even the most wide-eyed AI-optimist can see how much they don't work.


> And people don't even attempt difficult scenes with hands like holding small objects or performing specific actions

They definitely do and with pretty decent (albeit certainly far from perfect) results. I feel you have a bit of cognitive bias based on your sources. A few hundred images of hands well tagged used as fine tuning into an existing SD model produces a dramatic improvement, based on personal experience, including grasping objects. Are they photograph level reproductions, certainly not, but easily qualify as skilled artistic renderings. The part I’ve been experimenting lately on is getting fingernails and skin texture to both be realistic at the same time (I can get one or other, but not both so far).


People keep saying things that these models can do, but they're pretty reluctant to actually show any results. Perhaps they're not as good as you claim.


I could show hundreds, but would you trust me or just think they were cherry picked good cases? Skeptical people will be skeptical, news at 11! ;)

If you really want to know, you could just try it yourself. Setup WebUI, download some CAI or HF SD models that have been fine tuned with hands, learn about prompting, and then see that the field has dramatically improved by just not using stock settings.

Zero to hand model hero in around an hour!


Stable diffusion solved being cross eyed by having everyone wear a space helmet. Done, ship it.


Good timing - since they get the clicks for this article and then all the clicks for another article about how it's fixed in a few weeks.


Is it extra layers and models? They already have some kind of "face fixup" don't they?


Didn’t want to look at the article because I get real intense discomfort when viewing AI generated faces. The hands aren’t as bad but still pretty uncomfortable. It’s almost cubism in a way but still too uncanny valley for me


Thanks, I get extremely uncomfortable when looking at the hands. Faces for some reason aren't as bad. But the discomfort is similar to that generated by those older "deep dream" images where everything is a psychedelic octopus dog nightmare


this worked for me:

    links2 https://www.newyorker.com/culture/rabbit-holes/the-uncanny-failures-of-ai-generated-hands
(in this case i was more interested in not getting paywalled)

i was still able to load the images by typing 'i' to launch an external viewer


Zero mentions of ControlNet. Also there is now https://www.reddit.com/r/StableDiffusion/comments/11pyiro/ne...


Github link to actual source project for anyone curious:

>https://github.com/lllyasviel/ControlNet


I wonder if satisfactory AI hands will be like self-driving cars: always five years away.


They are already here, several custom SD models are really good at it.


Yet look up the comments list and see people saying it still requires tweaking…


If people don't use a good custom model or can't make the slightest commitment to do inpainting if there are errors then sure but it's trivial nowadays.


There's a model called Ziepher's F111 that greatly improves human anatomy. You can merge F111 with other models in automatic1111, something like 80% SD1.5 and 20% F1111 greatly improves human anatomy, including hands.


I've been curious about the hand problem.

It seems like it’s caused by the massive number of permutations possible with hands: all the joints, combined with all the angles at which they can be seen.

I’ve wondered if this could be solved by creating a massive permutation of hand images using rigged 3D hand models (using, for example, Blender, or coupling with Unreal), and programmatically putting them in all possible combinations and angles possible from the rigging and then rendering millions of images of these combinations.

Then the image models could learn from that artificially created dataset.

Anyone with actual knowledge know how people are trying to tackle this?


the entire "more data and pray" approach is how you's all end up with these problems. You want to see another stable diffusion generation failure, try asking Dalle or Craiyon for playing cards. In particular the back sides. Very similar to what you pointed out. Lots of repetition in the data sets, but non-exact repetitions. Different companies put different patterns and different stylistic choices on the same recognizable standard deck. The output the machine produces is frankly psychedelic.

Here's what you missed. The generation problems all happen under a certain length scale. Its things under a certain band of fine detail where the distortion happens. Normally you won't notice it, because normally there isn't anything but noise in that band anyway. Shove a SD generated image through a fourier transform to see what I mean.

And here's my very informed conjecture why it happens. The hint is right there in the name. Stable Diffusion. The generator network trains to de-noise images. The adversary adds Gaussian noise. That's a perfectly reasonable noise method, but it comes with its own spread and distribution in frequency space.

A very similar problem exists with dithering processes. One way to use a b/w screen for grayscale is to treat the gray scale values as coin flip probabilities. This is known as random dithering. Its simple, its obvious, it should work, and it does work surprisingly well most of the time. But it runs into the exact same distribution problem. When there are closely spaced stripes in the image you want to show, that stripe pattern gets completely washed out by the noise being added to fake brightness levels. Put another way, overlaying TV static on any image makes it impossible to see blobs of the same size as might happen by random chance. The generator network can't see certain patterns, because the length scale of those patterns are coincident with the length scale of the noise being added by the adversary.

This problem will never be solved by substituting data for comprehension. That's a one shot lesson worth generalizing.


I don't think there is a problem with the diffusion process as you describe it, the model has been trained on very different noise/signal ratios and can more or less guide the generation depending on the step count, adding noise to the process makes it stochastic, but if a pattern is critical to fit the distribution correctly and the model has learned it, it will add it. There are also custom SD models that can easily generate good hands.


To make it really clear, Its not the raw amount of noise or even the noise/signal ratio that I'm pointing to but rather the spectral properties of randomness. It matters if you use white noise, pink noise, brown noise, gaussian, uniform etc. These variations in the coarseness of noise pattern will each obscure patterns of the same coarseness. Put another way, if your noise pattern is trees, you won't be able to discern if a forest was there prior to the noise. You'll miss the forest for the trees. You can sort of get out of it by switching away from single pixel uncorrelated noise, but that's hard to think about and hard to code and it all looks random when you test it anyway.

I don't know how well if at all the adversary side of the process adjusts the probabilities.


The spectral properties of the noise does matter a bit (see the stable diffusion offset noise issue for example) but

> These variations in the coarseness of noise pattern will each obscure patterns of the same coarseness.

is not accurate. During the forward noise process the high frequency information is obscured first, and the coarse information last. During the reverse process, the neural net learns to denoise the coarse visual elements first, and adds high frequency details at the end. With the original clip guided diffusion notebook you can skip the last 10% of the denoising steps to get a smoother image.

This is also the root cause of the noise offset problem (the coarse information is not completely obscured by the forward process)


You're making exactly the same point I am making. I was just substituting the word "coarseness" for "frequency" in an attempt to make it more accessible for people who have never studied this topic.

I'm glad someone else sees the frequency space folly of the diffusion process. If I had time, I would test the hypothesis of doing all the learning in frequency space rather than trying to shape the profile of the noise. But I don't have time so feel free to steal my idea.


I wrote a response yesterday but did not post it or send it, ops.

I still don't understand the problem, if you ask model trained on a noise pattern "trees" for a forest it will still give you a random forest, that's what it was trained on, also: https://arxiv.org/abs/2208.09392, to see the diffusion process applied to processes other than Gaussian noise.


That was an illustrative example don't take the trees literally. Here is an actual illustration of dithering which I believe is relevant.

https://surma.dev/things/ditherpunk/

Imagine asking Dakke2 for a picture of the bridge through the fog. The problem is an image of fog is graphically indistinguishable from random noise. Whats the difference between a fog patch and something the adversary drew in? Good luck training out of that one.

Less tortured understanding: think about my playing cards example. Consider a face down card with the ordinary geometric lacing patterns drawn on the back side. So we are all looking at the same thing: https://www.wopc.co.uk/images/countries/usa/standard/standar...

Now think about what happens to that red card as I randomly add white noise static on top. The original fine patterning on the red side is soon distorted beyond recognition because those thin white lines are obscured by equiprobable white dots and coincidental random patterns.


>Now think about what happens to that red card as I randomly add white noise static on top. The original fine patterning on the red side is soon distorted beyond recognition because those thin white lines are obscured by equiprobable white dots and coincidental random patterns.

I don't see the problem, the pattern will remove the static white noise at the top and generate a new fine patterning on the red side (in case it did not go through the forward process to the end), of course it will not recover the lost information.

>Imagine asking Dakke2 for a picture of the bridge through the fog. The problem is an image of fog is graphically indistinguishable from random noise. Whats the difference between a fog patch and something the adversary drew in? Good luck training out of that one.

Ignoring the fact that Gaussian noise is really different from a patch of fog, it is obvious that if you have a high noise-to-signal ratio you will not be able to fully recover the signal, during the forward diffusion process all information is lost, however, the model will learn the distribution of samples. Also Dalle 2*


>I don't see the problem, the pattern will remove the static white noise at the top and generate a new fine patterning on the red side (in case it did not go through the forward process to the end), of course it will not recover the lost information.

You said it yourself, it erases information. Pivotally it isn't just a blunt eraser. It erases a very particular grain of detail. The problem is it looks like your on hallucinogenics. https://i.imgur.com/YlPHzgY.png

My hypothesis here is that the distortion manifests when you have fine grain patterns like you would find on a playing card because that is the length scale that is most affected by the noise process. The presence of noise doesn't effect long range objects. The same process would never cause the red card to conmpletely go away. It destroys information about that fine, thin, white line and patterns of a similar characteristic. That's really the key to what I'm saying.


>The same process would never cause the red card to conmpletely go away

The forward diffusion process completely destroy all information, in fact when the model is trained we simply sample directly from the gaussian distribution instead of actually applying the forward diffusion process on a sample, so the fact that you lose information in the diffusion process is not really a problem because with the diffusion process you lose all information but the model clearly can still generate images.


I meant over a single step of the diffusion. Obviously if you keep adding noise it eventually all becomes noise. My point is the noise process is biased towards erasing some types of details faster than others. This works well when the length scale of features are proportional to the relevance and noticability, as is often the case (smaller size details usually matter less). But when the fine grain level actually has some sort of geometric patterning (such as the white lacing on the playing cards), stable diffusion can't even see it after the first steps of diffusion. It takes a lot more steps for the blob of red card overall to be eliminated than for the fine white lines.

Its exactly the same sort of problem as in the dithering example. An original image of 2-pixel-thin stripes is very recognizable to us as a "thin striped object". Trying to color it in with random dithering will render it unrecognizable. However a much wider 10-pixel-stripe pattern can be safely colored with random dithering.

The commonality is long range ordering of short range features. Thats what the SD process struggles with.



Thanks for that. I'm reading it now. Just off the abstract

"when increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels)"

That's highly related to what I'm saying. Its a statement that may be true in the vast majority of cases, and thus would certainly be validated as true-on-average by whatever benchmark, but there's certain types of patterns that aren't redundant among nearby pixels despite being macroscopically visible.

A similar thing happens with compression artifacts. For pictures of normal environments and objects jpeg works great. Save a 1 pixel wide line as a jpeg and it is surrounded by artifacts. jpeg compression is also largely built on the assumption that high frequency components in large images won't contain anything important to the image content.

In both cases, the result is a visible digital artifact with a "tripy" patterning effect.


here's an article that happens to echo a lot of what I've said, albeit with a different manifested example.

https://www.crosslabs.org/blog/diffusion-with-offset-noise


Other commenters here seem to think MJ5 has solved the problem, are they wrong ? I don’t know just curious.


I just dug around to find those comments. The answer is I haven't seen MJ5 so I don't know, but when I see it I'll believe it. From other comments it seems MJ5 does better, but still has problems. I fully believe that all of these AI models can be ad hoc patched around the specific cases users complain about most, and that profitable companies desire a working product today more than a conceptual understanding that will pay off later. So I fully believe that there will be versions which "fix the hands problem", but the problem will still be there in at least some cases.


I "rated" 125 images today and many had hands.

90% solved it appears. Where the baseline I'd say has been "almost impossible to get right."


> how people are trying to tackle this

I don't think "people" actually tackle this. Ask a kid or non-professional to draw animal hands/feet without learning them first, e.g. crocodile hand, or feet of a seal. "People" ends up with uncanny drawings as well.

There are tons of face photo on the Internet, but few of them are with hands clear visible.


It's really very simple:

1. Humans are terrible at drawing hands and feet (ask any illustrator, amateur or professional).

2. Advanced Interpolation software reads and generates images based off of both real-life photos and illustrations of hands drawn by humans, depending on what materials are fed to it.

3. The images generated carry forth the failings of humanity.


AI drawn hands are consistently terrible in different ways than bad human artists draw hands badly.


As a man of culture who has witnessed his fair share of illustrated hands and feet: "AI" reflect human failings at drawing appendages very, very well.


Wrong number of fingers makes my skin crawl. I don't really have that problem with regular drawings.


An unconditional diffusion model is trying to solve a huge problem - storing the set of all images that are meaningful to humans. I think incorrect details in hands/faces are mostly due to limited model capacity. From the imagen paper we see that details like fingers and text spontaneously get better at around 20b parameters.


Could we test that by training a small model on nothing but hands?


Yes I think so, but it would depend a lot on the data. If properly normalized like FFHQ you don’t even need a diffusion model.


Interesting because hands often look weird in dreams. A trick to know if you're lucid dreaming is to look at your hands.


Light switches are a reliable indicator that I am in a lucid dream. Switching them never turns on or off a light. They often cause a small explosion, weirdly.

Also, if there is text, it will never read the same twice if you look away from it.


text is another terrible problem for these diffusion models


Maybe only in your dreams. I've never noticed that myself. My dream generator is able to render hands just fine.


Heard this in a few places. Can't say I've noticed myself (though I never lucid dream). I've even seen it put forward as evidence these models are like the brain as they have similar 'failure modes' - anyone know if there is any truth to that?


I had more issue with gravity than with hands.



Using Dall-e 2 for all your example images is a great self report that you don't keep up much with AI.


> Using Dall-e 2 for all your example images is a great self report that you don't keep up much with AI.

I think the overarching story (subtitle of the article) is what is the message: "machines can grasp small patterns but not the unifying whole."

Amateur human illustrators might also struggle to draw accurate representations of hands, but, unlike machines, can critically evaluate and KNOW whether they got it right or not...


Luckily there sits sometimes a human in front of the screen to decide if the output is good.


That’d be me. I don’t really care to “keep up” with things. What should I be using instead?


Stable diffusion finetunes and model merges. You can find most of the newer ones at CivitAI.

https://civitai.com/

SD still has the hand problem, but a lot of the newer checkpoints are getting pretty good at hands.

As far as keeping up with things, the best way to do this is actually to be in AI related discords. I'm not sure why, but people involved in AI are terrible about having a single source of info...or even just info in general. I would start by joining the LAION discord then finding others from there.

Another great way to learn about what's going on in AI is actually through threads on 4chan's /g/ board. If you're willing to look past the fact that it's 4chan you will be the first to know about lots of neat stuff.

Btw, I'm not a researcher. I'm just talking about ways to get a quick scoop on what the current AI meta is and how far it can be pushed.

edit: I just realized you don't care to keep up...lol. Oops. I'm still leaving this up because I think some people will find it useful.


You can position the body and fingers and way you want with ControlNet.


Stable Diffusion but it also has artifacts as discussed in the post


Where does the training data come from? Humans have traditionally sucked at drawing hands.


That's not the source of the problem. In layman terms, most fingers are flanked by two fingers and the ai does what is "statistically more accurate". It doesn't have a solid big picture context and gets lost in local detail, similar to kids' drawings. The role of training data is that hands are extremely expressive and commonly shown in very complex positions in the training images. AI is equally inept with complex body postures and contortions. But as it has been said, there are already models and workflows that improve the results significantly.


I think the human element shouldn’t be lost here. Remember, the goal of AI is to have a human brain on demand, but just a bit better than human. It sounds like every complexity there could apply to why humans suck at drawing hands.


You mean humans traditionally drew hands with 7 fingers and other such? I have a hard time remembering a single human drawing "sucking" in the way AI-generated hands do.


> I have a hard time remembering a single human drawing "sucking" in the way AI-generated hands do.

That's because yourself as a human being already owns a pair of hand, you can look at it everyday.

Now try draw the hands/feet of some animal you've heard of without a photo. e.g. hands of a sloth


3 fingers is pretty common because it's easier to get a 3-fingered hand to look okay

but yeah humans rarely draw 7-fingered hands or smiles with four rows of 40 teeth


> but yeah humans rarely draw 7-fingered hands

Guess you haven’t seen hand drawings done by young children?


Surely you aren't suggesting that many drawings by young children are in the AI training models and that's why it sucks at drawing hands?


Nope, not that at all. I was replying directly to the quoted sentence of GP’s post with a clear and direct rebuttal against it.


i was trying to remember if young children often did this, but came to the conclusion that they only do it rarely


Based on experience with the kids drawings I’ve seen, I’d say getting the right number of fingers is definitely the rare thing up to a certain age, but specifically drawing 7, probably not that common I guess.


i guess i was wrong then

thanks


This article seems a few months late, it is trivial now to get correct hands using custom stable diffusion model or make the slightest commitment to do inpainting if there are errors.


For the old-schoolers: As a kid I watched the hands to see if the TV character was an alien (The Invaders)


Anyone who knows an artist or has attempted this themselves will immediately confirm that hands and feet are often the most challenging anatomy to get right.

I view this result as further evidence that these algorithms are approaching the target.


The way for AI to make it realistic is to build a human from the bones to the muscles to blood to skin.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: