Hacker News new | past | comments | ask | show | jobs | submit login
Is It a Duck or a Rabbit? For Google Cloud Vision, Depends on Image Rotation (reddit.com)
260 points by jpatokal 47 days ago | hide | past | web | favorite | 88 comments

I quite surprised at the comments on HN so far as nobody seems to see the significance of this. Yes, the image is ambiguous. The point is that Google Cloud Vision gives an unambiguous answer of that image based on the rotation. Transformations of an image are regularly used to improve the results of image recognition. That process fails quit dramatically if in the course of a transformation the answer given is presented with higher confidence than should be.

I'm glad that at least someone here sees the problem, but I am not surprised by the typical reaction of AI apologists in this thread. You always get at least one of the two responses:

"OMG, this is amazing, it's just like humans. We're probably close to AGI."

"Ha-ha, humans are stupid, so the algorithm giving unexpected result is just a proof that it's better and less biased."

Here, we have both in response to the same demo.

Still, I honestly don't know why some people are so biased in favor of neural nets and have zero interest in edge cases and flaws (the most interesting parts if you want to gain deeper understanding of how the algorithm actually operates). Wishful thinking, I guess.

Can someone explain why this is a problem? I'm not an "AI apologist", but I would consider it a good thing that the model pegs it as a rabbit when it is in more of a "rabbit orientation" and a duck when it is in more of a "duck orientation".

In the real world, you don't want AI to instantly flip from 90% confidence in one direction to 90% confidence in the other direction, because it would cause erratic behavior. What would be preferable is a large zone where it gives both labels .45 score. Then you can apply higher-level reasoning based on the possibility that the object could be either of those two labels (i.e. act on the possibility of the most dangerous or most beneficial scenario of the two).

But it doesn't instantly flip the decision in a way that can oscillate, it recognizes the similarity in specific positions where humans would also recognize it, and doesn't recognize anything in between (again the same as humans would). Isn't that the goal, to mimic how us humans do it? Imagine recognizing numbers 6 and 9, you want AI to recognize it with high certainty depending on what part of the digit is up, you don't want 45% certainty that it might be 9 or 6. Or am I missing something?

That doesn't make intuitive sense to me - I don't think humans have such a zone for (all) binary choices. An example that comes to my mind is the rotating Spinning Dancer Illusion[1] - I am 100% confident that she's spinning to the left, or 100% confident she's spinning to the right - there is no middle-ground where my brain tells me "it could be either direction".

Multistable/Bistable perception is not unique to Google Cloud vision - it afflicts humans too.

1. https://en.wikipedia.org/wiki/Spinning_Dancer

This. Maybe the AI needs to know somehow that it's the same image it saw a few seconds ago, but now presented at a different angle or the AI itself should look at an image at different angles, just to be sure, that is how humans sometimes look at pictures if they are confused. Something to contextualize every image wrt to what it recently saw and make the current decision a little less overconfident.

CV usually does look at images with different rotations, scaling, stretch. The point is that GCV doesn't get stuck on duck when the duck is upside down, it behaves the same way as a human classifier.

I'm looking forward to ReCAPTCHA asking "Click the images of things that are upside down."

You'd think, but the entire reason that softmax is so common in ml is because the artificial certainly is the preferred behavior.

In "real world", you expect AI to behave like human. Isn't that the definition?

In this case, a rabbit and duck are approximately the same size and danger level. So few cases where there is harm possible.

What if it was AI looking at bacteria? Or scanning a roadside for IEDs? Or when a guy on a bike when turned and rotated the correct way appears to be a crosswalk paint mark of a guy on a bike?

If our current AI is making different “DEFINITE” determinations based only on image rotation - there is a problem.

Not sure why you're being down-voted, this is exactly the issue.

The image is both a rabbit and a duck regardless of orientation, capturing the object it depicts as a single class with a confidence measure is a mistake.

The magnitude of this mistake becomes apparent when you connect it to real-world decision making, and it becomes highly unsafe.

As is most of ML because it only uses statistical (rather than causal) modelling of the world -- so really, it is only offering us generalised statistical associations. It cannot cope with statistical discontinuities.

AI has become exactly like most complex issues with multiple distinct 'sides'. For whatever reason everybody is expected to have an opinion even though, of all people, it's generally safe to say < 1% have the knowledge and experiential basis to form an educated opinion. So the other 99% are mostly just picking sides semi-arbitrarily. And it's this arbitrariness that leads to overly, and inappropriately, simplistic responses to issues. This simplicity in turn also tends to drive responses that are either radically 'for' or radically 'against' something. Understanding drives nuance, and nuance drives uncertainty.

The trouble with the world is that the stupid are cocksure and the intelligent are full of doubt. - Bertrand Russell

Though in this case, stupid/intelligent are probably overly harsh. Intelligent people are certainly not immune to this 'trap'. In some ways they can be even more susceptible since they may themselves know very little, but that very little is still enough to put them ahead of 80% of the rest which can yield unjustified confidence. So let's just say uninformed/informed.

I'm just saying it's as a rabbit in one orientation, and valid as a duck in another orientation.

Many things belong in distinct groups depending on their orientation or other physical attributes.

A bucket is, amongst other things, a bucket when standing flat on the ground with the hole facing upwards.

Turn it around, and it becomes the cover for a mole trap, place it on someones head, it becomes a rain cover. Mount a light bulb inside it and it becomes a lampshade.

It's still a bucket, but it's not _primarily_ a bucket in all cases, and shouldn't necessarily be classified as a bucket in all cases. Rotate the plus symbol 45 degrees and you got the letter x. And it is definitely, 100% certainly an x in one orientation and a + in another orientation.

Humans reading a text in Latin letters will find that d, b, p, and q are different letters.

In most fonts, the only difference is rotation and/or mirroring.

Yet they represent different sounds, they have different meanings.

Rotation is another data point, not something entirely independent from the data.

Probably, and I do hate being the cynic in the room, it's due to the huge salaries and interesting, specific use-case work that NN-based AI is generating these days.

But the human brain fails at this if it's rotated also...

If you showed me the rabbit rotation of the picture, i'd tell you with pretty high confidence that it's a rabbit.

If you showed me the duck rotation, i'd tell you with pretty high confidence that it's a duck.

That's the point of this, it's an illusion.

And it did give a bit of an "I don't know" answer for many of the rotations in the middle of the gif/video, which is exactly as I would expect it to, and when I pause the video at those points and glance at it, it doesn't look like much of anything to me either.

But a human brain should only falls for it once. You have your initial reaction, realise it might be the other animal and then from there you know it’s an illusion and the rotation of the picture no longer matters.

What's the correct answer for this? Is it definitely a rabbit regardless of the orientation. Or is it definitely a duck regardless of the orientiation?

I still fail to see how human brain only fall for this illusion once.

If they chose to allow a classification option of (c) optical illusion, then the system could easily learn that. In fact, you could probably make a GAN ambiguous image generator.

An ambiguous image is not always ambiguous.

Eg clever animal camouflage. A caterpillar that looks like a snake should not be classified as ambiguous. It should always be caterpillar.

Ah, but then you open the door to more complex and meaningful classification. For example "insect camouflaged as a stick" or the classic "couch with leopard print pattern".

Eh, you use rotations if you're specifically looking for rotation invariance. Also, it's hard to tell how confused the network actually is since it's not really predicting 'probability' of a class (even though the term is commonly used). Quite often neural nets just output the most 'probable' class with an oversized probability estimate due to how the most common classification layer works.

problem? this is a feature imo. it uses the rotation I give it to help deduce ambiguous content, which is helpful.

It is only a feature, because you have the information about which rotation you provide and relate it to the result. If I just try to classify an image the information about which rotation the image is under might be not there at all. Just an example. Say there is a drone: The Elmer-2 that takes an image of a sign. Now depending on which angle this picture is taking from it might think it is duck season instead of rabbit season and never look twice.

They have scale-invariant feature transforms (SIFTs[1]). I wonder if they could do rotation-invariant ones that wouldn't have a different answer depending on rotation?

[1] https://en.wikipedia.org/wiki/Scale-invariant_feature_transf...

SIFTs are rotation invariant. However, they are not good for classification.

Anyway, the early layers of an NN should be performing an encoding that creates scale and rotation invariance though, so that later layers can classify. That's what makes this result interesting. Well that and the ambiguity matches the human ambiguity.

Lookup spherical CNNs and tensor field networks, both are rotation invariant using spherical harmonics. They are not really used in practice however.

Creator of the animation here. Most of the relevant information/context behind the animation (including a link to the repo) is in this Reddit comment: https://reddit.com/r/dataisbeautiful/comments/aydqig/_/ehzyo...

To answer the question why I made the animation: there isn't an ulterior "I found an AI gotcha!" motive, I saw a tweet where the API returned different things depending on orientation and expanded on it. It was also an opportunity to test a few animation hypotheses via gganimate.

With 3D rotation is even worse

When the output switches to rabbit the picture actually resembles a rabbit. I am unsure if this experiment was supposed to be a “haha look how stupid AI is” type thing or not, but it seems like the cloud vision api is performing as intended.

I think this shows how poorly many neural networks are at handling ambiguity.

The picture is constructed to be ambiguous, and this property is preserved by the rotation: you can still easily see the duck by slightly shifting where you focus.

One mode might be more prominent at some orientation, but the ambiguity is always there so to confidently assign labels and then switch what you assign is an error. So you should be constantly switching as the two classes end up with very similar scores and the noise decides.

Either that, or you decide one label is the most appropriate and then it correctly handle the trivial rotation.

The neural network is likely handling it just fine:

A classifier generally outputs a vector of weights, so it’s likely classifying, say [0.8, 0.75] and then the output is selecting the highest and saying “bunny”. Then you rotate it, and the classifier says [0.75, 0.8] and the output says “duck”.

This is completely reasonable on the part of the network: all things being equal, animals generally appear in certain orientations and we should prefer the interpretation of the amigbuity which respects this alignment, slightly. Example: “bill” down, it looks more like a duck because rabbits rarely have their head in that alignment, while “ears up” it looks more like a rabbit since ducks rarely hold their bills that way.

The problem is actually in how we represent probabilistic information to humans, aka “why the weather man is always wrong”, so it seems like the classifier is randomly flapping when it’s actually perfectly correctly adjusting its distribution of answers based on information we give it.

> A classifier generally outputs a vector of weights, so it’s likely classifying, say [0.8, 0.75] and then the output is selecting the highest and saying “bunny”. Then you rotate it, and the classifier says [0.75, 0.8] and the output says “duck”.

The original post actually included the predicted probabilities, which are around 80% for duck or rabbit and 0% for the other class. So the neural network really is overconfident.

It works that way for me when I play it. To see this, it seems to help to look at the picture but have half an eye on the changing NN evaluation. I don't think my perception is simply responding to the changing NN output, as there are times when I disagree with it. My guess is that, by dividing my attention, I reevaluate the picture more frequently.

The interesting part (IMO) isn't that the AI classifies it as both a rabbit and a duck, but that the classification is dependent on the rotation of the picture.

I can somewhat understand how that happens, but I find it's in an interesting observation (rather than a criticism on the system, though the title is somewhat unclear and had me expect something else).

But rotation does contain information. 6 and 9 can be considered as a case where rotation SHOULD change the classifier's output.

AI does makes dumb predictions from time to time, but I my opinion, this isn't that strong a case. When it rotate upside down, it does look like a rabbit even to me.

The more interesting 'failure' here to me, is that while the rotation is smooth, the prediction is not, instead it is flickering, which does raise some interesting questions about what the model's internal distribution surface looks like.

But only with context right? It is completely ambiguous if the character is 6 or 9 without some other clue, like which way up the paper is, or from the 4 next to it (if you assume that an arbitrary rotation may have occurred). It is just a sign to me that doing some rotations as data augmentation is not good enough. Rotation invariance needs to be built into the architecture of the network (like translation invariance sort of gets in via max pooling). I think it should be giving a 50/50 classification of duck and rabbit at all rotations if it was working as expected.

> I think it should be giving a 50/50 classification of duck and rabbit at all rotations if it was working as expected.

This would likely give incorrect or worse results when, eg, classifying ducks and rabbits in wildlife photos — animals come in orientations, and you’ll do a better job classifying them in practice if you respect that.

It’s also not the case that a human would classify it 50/50 in all rotations — it certainly looks more or less like one or the other as you rotate it. Humans are even program,Ed for that compromise: we see faces in one orientation much better than rotated 180 degrees.

Ideally it would classify it as an ambiguous ink drawing meant to test human psychology

How often do you read upside down?

All the time, especially when I read to my kids.

Your kids might enjoy this cartoon.


every meeting

In the reddit comment the author noted that the training set is frequently rotated to help the system generalize better (maybe that's why people is fascinated by this animation at the first place). Possible explanations thus include: the Vision API is not well trained in this regard, or it has (correctly) learned that some images are sensitive to the rotation.

> AI does makes dumb predictions from time to time, but I my opinion, this isn't that strong a case.

I don't think it's meant as an example of a 'dumb prediction'. I think it's an interesting example that inspires (me at least) to think about how this recognition and classification works, and what could cause this effect.

Thank you for this comment. To me, duck v. rabbit on rotation just proves the system works as intended.

Exactly, I also see a duck and a rabbit depending on rotation.

The distinction is present in the training data. So why wouldn't the classifier use the information? A completely rotation-invariant algorithm would most likely be more difficult to create, and would generate worse classifications because orientation is meaningful information.

Since human would do the same error, I think it's kind of a praise of how close to human is the AI.

There's no error. The image is simply ambiguous.

Depending on orientation our first interpretation is also either a duck or a rabbit, because our vision is obviously biased to interpret things in the orientation they are most likely to occur based on our priors that have evolved in the presence of "up" and "down" directions. The AI correctly takes the orientation into account because it also matters in its priors, having been fed training data that captures those human priors.

Now, most computer vision algorithms do strive for some degree of rotation (and translation etc) invariance, because a classifier that gets confused when you rotate the input by 15 degrees or whatever isn't very useful in the real world. But complete rotation invariance would just be a case of Artificial Stupidity in an application that attempts to classify like a human would. Input orientation is meaningful information.

It's a seagull not a duck. Don't confuse the dumb AI even more by not knowing what a duck doesn't look like in the first place. Jeez.

No such thing as a seagull, as my ornithologist friend likes to remind me.

That image is a visual illusion. I find it hard myself to detect that it's a rabbit when it's ears are horizontal like a mouth.

Not sure what is the purpose of it, is it to show that even computers vision algorithms can get confused by visual illusions?

I find your response more interesting that the experiment.

You don’t just rotate the image in your mind or focus on specific features to bring out the duckiness or rabbit-ness? I can make it more duck or more rabbit as will.

Is it concerning that there are short, sudden drops in prediction in the middle of a block otherwise solidly classified as rabbit/duck? I don't know much ML, does anyone know why it'd be so discontinuous?

Specifically, those drops are where the top/bottoms of the image are very slightly cropped out.

When making the animation I didn't intend for the occlusion, but the fact that the occlusion causes the prediction to drop to zero is itself an interesting data point. Many objects in real life are occluded.

To me it just reinforces those algorithms are only comparable to spouting out the intuitive, unconscious first impression of what they see in the image.

Like if something passes through your field of vision an you recognize it as lets say a god. But a second later you feel something was off, turn around, look at it in details and determine it to be a weird looking cat? Yeah, those "AI"s so far seem incapable of such deep thinking.

While the title is clickbaity (as in adversarial examples for fooling neural networks e.g. by adding a baseball ball to a whale to make it shark), I think it shows a nice phenomenon. I.e. a given illusion works similarily for humans and AI alike.

Vide "dirty mind" pictures like posting https://images.baklol.com/13_jpegbd9cb76b39e925881bdb2956fd3... to Clarifai https://clarifai.com/models/nsfw-image-recognition-model-e95... gives 88% for NSFW.

Clickbaity? The title is downright misleading.

I'm a little confused, it seems quite accurate to me. It's the famous duck/rabbit rotation illusion and the google cloud vision API returns different results depending on rotation.

What do you find misleading about it?

It is an example of an adversarial title - one that is sometimes mistaken for clickbait even though it is an accurate summary of the content. This response can be primed by repeated prior training on real clickbait.

I wasn't the one who downvoted you but I agree with IanCal and this blog title didn't trigger my "clickbait" sensor.

I disagree that this is an example of "adversarial attack". The famous duck/rabbit illusion has been around since ~1892[1] and therefore was not deliberately constructed to be an "adversary" to image classification neural networks.

To me, it's an interesting example of feeding a well-known optical illusion to an AI algorithm and observing its behavior.

[1] https://en.wikipedia.org/wiki/Rabbit%E2%80%93duck_illusion

Yes, I agree with you. (No idea who downvoted my and why either.)

This (well known) illusion is NOT an adversarial example. Though, I explain why for people working with AI (e.g. me) the title seemed like mentioning an adversarial example. There are plenty of examples of "just rotate and a vulture becomes an orangutan" where it does not look like an orangutan for humans.

Vide: "A Rotation and a Translation Suffice: Fooling CNNs with Simple Transformations" https://arxiv.org/pdf/1712.02779.pdf

It would be cool to visualize this as a kind of pie chart, based on where the ears/beak is pointing. Blue for directions where it sees duck, red for rabbit, and empty for neither.

Looks like proof to me, that the classification works correctly.

I wonder whether it would stay consistent if you gave it a solid background line

This seems like a serious concern. What's a possible solution to this problem? Should all orientations be considered valid types? so in this case the image should be both a duck and a rabbit as the response?

On still (not animated, not rotated) preview I saw rabbit first, then in a second I found it can be a duck also, and now it takes efforts to see rabbit again (but I can do it).

I was ONLY seeing clockwise in all images until the counter-clockwise one went about 8 rotations and all of a sudden I saw it counter-clockwise. Now I can’t unsee it.

When I look at the anticlockwise one I can see it as going either direction. When I look at the clockwise one I can only see it going clockwise

Does Google Cloud like Duck or Rabbit? That’s where the answer lies.

In addition, if Cloud could taste one, it would really help itself with the answer.

I wonder if this was hardcoded/specifically trained to do this for this image?

The image was created specifically to fool humans, it's one of the classics in the optical illusion genre. It would be a shame for ML if it required specific trickery on both ends to reach that outcome.

There is a children's book about this pairing: https://www.amazon.com/Duck-Rabbit-Amy-Krouse-Rosenthal/dp/0...

But does it see a blue or a white dress?

it cannot be a rabbit because there is no nose nor mouth

But how does it smell?


It’s a drawing of a creature that looks a bit like a rabbit or a duck from different angles but is very clearly neither, at best a bad drawing. That’s the failure here - it’s classifying into one of its categories when it shouldn’t be classifying at all.

It's one of those optical illusions, where we (humans) do recognize animals. I don't think it's reasonable to limit classifications to the specific case where a drawing perfectly resembles a certain animal.

The interesting part (IMO) isn't that the AI classifies it as both a rabbit and a duck, but that the classification is dependent on the rotation of the picture.

I don't find it super interesting. The orientation distinction is clearly in the training data, and making the algorithm completely rotation invariant would likely a) be more difficult and b) result in worse classification compared to us humans who very much use orientation as a cue having been evolved in an environment with distinct "up" and "down" directions.

As the AI has likely not seen anything remotely similar during training, it is quite interesting that it is detecting the animals. As the picture was set up to confuse humans, does it somewhat show, that the representation that the AI learned is similar to the one humans have?

The AI has seen ducks and rabbits. It has build a good enough internal model of "duckness" and "rabbitness" that even more abstract renderings of ducks and rabbits activate the relevant parts of the net. I mean, that's exactly what we want them to do! Figure out the aspects of the input data salient to the classification task and ignore the irrelevant parts.

That said, I the response here is likely filtered and normalized to only include "duck" and "rabbit" classifications; after all the bird looks much more like a seagull than a duck.

This is the infamous Duck-Rabit illusion, right? The classifier seems to be doing a good job.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact