Hacker News new | past | comments | ask | show | jobs | submit login
Object-recognition dataset stumped the world’s best computer vision models (news.mit.edu)
104 points by dmayo on Dec 10, 2019 | hide | past | favorite | 39 comments



Being able to measure our successes and failures is the first step towards better algorithms :) This is always exciting. I don't think anyone thought that vanilla CNN architectures were all that would ever be needed.

This paper which received an honorable mention this year from NeurIPS conference first attempts to convert the image to a 3d scene before detecting objects. https://arxiv.org/pdf/1906.01618.pdf


The thing about AI/CV and other interpretation simulators is that there is always a quantization of nature in the end result. This is why the Uncanny Valley exists, "Uncanny Valley" being a term that refers to the difference between nature and the model itself, and I'm pretty sure it can't be engineered out of the technology. That is, the simulator can not picture an object from any angle, only an angle greater than its mathematical-precision limits. Analog vs. digital doesn't always have to be about audio. ;)

So, "construct an internal 3D model of [something in the natural world]" will always be deficient, and any conclusions derived from these models will always have inherent errors (even before you get to algo bias). Self-driving cars, airport face-reading gates, Pixar blockbusters...their models can never represent reality in anything but a temporarily-convincing way. Those that affect policy and peoples' lives (aka not-entertainment) will always come up with the wrong conclusion sometimes, sometimes fatally.

https://en.wikipedia.org/wiki/Uncanny_valley


> This is why the Uncanny Valley exists, "Uncanny Valley" being a term that refers to the difference between nature and the model itself

That is incorrect. From the Wikipedia page you references:

as the appearance of a robot is made more human, some observers' emotional response to the robot becomes increasingly positive and empathetic, until it reaches a point beyond which the response quickly becomes strong revulsion. However, as the robot's appearance continues to become less distinguishable from a human being, the emotional response becomes positive once again and approaches human-to-human empathy levels

> The thing about AI/CV and other interpretation simulators is that there is always a quantization of nature in the end result.

That's just not true in any meaningful sense. Computer vision can process images with a higher resolution than the human eye can distinguish.

> So, "construct an internal 3D model of [something in the natural world]" will always be deficient, and any conclusions derived from these models will always have inherent errors

Humans do this too (hence optical illusions). There's no reason to think that machine models can't surpass human models (and in some domains they already do).

All models are wrong, but some are useful.


>as the robot's appearance continues to become less distinguishable from a human being, the emotional response becomes positive once again and approaches human-to-human empathy levels

The gap implied by "approaches" is where the problems occur, and how that gap is dealt with is where inherent bias in the process(es) fits.

>Computer vision can process images with a higher resolution than the human eye can distinguish.

The problem is not in the resolution, but the "process" part. Also, "resolution" is not the right word to pivot the situation on, since we'd have to be comparing a digital system with an analog one.

>All models are wrong, but some are useful.

Useful I'll give you, but for life and death decisions? It's not going to help me evade downvotes, but...not so much.


humans do not have only one state, one performance criteria or one mood.. on the other hand, of course sensors can read data that humans cannot, for some set of hardware and inputs. I like the comment above. It is too early for large generalization.


I have been working on this problem for the last year. As it turns out this problem is especially prominent when doing object detection in the real world on a drone platform. Besides a large number of angles/contexts being able to move in 3D adds a large number of scales also, and there is hardly any open training data besides the VisDrone dataset[0] which doesn't even address the scale issue.

It is certainly an interesting problem though. I can't talk much about my work but if anyone wants to collaborate on something open source addressing the core problem check my profile.

[0]: http://www.aiskyeye.com/upfile/Vision_Meets_Drones_A_Challen...


I recently saw a paper (which I reposted here on HN a while back) that perhaps could help with your problem. The paper proposes an approach that sort of induces models to learn representations of objects that are good at predicting "the most agreed-upon" rotations of the inputs. I'm not explaining it well. Anyway, a model in the paper achieved SOTA on a change-of-viewpoint dataset with a really tiny number of parameters. Might be worth a look: https://arxiv.org/abs/1911.00792


I have been looking for an excuse to use capsule networks. I can't investigate it as a possible solution at work because even if good results are achieved the performance problems[0] would prevent real world use. Definitely might look into that as a side project though. I have also been looking for an excuse to do something beyond like SENet in Julia.

Maybe a new benchmark like this is what we need to get out of the rut.

[0]: delivery.acm.org/10.1145/3330000/3321441/p177-Barham.pdf


I'm the author of that paper. Happy to answer questions about it here.


It seems like, to be really good the AI needs to construct an internal 3D model of objects so it won't matter which way it's rotated.

It seems to be how the human works. I can rotate an object in my mind and picture it from any angle.


Decades ago, computer vision was primarily thought of as a kind of inverse graphics, the opposite of rendering: image goes in, 3D shapes with material properties and lighting properties go out. Of course 3D reconstruction is still a huge thing, but object recognition split off onto a different path with the realization that "superficial", 2D-based features (like SIFT and HOG) work very well for recognizing image content, when combined with powerful classifiers and regressors of the time (like the SVM). It was common for lecturers to say that "You may think we need complicated internal 3D representations of everything, but another approach seems more fruitful: ..." Nowadays there's a lot of buzz around merging the two branches back together to unify explicit 3D geometric reasoning and 3D modeling with deep learning.

I think humans use both strategies. Sometimes we really rely on superficial visual information, like yellow/black stripes -> time to get away! No need to first perfectly match the visual input to a mental tiger model rotated at the correct orientation. I think split-second recognition is usually like this. Or perhaps we use different strategies for different objects, I could imagine for example that facial recognition in the brain is more 2D-feature based pattern matching, rather than 3D reconstruction.


For contrast, horse brains seem to not do this. They'll often see the same object from a different angle as a new stimulus.


> It seems to be how the human works. I can rotate an object in my mind and picture it from any angle.

You can, but that doesn't seem to me to be how the mind works. If it looked at it from every angle simultaneously there would be no speed difference in my recognising an object regardless of the orientation.

But in some cases there are a huge speed difference. I can be staring at it for many seconds and then snap! - oh it's upside down. As soon as I realise this my mind immediately adapts and what was unrecognisable is suddenly as plain as day.

The only way I can account for this is when I finally twig the image is upside down I re-route it through a different path in my brain that does a rotation before feeding it to the recogniser. But normally that path is shut off - it's not constantly scanning the input.

I suppose what happens is in most cases some pre-processor uses other clues in the picture to tell me it's rotated from it's normal position and engages the correct path without conscious intervention. That would explain why most of the time you say you don't notice it.

Nonetheless the two mechanisms are very different. A sequential path that rotation -> recognition will be slightly deeper and slightly slower than that does both in the one step, but far smaller. Nonetheless, to looks to me modern designs do attempt to do it in one step, which is to say they attempt to recognise the object in all possible orientations simultaneously.


You can, but I don't think that's what happens when looking at stuff. When I look at the hammer, I don't rotate a hammer to see what orientation matches what I'm seeing, I see the handle and... metal bit (don't know the term, sorry), realize those look like they belong to a hammer and go "oh yeah, it's an upside down hammer".


It took me a while to recognize that wooden chair. At first I tought it was a wooden hammer on top of some squared background, but then I realized it was a chair seen from above.

For me, I actually imagine these objects moving/rotating to make sense of them when seen from unusual angles. That hammer you described, I look at it and imagine myself flexing it.


Another thing that seems overlooked is that we don't just randomly happen to look at objects from weird angles. If you're looking at a chair from that weird top-down angle, probably you walked to it previously and expect to see it like that. And you feel the direction of gravity in your ear, so you feel your viewpoint. The object itself may still be in a random orientation, but then just moving your head around a bit or turning the object clears up any confusion.

We do heavily rely on context.


Same here. I needed almost 10 seconds to recognize the chair.


I think that's the idea behind capsule networks. But I haven't heard anything more about them since they were announced so maybe they don't work too well.


"I can rotate an object in my mind and picture it from any angle"

Except there was a wonderful thread about Aphantasia a while ago https://news.ycombinator.com/item?id=20267445

Several HN readers chimed in to say they have this feature. It would be interesting to know if this dataset would stump them.


The model hasn't ever picked up or held a hammer. How is it ever supposed to recognize a hammer in an unfamiliar context (e.g., when given a new image that is not representative of what it was trained against?). You can't train intuition.


I wonder if the exact some models that failed this test would succeed if their training data included images of weird angles/unusual contexts?

My guess is every "hammer" image in the training data set was "conventional" -- a convenient angle and orientation. If half the images of "hammers" were instead "unconventional", would the model adapt to realize "my existing model of a hammer is incomplete; there must be a way to consolidate these two different images"?

Or does this require an internal 3d modeling, and better inputs wouldn't help; instead the model itself would need to be more advanced?


They show in section 4.3 that fine-tuning the last layer of ResNet-152 on half of ObjectNet (25k images) and testing on the other half increases the top-1 accuracy from 29% to 50%, while the corresponding accuracy on ImageNet is ~67%.

Nevertheless, I agree with you. Given a huge dataset with millions of of unconventional images may be enough. Who knows.

Things kind of go in cycles in machine learning (similar to other fields). There is nowadays growing dissatisfaction of having to use so much (labeled) data, and people want the models to be better "primed" to capture the variations and structures existing in the real world. Partially because labeling a lot of data is just very expensive, but partially it's also seen as inelegant and "black-boxy" or it's just not in their scientific taste.

Other people argue that learning it all from data is fine and this kind of robustness shouldn't have to be baked in to models. Rather they should/could be learned from vast amounts of unlabeled data instead (Yann LeCun seems to be in this group.), with unsupervised/self-supervised methods.


> But designing bigger versions of ObjectNet, with its added viewing angles and orientations, won’t necessarily lead to better results, the researchers warn.

Curious if there's supporting reasoning for this type of statement? Imho most of these "objects" should still be learnable with vanilla CNNs if you had sufficient data, especially more angles. Starving a vision network of data is an interesting problem, but I don't think it can be used as a blanket statement for all state of the art techniques. And if I'm allowed to make a naive comparison to human intelligence, I don't think lack of viewing angles is a factor.


I bet if you gave a human a short amount of time to identify these images, they'd have some mistakes too. Particularly the middle top one on the article


But human can take more time and get it correct. Computers cannot simply take more time and be more accurate (I guess AlphaZero can, but that is a different problem entirely).


I agree; it seems possible that the tools currently used for image recognition are analogous to how humans quickly recognize objects, so we might not expect these tools to ever do a good job on images that humans also struggle with.

It would be nice at least if the computer tools could detect that they are confused, and I know there is some research in that direction.


... it's a chair. Wow that took a minute. Some of these are devilishly tricky; clearly they're designed to hit all the difficult edge cases within the domain. What a fun dataset!


Spoiler alert. Thanks for ruining the article.


This looks like a nice dataset, but as someone who works in an adjacent field (ML on text) it doesn't seem as revolutionary as it is being presented as.

When a model is trained on ImageNet the training dataset is (usually) enlarged by doing artificial image augmentation. This does things like rotate, skew, crop and recolor the images so the model understands what the object can look like.

This dataset appears to find angles of objects that are difficult to reproduce using this process.

That is useful, but I can think of two ways to solve this pretty easily that would be achievable and would make a good project for a undergrad or Masters student.

1) Acquire 3D models of each of the object classes, render them at multiple angles in Unreal (or similar) and augment the ImageNet dataset with these images

2) Assuming you want to use the whole ObjectNet dataset as a test set, follow their dataset construction process using Mechanical Turk, and train on that data.

I bet either of these processes would take back 20-30% of the 45% performance drop very easily, and I bet the ones left would be the ones that humans have a lot of trouble identifying.


1) Training with synthetic data is definitely a thing in computer vision, exactly the way you describe. You can even throw a GAN on top of the results to make the renderings look less artificial. 2) They do something like this, in section 4.3 by splitting ObjectNet in half. They fine-tune (the last layer of) ResNet-152 (I wonder what happens if you fine-tune more layers) on half and test on the other half. This pushes results up by about 15% points. There's still a gap, but it can be plausibly argued that the gap would close up if we scaled things up by one or two orders of magnitude. The question is whether there's a better way.


> I wonder what happens if you fine-tune more layers

Generally it improves some, but most of the gains are in the final layer retraining.

But it's a lot more data hungry.

> They do something like this, in section 4.3 by splitting ObjectNet in half.... here's still a gap, but it can be plausibly argued that the gap would close up if we scaled things up by one or two orders of magnitude

This is interesting. It's worth noting that this training is on only 64 images per class, and it is unclear if they augment this in anyway.

Before retraining, the paper itself notes:

Classes such as plunger, safety pin and drill have 60-80% accuracy, while French press, pitcher, and plate have accuracies under 5%

It is worth noting that the plunger, safety pin and drill classes are ones that have multiple orientations already in ImageNet, while French press, pitcher, and plate are almost all the "right" way up.

To me this indicates this is simply a data problem - the model has never seen what an upside-down French press looks like so it gets it wrong.


What are the implications for ML used in real-world situations with no training set (or a very limited one) that could have life or death consequences - passenger vehicles, industrial use, military, etc.?

Or is the consensus that it is a matter of time before compute and algorithms make these situations "safe enough," even for edge cases?


It's clearly delusional to think that computer vision will go from soft computing (erroneous) to hard computing in less than a decade, at the rate of current incremental improvements. We will soon hit an accuracy wall that only breakthrough research will allow to beat. The problem being: there's too much research exploring the search space in the same direction and not enough foundational research.


I disagree. There is a huge amount of research on adversarially robust classifiers and detectors going on. One can also programmatically test a neural network on real data, synthetically damaged data, fully synthetic data, and adversarial data, and everything in between. You can statistically ensure you get any desired accuracy level on those tests. While that's not a hard proof of anything, it can allow you to be very confident in the nets abilities.


Well, anyway it doesn't seem worse than how we test our biological neural networks before letting them drive huge chunks of metal around


Is there an end here? Just google "optical illusion photos" or something similar.


Again another source looking at the angle of objects rather then the refraction patterns. I ran the first IR testing at Google and the one thing that tripped up the engine was lighting variations, which the source I see didn’t consider.


Perhaps the idea of not having a training set means that one should use other datasets such as image net and use transfer learning.


Nonsense.. most of cases a bigger object is prevalent. Chair? No its a fricking bed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: