
Object-recognition dataset stumped the world’s best computer vision models - dmayo
https://news.mit.edu/2019/object-recognition-dataset-stumped-worlds-best-computer-vision-models-1210
======
etaioinshrdlu
Being able to measure our successes and failures is the first step towards
better algorithms :) This is always exciting. I don't think anyone thought
that vanilla CNN architectures were all that would ever be needed.

This paper which received an honorable mention this year from NeurIPS
conference first attempts to convert the image to a 3d scene before detecting
objects.
[https://arxiv.org/pdf/1906.01618.pdf](https://arxiv.org/pdf/1906.01618.pdf)

~~~
rhizome
The thing about AI/CV and other interpretation simulators is that there is
always a quantization of nature in the end result. This is why the Uncanny
Valley exists, "Uncanny Valley" being a term that refers to the difference
between nature and the model itself, and I'm pretty sure it can't be
engineered out of the technology. That is, the simulator can not picture an
object from _any_ angle, only an angle greater than its mathematical-precision
limits. Analog vs. digital doesn't always have to be about audio. ;)

So, "construct an internal 3D model of [something in the natural world]" will
always be deficient, and any conclusions derived from these models will always
have inherent errors (even before you get to algo bias). Self-driving cars,
airport face-reading gates, Pixar blockbusters...their models can never
represent reality in anything but a temporarily-convincing way. Those that
affect policy and peoples' lives (aka not-entertainment) will always come up
with the wrong conclusion sometimes, sometimes fatally.

[https://en.wikipedia.org/wiki/Uncanny_valley](https://en.wikipedia.org/wiki/Uncanny_valley)

~~~
nl
> _This is why the Uncanny Valley exists, "Uncanny Valley" being a term that
> refers to the difference between nature and the model itself_

That is incorrect. From the Wikipedia page you references:

 _as the appearance of a robot is made more human, some observers ' emotional
response to the robot becomes increasingly positive and empathetic, until it
reaches a point beyond which the response quickly becomes strong revulsion.
However, as the robot's appearance continues to become less distinguishable
from a human being, the emotional response becomes positive once again and
approaches human-to-human empathy levels_

> _The thing about AI /CV and other interpretation simulators is that there is
> always a quantization of nature in the end result._

That's just not true in any meaningful sense. Computer vision can process
images with a higher resolution than the human eye can distinguish.

> _So, "construct an internal 3D model of [something in the natural world]"
> will always be deficient, and any conclusions derived from these models will
> always have inherent errors_

Humans do this too (hence optical illusions). There's no reason to think that
machine models can't surpass human models (and in some domains they already
do).

All models are wrong, but some are useful.

~~~
rhizome
> _as the robot 's appearance continues to become less distinguishable from a
> human being, the emotional response becomes positive once again and
> approaches human-to-human empathy levels_

The gap implied by "approaches" is where the problems occur, and how that gap
is dealt with is where inherent bias in the process(es) fits.

> _Computer vision can process images with a higher resolution than the human
> eye can distinguish._

The problem is not in the resolution, but the "process" part. Also,
"resolution" is not the right word to pivot the situation on, since we'd have
to be comparing a digital system with an analog one.

> _All models are wrong, but some are useful._

Useful I'll give you, but for life and death decisions? It's not going to help
me evade downvotes, but...not so much.

------
Datenstrom
I have been working on this problem for the last year. As it turns out this
problem is especially prominent when doing object detection in the real world
on a drone platform. Besides a large number of angles/contexts being able to
move in 3D adds a large number of scales also, and there is hardly any open
training data besides the VisDrone dataset[0] which doesn't even address the
scale issue.

It is certainly an interesting problem though. I can't talk much about my work
but if anyone wants to collaborate on something open source addressing the
core problem check my profile.

[0]:
[http://www.aiskyeye.com/upfile/Vision_Meets_Drones_A_Challen...](http://www.aiskyeye.com/upfile/Vision_Meets_Drones_A_Challenge.pdf)

~~~
cs702
I recently saw a paper (which I reposted here on HN a while back) that perhaps
_could_ help with your problem. The paper proposes an approach that sort of
induces models to learn representations of objects that are good at predicting
"the most agreed-upon" rotations of the inputs. I'm not explaining it well.
Anyway, a model in the paper achieved SOTA on a change-of-viewpoint dataset
with a really tiny number of parameters. Might be worth a look:
[https://arxiv.org/abs/1911.00792](https://arxiv.org/abs/1911.00792)

~~~
Datenstrom
I have been looking for an excuse to use capsule networks. I can't investigate
it as a possible solution at work because even if good results are achieved
the performance problems[0] would prevent real world use. Definitely might
look into that as a side project though. I have also been looking for an
excuse to do something beyond like SENet in Julia.

Maybe a new benchmark like this is what we need to get out of the rut.

[0]: delivery.acm.org/10.1145/3330000/3321441/p177-Barham.pdf

------
jobseeker990
It seems like, to be really good the AI needs to construct an internal 3D
model of objects so it won't matter which way it's rotated.

It seems to be how the human works. I can rotate an object in my mind and
picture it from any angle.

~~~
StavrosK
You can, but I don't think that's what happens when looking at stuff. When I
look at the hammer, I don't rotate a hammer to see what orientation matches
what I'm seeing, I see the handle and... metal bit (don't know the term,
sorry), realize those look like they belong to a hammer and go "oh yeah, it's
an upside down hammer".

~~~
tcgv
It took me a while to recognize that wooden chair. At first I tought it was a
wooden hammer on top of some squared background, but then I realized it was a
chair seen from above.

For me, I actually imagine these objects moving/rotating to make sense of them
when seen from unusual angles. That hammer you described, I look at it and
imagine myself flexing it.

~~~
bonoboTP
Another thing that seems overlooked is that we don't just randomly happen to
look at objects from weird angles. If you're looking at a chair from that
weird top-down angle, probably you walked to it previously and expect to see
it like that. And you feel the direction of gravity in your ear, so you feel
your viewpoint. The object itself may still be in a random orientation, but
then just moving your head around a bit or turning the object clears up any
confusion.

We do heavily rely on context.

------
2bitencryption
I wonder if the exact some models that failed this test would succeed if their
training data included images of weird angles/unusual contexts?

My guess is every "hammer" image in the training data set was "conventional"
\-- a convenient angle and orientation. If half the images of "hammers" were
instead "unconventional", would the model adapt to realize "my existing model
of a hammer is incomplete; there must be a way to consolidate these two
different images"?

Or does this require an internal 3d modeling, and better inputs wouldn't help;
instead the model itself would need to be more advanced?

~~~
bonoboTP
They show in section 4.3 that fine-tuning the last layer of ResNet-152 on half
of ObjectNet (25k images) and testing on the other half increases the top-1
accuracy from 29% to 50%, while the corresponding accuracy on ImageNet is
~67%.

Nevertheless, I agree with you. Given a huge dataset with millions of of
unconventional images may be enough. Who knows.

Things kind of go in cycles in machine learning (similar to other fields).
There is nowadays growing dissatisfaction of having to use so much (labeled)
data, and people want the models to be better "primed" to capture the
variations and structures existing in the real world. Partially because
labeling a lot of data is just very expensive, but partially it's also seen as
inelegant and "black-boxy" or it's just not in their scientific taste.

Other people argue that learning it all from data is fine and this kind of
robustness shouldn't have to be baked in to models. Rather they should/could
be learned from vast amounts of unlabeled data instead (Yann LeCun seems to be
in this group.), with unsupervised/self-supervised methods.

------
h54eaqh4e
> But designing bigger versions of ObjectNet, with its added viewing angles
> and orientations, won’t necessarily lead to better results, the researchers
> warn.

Curious if there's supporting reasoning for this type of statement? Imho most
of these "objects" should still be learnable with vanilla CNNs if you had
sufficient data, especially more angles. Starving a vision network of data is
an interesting problem, but I don't think it can be used as a blanket
statement for all state of the art techniques. And if I'm allowed to make a
naive comparison to human intelligence, I don't think lack of viewing angles
is a factor.

------
aidenn0
I bet if you gave a human a short amount of time to identify these images,
they'd have some mistakes too. Particularly the middle top one on the article

~~~
zeta0134
... it's a _chair._ Wow that took a minute. Some of these are devilishly
tricky; clearly they're designed to hit all the difficult edge cases within
the domain. What a fun dataset!

~~~
megablast
Spoiler alert. Thanks for ruining the article.

------
nl
This looks like a nice dataset, but as someone who works in an adjacent field
(ML on text) it doesn't seem as revolutionary as it is being presented as.

When a model is trained on ImageNet the training dataset is (usually) enlarged
by doing artificial image augmentation. This does things like rotate, skew,
crop and recolor the images so the model understands what the object can look
like.

This dataset appears to find angles of objects that are difficult to reproduce
using this process.

That is useful, but I can think of two ways to solve this pretty easily that
would be achievable and would make a good project for a undergrad or Masters
student.

1) Acquire 3D models of each of the object classes, render them at multiple
angles in Unreal (or similar) and augment the ImageNet dataset with these
images

2) Assuming you want to use the whole ObjectNet dataset as a test set, follow
their dataset construction process using Mechanical Turk, and train on that
data.

I bet either of these processes would take back 20-30% of the 45% performance
drop very easily, and I bet the ones left would be the ones that humans have a
_lot_ of trouble identifying.

~~~
bonoboTP
1) Training with synthetic data is definitely a thing in computer vision,
exactly the way you describe. You can even throw a GAN on top of the results
to make the renderings look less artificial. 2) They do something like this,
in section 4.3 by splitting ObjectNet in half. They fine-tune (the last layer
of) ResNet-152 (I wonder what happens if you fine-tune more layers) on half
and test on the other half. This pushes results up by about 15% points.
There's still a gap, but it can be plausibly argued that the gap would close
up if we scaled things up by one or two orders of magnitude. The question is
whether there's a better way.

~~~
nl
> _I wonder what happens if you fine-tune more layers_

Generally it improves some, but most of the gains are in the final layer
retraining.

But it's a lot more data hungry.

> _They do something like this, in section 4.3 by splitting ObjectNet in
> half.... here 's still a gap, but it can be plausibly argued that the gap
> would close up if we scaled things up by one or two orders of magnitude_

This is interesting. It's worth noting that this training is on only 64 images
per class, and it is unclear if they augment this in anyway.

Before retraining, the paper itself notes:

 _Classes such as plunger, safety pin and drill have 60-80% accuracy, while
French press, pitcher, and plate have accuracies under 5%_

It is worth noting that the _plunger, safety pin and drill_ classes are ones
that have multiple orientations already in ImageNet, while _French press,
pitcher, and plate_ are almost all the "right" way up.

To me this indicates this is simply a data problem - the model has never seen
what an upside-down French press looks like so it gets it wrong.

------
ilamont
What are the implications for ML used in real-world situations with no
training set (or a very limited one) that could have life or death
consequences - passenger vehicles, industrial use, military, etc.?

Or is the consensus that it is a matter of time before compute and algorithms
make these situations "safe enough," even for edge cases?

~~~
The_rationalist
It's clearly delusional to think that computer vision will go from soft
computing (erroneous) to hard computing in less than a decade, at the rate of
current incremental improvements. We will soon hit an accuracy wall that only
breakthrough research will allow to beat. The problem being: there's too much
research exploring the search space in the same direction and not enough
foundational research.

~~~
etaioinshrdlu
I disagree. There is a huge amount of research on adversarially robust
classifiers and detectors going on. One can also programmatically test a
neural network on real data, synthetically damaged data, fully synthetic data,
and adversarial data, and everything in between. You can statistically ensure
you get any desired accuracy level on those tests. While that's not a hard
proof of anything, it can allow you to be very confident in the nets
abilities.

~~~
ElFitz
Well, anyway it doesn't seem worse than how we test our biological neural
networks before letting them drive huge chunks of metal around

------
egfx
Again another source looking at the angle of objects rather then the
refraction patterns. I ran the first IR testing at Google and the one thing
that tripped up the engine was lighting variations, which the source I see
didn’t consider.

------
msapaydin
Perhaps the idea of not having a training set means that one should use other
datasets such as image net and use transfer learning.

------
mam2
Nonsense.. most of cases a bigger object is prevalent. Chair? No its a
fricking bed.

