
Where we see shapes, AI sees textures - gdubs
https://www.quantamagazine.org/where-we-see-shapes-ai-sees-textures-20190701/
======
modeless
There is a misunderstanding here. It's not the algorithm that emphasizes
textures over shapes, it's the dataset. Texture is a more informative feature
than shape in the datasets used.

"AI" can detect shapes just fine; simply use a different dataset and you'll
get a different result with the same algorithm. I mean, just look at the
classic MNIST: it's all shape, no texture, and neural nets work great.

~~~
cameldrv
There's some truth to this, but it's not the whole story. CNNs have
limitations in terms of what they can represent and what they can learn.
Counting is an example of this. If I create a class of monsters with 10 eyes,
and another that is otherwise identical, but with 11 eyes, these will be
difficult classes to tell apart.

When it comes to "shape", part of the problem is definitional. In the MNIST
example, the network will fairly easily learn features that are nodes, like
where two lines intersect, but it has trouble with paths, like "an unbroken
line that smoothly intersects itself." CNNs have trouble distinguishing things
like a spiral from a set of concentric circles because locally they look the
same.

~~~
petters
It seems to me that the number of eyes problem should be solvable. Just
convolve with an eye detector and threshold appropriately. Then feed
everything to a single neuron in the next layer and threshold at 10.5. No?

The spiral vs. circles I agree with.

~~~
mkl
Solvable with human guidance, yes, but would any existing machine learning
system come up with a successful strategy like this by itself? Most deep dream
style images suggest the networks don't even realise that dogs have two eyes.

------
jcranendonk
My take on this: if we want vision ML to succeed at recognition in the same
way as humans, perhaps we need to pre-process and present visual information
in the same way as the human vision system? As far as I'm aware, we get a lot
of info from our eyes about lines and orientation that assists in recognizing
shapes.

I'm not well-informed about the current state of visual recognition DL,
perhaps someone who is can tell us more about whether that approach makes
sense.

~~~
tveita
When you train a deep convolutional neural network, the first couple of layers
appear to take on this role, detecting simple features like edges and
textures, which the higher layers build upon to see more complex objects.

For example [https://www.researchgate.net/figure/Visualization-of-
example...](https://www.researchgate.net/figure/Visualization-of-example-
features-of-eight-layers-of-a-deep-convolutional-neural_fig2_279068412), where
you can see (somewhat, if you zoom in) that layer 1 neurons are interested in
very simple features, like strong horizontal edges, or particular gradients.

------
CharlesW
I'm a little confused by the assertion of "artificial intelligence's
preference for texture over shape". The article makes it sound like an
intrinsic issue with AI, when to my understanding it's just an outcome of
popular image classification algorithms. Couldn't one create classifiers that
depend more heavily on object shapes?

~~~
swagasaurus-rex
I'd guess animals have a lot of compute dedicated to finding out the shape
based off of various visual cues:

Lighting changes indicating edges

Gradients indicating a smooth curve

Texture maps that discern between matte, smooth, translucent, often in depth
enough to determine the texture material itself.

Reflectivity

Changes in reflectivity when reorienting oneself or the object

Binocular Vision

Parallax effect between frames

A large database matching objects to memories and experiences

Many other techniques for tracking and seeing movement.

In the above, Parallax and Binocular vision, and consulting one's memory are
the only ways to determine the _sizes_ of the objects being looked at.

It's why those optical illusions tiny/big rooms work so easily; neither we nor
CV algorithms would be able to discern the sizes of objects without prior
memory of such objects or by walking around it.

------
jpalomaki
Babies use their hands to feel shapes. Maybe brain works hard to make the
connection between the inputs provided through touch and vision.

Understanding shapes/outlines is useful for manipulating objects and avoiding
obstacles.

Maybe the object recognition develops a bit later and builds on top of how
brains already understand the world.

~~~
scotty79
I think babies also have pretty crappy sharpness of vision. Maybe that's on
purpose to help with initial learning how to see...

~~~
mkl
It seems they reach healthy adult sharpness by 6 months [1]. By the time they
are sitting and starting to move they are certainly extremely good at finding
tiny bits of fluff etc. to put in their mouths!

[1]
[https://en.wikipedia.org/wiki/Infant_visual_development](https://en.wikipedia.org/wiki/Infant_visual_development)

------
w_t_payne
Depends on the training data.

These networks are correlation-finding machines, and will simply latch onto
the simplest correlation possible that produces the correct results. Only by
explicitly controlling the correlations in the training data can you force the
network to focus on the attributes that you wish.

This is why domain-randomization (such as the approach that NVIDIA is taking)
helps support generalization -- it's a very direct and efficient form of
regularization, removing correlations that do not generalize, forcing the
network to work harder to find correlations that _do_ generalize.

This also has the added benefit of being something that we can tie back to
requirements and physical properties, an important ingredient in making these
systems understood and safe.

------
udayrddy
So, a person with a gray texture of elephant skin shirt is classified as
elephant ?!

~~~
Sharlin
One of the most archetypal failure modes of current NN-based classifiers is,
for example, happily reporting seeing a spotted feline given a picture of a
couch with a leopard pattern.

~~~
pmontra
Well, if I'm given the task to enumerate what I see in that image maybe I
would also add the feline label. But I would add the picture_of attribute, if
that's a thing.

~~~
Sharlin
I don’t think today’s classifiers are anywhere sophisticated enough to have a
concept of something vs. a picture of something, a type of use–mention
distinction, or _ceci n’est un pipe_ if you will. The point was that people
think they trained their classifier to recognize leopards, but in actuality it
just learned to recognize the leopard _coat pattern_.

------
jonplackett
Does this have anything to do with AIs always learning from 2D photos when we
learn from binocular vision so every image contains a lot more ‘shape’
information.

~~~
anewhnaccount2
Not only this, but we when we see things _even when we are still_ our
eyes/head move a small amount. Shape information is preserved in these
multiple perspectives relatively more than texture information.

------
quico
Yann LeCun (Chief AI @ FB) puts this article in perspective:
[https://www.facebook.com/yann.lecun/posts/10156068737942143](https://www.facebook.com/yann.lecun/posts/10156068737942143)

TLDR: The use of texture is inherent to the ImageNet dataset and not to deep
learning / ConvNet. Training on less-textured versions of ImageNet drives the
ConvNet to focus more on shape.

~~~
hooloovoo_zoo
Sure, but just because you can coerce an algorithm into doing something
different does not mean the algorithm itself does not have fundamental
tendencies.

------
stcredzero
I can't wait for the Hollywood version of this idea, _à la_ the Infrared
viewer from _Predator_ or the machine's eye view from _Terminator._

------
personjerry
This shouldn't be news. Most computer vision NNs are CNNs where the
convolutions are basically translating the image into small "textures". "AI
sees textures" yeah because that's how we present data to it!

~~~
hammock
I wonder what would happen if we fed it SVGs instead of JPGs. (Shape files)

~~~
tastroder
At that point it's less a task in the context of image processing and more one
of graph learning / language processing, depending on how you want to
formulate it.

~~~
wongarsu
Human vision certainly contains a component where we do 3d reconstruction and
adjust our perception based on that. Many optical illusions depend on this
step.

If we want more human like machine vision then having passes in image
processing that deal with more abstract data sounds like a great idea

------
Havoc
I thought that's why edge detection preprocessing etc is in use?

Doesn't quite seem like a break through insight to me?

------
ryu2k2
Just a temporary restriction until we start training AI in robots by
interaction with the real world.

