
People Recognize Objects by Visualizing Their “Skeletons” - headalgorithm
https://www.scientificamerican.com/article/no-bones-about-it-people-recognize-objects-by-visualizing-their-skeletons/
======
xaedes
Imho the most important part of the article:

 _One concern with the study is that the authors generated the objects
specifically from skeletons rather than deriving them from shapes, either
natural or human-made, covered by skin, metal, or other materials that people
encounter in their day-to-day life. “The shapes that they generated are
directly related to the hypothesis they’re testing and the conclusions they’re
drawing,” says James Elder, a professor of human and computer vision at York
University in Toronto. “If we’re interested in how important skeletons are to
shape and object perception, we can’t really answer that question by only
looking at the perception of skeleton-generated shapes. Because obviously in a
world of skeleton-generated shapes, skeletons are probably fairly important
because that’s the way those shapes were made.”_

I looked into the paper first and thought: yea well it's really not
surprising, that the skeleton models are most predictive for the kind of
objects they tested. Their skeleton really is all that defines them.

The only thing they tested and proved is: Skeleton models are predictive for
human decision when recognizing objects made just from skeletons with little
flesh and hardly any texture whatsoever.

Nevertheless I think skeleton models are a good thing for object recognition

~~~
Swizec
> The only thing they tested and proved is: Skeleton models are predictive for
> human decision when recognizing objects made just from skeletons with little
> flesh and hardly any texture whatsoever.

Isn’t it an important result that humans are able to recognize when an object
is made just from skeletons and optimize recognition to focus solely on the
skeleton? That sounds pretty neat to me

~~~
mkl
Yes, that's neat, but it's very different to and far more limited than the
generality the title and the rest of the article claim.

------
rpmisms
Humans are much better at noise removal than computers. Many people can look
at an object and see what's extraneous to the basic form--what's left is the
skeleton. Computers, so far, don't have the context to do this, and instead
try to recognize objects based on visual patterns, etc.

Perhaps "weighting" models, allowing algorithms to look for centers of gravity
and mechanical behavior would help. Humans exist in a 3d world, but we also
_interact_ with a simplified 3d world.

We don't worry about the plastic bag in the street because we can feel how our
car will respond. It's trivial. There's no "weight" attached to the object.

Weight and balance are incredibly important psychologically (see the
burgeoning popularity of weighted blankets), and that's a thing that's missing
for computers. Having a tangible sense of the world in our minds gives us a
huge leg up when relating to it.

~~~
derf_
_> Computers, so far, don't have the context to do this_

As someone who did their Ph.D thesis on the statistics of shape using models
based on the medial axis (i.e., a skeleton), I would beg to differ.

Whether these models are as easy to apply (computationally and conceptually)
as the currently in-vogue techniques is another question, but there is nothing
magical here that computers are incapable of.

~~~
rpmisms
Sure, just the skeleton part. How about density, deformation, and
reflectiveness? All at once? We can simulate these, so we can obviously detect
them, but not yet.

------
Isamu
Direct link to the paper:
[https://www.nature.com/articles/s41598-019-45268-y](https://www.nature.com/articles/s41598-019-45268-y)

>Here we tested whether skeletal structures provide an important source of
information for object recognition when compared with other models of vision.
Our results showed that a model of skeletal similarity was most predictive of
human object judgments when contrasted with models based on image-statistics
or neural networks, as well as another model of structure based on coarse
spatial relations. Moreover, we found that skeletal structures were a
privileged source of information when compared to other properties thought to
be important for shape perception, such as object contours and component
parts. Thus, our results suggest that not only does the visual system show
sensitivity to the skeletal structure of objects32,36,37, but also that
perception and comparison of object skeletons may be crucial for successful
object recognition.

------
AbrahamParangi
I think it's telling that even young children are exceptionally good at object
recognition, and if you ask them to draw an object, they'll typically give you
a "skeleton" with basically no ability to reconstruct the textural components.

I think the real interesting question is: what is the internal representation
of this skeleton? A graph? A forest of graphs? Some kind of field that's
graph-like?

~~~
gpm
I think young children's drawing ability is more indicative of the type of
tool we are giving them, they only have the ability to draw a fixed width
line, how else would you represent a limb?

~~~
dzmien
The same instrument in the hands of a skilled artist would have no trouble
using it to produce a convincing likeness of whatever they were drawing.

~~~
gpm
Absolutely, but that requires advanced fine motor control, understanding of
how the instrument lays down color and what multiple layers of color look when
on top of each other, and so on.

The naive way to use the instrument, is to run the instrument over the area
one or a few times. The simplest way to do that in terms of motor control
(e.g. fewest turns) is to run it up and down the longest axis one or more
times. That's exactly what a child does.

------
jonplackett
This seems very obvious.

Machines are taught from flat images. How can they be expected to create 3D
from this?

Humans learn from binocular vision, and from multiple angles as we move around
an object, making it a lot easier to get an idea of its shape.

My daughter aged 18 months could already recognise abstract signs like the
mother and baby or disabled sign just from knowing the real object. Which must
say something about the way she stored the representations of them.

~~~
The_rationalist
Why not use two cameras for training AI then?

~~~
jonplackett
Because they’re using existing data. You need thousands, maybe millions of
images to train an AI to recognise something well, and only recognise the
right characteristics. No-one has the resources to go take all those photos
themselves.

Anyone know of a visual recognition AI being trained also with depth data?
Would be interested to see what difference it makes.

This relates to something else I noticed differently about my daughter
learning. You can show her one photo of a lion, from one angle and she will
recognise other lions later on, at different angles. I think she must have
seen enough animals already from many angles to have generalised their shape
and then be able to presume the new animal is similar and just see the new
characteristics like a mane. Something very different is happening in Human
brains!

~~~
The_rationalist
You are right and it would be interesting to quantify how much it could
improve AI if datasets were binoculars.

------
axilmar
I'd say (out of experience) that people do not recognize objects by
visualizing their skeletons, but they recognize objects by a generalization of
their shape.

In case of recognizing other animals, the generalization takes the form of a
'tree' of objects connected via nodes, which is actually what a skeleton does
to a body.

But that does not happen with other objects, i.e. cars. For cars, the
generalization is that of a box with circles at the bottom (for the wheels).

It shall also have to be noted that the details of objects are not really
lost, but they are remembered, up to a certain degree, which allows us to
recognize a person with fat body parts from a person with thin body parts of
the same height and otherwise same general outlook.

The degree of generalization is also responsible for not being able to
remember a new face that strongly resembles a face we already know, until we
recognize for the new face some special attributes the old face does not have.
In this case, the degree if generalizaton is such that does not allow us to
immediately tell apart the old from the new face.

I'd say that recognition works in a step like fashion:

-we first recognize a generic abstraction of the object at hand: if the object is inanimate or not.

-then we recognize in which category of the inanimate or living objects the object under recognition is (for example, is it a human? an animal? etc).

-then we recognize more details; is the person tall, fat or blond? for example.

-then we recall our connections to that person, resulting in chosing a response.

I don't have data to back the above up, it's all from intuition and personal
experience, but that's how I think objects are recognized by brains.

------
skybrian
Note that the leading image classification algorithms are apparently trained
to recognize texture more than shapes, because that's the easiest way to win
at current benchmarks. But that can be fixed:

[https://arxiv.org/abs/1811.12231](https://arxiv.org/abs/1811.12231)

------
thanatropism
So what's the implications of this for topological data analysis as an
alternative (or complementary) framework to the convolutional approaches in
image analysis?

(I'm being loose with language, but a CNN is not an optimal "hole finder",
while persistent homology is not optimal for telling different kinds of fish
apart.)

~~~
ilaksh
I was looking at something along those lines recently "TopoResNet: A hybrid
deep learning architecture and its application to skin lesion classification"
on arxiv.

------
jpfed
>people do not evaluate an object like a computer processing pixels, but based
on an imagined internal skeleton

Well, maybe not how computers typically process pixels _nowadays_ , but back
in the old days of computer vision one technique for simplifying an image was
skeletonization :
[https://en.wikipedia.org/wiki/Topological_skeleton](https://en.wikipedia.org/wiki/Topological_skeleton)

~~~
eli_gottlieb
Yeah, now I'm wondering what topological skeletons might have in common with
the abstract simplicial complexes generated by running a persistent homology
algorithm on a point cloud.

------
chacham15
I dont think the study can be used to draw the conclusions that the article is
trying to draw. The study presents new objects which are derived from
skeletons for people to learn and identify. IMO people learn differently in
short term vs long term. Short term, we try to reduce the dimensionality of
the input to things we can hold in working memory. In this study that would be
the skeleton of the object. That doesnt mean that that pattern holds up for
long term learning (which is mostly how we visually identify things, because
we've seen them many times already). The main reason I bring this up is
because it seems to be in direct contrast with studies which show the opposite
(i.e. that humans do operate like machines in identifying objects). That study
was done by comparing the brain regions which activated when the person was
exposed to visual input and found a consistent location which was activated
due to seeing a horizontal / vertical line.

------
ropiwqefjnpoa
"Do humans learn the same way as computers?" Computers learn whichever way
humans program them to...

~~~
electricviolet
OK, rephrase the question to "Does the way we've programmed computers to learn
happen to resemble the way that humans learn?"

~~~
patagurbon
Our models of human learning are crude at best. Some programs attempt to
approximate those models. But it resembles how humans learn the same way a
stuffed animal chicken resembles a T-Rex.

------
senthil_rajasek
This reminded me of a scene from the movie The Omen where Damien's mother
Maria Scianna's skeleton turns out to be of a jackal's skeleton.

[https://images.app.goo.gl/bKoVyzrgkty1J4DeA](https://images.app.goo.gl/bKoVyzrgkty1J4DeA)

------
arafa
Reminds me a lot of the ancient idea of Platonic forms:
[https://en.wikipedia.org/wiki/Theory_of_Forms](https://en.wikipedia.org/wiki/Theory_of_Forms)

Obviously not the same thing, but I think it's an interesting association.

------
mbeex
Related problem (ImageNet-trained CNNs are biased towards texture)

[https://openreview.net/forum?id=Bygh9j09KX](https://openreview.net/forum?id=Bygh9j09KX)

------
ilaksh
This reminds me loosely of a paper I found on arxiv recently: "Scene
Representation Networks: Continuous 3D-Structure-Aware Neural Scene
Representations"

------
cjjuice
Reminds me of Plato's theory of forms

------
tomekowal
I think that object recognition is hard because humans have much more data
than computers. People see with two eyes which can focus on different
distances, so our brain has 3D data to learn from. We later learn to recognise
the same objects on pictures.

Computers usually start from flat pictures, and that trips the learning
process.

I have zero data to back it up. Just my hunch :D

~~~
sorenjan
So children that's blind on one eye take longer to learn to recognize objects?

~~~
onemoresoop
No but they cannot see depth very well because they can't use stereoscopic
vision (triangulate). However, there are other cues that are used to infer for
depth such as covered edges(if one object partially covers another then it is
closer to you), perspective (if two objects that you know are similar in size
but one appears smaller then it is farther) etc.

A friend of mine who cannot see with one eye and yet he is a painter. One
thing I know he cannot do is drive a car.

~~~
rootusrootus
> One thing I know he cannot do is drive a car

That's specific to your friend, not true in general. Lots of people drive with
only one functional eye. At the visual distances involved, the depth
perception provided by stereoscopic vision doesn't matter much. Especially
with all the relative motion. My dad has been driving successfully for 65
years with only one working eye.

~~~
onemoresoop
To be honest I don't know whether he is allowed to drive or not, he thinks he
isn't allowed and never pursued it.

~~~
WrtCdEvrydy
There's a cutoff of a 20/40 on at least one eye (corrected with class A
restriction)

------
christophclarke
This seems to play very well with some of MIT CSAIL's research in training
robots to be able to manipulate objects they haven't seen before.

[1]
[https://www.youtube.com/watch?v=l9U8X6I1vow](https://www.youtube.com/watch?v=l9U8X6I1vow)
[2] [https://arxiv.org/abs/1903.06684](https://arxiv.org/abs/1903.06684)

TL;DR the objects are grouped into categories which determine the "Key points"
on the objects (similar to this 'skeleton') which the robot knows how to
interact with in order to bring about the intended manipulation.

------
dredmorbius
It also seems that people have a tendency to represent things in drawing as
either bubbles or stick figures. Even to ancient times, such as the humans
from this cave paintint (Lascaux, I believe):

[https://anthonyalvaradoanthonyalvarado.files.wordpress.com/2...](https://anthonyalvaradoanthonyalvarado.files.wordpress.com/2013/12/06_hunting-
scene-on-the-cave-paintings1.jpg)

Or more recently:

[https://xkcd.com/](https://xkcd.com/)

------
sdegutis
This is because there is a metaphysical reality behind everything and humans
instinctively recognize that even from a young age.

