One concern with the study is that the authors generated the objects specifically from skeletons rather than deriving them from shapes, either natural or human-made, covered by skin, metal, or other materials that people encounter in their day-to-day life. “The shapes that they generated are directly related to the hypothesis they’re testing and the conclusions they’re drawing,” says James Elder, a professor of human and computer vision at York University in Toronto. “If we’re interested in how important skeletons are to shape and object perception, we can’t really answer that question by only looking at the perception of skeleton-generated shapes. Because obviously in a world of skeleton-generated shapes, skeletons are probably fairly important because that’s the way those shapes were made.”
I looked into the paper first and thought: yea well it's really not surprising, that the skeleton models are most predictive for the kind of objects they tested. Their skeleton really is all that defines them.
The only thing they tested and proved is: Skeleton models are predictive for human decision when recognizing objects made just from skeletons with little flesh and hardly any texture whatsoever.
Nevertheless I think skeleton models are a good thing for object recognition
Isn’t it an important result that humans are able to recognize when an object is made just from skeletons and optimize recognition to focus solely on the skeleton? That sounds pretty neat to me
Exactly. One minute reading. Conclusion: junk science.
Perhaps "weighting" models, allowing algorithms to look for centers of gravity and mechanical behavior would help. Humans exist in a 3d world, but we also interact with a simplified 3d world.
We don't worry about the plastic bag in the street because we can feel how our car will respond. It's trivial. There's no "weight" attached to the object.
Weight and balance are incredibly important psychologically (see the burgeoning popularity of weighted blankets), and that's a thing that's missing for computers. Having a tangible sense of the world in our minds gives us a huge leg up when relating to it.
As you say, when a human observes a plastic bag, a vast number of different models and transformations aggregate their predictions in a highly nonlinear fashion:
The bag has a plasticky look, it seems to be flopping around, it is slightly see-through, it produces a certain sound that implies a hollow cavity etc... these primary observations are processed by the sensory neurons, which do a first pass filter to remove noise completely subconsciously. If they don't get enough feedback, perhaps it was just a mirage of a bag - not real - and you do a double take and realize it was just a play of shadows.
But let's assume first pass feedback confirms that it is likely a real sensation. The primary inputs are then confirmed by secondary predictions:
The bag is carried by the wind, implying low density, the sound it makes is common for empty thin plastic materials, it has a matte surface that lets through some amount of light, etc... the subconscious thus makes the conclusion that it is indeed probably made of thin plastic and therefore of low density and low hardness and therefore not a threat in terms of high velocity impact. Your swerve reflex thus doesn't kick in and you drive straight.
But this reasoning requires an in-depth model of the world. It isn't enough to just recognize the shape of a bag, because that could be a myriad of other things. Only by having a model and thus understanding of all these different aspects of reality can one make a prediction as robustly as a human. And that is not a high bar, because humans are not good at predictions, let alone on short timescales. We are prone to biases, sensory errors, local minima from past bad experiences, basically the lot.
This is a great summary of why I think current deep-learning based methods will never lead to 'intelligence' that is good enough to e.g. navigate the real world like humans do. They are all based on learning to recognize patterns to infer which things look the same as whatever was in their training set, but they have no semantic capabilities beyond simple classification.
>> And that is not a high bar, because humans are not good at predictions, let alone on short timescales. We are prone to biases, sensory errors, local minima from past bad experiences, basically the lot.
This observation I don't really follow, I would say the bar to match human reasoning abilities is extremely high for exactly the reasons you described yourself.
Sorry I should've phrased it better. I was trying ti imply that just matching human reasoning abilities is indeed an undertaking of incomprehensible complexity, _and_yet_ it is still highly error prone. I believe a system that replaces humans will be under close scrutiny and just being at par won't be enough.
I'm not a neurologist or cutting edge ML researcher by any measure, but this is my viewpoint as well. The astounding amount of information and internal models, and the astounding complexity of these models in terms of connections and feedback loops (and their plasticity) implies to me that our current pedestrian attempts at AI are nowhere near what is required for GAI, let alone human level GAI.
It seems to me like a lot of hubris to suggest (as I've seen people do) that in just a couple of years we could get there. Currently we have not even a clue how consciousness arises. We have evidence that it is physically possible, but that's it.
The leading enterprise in the area, Google/Youtube routinely fail to identify objects and sounds in videos.
My prediction is that what we have currently is a local optimum that expands our capabilities a lot, compared to what we had before, but in terms of genuine insight into human level AI, it will prove to be a dead end.
I'd love to be proven wrong though.
A car could have a model of reality whose scope is only encompassed by the context of roads and driving. It is conceivable to me that a car could have an in-depth model of the "driving-world" that would allow it to make multi-sensory, tiered observations and predictions akin to human cognition.
Deep learning is more than just imagenet classification or object detection.
There are many approaches that require more understanding, such as future video prediction, captioning, question answering, reinforcement learning requiring an implicitly learned model of how the environment works beyond mere appearances, image generation, structure extraction, anomaly detection, 3d reasoning, external memory, few/one/zero shot learning, meta-learning, etc etc.
The field is huge and whatever "obvious shortcomings of deep learning" non-specialists come up with after reading popular articles are probably being tackled already in many groups and have several lines of approaches and papers already.
As someone who did their Ph.D thesis on the statistics of shape using models based on the medial axis (i.e., a skeleton), I would beg to differ.
Whether these models are as easy to apply (computationally and conceptually) as the currently in-vogue techniques is another question, but there is nothing magical here that computers are incapable of.
Our eyes are active in that they move freely and can focus at different distances. We also happen to have two of them and our brains have a model for how far apart they are. These two features (active focusing and binocular vision) give us incredible depth perception.
Our brains use this depth information to separate objects from the background, something a machine learning algorithm cannot do if you're just feeding it a billion photo labeled training set.
At least from my own personal experience, it's very biased too. It seems the more tired we are, the more likely we are to incorrectly recognise immobile objects as people or animals at a glance.
>Here we tested whether skeletal structures provide an important source of information for object recognition when compared with other models of vision. Our results showed that a model of skeletal similarity was most predictive of human object judgments when contrasted with models based on image-statistics or neural networks, as well as another model of structure based on coarse spatial relations. Moreover, we found that skeletal structures were a privileged source of information when compared to other properties thought to be important for shape perception, such as object contours and component parts. Thus, our results suggest that not only does the visual system show sensitivity to the skeletal structure of objects32,36,37, but also that perception and comparison of object skeletons may be crucial for successful object recognition.
I think the real interesting question is: what is the internal representation of this skeleton? A graph? A forest of graphs? Some kind of field that's graph-like?
For example: CNNs, though pretty good at detecting limbs (and miscellaneous other things) have only a very limited ability to encode structural information in this way. An interesting open question in the field is what is the "right way" to encode this sparse, graph-like structural data (hence capsule networks).
The naive way to use the instrument, is to run the instrument over the area one or a few times. The simplest way to do that in terms of motor control (e.g. fewest turns) is to run it up and down the longest axis one or more times. That's exactly what a child does.
Machines are taught from flat images. How can they be expected to create 3D from this?
Humans learn from binocular vision, and from multiple angles as we move around an object, making it a lot easier to get an idea of its shape.
My daughter aged 18 months could already recognise abstract signs like the mother and baby or disabled sign just from knowing the real object. Which must say something about the way she stored the representations of them.
Anyone know of a visual recognition AI being trained also with depth data? Would be interested to see what difference it makes.
This relates to something else I noticed differently about my daughter learning. You can show her one photo of a lion, from one angle and she will recognise other lions later on, at different angles. I think she must have seen enough animals already from many angles to have generalised their shape and then be able to presume the new animal is similar and just see the new characteristics like a mane. Something very different is happening in Human brains!
In case of recognizing other animals, the generalization takes the form of a 'tree' of objects connected via nodes, which is actually what a skeleton does to a body.
But that does not happen with other objects, i.e. cars. For cars, the generalization is that of a box with circles at the bottom (for the wheels).
It shall also have to be noted that the details of objects are not really lost, but they are remembered, up to a certain degree, which allows us to recognize a person with fat body parts from a person with thin body parts of the same height and otherwise same general outlook.
The degree of generalization is also responsible for not being able to remember a new face that strongly resembles a face we already know, until we recognize for the new face some special attributes the old face does not have. In this case, the degree if generalizaton is such that does not allow us to immediately tell apart the old from the new face.
I'd say that recognition works in a step like fashion:
-we first recognize a generic abstraction of the object at hand: if the object is inanimate or not.
-then we recognize in which category of the inanimate or living objects the object under recognition is (for example, is it a human? an animal? etc).
-then we recognize more details; is the person tall, fat or blond? for example.
-then we recall our connections to that person, resulting in chosing a response.
I don't have data to back the above up, it's all from intuition and personal experience, but that's how I think objects are recognized by brains.
(I'm being loose with language, but a CNN is not an optimal "hole finder", while persistent homology is not optimal for telling different kinds of fish apart.)
Well, maybe not how computers typically process pixels nowadays, but back in the old days of computer vision one technique for simplifying an image was skeletonization : https://en.wikipedia.org/wiki/Topological_skeleton
If we get good at allowing programs to generate programs that find new ways of learning. Is it still behaving in a way that humans program them, or has it shifted to a law of nature that is fundamentally out of our control.
When it's all said and done, we decide if it was us, or a force beyond humans.
Obviously not the same thing, but I think it's an interesting association.
Computers usually start from flat pictures, and that trips the learning process.
I have zero data to back it up. Just my hunch :D
A friend of mine who cannot see with one eye and yet he is a painter. One thing I know he cannot do is drive a car.
That's specific to your friend, not true in general. Lots of people drive with only one functional eye. At the visual distances involved, the depth perception provided by stereoscopic vision doesn't matter much. Especially with all the relative motion. My dad has been driving successfully for 65 years with only one working eye.
You are allowed to drive a car if you only have one eye, though.
Besides - I don't really think I myself use the perception of depth from my stereo vision as much as I know that the cars on the road are all standard dimensions.
There's not so many objects on the road which have similar proportions of dimensions but are different in size so are easily mistaken.
You have enough distance information by simply observing the visible width of the car in front of you.
TL;DR the objects are grouped into categories which determine the "Key points" on the objects (similar to this 'skeleton') which the robot knows how to interact with in order to bring about the intended manipulation.
Or more recently: