A human baby learns in a very interesting way. By using an overcomplete basis to sparsely code data. This method has only recently started getting attention in ML.
A human baby learns from uncleaned raw data using far less energy with better generalization than a computer and fuses large amount of data without suffering from dimensionality curses.
I think it is safe to say that human babies are still ahead. for now.
I agree that CV is currently behind human sight. I just argue that this is more due to lacking input processing than to actual classification engine. The reason they used 200x200 px images, I think, was that bigger images couldn't be analysed in sufficient quantity.
In a way their first input processing of the real images was pixelisation filter. If you feed pixelised image to a person you see how much information is lost. If you make single pixels occupy significant portion of view person might loose ability to recognize the image at all. Feeding so little information to CV system is like trying to teach nearly blind man to see.
To improve CV we should focus on finding best ways of converting full resolution visual data to something of smaller volume in such way that important features are preserved.
This input data IMO should also include time. I, thanks to crappy eyesight often recognize people, actions, objects relying more on how they move not how they look to me. Even with sharp eysight sometime your vision just gets stuck and can't recognize what is in the scene you are currently looking at. You can't understand what you see until something in the scene moves or you move a bit.
>>To improve CV we should focus on finding best ways of converting full resolution visual data to something of smaller volume in such way that important features are preserved.
You are exactly right! See Dictionary Learning, Random projection, compressive sensing. As for time, perhaps you are right I don't know. That question is: would a suitably written video trained classifier that preserved temporal features do better on image classification?
I think that it might be the case as I believe that there is a lot of information in movement (or in 3d that is made observable thanks to movement) than makes dividing objects into categories much easier task. There must be a lot of shortcuts that can be made thanks to the fact that objects are not something completely arbitrary but physical, usually solids and have to obey relatively simple rules of reality. Once this classifier is taught with reach data you can match newly observed object to one of the classes, relying only on very small amount of information, for example its 2d small resolution image.
Seeing is in my opinion very similar to understanding language in sense that the information that is transferred, observed image or words heard are just small fuzzy fragments. Sender (speaker, or in case of vision, physical world) has rich model and recipient has rich model of all the things they can communicate about and the actual information passed, only indicates the parts of the underlying model to the recipient, that he should select and how he should modify them to get the message.
Building usable model from small fuzzy fragments of information that are passed when recognizing image or hearing spoken words should be incredibly hard task and I think no biological brain could do that. I think that absorbing as much real information as possible at the time of training the classifier is absolutely crucial for achieving anything close to what humans or animals can do.
From the techniques you mentioned, dictionary learning looks most awesome to me, and most applicable to CV.
A human baby learns from uncleaned raw data using far less energy with better generalization than a computer and fuses large amount of data without suffering from dimensionality curses.
I think it is safe to say that human babies are still ahead. for now.