There are two irresponsible sentences in the article, "In other words, it is not going to be long before machines significantly outperform humans in image recognition tasks." and "Or put another way, It is only a matter of time before your smartphone is better at recognizing the content of your pictures than you are.".
The irresponsibility is in seeing the existence of a technique to solve a more complex version of a toy problem (e.g. find the location in this photo that exactly matches pattern x), and inferring that the same technique, given more power, will exhibit super-human behaviour.
It's a reasonable claim that, in the task as described, the technique outperforms humans but that's about as exciting a claim as saying how much quicker the latest supercomputer is at arithmetic than a human. The point being that human object recognition isn't about labelling a scene with nouns, but somehow instinctively knowing the relevant objects for a situation and, if required, the appropriate situation-specific noun.
I therefore request the final sentence of the article be rewritten as "Or put another way, it may only be a matter of time, funding and motivation before your smartphone (with equivalent computing resources of the likes of Google Inc. and University of Tokyo) can ascribe one or more nouns from a set of size of order 1000 to regions of a photo, given that huge amounts of pre-processing and man hours have been dedicated to the pre-processing of that exact set of nouns to create a training set, and that you take the photo in similar lighting conditions as the training set, don't apply any filters, and that the objects referred to by the appropriate nouns is neither small nor thin, better than you.".
It is just extrapolating the error rate reduction over the last few years. Spam filters have become better than moderators in labeling spam only in the last decade or so.
When computers first started to become faster than mathematicians this was really a breakthrough. The same is happening now with object and speech recognition.
The computer succesfully completes a task. That it is not how humans intuitively approach these same tasks is irrelevant for this accomplishment. What if the results were only half as good, but the system behaved more like humans, who does this satisfy?
The state-of-the-art is capable of detecting far more than 1000 objects, does not need labeled data, is robust to changes in light and does not care about the camera used. No preprocessing the data needed, features are automatically generated (preprocessing the target labels is a bit silly BTW).
So yes, in the very near future, algorithms will be better security guards than well... security guards.
My point is that extrapolating error rate reduction only applies to this tightly defined task.
You can only make claims about machines being better at "general" pattern recognition when we make progress on the issue that's stopped all Cognitivist General AI projects dead, which is that of situational awareness.
Arithmetic operations, spam detection and the task described in the article have a much smaller, and static, problem space than most human activities. You can demonstrably already knock up an automated-barrier style security guard. However, I'd argue that there does not exist an algorithm or appropriately weighted n-layer network that can handle all the ambiguity, countermeasures and ill-defined or contradictory situations that human security guards, or even just their object recognition capabilities, handle largely instinctively.
Do you think that computers are better at chess than humans? If yes, how does this relate to pattern recognition. If not, what makes someone or something better at chess, while still losing against a computer? Is that a beautiful move? Tactics? Irrational sacrifices to cause confusion?
Do you think that a machine's situational awareness can not achieve or surpass the level of a human? If not, what is holding the machines back?
Why do you think that instinct works better to create more rational, consistent and correct predictions? Are 100 security guards better than a single security guard at dealing with ambiguities? Do you think an algorithm to detect fights, drug dealers, and pickpockets from street cams can not exist? What if a NN could detect these cases faster and flag this to a human security guard for action/no-action.
It needed training on 1TB of labeled images in the first place. Arguably it can be used to transfer that knowledge to other tasks with a much smaller amount of labeled samples but still requires supervision.
Google trained a NN on unlabeled Youtube stills. It was able to detect/group/cluster pics of cats without ever seeing a label. This still needs supervision to teach the NN that whatever name it created for this cluster, us humans call this "cats".
If the error rate gets low enough, a NN could start labeling pics.
Finally, recent work has shown that running a dictionary through an image search engine can yield high quality labeled images automatically.
Aside: Thank you for contributing to sklearn. Really feel like I am standing on the shoulders of giants when I use that library.
The irresponsibility is in seeing the existence of a technique to solve a more complex version of a toy problem (e.g. find the location in this photo that exactly matches pattern x), and inferring that the same technique, given more power, will exhibit super-human behaviour.
It's a reasonable claim that, in the task as described, the technique outperforms humans but that's about as exciting a claim as saying how much quicker the latest supercomputer is at arithmetic than a human. The point being that human object recognition isn't about labelling a scene with nouns, but somehow instinctively knowing the relevant objects for a situation and, if required, the appropriate situation-specific noun.
I therefore request the final sentence of the article be rewritten as "Or put another way, it may only be a matter of time, funding and motivation before your smartphone (with equivalent computing resources of the likes of Google Inc. and University of Tokyo) can ascribe one or more nouns from a set of size of order 1000 to regions of a photo, given that huge amounts of pre-processing and man hours have been dedicated to the pre-processing of that exact set of nouns to create a training set, and that you take the photo in similar lighting conditions as the training set, don't apply any filters, and that the objects referred to by the appropriate nouns is neither small nor thin, better than you.".