This is not a rhetorical question by the way, I genuinely don't know the state of the art in this field. If it's indeed possible to do that today I'll be extremely impressed.
That's impressive, but I'll point out that the bird photos in this video are all clean, well focused close ups, that's probably easier to process than random pictures.
If you wanted a general algorithm working on non-curated data (like tagging facebook photos for instance) I'm sure it would be significantly harder.
Check out the (deliberately blurry) examples in https://arxiv.org/pdf/1703.05393.pdf where it can distinguish between blurred, low resolution pictures of different types of crows.
It's only ~50% accuracy, but the photos are terrible. Much worse than Facebook pics.
OTOH, this is classification into hundreds of classes, not millions like in the case of FB face recognition. (Although of course that can use the connectivity graph as a filter on that too).
This is all doable today. As an example, check out some bird photos[1] from the Visual Genome[2] project that are similar to your examples. I selected the photos and hosted them on Imgur in hopes we don't kill Visual Genome with traffic ;) The systems to do this today are not highly efficient or without flaws but it can certainly be done.
The research group I am part of, Salesforce Research (formerly MetaMind), have a model that does this "accidentally" - and there's even an example image of a bird[3]! The model is only meant to provide a caption for an image, not to segment the image into the various objects, but learns to "focus" on the bird as part of describing the image. For those particularly interested, check out the paper "Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning"[5].
Systems made specifically to segment an image into objects would obviously do far better. For an example of that, check out "CRF as RNN -
Semantic Image Segmentation Live Demo"[4]. There are many more systems of this style floating about.
I think you underestimate the problem, which is not to get an output that says "Bird", but one that says "Specific breed of bird."
Human experts can get enough clues from the bird shape and the context to do that in the sample photos. I doubt your captioning system can.
This is a good example of a standard problem in ML - underestimating the complexity of the problem domain.
You could argue that your system only needs to do the simpler task to be useful, and that's likely true. But if the goal is to approach human expert levels of classification, it needs to improve by at least a few levels.
I suspect getting it there would run into some interesting performance constraints, and possibly some theoretical issues too.
No, ML is very, very good at doing breeds. See, for example https://arxiv.org/pdf/1603.06765.pdf which gets 88.9% accuracy on the Stanford Dogs dataset, and 84.3 on the Caltech Birds dataset.
These are way better than anything a non-expert human can do. For example, it can distinguish between the Rhinoceros Auklet and the Parakeet Auklet.
I'm not sure what expert performance is, but around 94% is where humans top out on most tasks.
A single NN can predict more than one class of object. The ImageNet competition has 20,000 classes.
There's also image segmentation as another poster has pointed to.
In the case of FB face tagging, they'd have learn an embedding space for faces, and when a new image comes in they'd place it in the embedding space along with all the person's connections and find the nearest neighbors.
The problem posed in the xkcd is "Check if the photo is of a bird", not identify the bird in question. As far as identifying the bird species, that would probably be harder, because I'm guessing there's very few human experts who could reliably do that across a wide spectrum of species, and without knowing the context of the photo.
Others have pointed out that this problem has not been solved in the general case.
More importantly, the progress that has been made in recent years actually builds very heavily on work since the early 1990s, so not only is it not complete, what has been achieved took a great deal longer than 5 years.
Well, because we had to invent the massively parallel GPU in between those times. Essentially work that would take an entire supercomputer cluster in the 90s can be done on my desktop with 4 high end GPUs stuck in it.
Now that we are in the range of having the correct hardware the whole "it's taking decades issue" will go away.
GPGPU was definitely part of the success of recent years, but there was also a lot of experimentation and hard work carried out e.g. on CNN designs. Lots of trial and error. That took a lot of time. There have been fundamental changes in the structure and training of NNs that have helped bring the step-change in success.
> to look like geniuses when they solve ahead of schedule
No, it is because estimating software tasks is difficult, the penalty for underestimating is that people think you are dishonest/flakey, and there isn't anywhere to get an education in how to do it well. The default advice given to junior engineers is therefore: "take your intuition and triple it." I hate that this is the state of the industry. My interactions around estimation over the past 5 years since uni have literally made me feel nauseated and near fainting on multiple occasions. I would love for Joel or Klamezius or Uncle Bob or someone else to fix it and produce a good course on how to create estimates.
Agreed, agile seems the only way, but does indeed require experienced managers. A lecturer once pointed out that business/normal people would always expect some kind of point estimate, they are never satisfied with some kind of distribution or interval. Personally, I would say that this is even more sad than that: the point estimates are always taken at the extreme values, which ever suits the person wanting the estimate more, never the average value.
Of course, all this leads to bad blood between techies and business side: how long will it take? -> probably about 3 weeks, but this requires using a library we haven't used before, so in the worst case even 2 months -> what? so long? get it done in 4 days, this is required the next week -> no, that's not really possible -> make it happen -> it happens and it either sucks when it's delivered at all, so the deadline gets extended anyway to iron out all the bugs or it causes lots of problems in the future.
Or when you allow for hilarious false positives/negatives. Sometimes birds are birds, sometimes they are cats and cats are birds, sometimes they are dogs. Everything is possible with the right training set and machine learning.
Check whether a photo is of a bird.