As for "outperforming humans", a more accurate statement might be, "this algorithm outperforms (for this simplistic task) one experiment done with a limited set of humans on this one particular dataset which has been in the community for 10 years now and is thus highly gameable."
But I realize that's a lot less pithy.
In particular, this dataset is nearing saturation, and whenever that happens, differences in the accuracy numbers often don't mean much. So for example with Facebook's number at 97.53% and this paper's at 98.52%, you're talking about the difference between getting 148 pairs of faces wrong vs 89 pairs wrong. In practical terms, as a researcher working with a dataset like this, you very quickly learn to focus on just the ones your algorithm gets wrong, and it's impossible to not subconsciously try to optimize for getting those few cases correct, even if those techniques wouldn't actually help in the general case.
Face detection/recognition is a broad subject, and the applications are broad.
There are things like HAAR cascades which are able to pickout a face like object from others (or anything else that it's been "trained" to do), however they can't tell faces apart. They can be tuned so that they can be used in realtime apps (like the autofocus on cameras)
HAAR cascades are limited in a number of ways: they can't tell faces apart, and you need a different cascade for different views (profile/portrait/other) they can also have trouble with skin colour as well(depending on training).
more advanced algorithms are able to workout face orientation in realtime (google hangout moustaches and the like) but once again they arn't able tell between two faces.
However there are no accurate real-time (or anywhere near realtime) algorithms that are able to tell faces apart (i.e. put a name to a face in a crowd) In fact I would go so far as to say that there are no non-realtime ones either.
* Also, Haar is a surname, not an abbreviation, you don't write it in all-caps
Their result is impressive, and it improves a bit over Facebook's recent result on the same dataset with their DeepFace system (97.35% for DeepFace vs 97.53% for people vs 98.52% for the system discussed in the article).
Also, it is interesting that they are not using deep learning for this. They are using a Discriminative Gaussian Process Latent Variable Model.
On that note, Facebook could do even better by restricting the candidate set based on time, precise location, and compass orientation, given that most mobile users have Facebook installed and are running it in their pockets when they get their picture taken by others. (They could do rough recognition purely based on position and orientation without even looking at the camera image, if they really wanted to, so with the camera image it could really be near 100% accurate, and even work if you take a picture of a friend's back.)
Computers are given 2 pictures and asked "Are these the same person?"
However, if people are given the same 2 pictures, and one of them is a well-known actor; then the question is "Is the second picture the same person as the many pictures of this actor that I have seen during my life?" which is a rather different task.
If you want a fair comparison, then either you have to use portraits of people that nobody knows, or the computer systems need to use databases of celebrity pictures for that.
At their talk at CVPR, they made clear that the biggest factor in getting their numbers was having tons of training data and a high-capacity learning model. The paper even shows that if you replace their 3d alignment with a standard 2d affine alignment (something people were doing 15 years ago), you lose only 1.5% on LFW (table 2).
>97.53% for people
I'm surprised that people are that good - I think I'd be horrible at it. From googling, I can see that number came from Facebook - would you happen to know where to find the paper?
>Also, it is interesting that they are not using deep learning for this. They are using a Discriminative Gaussian Process Latent Variable Model.
Does using this different model gain some sort of space or processing efficiency over DeepFace?
Also: does anyone know the point of fighting to close this 3% gap? That seems good enough for any purpose. (Is there any other purpose to this technology other than mass surveillance?)
The methods are quite different, but I think at root, it's just more being able to take advantage of much more training data.
The 97.53% "human performance" number comes from my paper from a few years ago . We initially thought that there's no way we would be able to do that well on this task (manually) either, but then we tried it and our scores were much higher (very close to 100%). So the verification task on LFW is actually not that hard, although in part it's because the dataset consists of public figures and therefore you can recognize many of the people outright if you've seen them before. (We tried to measure this in our human studies but found it to be inconclusive.)
3% is not the right way to look at error rates, btw. See  for why. But also, face verification is more of a low-level building block for recognition, because most real-world applications don't directly care about saying whether two faces are the same person or not ("verification"), but "who is this person?" ("recognition"). A simple way to build recognition from verification is to compare a test face against all faces in the database, and then take the one that gets ranked highest. In that (simplistic) scenario, your recognition rate scales (very roughly) as acc^lg(N), where N is the number of people in your database and acc is the verification accuracy. So if, e.g., you had 100,000 people to recognize from and your verification accuracy was 97.53%, you might have a recognition accuracy of ~66% (very roughly!), which is not great.
Finally, there are already lots of applications to face recognition other than surveillance, and many more waiting to be discovered.
>So if, e.g., you had 100,000 people to recognize from and your verification accuracy was 97.53%, you'd have a recognition accuracy of ~66%, which is not great.
Out of 100,000 people, I'd assume that there'd be many pairs that wouldn't be distinguishable by their parents, or even by each other. Wouldn't you eventually run into a quantum effect where the differences between people's faces from image-to-image would be larger than the difference between someone's face and everyone in a corpus of X million people? I'd think that a 66% would crush if you had 5 (for example) images of each person you were trying to identify, especially if they were intentionally taken from very different angles or in very different lighting, and you were trying to identify a person that you had 3 images of.
>Finally, there are already lots of applications to face recognition other than surveillance
I'm curious about those.
A computer algorithm operating on single 2d images (like LFW) has none of that. If you were to ask people to do the same task with the same data, they'd probably still do pretty well (much better than computers), but perhaps not perfectly.
The differences between image-to-image (same person) and different people is exactly what makes this problem so tough in the general case. It was shown about 20 years ago now that faces span a fairly low-dimensional manifold, and across different parts of this manifold, faces of different people do look much more similar than faces of the same person.
I probably shouldn't have explicitly written that formula for recognition, since it's not actually the formula, but more a general scaling rule-of-thumb. But even if we take it as given (hypothetically), there would be several issues. First, remember that 97.53% is on this particular dataset, which is very special in many ways (e.g., it was all collected over the course of a year from photos on Yahoo News, of public figures, with relatively lower-resolution images, and often very distinctive backgrounds). On a more realistic dataset, these numbers would be much lower.
Second, having more images of a person does help, but not nearly as much as you'd hope, because now there's an even bigger chance that you might accidentally match against someone else who happened to have their photo taken in the same pose, lighting conditions, and facial expression as your test photo, and it's very tough for algorithms to discount those confounding factors.
As for other applications of face recognition, let me describe one broad area that I think is pretty exciting, rather than a bunch of specific instances. One of the big shifts in user interface design is going to be "personalization" (to varying degrees). If a program or device or robot can recognize who you are, it can pro-actively change its settings/behavior/performance/etc. to better suit you. A very simple example is the new Kinect, which does some sort of recognition to load your saved profile/controller preferences. But that's just the tip of the iceberg.
And then of course there's personal photo collection management. If your photo organizer knew who everyone was in all your photos, it would be immensely useful in a number of ways. For starters, you could easily search for "photos of me & and my sister" or "my family", etc. You could also do more advanced analysis to start to get at more subtle things. For example, "my high school debate trip" might not immediately seem like it's related to face recognition, but in fact recognition might get you 90% of the way there.
Finally, if you've made it this far, I might as well plug my recent paper, "Photo Recall" , which looks at how to do advanced searches on your photo collection. Our system doesn't currently handle faces, but if you look at the kinds of queries we can do, it should become clear how they might extend with faces.
>Second, having more images of a person does help, but not nearly as much as you'd hope, because now there's an even bigger chance that you might accidentally match against someone else who happened to have their photo taken in the same pose, lighting conditions, and facial expression as your test photo, and it's very tough for algorithms to discount those confounding factors.
I was thinking that having two sets (for example a set of five and a set of three) of known matching images would mitigate that - in that you'd have 15 pairings to check rather than just one.
>If a program or device or robot can recognize who you are, it can pro-actively change its settings/behavior/performance/etc. to better suit you. A very simple example is the new Kinect, which does some sort of recognition to load your saved profile/controller preferences. But that's just the tip of the iceberg.
Personal devices and systems would rarely if ever be expected to distinguish between 1000s of people, though. Are you thinking more in terms of public devices, such as locks on house or car doors? That would seem as if it would be impossible just because out of thousands, someone could look more like you did when you set it up than you currently do.
For example, you might get lucky and 2 of the 3 test photos you have match well with 2 of the 5 database photos of the right person. But if you have enough people in the database, you're almost guaranteed to find some random person where all 3 of the test photos match quite well. So it often comes down to how you combine the results from the different test images. Do you take the max? Do you average them? etc. Each has different tradeoffs, and it's a topic under research at the moment ("image set recognition"), but I'd say there are no compelling results yet.
Some personal devices might indeed be tuned for relatively fewer people, but these are precisely the cases that are the toughest: distinguishing between people in the same family, who're all going to look much more similar to each other than random people in a large database.
But there are many scenarios where it might actually be distinguishing between 1000s of people. To take but one example, advertising. On the web, advertising is popular not just because it's "easy" and one of the few ways of successfully monetizing a site, but also because knowing more about a person allows for more targeted advertising which has a much bigger ROI. How would this translate to the "real world?" Personalization is clearly one of the ways to do this, and face recognition is ideal in many respects, given that it works from a distance, is non-invasive, and cheap to implement. (Obviously, there's a separate debate to be had about privacy and other issues, but I think it's hard to deny that many companies are interested in this.)
I'm not sure that face recognition will be used for security purposes like house/car locks, just because it seems like other devices might get you better security sooner and cheaper (like RFID tags used in some car keys now, or some sort of ID from your phone).
It's my understanding that deep learning is more interesting when you have a large dataset and need to generalize (e.g., recognize faces sideways, upside-down, etc), otherwise, when the requirements are more predictable, the hand-tuned algorithms tend to perform better. The task of finding whether two pictures are from the same person doesn't seem appropriate for DL (compared to e.g. find whether a picture contains a person).
Would that be correct?
If you have a ton of labeled data (either for the exact task you're doing, or a related one), DL often wins because it's sufficiently high-capacity to learn sophisticated models from all that data.
On the other hand, if you don't have enough data, then the deep network isn't going to be learnable, in which case hand-tuned methods might do better, depending on how much "domain expertise" you can bake into the system. Faces are a very constrained domain, and since there has been enormous work on them for decades now, it's one area that non-DL algorithms probably still have somewhat of a fighting chance (until people gather & label large enough databases, of course).
The performance of human it reports is on cropped images shown to human, which is not a fair comparison.
Over training it is.