Hacker News new | comments | show | ask | jobs | submit login
A Face Recognition Algorithm That Finally Outperforms Humans (medium.com)
147 points by Mz 1069 days ago | hide | past | web | 31 comments | favorite



sigh no, it doesn't "outperform humans" on "face recognition". In particular, see my previous comments [1] and [2] for discussion on why this method might be doing well.

As for "outperforming humans", a more accurate statement might be, "this algorithm outperforms (for this simplistic task) one experiment done with a limited set of humans on this one particular dataset which has been in the community for 10 years now and is thus highly gameable."

But I realize that's a lot less pithy.

In particular, this dataset is nearing saturation, and whenever that happens, differences in the accuracy numbers often don't mean much. So for example with Facebook's number at 97.53% and this paper's at 98.52%, you're talking about the difference between getting 148 pairs of faces wrong vs 89 pairs wrong. In practical terms, as a researcher working with a dataset like this, you very quickly learn to focus on just the ones your algorithm gets wrong, and it's impossible to not subconsciously try to optimize for getting those few cases correct, even if those techniques wouldn't actually help in the general case.

[1] https://news.ycombinator.com/item?id=7637866

[2] https://news.ycombinator.com/item?id=7638269




You seem to know something about this - how accurate are the best face detection algorithms on real-world datasets?


depends on the lighting, camera setup, CPU, available processing time and finally your intended purpose.

Face detection/recognition is a broad subject, and the applications are broad.

There are things like HAAR cascades which are able to pickout a face like object from others (or anything else that it's been "trained" to do), however they can't tell faces apart. They can be tuned so that they can be used in realtime apps (like the autofocus on cameras)

HAAR cascades are limited in a number of ways: they can't tell faces apart, and you need a different cascade for different views (profile/portrait/other) they can also have trouble with skin colour as well(depending on training).

more advanced algorithms are able to workout face orientation in realtime (google hangout moustaches and the like) but once again they arn't able tell between two faces.

However there are no accurate real-time (or anywhere near realtime) algorithms that are able to tell faces apart (i.e. put a name to a face in a crowd) In fact I would go so far as to say that there are no non-realtime ones either.


Haar* cascades (specifically, Viola-Jones, which is what I assume you are talking about) are hardly state-of-the-art, they're over 13 years old now.

* Also, Haar is a surname, not an abbreviation, you don't write it in all-caps


If you want to see results of state of the art commercial systems see NIST'a recent FRVT 2013's test results:

http://www.nist.gov/itl/iad/ig/frvt-2013.cfm


I just love all the analysis you've done about these face recognition systems. They're so insightful!


This isn't really face recognition, this is face verification. In computer vision, face recognition usually means tell me who this person is (in psychology it means "have you seen this person before?"). Face verification gives an algorithm (or a person) two images of faces and asks, "Are these the same person?"

Their result is impressive, and it improves a bit over Facebook's recent result on the same dataset with their DeepFace system (97.35% for DeepFace vs 97.53% for people vs 98.52% for the system discussed in the article).

Also, it is interesting that they are not using deep learning for this. They are using a Discriminative Gaussian Process Latent Variable Model.


I'm pretty sure in reality Facebook also uses your social network graph to restrict the candidate set and get higher recognition accuracy. This makes it hard to compare Facebook's results to a pure recognition algorithm.

On that note, Facebook could do even better by restricting the candidate set based on time, precise location, and compass orientation, given that most mobile users have Facebook installed and are running it in their pockets when they get their picture taken by others. (They could do rough recognition purely based on position and orientation without even looking at the camera image, if they really wanted to, so with the camera image it could really be near 100% accurate, and even work if you take a picture of a friend's back.)


DeepFace's figures[1] also come from running it on the LFW dataset[2]. While Facebook's production tech will be more accurate because of what you say, the Deepface algorithm has a lot of raw power, too!

[1] http://www.cs.toronto.edu/~ranzato/publications/taigman_cvpr...

[2] http://vis-www.cs.umass.edu/lfw/index.html


Sure, but the numbers in the research paper are "pure" (i.e., not using all this additional information). In production, I'm sure they must be using all these additional cues as well.


Ah, I see, you mean the 97.35% is a "pure" algorithmic result. That's pretty impressive.


Actually, for humans this actually is a face recognition task - from the given examples, it seems that the dataset involves pictures of publicly known people, so it's comparing apples and oranges as humans are given an entirely different task.

Computers are given 2 pictures and asked "Are these the same person?"

However, if people are given the same 2 pictures, and one of them is a well-known actor; then the question is "Is the second picture the same person as the many pictures of this actor that I have seen during my life?" which is a rather different task.

If you want a fair comparison, then either you have to use portraits of people that nobody knows, or the computer systems need to use databases of celebrity pictures for that.


I think the breakthrough with Facebook's system was the idea to transform the face before running the training or verification on it. And the new system seems to use the same idea: The new algorithm works by normalising each face into a 150 x 120 pixel image, by transforming it based on five image landmarks: the position of both eyes, the nose and the two corners of the mouth.


No. Alignment is done by every face verification method, and even the "new" 3d alignment done here is on par with other alignment methods in practice today.

At their talk at CVPR, they made clear that the biggest factor in getting their numbers was having tons of training data and a high-capacity learning model. The paper even shows that if you replace their 3d alignment with a standard 2d affine alignment (something people were doing 15 years ago), you lose only 1.5% on LFW (table 2).


>Face verification gives an algorithm (or a person) two images of faces and asks, "Are these the same person?"

>97.53% for people

I'm surprised that people are that good - I think I'd be horrible at it. From googling, I can see that number came from Facebook - would you happen to know where to find the paper?

>Also, it is interesting that they are not using deep learning for this. They are using a Discriminative Gaussian Process Latent Variable Model.

Does using this different model gain some sort of space or processing efficiency over DeepFace?

Also: does anyone know the point of fighting to close this 3% gap? That seems good enough for any purpose. (Is there any other purpose to this technology other than mass surveillance?)


Here's the paper: https://www.facebook.com/publications/546316888800776/

The methods are quite different, but I think at root, it's just more being able to take advantage of much more training data.

The 97.53% "human performance" number comes from my paper from a few years ago [0]. We initially thought that there's no way we would be able to do that well on this task (manually) either, but then we tried it and our scores were much higher (very close to 100%). So the verification task on LFW is actually not that hard, although in part it's because the dataset consists of public figures and therefore you can recognize many of the people outright if you've seen them before. (We tried to measure this in our human studies but found it to be inconclusive.)

3% is not the right way to look at error rates, btw. See [1] for why. But also, face verification is more of a low-level building block for recognition, because most real-world applications don't directly care about saying whether two faces are the same person or not ("verification"), but "who is this person?" ("recognition"). A simple way to build recognition from verification is to compare a test face against all faces in the database, and then take the one that gets ranked highest. In that (simplistic) scenario, your recognition rate scales (very roughly) as acc^lg(N), where N is the number of people in your database and acc is the verification accuracy. So if, e.g., you had 100,000 people to recognize from and your verification accuracy was 97.53%, you might have a recognition accuracy of ~66% (very roughly!), which is not great.

Finally, there are already lots of applications to face recognition other than surveillance, and many more waiting to be discovered.

[0] http://homes.cs.washington.edu/~neeraj/projects/faceverifica...

[1] https://news.ycombinator.com/item?id=7393438


Found it about 10 secs ahead of your post - was coming back to edit it into my comment:) Thanks.

>So if, e.g., you had 100,000 people to recognize from and your verification accuracy was 97.53%, you'd have a recognition accuracy of ~66%, which is not great.

Out of 100,000 people, I'd assume that there'd be many pairs that wouldn't be distinguishable by their parents, or even by each other. Wouldn't you eventually run into a quantum effect where the differences between people's faces from image-to-image would be larger than the difference between someone's face and everyone in a corpus of X million people? I'd think that a 66% would crush if you had 5 (for example) images of each person you were trying to identify, especially if they were intentionally taken from very different angles or in very different lighting, and you were trying to identify a person that you had 3 images of.

>Finally, there are already lots of applications to face recognition other than surveillance

I'm curious about those.


People are really good at face recognition in real-life, because you have not just a static 2-d view of a person completely out-of-context, but a fully dynamic 3-d view of someone you've probably seen before and you know the lighting environment you're in, meaning you can easily factor out those effects.

A computer algorithm operating on single 2d images (like LFW) has none of that. If you were to ask people to do the same task with the same data, they'd probably still do pretty well (much better than computers), but perhaps not perfectly.

The differences between image-to-image (same person) and different people is exactly what makes this problem so tough in the general case. It was shown about 20 years ago now that faces span a fairly low-dimensional manifold, and across different parts of this manifold, faces of different people do look much more similar than faces of the same person.

I probably shouldn't have explicitly written that formula for recognition, since it's not actually the formula, but more a general scaling rule-of-thumb. But even if we take it as given (hypothetically), there would be several issues. First, remember that 97.53% is on this particular dataset, which is very special in many ways (e.g., it was all collected over the course of a year from photos on Yahoo News, of public figures, with relatively lower-resolution images, and often very distinctive backgrounds). On a more realistic dataset, these numbers would be much lower.

Second, having more images of a person does help, but not nearly as much as you'd hope, because now there's an even bigger chance that you might accidentally match against someone else who happened to have their photo taken in the same pose, lighting conditions, and facial expression as your test photo, and it's very tough for algorithms to discount those confounding factors.

As for other applications of face recognition, let me describe one broad area that I think is pretty exciting, rather than a bunch of specific instances. One of the big shifts in user interface design is going to be "personalization" (to varying degrees). If a program or device or robot can recognize who you are, it can pro-actively change its settings/behavior/performance/etc. to better suit you. A very simple example is the new Kinect, which does some sort of recognition to load your saved profile/controller preferences. But that's just the tip of the iceberg.

And then of course there's personal photo collection management. If your photo organizer knew who everyone was in all your photos, it would be immensely useful in a number of ways. For starters, you could easily search for "photos of me & and my sister" or "my family", etc. You could also do more advanced analysis to start to get at more subtle things. For example, "my high school debate trip" might not immediately seem like it's related to face recognition, but in fact recognition might get you 90% of the way there.

Finally, if you've made it this far, I might as well plug my recent paper, "Photo Recall" [1], which looks at how to do advanced searches on your photo collection. Our system doesn't currently handle faces, but if you look at the kinds of queries we can do, it should become clear how they might extend with faces.

[1] http://homes.cs.washington.edu/~neeraj/projects/photo-recall...


Thanks, that's a lot to think about.

>Second, having more images of a person does help, but not nearly as much as you'd hope, because now there's an even bigger chance that you might accidentally match against someone else who happened to have their photo taken in the same pose, lighting conditions, and facial expression as your test photo, and it's very tough for algorithms to discount those confounding factors.

I was thinking that having two sets (for example a set of five and a set of three) of known matching images would mitigate that - in that you'd have 15 pairings to check rather than just one.

>If a program or device or robot can recognize who you are, it can pro-actively change its settings/behavior/performance/etc. to better suit you. A very simple example is the new Kinect, which does some sort of recognition to load your saved profile/controller preferences. But that's just the tip of the iceberg.

Personal devices and systems would rarely if ever be expected to distinguish between 1000s of people, though. Are you thinking more in terms of public devices, such as locks on house or car doors? That would seem as if it would be impossible just because out of thousands, someone could look more like you did when you set it up than you currently do.


The problem with more pairings is that you have a greater chance of identifying the right person because of their additional photos, but that gets offset by the huge number of photos of everyone else in the database that you will now also get lots of "hits" on.

For example, you might get lucky and 2 of the 3 test photos you have match well with 2 of the 5 database photos of the right person. But if you have enough people in the database, you're almost guaranteed to find some random person where all 3 of the test photos match quite well. So it often comes down to how you combine the results from the different test images. Do you take the max? Do you average them? etc. Each has different tradeoffs, and it's a topic under research at the moment ("image set recognition"), but I'd say there are no compelling results yet.

Some personal devices might indeed be tuned for relatively fewer people, but these are precisely the cases that are the toughest: distinguishing between people in the same family, who're all going to look much more similar to each other than random people in a large database.

But there are many scenarios where it might actually be distinguishing between 1000s of people. To take but one example, advertising. On the web, advertising is popular not just because it's "easy" and one of the few ways of successfully monetizing a site, but also because knowing more about a person allows for more targeted advertising which has a much bigger ROI. How would this translate to the "real world?" Personalization is clearly one of the ways to do this, and face recognition is ideal in many respects, given that it works from a distance, is non-invasive, and cheap to implement. (Obviously, there's a separate debate to be had about privacy and other issues, but I think it's hard to deny that many companies are interested in this.)

I'm not sure that face recognition will be used for security purposes like house/car locks, just because it seems like other devices might get you better security sooner and cheaper (like RFID tags used in some car keys now, or some sort of ID from your phone).


> Also, it is interesting that they are not using deep learning for this. They are using a Discriminative Gaussian Process Latent Variable Model.

It's my understanding that deep learning is more interesting when you have a large dataset and need to generalize (e.g., recognize faces sideways, upside-down, etc), otherwise, when the requirements are more predictable, the hand-tuned algorithms tend to perform better. The task of finding whether two pictures are from the same person doesn't seem appropriate for DL (compared to e.g. find whether a picture contains a person).

Would that be correct?


I'm not sure that's the distinction I'd use. To me, it seems more to be a matter of the kind and amount of training data you have.

If you have a ton of labeled data (either for the exact task you're doing, or a related one), DL often wins because it's sufficiently high-capacity to learn sophisticated models from all that data.

On the other hand, if you don't have enough data, then the deep network isn't going to be learnable, in which case hand-tuned methods might do better, depending on how much "domain expertise" you can bake into the system. Faces are a very constrained domain, and since there has been enormous work on them for decades now, it's one area that non-DL algorithms probably still have somewhat of a fighting chance (until people gather & label large enough databases, of course).


That makes more sense, thank you.


even with Facebook's 97.35% they still confuse me with my toddlers :P


No it is wrong, the actual human performance on full images is close to 99%. [1] Ctrl+F human

The performance of human it reports is on cropped images shown to human, which is not a fair comparison.

[1] http://vis-www.cs.umass.edu/lfw/results.html


No, the 97.53 is more fair. See here: https://news.ycombinator.com/item?id=7638269


NIST's latest FRVT results for commercial face recognition (FRVT 2013) is available here:

http://www.nist.gov/itl/iad/ig/frvt-2013.cfm


> But when the algorithm is faced with images that are entirely different from the training set, it often fails.

Over training it is.


I had to say this, the articles from medium.com has more focus on PR than the quality of content. Today itself I saw two articles, Schrodinger's cat and this.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: