
A Face Recognition Algorithm That Outperforms Humans? - gagzilla
https://medium.com/the-physics-arxiv-blog/2c567adbf7fc
======
apu
First, my earlier comments on a "competing" approach from Facebook may help
give relevant context for how to think about these numbers:
[https://news.ycombinator.com/item?id=7393378](https://news.ycombinator.com/item?id=7393378)

Briefly skimming through this paper, it appears that these numbers are not a
fair comparison, as this paper uses the _un_ restricted protocol of LFW[1],
whereas the other methods in the ROC curve shown in the paper are using the
restricted protocol. As you might imagine, the latter is more restrictive --
specifically in terms of amount of training data allowed. And as I mentioned
in my previous comment, training data is king in these kind of systems -- more
is always better.

To go slightly out on a limb, I think more significant than the new
theoretical model proposed in this paper is probably the use of lots of
different types of datasets for training. (Significantly more data >> more
complicated models, most of the time.) But I'd have to read the paper much
more carefully to be sure about this.

[1] [http://vis-www.cs.umass.edu/lfw/results.html](http://vis-
www.cs.umass.edu/lfw/results.html)

~~~
tormeh
What's the point in limiting yourself to small datasets? It forces you to be
"clever" about preprocessing, because the levels of freedom in your learning
algorithm must be limited to match the size of the dataset. Being clever like
this is precisely what we're trying to avoid with machine learning algorithms.
It's better to just shove the raw data into a very general algorithm like
neural networks and let the data do the configuration. And do that you need
_lots_ of data.

There is a conflict between an algorithm's performance and its learning speed.
Restricting ourselves to low amounts of data means we get algorithms with
lower maximum performance than we otherwise would. These are then claimed to
be better than algorithms which need more data to generalize well.

~~~
gcr
Also, one _huge_ reason why we might "limit ourselves to small datasets" is
because that's the only way we can compare many algorithms.

Outside training data has a _huge_ influence on algorithm accuracy. For
example, since Facebook has access to bajillions of face images that they
could use to train their classifier, and since they can't share those images
with the rest of the community, it's unclear whether their excellent
performance is because they have gobs of data or whether it's really a better
algorithm. I bet that a simple algorithm like a naive SVM might do leagues
better if we could train it on Facebook's (hidden!) dataset and test on LFW.
It just isn't reproducible---a measuring stick is only meaningful if everyone
uses the same measuring stick.

~~~
tormeh
Isn't small datasets the reason for using SVMs? You are, after all, operating
in linear space preprocessed with a kernel trick. Far less expressive than a
neural network.

~~~
gcr
Well, isn't it a bit awkward when _dataset size_ is what forces you to select
a certain algorithm that you otherwise wouldn't use? "Gee, I would love to
train a neural network, but it's just _so much data_ and I'm on a deadline;
maybe I should just use an SVM and hope for the best ... ..."

That's one of the big surprises behind deep learning these days: it's now
feasible to do things like "train a big neural network on bunches of images"
in a sensible time. It's an optimization thing as much as it is a machine
learning thing, in my opinion.

~~~
tormeh
I didn't mean that training time is the constraint. It's that when training a
more general algorithms (ANN) your hypothesis space has more dimensions than
when training a more specialized one (SVM). You therefore need more data to
train an ANN than a SVM. The reason for choosing SVM is not that your dataset
is too big for ANN, it's that it's too small for ANN.

------
dang
The original url [1] was blogspam—that is, it was a knock-off (or excerpt) of
some other, more original source. In such cases HN strongly prefers the
original source.

Submitters: blogspam is usually easy to recognize. Please check for that and
post the original instead.

1\. [http://news.sciencemag.org/signal-noise/2014/04/face-
recogni...](http://news.sciencemag.org/signal-noise/2014/04/face-recognition-
algorithm-finally-beats-humans)

~~~
slashcom
If we're going to go that route, linking to the actual paper might be even
more appropriate. Even the blog article is only a second hand source.

[http://arxiv.org/pdf/1404.3840v1.pdf](http://arxiv.org/pdf/1404.3840v1.pdf)

If you read it carefully, there's a caveat about how this particular dataset
(recognition of /cropped/ pictures of /unfamiliar/ persons) has relatively low
human accuracy (97.5% as opposed to 99.2%), because humans also use other
features. That is, there's more work to be done, and facial recognition isn't
a "solved" problem yet.

All the same, very impressive work. Congratulations to the authors on
achieving such an important milestone.

~~~
dang
It's true that the paper is _the_ original source, and it's certainly ok to
post those. But a user pointed out recently [1] that many HN users might have
time to read a good general-interest article but not the paper itself—and
usually the paper is referenced in the thread for those who want it. So either
is ok, but if you're going to post a general-interest article, try to make it
the most substantive one out there.

[1]
[https://news.ycombinator.com/item?id=7625534](https://news.ycombinator.com/item?id=7625534)

------
ihodes
The actual paper (parts are accessible & interesting):
[http://arxiv.org/pdf/1404.3840v1.pdf](http://arxiv.org/pdf/1404.3840v1.pdf)

~~~
xpda
The paper is really interesting, but a lot of the references have been
omitted.

~~~
vlasev
Really? It seems that the pdf is somewhat broken (24 pages is 12 pages of the
proper paper and the next 12 are from a slightly older version(it seems))

------
netcan
Face recognition is one of those technologies that's seems neat at a glance
and mindbogglingly terrifying on closer inspection. It has the potential to
sci-fi the world overnight and it could do it tomorrow night. The algorithm
accuracy and enormous comparison DBs are already here.

The effect this can have on commerce, advertising, policing, crime, culture,
or a bunch of other things has enough wide reaching effects for a sci fi
thriller.

A camera in cahoots with a till in a supermarket could put a face and a name
on every purchase. If the camera and the till in cahoots with an advertising
billboard in a shopping mall, you have created an offline version of
conversion tracking.

Since the supermarket and billboard company are in cahoots, they can compare
notes and find a billboard location that gets the supermarket's best
customers. If you are seen checking out climbing gear by a camera in cahoots
with Facebook, that store can keep outdoor activity products to you in
Facebook. Hello offline retargeting.

That's just advertising. Imagine policing. Imagine high school.

~~~
Zigurd
> _It has the potential to sci-fi the world overnight and it could do it
> tomorrow night._

Commerce and government already use face recognition and other machine vision
capabilities like license plate recognition to track you.

Simple, cheap face recognition would just level the playing field: Was the cop
approaching your car written up in /r/bad_cop_no_donut? Is the bureaucrat you
are dealing with known to be obstinate?

~~~
netcan
Sure. I said sci fi, not evil sci fi.

Commerce and government use face recognition a little. ATM it's at the point
of "The 30% of faxes now sent using the internet."

------
gcr
I'm a computer vision grad student. A few things concern me about this work.
Maybe they're incidental, but I'm not ready to throw my hands up in the air
quite yet.

\- Why wasn't this accepted to CVPR/ECCV/one of the well-established computer
vision conferences? I would love to read some of the reviewers' comments about
this work before I give further judgment. (If this really is some CVPR
preprint, or if it actually is peer-reviewed, I'd feel much better about
this.)

\- Why isn't this work listed on the official curated "LFW Results" page that
Erik Learned-Miller maintains? [http://vis-
www.cs.umass.edu/lfw/results.html](http://vis-
www.cs.umass.edu/lfw/results.html) Is this work so new that Erik hasn't had
time to review it yet?

\- Human performance on LFW is 99.2%, which is higher than what the authors
think it is. The performance drops to the (claimed) 97% when we only show
humans a tight crop of the face:
[http://www1.cs.columbia.edu/CAVE/publications/pdfs/Kumar_ICC...](http://www1.cs.columbia.edu/CAVE/publications/pdfs/Kumar_ICCV09.pdf)
They discuss this difference in a paragraph in their conclusion, but I
consider it dishonest to use the lower number in the abstract and imply it in
the title. In fact, I consider it misleading to put "Surpassing human
performance" in the title to begin with, but that's another matter :)

\- Showing good performance on one dataset (LFW) is certainly not enough to
show that this "outperforms humans" in the general case. Getting a state-of-
the-art result on LFW these days is like squeezing a drop of water out of a
rock; in my opinion, we should turn our attention to harder datasets like GBU
now that these "easier" ones are solved.

I'm not terribly familiar with Gaussian processes so I'm not sure whether the
math works out, but it is a pretty uncommon thing to try in this domain.
(Perhaps that's what makes this work interesting, especially since this year
seems to be the "Deep Learning is Eating Everyone's Lunch" year)

I also wish they describe what final-stage classifier they use for the
"GaussianFace as Feature Extractor" model. Often, that's the most important
step; it's strange that they didn't compare with POOF/High-dimensional-
LBP/Face++'s deep-learned features/any of the other state-of-the-art feature
extractors, especially considering how much worse "GaussianFace as a binary
classifier" does (93% vs 97% is a huge difference in this dataset)

Just my two cents. It definitely demands further exploration. I don't see any
obvious mistakes, but I'm not sure why their approach works as well as they
claim it does either.

 _Edit_ : I don't mean to start a witch hunt or anything, but if the authors
have the guts to put "Human-level performance" in their title, they're just
_begging_ for the community to inspect every detail and point out all the
flaws in every minutiae in their work. It's our community's hot button. It's
similar to the old adage about how if you want a Linux user to help you, you
have to tell them how much Linux sucks. That's where much of my skepticism
comes from. The most astounding papers are often the most humble, but "humble"
certainly doesn't describe this work.

~~~
apu
In some sense, the 97% is indeed the more fair number to compare against,
assuming that this paper also restricts the algorithms to only see tight crops
of the face.

Obviously using more than just tight crops will give you more information, but
our point in differentiating those cases when measuring human performance was
that the way LFW was constructed gives you MUCH more information when using
loose crops than "normal" images would. For example, many images of actors are
at award shows (the Oscars in particular), and so if you see that kind of
background in a pair of images, you can just say "same" and have a very good
chance of getting it right. That's what the "inverse crop" experiment shows
[1] -- when you block out the face in LFW images, you can still get 94%
accuracy!

In normal images, however, the background won't normally give you so much
information.

I do feel somewhat bad that our human verification performance experiment
numbers are now being used to create linkbait titles like, "computer
algorithms can beat humans," because that's obviously not true (nor have I
ever believed it), but in my defense, in 2009 we didn't really think about
what the press would do in 2014 when algorithms started saturating on LFW =)

[1]
[http://homes.cs.washington.edu/~neeraj/projects/faceverifica...](http://homes.cs.washington.edu/~neeraj/projects/faceverification/)

------
schrodingersCat
I'll have to run this by my friend who writes morphometrics algorithms, as I
can;t actally tell what is new about this paper. This might actually allow for
a proper photo-matching search engine. All the ones that I have tried to this
point have been lacking or broken...

------
dschiptsov
No matter angle and illumination? Come on.)

~~~
gcr
Believe it or not, LFW does not include a whole lot of rotation variance in
its images. One reason for this is because LFW was originally selected by an
automatic face detector in the first place, which doesn't detect rotated faces
very well. Here are some sample images from LFW: [http://vis-
www.cs.umass.edu/lfw/number_11.html](http://vis-
www.cs.umass.edu/lfw/number_11.html) In almost all of them, the person is
usually facing the camera (no profile shots!) and the in-plane rotation is not
very large (their eyes are parallel to the horizon; they're not upside-down)

The fact that we aren't shooting for rotation-invariance is a problem with
LFW, not necessarily this algorithm, but you're right to say that the authors'
paper would be much more convincing if they tried their approach on several
different datasets.

(In particular, the "Multi-PIE" dataset aims to specifically stress rotation-
invariance, even though it's not quite as popular as LFW. These guys do use
Multi-PIE as part of their training and they do in fact measure their
algorithm's performance on Multi-PIE; see page 24. They don't seem to talk
about it much since everyone is crazy for LFW and since Multi-PIE is "easy"
for other reasons...)

~~~
dschiptsov
So what then is there about "human"?

Everyone took the same courses and more or less familiar with the basic
algorithms. The task of recognizing a passport-like pictures is nearly
trivial, while a "random, on the street" face recognition is a different
problem.

People are using environmental and contextual cues (so, does facebook's
system) and they mostly guessing rather than "exact matching" peoples.

Anyway, as long as it is not a passport photos the task of face recognition
has _almost_ nothing to do with basic NN algorithms, but a lot with contexts
and cues.

~~~
gcr
Exactly. That's why we should be (IMO) focusing on hard face recognition
problems, full of rotation and occlusion and blur and all the "icky" parts of
the real world that we don't like to deal with.

There is some of that work going on out there, but I would be very surprised
to find that spirit in a paper with the words "LFW" in the title. It's just a
different focus.

In its time, LFW was "hard" since most of the face datasets (FERET, AT&T) were
completely controlled: the subject visited a laboratory, sat down in the best
pose, made the best face expression, and had their picture taken by the best
camera that grant funding could buy. The "in the wild" part of "Labeled Faces
in the Wild" alludes to the fact that these are more real-world than what the
community was used to.

It's time to continue that train of thought and move on to something harder
though.

------
michaelochurch
Eigen tell you, it's not easy.

------
weishigoname
Thanks for sharing.

