
DeepFace: Closing the Gap to Human-Level Performance in Face Verification - mlla
https://www.facebook.com/publications/546316888800776/
======
scotth
It's surprising to me that there isn't a single positive comment in this
thread considering how amazing this is. Sure, it has other implications, but
were we really hoping to prevent computers from recognizing faces permanently?

~~~
normloman
No, we were just hoping to prevent creeps like facebook from using it to
tailor advertising.

~~~
dave_sullivan
Ah, there's how you get to minority report type stores/advertising being used
in practice...

1\. You upload photos to Facebook. Facebook detects various commercial
products or even--assuming deep learning can take things to "higher level"
representations--"style preferences" of people recognized in the photo. It
will also develop style preferences based on context, location, etc. (dive bar
or classy lounge?) Because it knows who is in the picture, etc. it can
correlate that with an identity.

2\. Facebook then sells this "identity" to the Gap -- no actual information
about the user, just these raw vectored "style preferences" which contain all
knowable brand information about each user. Facebook can provide massive
coverage here.

3\. You walk into a Gap store. Gap has installed software provided by Facebook
to detect your face/person/style preferences (but no personally identifiable
information, just that "the person with this face has these preferences and
probably makes this much money so you might offer X,Y,and Z at these prices")
and you then get an offer via facebook message (or "facebook offer"?) on your
phone to buy what you think is actually a really cool jacket at an admittedly
reasonable price (based on what you're used to paying for jackets).

This probably has massive ramifications for outfits like Costco or Target or
Walmart where individual consumer preference/taste/whatever can really make a
difference in choosing effective lineups of products... Maybe they manage to
offer deals that sort of price themselves based on what they know the user
will pay?

Almost not a bad idea...

------
cs702
The key innovation is an _accurate, reliable method for rotating faces_ so
they're 'looking straight at the camera' before feeding them to a deep neural
network. They call this 3D photo rotation process "frontalization." Figure 1
on page 2 of the paper shows at a very high level how this is being done. Very
nice!

~~~
apu
Actually, that's only one of the contributions, and I'm not so sure it's the
"key innovation". Every other recent face recognition method also tries to do
some kind of alignment to make faces more similar in pose/expression/lighting
prior to classifying them; and of these, several also fit faces to a 3-d model
to rotate to frontal (with varying quality).

See my other comment for my guess on what's actually providing the boost:
[https://news.ycombinator.com/item?id=7393378](https://news.ycombinator.com/item?id=7393378)

~~~
cs702
Yes, other attempts _" also fit faces to a 3-d model to rotate to frontal
(with varying quality),"_ as you put it, but this method for rotating faces
appears to be superior -- more accurate and reliable.

It looks like the main contribution to me :-)

------
fchollet
Important to note:

\- they still need 1000 labeled samples per identity

\- their network can only handle 4000 distinct identities (at 97.25% accuracy)
at a time

It's still a very worrying development for online and offline privacy.

~~~
bayesianhorse
Actually no. For one thing, this isn't exactly a stealthy or cheap thing to
do. It involves datacenters full of computing resources even for 4000
identies.

I also don't believe it's so much a privacy issue. If I upload pictures to
facebook I actually want them to be seen by human beings. The face-recognition
only helps with that.

If facebook recognizes me on a picture someplace else, I actually rather want
to know about it. I'm not super famous, so unexpected pictures of myself can
be more of a bad thing...

~~~
ta_fbp
Privacy issue is not about what you want, it's about what can be done (and
often is without you knowing).

Automated facial recognition is a serious privacy concern, it's not just about
the slimy despicable thing facebook is. For example, in the uk where there are
more cctv video feed than people to watch them, an automated facial
recognition can track you around constantly. Their current automatic number
plate recognition is already a serious privacy concern.

Remember 7th cube's voyeur's dream[1] released in 2005? the same but with more
camera now being able to identify you based on your facial features. [1]:
[http://www.pouet.net/prod.php?which=16410](http://www.pouet.net/prod.php?which=16410)

~~~
bayesianhorse
Wouldn't it actually be better to stop a government from doing that? If you
can't, there's something wrong that no restrictions/bans on technological
progresss will fix...

Usually it's easier to get the government in line than to prevent some
technology to be developed...

~~~
chroem
So how is that NSA reform coming along? Good I hope?

------
phpnode
wow, does anyone apart from facebook and the government actually want facebook
to do this? it's pretty terrifying

~~~
userbinator
Not surprisingly, they are called _face_ book after all...

~~~
visarga
deepfacebook

Btw, does this software match faces to people or just draws a rectangle around
faces?

~~~
fchollet
It matches your face to your name with 97.25% accuracy, assuming that they
have at least 1000 labeled photos of you to start with.

~~~
beagle3
1 or 2 photos are enough for that accuracy. The 1000 photos were only needed
in finding the right face representation.

------
apu
Having worked on this problem before (the comparison to human performance they
cite is from my work) and seeing all the recent successes of deep learning,
I'd bet that a lot of the gain here comes from what deep learning generally
provides: being able to leverage huge amounts of outside data in a much
higher-capacity learning model.

Let me try to break this down:

In machine learning, when you have input data that is labeled with the kinds
of things you are directly trying to classify, that is called "supervised". In
this case it's not quite supervised, because their main evaluations are on the
LFW dataset, which is a _verification_ dataset, whereas their training on SFC
is a _recognition_ task. The difference is that in verification, you are given
photos of two people you've never seen before and have to identify if they're
the same or not. In recognition, you are given one or more photos of several
people as training data, and asked to identify a new face as one of them. In
theory, you could build recognition out of verification (verify all pairs
between training images and test input images and assign the top-scored name
as the person) but in practice it's much better to build dedicated recognition
classifiers for each person.

Their main network is trained on a recognition task, using their SFC dataset.
They show these recognition results in Table 1 and the middle column of Table
2. An error number of 8.74% (DF-4.4M), for example, means that they were able
to successfully name the person in 91.26% of input images. However, this error
rate crucially depends upon two key factors: (1) the number of people they're
trying to distinguish between, and (2) the number of images they have per
person. For this test, it was ~4,000 people, and ~1,000 images/persons,
respectively.

If you were to add more people to the database, or have fewer images per
person, this accuracy would drop. You can see this clearly in Table 1, where
subsets DF-3.3M and DF-1.5M have correspondingly lower error rates because
they have fewer people (3,000 and 1,500, resp). Similarly, the middle column
of that table shows how error rates rise when you reduce the number of images
per person.

In contrast, all subsequent results are shown on verification benchmarks (LFW
and Youtube Faces). In large part, I suspect this is because of the realities
of publishing in the academic face recognition literature: you have to
evaluate on some dataset that the community is familiar with to get your paper
accepted, and LFW is the de-facto standard these days, and it only does
verification not recognition.

Here, their performance is certainly very good, and an improvement over
previous work, but not an unexpectedly huge leap. If you look at the LFW
results page, you can see that recent papers have been edging up to this
number quite steadily: 95.17% (high-dim LBP), 96.33% (TL Joint Bayesian),
97.25% (this paper) [http://vis-www.cs.umass.edu/lfw/results.html](http://vis-
www.cs.umass.edu/lfw/results.html)

Nevertheless, how are they able to get this boost in performance? What recent
papers in this field have increasingly been discovering is that having higher-
dimensional features can really give you a big boost, or to put it another
way: having a higher-capacity model is what buys you the additional
performance.

In machine learning, the "capacity" of a model refers (in a loose sense) to
how powerful it is. The basic tradeoff is that a higher-capacity learner can
more accurately classify testing data BUT it requires much more training data
to learn. The problem is that for the LFW benchmark, the amount of direct
training data you have is strictly limited: there are 6,000 pairs of faces,
and you train on 90% of them and test on the remaining 10%. This is not nearly
enough data to train a high-capacity model.

So what people have been doing is training the bulk of their models on some
other data, for some other task, and then adapt that model to the LFW problem,
using the LFW training data essentially to "tweak" the classification model
for this particular task. That's why the LFW results tables are now broken up
into different sections according to how much outside data was used and in
what form.

In the case of DeepFace, this takes the form of the SFC dataset and learning a
network for recognition, not verification. Since they have access to lots of
data of this form, they can successfully train a high-capacity model for it.
Then they simply "chop-off" the last layer of the network -- the one that does
the final recognition task, and instead replace it with a component for
verification using only LFW training data. Or for their "unsupervised"
results, using no LFW training data ("unsupervised" in quotes because it's not
really unsupervised).

BTW, this approach of training a deep network for some task, and then cutting
off the last layer to apply it to a different task (in effect making it simply
a feature-extraction method) is quite common, and has been applied
successfully to many problems that might not have enough data to train a high-
capacity model directly.

Anyway, if people have more questions, I can try and answer them. (I'm not one
of the authors, but I am in the field.)

~~~
boomzilla
Thanks for the write-up. This is very informational.

Could you elaborate a bit more on the "capacity" of learning models? Can it be
quantified and is it some how related to the VC dimension of a particular
learning problem? It would be great if you could give some example of
"capacity" for the more well known models: trees, naive bayes, SVM, one hidden
layer neural nets, etc.

~~~
apu
Yes, capacity is intimately tied to VC dimension; in particular, VC dimension
is one way to measure capacity. See the Wikipedia article for more
information:
[http://en.wikipedia.org/wiki/Vc_dimension](http://en.wikipedia.org/wiki/Vc_dimension)

I'm not an expert on deep learning (although I generally understand how they
work on vision problems), so I'm not sure if you can precisely measure the
capacity of deep networks. Informally, the primary number that seems to matter
is the number of parameters in the network that have to be learned. This paper
quotes that at "more than 120 million".

SVMs, in contrast, typically work with feature dimensionalities (i.e., # of
parameters) that are on the order of 1,000 - 100,000. You can't directly
compare these numbers because there are various non-linearities involved, but
this deep learning network is definitely much higher capacity than an SVM
would be with normal feature dimensionalities.

------
chriskanan
A paper posted on arxiv a few days ago by Fan et al. claims to have a similar
level of accuracy (97.3% on LFW):
[http://arxiv.org/pdf/1403.2802.pdf](http://arxiv.org/pdf/1403.2802.pdf)

Both methods use deep neural networks, but have a lot of differences, e.g.,
the Fan et al. paper doesn't use a 3D face model.

------
somberi
May be a stupid question - The Social Face Classification (SFC) dataset that
they refer to - Is it published to the world? I wonder if they can deduce
"emotions" from SFC dataset and use it as a training set for images in the
wild.

------
beagle3
Aren't you really glad now you uploaded all these photos to facebook?

~~~
DennisP
I didn't. I joined to keep up with a few friends and family, and they uploaded
photos with me in them, nicely tagged.

~~~
hnha
you can disable their ability to do so.

~~~
beagle3
No, you can't stop them from uploading photos and mentioning your name, even
if they didn't specifically put a rectangle on your face.

------
stdbrouw
Reducing error by 25% from a 96.5% baseline gives you their stated 97.25%
accuracy. About 0.75% fewer errors. Still amazing, but less impressive than
the abstract makes it sound.

~~~
apu
Actually, 25% is the right way to judge this improvement. For example, let's
say performance was currently at 99.9% and you improve it to 99.99%. That's
not a 0.09% improvement (99.99 - 99.9), but rather a ten-fold improvement
(.01% errors vs 0.1% errors).

This has to do with the fact that accuracy/errors are not linear.

------
joshgel
Great name.

~~~
visarga
It comes from deep learning, which is all the rage in machine learning since a
few years back.

