
Facial recognition's 'dirty little secret': Millions of online photos scraped - vinnyglennon
https://www.nbcnews.com/tech/internet/facial-recognition-s-dirty-little-secret-millions-online-photos-scraped-n981921
======
yorwba
They make a big deal about IBM using images published under a Creative Commons
license without asking for permission, when the whole point of CC is to give
permission to anyone, without having to ask. The people who are now surprised
that their images are being used for purposes they disagree with should
probably have used a different license, but I guess the messaging around CC
(which tends to emphasize reuse by artists) makes it difficult.

Using images with a restriction to non-commercial purposes is a bit more of a
gray area, depending on how you separate commercial from non-commercial
activity. Since they share the data set with researchers at other
organizations (presumably including competitors), I'd consider it non-
commercial enough, because they don't gain a competitive advantage, but the
details might have to be fought out in court.

~~~
jasonhansel
But the CC license requires attribution (and, in some variants, prohibits
commerical use). It's not clear to me that IBM followed these requirements--
presumably a trained neural network counts as a derived work, so far as the CC
license is concerned.

~~~
EpicEng
>presumably a trained neural network counts as a derived work

I just spent about 15 minutes trying to confirm that and... I have no idea. I
suppose it's not surprising that a software engineer would not be able to suss
that out in 15 minutes. Every definition I find tends to focus on on art,
visual and auditory creations.

Disregarding a legal interpretation (you know, the thing that actually
matters), I can see it either way. Certainly the model is based off of data
derived from the characteristics of these images. On the other hand, if I saw
e.g. a shade of blue in one of these images that I liked, would I need to
provide attribution if I measured it and used it in my own work? I have no
idea, I suppose I'm just thinking out loud here. I do understand the taking
something to a logical extreme (the color example) is not the end all be all
of legal arguments.

~~~
nitrogen
Isn't the purpose of a ML model to describe actual facts, derived mechanically
and not creatively, and thus might not be subject to US copyright in the first
place?

~~~
penagwin
But you could argue that the weights of the model wouldn't be what they are
without the copyrighted work. Since their model does use the work, apart of it
was "derived" from that piece.

~~~
sls
This isn't about the various senses the English word "derived" can have, it's
about the specific meaning of the legal term "derivative work." That would
start with including major elements of another work and those elements being
subject to copyright (e.g. in the US, book titles are not). There's no point
in talking about how one could argue that something wouldn't be what it is
without something else, that's not the legal basis at all. In no legal sense
are the Indiana Jones movies derivative works of the old 1930's serials just
because they were sources of inspiration.

------
FesterCluck
I'm not usually one to rant, but I'm getting a bit sick of media outlets
behaving like they don't have a duty to understand the subjects they are
writing about. While in the past it may have only been "techies" who
understood these things, at least the general understanding of how privacy
rules have developed exists in the populace. All of us read these articles and
give the author the "laymans" excuse like a southerner excuses their
grandparents' racist vocabulary. I'd like to make the situation very clear:

This article is clickbait, it's attempting to inflame readers by misinforming
them and/or feeding common misconceptions. As journalists we should expect
more of them. I do. We shouldn't allow ourselves the luxury of accepting
"that's just the way journalism is now". It wasn't always this way, and it
doesn't need to be.

I beg of readers: know when you are being played to raise agendas based on
false premises. The author just wants to stir up the public so they have more
to write about later. If by some chance the author believes anything they
wrote, then perhaps NBC should consider moving them to the obits.

------
helsinki
A friend and I built a web crawler that published images to a pool of workers
that generated feature embeddings for all faces it found. After indexing all
the feature vectors, we had effectively built a search engine for faces
powered by images found on the public web.

The results were terrifying and they really affected me in a very negative
way. Needless to say, we moved on to other projects. The world is not ready
for the harm such technologies can cause.

Our index contained 100+ million faces and the compute costs were obscene.

~~~
nostrademons
Pretty sure Google, Facebook, and Apple already have this. They limit it to
little convenience features like "Auto-tag your friends" or "Here are some
friends you may know" because they don't want to scare people, but they have
both the data and the models to do large-scale person search over the entire
world. If they wanted to they could launch a search engine where you snap a
picture with your cell phone and they tell you who it is, what their home &
work addresses are, who their friends are, where they've been over the last
week, which businesses they like to frequent, and probably roughly how much
money they spend.

~~~
imglorp
Right, I don't want it matching general public but I'm curious why a subset of
this feature isn't commonplace: celebrities. As a user, I want this all the
time to identify an actor, a politician, musician, etc.

It would take a random _new_ photo and give you 3 or 4 likely matches.

Tineye only matches already known photos. Google/Bing image search will
abstract a given photo and show you more of the same type: eg more white men
wearing red shirts, rather than identifying the person and showing only photos
of that person.

------
nmstoker
The article takes a very long time to raise that the photographs in the
collection had Creative Commons licenses.

Perhaps the specific use they're being put to isn't covered by the particular
CC licence of each one, but until that is someone's claim I don't see this is
quite the issue it's portrayed as.

~~~
stefan_
If it's a full on frontal photo, in a number of countries, it is not for the
photographer to decide what the license of that image is and any such
attribution as CC might very well be invalid.

~~~
lotu
Yes, and when uploading you are required to confirm you have the rights to do
so with that photo. If we aren't going to be able to trust these type of self
declarations we pretty much can't have websites where users upload photos
anymore.

------
benatkin
The owner of Kabosu, the Shiba Inu in the famous Doge picture, puts hearts on
peoples' faces.
[https://kabosu112.exblog.jp/22453542/](https://kabosu112.exblog.jp/22453542/)

These include just strangers on the street - anyone who doesn't consent to
public photos. There are thousands of lovingly censored faces in her photos.

If everyone had done this I guess the situation would be different...

~~~
codedokode
This is common in Japan. If TV reporters are taking an interview in a street,
for example, they hide the faces of random people passing by because they did
not consent to being on TV. They really care about privacy unlike people in
other countries.

~~~
exolymph
See also: Germany and Google Street View.

------
glitchc
At a recent event, all of my friends and relatives took lots of photos, then
cautioned each other not to use Facebook to post or share any of the photos
because of privacy and facial recognition concerns.

All of them did, however, use WhatsApp to share the photos with each other.
<facepalm>

~~~
caprese
At the moment, whatsapp is supposed to be end-to-end encrypted. I get the
don't trust facebook mantra though, it is valid.

~~~
severine
Are group chats E2E?

~~~
oarsinsync
According to the current version of the iPhone app, yes, my group chats are
E2E encrypted.

I believe the mechanism is to use a generic / per message key to encrypt each
individual message, and then use each recipient's public key to encrypt the
encryption key before sending that (along with a link to the cipher text) to
each end user.

This is also why sending a video the first time takes forever (encrypt,
upload, encrypt key, send key, send link), while forwarding it to another
person (encrypt key, send key, send link) does not.

------
alexcnwy
This brings up a very interesting and important question around differential
privacy.

If I scrape millions of photos from Facebook (including yours) then train a
differentially private model that can extract features from a new face, is
that a privacy violation?

A differentially private model is one in which you cannot identify the
inclusion of any single datapoint which means you cannot tell the difference
between a model trained on the dataset and the same model trained the same
dataset with the addition of your one datapoint.

You might argue it’s a privacy violation because the scraping process might
involve people looking at your images but if that was fully automated and
nobody ever looked at your images - the model can be trained then the data
inmediately deleted...

------
telesilla
The irony of all this being, when I use Facebook marketplace with a specially-
created public profile that has no discernible photos I get accused of being a
spammer. Maybe Facebook could implement some kind of trustworthy algorithm
that says "we believe this account is not a scammer". And I don't have to deal
with not being able to fully use online systems that assume a
photograph==trust.

~~~
Mediterraneo10
Even if you upload a photo, FB will block the account if the photo doesn’t
show your _entire_ face. I set up a Facebook account where, after a holiday in
Morocco, my profile photo was me in a turban where my mouth was obscured by
part of the turban. I got a message from FB saying that this was not
sufficient and I would have to upload another photo where no part of my face
was covered.

~~~
doovd
This sounds like bs to me. There is a significant number of users with profile
photos that don't contain a face and this is the first time I'm hearing
this...

~~~
Mediterraneo10
Those users probably have performed various other actions that reassured FB
they had not signed up to spam (joining groups, adding many friends, etc.). It
was the combination of being a new signup with an emptyish profile and having
no clearly identifiable face that led to my account being blocked, with a
message that specifically said I would not be able to regain access to my
account until I uploaded a photo where my face was fully visible.

~~~
doovd
Fair enough

------
munificent
We are truly in the "revolution" part of the Information Age now. Our social
and legal structures are as ill-equipped to handle problems like this as the
Georgian-era England was able to handle things like millions of its people
becoming factory workers.

I don't know what the future is going to look like, but, man, we're going to
be going through some shit to get there.

------
JohnFen
This sort of thing is the main reason why I do not put photos (or any other
identifying information) of myself or others on the open web.

~~~
canada_dry
Not that this really helps.

The moment someone else puts up a picture that you happen to be in (esp.
family, friends) and then tags it with your name -- your image and name be
scooped and cataloged.

~~~
renholder
>...and then tags it with your name...

Even that's not wholly necessary required. If they can obtain data from your
own Facebook app (or an app using the Facebook SDK), it can place you in that
area around the time the photo was taken and given it's friends you have
connections with on Facebook, it's easy enough to surmise it's you without the
necessary confirmation.

Seems very Orwellian, to be sure, but not out of the grasps of the end-goals
of data harvesting/profiling.

~~~
canada_dry
It's the intelligence/LE community's wet dream come true really... and scary
on just about every level you ponder ( __* tin-foil-hat sold separately).

------
davesque
You mean that they cut ethical corners to get ahead in our competitive free
market economy? Someone stop these mad lads!

------
codedokode
In Russia developers of face search engines like findface or searchface
usually scrape images from largest Russian social network, VK. They usually do
it following that motto about asking for forgiveness, and there are no
effective legal countermeasures anyway.

I think there should be no problem to scrape other social networks like
Facebook, Instagram or Twitter from the countries where there are no legal
restrictions and the photos are considered to be a "public data". You can
outsource face recognition tasks to such countries.

------
rapjr9
There are lots of other ML models built using training data derived from the
public. Voice models, gait recognition models, WiFi mobility models, Bluetooth
location models, activity recognition models, emotion models, etc. It's
essentially impossible for people to tell if their data was used to train
these models. And it's very difficult to know what companies and individuals
might do with access to both live data streams and these models. Worth
thinking about though.

------
ykevinator
This is stupid. People put their stuff on the Internet and someone used it.
It's good, not bad. We all participated in training data at this point. Google
crawls us all.

~~~
ysopex
And sells us our data back via secret government contracts. Yay!!

------
dheera
Someone once mentioned to me that scraping online dating profiles was an
excellent source of thousands of faces nicely and cleanly labelled by gender,
age, and ethnicity, if those are the class labels you are looking for.

I'd say there are lots of unethical use cases but also a few ethical use cases
of such a trained model.

------
laythea
This is no surprise to me and I wager that far worse is happening with your
data.

------
philipodonnell
> The dataset does not link the photos of people’s faces to their names, which
> means any system trained to use the photos would not be able to identify
> named individuals.

But isn't that like the point of a facial recognition algorithm? Recognizing
individuals by their faces? Presumably from a reference image that has a name?

Also it seems pretty trivial to reverse lookup the images if they were from a
public source and some of those will have names, unless they are significantly
downsampled.

~~~
fixermark
Not necessarily. One thing that a wide-net dataset can be extremely useful for
is improving the neural net models to better recognize people outside the
traditional academic dataset a lot of older-generation neural nets were
trained on, which is to say: the sorts of people who have the time and need-
awareness to put their faces in some university's dataset for training neural
nets (which is to say: white guys ;) ).

You can use faces without names attached to improve the engine's modeling for
recognizing human faces in general (and more importantly: improve the system's
ability to distinguish human and animal faces).

(Your first comment is pretty interesting by itself, incidentally: both NBC
News and your comment make an assumption that is not universially true about
the technology. Face recognition is a much wider space than "recognize an
individual by their face." Clustering of similar faces, emotion analysis,
camera targeting, human presence / absence can all be done without name
labels).

------
NoblePublius
Not sure what the secret is. Who didn’t know this?

------
jakelazaroff
_> As the algorithms get more advanced — meaning they are better able to
identify women and people of color, a task they have historically struggled
with — legal experts and civil rights advocates are sounding the alarm on
researchers’ use of photos of ordinary people._

This is written as if algorithms were sentient beings overcoming the next
level of obstacles, rather than just being written by mostly white men who
train them on photos of people who mostly look like themselves.

