

An open letter to Netflix from the authors of the de-anonymization paper - randomwalker
http://33bits.org/2010/03/15/open-letter-to-netflix/

======
randomwalker
I'm a little taken aback by the tone of some of the comments here, so I
thought I'd offer a few points of clarification.

* For the longest time we mostly stuck to doing the math. We certainly didn't call for the contest to be cancelled, and we had nothing to do with the lawsuit. But when some people implied that we were responsible for the mess that ensued, we were kind of pulled into it. We posted this as a way of explaining our point of view and reaching out to see if there's a possibility of collaboration.

* The sadness we expressed is genuine. The reason we brought up Netflix's response to our paper wasn't "snark" or "gloating." Rather, we were pointing out that the cancellation of this contest was rather needless, because if they had acknowledged the privacy risks back when we published the paper, they would have had more than enough time to deploy an opt-in system for this contest. I think it is really unfortunate that that didn't happen.

* Someone wanted to know exactly what I thought of the "greater good" argument. Well, I'll tell ya. I'm vehemently opposed to it and I think it's a dangerous slippery slope. I think this point of view is enshrined in the ethos of this country -- "better let ten guilty men walk free than to convict one innocent man," etc. I don't think anyone has the moral authority to decide that the privacy concerns of a few can be sacrificed.

* There is a specific reason we chose the open letter format rather than communicating with Netflix directly. Acutally, two reasons. First, there are many data privacy researchers who are at least as qualified as we are for this role. We wanted to make sure the community had the opportunity to participate in whatever ensures, rather than just us.

Second, I'm sure there are many companies other than Netflix who have a
similar need for privacy preserving data mining. If Netflix doesn't take us
up, perhaps one of the others will. Bottom line, since there are multiple
parties on both sides, and we don't really know who they are, we felt it is
better to have this dialog in public.

* Finally, it is understandable when something like this happens to want to find someone to blame. But think twice before shooting the messenger.

~~~
earl
Please. You directly enabled the lawsuit. At least have the integrity to
acknowledge your part -- your pretense "Oh wow, a lawsuit just happened! But I
had nothing to do with it!" -- is pretty stupid.

I'll bet money the most likely result of this lawsuits -- and your actions are
a big piece of it -- is Netflix, et al, will release no more datasets to the
public. Instead, only researchers under NDA will be allowed to work with the
data, as was basically the tradition before this.

Good job.

~~~
jvdh
Please. You're acting like Netflix did nothing wrong in releasing the dataset
when they knew that they could be pretty easily de-anonimized. Thus providing
a privacy risk.

------
nkurz
I'm saddened to see someone gloating at having helped to prevent the release
of a dataset that I see as beneficial. Netflix offered an unprecedented corpus
for research, and now someone is proud about helping the lawyers to lock it
up. I think I just have a fundamentally different sense of privacy than the
author. I think this comes out most clearly in their FAQ:

    
    
      Furthermore, even if the algorithm finds the "wrong" 
      record, with high probability this record is very similar 
      to the right record, so it still tells us a lot about the 
      person we are looking for. 
    

So the "violation of privacy" occurs even we don't actually reveal information
about the individual, even if we only provide a framework for making
predictions? So if I publish a study (with backing data) that says that 38
year old males are likely to commit adultery, I've "violated the privacy" of
all 38 year old males?

Could someone who shares the author's worldview try to explain it? I've tried,
but I just don't see it.

~~~
randomwalker
'Gloating' is so diametrically opposed to the view expressed in the article
that I have no idea how to respond.

As for that part of the FAQ, it is intended as an explanation of some of the
theorems proved in the paper and is a response to some of the theoretical
objections we face from the data privacy community. It is not an issue that
arises in practice.

~~~
earl
Well, I can say this: congrats on getting one of the best industrial data sets
locked up. I hope you're proud of the work you did.

As for the possibility of netflix running a contest like this in an online
fashion, well, maybe, but the benefits of having access to the data are
enormous, plus you've now moved to a model where only the privileged few are
allowed access via NDA, or Netflix has to provide computing resources to all
researchers, etc. I don't see it happening.

~~~
pyre
If Netflix had attached credit card info and social security numbers to the
info would you be singing the same tune? You're basically saying that you
don't like the outcome due to your perceived utility of the data. Thanks I
don't see you talking about:

    
    
      - Do you view this as a breach of privacy?
      - What do you consider private?
      - Do you view this as a breach of privacy, but just
        don't care?
      - Do you feel that the utility of the data out-weighs
        the privacy concerns?
      - What about the people that view this as an invasion
        of privacy and have their Netflix user data in that
        set? Should they be thrown under the bus in the pursuit
        of progress because *you* feel that the data has more
        utility than the privacy concerns do?
    

I see a lot of people arguing that this is 'stifling innovation,' but
innovation is not an end unto itself. Banning using human test subjects
against their will in the pursuit of scientific knowledge 'stifles innovation'
too, but I think you would be hard-pressed to find many people to see that as
a bad thing. "Stifling innovation" in the pursuit of privacy concerns should
be a noble cause. It benefits the public. This is hardly the argument against
intellectual property rights and I really find it annoying that people seem to
be lumping it into the same ballpark with these boilerplate "stifling
innovation" comments.

~~~
Jun8
I see it as a breach of privacy people _might have prevented_ had they known
about the dangers of their reviews getting linked to their accounts. Many
companies have their large credit card databases stolen or hacked into through
sheer incompetence. Netflix is not in the same boat with these.

~~~
inerte
So... having private info stolen == bad company, releasing private info ==
good company.

------
hooande
I looked at the methods used in the paper and it's clear that my definition of
"privacy" varies greatly from the author's definition.

Essentially they are saying that if you know what rating someone gave 8 movies
and the date that they gave those ratings, you can find a sample of their
rating list (or something very similar to it) with 99% accuracy. So freaking
what?

"Evidence" that flimsy wouldn't stand up in a local bar argument, much less a
court of law. The records are still completely and totally anonymous. No
names, no addresses, no way to identify anyone...nothing but a strong
statistical correlation to a set of ratings in a database.

It sounds like they have a problem with the power of predictive modeling and
not with the handling of anonymous data. Esstentially what they are saying is
"if we know a little about you, we find can out things that we didn't know
with a very high degree of accuracy, but no certainty.". Duh. That's what the
whole netflix prize was about...using known data to make strong predictions
about unknowns.

They had some interesting methods (especially in their similarity
calculations) but this has nothing to do with privacy.

------
pbh
It seems to me that any data at all will necessarily reduce the entropy of the
probability distribution of members' preferences, likes, dislikes, habits, and
so on (i.e., their privacy). The authors seem to brush off the "greater good"
argument here, but I don't understand how any large scale data release can
happen without at least some reference to such an argument given that context.
Given that, the authors seem to be making a fairly strong claim here: that no
large scale "anonymised" data release should ever happen. Is that helping
anyone in the context of movie viewing? And is it hurting anyone other than
academic researchers, given that companies share more sensitive data anyway?

~~~
raganwald
The authors mentioned two alternatives to the current form of large-scale
release. First, opt-in. Second, contestants submit programs that run on the
anonymized data but the contestants do not have access to the data itself.
Could either of these approaches contribute to the greater good without
compromising privacy?

~~~
pbh
I do not have any data regarding opt-in, but my impression was that as a rule,
no one opts-in, and no one opts-out of basically anything (excepting the
notorious cases, e.g., Real Player). If it worked, and somehow gave an at
least somewhat unbiased sampling of the data, opt-in would obviously be best.

I am completely unsatisfied with the submit-and-run model. Feature engineering
seems to rely on knowing your data really intimately, and that does not seem
possible in a submit-and-run model.

~~~
smokey_the_bear
What happened with Real Player?

~~~
pbh
The Real Player installer, over the course of maybe five to ten years, was so
pushy about installing extra, unwanted software and sending private data that
it garnered a reputation that caused people to be really careful when
installing it, if they installed it at all. It is not really a perfect example
for people actually opting-out, because I think one of the many criticisms was
that it was often either not possible or extremely difficult to determine how
to opt-out of its features (sending titles of files being played, annoying
message center and ad popups, packaged additional software). That said, check
out the Wikipedia page for further details.

------
stevoski
IIRC, in some countries census data is sometimes released with small,
intentional errors to prevent the ability to locate specific individuals. Make
a 36 year old sometimes a 37 year old or a 35 year old. Make a 180 cm person
sometimes 182 cm or 178 cm. Small enough errors not to make the aggregate data
invalid, but enough to make it hard to identify individuals from the data.

Perhaps this is a partial solution for the Netflix dataset.

~~~
nkurz
This is the approach that Netflix took with the initial data. The paper
referred to shows that this is insufficient, and does little to ease privacy
concerns. The general problem is that if you 'fuzz' up the data enough to make
identification impossible, it's no longer useful as a dataset.

------
prakash
fyi: randomwalker is Arvind Narayanan.

------
bmickler
P.S. - BTW, when do you expect to allow your linux-using, paying customers to
watch your instant streaming movies online?!

(off topic rant, I know. goodbye karma!)

