
Netflix Spilled Your Brokeback Mountain Secret, Lawsuit Claims - phsr
http://www.wired.com/threatlevel/2009/12/netflix-privacy-lawsuit/
======
jpwagner
Summary: a woman wrote something like this...<http://bit.ly/8BLUAw>
(describing her sexual orientation in an "anonymous" (imdb) forum), and then,
based on her other reviews and data in the Netflix contest, a couple of
hackers (security/privacy researchers) put 2 + 2 together to find out her
identity (not their intended purpose, but as a side effect.) She blames
Netflix.

I'm actually unclear as to who's right and wrong here. Clearly, it seems
unjust that she unknowingly outed herself, but how responsible is she of her
online personas? I also wouldn't be surprised if Netflix has something in
their ToS relating to this kind of "anonymous" release of information.

This is a complicated scenario.

~~~
notauser
I think the responsibility does rest with Netflix, because it's almost outside
the control of one person to keep identities separate these days. Every bit of
information released vastly increases the chance of identification.

To given an example - let's say Google as part of a new anti-fraud service
released an MD5 hash of every e-mail address with a Google account, plus a MD5
hash of every IP that account had been successfully logged in from.

Sounds fairly anonymous - except that every website owner would now be able to
match up every duplicate-but-separate account in their database to find out
who had two or more identities, even if the user had been careful to use two
separate machines for that specific site.

------
wheels
Notably, _Arvind Narayanan_ , one of the deanonymizers is our very own
randomwalker:

<http://news.ycombinator.com/user?id=randomwalker>

~~~
randomwalker
Thanks Scott. I just want to say that I had nothing to do with this lawsuit.
Also, this comment I posted earlier might be relevant:
<http://news.ycombinator.com/item?id=838226>

~~~
dschobel
So what's your take on the liability issue?

It doesn't sound like this was a technical failing of netflix' anonymization
process but rather a matter of deduction of various independently anonymous
pieces of data.

~~~
stavrianos
if ya put it like that, it sounds more like a _fundamental_ failing

~~~
yrb
The thing is that perfect anonymization implies that the dataset would be
useless since by definition it would contain no information. If you can begin
correlating data points with enough outside information you will be able to
extract at least a shadow of the original information.

------
ekiru
I wonder how many mothers of multiple children with Netflix subscriptions and
IMDB accounts from Franklin County, Ohio, will be traveling to the location of
the court in which the suit is being made on the date of the trial and have
also visited the offices of the lawyer representing "Jane Doe". It seems as
though perhaps the lawsuit may be insufficiently anonymizing her personal
data.

I don't know that it was necessarily a good or a legal decision for Netflix to
release the contest data, but it doesn't help the plaintiffs' case when they
quote where the privacy policy of Netflix specifically states that they may
disclose the information disclosed in the contest and immediately claim that
the policy states no such thing.

------
dschobel
_[The researchers] identified several NetFlix users by comparing their
“anonymous” reviews in the Netflix data to ones posted on the Internet Movie
Database website_

So why sue netflix instead of IMDB? Additionally, is there an expectation of
privacy when posting movie reviews to public websites?

~~~
electromagnetic
I believe the suit is targeted directly at Netflix because it allegedly
violated stringent privacy laws related to video rental data.

Whether Netflix violated the persons privacy or not is debatable (hence it
hasn't been settled yet), however they certainly don't appear to have the
intent to keep peoples privacy:

> The suit is also asking the court to stop Netflix from launching its
> promised second contest to improve the recommendations — this time giving
> out user data that includes ZIP codes, ages and gender, along with movie
> ratings and ID numbers substituted for user names.

I'm not certain how ZIP codes work in the US, I know however that my postal
code for my childhood home in the UK could place me within a 30 house range on
my street. Given age you could extract this down to ~7 people, given sex it
was down to 3. Being 1 of these 3 people means I have a 50/50 chance of
identifying two 'anonymous' people based _solely_ on postal code, age and
gender.

Lawsuits usually come down to intent, and Neflix arguably doesn't have the
intent to keep its users privacy if it's intending on releasing ZIP, Age and
Gender information.

~~~
Retric
It depends on the type of Zip codes. Basic Zip codes are only 5 digits so
~30,000+ people per Zip code. However, Zip + 4 codes dramatically reduce that.

~~~
electromagnetic
Considering ~3% of Americans are NetFlix subscribers that means only ~900
people per Basic Zip might be subscribers. Gender will half this to ~450
people, and considering that (IIRC) each 5-year population group has on
average ~5.8% of the population in it with a median age of ~38 (where the
percentages are hitting 7.8%), but let's say an average 1.2% of the population
is of any individual year of age.

This means, on average, you should still be able to place someone down to ~5
people. God forbid you're a 110 year old using netflix. If netflix is
releasing it in 5-year groupings that still puts you in a group of ~30 people
for grouping of ~30,000.

I'm unsure if any data release like this counts as anonymous.

~~~
Retric
I don't think Netflix is going to release their customer list the first
reduction is not directly possible. Also splitting the population into
equivalent size groupings is a normal approach, so you might start with
18-21,21-26... and end with 85+.

Also, it's normal to add / remove ~1% of your sample to remove some edge cases
and muddle the waters.

------
andrew1
It's a little off topic, but I'm intrigued by the '87% of Americans can be
uniquely identified by DOB, gender and zip code'. Given the size of the US's
population and the relative scarcity of zip codes, this seems an incredible
claim. The link in the article is broken so I can't read the paper. Even just
thinking about the big cities, where I imagine virtually no one would be
identifiable, that figure sounds impossibly high. Maybe the figure actually
refers to just working adults or something like that. Does anyone have any
more information about this, or access to the original paper?

~~~
andrew1
Actually, I was thinking about DOB as being just the day and month, not day,
month and year. If the year's included then this seems less far fetched,
although I'd still be surprised if it's correct.

------
dkarl
_“a privacy blunder that could cost millions of dollars in fines and civil
damages.”_

Since they considered the knowledge gained from the original contest cheap at
$1 million, I'm sure the bigwigs at Netflix are wondering, "How _many_
millions?"

~~~
bmalicoat
They only considered the contest cheap because doing the equivalent within the
company would be expensive, take forever and possibly not produce anything
usable. The contest, by paying out only under certain conditions, guaranteed a
usable result and thousands of people working on it. So it was much cheaper
than the alternative in more ways than just financially. It was a brilliant
idea IMO. Releasing the data without passing on liability of some kind, maybe
not so brilliant.

------
breck
I actually saw this lady at a bar once kissing another woman. But I had no
idea she was a lesbian until I wrote a deanonymizer on a dataset of millions
of rows and then combed through her IMDB posts to find one very suggestive
comment.

Give me a break.

> The lead attorney on the new suit recently reached a multimillion-dollar
> settlement with Facebook over its failed Beacon program

It really annoys me when attorneys try to make millions when honest people
trying to improve the world make a mistake.

So Beacon was a bad idea. Netflix should have asked for permission before
releasing a user's anonymized data. But I think they learned their lessons.

Why should some random attorney who builds nothing get paid millions and
obstruct these companies from continually trying to innovate?

Sigh.

~~~
pyre
> _I actually saw this lady at a bar once kissing another woman. But I had no
> idea she was a lesbian until I wrote a deanonymizer on a dataset of millions
> of rows and then combed through her IMDB posts to find one very suggestive
> comment._

The issue is really whether or not 'everyone' will know that she is a lesbian
(really bisexual if read her IMDB comment which someone posted above). I
highly doubt that someone wanting to conceal her identity as a
lesbian/bisexual woman would go to a 'straight bar' and start making out with
a woman. So the presumption here would be that you saw her at a gay bar
kissing a woman. If so, not too many 'anti-gay' people go to gay bars, so I
would think that she is relatively safe from discrimination in such a
scenario.

> _Netflix should have asked for permission before releasing a user's
> anonymized data. But I think they learned their lessons._

Obviously not because Netflix's second contest will release _even more_
information on users (like zip/postal code, age, etc). Does that sound like
someone that has 'learned their lesson?'

While the first contest could be put down to good faith, the second one
definitely shows them at least attempting to push the boundaries.

~~~
breck
On your first point, I was making a (bad) joke.

On your second point:

Did they ask permission or set their TOS appropriately in the second contest?

I don't think the problem is releasing the data, the problem is not asking for
users' permission first.

------
bcl
I think this demonstrates why people should be more careful about what they
post in public forums, under their own names. The ability to make associations
like this is only going to become easier.

------
bugs
So did the lady get outed?

This article is very mindboggling to me and hard to follow.

~~~
btilly
_So did the lady get outed?_

That is unclear since she has not been identified in the lawsuit. But several
people were successfully identified, and she might have been one of them.

Whether or not she was outed then, I'm willing to bet money that at some point
during the lawsuit she will be tracked down and outed in a rather public
fashion. Filing lawsuits that upset tech people is not a good way to protect
your privacy...

~~~
bugs
This is my assumption as well, I mean this lady really does not want her
privacy concealed if she is taking netflix to court she really wants money to
compensate for what little damage was done. To be honest people are much more
accepting of gay people and she would be able to stay relatively closeted
(depending on the size of her home town) compared to what will happen when
this becomes a public spectacle.

------
fuzzythinker
guess that explains why we haven't heard of prize part 2.. and may never will,
sad.

------
joubert
Does she still use Netflix?

