

Why you can't really anonymize your data - Garbage
http://radar.oreilly.com/2011/05/anonymize-data-limits.html

======
nkurz
This is a good practical article. Maybe it's obvious, but I feel like it's
missing one recommendation: weigh the risks and benefits. Are people going to
die if the anonymization is broken, or are the consequences relatively minor?
Are there ways to reduce the severity of these consequences? To maximize the
benefits?

I take a bit of issue with Pete's statement that Naryanan's break "gave access
to their complete rental histories". Rather, it gave access to slightly munged
ratings a for possibly partial subset of the movies included in the dataset.
While this is real, it's not quite the same as knowing their entire rental
history.

To my knowledge, there was no damage done by this exploit: no one was fired
because it was likely they loved Brokeback Mountain. By contrast, the Netflix
Prize brought about a lot of great research that otherwise might never have
been done much less made public.

I'm glad the data was released, and the contest held, and I think in
retrospect the benefits outweighed the potential risks. I would hate for the
conclusion to be that privacy must be upheld above all other factors. I would
hate for the outcome to be that no such dataset will ever be released again.

~~~
skybrian
Unfortunately this can too easily devolve into proof by lack of imagination.

Properly weighing the risks requires a certain amount of paranoia. When the
person who's doing the security review cares about what the outcome is, they
won't be imaginative enough when thinking about how the data might be misused.
You almost need an adversarial system.

~~~
nkurz
Maybe, but overimagination might be worse. Let's keep it concrete. How should
Netflix have done things differently? Is there an adequate level of paranoia
that would cover all the possible outcomes while still generating as much
published original research as it did? My fear is that although (in my
opinion) the contest was a stunning success, paranoia will prevent Netflix
from ever running such a contest again. Do you feel this is a good thing,
necessary to protect the public?

------
roel_v
A first step is in removing all records below a certain k-anonymity threshold
from released data sets. Of course this brings in sample bias and reduces the
usability of the set. Better educating researchers in the theory and practices
of privacy is another thing that is urgently needed.

~~~
SpikeGronim
The k-anonymity threshold changes as other datasets are released. If you are
looking at a dataset with my zip code in it you can calculate a k-anonymous
version of that dataset. As soon as somebody else releases another dataset
with my zip code in it you must now consider both, and anything you released
earlier is likely to be compromised.

------
clarkevans
An approach that I think is worth exploring is ad-hoc aggregation: variable
selection with summary bucketing (aggregate results have a error bound) so
that individual samples cannot be isolated.

~~~
3pt14159
That's what FreshBooks does with their report cards. They also throw in a very
small amount of jitter to make sure math savvy people can't back calculate the
number of people in a geographic location.

------
Joakal
Even AOL attempted anonymity with the released search data by simply replacing
personal information with IDs. But given enough search terms, people were able
to find out quite easily where and even who the person is [0].

[0]
[https://secure.wikimedia.org/wikipedia/en/wiki/AOL_search_da...](https://secure.wikimedia.org/wikipedia/en/wiki/AOL_search_data_scandal)

