Why you can't really anonymize your data

nkurz · on May 17, 2011

This is a good practical article. Maybe it's obvious, but I feel like it's missing one recommendation: weigh the risks and benefits. Are people going to die if the anonymization is broken, or are the consequences relatively minor? Are there ways to reduce the severity of these consequences? To maximize the benefits?

I take a bit of issue with Pete's statement that Naryanan's break "gave access to their complete rental histories". Rather, it gave access to slightly munged ratings a for possibly partial subset of the movies included in the dataset. While this is real, it's not quite the same as knowing their entire rental history.

To my knowledge, there was no damage done by this exploit: no one was fired because it was likely they loved Brokeback Mountain. By contrast, the Netflix Prize brought about a lot of great research that otherwise might never have been done much less made public.

I'm glad the data was released, and the contest held, and I think in retrospect the benefits outweighed the potential risks. I would hate for the conclusion to be that privacy must be upheld above all other factors. I would hate for the outcome to be that no such dataset will ever be released again.

skybrian · on May 18, 2011

Unfortunately this can too easily devolve into proof by lack of imagination.

Properly weighing the risks requires a certain amount of paranoia. When the person who's doing the security review cares about what the outcome is, they won't be imaginative enough when thinking about how the data might be misused. You almost need an adversarial system.

nkurz · on May 18, 2011

Maybe, but overimagination might be worse. Let's keep it concrete. How should Netflix have done things differently? Is there an adequate level of paranoia that would cover all the possible outcomes while still generating as much published original research as it did? My fear is that although (in my opinion) the contest was a stunning success, paranoia will prevent Netflix from ever running such a contest again. Do you feel this is a good thing, necessary to protect the public?

roel_v · on May 17, 2011

A first step is in removing all records below a certain k-anonymity threshold from released data sets. Of course this brings in sample bias and reduces the usability of the set. Better educating researchers in the theory and practices of privacy is another thing that is urgently needed.

SpikeGronim · on May 17, 2011

The k-anonymity threshold changes as other datasets are released. If you are looking at a dataset with my zip code in it you can calculate a k-anonymous version of that dataset. As soon as somebody else releases another dataset with my zip code in it you must now consider both, and anything you released earlier is likely to be compromised.

blake8086 · on May 17, 2011

Why would k-anonymity stay fixed? Maybe you could deanonymize the data later when better sets become available to correlate against.

clarkevans · on May 17, 2011

An approach that I think is worth exploring is ad-hoc aggregation: variable selection with summary bucketing (aggregate results have a error bound) so that individual samples cannot be isolated.

3pt14159 · on May 18, 2011

That's what FreshBooks does with their report cards. They also throw in a very small amount of jitter to make sure math savvy people can't back calculate the number of people in a geographic location.

Joakal · on May 18, 2011

Even AOL attempted anonymity with the released search data by simply replacing personal information with IDs. But given enough search terms, people were able to find out quite easily where and even who the person is [0].

[0] https://secure.wikimedia.org/wikipedia/en/wiki/AOL_search_da...