Hacker News new | past | comments | ask | show | jobs | submit login

These are old lessons: in 2006 AOL [1][2] and Netflix [1][3] both released datasets that were supposed to be anonymized but were easily de-anonymized. There are older examples based on Census data[4]. It's difficult if not impossible to release a dataset that is both useful and truly anonymized; in Schneier's words:

The obvious countermeasures for this are, sadly, inadequate. Netflix could have randomized its dataset by removing a subset of the data, changing the timestamps or adding deliberate errors into the unique ID numbers it used to replace the names. It turns out, though, that this only makes the problem slightly harder. Narayanan's and Shmatikov's de-anonymization algorithm is surprisingly robust, and works with partial data, data that has been perturbed, even data with errors in it.

[1] https://www.schneier.com/blog/archives/2007/12/anonymity_and...

[2] http://www.securityfocus.com/brief/286

[3] http://www.securityfocus.com/news/11497

[4] http://crypto.stanford.edu/~pgolle/papers/census.pdf




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: