

Ask HN: What are some clever ways data in public domain has been de-anonymized? - danielhughes


======
johnloeber
Semantic analysis of the Federalist Papers[0] comes to mind. It originally was
not known which individual author wrote which paper, but stylometric analysis
(i.e. word-counting and matching word frequency distributions of the
unlabelled papers against those of labelled papers (in which the author was
known)) made it reasonably straight-forward to identify the original authors.

[0] A set of historical papers of great political importance.
[http://en.wikipedia.org/wiki/The_Federalist_Papers](http://en.wikipedia.org/wiki/The_Federalist_Papers)

------
alex_sf
Taxi data from NYC was deanonymized:

[https://medium.com/@vijayp/of-taxis-and-
rainbows-f6bc289679a...](https://medium.com/@vijayp/of-taxis-and-
rainbows-f6bc289679a1)

And then used to identify Muslim drivers:

[http://www.reddit.com/r/dataisbeautiful/comments/2t201h/iden...](http://www.reddit.com/r/dataisbeautiful/comments/2t201h/identifying_muslim_cabbies_from_trip_data_and/)

And then used to track celebrities:

[http://theiii.org/index.php/316/which-celebrity-is-
taking-a-...](http://theiii.org/index.php/316/which-celebrity-is-taking-a-
taxi-where/)

------
NeutronBoy
I can't recall the article, but there was a case where public data was de-
anonymized based on DOB and zipcodes, and it was incredibly successful in a
given state.

