
Measuring Re-Identification Risk in De-Identified Content - dsr12
https://cloud.google.com/dlp/docs/compute-risk-analysis
======
Cynddl
If you were expecting anything more than Google trying to k-anonymize your
datasets, you will be disappointed.

 _“K-anonymity was born out of a desire both to quantify the re-
identifiability of a dataset, and to balance the usefulness of de-identified
people data and the privacy of the people whose data is being used.”_

Researchers have also shown recently that many high-dimensional datasets
(mobile phone metadata, online ratings, etc.) cannot be de-identified without
loosing all their utility [1, 2, 3]. In that case, there is no distinction
between quasi-identifiers and sensitive data. All the information in the
dataset can be used to identify an individual. De-identification algorithms
such as the one used here make strong assumptions that the data collector
knows which information can and cannot be used to re-identify individuals.

[1]
[https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf](https://www.cs.utexas.edu/~shmat/shmat_oak08netflix.pdf)

[2]
[https://www.nature.com/articles/srep01376](https://www.nature.com/articles/srep01376)

[3]
[http://science.sciencemag.org/content/347/6221/536.full](http://science.sciencemag.org/content/347/6221/536.full)

------
CSDude
I have taken Data Privacy course and as a project we scraped LinkedIn for
"people also viewed" part. Some of the LinkedIn profiles had Twitter accounts
linked and by traversing those accounts and their connections we were also
able to identify the Twitter handles of Linkedin users that did not list it,
even by using simple string comparisons and text similarities. So, cross-
referencing works even for a simple stupid like this, imagine what people can
do with advanced methods.

