
How to De-Identify Your Data: Balancing Accuracy and Privacy - alanfranzoni
http://queue.acm.org/detail.cfm?ref=rss&id=2838930
======
macobo
De-identification has its limits and information can still be learned even
from anonymized datasets. An alternative to this is something like Sharemind
[1][2] where sound cryptography is used to make secure multi-party computation
possible.

[1]: [http://sharemind.cyber.ee/](http://sharemind.cyber.ee/)

[2]:
[https://www.youtube.com/watch?v=bAp_aZgX3B0](https://www.youtube.com/watch?v=bAp_aZgX3B0)

~~~
jrowley
De-identification is perfect for a lot of medical research, where you're
looking for general trends, and don't need or want specific patient
identifying information sitting around on your hard drive. For example, Mimic-
III is a great database for ICU data. You still have patient ID numbers for
tracking events involving a specific patient but all personal info is de-
identified.

[http://mimic.mit.edu/about/mimic/](http://mimic.mit.edu/about/mimic/)

~~~
jtheory
It's worth pointing out that anything with patient IDs (even though all
personal info is removed) is still not really anonymous "safe" data, and
should be treated carefully.

I'm not sure what measures Mimic takes (beyond just stripping demographic
info), but a given patient's pattern of healthcare interactions make quite a
unique fingerprint -- and some parts of that fingerprint are likely public
information for some patients.

E.g., imagine a celebrity who (you can find from the tabloids) was treated at
X hospital for a sprained ankle on 2012-07-14, and gave birth to a daughter at
Hospital Y on 2014-04-01. If her record -- completely "anonymized" \-- is in a
data set that lets you search for patients matching these two events... it
seems fairly likely you'd be able to narrow it down to only a few candidates,
or quite likely an exact match. And then once you have her pseudonym/ID, does
the rest of the record reveal anything interesting? An abortion no one knew
about (possibly not even her partner)? A venereal disease treatment?

Even the fact that a patient had an appointment at a given clinic is sensitive
data -- e.g., seeing an IVF specialist, or oncologist, etc..

It's a tricky field to navigate.

------
m0nster
While data de-identification surely has its limits, it is useful in many
contexts.

If someone is interested in tools for data de-identification, ARX [1, 2] is an
open source software that (among other features) supports exactly the set of
methods used in this study.

Full disclosure: I'm one of the developers of ARX.

[1] Website: [http://arx.deidentifier.org](http://arx.deidentifier.org)

[2] Source: [https://github.com/arx-deidentifier/arx](https://github.com/arx-
deidentifier/arx)

------
djyaz1200
Best research I've seen on the topic...
[http://latanyasweeney.org/work/identifiability.html](http://latanyasweeney.org/work/identifiability.html)

------
grflynn
"Your data" assumes there is some sort of Doppelganger attached to a data
bundle which is mostly hot air and used to persuade those who buy from data
brokers that the data is in-fact correct. I know some FOIA pests who are
purposefully polluting such data-sets and then asking for the information and
seeing some very skewed results. What if I sell back my data, since that's
what they're after anyway? I keep more logs than brokerages and would be happy
to hand them over for a fee. One item of browsing history alone is probably
worth upwards of $10,0,00

