
The simple process of re-identifying patients in public health records - peter_tonoli
https://pursuit.unimelb.edu.au/articles/the-simple-process-of-re-identifying-patients-in-public-health-records
======
TeMPOraL
A friendly reminder that there's no such thing as "anonymized data", there's
only "anonymized until combined with other data sets".

~~~
ThePhysicist
Personally I wouldn't be that pessimistic about data anonymization. It's
entirely possible to robustly anonymize low-dimensional data sets and restrict
the information gain of an attacker to a given value even when he/she has
information about all non-sensitive attributes in the data set. When using
e.g. k-anonymity (with additional l-diversity or better and t-closeness
criteria) the resulting data is very robust against attacks, given you
correctly specify your sensitive attributes. Of course there are more things
to keep in mind, e.g. when repeatedly anonymizing different versions of the
same data set (as this can cause data leakage).

~~~
a785236
K-anonymity provides very little protection, if any. A few brief points:

1\. I've never seen a formal definition of security that k-anon supposedly
satisfies. While I personally really like formal guarantees, maybe one might
argue this wouldn't be so bad absent concrete problems with the definition.
Which leads us to...

2\. K-anon doesn't compose. The JOIN of 2 databases, each k anonymized, can be
1-anonymous (i.e., no anonymity), no matter what k is.

3\. The distinction between quasi-identifiers and sensitive attributes
(central to the whole framework) is more than meaningless: is misleading.
Every sensitive attributes is a quasi-identifier given the right auxiliary
datasets. Using k anon essentially requires one to determine a priori which
additional datasets will be used when attacking the k anonymized dataset.

4\. My understanding of modified versions (diversity, closeness, etc) is less
developed, but I believe they suffer similar weaknesses. The weaknesses are
obscured by the additional definitional complexity.

(Edit: typos and autocorrect)

~~~
ThePhysicist
1\. As I said most people don't use plain k-anonymity as it can leak
information about the sensitive attribute when the values of this attribute in
a group are (almost) all the same. This is why extensions like l-diversity and
t-closeness exist: l-diversity ensures that in each group there will be at
least l different values of the sensitive attribute, t-closeness ensures that
the resulting distribution of the sensitive attribute values in a group is
close (as e.g. measured by the "earth mover's distance") to the distribution
of the sensitive attribute in the entire dataset. Given the original data and
the anonymized data sets it's pretty easy to measure the information gain
(e.g. using a Bayesian approach) of an attacker if he/she knows in which group
a given person is. In that sense k-anonymity (with l-diversity/t-closeness)
can be analyzed in a formal context just like e.g. differential privacy.

2\. Yes that's what I mentioned at the end, k-anonymity is not different from
most other techniques here: If you use differential privacy with the Laplacian
mechanism and repeatedly publish independently anonymized versions of the same
underyling data you will leak information (as an attacker will be able to
average the released values in order to get an estimate of the true value).

3\. Yes sensitive attributes are often quasi-identifiers as well (at least in
combination with other quasi-identifiers), they are treated differently
because the underlying risk model does not regard a (non-sensitive) quasi-
identifier as something that needs to be protected. Inferring e.g. your gender
from your zip code, age and body weight using an anonymized data set is
(usually) not considered problematic, whereas learning that you are HIV-
positive would (almost always) be problematic, hence the distinction. Also,
sensitive attributes are treated as a group when applying k-anonymity, i.e. if
we have two binary attributes (HIV, Syphilis) one applies the anonymization
criteria to the combinations of the attributes ((true,true), (false, true),
(true, false), (false, false)), not individually to each attribute (as this
can cause information leakage).

4\. I honestly don't know what to reply to this, as l-diversity/t-closeness
are well specified methods that were designed to overcome the (known)
limitations of k-anonymity. Yes, these methods are not completely trivial to
use, but if used correctly they can provide good and quantifiable protection.
Not using them since they are hard to implement correctly is like saying we
shouldn't use cryptographic algorithms like RSA because it's hard to get all
the implementation details right.

------
ilaksh
A worse problem, I believe, is more on the side of people actually easily
accessing their medical records. It seems that providers have used the onerous
privacy laws as an excuse to obstruct the process of releasing medical records
in order to prevent customers from moving to another provider.

For example, the Doctor On Demand app has records of all of my visits
available on the screen when I am logged in, but in order to actually export
or download them, I was required to call them, then fill out an online form
where none of that data was prefilled, then I will need to wait up to 10 days
for the request to be filled. Its ridiculous.

It doesn't make any sense, since I could just take screenshots on the app as I
scroll through the data. Which it seems that if I am identified on the app,
they should be able to release the records -- which they do -- they just
deliberately make it difficult to export them.

Someone is going to come on here and give me a lecture about privacy laws and
how they have to do that or something, but I think its BS. The laws need to be
updated to ensure that people can access and export and transfer their own
medical records easily. I need to own that data.

I think there are quite a few groups working on technology to solve the
problem of owning your data and also being able to share it in a non-
identifiable way. Some of them use things like bitcoin or blockchain to do so.
We definitely need high tech solutions so I hope some of these types of
endeavors will become popular and more effective than some of the inept
government efforts so far.

------
amichal
"anonymized" is a statistical measure. What you are doing is making it less
__likely __that someone can be identified not necessarily impossible. I think
it would be best if folks were more honest about that. The article mentions
finding 7 people in a dataset of 2.9 million. It 's obvious that they felt
that 7 prominent people was enough to tell the story and they could likely
find many more. My question is could they find 0.001%, 1%, 10%, or more? If so
with what resources...

Edit: an old an interesting discussion on this:
[https://news.ycombinator.com/item?id=2942967](https://news.ycombinator.com/item?id=2942967)

------
cle
> but we now face the challenge of how to deliver that access, while
> protecting the privacy of the people in those datasets.

This is a losing battle. The information is already being leaked--we have been
protected by the high cost and inaccessibility of analyzing it. These factors
are quickly changing, and it's time to ask ourselves: how do we intend to live
in a post-privacy world?

------
crb002
Black guy in remote Alaska village. Left leg and right hand amputated
scenario. For anon you have to ditch geocoding; or aggregate symptoms across a
population. IMHO anon should be good enough that the aggregated records become
public domain.

