

No silver bullet: De-identification still doesn’t work - maxerickson
https://freedom-to-tinker.com/blog/randomwalker/no-silver-bullet-de-identification-still-doesnt-work/

======
randomwalker
I'm the first author of this piece, happy to answer any questions. Much of my
previous research on re-identification
([http://33bits.org/about/](http://33bits.org/about/)) has been discussed on
Hacker News.

~~~
siculars
What are your recommendations for anonymizing PHI data, for example? Saying
there is no way to anonymize data isn't really a solution and beyond that, I
don't think it's true.

What practitioners need is a canonical reference or toolset that you could
feed a csv file to, tell it which columns should be scrambled and out comes an
anonymized data set. Yes, I realize the "tell it" part is of concern, well
that could be mitigated by smarter tools and more knowledgable practitioners -
knowledge gained from a canonical reference.

~~~
jandrewrogers
In order to have minimally robust anonymization you must strip out all
implicit or explicit references to locations and times. Unfortunately, space
and time values are material to the analysis of most non-trivial data models
(it certainly is for health data) so stripping that out is not really an
option.

Few people appreciate the robustness and generalizability of spatiotemporal
coincidence analysis for reconstructing relationships in anonymized data both
within and across many unrelated sources of anonymized data, even sources that
are identifying entities that are not people (like anonymous vehicle
tracking). There are enough anonymous entity tracking data sources available
to algorithmically reconstruct relationships to most other "anonymous" data
sources. I've seen it done many times and the capabilities are jaw-dropping in
part because it violates human intuition as to what it is possible with such
data sets.

~~~
aetherson
Do you have opinions on/is there research regarding how much "fuzzing" you
have to do to get substantial rewards in making space/time data anonymous?
That is, if my legitimate use for a space/time dataset can stand that data
being handled at the resolution of "hours" and "within 100 meters," does it
win me significant privacy benefits to scrub out more significant figures?

~~~
ProblemFactory
There is plenty of good work on anonymising and fuzzing location data. But
unfortunately the results show that the amount of fuzzing required for
anonymity is huge. "Hours and 100m" are far from enough, only something like
"year and state" might work.

Some of the notable results are:

* Golle and Partridge 2009 ([http://xenon.stanford.edu/~pgolle/papers/commute.pdf](http://xenon.stanford.edu/~pgolle/papers/commute.pdf)) - Just the home and work location at city block level is enough to uniquely identify 50% of the US population. Home and work at zip code level is enough to uniquely identify 5% of US population. Home and work county is still enough to identify 1% of people to a set of 6 candidates.

* Montjoye et. al. 2013 ([http://www.nature.com/srep/2013/130325/srep01376/full/srep01...](http://www.nature.com/srep/2013/130325/srep01376/full/srep01376.html)) - If you have a time-location dataset of people with hourly accuracy of time and cell tower accuracy for location (100m in cities, a few km in rural areas), then _four_ randomly picked points for each person uniquely identify 95% of them. Just two randomly picked points for each person uniquely identify 50% of people.

The main conclusion is that it is not possible to release an "anonymised"
location dataset of people that is still useful for mobility research. The
only realistic approach seems to be to have strict privacy regulations and
NDAs with people who are given access to this data.

But more generally, all research into anonymisation and preventing de-
anonymisation is difficult because it's not known what _other data sources_
the attacker has access to. If I have a time-location dataset of anonymous
people, and if I see from Facebook when a friend visited Paris and Barcelona,
then it becomes trivial to match these dates and cities against the location
traces, and find their full movement trace. Similarly, if I can get an
anonymous dataset of phone calls, then I can make 20 missed calls at 4am to a
friend, and later look for that pattern in the data to find their other calls.

The best example of this was from the Netflix dataset of anonymous movie
ratings (
[http://en.wikipedia.org/wiki/Differential_privacy#Netflix_Pr...](http://en.wikipedia.org/wiki/Differential_privacy#Netflix_Prize)
) - people were anonymous in _their_ dataset, but some people also rated
movies with visible identities on IMDB. Correlating the two datasets allowed
researchers to discover the identities of people in the Netflix dataset.

------
nkurz
I'm often troubled by the apparent lack of attention to the "benefit" side of
equation in both the report and the response. Yes, there is risk that
anonymized data can be de-anonymized, but there is also potential benefit from
the release of large data sets. Is this discussed elsewhere?

I'm also generally skeptical about the level of damage caused by de-
identification of intentionally released data versus the potential for damage
from data released by hackers, subpoenas, and other means. Are there attempts
to put these in perspective? Is a true adversary likely to be thwarted by
better anonymization?

~~~
forgottenpass
_I 'm often troubled by the apparent lack of attention to the "benefit" side
of equation in both the report and the response._

As a precursor to any honest conversation about an "equation", you would have
to model the chance and impacts of re-identification. Anything else is
sophistry to promote the idea of releasing data in general. Taking about how
to anonymize data already carries the implicit assumption that there is value
to be had in releasing it. Otherwise why would it be a topic of conversation
at all?

I wouldn't have a problem with talking about the lack of quantification of
this risk, it's a really hard problem if not impossible. What bothers me is
that you're playing to a whole slew of fallacious reasoning, if I was being
less charitable I'd say you're trying to go congress on "the debate."

~~~
nkurz
_Anything else is sophistry to promote the idea of releasing data in general_

My prejudice is certainly toward the release of data in the case of datasets
of clear use to research that lack obvious harm to the participants. I feel
that because of randomwalker's work, datasets such as those used for the
Netflix Prize may never be released again, and I feel that this is a net loss
to society:
[https://news.ycombinator.com/item?id=1193417](https://news.ycombinator.com/item?id=1193417)

 _What bothers me is that you 're playing to a whole slew of fallacious
reasoning, if I was being less charitable I'd say you're trying to go congress
on "the debate."_

I'm not familiar with the phrase "go congress". Could you explain? I'm not
trying to play to fallacious reasoning, although I certainly might be
susceptible to it. I genuinely would like more discussion of what is lost when
data is not released due to fears of violation of privacy. Could you be more
specific about the fallacies involved?

~~~
NotAtWork
> I feel that this is a net loss to society

Is this just a random feeling or is there something to back this up?

How are you even assessing things like the value of the difference in
precision caused by doing something like creating (mostly) distribution
equivalent (but fake) data and the cost of leaked person details to hundreds
of thousands or millions of people?

~~~
nkurz
I may be wrong, but it's more than just a random feeling. I should also point
out that I'm not suggesting that all data should always be released. And also
I'll reiterate that it's not just 'randomwalker' I'm referring to here. I'm
disappointed that the paper he's rebutting seems to be making the silly claim
that 'it's safe' instead of 'it's a worthwhile risk'.

The case I'm most familiar with is the Netflix Prize. I think I can safely say
(in my semi-professional opinion) that a lot of good research was published as
a result of Netflix's decision to release the data that they did:
[http://scholar.google.com/scholar?q=netflix](http://scholar.google.com/scholar?q=netflix)

I view these publications as a public good. It's possible the techniques
described were previously known privately, but before the contest there was no
description of them in the available literature. At the conclusion of the
contest, Netflix briefly announced that they would have a second contest,
which would involve the release of another data set. In large part as a result
of the press coverage of randomwalker's work, this potential release was
cancelled: [https://freedom-to-tinker.com/blog/paul/netflix-cancels-
netf...](https://freedom-to-tinker.com/blog/paul/netflix-cancels-netflix-
prize-2/)

It's disputable, but I feel this additional data set would have generated a
similar number of good publications, inspired new research, spread knowledge,
and offered a similar benefit to the public.

It can be argued that preventing further data releases has prevented further
harm to the public. But I think it's important to look at the actual harm done
by the release of the first dataset. While there was some degree of potential
for harm, to my knowledge no individual was actually discriminated against as
a result of being identified in the data set. By contrast, I think it is
accepted that numerous individuals have been harmed by other online data
breaches. If more attention is paid to anonymous datasets than to other
matters that cause actual harm, the attention is likely misguided.

I'd draw a parallel with the increase in airline security post-9/11\. It's
possible that confiscating oversize toiletry items and removing shoes has
prevented further harm. It's indisputable (I think) that the additional hassle
has had a negative cost to the public. And if more people choose to drive than
to fly, the net effect is likely negative. But at least in the case of airline
security, there is a clear case to be made of actual harm. Whether the current
policies are a good compromise of inconvenience and safety depends on both the
risk and the reward.

I know less about the other cases, both for benefit and harm. Have New York
taxi drivers suffered harm as a result of the poor attempts at anonymizing the
data? Is there public benefit to the data that was released? Have individuals
been harmed by the semi-anonymized release of health records? Was this offset
by any positive effects? This is the discussion I think we should be having,
rather than simply pointing at the abstract potential for harm.

------
miguelrochefort
Why do we pursue anonimization?

------
exo762
This link does not work.

[http://www2.itif.org/2014-big-data-
deidentification.pdf](http://www2.itif.org/2014-big-data-deidentification.pdf)

~~~
sp332
It worked fine for me. Here's an in-browser copy
[https://onedrive.live.com/redir?resid=F6347E00AD1FCD9%211256...](https://onedrive.live.com/redir?resid=F6347E00AD1FCD9%2112569)

