
'Anonymised' data can never be totally anonymous, says study - stmw
https://www.theguardian.com/technology/2019/jul/23/anonymised-data-never-be-anonymous-enough-study-finds
======
amrrs
Previous Discussion:
[https://news.ycombinator.com/item?id=20513521](https://news.ycombinator.com/item?id=20513521)

~~~
dang
Thanks! That one has comments by one of the authors.

Also related:
[https://news.ycombinator.com/item?id=20513453](https://news.ycombinator.com/item?id=20513453).

------
ThePhysicist
The authors of this research should really explain better to journalists what
the difference between anonymized and de-identified data is. Their paper does
a good job at this and does not claim that anonymization is impossible, most
of the newspaper articles that were published about it seem to miss this point
entirely though.

To clarify, when de-identifying a dataset you simply remove direct identifiers
(like a name) from it. This protects the data from direct re-identification,
i.e. from someone learning the identity of a person in the data by looking at
an individual row. Anonymization is supposed to protect individuals from re-
identification also when using external context information and can usually
only be achieved by further transforming the data, for example by grouping it
(as techniques like k-anonymity do), adding noise (as randomized response
techniques do) or by synthesizing new data.

High-dimensional, de-identified data will always be easy to re-identify given
enough context, I've done this myself in 2016 with a clickstream dataset (the
authors reference our work in their paper).

------
lettergram
I work on this problem full-time, that’s why we create synthetic data:

[https://medium.com/capital-one-tech/why-you-dont-
necessarily...](https://medium.com/capital-one-tech/why-you-dont-necessarily-
need-data-for-data-science-48d7bf503074)

It’s even good enough in many cases to build models off of.

However, it’s important to not that numbers such as SSNs are always “real”, in
the sense they link to someone. A random SSN has with near certainty been used
before and belonged to someone (perhaps multiple) people. The trick is
ensuring the rest of the attributes don’t match the individual.

~~~
ThePhysicist
While I think data synthesis is a useful technique I never understood how most
synthetization approaches can so confidently claim that there is no leakage of
sensitive information into the synthesis model. The only approach I've seen
for this is using differential privacy techniques to limit the amount of
information that can be learned by the synthesis model from a given datapoint,
but doing so also drastically limits the ability of the model to learn from
the input data.

Another issue I have with data synthetization is that by trying to reproduce
the original data as faithfully as possible you waste a lot of your privacy
budget on meaningless features.

That said I'm happy to be convinced otherwise. Do you have a publicly
available demo dataset that I could look at?

------
rectang
It's great to see awareness of the futility of pseudonimization bubble up to
non-technical press.

Unmasking of bulk-sold pseudonymized user data is an externality, like
pollution — those who bear the cost when the data gets reidentified are the
users, not the buyers or sellers. Therefore the data belongs to the users and
propagation should be severely constrained.

------
jandrewrogers
Statistically useful anonymized/de-identified data sets will leak information
when analytically combined with sufficiently rich and diverse exogenous data
sets. This is more of a "yet another example" than a new result. Techniques
such as k-anonymity, chaffing, adding noise/randomness, differential privacy,
etc significantly increase the computational cost and data requirements but
not intractably so. The proliferation of vast sensing/event data sources
provide nearly bottomless sources of exogenous data suitable for the purpose.

Anonymity is more polite fiction than rigorous fact these days.

------
nixpulvis
It might really depend what we are considering "deanonymization". This article
is take a hopelessly naive approach to simply see how unique existing data
sets are (am I misunderstanding this?), which is not the underlying issue here
really.

Let me give a concrete example. Let's say you want to collect application
usage information from users of your app. You could a) collect all the
information attached to an IP address, and any other nominal information, and
store it, or b) you could calculate the valuable usage metrics for the entry,
and store it in some demographic bucket, throwing away any other knowledge.

The real problem is that companies want the full data so they can be free to
change their models at will. We as users should not expect data to be stored
properly, and cryptography tools are our main tool to address this problem.

------
webmobdev
Apple, the ball is now in your court. Speak up or stop spying on us through
your "anonymised" data collection to build a, of all things, a better "ad
network" to exploit us with our own data!

------
boron1006
Absolutely idiotic article and title. Obviously it's _possible_ to reidentify
a person if you have 15 demographic attributes if you don't specify which
attributes you use. I can do even better, I can reidentify 100% of people,
with only their name, DOB, fingerprints and SSN. The fact that DOB and zip
code are in the dataset make this result completely trivial.

A couple years ago, I got into an argument on reddit where someone claimed
that any mapping could be recovered "using deep learning techniques" (e.g. if
you take 3*0 = 0, you can get back that the original value was 3 with no other
information except for the value "0"), and that obviously I was just too
stupid to understand deep learning if I couldn't see that.

~~~
AnthonyMouse
I mean, yes, some people are factually incorrect. But I think the general idea
is more like, if you have a massively over-determined system of linear
equations, you can omit many of the values and still be able to recover them
all from the remaining values and knowledge of the equations.

And it's not intuitively obvious which combinations of values allow you to
recover which other ones.

~~~
boron1006
For context, this was when ISPs were planning on selling data, and someone was
collecting donations saying they'd reidentify senators internet history. I
said that people shouldn't donate to them, because it wasn't even clear what
the ISPs would release. Their point was it doesn't matter what the ISPs
release, they could reidentify anyone with deep learning.

> And it's not intuitively obvious which combinations of values allow you to
> recover which other ones.

I think it's pretty intuitive that Zip Code and DOB are identifiers. That's
why they count as such in HIPAA, and are used to demonstrate identity by
governments, credit cards, etc.

Personally I think this stuff just poisons the well when it comes to
discussions of privacy. I think the goal is to remove the expectation of
anonymity by claiming that it's never possible.

~~~
singron
> I think it's pretty intuitive that Zip Code and DOB are identifiers.

It's great that you think that, but basically no company uses that definition.
Most company privacy policies don't consider combinations of information when
making this determination. E.g. your billing address might be personal
information, but your zip code by itself might not. Similarly, IP address
(with or without last octet), wifi SSID, location data, browsing history (or
attributes derived from browsing history), and so on. Each individual piece of
data isn't enough to personally identify you, so the privacy policy often
doesn't have to be applied to it.

E.g. after reading the Google privacy policy[0], can you tell what protections
your zip code and DOB have? Will Google treat them as personal information or
personal identifiers or not?

0: [https://policies.google.com/privacy?hl=en-
US](https://policies.google.com/privacy?hl=en-US)

