
Algorithm can pick out almost any American in supposedly anonymized databases - zoobab
https://www.nytimes.com/2019/07/23/health/data-privacy-protection.html
======
rzwitserloot
I'm a programmer in the GP data analysis world. We use the term
'pseudonymization' for this kind of data. 'Anonymization' is used solely to
refer to, say, 'the sum total of diabetes patients this practice has' (that
would be anonymous patient data; it would not be anonymous relative to the GP
office this refers to): Aggregated data that can no longer be reduced to a
single individual at all.

The term raises questions: Okay, so, what does it mean? How 'pseudo' is
psuedo? And that's the point: When you pseudonimize data, you must ask those
questions and there is no black and white anymore.

My go-to example to explain this is very simple: Let's say we reduce birthdate
info to just your birthyear, and geoloc info to just a wide area. And then I
have an pseudonimized individual who is marked down as being 105 years old.

Usually there's only one such person.

I invite everybody who works in this field to start using the term
'pseudonimization'.

~~~
cheez
It doesn't roll off the tongue, perhaps pseudo-anonymization is enough.

~~~
gervu
Pseudonymization already refers to reference by pseudonym.

Pseudonimization is bad terminology in that it's indistinct from the above, to
the point that parent has already mixed the two up while in the process of
recommending it. And it'd be worse verbally.

"Pseudo-anonymization" could work, but something like "breakable
anonymization" or "partial anonymization" might be better in that it's more
obvious to a reader and doesn't rely on familiarity with technical terminology
to convey the idea.

I'd go with breakable, myself, since it's most to the point about why it's a
problem.

Pseudo is etymologically correct, but that doesn't necessarily help us much
when the goal is ratio and ease of understanding by a wide population of
readers.

Partial could work in the sense that you did part of the job, which people
would hopefully understand is a bit like having locked the back door for the
night while leaving the front propped wide open.

And there are probably other good options. If I was writing about this topic
often, I'd strongly consider brainstorming a few more and running a user test
where I ask random people to explain each term, then go with what consistently
gets results closest to what I'm trying to discuss.

~~~
hiccuphippo
ESL here, what's wrong with pseudonymization as a derivative of pseudonym? I
think it brings the point across. But if we want to stop beating around the
bush, what about "nameless identification"?

~~~
WhitneyLand
> _if we want to stop beating around the bush, what about "nameless
> identification"?_

The difference is the other proposed alternatives more directly suggest risk
is involved.

It's a nice ESL example because technically, I don't think you're suggestion
is wrong. In practice I think few would infer its implications.

------
rectang
The great contribution of Differential Privacy theory is to quantify just how
little use you can get out of aggregate data before individuals become
identifiable.

Unfortunately, Differential Privacy proofs can be used to justify applications
which turn out to leak privacy when the proofs are shown to be incorrect after
the fact, when the data is already out there and the damage already done.

Nevertheless, it is instructive just to see how perilously few queries can be
answered before compromise occurs — putting the lie to the irresponsible idea
of "anonymization".

~~~
specialist
The part I can't wrap my head around is how to mitigate (future proof)
unforeseen leakage and correlations. The example I keep going back to is
deanonymizing movie reviews (chronologically correlating movie rentals and
reviews). And, frankly, I'm just not clever enough to imagine most attacks.

If nothing else, I appreciate the Differential Privacy effort, if only to show
the problem space is wicked hard.

I worked in medical records and protecting voter privacy. There's a lot of
wishful thinking leading to unsafe practices. Having better models to describe
what's what would be nice.

~~~
bo1024
> The part I can't wrap my head around [...] The example I keep going back to
> is deanonymizing movie reviews

The reason is that you are thinking of an example that's not nicely compatible
with differential privacy. The basic examples of DP would be something like a
statistical query: approximately how many people gave Movie X three stars? You
can ask a bunch of those queries, adding some noise, and be protected against
re-identification.

You can still try to release a noisy version of the whole database using DP,
but it will be very noisy. A basic algorithm (not good) would be something
like

    
    
        For each entry (person, movie):
          with probability 0.02, keep the original rating
          otherwise, pick a rating at random
    

(A better one would probably compute a low-rank approximation, then add small
noise to that.)

------
shusson
The article title misses a bit of nuance from the paper which is specifically
talking about re-identification.

e.g from the paper:

"We show that, as a male born on July 31, 1945 and living in Cambridge
(02138), the information used by Latanya Sweeney at the time, William Weld was
unique with a 58% likelihood (ξx = 0.58 and κx = 0.77), meaning that Latanya
Sweeney’s re-identification had 77% chances of being correct. We show that, if
his medical records had included number of children—5 for William Weld—, her
re-identification would have had 99.8% chances of being correct!"

~~~
Freak_NL
What accounts for that remaining 2‰ of uncertainty?

~~~
Cynddl
Co-author here. We designed a statistical model, which is never 100% sure a
re-identification is correct. There is, e.g., a non-null probability that two
individuals in the US share 5, 10, or even 15 demographics attribute.

~~~
mnky9800n
Can you provide a link to your paper?

~~~
Cynddl
The article is available here, in open access:
[https://www.nature.com/articles/s41467-019-10933-3](https://www.nature.com/articles/s41467-019-10933-3)

------
shusson
The paper referenced in the article:
[https://www.nature.com/articles/s41467-019-10933-3](https://www.nature.com/articles/s41467-019-10933-3)

------
shalmanese
While things like this sound scary, deanonymization turns out to be not
exceedingly impactful in practice. Most entities have no desire to deanonymize
you, to them, the most useful format is to treat you as a GUID with a bag of
attributes attached to it.

Of the entities that remain, they fall into two buckets: Ones powerful enough
that they already have personally identifiable data without the need to
deanonymize anonymous data sets and ones small enough that they don't have the
capabilities to deanonymize.

If you're a government, you don't need to rely on anonymized data sets, you
have the sets with the labels already. If you're a stalker or internet troll
or whatever, it's far easier to just pay one of the PI websites $29 to get far
more data on a person than any deanonymized dataset will give you.

~~~
samfriedman
I think there are still threats that come from improperly anonymized data,
beyond my own government or a single stalker/harasser.

For one, that this data is improperly anonymized would make it an easy avenue
for malicious nation-state actors to use to track/analyze/destabilize the
population. If I am a government with an interest in freaking out the US
public, I could quite easily de-anonymize sensitive datasets and begin using
them for wide-scale harassment, identity theft, etc. on an automated basis.

The lowering of the bar makes it easier for Johnny Troublemaker to start
harassing people based on their PII as well. Instead of paying for the data,
just download some datasets and run a Julia notebook against them. Maybe not
much changes for the targeted stalking case, but now you can cast a wide net
when looking for someone to mess with.

~~~
shalmanese
Malicious nation state actors can easily access PII without the need for
deanonymization, simply by buying it on the open market.

The number of Johnny Troublemakers who are randomly spraying hate based on PII
is about the same as the rate of people throwing rocks off highway overpasses
onto cars below. It's simply not a significant enough problem to be worth
worrying about.

------
Zenst
If they can pick out individuals from the data, then the data is not
anonymized. Sure they may have unassociated data spread across unassociated
records, but if an algorithm can pick it out, then so could a human (though
way more effort). That for me is not anonymized data.

~~~
AlexandrB
> That for me is not anonymized data.

What matters is that this is what most online companies (and their terms of
service) would call anonymized data.

~~~
q-base
If they operate in Europe then I am pretty sure that the GDPR legislation is
pretty straight forward here. If you can de-anonymize the data then it is by
definition not anonymized.

~~~
Shaanie
That seems correct.

From an article on the subject:

>Recital 26 of the GDPR defines anonymized data as “data rendered anonymous in
such a way that the data subject is not or no longer identifiable.” Although
circular, this definition emphasizes that anonymized data must be stripped of
any identifiable information, making it impossible to derive insights on a
discreet individual, even by the party that is responsible for the
anonymization.

~~~
shusson
it is not clear to me if this covers re-identifying.

~~~
Zenst
I'd say that is pretty clear-cut with "making it impossible to derive insights
on a discreet individual"

If it is possible, it's not anonymous per GDPR's definition and that is what
counts.

~~~
shusson
> "making it impossible to derive insights on a discreet individual"

doesn't clarify how much information you already have about the individual.
There is a distinction between being able to identify someone without any
prior knowledge about them vs re-identifying them. I don't think the GDPR is
clear about that.

~~~
Zenst
Interesting - would an example of what you outline be say a digital voice
recording. Which, unless you know who is in the recording, you have no way to
associate that digital data with an individual.

Would that example fall within the remit you outline and as such - skirt the
whole GDPR aspect?

------
pfortuny
In any database whose fields have at least 4 meaningful ranges, 16 fields give
4.000.000.000 possibilities. Now I am on my iphone. As long as ranges are
meaningful (i.e. they do divide the group in somewhat even parts), the
individuating possibilities are HUGE. And fields do have usually many more
than 4 ranges.

Anonymizing datasets is a weasel term.

The database is secure or it is not. As any database is quite likely insecure,
we are doomed.

~~~
jobigoud
An anonymized version would be one where you have a tally for each
value/category in each field, without any correlating table between the
fields. Only store the histograms.

~~~
pfortuny
Well, yes, but then that is not what advertisers want... That is the thing.

Whenever a large enough database exists with individual data, we are doomed.

------
richmarr
There's a startup in London called Synthesized working on part of this problem
space.

Given a source dataset they create a synthetic dataset that has the same
statistical properties (as defined at the point the synthetic dataset is
created).

I've seen a demo, it's pretty slick
[https://synthesized.io/](https://synthesized.io/)

------
polskibus
Would differential privacy fix this problem? I heard that new US census will
use it.

~~~
majos
Yes, in the sense that the output of a differentially private protocol has
mathematical guarantees against re-identification, regardless of the
computational power or side information an adversary has.

There are caveats. The exact strength of the privacy guarantee depends on the
parameters you use and the number of computations you do, so simply saying "we
use a differentially private algorithm" doesn't guarantee privacy in
isolation.

~~~
shusson
do you have some examples?

~~~
majos
Of a differentially private algorithm? Frank McSherry (one of the authors of
the original differential privacy paper) has a nice blog post introducing the
idea and giving many examples with code [1].

Or even more briefly, if you want to know how many people in your database
have characteristic X, you can compute that number and add Laplace(1/epsilon)
noise [2] and output the result. That's epsilon-differentially private. In
general, if you're computing a statistic that has sensitivity s (one person
can change the statistic by at most s), then adding Laplace(s/epsilon) noise
to the statistic makes it epsilon-differentially private (see e.g. Theorem 3.6
here [3]). The intuition is that, by scaling the added noise to the
sensitivity, you cover up the presence or absence of any one individual.

[1]
[https://github.com/frankmcsherry/blog/blob/master/posts/2016...](https://github.com/frankmcsherry/blog/blob/master/posts/2016-02-06.md)

[2]
[https://en.wikipedia.org/wiki/Laplace_distribution](https://en.wikipedia.org/wiki/Laplace_distribution)

[3]
[http://cis.upenn.edu/~aaroth/privacybook.html](http://cis.upenn.edu/~aaroth/privacybook.html)

~~~
shusson
Thanks for the links. I'm still a little confused by how differential privacy
can be applied to non-aggregated fields. Can differentially private algorithms
also be applied to mask/anonymise non-aggregated fields?

~~~
majos
You could, but if your statistic is a function of one person's data,
differential privacy will force you to add enough noise to mask that one
person's data, i.e. destroy almost all of the utility of the statistic.

It's possible to learn something by aggregating a bunch of those individually-
privatized statistics. Randomized response [1] is a canonical example. More
generally, local differential privacy is a stronger privacy model where users
privatize their own data before releasing it for (arbitrary) analysis. As you
might expect, the stronger privacy guarantee means worse utility, sometimes
much worse [2].

[1]
[https://en.wikipedia.org/wiki/Randomized_response](https://en.wikipedia.org/wiki/Randomized_response)

------
the_seraphim
We are going to have to get used to and start to develop a legal and moral
framework around the fact that it will soon be impossible to stay in the
shadows.

------
TheBobinator
When information you are recieving has been "anonymized", how do you tell it
is accurate? If you are the person collecting and storing said sensative
information, you will know what is and is not accurate, but nobody else knows.

For some kinds of information, like medical records, the information is deadly
not to have accurate, but also deadly to have accurate and public. Once the
information leaks, employers might decide to not hire high-risk people or
insurers might decide to pass over certain people as too costly.

I'm of the opinion "anonymizing" data is something that enables grifting; if
enables the collectors to placate the people they are pulling data from, and
it allows the grifters to make the argument the information they have means
nothing.

Ultimately, I think these organizations should be making sure their
information is absolutely accurate, and we should have laws in place, with
severe criminal penalties, against the use transfer or use of said
information. I would even go so far as to say things like cell phone location
records should be fully public as a matter of the law.

Now when you want to get those records, you go to a government website for the
"hunt and poke" stuff (e.g. where are my kids going or is my wife spending
time with another lover, how long is my commute on average, or where was I at
3 years ago on a day, all sorts of useful questions); the access records are
public too.

If you want to study them, you sign a NDA saying you won't, under penalty of
severe criminal prosecution, leak the information or use it for criminal
purposes. Anyone found having the data and no signed government NDA = instant
20 year prison sentance plus felony conviction.

This way, if, for example, someone signs the NDA and goes on to offer services
to executives to help them cherry pick staff, not only does the person
offering the under the table services go to jail, but the executive does as
well.

When you criminalize certain things, then give the public all the information
and tools to do as they see fit, the law works. It's a lot easier to prosecute
a company executive for cherry-picking staff with insurance data when the data
is well-labeled. It is also a lot easier to sue them when you have an access
record that says someone under their employ checked how often you go to a
clinic or night club via your cellphone records.

The problem is not going away anyway, and "anonymizing" data to placate our
sense of morality isn't going to help. There is no easy technical solution,
but if the thinking is not to anonymize but instead track and enforce who has
access, things change drastically.

~~~
raxxorrax
You should be able to request any personal data to be deleted. If the company
in question leaks anything, they are responsible. Here the bonkers american
style punishments might actually be the way to go.

That aside, I would like the option that says "do not collect the data". It
wouldn't even be hard.

Sure there is knowledge and advantages in that data, but that doesn't even
come close to the benefits of privacy. Think the general public opinion about
X is pretty stupid? If so, you'll need it too.

~~~
TheBobinator
The idea you don't need to trust society and the government is very attractive
when ridiculous abuse has been done to you and you've had to manage the fall-
out.

------
point78
Paywwall...

------
tialaramex
If people tell you they're collecting data for statistical purposes, then one
of three things:

1\. They should deliberately introduce noise into the raw data. Nazis with the
raw census data can spend all month trying to find the two 40-something Jews
that data says live on this island of 8400 people, but they were just noise.
Or were they? No way to know.

2\. Bucket everything and discard all raw data immediately. This hampers
future analysis, so the buckets must be chosen carefully, but it is often
enough for real statistical work, and often you could just collect data again
later if you realise you needed different buckets.

3\. They shouldn't collect _anything_ personally identifiable. Hard because
this could be almost anything at all. If you're 180cm tall your height doesn't
seem personally identifiable, but ask Sun Mingming. If you own a Honda Civic
then model of car doesn't seem personally identifiable but ask somebody in a
Rolls Royce Wraith Luminary...

~~~
GhostVII
> They shouldn't collect _anything_ personally identifiable

Why not just ensure that any personally identifiable data is properly
bucketed, and discarded if it is too strongly identifiable. If you are storing
someone's height, age, and gender, you can just increase the bucket size for
those fields until every combination of identifiable fields occurs several
times in the dataset. If there are always a few different records with well
distributed values for every combination of identifiable fields, you can't
infer anything about an individual based on which buckets they fall into.

~~~
majos
Not a bad idea! It sounds pretty similar to k-anonymity [1], which is not a
terrible privacy heuristic. But it does have some specific weaknesses.
Wikipedia has a good description.

> Homogeneity Attack: This attack leverages the case where all the values for
> a sensitive value within a set of k records are identical. In such cases,
> even though the data has been k-anonymized, the sensitive value for the set
> of k records may be exactly predicted.

> Background Knowledge Attack: This attack leverages an association between
> one or more quasi-identifier attributes with the sensitive attribute to
> reduce the set of possible values for the sensitive attribute.

Optimal k-anonymization is also computationally hard [2].

[1]
[https://en.wikipedia.org/wiki/K-anonymity](https://en.wikipedia.org/wiki/K-anonymity)

[2]
[https://dl.acm.org/citation.cfm?id=1055591](https://dl.acm.org/citation.cfm?id=1055591)

------
mv4
Effective de-identification is much harder than people think.

------
zyztem
Article picture is an IBM iDataPlex system.

------
ptah
where's the link to code?

~~~
shakna
The paper suggests it can be accessed on the site [0], however certain parts
of the site only appear if you run through their questionnaires.

> The source code to reproduce the experiments is available at
> [https://cpg.doc.ic.ac.uk/individual-
> risk](https://cpg.doc.ic.ac.uk/individual-risk), along with documentation,
> tests, and examples.

As far as I can tell, the source code is not available, at least not from
where the authors suggest.

[0] [https://cpg.doc.ic.ac.uk/individual-
risk/](https://cpg.doc.ic.ac.uk/individual-risk/)

~~~
Cynddl
Co-author here, we will add the source code very soon. A little swamped with
the press coverage, but the source code in Julia+Python is coming.

------
akamor
There are also way too many people with access to non-anonymized data. i.e.
the development team that has read privileges on the production database. e.g.
employees at uber spying on customers
([https://www.theguardian.com/technology/2016/dec/13/uber-
empl...](https://www.theguardian.com/technology/2016/dec/13/uber-employees-
spying-ex-partners-politicians-beyonce)).

edit: shameless plug. check out tonic.ai for a solution to the above problem.

