
Simulating identification by zip code, sex, and birthdate - throwawaymath
https://www.johndcook.com/blog/2018/12/07/simulating-zipcode-sex-birthdate/
======
nkurz
While this is intended as just a proof of concept, I think this technique is
mathematically flawed. You can't just use the average number of people in each
zipcode, because if there is great variation in the number of people per
zipcode, a randomly sampled person is more likely to be in a "large" zipcode
than a small one.

Consider the case where we have 100000 people, and 10 zipcodes, each with an
average of 10000 people. Now add that zipcodes 0-8 have only one person, and
zipcode 9 has the remaining 99991 people. What percentage of Americans can be
reliably identified by sex, zipcode, and full birthdate?

I don't know the exact answer, but I'm pretty sure it's closer to .01% than
87%. The 9 people in the 9 small districts are reliably identifiable, and
99.99% of the population in the large 10th district are essentially not
identifiable. While the actual population distribution by zipcode is surely
not this extreme, I think this example shows that the question cannot be
answered unless a specific population distribution is known (or assumed).

Actual distribution of population can be seen here:
[http://proximityone.com/zip-place.htm](http://proximityone.com/zip-place.htm)
(click on Population twice to sort by most populous). At a glance, there are
quite a few zipcodes with far more than 10,000 people, making me think this is
going to skew the results considerably.

~~~
pinneycolton
I work with this type of data and I assure you that the results are quite
plausible. The original hypothesis was tested against US census data. See
"Experiment B" here:

[https://dataprivacylab.org/projects/identifiability/paper1.p...](https://dataprivacylab.org/projects/identifiability/paper1.pdf)

I'll add that there are far fewer live births, per day, in the US than there
are zip codes. I agree that some highly populated areas that are problematic,
but this may be the only reason that 87.1% number isn't 100%!

~~~
nkurz
It may have gotten lost in the revisions of my comment, but I was consciously
_not_ trying to argue that Sweeney's numbers were wrong, only that Cook's
explanation was lacking since it doesn't discuss distribution. I hadn't yet
looked at the paper.

That said, looking at the paper you linked now, I don't see how Cook's
simulation (simulation, not explanation) and Sweeney's paper can both be
correct. Cook got 84-85% identifiable assuming uniform age distribution and
identical population per zipcode. At the bottom of Figure 14, Sweeney says 87%
for the US as a whole.

Shouldn't any non-uniformity (of zip code population or age clustering) act
only to reduce the percent of the population that is identifiable? That is,
shouldn't Cook's simulation with flat age distribution and equal zipcode
populations be an upper bound on identifiability? Since Cook's simulation code
looks fine, this makes me suspect that there's something off about Sweeney's
analysis.

Is the 87% perhaps an average of the state percentages, and not properly
weighted by state population? Or maybe an average across age classes not
weighted by population of that class? Oh, I don't know about those, but maybe
I see a bigger issue now...

In Section 4.3.1, Sweeney defines the "Number of subjects uniquely identified
in a subdivision of a geographical area". But this isn't a simulation like
Cook did, she's just using a binary yes/no depending on whether the
subpopulation in each age class exceeds a numerical threshold:

    
    
      if population(zi, a) ≥ |Qa|, then ID_aZi= population(zi, a)   
      else ID_aZi = 0.
    

While it's nice that it's clearly defined, I don't think this yields a
"percent identifiable" that matches up with Cook's simulation, nor with any
common usage of the term. Also (while I'm being picky) isn't the definition
backward? If we were to go with this arbitrary definition, wouldn't we want
Id_zi to equal zero when the population is _less_ than the threshold? I
presume the direction of equality is just a typo, but if the paper is using a
hard threshold rather than some more rigorous approach like Cook's simulation,
this seems like a major flaw in interpretability of the results.

~~~
nkurz
Now that edit window is over, I finally noticed the massive error in my
wording in the second to last sentence. Instead, let's pretend I wrote "why
would we want Id_zi to equal zero when the population is _less_ than the
threshold?" It would also be good to note again that I haven't read the paper
closely, and very well might be misinterpreting what it is doing.

\---

But since I'm still in this edit window, I'll add an update here. I downloaded
the per zip population data from here:
[https://blog.splitwise.com/2013/09/18/the-2010-us-census-
pop...](https://blog.splitwise.com/2013/09/18/the-2010-us-census-population-
by-zip-code-totally-free/). Then I wrote a quick Perl program (parallel to
Cook's Python simulation) but using the actual per zipcode populations rather
than a fixed average. After confirming Cook's 84% number with the fixed
population, I ran it on the actual populations (but still with a flat
distribution for age and sex) and got 63% uniques.

Presumably this number would drop somewhat further with actual age
distributions, but I don't know how far exactly. My current belief is that
Sweeney's paper does a good job of calling attention to the fact that risk of
identification is high, but the methodology and exact numbers should not be
trusted. The actual percentage of Americans identifiable by (zip, dob, sex) is
large, but something less than 63%. It might be interesting to run the
simulation with actual age bracket data, but I didn't find this in any easy to
download format, so I think I'll stop here.

------
throwawaymath
I editorialized the title a bit - the author's title is "Simulating
identification by zip code, sex and birthdate." Instead I've used the salient
conclusion of the article, which is pretty interesting.

This article is a footnote to another front page submission by the same
author, "No funding for uncomfortable results." While that article is
sobering, I find this one to be technically cooler. The author uses a simple
probabilistic calculation - alongside a Python simulation - to demonstrate how
87% of Americans are uniquely identifiable, given these three points of data.

Not a novel finding, but I like the way the author demonstrates the conclusion
analytically and programmatically.

~~~
jMyles
I think that your alteration of the title is good, but why did you change
"sex" to "gender"?! The study appears to be about sex as a matter of public
record, so you've added inaccuracy, or at least ambiguity, to the title.

~~~
dlhavema
Is sex male/female and gender "what I identify as"? I've always thought of
them in this context as meaning the same thing...

~~~
seandougall
Biological sex is most often male or female, but not always. People can be
born intersex (with genitalia that are neither clearly male nor female, which
is often “corrected” surgically at birth, and/or with genitalia that don’t
match what’s expected from their chromosomes, and/or with chromosome
combinations other than XX or XY). Chimerism can result in different parts of
the body having different sex chromosomes. It’s not strictly monolithic or
binary.

That aside, yes, “sex” broadly refers to biology in some way, though often
oversimplified, while “gender” refers to identity and expression (which also
don’t necessarily go hand in hand).

------
kolinko
Fun fact, in Poland (population ~35M), every person has a public identifier in
a format of:

YYMMDDXXXXC

Where YYMMDD is date of birth, XXXX a number, and C a checksum.

[https://en.wikipedia.org/wiki/PESEL#Format](https://en.wikipedia.org/wiki/PESEL#Format)

It works very well, and - since it's for identification, not authentication -
you don't have to worry about it leaking out too much, like you would with
SSN.

~~~
garaetjjte
Sadly, it is used for authentication sometimes. Name, PESEL and mother's
maiden name is sometimes enough to authenticate on bank helpline.

------
ACow_Adonis
I've worked in probablistic data-linking quite substantially over the past ten
years. See
[https://github.com/DJMelksham/IcarusDataLinkingSystem](https://github.com/DJMelksham/IcarusDataLinkingSystem)
for example.

While the principal is broadly correct, in real life you've got to worry about
a host of other things.

Firstly, while you can come up with a theoretical number of persons who are
uniquely identified, its harder to establish who IS uniquely identified. This
might seem like splitting hairs but its quite fundamental: imagine you knew
you could identify 50% of the population uniquely, unless you know which 50%
of the population, have you really identified anyone? Clearly knowing
someone's sex, age and location gives me analytical information about them I
can use to make predictions (a 5 year old female in california is going to be
fundamentally different from a 85 year old male in alabama) but is it really
identification yet and do we even need identification to make useful
predictions?

Secondly, what you presumably care about in 'identification' is not 'single
source' probability issues per se. In the real world what most people care
about is multiple-source identifiability: not the odds of one piece of data
uniquely identifying someone (just by collecting gender I might eventually
uniquely identify someone in some remote geography for example), but the odds
of uniquely identifying someone in TWO (or more) data sources that were
previously 'unlinked', because that's what's required to expand your current
information set (you didn't know this person was identifiable on two data
sources, and now you do, so you can bring together more information than what
you already held originally).

This second point is extremely important, because in the real world, you have
to worry about transcription errors, recording formats, corruption, temporal
changes and time, and scope of the two data sources. Most people do not get to
work with total accurate census' of the population at two points in time in a
population that doesn't change.

From my own practical experience, something like the zip code, sex and
birthdate combination is powerful, and yes, you'll be able to uniquely
identify some people with such information (especially in smaller geographical
areas), but the practical rate will be far far less than 87%. But for many
modelling purposes, it doesn't need to be spectacularly accurate to be useful
anyway.

------
sctb
Discussion of the previous post here:
[https://news.ycombinator.com/item?id=18637703](https://news.ycombinator.com/item?id=18637703).

------
incompatible
What is the identification for, to keep track of individuals over time? It
will also fail for everybody who moves address and changes zip code.

------
lxe
I like when they go into the simulation/computational solutions before showing
the analytical solution. Even much harder problems have very simple
computational solutions/simulation "solutions"

~~~
wiz21c
especially when the analytical solution is so hard to write that it requires
approximations. I always wondered why we spend so much time learning
analytical analysis when, once faced with real problems, we have to shortcut
it in various, empirical approximations...

------
kayaknutrient
The birthday adds simply too much information - which is why I mostly
obfuscate my birthdate.

~~~
cpburns2009
Do you happen to be born on Jan. 1 a random number of years ago too?

~~~
nerflad
1970 -- The Unix Epoch, of course! I imagine a lot of us here use that one.

~~~
packet_nerd
No me, because that would be identifying myself as a relatively privacy
conscious nerdy type. Combine that info with one or two other datapoints, and
I'm sure it could be used to identify someone.

------
gitgud
This could probably be extrapolated to the entire western world...

------
alexnewman
surprised that gender is that useful for this

~~~
javitury
I think it's just about anonymity in research data. These are the most common
variables across databasets.

More and more data is being released, specially government funded data in the
US. Anonimity of respondents is a big deal. Would you like everyone to know
your debt? Or your salary?

If people knew that they will be identified and their answers will be public,
they may lie or refuse to participate.

As a researcher you want truthful data about a representative sample of
everyone. If you exclude people sensitive to privacy, that is not
represetative.

There is a trade off between respondent anonymity, usefulness of data and
complexity. Take for instance the "Survey of Consumer Finances" released by
the Federal Reserve. It provides quite many variables, which is useful but
uses multiply imputed observations to provide anonimity. However, it's a pain
in the ass to deal with multiply imputed data :(

