
87% of the U.S. Population are uniquely identified by {DOB, gender, zip} - jessekeys
http://latanyasweeney.org/work/identifiability.html
======
jdp23
Here's some backstory, from the 1990s when the Massachusetts' Group Insurance
Commission released "anonymized" health data:

''At the time GIC released the data, William Weld, then Governor of
Massachusetts, assured the public that GIC had protected patient privacy by
deleting identifiers. In response, then-graduate student Sweeney started
hunting for the Governor’s hospital records in the GIC data. She knew that
Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents
and seven ZIP codes. For twenty dollars, she purchased the complete voter
rolls from the city of Cambridge, a database containing, among other things,
the name, address, ZIP code, birth date, and sex of every voter. By combining
this data with the GIC records, Sweeney found Governor Weld with ease. Only
six people in Cambridge shared his birth date, only three of them men, and of
them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney
sent the Governor’s health records (which included diagnoses and
prescriptions) to his office.''

Source: [http://arstechnica.com/tech-policy/news/2009/09/your-
secrets...](http://arstechnica.com/tech-policy/news/2009/09/your-secrets-live-
online-in-databases-of-ruin.ars)

~~~
russell
I had to deal with the converse of the problem when I was working for a
company that ran the vehicle smog checking for the state of Illinois. A
driver's license id was composed of birthdate, sex, and some location
information (I dont remember exactly what) for the reasons mentioned by jdp23
it worked quite well, except for identical twins, who had the same driver's
license id. (Registrations ofter were processed by hand at drug stores. The
driver's license was used to disambiguate ambiguous renewals.)

The state of Texas complicated things by allowing a family to have the same
license plate for all their cars. One family had seven vehicles, all with the
same license plate number. Florida would issue the same plate to complete
strangers. One man arrested with great drama, because a bank robber (or some
such undesirable) had the same (legal) plates.

My GF has such a common name that her doctor has to check which one of three
she is. The last time she tried to get her driver's license renewed online,
she was told she had to go to the DMV in person. She asked why and was told
she was a "Code 5". She reported to the DMV and told the woman that she was a
Code 5. There dropped jaws and a hurried conference with a supervisor who
called Sacramento and had a lengthy conversation with comments like, "Yes I'm
sure. She's right here." Finally she got impatient and asked, "What's Code 5?"

The clerk said, "You're dead and it was verified by medical personnel." She
did eventually get her license renewed. Officially I live with a zombie. It
isnt as bad as the movies portray.

~~~
blahedo
> _...for the state of Illinois. A driver's license id was composed of
> birthdate, sex, and some location information...._

It's worse than that: in Illinois the first four characters of your DL ID are
your last name's soundex code.

~~~
steve-howard
Do you have a source for this? I think it's pretty cool and I'm wondering if
there's a reasonably centralized place for other states' formats as well.

~~~
ben1040
It's even worse than just a Soundex of your last name, the entire number is
derived off your name and DOB.

<http://www.highprogrammer.com/alan/numbers/dl_us_shared.html>

------
_delirium
This is sort of the reverse of the "curse of dimensionality", for those
interested in machine learning
(<http://en.wikipedia.org/wiki/Curse_of_dimensionality>). From an ML
perspective, as you add dimensions to a dataset, the amount of data you'd need
to accurately model the data without overfitting (i.e. without memorizing
specific details of the sample rather than underlying trends) grows _very_
fast, because due to the combinatorics you end up with extremely sparse
coverage of the overall possibility space even with huge data sets.

The reverse of that phenomenon is that, given a data set in a high-dimensional
space (even 3 dimensions, if each dimension has more than a few bits of
entropy), it will cover the dimensions very sparsely (even if it's large!),
and therefore it's relatively easy to recover specific details of the sample
from the aggregate statistics.

edit: Well, I was hoping this might be a new insight, but in fact there's a
good 2005 paper exploring that connection in much more detail:
<http://www.vldb2005.org/program/paper/fri/p901-aggarwal.pdf>

~~~
benmathes
The Curse of Dimensionality is a key piece of knowledge to keep in mind
anytime you're building some software that searches/sorts/filters/etc. a large
dataset and your users keep asking for things like _"but I want to search
based on <some new thing>!"_

Someone has to be the gatekeeper on your dimensions.

------
javert
This is an excellent example of "how to write a title for an HN submission."

For the busy HNer, it's not even necessary to click on the link to get the key
idea from the article.

------
yuvadam
I should run that on the Israeli population census - which has been illegally
leaked several times [1].

The ZIP codes in Israel are per-street, not city. Given Israel's population of
just under 8M, I believe a very high percentage (95% <) of unique people can
be found.

[1] - [http://blog.y3xz.com/post/7846661044/data-mining-the-
israeli...](http://blog.y3xz.com/post/7846661044/data-mining-the-israeli-
population-census)

~~~
edanm
You should definitely update us on that - it will be very interesting to see
the results.

Great article, by the way. I especially like the idea of diff'ing the
databases.

------
EwanG
Then there are those of us who have unique enough names that you can pretty
much look us up and find there's only two in the whole US (and we're related).
Having been a DBA at one point, it's always fun to tell folks that "really, if
you just search for my name I guarantee you'll find me faster". Heck in most
cases a search in a system that separates first and last names will let me be
found with just my first name. If it weren't for that lousy Welsh actor, that
would work on Google also (although full name turns me up as most of the front
page).

My point is that this shouldn't be THAT surprising. I suspect full name and
gender would uniquely identify a fair portion of the US as well. We're not as
homogenous as some societies and I think this proves it...

------
Zakuzaa
I don't find that number surprising. DOB along with zip is pretty good data.

What I'd find amusing would be how much of the population is uniquely
identifiable by browser+plugins+os+resolution.

~~~
mbrubeck
The EFF has done some research on that question:
<https://panopticlick.eff.org/>

~~~
delluminatus
Interesting -- using NoScript reduced my "bits of information" from 20 to 14.
I would have expected it to increase the uniqueness of my browser.

~~~
bobds
They can get considerably less data if you don't have Javascript enabled.

------
kingkilr
There were no other males, born on october 8th 1990, in Lakeview, Chicago. I'm
a little flabbergasted, to be frank.

~~~
Zimahl
October 8th is an odd date. Since it's my birthday as well (long before 1990
though) I have taken note that I run across very few with that birthday, let
alone early October birthdays. Is it statistically a low-birth month? I'd love
to know.

On a side note, Happy 21st Birthday in just over a month!

~~~
tokenadult
October 8th is actually in the top 100 of birthdates by frequency of births,
according to the best data table I have seen on the subject.

[http://www.nytimes.com/2006/12/19/business/20leonhardt-
table...](http://www.nytimes.com/2006/12/19/business/20leonhardt-table.html)

Of course February 29th is dead last in frequency. Most early October
birthdays are above the median in frequency.

~~~
caf
Note the group of high rankings from 8-Sep, tailing off in early October - all
those NYE babies ;)

~~~
throw_away
also interesting is looking at the next lowest frequency dates-- all of them
are clustered around major holidays:

365 12 25

364 1 1

363 12 24

362 1 2

361 7 4

360 12 26

359 11 28

358 11 27

357 11 26

356 1 3

355 10 31

354 12 23

353 11 25

352 11 24

~~~
ganley
Not sure if this is the (whole) reason, but obviously no one is going to have
a scheduled induction or C-section on a holiday.

------
Zimahl
Putting it to a non-scientific test, where I grew up is within the zip code of
98632. From the 2000 census there are 47,202 within that zip code. This is
probably a good average city for America - not too big, not too small. Based
on the fact that there are 366 possible (year excluded) birthdays (I'm
including Feb. 29th) and only two genders, just about the best I can do for
uniquely identifying people in that county is 1 in 64. Add in the year and
what I would assume is an even population between the ages of 1 to 50 and that
gets you closer to 1:1. Not bad.

A zip code with large populations probably are more around 50% than 87%, and
obviously the reverse is true as well. I wonder what the population size for a
zip code would have to be to be really close to 100%. Just throwing some
numbers in a calculator I'd guess at 15-20k people would be damn close. So 10k
is probably just about a unique identifier.

~~~
simonw
Your calculations would be very different if you included the year in the DOB.

~~~
Zimahl
I didn't at first to show the 1:64, and then I added the year (well 50 of
them) to show how I got closer to the 87% with my extremely un-scientific and
fuzzy math.

------
mikey_p
While I find this fascinating, I'm sure this is old news to people in the
business, like the people that run the grocery/chain store savings programs.

What I'd be really interested to see is why this works, and what it tells us
about distribution of population by zip code. I'd imagine the places where
this doesn't work as well, are the most densely populated zip codes, where the
likely hood of duplicates on the given key increases, but I would never have
guessed that the accuracy would be anywhere near 87%. (maybe there's alot more
zipcodes than what I thought? maybe they used zip+4?)

~~~
jerf
Simple information theory; in a world of 6 billion people, 33 bits serves to
uniquely identify anyone. 33 bits is a really low bar. It works because it
can't hardly help but to work, sort of like the birthday paradox. In the US
with approx 312 million (wikipedia) that's approximately 28.2 bits.

It doesn't tell us much about zip code distribution because zip codes are
chosen to have approx the same number of people in each. As it turns out,
that's exactly how you'd go about maximizing the amount of information the zip
code carries... which is unsurprising since that's the entire purpose of a zip
code. Gender is almost exactly one bit, and date of birth is 15ish bits with
some bad uniformity assumptions, zip is another 15ish with bad uniformity
assumptions[1], that's 31-ish total, subtract off 3-ish for the bad
assumptions and you get 28, which covers 2^28 = 268,435,456 people, which is
pretty close to the number cited (.87 times 312,000,000 = 271,440,000). I'll
cop to tuning the fudge factor of three bits to nicely match the number given,
but the bit count itself just comes from the space of possibilities.

Lots more info on the topic can be found here: <http://33bits.org/about/>

[1]: <http://www.carrierroutes.com/ZIPCodes.html> 43,000 ZIP codes => log2
43000 is approx 15.4 bits assuming perfect uniformity.

~~~
robrenaud
1-.5^3 = .875, eerily close to .87.

In general, if you model this as an N balls in M bins problem, then even when
N == M, you'd expect a fair amount of anonyonomity preserving collisions.
Maybe 1/2 of people would collide. As we then double the number of bins, we'd
roughly expect the number of collisions to be cut in half.

If you imagine putting 100 balls (people) into 800 = 100 * 2^3 bins (number of
different birthday-zip-gender encodings) at random, about 1/8 of the bins will
have more than one ball (person) [okay, this estimate is somewhat off by a
smallish constant factor, it's only true that the 100th ball tossed will
collide with probability 99/800 ~= 1/8 if there were no existing collisions,
and the earlier balls thrown have less to collide with], and not be uniquely
identifiable.

------
giberson
This is an interesting statistic for another reason aside from de-
anonymization, particularly directory listing applications. A lot of web sites
have url's to access the user's profile page by their username. For popular
sites, it's always a rush to register "yourname" as a username so that you can
get website.com/<yourname>. Inevitably, your name gets registered by some one
else and you're stuck picking a nick name or appending a randomized letter set
to it.

I've been pondering a useful way to have /<yourname>in a URL, so that everyone
with that name can use the url containing it without collisions. Of course, I
always end up with something like website.com/a3fx/<yourname>. Which is
arbitrary, and ugly. However, with this stat it seems we have something close
to a non colliding, pretty, meaningful addressing scheme. Ie:
website.com/<dob>/<gender>/<zip>/<yourname>, sure it's a bit long but it
provides assurance you're getting who you think your getting.

~~~
natrius
A problem with that is that people move, so zip codes change. You could fix
that by substituting artificial, user-chosen groupings for zip codes if you
used enough of them to have a similar distribution. Users could pick a
grouping based on their interests, and you could even keep the geographical
aspect of it by treating them as "neighborhoods". People who play the stock
market could choose WallStreet, science fiction fans could choose Area51, wine
aficionados could choose NapaValley, and so on.

------
thisuser
Medical and voting records generally ask for sex, not gender. It would be nice
if the two concepts would stop being confused.

~~~
masterzora
I agree. That said, using actual gender rather than assigned sex only serves
to identify people more uniquely. I wonder if that would have an appreciable
effect on the stats or if it would fall within error bars.

~~~
billybob
You'd have to have a pretty long dropdown menu in the data-mining centers for
all the genders people self-identify these days. Some people might be uniquely
identifiable on that value alone.

~~~
DrJ
or when someone comes up with a newer more politically correct term!

I prefer writing Sex: Yes, please

------
zeratul
De-identified patient data does NOT have DOB or ZIP codes:

[http://www.hipaa.com/2009/09/hipaa-protected-health-
informat...](http://www.hipaa.com/2009/09/hipaa-protected-health-information-
what-does-phi-include/)

Sex, gender or age is allowed.

~~~
tlb
But if you poll the database every day to see when age increases, you can get
DOB.

~~~
nkassis
Well... if the records are released I don't think they should change. Someone
record at 50 isn't the same as at 60. Unless these are live datasets.

~~~
zeratul
Correct you would not be able to guess the date of birth - the date of visit
is removed.

------
pyre
If we want more anonymized datasets, maybe we should be asking for approx. age
or year of birth instead. I doubt that most things that ask for DOB need
anything more than granular than that.

~~~
ComputerGuru
I wish. Unfortunately, most sites need this for legal reasons, to determine
whether or not the user is above or below a certain age.

One could theoretically ask "how old are you" but then you don't know when to
lift limitations. Say a user joins six months before their 18th birthday. Are
you going to block 18+ content for a whole year from then?

~~~
Semiapies
How many sites filter content based on age _and_ really do so selectively
based on a birthdate? Most age verifications, even ones that request a date,
seem to be just asking the question, "Are you of legal age?"

(I prefer websites just ask that binary question, like this brewery -
<http://www.newbelgium.com> \- instead of making me waste time with a date
input.)

------
joshuaheard
This might work well in a fixed database, but it would not work in real life
since people change zip codes when they move every couple of years, or have
more than one zip code (home, office).

The only way to identify people in real life is to issue them a public ID
number that can be used in conjunction with a private ID number (SS#).

------
StrawberryFrog
DOB + name is unique in most cases.

Which is why I deliberately give an incorrect DOB on most websites that want
one.

------
est
China's Citizen ID:

Area_ID + DOB + Order Number + Checksum

For order number: Men are assigned to odd numbers, women assigned to even
numbers.

[http://en.wikipedia.org/wiki/Resident_Identity_Card#Identity...](http://en.wikipedia.org/wiki/Resident_Identity_Card#Identity_card_number)

------
jeffreymcmanus
True unique identifiers are immutable. At least two of those three things can
be changed.

~~~
steve-howard
Gender is a bit of a stretch, if technically changeable. I don't know much on
the subject but I would wager that the proportion of people who have changed
gender is not statistically significant enough to change the 87% figure.

~~~
jeffreymcmanus
Identifiers are either immutable or they're not. They can't be sorta-kinda
immutable.

------
ditojim
now im glad i randomize my birthdate on just about every form on the internet.

------
merraksh
With the density of large cities (though matched by zip codes), I wonder if
the 13% live in urban areas and the remaining 87% outside.

------
pavel_lishin
I'd love to see someone with a large dataset containing those variables to see
what % of their set overlaps on those.

~~~
ascott
About a year ago, I did a project for de-duping customer entered contact info.
We had less than 100k records for any particular region (Bay Area, New York,
etc.) From what I recall, we were able to confirm identity in about 80% of
cases using Name, DOB, and zip code.

By far the best metric is phone number, that bumped us up over 97%. If the
comparison was limited to only three metrics, then we would use Name, DOB, and
phone number. In reality, we also compare zip code, street address, and
contacts (like friends on facebook), in that order.

It would be interesting to see if on average, people change their phone number
less often than their name (marriage) or address (renters).

~~~
pnathan
That's interesting, considering Google is trying to get me to give them my
phone number. _puts on a tin foil hat_

------
dotcoma
But most define themselves by asl ;-)

------
thelovelyfish
The Mark of The Beast is at hand.

------
samstave
JOKES ON YOU! I can change 2 of the 3!!

