Hacker News new | past | comments | ask | show | jobs | submit login
87% of the U.S. Population are uniquely identified by {DOB, gender, zip} (latanyasweeney.org)
281 points by jessekeys on Aug 30, 2011 | hide | past | favorite | 95 comments



Here's some backstory, from the 1990s when the Massachusetts' Group Insurance Commission released "anonymized" health data:

''At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.''

Source: http://arstechnica.com/tech-policy/news/2009/09/your-secrets...


I had to deal with the converse of the problem when I was working for a company that ran the vehicle smog checking for the state of Illinois. A driver's license id was composed of birthdate, sex, and some location information (I dont remember exactly what) for the reasons mentioned by jdp23 it worked quite well, except for identical twins, who had the same driver's license id. (Registrations ofter were processed by hand at drug stores. The driver's license was used to disambiguate ambiguous renewals.)

The state of Texas complicated things by allowing a family to have the same license plate for all their cars. One family had seven vehicles, all with the same license plate number. Florida would issue the same plate to complete strangers. One man arrested with great drama, because a bank robber (or some such undesirable) had the same (legal) plates.

My GF has such a common name that her doctor has to check which one of three she is. The last time she tried to get her driver's license renewed online, she was told she had to go to the DMV in person. She asked why and was told she was a "Code 5". She reported to the DMV and told the woman that she was a Code 5. There dropped jaws and a hurried conference with a supervisor who called Sacramento and had a lengthy conversation with comments like, "Yes I'm sure. She's right here." Finally she got impatient and asked, "What's Code 5?"

The clerk said, "You're dead and it was verified by medical personnel." She did eventually get her license renewed. Officially I live with a zombie. It isnt as bad as the movies portray.


My wife had her wallet stolen on the T (in Boston) once, which would have been manageable but for the fact that Massachusetts driver's license ID's were your Social Security number. The thief (or someone) then used her SSN to open a series of credit cards and rack up a bunch of charges—computers, car repairs, etc.—the aftermath of which took at least a few hundred hours of hassle spread over three or four years to clear up. (Getting our first mortgage somewhere in there made it even more painful.) Mercifully, it eventually cleared up.

MA driver's licenses no longer use SSN as far as we can tell.


> ...for the state of Illinois. A driver's license id was composed of birthdate, sex, and some location information....

It's worse than that: in Illinois the first four characters of your DL ID are your last name's soundex code.


That used to be true for MN as well, but in the time that I moved from MN to IL back to MN (about 13 years), MN dropped that.


The first four characters of my FL driver's license are a soundex code of my last name. So, this appears to be a common practice.

I always thought that part looked suspicious. Thanks for the tip.


Do you have a source for this? I think it's pretty cool and I'm wondering if there's a reasonably centralized place for other states' formats as well.


It's even worse than just a Soundex of your last name, the entire number is derived off your name and DOB.

http://www.highprogrammer.com/alan/numbers/dl_us_shared.html


What's the harm of just a soundex? If you name is on the card anyways, it's not that secretive.


If I know your name, this makes it much more easy to deduce your drivers license number.


> Officially I live with a zombie

Not as uncommon as you might think. Plight of the Living Dead(http://www.time.com/time/world/article/0,8599,2054133,00.htm...)


Never underestimate the power of bored graduate students.


If only 6 people out of 54000 shared Weld's birthday, she was just lucky (not that this invalidates the main point).


Including birth year helps significantly. Assuming an even distribution of births, you'd expect 54000/365 ~ 148/day. Then knowing gender lets you cut that in half. Dividing again by the age distribution of voters gets that down to a reasonable number.


Birthdate includes year


Edit: Ok ok got it, birthday includes the year, my mistake; no need to keep downvoting


This is sort of the reverse of the "curse of dimensionality", for those interested in machine learning (http://en.wikipedia.org/wiki/Curse_of_dimensionality). From an ML perspective, as you add dimensions to a dataset, the amount of data you'd need to accurately model the data without overfitting (i.e. without memorizing specific details of the sample rather than underlying trends) grows very fast, because due to the combinatorics you end up with extremely sparse coverage of the overall possibility space even with huge data sets.

The reverse of that phenomenon is that, given a data set in a high-dimensional space (even 3 dimensions, if each dimension has more than a few bits of entropy), it will cover the dimensions very sparsely (even if it's large!), and therefore it's relatively easy to recover specific details of the sample from the aggregate statistics.

edit: Well, I was hoping this might be a new insight, but in fact there's a good 2005 paper exploring that connection in much more detail: http://www.vldb2005.org/program/paper/fri/p901-aggarwal.pdf


The Curse of Dimensionality is a key piece of knowledge to keep in mind anytime you're building some software that searches/sorts/filters/etc. a large dataset and your users keep asking for things like "but I want to search based on <some new thing>!"

Someone has to be the gatekeeper on your dimensions.


This is an excellent example of "how to write a title for an HN submission."

For the busy HNer, it's not even necessary to click on the link to get the key idea from the article.


I should run that on the Israeli population census - which has been illegally leaked several times [1].

The ZIP codes in Israel are per-street, not city. Given Israel's population of just under 8M, I believe a very high percentage (95% <) of unique people can be found.

[1] - http://blog.y3xz.com/post/7846661044/data-mining-the-israeli...


You should definitely update us on that - it will be very interesting to see the results.

Great article, by the way. I especially like the idea of diff'ing the databases.


Then there are those of us who have unique enough names that you can pretty much look us up and find there's only two in the whole US (and we're related). Having been a DBA at one point, it's always fun to tell folks that "really, if you just search for my name I guarantee you'll find me faster". Heck in most cases a search in a system that separates first and last names will let me be found with just my first name. If it weren't for that lousy Welsh actor, that would work on Google also (although full name turns me up as most of the front page).

My point is that this shouldn't be THAT surprising. I suspect full name and gender would uniquely identify a fair portion of the US as well. We're not as homogenous as some societies and I think this proves it...


I don't find that number surprising. DOB along with zip is pretty good data.

What I'd find amusing would be how much of the population is uniquely identifiable by browser+plugins+os+resolution.


The EFF has done some research on that question: https://panopticlick.eff.org/


Interesting -- using NoScript reduced my "bits of information" from 20 to 14. I would have expected it to increase the uniqueness of my browser.


They can get considerably less data if you don't have Javascript enabled.


hmmmm..... Looks like I should use IE instead of chrome.


Panopticlick (EFF) tests browser uniqueness. https://panopticlick.eff.org/

Edit: Beaten like a dead horse.


There were no other males, born on october 8th 1990, in Lakeview, Chicago. I'm a little flabbergasted, to be frank.


Birth location =/= mailing address zipcode. Babies are usually born in hospitals, and thus parents from many areas concentrate in one specific building. When you re-distribute these births out to their respective homes around the county, the zip code becomes much more important.


I used my mailing address zipcode, without even realizing it wasn't my birth zipcode :)


Where are you searching for your zip code and birth date? I couldn't find anything on the Data Privacy Lab site. :\


In towns like mine, however, 40,000 people share the same zip code with the hospital. I'm sure there are quite a few more town like mine throughout the US.


Okay, well let's do the math:

US pop in 2000: 280,000,000 — US births in 2000: 4,000,000 — US births per day: 10,958 — Fraction of US pop that 40,000 represents: 0.0001428 — Number of births per day for the 40,000: 1.56

That seems inline with the conclusion of the paper. See any mistakes?


October 8th is an odd date. Since it's my birthday as well (long before 1990 though) I have taken note that I run across very few with that birthday, let alone early October birthdays. Is it statistically a low-birth month? I'd love to know.

On a side note, Happy 21st Birthday in just over a month!


October 8th is actually in the top 100 of birthdates by frequency of births, according to the best data table I have seen on the subject.

http://www.nytimes.com/2006/12/19/business/20leonhardt-table...

Of course February 29th is dead last in frequency. Most early October birthdays are above the median in frequency.


Note the group of high rankings from 8-Sep, tailing off in early October - all those NYE babies ;)


also interesting is looking at the next lowest frequency dates-- all of them are clustered around major holidays:

365 12 25

364 1 1

363 12 24

362 1 2

361 7 4

360 12 26

359 11 28

358 11 27

357 11 26

356 1 3

355 10 31

354 12 23

353 11 25

352 11 24


Not sure if this is the (whole) reason, but obviously no one is going to have a scheduled induction or C-section on a holiday.


These days it doesn't surprise me, it seems like doctors are willing to induce at the drop of the hat these days. So why not to keep their holiday clear.


Interesting probabilities on x random people sharing the same birthday:

http://en.wikipedia.org/wiki/Birthday_problem


As another October 8th birthdate with the opposite impression of how many October kids there are, I had to look this up.

8.62% of all live births in the US during 2006 occurred during October [1]. No single month deviated significantly from it's expected percentage.

[1] http://data.un.org/Data.aspx?d=POP&f=tableCode:55


I guess people be fuckin' all of the damn time.


October 8th? It seems like a decent % of people share your birthday (me among them). Certainly feels like a weird to me. How many thousand people read a page (given that it has a moderate up-vote)?


That was my dad's bd. His best friend was also born in the day.


I too am having some problems digesting this. I think we are missing something.


Putting it to a non-scientific test, where I grew up is within the zip code of 98632. From the 2000 census there are 47,202 within that zip code. This is probably a good average city for America - not too big, not too small. Based on the fact that there are 366 possible (year excluded) birthdays (I'm including Feb. 29th) and only two genders, just about the best I can do for uniquely identifying people in that county is 1 in 64. Add in the year and what I would assume is an even population between the ages of 1 to 50 and that gets you closer to 1:1. Not bad.

A zip code with large populations probably are more around 50% than 87%, and obviously the reverse is true as well. I wonder what the population size for a zip code would have to be to be really close to 100%. Just throwing some numbers in a calculator I'd guess at 15-20k people would be damn close. So 10k is probably just about a unique identifier.


Your calculations would be very different if you included the year in the DOB.


I didn't at first to show the 1:64, and then I added the year (well 50 of them) to show how I got closer to the 87% with my extremely un-scientific and fuzzy math.


If they have zip+4 info, this gets even easier.


Using the tools at USPS.com, I could find examples where ZIP+4 was unique down to just 6 apartments in a dense neighborhood. Just type in a street name with lots of apartment buildings on it.


Longview probably just has the one zip code. Seattle has 59, for just over 10000 average per zip code, so we up here are probably a bit worse in a big city.


Worse or better? With only 10k per zip you'd get very close to a 100% unique identifier.

I realized after looking at both Portland and Seattle zip codes that they seem to be distributed better than the less dense areas so you do have very few zip codes over 20k.


While I find this fascinating, I'm sure this is old news to people in the business, like the people that run the grocery/chain store savings programs.

What I'd be really interested to see is why this works, and what it tells us about distribution of population by zip code. I'd imagine the places where this doesn't work as well, are the most densely populated zip codes, where the likely hood of duplicates on the given key increases, but I would never have guessed that the accuracy would be anywhere near 87%. (maybe there's alot more zipcodes than what I thought? maybe they used zip+4?)


Simple information theory; in a world of 6 billion people, 33 bits serves to uniquely identify anyone. 33 bits is a really low bar. It works because it can't hardly help but to work, sort of like the birthday paradox. In the US with approx 312 million (wikipedia) that's approximately 28.2 bits.

It doesn't tell us much about zip code distribution because zip codes are chosen to have approx the same number of people in each. As it turns out, that's exactly how you'd go about maximizing the amount of information the zip code carries... which is unsurprising since that's the entire purpose of a zip code. Gender is almost exactly one bit, and date of birth is 15ish bits with some bad uniformity assumptions, zip is another 15ish with bad uniformity assumptions[1], that's 31-ish total, subtract off 3-ish for the bad assumptions and you get 28, which covers 2^28 = 268,435,456 people, which is pretty close to the number cited (.87 times 312,000,000 = 271,440,000). I'll cop to tuning the fudge factor of three bits to nicely match the number given, but the bit count itself just comes from the space of possibilities.

Lots more info on the topic can be found here: http://33bits.org/about/

[1]: http://www.carrierroutes.com/ZIPCodes.html 43,000 ZIP codes => log2 43000 is approx 15.4 bits assuming perfect uniformity.


1-.5^3 = .875, eerily close to .87.

In general, if you model this as an N balls in M bins problem, then even when N == M, you'd expect a fair amount of anonyonomity preserving collisions. Maybe 1/2 of people would collide. As we then double the number of bins, we'd roughly expect the number of collisions to be cut in half.

If you imagine putting 100 balls (people) into 800 = 100 * 2^3 bins (number of different birthday-zip-gender encodings) at random, about 1/8 of the bins will have more than one ball (person) [okay, this estimate is somewhat off by a smallish constant factor, it's only true that the 100th ball tossed will collide with probability 99/800 ~= 1/8 if there were no existing collisions, and the earlier balls thrown have less to collide with], and not be uniquely identifiable.


> Simple information theory; in a world of 6 billion people, 33 bits serves to uniquely identify anyone. 33 bits is a really low bar. It works because it can't hardly help but to work, sort of like the birthday paradox. In the US with approx 312 million (wikipedia) that's approximately 28.2 bits.

It's a powerful idea. I wrote a whole essay analyzing the anime _Death Note_ using the 33 bits idea (http://www.gwern.net/Death%20Note%20Anonymity) and I'm sure that's not even the tip of the iceberg.


Here's a quick back of the envelope calculation:

The US population is 300 million and there's about 43,000 zip codes in the US. Assuming an even distribution, that's about 7,000 people per zip code. Cut it down to just one gender and we're at 3,500. The probability of a chosen person in that group having a a unique birthday is (364/365)^3500 = 0.0068%. Now, if we say that the person could have been born in one of the past sixty years, we get (21914/21915)^3500 = 85% as the probability that this person has been uniquely identified.


Birth date is pretty granular as well. A quick trip to the interwebs gives about 11k-12k people born each day, or 6k of each gender. There are 100k possible ZIP codes. The birth dates aren't evenly distributed, but they're reasonably close. The ZIP codes probably have a lot of clumping, but in dense urban areas they usually start cutting the areas into smaller pieces. The long tail of ZIP codes is probably well-populated. Out of 6k boys and 6k girls that share a birth date, expecting 5k of each to happen to land a unique ZIP out of 100k sounds reasonable to me.


Point of interest: there are 100K possible ZIP codes but only 42K-ish are actually in use in the US. Not that I think this makes it unreasonable for 5K of each to land unique ZIPs.


> ... the most densely populated zip codes ...

It turns out that zip codes aren't as populous as you might imagine.

As of the Y2K census, only a dozen zip codes have over 100,000 residents -- 80% of all zip codes have less than 15,000 residents, with the median zip code at 2,500.

(Source: http://www.census.gov/geo/www/gazetteer/places2k.html )


Some zip codes have zero residents -- a few large office buildings have their own zip codes, but nobody actually has a home address there. Examples: the Empire State Building, the Pentagon, the former World Trade Center, IRS tax processing centers.


This is an interesting statistic for another reason aside from de-anonymization, particularly directory listing applications. A lot of web sites have url's to access the user's profile page by their username. For popular sites, it's always a rush to register "yourname" as a username so that you can get website.com/<yourname>. Inevitably, your name gets registered by some one else and you're stuck picking a nick name or appending a randomized letter set to it.

I've been pondering a useful way to have /<yourname>in a URL, so that everyone with that name can use the url containing it without collisions. Of course, I always end up with something like website.com/a3fx/<yourname>. Which is arbitrary, and ugly. However, with this stat it seems we have something close to a non colliding, pretty, meaningful addressing scheme. Ie: website.com/<dob>/<gender>/<zip>/<yourname>, sure it's a bit long but it provides assurance you're getting who you think your getting.


A problem with that is that people move, so zip codes change. You could fix that by substituting artificial, user-chosen groupings for zip codes if you used enough of them to have a similar distribution. Users could pick a grouping based on their interests, and you could even keep the geographical aspect of it by treating them as "neighborhoods". People who play the stock market could choose WallStreet, science fiction fans could choose Area51, wine aficionados could choose NapaValley, and so on.


Medical and voting records generally ask for sex, not gender. It would be nice if the two concepts would stop being confused.


I agree. That said, using actual gender rather than assigned sex only serves to identify people more uniquely. I wonder if that would have an appreciable effect on the stats or if it would fall within error bars.


You'd have to have a pretty long dropdown menu in the data-mining centers for all the genders people self-identify these days. Some people might be uniquely identifiable on that value alone.


or when someone comes up with a newer more politically correct term!

I prefer writing Sex: Yes, please


De-identified patient data does NOT have DOB or ZIP codes:

http://www.hipaa.com/2009/09/hipaa-protected-health-informat...

Sex, gender or age is allowed.


But if you poll the database every day to see when age increases, you can get DOB.


Well... if the records are released I don't think they should change. Someone record at 50 isn't the same as at 60. Unless these are live datasets.


Correct you would not be able to guess the date of birth - the date of visit is removed.


If we want more anonymized datasets, maybe we should be asking for approx. age or year of birth instead. I doubt that most things that ask for DOB need anything more than granular than that.


No, we should be asking for exact age and year, and then jittering each part of it by plus/minus 2 at random. Won't affect the data analysis, but will increase the anonymity considerably.


I wish. Unfortunately, most sites need this for legal reasons, to determine whether or not the user is above or below a certain age.

One could theoretically ask "how old are you" but then you don't know when to lift limitations. Say a user joins six months before their 18th birthday. Are you going to block 18+ content for a whole year from then?


How many sites filter content based on age and really do so selectively based on a birthdate? Most age verifications, even ones that request a date, seem to be just asking the question, "Are you of legal age?"

(I prefer websites just ask that binary question, like this brewery - http://www.newbelgium.com - instead of making me waste time with a date input.)


Your point really does not apply anonymized datasets which are released and used to do statistical analysis on a population.

It may be relevant if you are trying to examine differences immediately before and after your 18th or 21st birthdays, but that would be pretty rare.


This might work well in a fixed database, but it would not work in real life since people change zip codes when they move every couple of years, or have more than one zip code (home, office).

The only way to identify people in real life is to issue them a public ID number that can be used in conjunction with a private ID number (SS#).


DOB + name is unique in most cases.

Which is why I deliberately give an incorrect DOB on most websites that want one.


China's Citizen ID:

Area_ID + DOB + Order Number + Checksum

For order number: Men are assigned to odd numbers, women assigned to even numbers.

http://en.wikipedia.org/wiki/Resident_Identity_Card#Identity...


True unique identifiers are immutable. At least two of those three things can be changed.


Gender is a bit of a stretch, if technically changeable. I don't know much on the subject but I would wager that the proportion of people who have changed gender is not statistically significant enough to change the 87% figure.


Identifiers are either immutable or they're not. They can't be sorta-kinda immutable.


Do we know if she took into account population migration / movement? Looking at the description it doesn't look like it, but I'd be really curious to see how that affects the stat. Or how the stat changes if the zip code is tied to the person's DOB. In that scenario it probably drops a lot, because (I'm assuming) most people are born in a hospital, and there isn't a hospital in every zip code. So to my first comment on population movement, maybe that's actually a representation of the effect of population migration / movement. Cool.


"True unique identifiers"? Is it anything like a true Scotsman? :-)

Something can be a unique identifier of someone in a given time point, and later become non-unique, or stop identifying that person.


It is hard to find good natural keys (that are immutable and unique). That is why many people recommend using artificial keys.


now im glad i randomize my birthdate on just about every form on the internet.


With the density of large cities (though matched by zip codes), I wonder if the 13% live in urban areas and the remaining 87% outside.


I'd love to see someone with a large dataset containing those variables to see what % of their set overlaps on those.


About a year ago, I did a project for de-duping customer entered contact info. We had less than 100k records for any particular region (Bay Area, New York, etc.) From what I recall, we were able to confirm identity in about 80% of cases using Name, DOB, and zip code.

By far the best metric is phone number, that bumped us up over 97%. If the comparison was limited to only three metrics, then we would use Name, DOB, and phone number. In reality, we also compare zip code, street address, and contacts (like friends on facebook), in that order.

It would be interesting to see if on average, people change their phone number less often than their name (marriage) or address (renters).


That's interesting, considering Google is trying to get me to give them my phone number. puts on a tin foil hat


But most define themselves by asl ;-)


[deleted]


That's what makes it so paradoxical!


The Mark of The Beast is at hand.


JOKES ON YOU! I can change 2 of the 3!!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: