''At the time GIC released the data, William Weld, then Governor of Massachusetts, assured the public that GIC had protected patient privacy by deleting identifiers. In response, then-graduate student Sweeney started hunting for the Governor’s hospital records in the GIC data. She knew that Governor Weld resided in Cambridge, Massachusetts, a city of 54,000 residents and seven ZIP codes. For twenty dollars, she purchased the complete voter rolls from the city of Cambridge, a database containing, among other things, the name, address, ZIP code, birth date, and sex of every voter. By combining this data with the GIC records, Sweeney found Governor Weld with ease. Only six people in Cambridge shared his birth date, only three of them men, and of them, only he lived in his ZIP code. In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.''
The state of Texas complicated things by allowing a family to have the same license plate for all their cars. One family had seven vehicles, all with the same license plate number. Florida would issue the same plate to complete strangers. One man arrested with great drama, because a bank robber (or some such undesirable) had the same (legal) plates.
My GF has such a common name that her doctor has to check which one of three she is. The last time she tried to get her driver's license renewed online, she was told she had to go to the DMV in person. She asked why and was told she was a "Code 5". She reported to the DMV and told the woman that she was a Code 5. There dropped jaws and a hurried conference with a supervisor who called Sacramento and had a lengthy conversation with comments like, "Yes I'm sure. She's right here." Finally she got impatient and asked, "What's Code 5?"
The clerk said, "You're dead and it was verified by medical personnel." She did eventually get her license renewed. Officially I live with a zombie. It isnt as bad as the movies portray.
MA driver's licenses no longer use SSN as far as we can tell.
It's worse than that: in Illinois the first four characters of your DL ID are your last name's soundex code.
I always thought that part looked suspicious. Thanks for the tip.
Not as uncommon as you might think.
Plight of the Living Dead(http://www.time.com/time/world/article/0,8599,2054133,00.htm...)
The reverse of that phenomenon is that, given a data set in a high-dimensional space (even 3 dimensions, if each dimension has more than a few bits of entropy), it will cover the dimensions very sparsely (even if it's large!), and therefore it's relatively easy to recover specific details of the sample from the aggregate statistics.
edit: Well, I was hoping this might be a new insight, but in fact there's a good 2005 paper exploring that connection in much more detail: http://www.vldb2005.org/program/paper/fri/p901-aggarwal.pdf
Someone has to be the gatekeeper on your dimensions.
For the busy HNer, it's not even necessary to click on the link to get the key idea from the article.
The ZIP codes in Israel are per-street, not city. Given Israel's population of just under 8M, I believe a very high percentage (95% <) of unique people can be found.
 - http://blog.y3xz.com/post/7846661044/data-mining-the-israeli...
Great article, by the way. I especially like the idea of diff'ing the databases.
My point is that this shouldn't be THAT surprising. I suspect full name and gender would uniquely identify a fair portion of the US as well. We're not as homogenous as some societies and I think this proves it...
What I'd find amusing would be how much of the population is uniquely identifiable by browser+plugins+os+resolution.
Edit: Beaten like a dead horse.
US pop in 2000: 280,000,000 —
US births in 2000: 4,000,000 —
US births per day: 10,958 —
Fraction of US pop that 40,000 represents: 0.0001428 —
Number of births per day for the 40,000: 1.56
That seems inline with the conclusion of the paper. See any mistakes?
On a side note, Happy 21st Birthday in just over a month!
Of course February 29th is dead last in frequency. Most early October birthdays are above the median in frequency.
365 12 25
364 1 1
363 12 24
362 1 2
361 7 4
360 12 26
359 11 28
358 11 27
357 11 26
356 1 3
355 10 31
354 12 23
353 11 25
352 11 24
8.62% of all live births in the US during 2006 occurred during October . No single month deviated significantly from it's expected percentage.
A zip code with large populations probably are more around 50% than 87%, and obviously the reverse is true as well. I wonder what the population size for a zip code would have to be to be really close to 100%. Just throwing some numbers in a calculator I'd guess at 15-20k people would be damn close. So 10k is probably just about a unique identifier.
I realized after looking at both Portland and Seattle zip codes that they seem to be distributed better than the less dense areas so you do have very few zip codes over 20k.
What I'd be really interested to see is why this works, and what it tells us about distribution of population by zip code. I'd imagine the places where this doesn't work as well, are the most densely populated zip codes, where the likely hood of duplicates on the given key increases, but I would never have guessed that the accuracy would be anywhere near 87%. (maybe there's alot more zipcodes than what I thought? maybe they used zip+4?)
It doesn't tell us much about zip code distribution because zip codes are chosen to have approx the same number of people in each. As it turns out, that's exactly how you'd go about maximizing the amount of information the zip code carries... which is unsurprising since that's the entire purpose of a zip code. Gender is almost exactly one bit, and date of birth is 15ish bits with some bad uniformity assumptions, zip is another 15ish with bad uniformity assumptions, that's 31-ish total, subtract off 3-ish for the bad assumptions and you get 28, which covers 2^28 = 268,435,456 people, which is pretty close to the number cited (.87 times 312,000,000 = 271,440,000). I'll cop to tuning the fudge factor of three bits to nicely match the number given, but the bit count itself just comes from the space of possibilities.
Lots more info on the topic can be found here: http://33bits.org/about/
: http://www.carrierroutes.com/ZIPCodes.html 43,000 ZIP codes => log2 43000 is approx 15.4 bits assuming perfect uniformity.
In general, if you model this as an N balls in M bins problem, then even when N == M, you'd expect a fair amount of anonyonomity preserving collisions. Maybe 1/2 of people would collide. As we then double the number of bins, we'd roughly expect the number of collisions to be cut in half.
If you imagine putting 100 balls (people) into 800 = 100 * 2^3 bins (number of different birthday-zip-gender encodings) at random, about 1/8 of the bins will have more than one ball (person) [okay, this estimate is somewhat off by a smallish constant factor, it's only true that the 100th ball tossed will collide with probability 99/800 ~= 1/8 if there were no existing collisions, and the earlier balls thrown have less to collide with], and not be uniquely identifiable.
It's a powerful idea. I wrote a whole essay analyzing the anime _Death Note_ using the 33 bits idea (http://www.gwern.net/Death%20Note%20Anonymity) and I'm sure that's not even the tip of the iceberg.
The US population is 300 million and there's about 43,000 zip codes in the US. Assuming an even distribution, that's about 7,000 people per zip code. Cut it down to just one gender and we're at 3,500. The probability of a chosen person in that group having a a unique birthday is (364/365)^3500 = 0.0068%. Now, if we say that the person could have been born in one of the past sixty years, we get (21914/21915)^3500 = 85% as the probability that this person has been uniquely identified.
It turns out that zip codes aren't as populous as you might imagine.
As of the Y2K census, only a dozen zip codes have over 100,000 residents -- 80% of all zip codes have less than 15,000 residents, with the median zip code at 2,500.
(Source: http://www.census.gov/geo/www/gazetteer/places2k.html )
I've been pondering a useful way to have /<yourname>in a URL, so that everyone with that name can use the url containing it without collisions. Of course, I always end up with something like website.com/a3fx/<yourname>. Which is arbitrary, and ugly. However, with this stat it seems we have something close to a non colliding, pretty, meaningful addressing scheme. Ie: website.com/<dob>/<gender>/<zip>/<yourname>, sure it's a bit long but it provides assurance you're getting who you think your getting.
I prefer writing
Sex: Yes, please
Sex, gender or age is allowed.
One could theoretically ask "how old are you" but then you don't know when to lift limitations. Say a user joins six months before their 18th birthday. Are you going to block 18+ content for a whole year from then?
(I prefer websites just ask that binary question, like this brewery - http://www.newbelgium.com - instead of making me waste time with a date input.)
It may be relevant if you are trying to examine differences immediately before and after your 18th or 21st birthdays, but that would be pretty rare.
The only way to identify people in real life is to issue them a public ID number that can be used in conjunction with a private ID number (SS#).
Which is why I deliberately give an incorrect DOB on most websites that want one.
Area_ID + DOB + Order Number + Checksum
For order number: Men are assigned to odd numbers, women assigned to even numbers.
Something can be a unique identifier of someone in a given time point, and later become non-unique, or stop identifying that person.
By far the best metric is phone number, that bumped us up over 97%. If the comparison was limited to only three metrics, then we would use Name, DOB, and phone number. In reality, we also compare zip code, street address, and contacts (like friends on facebook), in that order.
It would be interesting to see if on average, people change their phone number less often than their name (marriage) or address (renters).