Hacker News new | comments | show | ask | jobs | submit login
When Anonymous Isn’t Really Anonymous (brooksreview.net)
63 points by chmars 1259 days ago | hide | past | web | 25 comments | favorite

Over 60%, and potentially up to 87% [0], of the US population can be uniquely identified just from gender, zip code, and birth date. This isn't all that surprising [1]. The EFF's browser fingerprinting (through Panopticlick) shows that a similar percentage of browsers are uniquely identifiable through fonts, plugins, etc. [2] If you're trying to identify someone, every "common" trait they have will still eliminate a large number of people from consideration -- gender eliminates roughly 50% of the population, birth state eliminates 85-99.8% of the population (depending on whether you were born in California or Wyoming), birth date eliminates over 99% of the population and much more if you have a year, and pretty soon you're not one face in 300 million, you're one face in a thousand, and then one face in ten, and then you're just you. And whatever it is that made someone want to try to figure out who you were is now out in the open.

[0] Depending on whether http://www.citeulike.org/user/burd/article/5822736 or http://www.truststc.org/wise/articles2009/articleM3.pdf is more accurate

[1] http://godplaysdice.blogspot.com/2009/12/uniquely-identifyin...

[2] https://panopticlick.eff.org/ and the explanation at https://panopticlick.eff.org/browser-uniqueness.pdf -- particularly sections 4 and 5.2.

I had a doctor that used first name, last name, birth month, and birth day as their primary identifiers. I was checking in one year for my physical, and was asked for these four data points. The next question was about current medication. The receptionist nurse asked me the question, but before I could answer, she got a really, really odd look on her face, then started glancing between me and the computer. Finally, she asked, with a bit of a smirk on her face, what year I was born. When I responded, she chuckled. Turned out the only record that came up was for someone 30 years my senior, and on a medication that only someone that age would be on. The receptionist nurse dug around in the system and was finally able to find my records, but if she hadn't noticed that the listed prescription didn't match my apparent age, my checkup info would have ended up in someone else's record and it could have gotten messy.

False positives are an interesting case, where things can definitely get messy. That doctor may have used the same system for decades without ever seeing a name collision, but if the receptionist hadn't noticed in that instance, both of you might have had significant problems in the future. Likewise, if someone decides to track another person with malicious intent (say, "that guy I saw my ex with in the restaurant who was wearing the [university] shirt and talked about being in [field] and growing up in [place]") and they end up locating the wrong target, things could get ugly. Or if a company starts sending potentially offensive material to someone they've misidentified, it can trigger family drama or even retaliation.

So it's a two-edged sword. People aren't very anonymous, but mistakes in the de-anonymization process can wreak havoc.

When I worked in Medical Records in a capital city in the UK, your date of birth was the primary key of filing: not querying for the year is then a basic procedural error, but then again doctors surgeries have far lower standards than outpatient departments. Incidentally, I have found my experience as a filing clerk to be far more useful to my coding life than any of the other crap jobs I did..

Here in the UK we have an NHS number, which curiously is not the same as your NI number (equiv to an SSN). Unlike the SSN in the US, the NHS number isn't used for anything else, and the NI number is only used for tax.

That seems like it should violate medical records confidentiality rules (I don't know if it does, but it should).

Based on the interaction, I could obviously deduce that there was someone else with the same four data points that saw the same doctor and was 30 years older. I don't know if that, in and of itself, is breaking confidentiality.

If they had started entering my data into that person's record and subsequently realized the error, then they might have had to break some confidentiality in order to clean up our conjoined record.

At my doctor and pharmacy they look me up by birthdate only in the computer. I assume they only need more data if there is a collision.

All this only goes to show that, if we really want to keep a sane degree of democracy, we can never rely on tech to be safe.

Instead, it's thousand times more important to get real transparency an accountability in our governmental structures and money completely out of politics.

At least it is straightforward to be aware of conversations you are trying to anonymize and omit or obscure birth dates and locations (By obscure, I mean lie).

It's also somewhat interesting that text wouldn't necessary be as revealing as the language quiz from the article; when I type "aunt", readers have no idea how I pronounce it (but that is just picking at the specific to ignore the general, identity characteristics leak out all over the place).

> "... when I type "aunt", readers have no idea how I pronounce it ...".

From the language survey, there were quite a few questions dealing with word choice (e.g., coke, soda, pop). There are also algorithms that can predict, with varying levels of accuracy, the writer's gender [1]. Combine these with browser font identification [2], geolocation [3], and other comparable data, and lying or omitting data isn't going to provide much in the way of anonymizing you or your data.

[1] http://www.hackerfactor.com/GenderGuesser.php

[2] https://panopticlick.eff.org/

[3] http://html5demos.com/geo

Why would you trim the majority of my comment which is essentially in agreement with what you have written?

I get that you are just elaborating on data leakage, I just don't understand the style you chose to do it in. "identity characteristics leak out all over the place" isn't exactly a brazen declaration of how easy it is to stay anonymous.

I read your comment as saying that you could lie in text about identifying information and things that might betray you in a f2f interaction, such as regional accents, weren't as important in textual interactions.

My intent was more that it is straightforward to be aware of several of the most blatant leakages. Combating dialect analysis is a lot tougher, but the quiz likely used 25 of the highest value differences (many of the choices are laser specific: http://www4.uwm.edu/FLL/linguistics/dialect/maps.html ).

Context also matters a lot. If there is no obvious path to an IP or browser interaction, all that stuff goes away.

What would be interesting is a look into passive elicitation of information. Its easy to falsify something when you know someone is looking to use a single data-point against you. But tryly powerful social engineering would manipulate your confidence, and extract high-information-content data passively (when you are not expecting it), and to do this they need to instill confidence that you <are not being ovserved or monitored>. Yet they still may need to promt information flow, to feed the analysis. Which is sort of the art of interviewing, but now in a bit more 21st century context.

I agree that the things you can do with these large datasets are pretty cool. However, people ignore the exceedingly high error rates when they try to apply these techniques to "important" problems (e.g. identifying potential terrorists).

Take his example of the quiz identifies your home region by dialect quirks. It's really an exercise in confirmation bias. For those who get an accurate result, the quiz is amazing, and they tell all their friends. For people, like me, who were told they grew up across the country from where they actually did, it's just another silly, easy to forget quiz.

It's the same for the 20-questions device. We're amazed when it guesses the right answer, but so quickly forget the ones it screws up on.

That's why, when I hear about how we just need a big enough dataset to identify threats to our nation or accurately predict the stock market, I worry about that 10-30%.

Something that annoys me about how the NSA scandal is treated even here on HN is the extreme self-centric approach. Not all 300 million US-citizens are identifiable by looking at word choice, since not all of them are native speakers. The 6.7 billion other people on the planet aren't native speakers either, and probably can't be traced as easily or even be located in the wrong place (my non-native English was located in New Jersey according to the test page they mentioned).

Now that the EFF, Doctorw etc. have started their thedaywefightback.org, where they recommend putting up quotes from Benjamin Franklin and call spying "unamerican". When Obama told us that the US "are only spying on foreigners" there was almost no reaction to be found in the US media (while the rest of the world was rightfully pissed).

Yes, it's about an American agency is spying, but it's spying on everybody and not only US citizens want to do something against that. It's sad, that the people raving about how great the internet is are missing this chance to use _global_ outrage to keep it what it is supposed to be: A free communication channel for people all over the world. Not just Americans.


It's worth noting that Doctorow is neither a US citizen nor resident.

You only need 32.6 bits of information to uniquely identify everyone:


Does it really matter if I'm not anonymous online?

"Does it really matter if I'm not anonymous online?"

Even if you're not anonymous online, consider the following scenario:

You visit website A; your're not anonymous and that's fine for you.

You then visit website B. Again you're not anonymous and that's fine too.

Then you visit website C, followed by website D.

Again, you're not anonymous and that's fine because you know that website A doesn't know you visited website B and website B doesn't know that you visited website C.

Website C does know that you visited website D, but it doesn't know what you did once you got there. So although you're not anonymous online, you don't feel like someone is watching over your shoulder looking at everything you do.

But unknown to you (or maybe even known to you) company X has a little bit of analytics code on each of these websites and does know that you visited website A, and website B, and also website C, D, E, F, G, H, I, J, K, L and more.

So maybe it's a question of the degree to which you can be anonymous online. On individual websites, you might be fine not being anonymous, as long as those websites don't know what you're doing elsewhere on the web. But is anonymity important if you know everything you do online is being tracked and joined together from all your disparate journeys?

For some people it doesn't matter. For some people or in some instances, it does matter.


Another good example is from http://phinished.org/faq.php?faq=vb_read_and_post#faq_zoints...

"The anonymous posting option should be used sparingly. Please use it only when you have a matter of a highly personal or sensitive nature to discuss, or when you need to protect the privacy of colleagues, students, friends, or family members. It should never be used as cover to criticize or attack other users, as a means to evade social responsibility for one's comments, or as a way to avoid minor embarrassment on the boards. Anonymous posts may not contain attachments, and they cannot be edited after posting without revealing the username of the author. If you want to maintain your anonymity, do not attempt to edit an anonymous post.

The webmaster reserves the right to suspend or revoke the ability to post anonymously of any member who abuses the anonymous posting privilege. For more information, please see our acceptable use policy and privacy statement."

For some people ot really does matter.

They are releasing information to the outsode world about an oppressive regime.

They are whistle-blowing in an industry that has weak protections for whistleblowers.

They found some horrible criminal material online amd want to report it without being arrested by law enforcement

Etc etc.

Sadly there are things that most people could do to help the few who need anonymity, but because these things are a bit fiddly and provide no immediate benefit most people do npt bother.


People can take all sorts of things as justification for targeting you. Maybe you made a seemingly innocuous comment on HN or another website that someone took as a personal slight (say, your comment about tech companies "getting ahead of themselves" with drones and self-driving cars). Maybe you made a comment that sounded vaguely like a religious or political position that someone finds very offensive. Maybe you have a particular habit, desire, or interest that is illegal or that some people consider immoral or offensive, which you don't particularly advertise but you happen to mention somewhere. Maybe someone thinks you downvoted them even though it wasn't you. All of a sudden, you're a target -- and someone is able to go from a handful of comments you made on some website to knowing your home address, your place of employment, your family/religious/social circles, and maybe something about you that could get you into trouble with one of those groups.

Note that it's not just "on the internet". Somebody might see you talking to their ex in a public place (you were just asking for directions) and decide to pick up whatever snippets of personal information they can about you by listening in at the checkout. The internet only comes in once they've decided to come after you -- and they're able to tie in to the giant database their company maintains that sends you coupons or whatever in the mail, and now they have all of your personal information including some you'd rather not have every stranger on the planet knowing.

Who's asking?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact