Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Can Google aggregate everything you've ever posted anonymously online based on writing style?
36 points by staunch on Dec 30, 2008 | hide | past | web | favorite | 33 comments
I'm just asking about Google because they're in such a good position to do it.

Is any private company doing it? It'd be neat if there was a web site that you could submit some writing samples (emails or whatever) and then see everything else that person has posted online (regardless of whether it was anonymous or not).

I'm sure there's no way to do it with total accuracy, but with enough input shouldn't it be possible to be highly accurate?

Anyone know of any software that can take a large number of writing samples and determine who wrote which ones?

If not, how would you go about creating it?

I did really simple analysis in an AI class with several thousand posts of a forum I post frequently at, tagged "me" and "not me" (I figured that way it would be minimally invasive to people not participating -- plus, hey, easy naive Bayesian). Got fairly decent results with pretty trivial sample inputs. Numbers elude me, it has been years.

Everyone thinks "Aha, you have some catchphrases" (I do) or "Aha, you were one of the only Republicans and thus someone saying 'death tax' was more likely you" (true) or "You cited nationalreview.com more than the rest of the forum together" (true), but it turns out the distribution of really stupid stuff (stopwords, essentially) works better.

This is ironically the same they've discovered for making female/male authorship decisions, although I never went the next step and said "So what relationship does my distribution have with the average guy distribution?"

Incidentally, here's the reason you'll never have to worry about this in the context of "Google the Internet for everything Patrick McKenzie has ever written": imagine I have a 99.9% effective filter for you, and I dragnet an Internet filled with 5 billion documents of which you've written 1,000. I then identify 5 million documents as written by you... but you only wrote 1,000 of them.

This sort of "don't search the haystack unless you're bloody sure it is packed full of needles" thing is why you never want to test a population not known to be at risk for the disease, etc. (Or why you retest in the event of a positive using a different test.)

I down-voted you for the use of the term "death tax". The correct term for it is the "trust fund baby tax". This is not merely a matter of rhetoric, but of accuracy. The dead person isn't being taxed. Being deceased and presumably having no use for wealth past that point, he or she doesn't care. The living heirs are being taxed.

To quote Salute Your Shorts, get it right or pay the price.

It cost you 2 karma points, because I'd otherwise have voted you up for an otherwise high-quality post.

That was an odd thing to do... If he had used your "correct" term in his post, it would not have been a good indicator of Republican behavior and would therefore ruin the example. His post uses the term in a description of other posts -- just as your post does.

You make a good point. I just have a knee-jerk negative response to phrases like "death tax" when not used with bitter irony. (I have Republican relatives.) I probably deserve my -5 for being obnoxious.

Rash, knee-jerk reactions are often a social liability, which reminds me of a funny bar anecdote:

Guy: So, you're waiting for someone?

Girl: Yeah, some college friends. They're an hour late. Oh my... Boy, friends can be really rude sometimes.

Guy: Tell me about your sister. How old is she?

Girl: What?

Guy: Sorry, I heard the words "my boyfriend" and--

(Girl walks away.)

Yaaaouch! Seafood soup is NOT on the menu!

The dead person isn't being taxed. Being deceased and presumably having no use for wealth past that point, he or she doesn't care.

That's not true. Estate tax rules affect gifts during one's own lifetime as well.

I think such nitty-gritty comments about karma aren't very insightful. But while we're in those nitty-gritty details, I'll say there could very well be some people who upvote him just to counterbalance your downvote; he might even make a profit.

I down-voted your description to register my disapproval of marking down someone's post for a single, non-offensive word. That's obnoxious and pedantic.

You can imitate someone's writing style with a travesty generator but that doesn't mean everyone has a "writing fingerprint". Very few people have a distinct enough writing style to weed out false positives, and since it's so easy to imitate you'll never really know your precision.

I have a few writing ticks (parens, the '--', certain words like 'certain') but it's much easier to just search on "aristus" to start exposing my shame.

I played with this a few years ago with a project called unmaskr. Heuristics can help precision a lot but does not help with recall. People generally:

  * use similar usernames, or a "constellation" of usernames

  * have semi-regular posting times

  * post in one place at one time

  * use similar place names, nicknames for things

  * write fluently in one language

  * link to a "constellation" of domains

It is no doubt possible. I've done something similar to this on a single forum 3 or so years ago and had reasonably interesting results.

Basically, I scraped the site, removed formatting/spacing/dead-words, stemmed (using a modified porter), constructed a matrix of word-frequencies per post. After which I did several various analytical techniques (statical, geometric, etc). The net result from using a blended method was able to identify several most aliases of board posters However, the issue google would no doubt have is sampling size. In a small enough sampling size quirks work as identifying characteristics, in a larger dataset you would no doubt see clusters of people who have similar backgrounds (e.g. education).

When I was packing up my college dorm room at the end of senior year, I found a typed short story and started reading. It was the most peculiar sensation, because it read like I had written it, but I didn't remember writing it. Then I realized (i.e., noticed the heading and saw) that it was my brother's story. Three years apart, we'd picked the same kind of IB senior English project and attempted to write in the style of the same author, and our writing (though in very different stories) was very, very similar.

Donald Foster (http://en.wikipedia.org/wiki/Donald_Foster_(professor) ) is sometimes known as a "forensic linguist" for his use of computers to analyse texts. Famously, he figured out the (formerly anonymous) author of the bestselling novel "Primary Colors" using computer analysis. If you're interested in this kind of thing, I highly recommend starting with the Wikipedia article, which describes a lot of Foster's work.

I believe I can offer some help because this is in the rough area of my Ph.D thesis (see http://33bits.org/ for more on that.)

Everyone does have a writing fingerprint, contrary to what another person claimed. However, it is an open question whether it can be efficiently extracted.

The basic idea for constructing a fingerprint is this. Consider two words that are nearly interchangeable, say 'since' and 'because'. Different people use the two words in a differing proportion. By comparing the relative frequency of the two words, you get a little bit of information about a person, typically under 1 bit. But by putting together enough of these 'markers', you can construct a profile.

The beginning of modern, rigorous research in this field was by Mosteller in Wallace in 1964: they identified the author of the disputed Federalist papers, almost 200 years after they were written (note that there were only three possible candidates!). They got on the cover of TIME, apparently. Other "coups" for writing-style de-anonymization are the identification of the author of Primary Colors, as well as the unabomber (his brother recognized his style, it wasn't done by statistical/computational means).

The current state of the art is summarized here. http://www.stat.rutgers.edu/~madigan/AUTHORID/bibliography.h... If you're going to do any work on this, you should read as many of those papers as you can. Or else you'll invent something feindishly clever only to find that some academic already wrote about it 20 years ago and showed why it doesn't work.

Now, that list stops at 2005, but I'm assuming there haven't been earth shattering changes since then. I'm familiar with the results from those papers; the curious thing is that they stop at corpuses of a couple hundred authors or so -- i.e, identifying one anonymous poster out of say 200, rather than a million. This is probably because they had different applications in mind, such as identification within a company, instead of Internet-scale de-anonymization. Note that the amount of information you need is always logarithmic in the potential number of authors, and so if you can do 200 authors you can almost definitely push it to a few tens of thousands of authors.

The other interesting thing is that the papers are fixated with 'topic-free' identification, where the texts aren't about a particular topic, making the problem harder. The good news is that when you're doing this Internet-scale, nobody is stopping you from using topic information, making it a lot easier.

So my educated guess is that Internet-scale writing style de-anonymization is possible. However, you'd need fairly long texts, perhaps a page or two. It's doubtful that anything can be done with a single average-length email.

Another potential de-anonymization strategy is to use typing pattern fingerprinting. The timing between our keystrokes fingerprints each of us (yes, this works even for non-touch typists.). This is already used in commercial products as an additional factor in password authentication. However, the implications for de-anonymization have not been explored, and I think it's very, very feasible. i.e, If google were to insert javascript into gmail to fingerprint you when you were logged in, they could use the same javascript to identify you on any web page where you type in text even if you don't identify yourself. Now think about the de-anonymization possibilities you can get by combining analysis of writing style and keystroke dynamics...

By the way, make no mistake: the malicious uses of this far overwhelm the benevolent uses. Once this technology becomes available, it will be very hard to post anonymously at all. Think of the consequences for political dissent or whistleblowers. The great firewall of China could simply insert a piece of javascript into every web page, and poof, there goes the anonymity of everyone in China.

As for who's doing this? Google would be the least likely candidate, IMO. The PR consequences of such experiments, were it ever to come out, blow up in proportion to the size of the company. Not a good idea. On the other hand, I do know a guy who was trying to start an 1-person company based on similar ideas when I last heard from him. Well, de-anonymization of web sessions, although writing style was not involved. That's the closest I can think of.

I am myself very, very interested in looking into this. My main interest is to write a paper and possibly build tools to take a chunk of writing and try to remove your fingerprint from it, i.e, protect anonymity, but if in the process of collaboration someone else were to build a de-anonymization tool, I have no problem with that. I've built (if I can say so), some of the current best de-anonymiztion tools/techniques (check my website), so if you're interested feel free to drop me a line.

"possibly build tools to take a chunk of writing and try to remove your fingerprint from it, i.e, protect anonymity,"

The standard procedure to do this, is to chain translations to other languages and back. The message remains, but the wording will pick up some noise.

Maybe someday it will be illegal to copy and paste to avoid being authenticated.

Humm... maybe this is why iPhone's don't allow copy paste! ;)

I guess it would be possible to fool any such system if you know your message might be subjected to this type of analysis?

I think it's likely, but it will need a lot of work to build software to hide your traces. Note the caveats, however: most ordinary people will not have the foreknowledge to find and use such a tool. Second, think of all the compromising posts -- rants about employers, accounts from cheating spouses, political dissent, etc. -- that have already been written. The day will come when some kid will download a script, let a crawler loose on the web, and post the de-anonymized results for all to see. There will be interesting consequences.

Unless your comparing novels worth of text the accuracy is going to be poor for most situations.

PS: Changed "extremely low" to poor just in case.

I'm suddenly worried...

Here's a question for you, and I don't know the answer:

It seems true that everyone has a "writing fingerprint" but is it not the case that it would change over time? For example, is it not likely that my writing style at present is more similar to that of another person on the Internet than it is to my own style 7 years ago? Moving targets would complicate Internet-scale de-anonymization considerably, because you'd have to take "ancient" Usenet posts with a grain of salt. This would be even harder if people learned to mask their writing style, perhaps using "translators" designed to thwart the de-anon technologies.

Note that the amount of information you need is always logarithmic in the potential number of authors, and so if you can do 200 authors you can almost definitely push it to a few tens of thousands of authors.

This makes sense from an information-theoretic, big-picture standpoint, but I'm guessing that the amount of actual data (measured in text, not information content) per candidate required to identify an author among N candidates is somewhere between O(log N) and O(N)-- my bet would be O(N^k) where 0 < k < 1. The reason is that, while your "since/because" example might be a great fingerprinting measure, there are only a small number of great (high information content) measures like this, then there are some few higher-hanging fruit, then a lot of scraps. It seems that any attempt to identify an unknown person among 1M+ candidates is going to require using the scraps, and it's not clear how much information content there is in them.

I know for myself that my writing style changes in response to what (and how much) I am reading for the given period (which lasts on the order of a many months/few years).

All of the methods require a decent amount of text. So unless someone is writing hundreds of words in each sample, they're unlikely to be distinguishable from the crowd. That, to me, rules out typical comments on threads. Lots of little samples seem to be just too noisy with too much overlap among authors to allow for unique information-based fingerprints.

However, if the data is sufficiently large - in number and length (say, lots of essay-type blog posts) I'd expect some classifier-based machine learning techniques could match authors. That is, take a sample of 100 bloggers, split their data in half, train the classifiers on one half then test to match up the other half of the data. Under those conditions you could probably get 90-95% accuracy.

The question, I think, is how small you could push the training set in terms of the fewest words and the fewest posts.

I doubt this is possible with current technology - so many people share the same or similar writing style that, even with a skilled human analyzing, you probably can't do much better than attribute it to a large bucket of perhaps thousands or tens of thousands of individuals.

Even if you couldn't narrow it down to an exact person, it seems like you could get good accuracy on socioeconomic status, cultural background, political slant, etc. Once you had that data it might be valuable from a marketing standpoint. It might not be useful for sites where there is a very specific user group, but for sites with incidental user generated content--even if it isn't anonymous per se, you could get a more accurate picture of your audience.

Doubtful, its not something you can automate to be reliable, and w/o reliability whats the point?

I always wanted to run that experiment on the Hacker News data...

I think that we all sound alike.... at least compared to the rest of the web.

Actually, I think if someone really tried to do what the original question asked, then it would end up with under 100 different "people" who all seemed to cluster together. All of HN would be something like 2 or 3 different clusters.

Search for "text fingerprinting" - there's a lot of info.


I'm assuming the question was

    It'd be neat if there was a web site that you could submit some writing 
    samples (emails or whatever) and then see everything else that person has 
    posted online (regardless of whether it was anonymous or not).
to which I agree with tptacek's answer.

In conjunction with ISPs, the (US) government or a large private company like Google or Microsoft could aggregate posts based on IP address. In fact, according to a report by Frontline ("Spying On The Homefront"), Homeland Security started implementing such a system for monitoring all phone and Internet communications after the passage of the Patriot Act. I was going to say that it would probably be political suicide for a private company to participate but I just remembered Homeland Security actually contracted a private company to do this work for them. I can't remember the name of the company but you can watch the whole show online:


Using writing style seems a little far fetched given the limited scope of variations in language relative to the large number of people using language. Otherwise it would not have taken so long for effective speech recognition software to appear - even the current incarnations are more fragile than one would expect relative to how long the best minds have been looking at the problem and relative to the high economic value of an effective solution.

I want my privacy :(

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact