
Ask HN: Can Google aggregate everything you've ever posted anonymously online based on writing style? - staunch
I'm just asking about Google because they're in such a good position to do it.<p>Is any private company doing it? It'd be neat if there was a web site that you could submit some writing samples (emails or whatever) and then see everything else that person has posted online (regardless of whether it was anonymous or not).<p>I'm sure there's no way to do it with total accuracy, but with enough input shouldn't it be possible to be <i>highly</i> accurate?<p>Anyone know of any software that can take a large number of writing samples and determine who wrote which ones?<p>If not, how would you go about creating it?
======
patio11
I did really simple analysis in an AI class with several thousand posts of a
forum I post frequently at, tagged "me" and "not me" (I figured that way it
would be minimally invasive to people not participating -- plus, hey, easy
naive Bayesian). Got fairly decent results with pretty trivial sample inputs.
Numbers elude me, it has been years.

Everyone thinks "Aha, you have some catchphrases" (I do) or "Aha, you were one
of the only Republicans and thus someone saying 'death tax' was more likely
you" (true) or "You cited nationalreview.com more than the rest of the forum
together" (true), but it turns out the distribution of really stupid stuff
(stopwords, essentially) works better.

This is ironically the same they've discovered for making female/male
authorship decisions, although I never went the next step and said "So what
relationship does my distribution have with the average guy distribution?"

Incidentally, here's the reason you'll never have to worry about this in the
context of "Google the Internet for everything Patrick McKenzie has ever
written": imagine I have a 99.9% effective filter for you, and I dragnet an
Internet filled with 5 billion documents of which you've written 1,000. I then
identify 5 million documents as written by you... but you only wrote 1,000 of
them.

This sort of "don't search the haystack unless you're bloody sure it is packed
full of needles" thing is why you never want to test a population not known to
be at risk for the disease, etc. (Or why you retest in the event of a positive
using a different test.)

~~~
time_management
I down-voted you for the use of the term "death tax". The correct term for it
is the "trust fund baby tax". This is not merely a matter of rhetoric, but of
accuracy. The dead person isn't being taxed. Being deceased and presumably
having no use for wealth past that point, he or she doesn't care. The living
heirs are being taxed.

To quote _Salute Your Shorts_ , get it right or pay the price.

It cost you 2 karma points, because I'd otherwise have voted you up for an
otherwise high-quality post.

~~~
jcl
That was an odd thing to do... If he had used your "correct" term in his post,
it would not have been a good indicator of Republican behavior and would
therefore ruin the example. His post uses the term in a description of other
posts -- just as your post does.

~~~
time_management
You make a good point. I just have a knee-jerk negative response to phrases
like "death tax" when not used with bitter irony. (I have Republican
relatives.) I probably deserve my -5 for being obnoxious.

Rash, knee-jerk reactions are often a social liability, which reminds me of a
funny bar anecdote:

Guy: So, you're waiting for someone?

Girl: Yeah, some college friends. They're an hour late. Oh _my... Boy, friend_
s can be really rude sometimes.

Guy: Tell me about your sister. How old is she?

Girl: What?

Guy: Sorry, I heard the words "my boyfriend" and--

(Girl walks away.)

~~~
time_management
Yaaaouch! Seafood soup is NOT on the menu!

------
aristus
You can imitate someone's writing style with a travesty generator but that
doesn't mean everyone has a "writing fingerprint". Very few people have a
distinct enough writing style to weed out false positives, and since it's so
easy to imitate you'll never really know your precision.

I have a few writing ticks (parens, the '--', certain words like 'certain')
but it's much easier to just search on "aristus" to start exposing my shame.

I played with this a few years ago with a project called unmaskr. Heuristics
can help precision a lot but does not help with recall. People generally:

    
    
      * use similar usernames, or a "constellation" of usernames
    
      * have semi-regular posting times
    
      * post in one place at one time
    
      * use similar place names, nicknames for things
    
      * write fluently in one language
    
      * link to a "constellation" of domains

------
jbester
It is no doubt possible. I've done something similar to this on a single forum
3 or so years ago and had reasonably interesting results.

Basically, I scraped the site, removed formatting/spacing/dead-words, stemmed
(using a modified porter), constructed a matrix of word-frequencies per post.
After which I did several various analytical techniques (statical, geometric,
etc). The net result from using a blended method was able to identify several
most aliases of board posters However, the issue google would no doubt have is
sampling size. In a small enough sampling size quirks work as identifying
characteristics, in a larger dataset you would no doubt see clusters of people
who have similar backgrounds (e.g. education).

------
midnightmonster
When I was packing up my college dorm room at the end of senior year, I found
a typed short story and started reading. It was the most peculiar sensation,
because it read like I had written it, but I didn't remember writing it. Then
I realized (i.e., noticed the heading and saw) that it was my brother's story.
Three years apart, we'd picked the same kind of IB senior English project and
attempted to write in the style of the same author, and our writing (though in
very different stories) was very, very similar.

------
michael_nielsen
Donald Foster (<http://en.wikipedia.org/wiki/Donald_Foster_(professor)> ) is
sometimes known as a "forensic linguist" for his use of computers to analyse
texts. Famously, he figured out the (formerly anonymous) author of the
bestselling novel "Primary Colors" using computer analysis. If you're
interested in this kind of thing, I highly recommend starting with the
Wikipedia article, which describes a lot of Foster's work.

------
randomwalker
I believe I can offer some help because this is in the rough area of my Ph.D
thesis (see <http://33bits.org/> for more on that.)

Everyone does have a writing fingerprint, contrary to what another person
claimed. However, it is an open question whether it can be efficiently
extracted.

The basic idea for constructing a fingerprint is this. Consider two words that
are nearly interchangeable, say 'since' and 'because'. Different people use
the two words in a differing proportion. By comparing the relative frequency
of the two words, you get a little bit of information about a person,
typically under 1 bit. But by putting together enough of these 'markers', you
can construct a profile.

The beginning of modern, rigorous research in this field was by Mosteller in
Wallace in 1964: they identified the author of the disputed Federalist papers,
almost 200 years after they were written (note that there were only three
possible candidates!). They got on the cover of TIME, apparently. Other
"coups" for writing-style de-anonymization are the identification of the
author of Primary Colors, as well as the unabomber (his brother recognized his
style, it wasn't done by statistical/computational means).

The current state of the art is summarized here.
[http://www.stat.rutgers.edu/~madigan/AUTHORID/bibliography.h...](http://www.stat.rutgers.edu/~madigan/AUTHORID/bibliography.html)
If you're going to do any work on this, you should read as many of those
papers as you can. Or else you'll invent something feindishly clever only to
find that some academic already wrote about it 20 years ago and showed why it
doesn't work.

Now, that list stops at 2005, but I'm assuming there haven't been earth
shattering changes since then. I'm familiar with the results from those
papers; the curious thing is that they stop at corpuses of a couple hundred
authors or so -- i.e, identifying one anonymous poster out of say 200, rather
than a million. This is probably because they had different applications in
mind, such as identification within a company, instead of Internet-scale de-
anonymization. Note that the amount of information you need is always
logarithmic in the potential number of authors, and so if you can do 200
authors you can almost definitely push it to a few tens of thousands of
authors.

The other interesting thing is that the papers are fixated with 'topic-free'
identification, where the texts aren't about a particular topic, making the
problem harder. The good news is that when you're doing this Internet-scale,
nobody is stopping you from using topic information, making it a lot easier.

So my educated guess is that Internet-scale writing style de-anonymization is
possible. However, you'd need fairly long texts, perhaps a page or two. It's
doubtful that anything can be done with a single average-length email.

Another potential de-anonymization strategy is to use typing pattern
fingerprinting. The timing between our keystrokes fingerprints each of us
(yes, this works even for non-touch typists.). This is already used in
commercial products as an additional factor in password authentication.
However, the implications for de-anonymization have not been explored, and I
think it's very, very feasible. i.e, If google were to insert javascript into
gmail to fingerprint you when you were logged in, they could use the same
javascript to identify you on any web page where you type in text even if you
don't identify yourself. Now think about the de-anonymization possibilities
you can get by combining analysis of writing style and keystroke dynamics...

By the way, make no mistake: the malicious uses of this far overwhelm the
benevolent uses. Once this technology becomes available, it will be very hard
to post anonymously at all. Think of the consequences for political dissent or
whistleblowers. The great firewall of China could simply insert a piece of
javascript into every web page, and poof, there goes the anonymity of everyone
in China.

As for who's doing this? Google would be the least likely candidate, IMO. The
PR consequences of such experiments, were it ever to come out, blow up in
proportion to the size of the company. Not a good idea. On the other hand, I
do know a guy who was trying to start an 1-person company based on similar
ideas when I last heard from him. Well, de-anonymization of web sessions,
although writing style was not involved. That's the closest I can think of.

I am myself very, very interested in looking into this. My main interest is to
write a paper and possibly build tools to take a chunk of writing and try to
remove your fingerprint from it, i.e, protect anonymity, but if in the process
of collaboration someone else were to build a de-anonymization tool, I have no
problem with that. I've built (if I can say so), some of the current best de-
anonymiztion tools/techniques (check my website), so if you're interested feel
free to drop me a line.

~~~
eyeraw
Maybe someday it will be illegal to copy and paste to avoid being
authenticated.

~~~
Anon84
Humm... maybe this is why iPhone's don't allow copy paste! ;)

------
robg
All of the methods require a decent amount of text. So unless someone is
writing hundreds of words in each sample, they're unlikely to be
distinguishable from the crowd. That, to me, rules out typical comments on
threads. Lots of little samples seem to be just too noisy with too much
overlap among authors to allow for unique information-based fingerprints.

However, if the data is sufficiently large - in number and length (say, lots
of essay-type blog posts) I'd expect some classifier-based machine learning
techniques could match authors. That is, take a sample of 100 bloggers, split
their data in half, train the classifiers on one half then test to match up
the other half of the data. Under those conditions you could probably get
90-95% accuracy.

The question, I think, is how small you could push the training set in terms
of the fewest words and the fewest posts.

------
potatolicious
I doubt this is possible with current technology - so many people share the
same or similar writing style that, even with a skilled human analyzing, you
probably can't do much better than attribute it to a large bucket of perhaps
thousands or tens of thousands of individuals.

~~~
jakecarpenter
Even if you couldn't narrow it down to an exact person, it seems like you
could get good accuracy on socioeconomic status, cultural background,
political slant, etc. Once you had that data it might be valuable from a
marketing standpoint. It might not be useful for sites where there is a very
specific user group, but for sites with incidental user generated content--
even if it isn't anonymous per se, you could get a more accurate picture of
your audience.

------
vaksel
Doubtful, its not something you can automate to be reliable, and w/o
reliability whats the point?

------
Tichy
I always wanted to run that experiment on the Hacker News data...

~~~
brfox
I think that we all sound alike.... at least compared to the rest of the web.

Actually, I think if someone really tried to do what the original question
asked, then it would end up with under 100 different "people" who all seemed
to cluster together. All of HN would be something like 2 or 3 different
clusters.

------
eyeraw
Search for "text fingerprinting" - there's a lot of info.

------
tptacek
No.

~~~
icey
I'm assuming the question was

    
    
        It'd be neat if there was a web site that you could submit some writing 
        samples (emails or whatever) and then see everything else that person has 
        posted online (regardless of whether it was anonymous or not).
    

to which I agree with tptacek's answer.

------
villageidiot
In conjunction with ISPs, the (US) government or a large private company like
Google or Microsoft could aggregate posts based on IP address. In fact,
according to a report by Frontline (" _Spying On The Homefront_ "), Homeland
Security started implementing such a system for monitoring all phone and
Internet communications after the passage of the Patriot Act. I was going to
say that it would probably be political suicide for a private company to
participate but I just remembered Homeland Security actually contracted a
private company to do this work for them. I can't remember the name of the
company but you can watch the whole show online:

<http://www.pbs.org/wgbh/pages/frontline/homefront/>

Using writing style seems a little far fetched given the limited scope of
variations in language relative to the large number of people using language.
Otherwise it would not have taken so long for effective speech recognition
software to appear - even the current incarnations are more fragile than one
would expect relative to how long the best minds have been looking at the
problem and relative to the high economic value of an effective solution.

------
qqq
I want my privacy :(

