

Is Writing Style Sufficient to Deanonymize Material Posted Online? - chl
http://33bits.org/2012/02/20/is-writing-style-sufficient-to-deanonymize-material-posted-online/
I wish many more papers came with expository blog posts like that ...
======
randomwalker
Lead author here. Since my serious thinking on this topic started when I
responded to this Ask HN post[1] Π years ago[2], it's nice to see this posted
here, to come full circle in a sense. Happy to answer any questions.

[1] <http://news.ycombinator.com/item?id=413730>

[2] No, really, it's been exactly Π years to the day :-)

~~~
dstorrs
I'm impressed by the skill that goes into this but it doesn't seem like an
even-handed technology -- it empowers governments, major corporations, and
other large organizations more than it does private individuals.

As a specific example, people writing political blogs in China could be
seriously harmed by this technique even at the levels that it's at now.

I applaud you for including the link to "manually changing your writing style
will defeat these attacks" but that's a link to an academic paper. Could you
please also write some good, layperson-oriented docs on "how to beat this"?
For that matter, I'll do the writing grunt work if you'll provide the
expertise. If you're interested, use the GMail address in my profile.

~~~
randomwalker
That's a good question. First, I believe that intelligence agencies are
already well aware of the potential of technology like this, and at least
some, like the NSA, could very well be ahead of public research. Second,
research such as ours is intended to demonstrate a proof of concept, and it
takes a lot of work to turn it into a reliable tool — for example, we restrict
ourselves to English text. For those two reasons, I think our work does little
to directly help governments and other oppressive entities. On the other hand,
publicly available research is effective (we hope) in raising awareness of the
threat, so on balance it does more good than harm to people writing political
blogs.

As for practical tips to defeat stylometry and such, organizations like the
EFF specialize in doing that, so I will leave that to them. Comparative
advantage, etc. If you would like to help, you are more than welcome.

~~~
bediger
_could very well be ahead of public research_ \- does that mean you've had
meetings with groups of 3 federal employees, one of whom does nothing but
ensure the other 2 don't say too much? You know, like the feds that visited
IBM to make DES more resistant to differential cryptanalysis
([http://en.wikipedia.org/wiki/Data_Encryption_Standard#NSA.27...](http://en.wikipedia.org/wiki/Data_Encryption_Standard#NSA.27s_involvement_in_the_design))?

------
simonsarris
Reading this it came to mind and is perhaps worth mentioning that this is how
the Unabomber (Ted Kaczynski), the Luddite who engaged in a mail bombing
campaign that spanned nearly 20 years, was caught.

 _Before the publication of the manifesto, Theodore Kaczynski's brother, David
Kaczynski, was encouraged by his wife Linda to follow up on suspicions that
Ted was the Unabomber. David Kaczynski was at first dismissive, but
progressively began to take the likelihood more seriously after reading the
manifesto a week after it was published in September 1995. David Kaczynski
browsed through old family papers and found letters dating back to the 1970s
written by Ted and sent to newspapers protesting the abuses of technology and
which contained phrasing similar to what was found in the Unabomber Manifesto_

<http://en.wikipedia.org/wiki/Ted_Kaczynski#Search>

------
DarkShikari
While impressive, I don't think these results are actually that bad for
privacy. 80% precision, for example, is useless when you're matching against
tens of millions. It's much the same fallacy of the medical test for a disease
that occurs in 1 out of 1000 people, and which has 99% accuracy -- but that
still means a 90% false positive rate.

It reminds me of the claims of being able to identify, for example, the gender
of an author with ~65% accuracy -- which is really actually completely
unimpressive, as it's hardly better than guessing, and certainly not something
you could rely on for any serious purpose.

The author mentions that topic is one way to help correlate beyond the results
of the algorithm. But if I wrote "anonymous" posts in my area of expertise,
you certainly would not need stylistic analysis to guess what my identity
might be! There has never been privacy in this regard, I don't think.

Where privacy is needed most, I think, is exactly where this deanonymizing
tool still isn't sufficient: talking about _unrelated topics_. A person should
be free to express themselves under multiple names for different purposes, and
there is no reason why an employer needs to know about a programmer's side
hobby as a fiction writer if s/he doesn't want them to.

Finally, I do wonder how well these results correlate to the case where
someone is _intentionally_ operating under a different name. Matching one post
by tech blogger A against blogger A is easy, because tech blogger A is making
no attempt to write any differently or in any different context. However, what
if tech-writer A ghost-wrote YA fiction on the side? Could you use these
techniques to detect that the fiction was written by that blogger? It can't be
ruled out without trying, but generalizing these results to that seems
questionable.

------
_delirium
The _difficulty_ of doing it cross-context is actually slightly more
surprising to me than the possibility. I would've guessed that, once a
suitable data set were found (a main impediment to previous studies), accuracy
would be quite good, along the lines of how easy it is to guess browser
fingerprints from a few dozen telltale markers. But it appears that only about
10% of authors can be guessed to a precision of 80%, which is still pretty
decent odds of not being identified automatically, at least for now, even
without actively trying to cover up (though the linked post is right that with
a specific target, intelligently adding some ad-hoc additional features can
probably help).

One thing that'd be interesting to me is whether there are certain
characteristics that make it particularly easy to identify people cross-
context, like a top-10-telltale-markers sort of thing. Are a disproportionate
number of the 10% who can be identified with high precision using a handful of
unusual grammatical or lexical features, or is it more of a diffuse sort of
thing?

------
ludflu
Funny, I just started reading about adversarial stylometry the other day.
<https://www.cs.drexel.edu/~mb553/stuff/Indiana_20110407.pdf>

------
gtani
That's a very interesting paper (and very accessible to anyone with a
stats/data mining background). I went back and read Jason Baldridge's intro,
which is excellent

[http://ata-s12.utcompling.com/schedule/ATA-
Authorship%20Attr...](http://ata-s12.utcompling.com/schedule/ATA-
Authorship%20Attribution.pdf?attredirects=0)

It seems you didn't attempt to fingerprint for misspellings, among the
variables on pdf p 5. Also, curious why did you need to up the dataset to
exactly 100k with the 5.7k.

------
lignuist
Location can (sometimes) also be detected from writing style:

[http://www.cmu.edu/news/archive/2011/January/jan7_twitterdia...](http://www.cmu.edu/news/archive/2011/January/jan7_twitterdialects.shtml)

------
tensafefrogs
The privacy implications are a bit worrisome. Perhaps it's time to write
utilities to anonymize your writing style.

Maybe running your text through a round-trip translator could help? (although
then you'd need to fix any errors introduced).

~~~
gojomo
Much better than that: if the research/software that identifies authors is
published, and some reasonable approximation of the public training set that
deanonymizers would use is available, then anyone can check their writing
against the tool before publishing it.

If your writing is too identifying, just perturb the text until the tool fails
to identify the author. Or even better: perturb the writing until the
deanonymizer fingers someone else, in a usefully confounding way.

The deanonymizer's feature-extraction/analysis could itself help drive the
perturbation routines. "Make my word choice more like Paul Graham", you could
say. And even if there are limits to its automatic substitutions, it could
offer coaching: "To make your writing more Graham-like, decrease your average
sentence length and use fewer interjections."

Edit, resubmit, repeat until the right author is fingered.

Business idea: website that offers this tuning to help un-deanonymize or faux-
deanonymize writing.

Evil business idea: this website remembers everything submitted, to allow the
super-deep-pocketed to peek in and de-un-deanonymize (of re-deanonymize?)
blocks of text.

~~~
abecedarius
Typically someone wanting anonymity needs to defend against future attacks on
their published works, not just current ones.

------
pasbesoin
I know that I semi-consciously engage in a few spelling anachronisms that
probably serve to isolate me. Actually, since I recognized both them and their
likely effect, I've become somewhat more conscious in applying them -- or in
checking for them while proofreading and deciding whether to leave them in.

~~~
kaarlo_n
This exact issue is addressed in _The Secret Life of Pronouns_ (Pennebaker).

<http://secretlifeofpronouns.com/>

------
Dn_Ab
" _Developing fully automated methods to hide traces of one’s writing style
remains a challenge_ ". How would the following 3 methods fare?

Method1: Run the text through a markov chain constructed maybe from a mixture
of 0.5 your text, 0.25 Shakespeare and 0.25 Alice in wonderland. Do something
like sample every third word with the other two coming as a chain. Then run
that text through wordnet to do synonym based replacement.

Method 2: Do a translation to a nearby language and back again using some
language translating api.

Method 3: Replace less common words with hypernyms and more common words with
synonyms or possibly not + antonyms.

Might want a few heuristics to replace stuff like (, ..., ) , - , : ,[,] with
each other. Also randomize space between punctuation.

Optionally Run the outputs through mechanical turk to iron out the result,
leave as is or clean by self.

~~~
warfangle
The first I thought of was your second point.

Of course, this is dependant on translation algorithms that are at least
somewhat inaccurate.

You may want to choose one closeby language, and one further removed and find
the equilibrium phrase. For example, translate
english->german->italian->german->english, repeat until you get the same
english phrase each time.

------
cbo
Possible privacy implications aside, this is awesome.

I am consistently astounded by how advanced AI techniques are becoming.

------
stretchwithme
Not if you can detect such style and point it out to the writer before he
posts it.

------
toonse
I, can see that. People, always say, that they can determine my writing style,
without, much problem.

