Later, someone will figure they can profit by gaming this system and writing posts which appear to be someone else. All they have to do is try to clone an author's style and then feed their attempts to the analysis program to see how close they came. Then they can just fine-tune things until it's a strong enough match and push it out there as legitimate.
Yep, we might just have a "linguistic similarity analysis arms race" at some point. I concluded this in a post some months back. http://rachelbythebay.com/w/2012/08/29/info/
> Time-stamping adds a cryptographically-verifiable timestamp to your
> signature, proving when the code was signed. [...] not all publisher
> certificates are enabled to permit timestamping to provide indefinite
> lifetime [...] to free a Certificate Authority from the burden of
> maintaining Revocation lists (CRL, OCSP) in perpetuity.
Detweiler, natively paranoid and geographically isolated in Colorado, had come to the conclusion that most of the prominent cypherpunks posters were sock-puppets of a single person he called "Medusa" — he invented the word "pseudospoofing" to describe the phenomenon, although "sock-puppet" is the more common current term — and resolved to use the same technique in response.
I never read the actual word-frequency analyses, just Detweiler complaining about having had it used against him, which I assume he didn't imagine.
There was a recent paper (two or three years ago?) that claimed that people could usually defeat this kind of linguistic analysis simply by trying to change their writing style. I'm skeptical that this is applicable in general.
http://33bits.org/ is a pretty good blog on the unreasonable difficulty of anonymity in a computerized world.
I find that more than a little surprising. I would think machine translation would detract from the signal and/or normalize the text, rather than boost the differences.
I tried a similar analysis for a class project using Twitter data, and it worked surprisingly well considering the small amount of data in a tweet. Would-be-anonymous posters beware!
Don Foster "did a textual analysis of Primary Colors to identify words, phrases, and expressions that were repeatedly used and to look for "quirky expressions" and peculiarities of punctuation. Afterwards, he simply analyzed writing samples until he located a consistent usage of the same "telltale" signs of authorship. Sample after sample was rejected until finally he happened onto the writings of Joe Klein and found what he had been looking for. The literary quirks and features that Foster had isolated in Primary Colors occurred with such frequency in Klein's articles that despite Klein's initial denial, Foster knew he was Anonymous."
Similar methods have been used to, for example, try and figure out who wrote which portions of the anonymously authored "Federalist Papers."
Alternatively, we can analyze our speech for those elements that are significantly different from others to allow identification, and alter just those.
Of course, modifying word choice seems trivial in comparison to analyzing how a person lays out an entire paper or communicates the steps of their thought process.
Peter Wayner came up with a technique called "Mimic Functions"  that could (at least in principle) change a file so that it assumes some statistical properties of a different file.
Unfortunately, it's easier said than done. One problem is that the sender doesn't know exactly which statistical tests would be used by the attacker, so while mimic functions could be devised to emulate certain statistical properties of a given file, it may be impossible to mimic all the statistical properties of any but the most trivial file -- especially if you want the result to make sense to a human reader. It might fool a machine, though.
 - https://en.wikipedia.org/wiki/Mimic_function
There are still trolls out there, alive and linguisizing.