

Linguistics identifies anonymous users  - maskofsanity
http://www.scmagazine.com.au/News/328135,linguistics-identifies-anonymous-users.aspx

======
rachelbythebay
First come the programs which generate fingerprints from writing samples. Then
comes the corpus of known author fingerprints which can take a submission of
arbitrary content and turn it into a series of authors and confidence values.

Later, someone will figure they can profit by gaming this system and writing
posts which appear to be someone else. All they have to do is try to clone an
author's style and then feed their attempts to the analysis program to see how
close they came. Then they can just fine-tune things until it's a strong
enough match and push it out there as legitimate.

Yep, we might just have a "linguistic similarity analysis arms race" at some
point. I concluded this in a post some months back.
<http://rachelbythebay.com/w/2012/08/29/info/>

~~~
andrewcooke
nice post (didn't know about dns serial nos). but why does it have to be a
humorous phrase (as timestamp)? couldn't it be a site that generates a random
nonce every day (and provides a service that compares two nonces)?

~~~
rachelbythebay
I guess it doesn't have to be humorous to be useful. I'd just like to have
them be quirky enough to be appreciated by users. Just think of the first time
your favorite site changed its appearance for a special holiday or similar. It
gives you some idea there are people involved and it's not just a bunch of
robots flinging bits around.

~~~
andrewcooke
<http://colorlessgreen.net/>

------
kragen
IIRC, Larry Detweiler's sock-puppeting attempts on the cypherpunks mailing
list were unmasked by other cypherpunks with this technique in the early
1990s.

Detweiler, natively paranoid and geographically isolated in Colorado, had come
to the conclusion that most of the prominent cypherpunks posters were sock-
puppets of a single person he called "Medusa" — he invented the word
"pseudospoofing" to describe the phenomenon, although "sock-puppet" is the
more common current term — and resolved to use the same technique in response.

I never read the actual word-frequency analyses, just Detweiler complaining
about having had it used against him, which I assume he didn't imagine.

There was a recent paper (two or three years ago?) that claimed that people
could usually defeat this kind of linguistic analysis simply by trying to
change their writing style. I'm skeptical that this is applicable in general.

<http://33bits.org/> is a pretty good blog on the unreasonable difficulty of
anonymity in a computerized world.

------
e12e
"And posts must be translated to English, a process which boosted author
identification from 66 to around 80 per cent but was imperfect using freely
available tools like Google and Bing."

I find that more than a little surprising. I would think machine translation
would detract from the signal and/or normalize the text, rather than boost the
differences.

~~~
robk
This is very surprising given the statistical machine learning approach of
Google Translate. Surely this adds in too much noise to be trustworthy.

~~~
taneliv
Could it be that some words do not get translated and appear in the original
language? These would be very distinctive features in the later steps.

------
lindauer
There was a really nice paper at the last IEEE Security & Privacy conference
where they did an analysis like this on blog postings.

[http://randomwalker.info/publications/author-
identification-...](http://randomwalker.info/publications/author-
identification-draft.pdf)

I tried a similar analysis for a class project using Twitter data, and it
worked surprisingly well considering the small amount of data in a tweet.
Would-be-anonymous posters beware!

------
dalke
For an historical example, the anonymous author of Primary Colors was
identified through a literary analysis
([http://en.wikipedia.org/wiki/Primary_Colors_%28novel%29#Unma...](http://en.wikipedia.org/wiki/Primary_Colors_%28novel%29#Unmasking_of_.22Anonymous.22)
)

Don Foster "did a textual analysis of Primary Colors to identify words,
phrases, and expressions that were repeatedly used and to look for "quirky
expressions" and peculiarities of punctuation. Afterwards, he simply analyzed
writing samples until he located a consistent usage of the same "telltale"
signs of authorship. Sample after sample was rejected until finally he
happened onto the writings of Joe Klein and found what he had been looking
for. The literary quirks and features that Foster had isolated in Primary
Colors occurred with such frequency in Klein's articles that despite Klein's
initial denial, Foster knew he was Anonymous."

Similar methods have been used to, for example, try and figure out who wrote
which portions of the anonymously authored "Federalist Papers."

~~~
sp332
Amazon used to pull out "statistically improbable phrases", 2- or 3-word
phrases that occurred in the book you were looking at that were rare among all
the books in their "look inside" database. Some examples:
<http://www.mentalfloss.com/blogs/archives/25839>

------
stephengillie
The 1998 Rom-com _Cupid_ [1] had an episode with an "elite linguistics ninja"
that could purportedly tell where a person was born and grew up based on word
usage.

[1] <https://en.wikipedia.org/wiki/List_of_Cupid_episodes>

~~~
cdman
The 1956 musical My Fair Lady [1] features a professor with the same claim.
Nothing new under the sun I guess :-)

[1] <http://en.wikipedia.org/wiki/My_Fair_Lady>

------
honestcoyote
Would this work as a way of hiding? Write plainly. Put it into the Google
translator to be converted to the language of your choice and then this is
again translated back to English. Correct the individual words mistranslated
but leave the awkward phrasing intact.

------
lsiebert
We shouldn't have to force ourselves to adopt different ways of talking when
we are anonymous. We should analyze the places where this variance exists,
regularly compile information on the most probable ways of writing, and then
programmatically alter our words, using the previous established
probabilities, keeping them updated as time goes on.

Alternatively, we can analyze our speech for those elements that are
significantly different from others to allow identification, and alter just
those.

------
NegativeK
I've forced myself to use different writing styles when trying to be
anonymous. I've no clue if I succeeded, but I'm relatively sure that I wasn't
as clearly myself. Every time, I've wondered what it would take to write an
anonymizing tool that would strip identifying patterns from the sentences.

Of course, modifying word choice seems trivial in comparison to analyzing how
a person lays out an entire paper or communicates the steps of their thought
process.

~~~
gnosis
_"I've wondered what it would take to write an anonymizing tool that would
strip identifying patterns from the sentences."_

Peter Wayner came up with a technique called "Mimic Functions" [1] that could
(at least in principle) change a file so that it assumes some statistical
properties of a different file.

Unfortunately, it's easier said than done. One problem is that the sender
doesn't know exactly which statistical tests would be used by the attacker, so
while mimic functions could be devised to emulate certain statistical
properties of a given file, it may be impossible to mimic all the statistical
properties of any but the most trivial file -- especially if you want the
result to make sense to a human reader. It might fool a machine, though.

[1] - <https://en.wikipedia.org/wiki/Mimic_function>

------
lykedis
who iz stoopid enuf 2 right in der normal stylz wen dey do bad tings?!

~~~
saraid216
In _Designing Virtual Worlds_ , Richard Bartle explicitly advises that
developers who intend to masquerade as players be very careful not to give
themselves away, and offers a nice list of things to pay attention to.

------
nerdfiles
I genuinely believe their findings do not apply to me; or, to say the least,
that I am a statistical outlier.

There are still trolls out there, alive and linguisizing.

