Unveiling the anonymous author: Stylometry techniques

lwansbrough · on March 6, 2022

This seems like a very open ended problem still. Surely there is much more hidden information in the way people write which has not been utilized widely yet. I’m reminded of when 80% accuracy in computer vision classification nets were state of the art. You need more data. And, I think, more transient data. Cliques develop lingo that falls in and out of fashion quickly.

For instance I know why I started one of the above sentences with “Surely” (not something I did often until recently) and it could be traceable to at least one specific internet community if you had all the conversation data. A few more layers like that in a neural net and you’ve got a pretty good fingerprint.

tehjoker · on March 5, 2022

I wonder if there's an RCT on whether these techniques actually work. Without validation, it's hard to take the claims too seriously.

woliveirajr · on March 6, 2022

There are some experiments that use a relatively big dataset (100 of authors), each one with 30 documents (to be used as training, validation and test).

Then you try for authorship (who is the author of _this_ document?) and get it right for 80% of the tries. Or get that the author is in the top-5 100% of the tries.

Perhaps those techniques aren't enough to find out the author from the whole world population but are useful enough to reduce the scope of some investigation inside some campus,for example.

(Disclaimer: PhD about it)

mwattsun · on March 5, 2022

I'm sure they can achieve a degree of certainty, but not enough to hold up in a court of law because writing styles can be faked as well

tehjoker · on March 6, 2022

I think one can pose the question of effectiveness in adversarial vs non-adversarial situations. I do think there are occasional "tells" someone has, but they're probably obvious things if you know the person. However, I'd still be skeptical because there are so many factors to consider in whether it works such as corpus selection.

I'm not a statistician, but my spider sense prickled when I saw mention of "thousands" of features in the vector due to the curse of dimensionality.

mwattsun · on March 6, 2022

> but they're probably obvious things if you know the person

An example of this is how Ted Kaczynski got caught: His brother recognized his writing and turned him in

flobosg · on March 6, 2022

Related: “Who Wrote The ‘Death Note’ Script?” – https://www.gwern.net/Death-Note-script

serhack_ · on March 6, 2022

Thank you! Amazing example!

no-body · on March 5, 2022

I am curious if using an extreme "accent" would hamper any meaningfull analysis. Like the person behind the shadowbroker leaks did.

dvh · on March 5, 2022

Translate to other language then back and fix errors.

no-body · on March 6, 2022

But then any analysis would lose meaning, or is that exactly the point?

Terry_Roll · on March 5, 2022

> Conclusion: Stylometry as a Probabilistic Science

Perfect for AI and web spiders, however you can also use this to spot bots, most bots cant do essays. Something not mentioned is someone's knowledge which can also give away someones identity, Autistics are easy to spot due to their splinter like knowledge and little other interests, but for non Aspies then again range of knowledge and interests can be used to "fingerprint" someone. Its why the security services want good sock puppets for their T.Roll farms.

As for bitcoins Satoshi’s identity, I think you need 5eyes to see the identity.