This seems like a very open ended problem still. Surely there is much more hidden information in the way people write which has not been utilized widely yet. I’m reminded of when 80% accuracy in computer vision classification nets were state of the art. You need more data. And, I think, more transient data. Cliques develop lingo that falls in and out of fashion quickly.
For instance I know why I started one of the above sentences with “Surely” (not something I did often until recently) and it could be traceable to at least one specific internet community if you had all the conversation data. A few more layers like that in a neural net and you’ve got a pretty good fingerprint.
There are some experiments that use a relatively big dataset (100 of authors), each one with 30 documents (to be used as training, validation and test).
Then you try for authorship (who is the author of _this_ document?) and get it right for 80% of the tries. Or get that the author is in the top-5 100% of the tries.
Perhaps those techniques aren't enough to find out the author from the whole world population but are useful enough to reduce the scope of some investigation inside some campus,for example.
I think one can pose the question of effectiveness in adversarial vs non-adversarial situations. I do think there are occasional "tells" someone has, but they're probably obvious things if you know the person. However, I'd still be skeptical because there are so many factors to consider in whether it works such as corpus selection.
I'm not a statistician, but my spider sense prickled when I saw mention of "thousands" of features in the vector due to the curse of dimensionality.
> Conclusion: Stylometry as a Probabilistic Science
Perfect for AI and web spiders, however you can also use this to spot bots, most bots cant do essays. Something not mentioned is someone's knowledge which can also give away someones identity, Autistics are easy to spot due to their splinter like knowledge and little other interests, but for non Aspies then again range of knowledge and interests can be used to "fingerprint" someone. Its why the security services want good sock puppets for their T.Roll farms.
As for bitcoins Satoshi’s identity, I think you need 5eyes to see the identity.
For instance I know why I started one of the above sentences with “Surely” (not something I did often until recently) and it could be traceable to at least one specific internet community if you had all the conversation data. A few more layers like that in a neural net and you’ve got a pretty good fingerprint.