The best you could say is that of all the Federalist paper authors, AH writes most similarly to John Titor. But really, we all know Titor is Larry Haber.
This is a fun exercise. Back in 1963, Fred Mosteller and David L. Wallace wrote a piece in the Journal of the American Statistical Association titled "Inference in an Authorship Problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist papers" [0]. It describes another technique for analyzing the authorship using a Bayesian model of word distributions.
One interesting thing about this is the claim that there is a ground truth for all but 12 of the papers, meaning that supervised learning could also be used.
For discussion, I often think that unsupervised methods are preferred to supervised methods, given a reasonably low error rate by the unsupervised method, as it will be able to generalize more readily.
It would definitely be hard to do. And my personal suspicion is that Satoshi is a group of individuals, making such an analysis almost impossible. However there is the possibility that there is some sort of "linguistic fingerprint" capable of uniquely identifying Satoshi, despite what I believe to be their efforts to create a highly specific and unique style of writing for the purposes of evading identification. The way the Unabomber used a certain phrase in the manifesto and a letter to a family member was what got him caught.
Another interesting analysis would be to compare the structure of code between the early bitcoin code and code written by likely suspects. Perhaps variable naming convention will identify Satoshi.
>my personal suspicion is that Satoshi is a group of individuals, making such an analysis almost impossible
I agree. My bet is at least two: one theory heavy "ideas guy" and one more software engineering oriented programmer who did the heavy lifting on implementation.
And I have also noticed that these types of analysis are for single authored documents.
I suspect it will stay a secret for a long time... maybe we'll get some deathbed confessions in a few decades.
Hmm the issue is that there might be some correlation between the syntactic and lexical similarity and the actual subject matter the authors are talking about.
I've been working on a similar project.
Three things that I would suggest would be
- add documents definitively written by the authors (Maddison, John jay, etc) from outside the federalist papers to your train set.
- Add another feature which looks at the frequency of the function (closed class words) such as articles, prepositions etc these are very stylistic and hard for an author to control, they are also independent of the content, this is a classical feature in forensics.
- Add a distractor case to your train and validation set I.e documents written by a non federalist such as Thomas Jefferson and confirm that they don't get clustered into one of the other federalist authors.
If you have questions feel free to tweet me it seems like a cool project @pythiccoder
nice article, there is a kaggle competition (Spooky Author) that you had to identify which of three authors wrote a sentence. the problem is very much the same, so not only can you try your technique on the data, you can also read kernels people posted in the competition.
Unlike what the commentor above said that its not modern and you should have did word2vec, bags of words are very robust and work well in these situations. word2vec was trained on a completely different corupus, and this data is quite small.
some things you might try are:
- cosine distance between words
- ad LDA (latent dirichelet allocation) topic probabilities
- add verb speed (how fast they used the first verb in sentence
- run an LSTM NN, add the predicted prob as features (careful in overfiting)
Maybe consider including the results of others' analyses alongside yours? Even more interesting would be a deep-dive into why they disagree. (Adventures in interpretation)
Also maybe do a PCA and show scatter plots of the first two PCs for each doc?
I'm no expert, but these could be fun avenues to explore.
That thought crossed my mind. I was thinking of trying to get a training set and teaching these models based on previous works by Jay, Madison, and Hamilton. However a lot of what I found for each of them was behind a paywall, or too hard to grep through to actually get the data. For instance all I could find in my (admittedly superficial) digging on John Jay was a book by UVA called "The Selected Papers of John Jay". It costs $90 and contains correspondence both too and from John Jay. It seemed like too much overhead for a weekend project so I just settled on unsupervised.
Increasingly it feels like we can only "think" anonymously, not speak. Technology like Tor allows me to surf the web untracked, which is good. There are chilling effects from mass surveilance that cause people not read about sensitive topics[1]. It's good people can expose themselves to primary sources unimpeded.
But if a user tries to say anything of substance or simply become part of a community rather than rotate nyms every year or so, they're opening themself up to fingerprinting.
Nice article! Bag-of-Words models, however, have their drawbacks and I wouldn‘t consider it a modern technique. Would be awesome if you would also try some state of the art techniques like NN word/document embeddings, e.g. with doc2vec.
I've tried to transition fully to python3, but yet it should mostly work in python2.7 as well!
I did not read the book on nltk. I've been into information/coding theory recently (just finished Information by James Gleick) and thought I'd try my hand at something closer to NLP/ML. Very little background in the topic - I've taken a few courses in college and chatted with a few friends that know ML fairly well, but besides that I relied on blog posts and papers found on arxiv!