One interesting thing about this is the claim that there is a ground truth for all but 12 of the papers, meaning that supervised learning could also be used.
For discussion, I often think that unsupervised methods are preferred to supervised methods, given a reasonably low error rate by the unsupervised method, as it will be able to generalize more readily.
I have a suite of nyms I created over Tor, and exclusively use via Tor, which I use to discuss certain topics.
It's hard to do a stylometric analysis when you don't have anything to compare to. Professional writing is very different from informal conversation.
If someone doesn't have a Facebook, gmail, twitter, etc it would be very difficult to find that person through stylometric analysis IMHO.
Another interesting analysis would be to compare the structure of code between the early bitcoin code and code written by likely suspects. Perhaps variable naming convention will identify Satoshi.
I agree. My bet is at least two: one theory heavy "ideas guy" and one more software engineering oriented programmer who did the heavy lifting on implementation.
And I have also noticed that these types of analysis are for single authored documents.
I suspect it will stay a secret for a long time... maybe we'll get some deathbed confessions in a few decades.
I've been working on a similar project.
Three things that I would suggest would be
- add documents definitively written by the authors (Maddison, John jay, etc) from outside the federalist papers to your train set.
- Add another feature which looks at the frequency of the function (closed class words) such as articles, prepositions etc these are very stylistic and hard for an author to control, they are also independent of the content, this is a classical feature in forensics.
- Add a distractor case to your train and validation set I.e documents written by a non federalist such as Thomas Jefferson and confirm that they don't get clustered into one of the other federalist authors.
If you have questions feel free to tweet me it seems like a cool project @pythiccoder
Unlike what the commentor above said that its not modern and you should have did word2vec, bags of words are very robust and work well in these situations. word2vec was trained on a completely different corupus, and this data is quite small.
some things you might try are:
- cosine distance between words
- ad LDA (latent dirichelet allocation) topic probabilities
- add verb speed (how fast they used the first verb in sentence
- run an LSTM NN, add the predicted prob as features (careful in overfiting)
Also maybe do a PCA and show scatter plots of the first two PCs for each doc?
I'm no expert, but these could be fun avenues to explore.
But if a user tries to say anything of substance or simply become part of a community rather than rotate nyms every year or so, they're opening themself up to fingerprinting.
An interesting dynamic, in my opinion.
1977. Always wondered how it stacked up against modern techniques
I believe you intended to run in python3 based on your last cell.
changing maketrans and translate will have it run in python2.7
table = string.maketrans('', '') # remove punctuation from each word
stripped = [w.translate(table, string.punctuation) for w in tokens]
I did not read the book on nltk. I've been into information/coding theory recently (just finished Information by James Gleick) and thought I'd try my hand at something closer to NLP/ML. Very little background in the topic - I've taken a few courses in college and chatted with a few friends that know ML fairly well, but besides that I relied on blog posts and papers found on arxiv!