Hacker News new | past | comments | ask | show | jobs | submit login
The Federalist Papers: Author Identification Through K-Means Clustering (jonlu.ca)
178 points by jonluca on July 8, 2018 | hide | past | web | favorite | 32 comments



I did a similar exercise, and then applied it to the Titor posts. Turns out Alexander Hamilton is John Titor.


The best you could say is that of all the Federalist paper authors, AH writes most similarly to John Titor. But really, we all know Titor is Larry Haber.


straight facts, my friend


This is a fun exercise. Back in 1963, Fred Mosteller and David L. Wallace wrote a piece in the Journal of the American Statistical Association titled "Inference in an Authorship Problem: A comparative study of discrimination methods applied to the authorship of the disputed Federalist papers" [0]. It describes another technique for analyzing the authorship using a Bayesian model of word distributions.

One interesting thing about this is the claim that there is a ground truth for all but 12 of the papers, meaning that supervised learning could also be used.

For discussion, I often think that unsupervised methods are preferred to supervised methods, given a reasonably low error rate by the unsupervised method, as it will be able to generalize more readily.

[0] https://www.jstor.org/stable/2283270


Now try Satoshi Nakamoto with bitcointalk posts and papers in cryptography.


This assumes Satoshi discusses such things under a nym tied to his real name or IP.

I have a suite of nyms I created over Tor, and exclusively use via Tor, which I use to discuss certain topics.

It's hard to do a stylometric analysis when you don't have anything to compare to. Professional writing is very different from informal conversation.

If someone doesn't have a Facebook, gmail, twitter, etc it would be very difficult to find that person through stylometric analysis IMHO.


It would definitely be hard to do. And my personal suspicion is that Satoshi is a group of individuals, making such an analysis almost impossible. However there is the possibility that there is some sort of "linguistic fingerprint" capable of uniquely identifying Satoshi, despite what I believe to be their efforts to create a highly specific and unique style of writing for the purposes of evading identification. The way the Unabomber used a certain phrase in the manifesto and a letter to a family member was what got him caught.

https://en.wikipedia.org/wiki/You_can%27t_have_your_cake_and...

Another interesting analysis would be to compare the structure of code between the early bitcoin code and code written by likely suspects. Perhaps variable naming convention will identify Satoshi.

https://github.com/trottier/original-bitcoin


>my personal suspicion is that Satoshi is a group of individuals, making such an analysis almost impossible

I agree. My bet is at least two: one theory heavy "ideas guy" and one more software engineering oriented programmer who did the heavy lifting on implementation.

And I have also noticed that these types of analysis are for single authored documents.

I suspect it will stay a secret for a long time... maybe we'll get some deathbed confessions in a few decades.



This is my first project in unsupervised NLP, so let me know if there's anything obviously wrong with the article or methodology.


Hmm the issue is that there might be some correlation between the syntactic and lexical similarity and the actual subject matter the authors are talking about.

I've been working on a similar project.

Three things that I would suggest would be

- add documents definitively written by the authors (Maddison, John jay, etc) from outside the federalist papers to your train set.

http://oll.libertyfund.org/titles/jay-the-correspondence-and...

http://www.gutenberg.org/ebooks/author/14, etc

- Add another feature which looks at the frequency of the function (closed class words) such as articles, prepositions etc these are very stylistic and hard for an author to control, they are also independent of the content, this is a classical feature in forensics.

- Add a distractor case to your train and validation set I.e documents written by a non federalist such as Thomas Jefferson and confirm that they don't get clustered into one of the other federalist authors.

If you have questions feel free to tweet me it seems like a cool project @pythiccoder


nice article, there is a kaggle competition (Spooky Author) that you had to identify which of three authors wrote a sentence. the problem is very much the same, so not only can you try your technique on the data, you can also read kernels people posted in the competition.

Unlike what the commentor above said that its not modern and you should have did word2vec, bags of words are very robust and work well in these situations. word2vec was trained on a completely different corupus, and this data is quite small.

some things you might try are: - cosine distance between words - ad LDA (latent dirichelet allocation) topic probabilities - add verb speed (how fast they used the first verb in sentence - run an LSTM NN, add the predicted prob as features (careful in overfiting)


I know nothing about NLP but would a run where it predicts the authorship of the known texts be useful to get some idea of the level of accuracy?


Not the author, but I do know a bit about NLP, so to answer your question: Yes it would :)


Maybe consider including the results of others' analyses alongside yours? Even more interesting would be a deep-dive into why they disagree. (Adventures in interpretation)

Also maybe do a PCA and show scatter plots of the first two PCs for each doc?

I'm no expert, but these could be fun avenues to explore.


You could also go super crazy and try to do an LDA (may have to go beyond syntax and lex). Assume each document is a mixture of author influence. :)


Why did you choose an unsupervised method to solve a classification problem?


There's no ground truth. :) They were written under a common psuedonym.


Couldn't works known to be written by hypothesis authors be used to train a supervised classifier?


That thought crossed my mind. I was thinking of trying to get a training set and teaching these models based on previous works by Jay, Madison, and Hamilton. However a lot of what I found for each of them was behind a paywall, or too hard to grep through to actually get the data. For instance all I could find in my (admittedly superficial) digging on John Jay was a book by UVA called "The Selected Papers of John Jay". It costs $90 and contains correspondence both too and from John Jay. It seemed like too much overhead for a weekend project so I just settled on unsupervised.


May be of interese (see "See Also" for James Madison stuff too): https://en.m.wikipedia.org/wiki/The_Selected_Papers_of_John_...


How were the properties chosen? Did you do any information gain analysis on them before doing the clustering?


Increasingly it feels like we can only "think" anonymously, not speak. Technology like Tor allows me to surf the web untracked, which is good. There are chilling effects from mass surveilance that cause people not read about sensitive topics[1]. It's good people can expose themselves to primary sources unimpeded.

But if a user tries to say anything of substance or simply become part of a community rather than rotate nyms every year or so, they're opening themself up to fingerprinting.

An interesting dynamic, in my opinion.

[1] https://motherboard.vice.com/en_us/article/aekedb/chilling-e...


https://books.google.com/books/about/To_Couple_is_the_Custom...

1977. Always wondered how it stacked up against modern techniques


Nice article! Bag-of-Words models, however, have their drawbacks and I wouldn‘t consider it a modern technique. Would be awesome if you would also try some state of the art techniques like NN word/document embeddings, e.g. with doc2vec.


Fun fact: this problem is why grep was created by Ken Thompson [1].

[1] https://youtu.be/NTfOnGZUZDk



Neat! Kernighan really is pleasant listening as noted on that post.


Why not use something like LDA and features taken from word embeddings? A more probabilistic analysis would give more meaningful results.


Fun post!


btw, mostly seems to run on python2!

I believe you intended to run in python3 based on your last cell.

changing maketrans and translate will have it run in python2.7

    table = string.maketrans('', '')  # remove punctuation from each word
    stripped = [w.translate(table, string.punctuation) for w in tokens]
Did you read the book on nltk before writing this project? Curious about background skill development to undertake this project.


I've tried to transition fully to python3, but yet it should mostly work in python2.7 as well!

I did not read the book on nltk. I've been into information/coding theory recently (just finished Information by James Gleick) and thought I'd try my hand at something closer to NLP/ML. Very little background in the topic - I've taken a few courses in college and chatted with a few friends that know ML fairly well, but besides that I relied on blog posts and papers found on arxiv!




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: