
The Federalist Papers: Author Identification Through K-Means Clustering - jonluca
https://blog.jonlu.ca/posts/the-federalist-papers-author-identification-through-k-means-clustering
======
swalsh
I did a similar exercise, and then applied it to the Titor posts. Turns out
Alexander Hamilton is John Titor.

~~~
oh_sigh
The best you could say is that of all the Federalist paper authors, AH writes
most similarly to John Titor. But really, we all know Titor is Larry Haber.

------
jaimie
This is a fun exercise. Back in 1963, Fred Mosteller and David L. Wallace
wrote a piece in the Journal of the American Statistical Association titled
"Inference in an Authorship Problem: A comparative study of discrimination
methods applied to the authorship of the disputed Federalist papers" [0]. It
describes another technique for analyzing the authorship using a Bayesian
model of word distributions.

One interesting thing about this is the claim that there is a ground truth for
all but 12 of the papers, meaning that supervised learning could also be used.

For discussion, I often think that unsupervised methods are preferred to
supervised methods, given a reasonably low error rate by the unsupervised
method, as it will be able to generalize more readily.

[0]
[https://www.jstor.org/stable/2283270](https://www.jstor.org/stable/2283270)

------
shiado
Now try Satoshi Nakamoto with bitcointalk posts and papers in cryptography.

~~~
clubm8
This assumes Satoshi discusses such things under a nym tied to his real name
or IP.

I have a suite of nyms I created over Tor, and exclusively use via Tor, which
I use to discuss certain topics.

It's hard to do a stylometric analysis when you don't have anything to compare
to. Professional writing is very different from informal conversation.

If someone doesn't have a Facebook, gmail, twitter, etc it would be very
difficult to find that person through stylometric analysis IMHO.

~~~
shiado
It would definitely be hard to do. And my personal suspicion is that Satoshi
is a group of individuals, making such an analysis almost impossible. However
there is the possibility that there is some sort of "linguistic fingerprint"
capable of uniquely identifying Satoshi, despite what I believe to be their
efforts to create a highly specific and unique style of writing for the
purposes of evading identification. The way the Unabomber used a certain
phrase in the manifesto and a letter to a family member was what got him
caught.

[https://en.wikipedia.org/wiki/You_can%27t_have_your_cake_and...](https://en.wikipedia.org/wiki/You_can%27t_have_your_cake_and_eat_it)

Another interesting analysis would be to compare the structure of code between
the early bitcoin code and code written by likely suspects. Perhaps variable
naming convention will identify Satoshi.

[https://github.com/trottier/original-
bitcoin](https://github.com/trottier/original-bitcoin)

~~~
clubm8
> _my personal suspicion is that Satoshi is a group of individuals, making
> such an analysis almost impossible_

I agree. My bet is at least two: one theory heavy "ideas guy" and one more
software engineering oriented programmer who did the heavy lifting on
implementation.

And I have also noticed that these types of analysis are for single authored
documents.

I suspect it will stay a secret for a long time... maybe we'll get some
deathbed confessions in a few decades.

------
jonluca
This is my first project in unsupervised NLP, so let me know if there's
anything obviously wrong with the article or methodology.

~~~
tssva
I know nothing about NLP but would a run where it predicts the authorship of
the known texts be useful to get some idea of the level of accuracy?

~~~
wodenokoto
Not the author, but I do know a bit about NLP, so to answer your question: Yes
it would :)

------
clubm8
Increasingly it feels like we can only "think" anonymously, not speak.
Technology like Tor allows me to surf the web untracked, which is good. There
are chilling effects from mass surveilance that cause people not read about
sensitive topics[1]. It's good people can expose themselves to primary sources
unimpeded.

But if a user tries to say anything of substance or simply become part of a
community rather than rotate nyms every year or so, they're opening themself
up to fingerprinting.

An interesting dynamic, in my opinion.

[1]
[https://motherboard.vice.com/en_us/article/aekedb/chilling-e...](https://motherboard.vice.com/en_us/article/aekedb/chilling-
effect-of-mass-surveillance-is-silencing-dissent-online-study-says)

------
ggm
[https://books.google.com/books/about/To_Couple_is_the_Custom...](https://books.google.com/books/about/To_Couple_is_the_Custom.html%3Fid%3Dz8GzNQAACAAJ)

1977\. Always wondered how it stacked up against modern techniques

------
davnn
Nice article! Bag-of-Words models, however, have their drawbacks and I
wouldn‘t consider it a modern technique. Would be awesome if you would also
try some state of the art techniques like NN word/document embeddings, e.g.
with doc2vec.

------
berti
Fun fact: this problem is why grep was created by Ken Thompson [1].

[1] [https://youtu.be/NTfOnGZUZDk](https://youtu.be/NTfOnGZUZDk)

~~~
neonate
That was on HN yesterday.
[https://news.ycombinator.com/item?id=17478260](https://news.ycombinator.com/item?id=17478260)

~~~
berti
Neat! Kernighan really is pleasant listening as noted on that post.

------
jul8234
Why not use something like LDA and features taken from word embeddings? A more
probabilistic analysis would give more meaningful results.

------
blobbers
Fun post!

~~~
blobbers
btw, mostly seems to run on python2!

I believe you intended to run in python3 based on your last cell.

changing maketrans and translate will have it run in python2.7

    
    
        table = string.maketrans('', '')  # remove punctuation from each word
        stripped = [w.translate(table, string.punctuation) for w in tokens]
    

Did you read the book on nltk before writing this project? Curious about
background skill development to undertake this project.

~~~
jonluca
I've tried to transition fully to python3, but yet it should mostly work in
python2.7 as well!

I did not read the book on nltk. I've been into information/coding theory
recently (just finished Information by James Gleick) and thought I'd try my
hand at something closer to NLP/ML. Very little background in the topic - I've
taken a few courses in college and chatted with a few friends that know ML
fairly well, but besides that I relied on blog posts and papers found on
arxiv!

