
How Forensic Linguistics Identified J.K. Rowling - ahmadss
http://phenomena.nationalgeographic.com/2013/07/19/how-forensic-linguistics-outed-j-k-rowling-not-to-mention-james-madison-barack-obama-and-the-rest-of-us/
======
junto
Correction: Rowling was 'outed' by her lawyer's wife's friend.

[http://www.independent.co.uk/arts-
entertainment/books/news/j...](http://www.independent.co.uk/arts-
entertainment/books/news/jk-rowling-angry-and-disappointed-after-law-firm-
leaked-robert-galbraith-pseudonym-8718087.html)

If I was the law firm, I'd fire the lawyer.

~~~
joezydeco
Funny coincidence that this 'outing' happened just as the book was headed for
the clearance bin.

~~~
UnoriginalGuy
Given how much damage this whole incident has done to that law firm I
seriously doubt any sane law firm would let themselves be put in that position
just for a publicity stunt.

Normally intentionally "leaks" are just "anonymous sources" etc. They're very
rarely if ever named.

Plus she doesn't need the money, she is one of the richest women in the world.
I suspect she wrote under a different name to see if the book would stand on
its own without her big name boosting sales.

~~~
gnosis
_" she doesn't need the money, she is one of the richest women in the world."_

Never underestimate what the rich and super rich will do to get more money.

Does Warran Buffet need more money? Yet he's still trying to make himself
richer.

Do all the top CEO's, VCs, and board members in the world "need" more millions
and billions? Maybe not. But it hasn't stopped them from trying to further
enrich themselves.

Besides, there could be other motivations for this stunt, such as fame or
wanting to be talked about or thought of as a certain type of author.

~~~
nicholassmith
Rowling actively rid herself of cash and fell out of the top 100 richest last
year, I think money isn't that important to her.

------
GigabyteCoin
"I called both of them __yesterday __and learned not only how the Rowling
investigation worked, but about the fascinating world of forensic linguistics.
"

 _Cringe._

From my experience (gleaned from dutifully reading every Bitcoin-related
article I can get my hands on) I am very wary of reading about any topic which
the author admits to just having learnt about _yesterday_.

The majority of the time, unfortunately, English majors aren't the best at
understanding technology.

~~~
cschmidt
That's a bit harsh. The article was by Virginia Hughes, who seems a reasonable
science writer, who has written for Nature, etc. Fundamentally, that's what
science writers do. They learn about new discoveries, figure out what they
mean, and explain them to us. There is a lot of bad science writing out there,
but I didn't find it to be an example of that.

She has a degree in neuroscience from Brown, so she's not an "English major".

------
elchief
PCA is a pretty neat technique. It's quite old too, invented by Pearson in the
early 1900's.

Basically, you find a "vector" that travels along the part of the data with
the highest variance. Then you find an orthogonal vector that travels along
the part with the next highest variance.

You then have a set of vectors that explain all of the variance, that aren't
correlated (because they're orthogonal), and are ranked by how much they
explain.

This can be useful in regression to get rid of correlated variables, or you
can get rid of some of the low variance components if there are more columns
than rows, which breaks OLS regression.

Consider a new town that you want to get to know as quickly as possible. What
is the best method? You start with the longest street, then take a left and
travel the next longest street, and so on. You can get a pretty good idea
about the town without seeing it all.

------
3minus1
The analysis of word length is interesting. English has a lot of long, multi-
syllabic Latin based words, and also a lot of short Germanic based words. I
wonder the extent to which a higher percentage of long words indicates a
preference for the Latin and vice versa.

~~~
cpeterso
In the 19th century, Lucy Aikin, under the pen name Mary Godolphin, wrote
_Robinson Crusoe In Words Of One Syllable_ and a number of other classics for
children using only monosyllabic words. Apparently there are over 9000
monosyllabic English words, but writing this way is surprising hard. It's an
interesting exercise, but reading the books feels like reading a telegram.

[http://collectingchildrensbooks.blogspot.com/2008/04/monosyl...](http://collectingchildrensbooks.blogspot.com/2008/04/monosyllabic-
monographs-of-antediluvian.html)

~~~
pbhjpbhj
I wonder how well a machine automated one-syllable rendering would fair
(compared to the manual ones). You'd need to use a thesaurus limited to single
syllables and ensure the correct meaning was being chosen. Doesn't _sound_ too
hard.

~~~
cpeterso
Interesting idea! Just filter WordNet's synonym sets of 155,287 words with a
list of one-syllable words. (I read there are 9,000+ words, but I can't find a
list online at the moment.)

[https://en.wikipedia.org/wiki/WordNet](https://en.wikipedia.org/wiki/WordNet)

------
praptak
Automatic transformation of text to evade these methods seems feasible (google
translate back and forth might be the crude first attempt.) Obviously there
might exist more refined methods of identification. In case of a book it is
probably hard not to ruin it this way but reviews, posts and such do not
require such high standards.

~~~
gwern
[http://www.cs.drexel.edu/~sa499/papers/adversarial_stylometr...](http://www.cs.drexel.edu/~sa499/papers/adversarial_stylometry.pdf)
"Adversarial Stylometry: Circumventing Authorship Recognition to Preserve
Privacy and Anonymity" Brennan et al 2013

Machine translation doesn't work at all. Imitation or deliberate changes to
style does - but if Rowling were to do that, she would probably sacrifice a
lot of quality or effort and it would defeat the apparent point of the
experiment (to demonstrate that she wrote good books, as assessed by 'blinded'
reviewers).

------
gtani
[https://news.ycombinator.com/item?id=3613734](https://news.ycombinator.com/item?id=3613734)

this is a tough thing to google for. Terms I used a few weeks ago

\- stylometry

\- authorship attribution/verification

\- grammatical analysis, plagiarism detection

------
hnha
Way too much terms like "proof", "fact", "confirmation", "definitely" later
on. Isn't something like this _always_ with a lot of assumption and _always_
with a bias from the samples? Everyone could happen to be writing like someone
else. There is nothing that definitively makes writing different between
people like a fingerprint (which, as I understand it, is biologically highly
random).

Analysing sites like HN to see indicators(!) for sockpuppets or generally
correlation of likelihood between accounts' writing styles would rock!

------
cliveowen
Is that really a website that uses a normally sized font and doesn't drown me
with ads?

Nah, I must be dreaming.

------
fortepianissimo
All of the statistical analyses sound to be fairly easy to beat.

Say you want to pretend to be another author: first build a language model of
the target author, then use the model to single out sentences of high
perplexity from your writing. Then, have the model "rewrite" your sentences by
replacing your words with synonyms of higher n-gram probabilities according to
the model. Similar things can be done to fool the character n-gram analyses,
or analyses above words (e.g., parses).

~~~
queensnake
All of the /mentioned/ analyses - those of the first program sound like
something you could do in an afternoon, yet it's been worked on for 10 years.
I think there are more, and more sophisticated methods than were described.

------
waterlesscloud
Pretty cool. Interesting too, since Rowling is probably the most imitated
author in the world at the moment. I guess not by published authors, though.

------
georgemcbay
Have there been any instances where "Forensic Linguistics" actually predicted
an outcome that wasn't previously suspected and it turned out to be true? All
of the examples I've heard of are it "confirming" things already suspected by
other means.

Either way it is still an interesting tool and a cool use of technology, but
I'd be a lot more impressed if the software were fed the text to a large
number of random books and it detected an instance (with very high likelihood)
of some famous author writing under a pen name, and then had that confirmed.

------
Nycto
Something similar could probably be done with code (if it hasn't been done
already). I suppose auto-formatting and checkstyles might mute some things,
but I imagine you could still get a read from things like variable names,
class names, function length, etc.

~~~
rca
That would be an interesting thing to do. But since code usually has a formal
grammar, and a relatively simple one at that, plus a far more constrained
vocabulary and, like you pointed out, stricter style frameworks, the liberty
of the author is way more limited. It might be enough to say that a piece of
code hasn't been written by someone in particular, or maybe discern an author
among a few coders. But I'd bet it can't be done on the same scale as with
literature.

~~~
troymc
One place where big variations can exist between two programs is in the
comments, where vocabulary and grammar aren't as restricted.

------
MarkMc
I'm curious about the ethics of this. Why is it OK to 'out' someone as the
author of a book, but it's not OK to 'out' someone as gay?

~~~
alan_cx
Authors are not generally persecuted for being authors.

~~~
MarkMc
So it's OK to expose a person's secrets as long as that person isn't harmed?

~~~
danso
Not sure where you're going with this so I'll avoid the obvious snarky
counter-examples...but are you arguing that if someone designates something as
a secret, it is absolutely wrong to uncover it?

~~~
MarkMc
Not sure where i'm going either! I'm just trying figure out whether I would
publish the story if I was editor of The Sunday Times.

Clearly there are cases where exposing someone's secret is the wrong course of
action (eg. someone being gay).

Clearly there are cases where exposing someone's secret is the right course of
action (eg. a corrupt politician)

What I'm interested in is the middle ground - where exposing a secret does not
harm the 'owner' of the secret, but also does not really benefit the general
public. In such cases shouldn't we respect the wishes of the person who is
keeping the secret?

------
brownbat
s/b "How Forensic Linguistics Confirmed a Leak about Rowling"

------
mnglkhn2
At the same time we can think of the whole thing as a smart marketing plot.

------
alxbrun
I don't buy this 'outed' story one second.

This is either marketing or fear of public reception of her non-Potter book
(imagine the pressure she must have). Either way, this is crap.

~~~
ralfd
Do you really think J.K.Rowling is under pressure by anyone?

Beside, authors like to write under pen names. Stephen King used Richard
Bachman for that, but was also outed:

[http://en.wikipedia.org/wiki/Richard_Bachman](http://en.wikipedia.org/wiki/Richard_Bachman)

> He says he deliberately released the Bachman novels with as little marketing
> presence as possible and did his best to "load the dice against" Bachman.
> King concludes that he has yet to find an answer to the "talent versus luck"
> question, as he felt that he was outed as Bachman too early to know.

~~~
alxbrun
Yes, I think that after a huge success (such as Harry Potter), artists are
under tremendous pressure to release something greater or at least equivalent.
The same is true for startups btw.

Anyway, if you want to believe what marketers want you to believe, that's fine
with me, I won't downvote you for that.

