

Visualizing Tolkien - josem
http://5013.es/p/1/

======
jmilloy
>One of the reviewers argued that it was the hardest book to read because
'and' was the most used word in the book.

I think the author took this way to literally. Granted, it inspired some fun
(albeit odd and basic) data analysis. But the point of the reviewer isn't
about word counts but about style and pace. That, narratively, the
Silmarillion just felt like it continued on and on without clear sections of
rising and falling action or significant breaks. Sentences were long and
atmospheric, rather than short, quick and active.

~~~
danielbarla
I haven't gotten through the whole book, but it's also written in a style
which makes it hard to immerse yourself in it. Characters aren't described in
detail, major events are described in a few words, and entire lifetimes are
glossed over... Certainly epic and at times quite beautiful, but completely
different to LoTR.

~~~
ajuc
Sillmarillion is just meant to be read in pieces, I think. It's very similiar
to the Old Testament, and I don't know a single person that read the Old
Testament in one sitting.

------
scrumper
Where's the sentence length statistics? Average word length? Commas per
sentence? All are surely much more indicative of reading difficulty than that
"originality index" he came up with.

Why didn't the author try the most obvious test of reading difficulty of all:
Flesch-Kincaid? [1]

[1]
[http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readabil...](http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_test)

~~~
sole
Hi, author here.

First, it's she, not he.

Second, because I hadn't heard about that obvious test at all back then. I
never pretended to do a superserious scientific analysis but rather answer to
the questions that came to my mind, by using a computer to validate
hypothesis.

But many thanks for the pointers & suggestions, though! I was thinking about
rebuilding this to make it realtime+interactive so that more than one text
could be analysed/visualised, so I'll make sure to introduce the new tests.

~~~
scrumper
Hi author, I really liked the visualisations, especially the black hole thing.

I think I may have misunderstood the purpose of your article: I first read it
as a serious attempt to use textual analysis to do a comparison of the
comprehensibility of Tolkein's best-known works, in which case you fell way
short of the mark by going no further than word counting. I think it started
off like that, but on re-reading it I see that you say you hit a brick wall
(paraphrasing) and decided to have some fun visualising your results so far;
something you did very nicely. So perhaps I got the wrong end of the stick.

I thought about what you've taken on here. Doing things at the word level is
pretty easy. Taking into account grammatical structure to get things like
sentence lengths and clauses per sentence, or breaking words up into syllables
(needed for F-K, for example) is considerably harder. I'm interested to hear
what you come up with. Flesch & Flesch-Kincaid are US inventions so perhaps
not obvious to everyone.

On he/she: I agonised over that for a good ten minutes over my breakfast. I
read around your blog and your twitter page this morning for clues as to the
appropriate pronoun. In the end I couldn't tell, so I went with 'he' because I
stereotyped you. I nearly changed my wording to use constructions like "the
author," "they," and various other mealy-mouthed alternatives but they were
too ugly. So I did try, but I got it wrong. I apologise.

~~~
Evbn
Isn't FK determined almost completely by sentence length? That is what I
recall from messing with MS Word docs in high school.

~~~
scrumper
FK combines average sentence length (total words/total sentences), average
syllables per word and some fixed coefficients to come up with an equivalent
school grade.

Flesch Reading Ease score, which is what I actually meant, does the same but
with different coefficients to come up with a more granular difficulty score,
usually in the range of 30-100.

They're both pretty arbitrary. The more I read up on this subject the more
respect I have for the author's own attempts at an originality score. It's all
subjective ultimately.

------
wizard_2
The painstaking lengths people go though in honor of Tolkien is amazing. Along
those lines I think <http://3rin.gs/> deserves a mention, especially as maps
are a "visualization" of sorts. I met the author this summer. The attention to
detail is staggering.

------
bane
Great post, I love the thinking out loud thought process portions.

Some suggestions: try removing stopwords (the, and, etc.), it'll bring out
more variation on the analysis. Particularly in the circular graphs.

Try unique n-gram analysis, I suspect that will show something interesting. At
the very least 2-gram (bigram/digram) analysis might show something cool.

Some other interesting comparative measures, since the author already has all
of the tokens, try the Jaccard index between book pairs to look for
similarity. (may even want to break the books down into major sections: e.g.
LOTR can be viewed as both 3 and 6 books.

Some others to try, sentence length, word length, distribution of tf-idf
scores, etc. etc.

fun fun!

~~~
sole
Yes--maybe I am thinking out 'too loudly' sometimes, but I think it's
interesting since that explains how one can come to a conclusion, and maybe if
it is wrong, it can be corrected as you know how the conclusion was reached.
It's like you can "debug" a thought process, in a way, as you have a "trace"
:-)

And thanks for all the suggestions! I've made a note of them all. Good to hear
from people who know more about the topic than me.

------
jre
I think the author should have filtered out stopwords. The big "The" in the
middle of the visualization seems kind of obvious. It would be much more
interesting to see the words that make this text different from other english
texts.

~~~
vital101
I did some work while still in college doing analysis of texts from Project
Gutenberg. Removing the stop words made the analysis far more interesting.

------
qznc
Why does everyone find the Silmarillion hard to read? Am I the only one, who
had no problems with that style?

~~~
DeepDuh
I was thinking the same. I never had problems with Silmarillion. Couldn't read
Messages from Middle Earth however, that was just not enough connection. But
Silmarillion, man, that's an _epic_ collection of stories. I think it might be
a cool material for a TV series. Say, one season for every major story with
the same characters, plus some creative freedom for the screenwriters of
course.

Just think about it - battles with hordes of orks, hero elves AND 30 or so
Balrogs. And loosing / winning a battle doesn't just blow up a tower, it
creates _spasms in damn middle earth_. The ring wars got nothing on that..

~~~
pooriaazimi
Later parts of the book might be good (or even great) in movie format, but
there's _no way_ one can turn Ainulindalë into a movie... Or can they? For
years I was passionately against the idea, but now that I think about it
again, maybe I'm just being over-pessimistic? I mean, LOTR turned out to be
quite good...

~~~
DeepDuh
IMO quite good is an understatement, it has become the best any LotR fan could
have ever hoped for. And about Silmarillion: Think about how good some TV
screenwriters have become in recent years - IMO they turn out more high
quality TV scripts than movies nowadays, mostly because the big budget movie
industry needs to always take the safe bets (and so you get stupid flicks like
battleship et al.)

------
andreasvc
I do research in this topic. While word level features are readily available
and provide some interesting insights, I'm more interested in syntactic
phenomena (e.g., recurring phrases). But in the case of The Simarillion the
"reading difficulty" could be beyond even that; I suspect it might be because
it doesn't build up suspense the way the traditional narrative of a novel
does. Unfortunately, that might be hard to capture in a programmable test.

------
5xz41s0P8T5N
"Visualizing English Text" seems more accurate a title. Cute process, but the
result is entirely generic.

~~~
ctdonath
Agreed. It's just applying some standard (albeit amusing) tools to a
particular text (amounting to 1 book in 5 parts) and getting results no
different from any other text. Word clouds etc have been done, but all still
result in pretty much the same sort of textual haze, with nothing inspiring a
unique view of a unique tome.

Mapping semantic associations would be more interesting. Something akin to
<http://xkcd.com/657/> generated by the in-text juxtaposition of names &
related verbs.

------
mcguire
" _I wrote a simple program who counted how many times did each word appear in
the [_ The Silmarillion _]._ "

Out of curiosity, where did the author get a copy of the text to analyze?

"Turin" is the only name in _The Silmarillion_ graph? Interesting.

~~~
sole
Author here.

I got them in txt files, online. I own the original books too. I would have
typed them in if I had all the time of the world, of course.

------
msurel
While I find this somewhat fascinating, I think this kind of analysis would be
similar to trying to figure out how to make great food by examining the
molecular contents of the final product.

------
aw3c2
I almost closed the tab right away because the text only started below my
fold. The 4 graphics fit the screen perfectly so I assumed that was all. Make
sure you scroll!

------
njx
You need to remove the stop words before visualizing them. Words such as 'the'
'and' etc don't need to be in the analysis

~~~
sole
They are included because they were the reason people gave when asked why "The
Silmarillion" was so unreadable.

(I'm the author)

~~~
crpatino
The problem with stop words is that they tend to be the most common words in
_every_ piece of text[1], regardless genre.

So, in order to test the hypothesis "Silmarillion is harder to read because it
has lots of stop words", you need to calculate the relative frequencies of
lots of other texts and see if there's something special about Silmarillion's
top 10 versus all other's.

Surely, you have already done that using LOTR and The_Hobbit, but a much
bigger sample is needed. At the very least, you may want to use 10-15 other
works of fantasy from different authors, and that will be just like a back-of-
the-envelop test to see if it is worth to pursue this experiment with a
statistically significant sample.

[edit] 1\. Provided it is sufficiently large.

------
caycep
ok great. but will you read the damn book already?!?!

;)

------
hastur
rather: Wasting time and attention on useless statistics and pointless
visualizations.

~~~
sdoering
Says the one with a nickname from mythology/fantasy (Darkover-Cycle)... ;-)

~~~
hastur
I'm not sure what you mean. I was criticizing the viz, not the literature.

Incidentally, I never read anything by Lovecraft, but am a big fan of Tolkien.
I came up with this nickname on the spot.

~~~
hsmyers
Err---that would be Marion Zimmer Bradley (see:
<http://en.wikipedia.org/wiki/Marion_Zimmer_Bradley>), not Lovecraft. It's
possible she got it from "The King in Yellow" usually referred to as 'The
Yellow Book', but not know for sure. See:
<http://en.wikipedia.org/wiki/The_King_in_Yellow>

