
Do 20 pages of a book gives you 90% of its words? - kiechu
https://blog.vocapouch.com/do-20-pages-of-a-book-gives-you-90-of-its-words-795a405afe70
======
imron
Although it sounds high, recognising 90% of words makes for a pretty horrible
reading experience.

That's 1 word in 10 that you don't know (1-2 words per sentence), or assuming
as you did in that post a page length of 300 words, then it's 30 new words a
page.

I actually recently wrote an article discussing the same phenomenon in Chinese
[0]

Where to get a reasonable level of new characters (e.g. no more than 1 a page)
you'd need to know 99.8% of the text on any page.

And the level of recognition required to be able to recognise and learn new
words completely from context is about 98%. [1]

0: [https://www.chinesethehardway.com/article/hsk-6-gets-you-
hal...](https://www.chinesethehardway.com/article/hsk-6-gets-you-halfway/)

1:
[https://www.youtube.com/watch?v=JbYMZZISPrU](https://www.youtube.com/watch?v=JbYMZZISPrU)

~~~
jaclaz
I know nothing of Chinese, but in western languages, like English, a number of
words may be "unique" when counted by a simple algorithm, (even if - as the
author did - words were reduced to their "basic form" thus deduplicating a lot
of slightly different forms) but often you can get the meaning of the word by
the context and by similarities with other knwown words, so the 90% percentage
while actually meaning that you don't know 1 word every 10, does not directly
mean that you cannot understand 1 word every 10 or that your reading
experience is so horrible.

Every reader attempting to learn a new language goes through that odd phase
where he/she can manage to understand the overall meaning of a sentence even
if there is one or two "holes" in it, and actually it is part of the learning
process.

~~~
imron
Although you might be able to pick up the meaning or get the gist of the
occasional word at 1 in 10, the video I linked to in [1] above makes a
convincing case that the rate at which unknown words stop being a hindrance to
understanding and can be picked up from context is around 98%

~~~
jaclaz
Most probably that 98% is an extremely accurate number, along the metrics of
the professor, and I was not commenting on that video (that I didn't watch), I
was only relating what in my experience happens, in my experience it isn't so
horrible and I am still within that experience with more than one language
where I scarcely reach 70 or (maybe) 80%.

------
pealco
This doesn't really address your teacher's claim about having to look words
up, though. What you want to look at is the distribution of low frequency
words across the book. What do the plots look like when you remove proper
nouns, functional words (e.g., "the", "and", prepositions) and, say, the top
1000 most frequent words in English?

~~~
anon1094
Would be very interesting to see this applied to blogs in different categories
to rapidly learn languages through reading based on the words that you
currently know and the most frequent words in that language. So it would
always present you with the article that suits your level and you would have
the benefit of learning the most new words.

~~~
_asummers
Also would be interesting to see it applied to newspapers, with obvious slices
like particular author, section (sports v world news etc) distribution year to
year, and which paper. TV news broadcasting could also be interesting to
compare by the same dimensions, though the conversational style in some
interview shows would possibly make this less telling. .

------
twoodfin
FWIW, _Ulysses_ isn't particularly incomprehensible. To the extent that it's
difficult to read, it's much more the shifting narrative perspective, widely
ranging references, and stream-of-consciousness rather than the vocabulary.

Take this typical section from the "Lotus Eaters" chapter, wherein Mr. Bloom
is contemplating the origins of the wares in a tea shop:

 _So warm. His right hand once more more slowly went over again: choice blend,
made of the finest Ceylon brands. The far east. Lovely spot it must be: the
garden of the world, big lazy leaves to float about on, cactuses, flowery
meads, snaky lianas they call them. Wonder is it like that. Those Cinghalese
lobbing around in the sun, in dolce far niente. Not doing a hand 's turn all
day. Sleep six months out of twelve. Too hot to quarrel. Influence of the
climate. Lethargy. Flowers of idleness. The air feeds most. Azotes. Hothouse
in Botanic gardens. Sensitive plants. Waterlilies. Petals too tired to.
Sleeping sickness in the air._

Hard to be too confused by the imagery and mood in this passage.

Now, _Finnegans Wake_...

~~~
kiechu
I will run Finnegans Wake in a moment and I will get back with a response. I
must find it in a text format.

~~~
kawera
Have you tried non-fiction books?

~~~
kiechu
What do you have in mind?

~~~
samstave
The Bible.

(Just kidding)

What about having this read a tweet history, say that of a POTUS?

~~~
kiechu
From what I see, POTUS is circling in basic 1000 words.

~~~
kiechu
According to bill you posted:

Number of Pages: 217 Number of Total Words: 65396 Number of Unique Words: 3106
You will know 90% of words after 28 pages which are 12.90% of the book. At
that page, you will know 36.77% of unique words.

The graph is less regular but it has more or less same shape. I will not
publish this part because it is not a book.

------
kabdib
My mom, an english teacher, once went through my library of science fiction
and analyzed it for reading level. I had the usual collection: Lots of
Heinlein, Asimov, Niven, Andre Norton, etc.

Her assessment: Most of the material was about 8th grade level, based on word
count.

From time to time I re-read one of those books, and run across pages where she
had penciled-in notations and underlined words.

------
loeg
> we turned words to their basic forms (went to go, cars to car, jumps to jump
> etc.)

FYI, this is called stemming.
[https://en.wikipedia.org/wiki/Stemming](https://en.wikipedia.org/wiki/Stemming)

~~~
elchief
or contextually, lemmatization

~~~
gattilorenz
That's correct, it's lemmatization. Stemming does not reduce "went" to "go"

------
dri_ft
For the record, Ulysses is at least a full order of magnitude more
comprehensible than Joyce's next book, Finnegans' Wake.

I'd also expect it to give a skewed response on a test of this kind because it
is composed of a number of different sections, which vary considerably in
their style. But maybe that's the point of including it.

~~~
kiechu
Here are Finnegans Wake graphs. It is indeed even more complicated.
[https://github.com/vocapouch/vocapouch-
research/blob/master/...](https://github.com/vocapouch/vocapouch-
research/blob/master/..). Number of Pages: 729 Number of Total Words: 218793
Number of Unique Words: 50872 You will know 90% of words after 387 pages which
are 53.09% of the book. At that page, you will know 60.64% of unique words.

------
prashnts
I think their teacher was referring to Zipfian Distribution[0]. I've seen this
distribution hold on Wikipedia corpus, as well. Of course it's empirical.

[0]:
[https://en.wikipedia.org/wiki/Zipf%27s_law](https://en.wikipedia.org/wiki/Zipf%27s_law)

------
jaclaz
A nice, interesting idea, and experiment, thanks.

Not so casually the blue lines remind me of the one in the graph for the
birthday problem:

[https://en.wikipedia.org/wiki/Birthday_problem](https://en.wikipedia.org/wiki/Birthday_problem)

------
bryanrasmussen
The use of Eve's diary doesn't make any sense here, of course the distribution
of words in a short story are going to be longer than in a book.

Ulysses is fair, but I would expect it and works of a similar caliber to be
outliers.

~~~
kiechu
It is Myth Buster's kind of science. The goal was to see how it works with
short and long books and with one with reputation being easy and a hard read.
It would be interesting to see it on larger population, with more of statistic
involved.

------
kazinator
As little as one character of almost any document will usually give you 100%
of the binary symbols 0 and 1. Usually, the first character will do this,
after which the rest of it is just mindless repetition.

------
nl
This is good, interesting work. I wonder what the difference between stemming
and lemmatization shows?

Edit: I see you are doing lemmatization now. Did you try just stemming?

------
Finch2192
This doesn't seem all that groundbreaking, it's just an instance of Zipf's law
in action, is it not?

~~~
kiechu
Yes, that's Zipf's law applied. I doubt that many language learners knew about
this law. I think it is still worth pointing out, that when you go through the
beginning of the book, reading will become rapidly easier.

~~~
twoodfin
It'd be an interesting exercise in Modernist writing to try producing a book
that violates Zipf's law, say by hashing all but the most common few hundred
words into chapter buckets.

------
ihaveajob
I bet this is not true for the Encyclopedia Britannica, by design.

------
js8
I think this is a very useful idea - it could be used to "rate" the books for
English learners to see how difficult they are.

------
al452
"incomprehensibility"

~~~
kiechu
Fixed. Thank you!

------
zeep
90% of the words is not 90% of the meaning... but I get your point.

------
flavio81
Yes, if the book is 22 pages long!

------
oconnor0
Not if it's a dictionary!

~~~
kiechu
Or a phone book.

------
rfrank
I wonder how Pale Fire by Nabokov would look after this sort of analysis. For
the unfamiliar, per wikipedia, "Starting with the table of contents, Pale Fire
looks like the publication of a 999-line poem in four cantos ("Pale Fire") by
the fictional John Shade with a Foreword, extensive Commentary, and Index by
his self-appointed editor, Charles Kinbote. Kinbote's Commentary takes the
form of notes to various numbered lines of the poem. Here and in the rest of
his critical apparatus, Kinbote explicates the poem surprisingly little.
Focusing instead on his own concerns, he divulges what proves to be the plot
piece by piece, some of which can be connected by following the many cross-
references. Espen Aarseth noted that Pale Fire "can be read either
unicursally, straight through, or multicursally, jumping between the comments
and the poem."[4] Thus although the narration is non-linear and
multidimensional, the reader can still choose to read the novel in a linear
manner without risking misinterpretation."

~~~
s_kilk
Huh, sounds a little like House Of Leaves, which has a similarly weird
structure.

I'll have to check out Pale Fire.

~~~
rfrank
Ah nice, I need to do the same with House of Leaves, I'm a big fan of stories
with unconventional structuring. Sometimes a Great Notion by Kesey is my
favorite; it's told from multiple first-person perspectives that shift pretty
rapidly, where the shifts are indicated by having a particular speakers' text
italicized, in parenthesis, with no formatting applied, etc. It's pretty neat.

