

The Long Tail of the English Language - impostervt
http://blog.wordsapi.com/2015/01/the-long-tail-of-english-language.html

======
madaxe_again
Now I understand somewhat better why I often end up ceasing midway through
discourse, as my colloquists often end up interjecting requesting definition
of whatever neologism has just emanated from lexicon into conversation.... a
random sampling of words that I have just today been asked to define either do
not feature in this list at all, or are 90,000+ in terms of use.
Brobdingnagian, deleterious, autodidact, loquaciousness... are these really
such strange vernacular?

I suppose this is what happens when you spend your formative years buried in
literature - it probably doesn't help that I mispronounce all sorts, as books
make poor elocutionists.

I do worry that as we further and further consolidate our vocabulary that we
lose the breadth and depth of thought that nuanced words provide... so did
Orwell...

~~~
delluminatus
Ah, another person who talks like a book. I also get that quite regularly. I
have a bit of a personal rule, though, where I never use a more complicated
word when a simpler one would work. After all, conversation is at heart a
means of communication.

Although we might get a bit of a thrill from talking over people's heads, it
doesn't really mean much of anything. And all it takes is one misuse -- for
example, calling Brobdingnagian a neologism -- and we tear ourselves down more
than we could hope to build ourselves up.

~~~
madaxe_again
Brobdingnagian is a neologism though. Swift quoined it scarcely two centuries
ago, which is a ... Lilliputian (sorry)... amount of time in linguistic terms.
I suppose it also depends on what you consider to be contemporaneous. I'm my
case, this era, which opened with the enlightenment...

Agreed - never use a complicated word when a simple one will do - but you
either have to turn verbose or lose nuance, as we're not talking synonyms, but
words which have meanings distinct from some of their more commonplace
counterparts.

No thrill, I just hate taking the colour out of language, and I like to share
words far and wide, so they don't drift into oblivion, along with the thought
and meaning behind them.

~~~
delluminatus
Oh well. I guess if you want "neologism" to mean "any word invented in the
past 300 years", that's your prerogative, but I take a very different meaning.
One interesting side-effect of that interpretation is that "neologism" becomes
autological, since it was itself coined in that period.

It's true that there are sometimes nuances lost in the conversion from large
to small, but there is far more nuance lost once your partner doesn't know
what the hell you are saying. For you, Brobdingnagian might have some special
nuance (I'm picking on that word because it is ridiculous to me), but to your
partner it has no nuance or flavor -- they have no associations with that
word, nor any grasp of any subtleties of meaning -- in fact they don't grasp
the meaning at all. So if you are trying to get something across, it's simply
a bad way to do it.

I mean, don't get me wrong -- I don't want to deride big-word-users. If you
want to do to because you like it -- that's obviously fine. Also, if someone
does happen to have a thing with Brobdingnagian, I guess you might have just
made their day a lot better. But at some point, you have to acknowledge that
obscure vocabulary is more of a hindrance than a boon to communication.

------
CapitalistCartr
I wonder how other languages compare to English. I know English is far from
pure.

"The problem with defending the purity of the English language is that English
is about as pure as a cribhouse whore. We don't just borrow words; on
occasion, English has pursued other languages down alleyways to beat them
unconscious and rifle their pockets for new vocabulary." \-- James Nicoll

~~~
hellrich
What do you mean by "pure"? Lack of loanwoards? Lack of any linguistic changes
(e.g., sound, meaning) for existing words? Lack of any innovation, i.e. no new
words for new concepts/objects? You could try to find some of these things in
languages spoken by rather isolated groups. Yet I don't think one should call
such a language "pure", implying some kind or (moral) superiority.

~~~
laichzeit0
Compared to Classical Latin for instance, I can sort of intuitively see how a
"purity" comparison might make some sense. The grammar is more rich and
precise, suffixes and prefixes attached to primitive roots are more regular in
terms of meaning. Meaning is very decoupled from word order, which allows much
richer possibilities for rhyme and prose. If the restored pronunciation is
anything to go by, then the pronunciation is very regular. Then again I'm not
a linguist.

I would say it's like arguing whether Python is more "pure" than Perl. They're
both Turing complete, they express very similar concepts but I'd wager most
people would concede that Python feels "purer".

~~~
deciplex
Isn't the 'purity' of Latin due more to the fact that we just don't have much
data on the languages that informed it? I'm sure Latin borrowed heaps of words
and grammar from other languages, but those languages are mostly lost to
history now, since the people who spoke them didn't conquer and hold the
Mediterranean for one thousand years.

------
japaget
The source material for this frequency count comes from Open Subtitles
[[http://www.opensubtitles.org/en/search](http://www.opensubtitles.org/en/search)].
Hence the frequencies here apply to spoken English, not written English. In
written English the three most common words are "the", "of" and "and", whereas
here they are "you", "I", and "the".

------
te_platt
If you have the right kind of friends you can play "Who knows the most obscure
word?". Everyone picks a word from memory, check each word's position (We used
to use google ngrams, but this would work well.), whoever gets the least
common word wins that round, repeat until it's not fun anymore. That's how I
learned defenestrate, obsequious, and a few other words.

~~~
Terr_
I love "sesquipedalian", because it's so self-referential.

~~~
vorg
I love "non-self-referential" because if I use it enough times with all the
correct stresses and pauses, I can build a computer.

------
dzdt
For a corpus much more complete on the long tail, try google books ngram
viewer : [https://books.google.com/ngrams](https://books.google.com/ngrams).
That uses full data from google's book-scanning endeavors, millions of books
compared to millions of words in the originally linked article.

~~~
madaxe_again
I love to play with this, and see how thought waxes and wanes - a good one is
to stick in every soviet premier since Lenin, or 20th c. US presidents -
really tells you rather a bit about the mindshare that these individuals had.

------
jasode
[http://en.wikipedia.org/wiki/Zipf%27s_law](http://en.wikipedia.org/wiki/Zipf%27s_law)

EDIT ADD: Zip's Law was not mentioned directly in the blog post but the reply
clarified that the API returns a zipf score. However, a word's zipf ranking is
_dependent on the corpus used_. The Wordapi "About" page[1] says most data
came from Princeton WordNet but a sibling comment says it came from a
subtitles compilation. If the project could clarify the data sources, it would
be helpful.

[1][https://www.wordsapi.com/about](https://www.wordsapi.com/about)

~~~
impostervt
The "frequency" score returned by Words API is the a Zipf score for the word.
Ranges from ~1.6 to ~7.6.

Regarding your update - I'll update the About page.

------
uberalex
'I' doesn't seem to work. It is very common.
[https://books.google.com/ngrams/graph?content=I%2C+you&year_...](https://books.google.com/ngrams/graph?content=I%2C+you&year_start=1800&year_end=2000&corpus=15&smoothing=3&share=&direct_url=t1%3B%2CI%3B%2Cc0%3B.t1%3B%2Cyou%3B%2Cc0)

~~~
copsarebastards
Try lower-casing it.

EDIT: This looks like a case of a common programming antipattern: you don't
care about the casing for comparison purposes, so instead of implementing a
case-insensitive compare, you downcase the strings and call it a day. But
that's inherently a loss of data, and not having that data will eventually
come back to bite you.

------
mrfusion
So I'm wondering, if you just learned the 200 most popular words, you might
get pretty far in learning a new language, no?

~~~
maaku
200? no. Try 2,000. That should be representable of a barely usable level of
the language. If you are what might be described as fluent, you're probably at
>20,000. Take this headline:

JACUZZIS FOUND EFFECTIVE IN TREATING PHLEBITIS

"EFFECTIVE" and "IN" are the only two words found in the first 2,000 words
sorted by frequency, although "FOUND" is close. So with only 200 words you'd
understand:

WNPHMMVF SBHAQ RSSRPGVIR IN GERNGVAT CUYROVGVF

2,000 words would give you:

WNPHMMVF FOUND EFFECTIVE IN GERNGVAT CUYROVGVF

And some grammar knowledge would tell you

WNPHMMVF(N) FOUND EFFECTIVE IN GERNGVAT(V) CUYROVGVF(N)

Which is enough to know that you only really need to look up "TREATING" in the
dictionary to understand the gist of the sentence But it'd hardly pass as a
fluent understanding...

~~~
Rangi42
"Some grammar knowledge" probably also tells you the structure of the verb,
namely TREAT+ING(V,present progressive). "TREAT" is the 1,118th most common
word, so that gives you:

WNPHMMVF FOUND EFFECTIVE IN TREATING CUYROVGVF

And then some contextual knowledge of what sort of things get treated tells
you that CUYROVGVF is a disease or injury, and WNPHMMVF is a medicine or
therapy. So knowing 2,000 words and some grammar isn't quite fluent, but it's
around the level of a middle-school child who has to ask what lots of nouns
are.

------
davedriesmans
Very nice product! As a sidenote: anyway knows a service that provides human
associations? Eg What do you associate with "Hacker" as a person "Computer",
"Night", "Internet"

~~~
rspeer
I work on ConceptNet, which does this.

[http://conceptnet5.media.mit.edu](http://conceptnet5.media.mit.edu)

------
jloughry
The most striking thing about Randall Munroe's "Up Goer Five" comic [1] was
that the word "computer" is on the list, but "thousand" isn't.

    
    
        "(Explained using only the ten hundred words people use the most often)"
    

[1] [http://www.xkcd.com/1133/](http://www.xkcd.com/1133/)

~~~
dullcrisp
Well to be fair, thousand is a number, so analysis of written text will find
it a lot rarer than it is actually used/spoken.

------
yellowstuff
I poked around in the neighborhood of "Tremulous" and saw a lot of obvious
data errors: "ladiesand","manklnd", "confldentlal", "howdare", "monment".
Other words don't seem particularly rare: "productively", "areolas",
"combusts", "lazier".

------
jboggan
It seems like their long-tail data is full of misspellings. Try typing in
"vacillate" and looking at the other words on the graph in relative frequency,
for example.

~~~
eginhard
No corpus is ever "clean". Depending on the type of corpus there might be many
misspellings, so obviously they will occur in the graph alongside other low-
frequency words.

------
justaman
TIL "fucking" is the 214th most popular word in the English language.

------
drpgq
I'm guessing "icing" is more popular in Canada.

~~~
cpwright
Only if you're thinking of icing on roads (or in hockey), but not if you're
thinking of icing on top of a cake.

