Hacker News new | comments | ask | show | jobs | submit login
The Long Tail of the English Language (wordsapi.com)
49 points by impostervt on Jan 29, 2015 | hide | past | web | favorite | 54 comments

Now I understand somewhat better why I often end up ceasing midway through discourse, as my colloquists often end up interjecting requesting definition of whatever neologism has just emanated from lexicon into conversation.... a random sampling of words that I have just today been asked to define either do not feature in this list at all, or are 90,000+ in terms of use. Brobdingnagian, deleterious, autodidact, loquaciousness... are these really such strange vernacular?

I suppose this is what happens when you spend your formative years buried in literature - it probably doesn't help that I mispronounce all sorts, as books make poor elocutionists.

I do worry that as we further and further consolidate our vocabulary that we lose the breadth and depth of thought that nuanced words provide... so did Orwell...

1. Never use a metaphor, simile, or other figure of speech which you are used to seeing in print.

2. Never use a long word where a short one will do.

3. If it is possible to cut a word out, always cut it out.

4. Never use the passive where you can use the active.

5. Never use a foreign phrase, a scientific word, or a jargon word if you can think of an everyday English equivalent.

6. Break any of these rules sooner than say anything outright barbarous.

-- Orwell, "Politics and the English Language"

I was referring more to newspeak, as Orwell's thinking progressed over the intervening five years, but the above are perfectly valid editorial rules.

The thing is, note "will do" in (2). There are many cases where a short word really won't do, lest you lose nuance and meaning.

Ah, another person who talks like a book. I also get that quite regularly. I have a bit of a personal rule, though, where I never use a more complicated word when a simpler one would work. After all, conversation is at heart a means of communication.

Although we might get a bit of a thrill from talking over people's heads, it doesn't really mean much of anything. And all it takes is one misuse -- for example, calling Brobdingnagian a neologism -- and we tear ourselves down more than we could hope to build ourselves up.

Brobdingnagian is a neologism though. Swift quoined it scarcely two centuries ago, which is a ... Lilliputian (sorry)... amount of time in linguistic terms. I suppose it also depends on what you consider to be contemporaneous. I'm my case, this era, which opened with the enlightenment...

Agreed - never use a complicated word when a simple one will do - but you either have to turn verbose or lose nuance, as we're not talking synonyms, but words which have meanings distinct from some of their more commonplace counterparts.

No thrill, I just hate taking the colour out of language, and I like to share words far and wide, so they don't drift into oblivion, along with the thought and meaning behind them.

Oh well. I guess if you want "neologism" to mean "any word invented in the past 300 years", that's your prerogative, but I take a very different meaning. One interesting side-effect of that interpretation is that "neologism" becomes autological, since it was itself coined in that period.

It's true that there are sometimes nuances lost in the conversion from large to small, but there is far more nuance lost once your partner doesn't know what the hell you are saying. For you, Brobdingnagian might have some special nuance (I'm picking on that word because it is ridiculous to me), but to your partner it has no nuance or flavor -- they have no associations with that word, nor any grasp of any subtleties of meaning -- in fact they don't grasp the meaning at all. So if you are trying to get something across, it's simply a bad way to do it.

I mean, don't get me wrong -- I don't want to deride big-word-users. If you want to do to because you like it -- that's obviously fine. Also, if someone does happen to have a thing with Brobdingnagian, I guess you might have just made their day a lot better. But at some point, you have to acknowledge that obscure vocabulary is more of a hindrance than a boon to communication.

You are obviously aware of my favourite; Hemingway on Faulkner:"Poor Faulkner. Does he really think big emotions come from big words? He thinks I don’t know the ten-dollar words. I know them all right. But there are older and simpler and better words, and those are the ones I use."

Children may be precocious and tolerable. I recall the joy of finding new words - grownup words. Most people find it pretentious and condescending in adults. It is poor communication. Want a game of who can think of the most obscure word? Firstly I tried 'widdershins'. Let's exclude Brobdingnagian as an obvious neologism. (Not trolling - really can't see why you think it ain't?)

Although a brief examination of your comment history suggests a satirical intent in what you have written, it has (at face value) some errors. If you truly intend to build your defense for this posturing polysyallabic puffery on the foundation of subtle distinctions in meaning you would be well served by properly understanding the customary meaning of the words in question, subtle or otherwise.

Taking, as an example, the final sentence of your first paragraph: "Brobdingnagian, deleterious, autodidact, loquaciousness... are these really such strange vernacular?" we find a sentence that is grammatically incorrect, with the singular form of vernacular. Assuming a typographically damaged plural does no good, as these words are not vernaculars. We do no better with the assumption of an elided indefinite article ("are these really such [a] strange vernacular?") as the answer is trivially yes, attempting conversation using only these words (Brobdingnagian, deleterious, autodidact, and loquaciousness) would be an exercise of Cnutian futility. As the recurring problem in finding meaning in this sentence is this definition may I suggest that perhaps the primary issue is in the choice of words? I would suggest the substitution of "words" in place of "vernacular", as this recovers a perfectly sensible rhetorical question. Perhaps, in keeping with the theme of subtle distinction in word choice, "obscure" could be substituted for "strange", signifying the strangeness is rooted in rarity of use, rather than e.g. etymology.

Poe's law is making it difficult to come up with an appropriate response to this.

> are these really such strange vernacular?

Nah. I'm proud to say I got everything except Brobdingnagian, which on further investigation is more of a literary reference than a "real" word.

How very antidisestablishmentarianist of you.

(Sorry, I tried to resist, but I could not.)

I wonder how other languages compare to English. I know English is far from pure.

"The problem with defending the purity of the English language is that English is about as pure as a cribhouse whore. We don't just borrow words; on occasion, English has pursued other languages down alleyways to beat them unconscious and rifle their pockets for new vocabulary." -- James Nicoll

Subtlex has done word frequency counts in a number of languages:

Dutch - http://crr.ugent.be/programs-data/subtitle-frequencies/subtl...

Chinese - http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2880003/

Greek - http://www.bcbl.eu/subtlex-gr/

There's a few others (Polish, French, etc) but I can't find the links for some reason.

Zipf's law [1,2] generally holds for a corpus in any natural language and can be applied to a lot of other things outside linguistics as well.

[1] Zipf’s word frequency law in natural language: http://colala.bcs.rochester.edu/papers/piantadosi2014zipfs.p... [2] https://en.wikipedia.org/wiki/Zipf%27s_law

What do you mean by "pure"? Lack of loanwoards? Lack of any linguistic changes (e.g., sound, meaning) for existing words? Lack of any innovation, i.e. no new words for new concepts/objects? You could try to find some of these things in languages spoken by rather isolated groups. Yet I don't think one should call such a language "pure", implying some kind or (moral) superiority.

Compared to Classical Latin for instance, I can sort of intuitively see how a "purity" comparison might make some sense. The grammar is more rich and precise, suffixes and prefixes attached to primitive roots are more regular in terms of meaning. Meaning is very decoupled from word order, which allows much richer possibilities for rhyme and prose. If the restored pronunciation is anything to go by, then the pronunciation is very regular. Then again I'm not a linguist.

I would say it's like arguing whether Python is more "pure" than Perl. They're both Turing complete, they express very similar concepts but I'd wager most people would concede that Python feels "purer".

Isn't the 'purity' of Latin due more to the fact that we just don't have much data on the languages that informed it? I'm sure Latin borrowed heaps of words and grammar from other languages, but those languages are mostly lost to history now, since the people who spoke them didn't conquer and hold the Mediterranean for one thousand years.

Language purity isn't related to moral superiority; it's a matter of the degree to which a language follows it's own rules and patterns, mostly how much of the vocabulary is native, i.e. how much of the vocabulary is "borrowed" from other languages. Since English is the result of a "forced merger" between a Germanic language (Old English) and a Latin one (Middle French) we are already screwed in that respect.

An interesting and large list of "English" words from other languages: https://en.wikipedia.org/wiki/Lists_of_English_words_by_coun...

I like to shampoo my hair on my veranda in the jungle, then put on my cushy khaki pyjamas, while smoking a cheroot.

It's nirvana.

The source material for this frequency count comes from Open Subtitles [http://www.opensubtitles.org/en/search]. Hence the frequencies here apply to spoken English, not written English. In written English the three most common words are "the", "of" and "and", whereas here they are "you", "I", and "the".

If you have the right kind of friends you can play "Who knows the most obscure word?". Everyone picks a word from memory, check each word's position (We used to use google ngrams, but this would work well.), whoever gets the least common word wins that round, repeat until it's not fun anymore. That's how I learned defenestrate, obsequious, and a few other words.

I love "sesquipedalian", because it's so self-referential.

I love "non-self-referential" because if I use it enough times with all the correct stresses and pauses, I can build a computer.

First word I tried wasn't even present in the list (equivocate). How do you score that?

Equivocate came up in https://books.google.com/ngrams. Looks like the linked tool isn't quite up for it after all.

"Kajagoogoo" is the 96,714th most common word in the English language.

That's cuz it's so shy...

For a corpus much more complete on the long tail, try google books ngram viewer : https://books.google.com/ngrams. That uses full data from google's book-scanning endeavors, millions of books compared to millions of words in the originally linked article.

I love to play with this, and see how thought waxes and wanes - a good one is to stick in every soviet premier since Lenin, or 20th c. US presidents - really tells you rather a bit about the mindshare that these individuals had.


EDIT ADD: Zip's Law was not mentioned directly in the blog post but the reply clarified that the API returns a zipf score. However, a word's zipf ranking is dependent on the corpus used. The Wordapi "About" page[1] says most data came from Princeton WordNet but a sibling comment says it came from a subtitles compilation. If the project could clarify the data sources, it would be helpful.


The "frequency" score returned by Words API is the a Zipf score for the word. Ranges from ~1.6 to ~7.6.

Regarding your update - I'll update the About page.

'I' doesn't seem to work. It is very common. https://books.google.com/ngrams/graph?content=I%2C+you&year_...

Try lower-casing it.

EDIT: This looks like a case of a common programming antipattern: you don't care about the casing for comparison purposes, so instead of implementing a case-insensitive compare, you downcase the strings and call it a day. But that's inherently a loss of data, and not having that data will eventually come back to bite you.

Should be fixed now.

So I'm wondering, if you just learned the 200 most popular words, you might get pretty far in learning a new language, no?

200? no. Try 2,000. That should be representable of a barely usable level of the language. If you are what might be described as fluent, you're probably at >20,000. Take this headline:


"EFFECTIVE" and "IN" are the only two words found in the first 2,000 words sorted by frequency, although "FOUND" is close. So with only 200 words you'd understand:


2,000 words would give you:


And some grammar knowledge would tell you


Which is enough to know that you only really need to look up "TREATING" in the dictionary to understand the gist of the sentence But it'd hardly pass as a fluent understanding...

"Some grammar knowledge" probably also tells you the structure of the verb, namely TREAT+ING(V,present progressive). "TREAT" is the 1,118th most common word, so that gives you:


And then some contextual knowledge of what sort of things get treated tells you that CUYROVGVF is a disease or injury, and WNPHMMVF is a medicine or therapy. So knowing 2,000 words and some grammar isn't quite fluent, but it's around the level of a middle-school child who has to ask what lots of nouns are.

This is actually a very nice example of the importance of grammar. A typical novice English learner, even if they can look up all the words from the dictionary, would go on like this:

"Hmm, Jacuzzis is clearly a noun, so it must be the subject. So Jacuzzis found "effective"... wait, that doesn't make sense. "Effective" is not a company, how does one found "effective"? Oh, it must be the past tense of "find". OK, so these Jacuzzis found... effective? How does one find "effective"? And why was effective hiding in "treating phlebitis"? OK, whatever, maybe I could decipher this if I understand "treating phlebitis"... So it is a kind of disease that is... treating someone (or something)? How could a disease treat anyone? This sentence makes no sense!!"

If you think "Duh, that's ridiculous, nobody thinks like that", then you obviously never met a struggling English learner.

I took exactly this approach to learning passable German when I lived there a couple months. Unfortunately, most Germans would much rather practice their English on you than let you practice German on them. So I gave up and let them.

It's funny, I had exactly the same experience when I lived in Germany for six months and was studying German. I would just keep speaking German until they gave up :)

Probably depends on the language. A three-year old supposedly knows about 1,000 English words.

You could probably at least get around.

While the frequency of words drops off precipitously, there's also the fact that the set of all thoughts one might want to convey is incredibly vast, so within any given conversation there will probably be at least a few words which otherwise rarely appear.

Very nice product! As a sidenote: anyway knows a service that provides human associations? Eg What do you associate with "Hacker" as a person "Computer", "Night", "Internet"

I work on ConceptNet, which does this.


It seems like a good idea, but to automate it you'd need to maybe scrap websites and create a count of words that appear in the same sentence. Or you could get really crazy and start comparing subject/object relationships, etc.

The most striking thing about Randall Munroe's "Up Goer Five" comic [1] was that the word "computer" is on the list, but "thousand" isn't.

    "(Explained using only the ten hundred words people use the most often)"
[1] http://www.xkcd.com/1133/

Well to be fair, thousand is a number, so analysis of written text will find it a lot rarer than it is actually used/spoken.

I poked around in the neighborhood of "Tremulous" and saw a lot of obvious data errors: "ladiesand","manklnd", "confldentlal", "howdare", "monment". Other words don't seem particularly rare: "productively", "areolas", "combusts", "lazier".

It seems like their long-tail data is full of misspellings. Try typing in "vacillate" and looking at the other words on the graph in relative frequency, for example.

No corpus is ever "clean". Depending on the type of corpus there might be many misspellings, so obviously they will occur in the graph alongside other low-frequency words.

TIL "fucking" is the 214th most popular word in the English language.

I'm guessing "icing" is more popular in Canada.

Only if you're thinking of icing on roads (or in hockey), but not if you're thinking of icing on top of a cake.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact