

What determines word length? Not frequency after all. - ColinWright
http://web.mit.edu/newsoffice/2011/words-count-0210.html

======
henrikschroder
Uhm, basic linguistics teaches you about phonemes and morphemes. Phonemes are
the smallest meaning-differentiating unit of a language, and a morpheme is the
smallest meaning-carrying unit of a language.

Now, the length of a word is pretty arbitrary, it's a function of how the
phonemes in the word are spelled out, but it doesn't say much about the
morphemes of a word. Yes, generally a longer word can contain more morphemes,
and thus more meaning, but, well, duh.

Something that would be genuinely interesting though is to compare the length
of _morphemes_ with their frequency or their complexity of meaning.

~~~
zwieback
Exactly. That was my thought #1, phonetically rough=ruf. Thought #2: I'd love
to see the study repeated in my native German, which is nearly phonetic and
has a different way of constructing words. My wife's favorite (she's
American): Glove = Handschuh, a shoe for your hands.

Flipping through and English-German dictionary you'll find a lot of specific
English nouns that have longer compound nouns in German.

~~~
grandalf
The rough <-> ruf example is a great one. One would have to argue that the
"ough" spelling conveyed additional meaning in both written and spoken forms
in order to support the hypothesis of the researchers, or else such examples
are such a small subset of words studied that they are insignificant to the
larger hypothesis.

I am somewhat aware of thinking about the etymology of word when using them.
For example, whenever I use the word conspire I picture two people breathing
together. Perhaps if I didn't picture this I'd just use the word "plot"
instead in such cases.

------
jwr
"But now a team of MIT cognitive scientists has developed an alternative
notion, on the basis of new research: A word’s length reflects the amount of
information it contains."

I'm trying very, very hard not to say "DOH!". I mean, isn't this the expected
result?

~~~
derrida
If you assume a view in which syntax produces a predictable semantics, then
yes. The number 123321 contains more information than 3. But natural languages
do not have a semantics produced by a syntax in an easily predictable way. I
think it isn't a novel result, but for different reasons.

~~~
CWuestefeld
_The number 123321 contains more information than 3._

This isn't true, or at least, it isn't true most relevant ways.

Most obviously, if I need to specify which record in my database (of, say, a
million records) is to be read, then "3" and "123321" contain precisely the
same amount of information.

But I think that even in more common cases: "how many birds are at the bird
feeder?"; "how many gallons did it take to fill you gas tank?"; "what's you
GPA?", the answer, regardless of its magnitude, conveys the same amount of
information.

Now, _precision_ is a different story. "I'm going to take a couple of weeks
vacation in July" contains less information than "I'm going to take 11 days of
vacation in July". "My gas tank has a capacity of 16 gallons" contains less
information than "my gas tank has a capacity of 15.9 gallons.".

~~~
sesqu
There is interplay. "There are on the order of fifty birds at the bird feeder"
is more informative than "there are on the order of zero birds at the bird
feeder".

------
nadam
For me it is more clear to say that not the absolute probability, but the
conditional probability of a word (in its typical contexts) determines word
length. Now thinking about it: it is trivial that this is optimal for
efficiency. Not a very deep result.

~~~
SiVal
One thing that isn't trivial is the nature of the constraints that produce the
various designs of human language. Are they constrained by some sort of innate
"Universal Grammar" coded for in DNA? Or by ease of learning, given general
learning mechanisms (not specific to language)? Or by efficiency of use? Or by
a balance of information theoretic efficiency (high bandwidth) and redundancy
(high accuracy despite noise)? Or something else more significant? Maybe some
form of flexibility (greater efficiency in low-noise and redundancy in high)?
Or all of the above, in differing proportions for different aspects of
language?

We need to measure a lot of things in a lot of different ways in a lot of
different languages before the contributions of the various constraints
reflected in "the design" become clear. It's not trivial that we will discover
(nor has this study come close to proving) that "optimal efficiency" is the
bottom line.

------
beza1e1
I wonder about the causation here. Maybe short words get used more often, such
that their meaning is diluted and the information contained decreases.

An interesting (counter?) example is "use" vs "usage" vs "utilization".

------
michaeldhopkins
The example compared a declarative sentence and an idiom.

This study sounds like the old Norman nonsense that turned perfectly good
Anglo-Saxon words into the vulgar, lower-class vocabulary where it concerns
nouns and verbs. Regarding articles it is obvious since articles are
combinatory by definition, and the "finding" does not exclude "the," "la" etc
also being short for convenience's sake.

------
sirclueless
Seems to me that this could just be a reversal of cause and effect. Basically,
short words more easily engender idioms containing them. For example, "It's
getting late, wanna grab lunch?" combines a couple short words into two common
phrases, whereas "It's mid-afternoon, I'm hankering for sustenance" does not.

------
Yahivin
This seems like it would also apply quite well to variable naming in
programming. Naming a variable 'a' just because it's frequently used would be
misguided.

~~~
Egregore
Yes, but there is a tradition (at least in some languages) to name cycle
variable as i for example: for (int i=0;i<n;i++).

~~~
Yahivin
This seems to bring additional evidence to the information content hypothesis:
because it is common and expected it does not have a high information density.

------
hugh3
No surprises there for anyone who has ever played enough Scrabble to sit down
and memorise the list of two-letter words. Aa? Qi? Ne? Ae, anyone?

------
noiv
First step to relate linguistic to entropy. Read physics meets language.
Expect another decade to answer why words carry meanings.

------
ajhit406
What type of information density does "God" have?

None or infinite?

~~~
owenmarshall
It'd be pretty low, considering the cardinality of the set of attested "gods"
;)

~~~
CWuestefeld
Richard Dawkins likes to say that atheists and Christians are _almost_
identical. They have both rejected the existence of huge numbers of alleged
gods; the atheist just adds one more to that list.

~~~
lotharbot
That's kind of like saying Jeffrey Dahmer and I are also almost identical,
since there are billions of people neither of us have killed and eaten. Bill
Clinton and I are almost identical, since there are so many nations neither of
us have been the president of. And so on.

In many contexts, the difference between zero and one is extremely
significant.

------
tybris
or it's a combination of both...

~~~
wisty
The explanation seems a little sloppy. If they used Markov Chains, then chains
of length 1 is the same as frequency. A chain of length > 1 will be a slightly
better fit.

They've generalized the model, and their generalization seems to work a bit
better.

~~~
_delirium
That seems right, and makes the write-up a bit strange. They seem to be
positioning it as if they're comparing two completely unrelated hypotheses:
the frequency hypothesis versus the information-content hypothesis. But as you
point out, their method of measuring information content (following Shannon)
is simply n-gram frequency. The difference is that they set n=2,3,4 rather
than n=1.

It's also not clear that it differs from the original motivation for the
frequency hypothesis: (some) people advancing the "more common words are
shorter" hypothesis come in part from something like a compression argument,
that common words are short for efficiency reasons. Arguing that _predictable_
(low-information-content) words are short instead is an interesting
refinement, but not a completely different claim. Just choosing string length
by individual word frequency is actually a sort of crappy compression
algorithm, and this paper seems to show that English's built-in compression is
better than that, and takes sequence frequencies into account.

