
English Letter Frequency Counts: Mayzner Revisited (2013) - sindoc
http://norvig.com/mayzner.html
======
yoyo1999
Can anybody help me understand how can this data be useful to anybody?

I was playing with n-gram for a while and even produced similar results. But I
don't see how can those data be useful to anybody.

~~~
ScottWhigham
I upvoted you to offset the downvote(s) you've received. There should be
nothing wrong with a newbie (to HN at least) asking a question like this.
Would anyone who downvoted OP care to explain the downvote? The only logical
reason I can think of that someone here would've downvoted is something along
the lines of - "He should've known that crypto and ML are the obvious
answers."

------
ableal
Google cache clicky:
[http://webcache.googleusercontent.com/search?q=cache:u4CVeIw...](http://webcache.googleusercontent.com/search?q=cache:u4CVeIwk2LMJ:norvig.com/mayzner.html)

------
Rezo
I used Norvig's frequency counts as input for the board generation algorithms
(in Scala) for my Android word game "5 Star Words" [1]. With this as the start
plus a few other tricks, I'm typically able to reach an average of ~300 common
English words (or easily 400+ when including less common and swear words) on a
4x4 letter board.

[1]
[https://play.google.com/store/apps/details?id=com.starwords](https://play.google.com/store/apps/details?id=com.starwords)

------
bane
I think natural language designers might also look at the letter frequencies
and question why 'E' shows up so much. Is the canonical sound it makes just
common in English or is there some problem with its "design"? It turns out E
is _way_ overloaded in English:

\- it's silent in the case of modifying preceding vowels separated by a medial
consonant e.g. hat vs. hate, bat vs. bate

\- and in older English (or English that wants to feel old) was a superfluous
final letter e.g. olde, pubbe

\- as a silent letter entirely e.g. eagle

\- as itself e.g. egg, education

\- as a silent or nearly silent suffix separator for -ed e.g. dropped, judged

\- as a non-silent suffix for -ed e.g. educated

\- silent as an immediate vowel modifier in vowel digraphs (in some spellings)
e.g. archaeology, encyclopaedia, caesar used to be ligatured it was so
incidental.

\- silent as a modifier on itself e.g. teen, feel

\- one of several representation for schwa, ə e.g. taken (takən), enemy
(enəmy)

etc.

'e' is a mess. It's mostly silent, either ignored completely or modifying
something else (an issue even Benjamin Franklin tried to solve through a
proposed spelling reform). It's conflated with schwa (the most common vowel
sound in English yet has no singular representation).

A language reformer would probably tackle this letter first and fix a great
deal of the spelling problems in English.

~~~
unwind
"Natural language designer" is a contradiction; one of the core defining
properties of natural languages (like English) is that they are _not_
designed.

You switched to "reformer" in your closing sentence, perhaps that was what you
originally meant, too?

Of course, such a reform is not exactly easy to implement.

~~~
bane
I mean natural language as "language for humans to use to communicate with
each other" as opposed to programming language as "language for humans to use
to communicate with computers". It's the same meaning as is used in NLP.

e.g.
[https://en.wikipedia.org/wiki/Hangul](https://en.wikipedia.org/wiki/Hangul)
[https://en.wikipedia.org/wiki/Cyrillic](https://en.wikipedia.org/wiki/Cyrillic)
and I guess even
[https://en.wikipedia.org/wiki/Klingon_language](https://en.wikipedia.org/wiki/Klingon_language)

This is different than the meaning of
[https://en.wikipedia.org/wiki/Natural_language](https://en.wikipedia.org/wiki/Natural_language)
and
[https://en.wikipedia.org/wiki/Constructed_language](https://en.wikipedia.org/wiki/Constructed_language)

I guess if you want to get pedantic a better term might be "Orthographic
design".

------
JeffJenkins
Fun fact: If you're taking one vowel and five consonants the Wheel of Fortune
letters RSTLNE—not in that order—are the letters that are most likely to occur

------
triplesec
I love this. One minor representation issue: for the "Letter Counts by
Position Within Word" that charting approach is less than helpful.
Improvements within the structure he uses might be coding each letter with its
own colour, and having each letter in its own column, reordered by length.
However, charting experts may easily come up with a more useful re-charting
approach better than I can off the top of my head.

