
English Letter Frequency Counts: Mayzner Revisited - phenylene
http://norvig.com/mayzner.html
======
chime
I did something similar with a 10m Tweet dataset couple of years ago:

* [http://ktype.net/wiki/research:articles:progress_20110209#tw...](http://ktype.net/wiki/research:articles:progress_20110209#twitter_n-gram_results)

* [http://ktype.net/wiki/research:articles:progress_20110228?s#...](http://ktype.net/wiki/research:articles:progress_20110228?s#letter-pair_frequency_table)

I would love to redo this analysis with newer Tweets but alas, don't know
where to get a usable corpus. Any suggestions? My goal was to explore many of
the concepts from norvig's <http://norvig.com/spell-correct.html> using the
Tweet dataset to build a better one-finger-keyboard and word-prediction engine
for iOS.

~~~
ComputerGuru
I would think tweets would be skewed, no? Slang, memes, shrtnd txt 2 avoid
char lmts, <http://urls/>, amongst others?

~~~
talaketu
No corpus is immune from comparison, and each will have statistical parameters
that reflect it's original selection criteria. Perhaps Mayzner's corpus,
apparently based on a sample from literature, exhibits a bias away from the
abbreviated forms widely used in written communication today.

So, if you wanted to tune your text prediction software for your phone...

~~~
chime
Precisely. I was looking for predicting informal communication patterns, not
formal book/newspaper style.

------
martinpw
Since the original data includes the year of publication, it would be
interesting to see trends in these datasets over time, eg which words are
becoming more/less popular, is average word length reducing over time, is the
variety of words in common use increasing or decreasing, etc.

~~~
hellrich
You might be into corpus linguistics...

------
lispython
I thought this may help to design a new keyboard layout better (from
scientific/statistic perspective) than Dvorak
(<https://en.wikipedia.org/wiki/Dvorak_Simplified_Keyboard>) which is based on
research more than 80 years ago.

~~~
gokfar
The results seem consistent with previous data. There are many recent layouts
(e.g. Colemak[1]) which attempt further optimization, but simulations show
little actual difference in strain (~5%) between those. They are all better
than qwerty in this regard, but optimizing any any more is a clear case of
diminishing returns.

It is actually much more difficult to model finger strain that the english
language (in terms of n-grams). Subjective assessments vary a lot, and the
quest for the optimal layout is bathed in controversy. Beyond switching away
from qwerty, the most significant gains will be made by hardware solutions,
like using an ergonomic keyboard (such as the very promising ErgoDox [2]).
Other optimizations may come in the form of chorded keyers and better
predictive technology. In this last case, the Google data may prove useful.

[1] <http://colemak.com/>

[2] [http://deskthority.net/workshop-f7/split-ergonomic-
keyboard-...](http://deskthority.net/workshop-f7/split-ergonomic-keyboard-
project-t1753.html)

~~~
lam
@gokfar: Seems like you have some background in keyboard layout optimization.
Are you currently working on anything related now? I am and would great to
chat with you about it if you're in the MV area.

------
sriramk
On a tangential note, does anyone know what he's using to generate those bar
graphs automatically from those tiny images? Nifty trick.

------
new-world-order
Mr Norvig, please share the code. I'm sure it's some interesting Lisp or
Python.

~~~
robrenaud
Norvig's spell corrector is the nicest code I've ever read. But every analysis
he makes here is just counting, there is nothing computationally interesting
happening. I doubt the code is really that much to look at.

------
jqueryin
It's not apparent to me what was used when calculating the frequency of
bigrams, n-grams. Was this based on the overall dataset or the dictionary
alone?

I believe it's beneficial to see a version that's based on the dictionary
words alone as that would ensure no duplicate words exist to effect the
n-grams, acting as a control group.

------
wangweij
Many years ago I read an article claiming the order is etoanirsh... I still
use it in hangman games.

~~~
Uncompetative
The real order is this one...

<http://en.wikipedia.org/wiki/ETAION_SHRDLU>

~~~
jerf
You should read the linked article.

------
jyhipu
Service Temporarily Unavailable The server is temporarily unable to service
your request due to maintenance downtime or capacity problems. Please try
again later. Apache/1.3.42 Server at norvig.com Port 80

------
nodata
But "forschungsgemeinschaft" is German, and it's mentioned frequently: _at
least 100,000 times each in the book corpus_.

I don't trust his corpus compares to the original English corpus.

~~~
pretoriusB
A lot of foreign words are frequently met in English texts, especially in
narrow domains like philosophy, biology and medicine, etc.

And seing that:

<http://en.wikipedia.org/wiki/Deutsche_Forschungsgemeinschaft>

is a research institute, one would expect tons of mentions for this word in
_english_ scientific papers.

But that is an outlier, and given the vastness of the corpus those would have
been sorted out.

That said, there would have been an easy way to filter non English books out
of the way automatically: do a statistical analysis on each book (e.g on
letter frequency) and reject the ones that stray too much from the norm -- or
send them to a secondary filtering stage, e.g by word presence or a
verification by a human. Done carefully that filtering would not harm the
actual results at all (e.g by presuposing a specific letter frequency, because
it would only reject extreme outliers that would indeed by non-english works).

~~~
arrrg
It’s not a research institute. It gives grants to researchers (it’s _the_
biggest organization in Germany giving research grants), it’s a foundation. As
such it is also often mentioned in scientific papers. (“This study was funded
in part by a grant from the …”)

------
ghubbard
How do the new results compare to Mayzner's?

