Hacker News new | past | comments | ask | show | jobs | submit login
English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU (2013) (norvig.com)
34 points by signa11 83 days ago | hide | past | web | favorite | 10 comments



In the not too distant past you could earn the PhD by writing a concordance (massive index, essentialy) of every word in the works of Shakespeare, all mentions of the differences between men and women, etc. Also for generating pre-computed tables (books of logarithms, roots, etc).

These books were quite useful so that work really did have valuable impact, but what a waste of resources!


Oddly enough, when you play hangman, you should use this sequence instead:

ESIARN TOLCDU

In hangman, you are guessing words, not text. The high frequency of the trigram THE makes T and H more probable in a text corpus. But if every word is equiprobable as in a dictionary list of words, you get the above sequence.

http://datagenetics.com/blog/april12012/index.html


Someone should do a proper game-theoretical analysis of hangman: you're playing hangman with DEATH, who has chosen an n-letter word according to an optimised strategy to minimise your chances of survival, assuming that you too will be following an optimal strategy. So a word containing very few common letters will probably be chosen with a higher probability that one that contains lots of common letters, so a frequency analysis of a dictionary won't directly help you.


I googled that on a whim and got an article. https://en.wikipedia.org/wiki/Etaoin_shrdlu


And now the long dead typesetters' mistakes are indexed by google for us all to readā€¦

From the January 1917 issue of the Southern and Southwestern Railway Club bimonthly proceedingsā€¦

https://books.google.com/books?id=Kx0wAQAAMAAJ&pg=RA4-PA24&d...

That's a fair bit of technological evolution in 102 years.

The publication seems like a forerunner of the podcast. It is a verbatim transcript of an enthusiasts' meeting with advertisements injected.


NB, those aren't "enthusiasts", the S&SWRWC is an industry association.


It should probably be "ETAOINSRHLDCU", or " ETAOINSRHLDCU", right?


I'm sure there's some joke I'm not getting, but just in case, it's in reference to this:

https://en.m.wikipedia.org/wiki/Etaoin_shrdlu


The point is that "space" is probably not the 7th most frequent character in English text.


Space isn't a letter either, it was just added to the phrase because the typesetters' lines had a length of 6. It's a relic.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: