

Word Frequencies in Front Page HN Titles - chegra
http://chegra.posterous.com/word-frequencies-in-front-page-hn-titles

======
sbierwagen
A bunch of meaningless words, ("You" and "for" rank highly? No, really?) but a
few that stick out are "how", "google" and "new".

It also drops off pretty fast. You'd think you'd see a lot more reused words.

The graph is also pretty poorly designed. The most important elements (the
words) are aligned vertically, which means you have to turn your head to read
them. Why not put the words on the Y axis?

~~~
chegra
Well, I was checking for these so calling meaningless words. I realise when I
write "You" in my blog post it gets more views. Some persuasion book I read a
while back say copywriters use that all the time. But yea, I edit words that
are interesting to me based on my past experience.

------
myffical
You need to massage your data to get more meaningful results.

It might be interesting to compare your word counts with the word counts from
a general-purpose word corpus, then pick out words that appear more frequently
by a statistically-significant amount. Something like Amazon's statistically
improbable phrases algorithm.

~~~
waldrews
I'd suggest, as a simple heuristic for ranking words for
improbability/relevance, contribution to K-L divergence from the frequencies
in the general-purpose word corpus:

P _ln(P/Q)

where P is the frequency of the word in the narrow corpus (HN titles)

and Q is the frequency of the word in the general-purpose corpus

(formula doesn't work if Q is ever zero; this won't happen if the broader
corpus includes the narrower one, as it should, but as a practicality, just
make Q:=(1-a)_Q+a*P for small positive a to simulate merging the smaller
corpus into the larger)

[http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diverg...](http://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence)

Anybody with more time than I have at the moment want to code this up?

------
kfarzaneh
Pretty cool, but you might want to do some more filtering to allow for more
meaningful words.

~~~
tectonic
Yea, I suggest you remove stop words.

------
d0m
I'm surprised apple isn't more frequent.. I feel like it's only that on HN.

~~~
chegra
Apple has mix feelings on HN. Compared to Google which the majority would say
they love.

