

Hacker News' Reading Level - aeurielesn
http://www.google.com/search?q=site:news.ycombinator.com&hl=en&tbs=rl:1

======
achille
Can someone briefly explain how google determines reading level? I'm assuming
it's using something like the Flesch–Kincaid test:

[http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readabil...](http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_test)

A brief search on wikipedia reveals a few readability tests, but they all seem
to be based on sentence/syllable ratios, not content complexity.

<http://en.wikipedia.org/wiki/Category:Readability_tests>

And in general they all rank multi-syllable (longer) words higher. Which would
mean a conversation between two Java API writers would be ranked higher than a
ruby conversation :)

Java vs Ruby vs Lisp: <http://i.imgur.com/tq3pA.png>

~~~
derefr
I figured that, since Google must read every word of every page to spider it,
they must thereby have, as a byproduct, the world's most accurate database of
word usage frequencies. "Reading level" would then just be a measure of the
average frequency of all the words on a page (thus making words learned in a
first year ESL class simple, and technical jargon advanced—quite the same as
the measure of difficulty used by language proficiency exams.) The fact that
the average of Simple English Wikipedia articles seems to be more intermediate
than basic, though (29/52/17), would argue against that—unless the
calculations are being biased by all the very infrequent proper nouns.

~~~
nl
I'd be lying if I said I don't doubt you are not incorrect ;)

What you are proposing is a statistically generated version of the Gunning Fog
Index (<http://en.wikipedia.org/wiki/Gunning_fog_index>) or the Flesch–Kincaid
test
([http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readabil...](http://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_test)).

If I were Google, I'd try that, but I'd also try something like working out
percentage deviation from a Markov chain generated from their crawl. A method
like that would show that my first sentence is pretty unreadable, while an
algorithm based on word complexity would see it as pretty simple.

~~~
biotech
> I'd be lying if I said I don't doubt you are not incorrect

Holy Shnikes, that's a tough one to parse! I wasn't sure what you were saying
here, so I'm gonna break it down, working from the end of the sentence:

1\. I'd be lying if I said I don't doubt you are not incorrect

2\. I'd be lying if I said I don't doubt you are _CORRECT_

3\. I'd be lying if I said I don't _think you are INCORRECT_

4\. I'd be lying if I said I _think you are CORRECT_

5\. _I think you are INCORRECT_

The idea is that each of the previous statements are saying basically the same
thing; I'm just cancelling negatives each time. Anyway, am I correct to assume
that you think that the GP is incorrect?

~~~
billswift
Your transition 2 -> 3 is not justifiable, "doubt" is not the opposite of
"think". A better parsing leaves "doubt" alone and would end with: "I doubt
you are correct."

~~~
gimpf
My dear, you just lost the "don't".

------
cruise02
I know it's comparing apples to oranges, but I contribute a lot on Stack
Overflow so I took a look at the results.

[https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...](https://encrypted.google.com/search?hl=en&tbs=rl%3A1&q=site%3Astackoverflow.com&btnG=Search&aq=f&aqi=&aql=&oq=&gs_rfai=)

I was expecting a higher proportion in the "advanced" category.

For comparison:

Math Overflow:
[https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...](https://encrypted.google.com/search?hl=en&tbs=rl%3A1&q=site%3Amathoverflow.net&btnG=Search&aq=f&aqi=&aql=&oq=&gs_rfai=)

OnStartups:
[https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...](https://encrypted.google.com/search?hl=en&tbs=rl%3A1&q=site%3Aanswers.onstartups.com&btnG=Search&aq=f&aqi=&aql=&oq=&gs_rfai=)

English Language & Usage:
[https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...](https://encrypted.google.com/search?hl=en&tbs=rl%3A1&q=site%3Aenglish.stackexchange.com&btnG=Search&aq=f&aqi=&aql=&oq=&gs_rfai=)

CS Theory:
[https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...](https://encrypted.google.com/search?hl=en&tbs=rl%3A1&q=site%3Acstheory.stackexchange.com&btnG=Search&aq=f&aqi=&aql=&oq=&gs_rfai=)

Seasoned Advice (Cooking):
[https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...](https://encrypted.google.com/search?hl=en&tbs=rl%3A1&q=site%3Acooking.stackexchange.com&btnG=Search&aq=f&aqi=&aql=&oq=&gs_rfai=)

Physics:
[https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...](https://encrypted.google.com/search?hl=en&tbs=rl%3A1&q=site%3Aphysics.stackexchange.com&btnG=Search&aq=f&aqi=&aql=&oq=&gs_rfai=)

It's no surprise really, but it seems like the more technical jargon used on a
site, the higher the reading level Google assigns.

------
Mithrandir
[https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...](https://encrypted.google.com/search?hl=en&tbs=rl%3A1&q=site%3Areddit.com)

and

[https://encrypted.google.com/search?hl=en&tbs=rl%3A1&...](https://encrypted.google.com/search?hl=en&tbs=rl%3A1&q=site%3Aen.wikipedia.org%2Fwiki%2F)

~~~
dantle
[https://encrypted.google.com/search?hl=en&safe=off&t...](https://encrypted.google.com/search?hl=en&safe=off&tbs=rl%3A1&q=site%3Asimple.wikipedia.org%2Fwiki%2F&);

So much for Simple English wikipedia!

~~~
G_Wen
Interesting, perhaps in their effort to avoid short scientific terms they have
substituted in longer more common words. For example describing a tumor as "an
abnormal new mass of tissue that serves no purpose".
<http://wordnetweb.princeton.edu/perl/webwn?s=tumor> The seemingly paradoxical
result in an increase in understandability but a decrease in usability. Think
of it as reading a passage where almost every noun has been substituted by a
dictionary definition.

------
redthrowaway
Well, it's official: 4chan is smarter than us.

[http://www.google.com/search?q=site:news.ycombinator.com&...](http://www.google.com/search?q=site:news.ycombinator.com&hl=en&tbs=rl:1#sclient=psy&hl=en&tbs=rl:1&q=site:4chan.org&aq=f&aqi=&aql=&oq=&gs_rfai=&pbx=1&fp=1bde53b2ade8e603)

~~~
citricsquid
Well 4chan.org is simply the homepage, if you link to the actual boards
(specifically /b/):
[https://encrypted.google.com/search?hl=en&tbs=rl:1&q...](https://encrypted.google.com/search?hl=en&tbs=rl:1&q=site:boards.4chan.org/b/)

~~~
redthrowaway
You, sir, have made my day.

~~~
citricsquid
There's hope for us yet!

~~~
olalonde
You guys are not really helping.

~~~
redthrowaway
LOL XD

------
Alex3917
At the risk of stating the obvious, you can get the results for the reading
level of your own HN comments by adding your user name. It's kind of
interesting going through the leader board and looking up different people's
scores.

~~~
olalonde
More accurately, it gives the reading level of all discussions in which you
commented.

------
fluidcruft
Redditors will flip their buckets when they see:

[http://www.google.com/search?q=site:reddit.com&hl=en&...](http://www.google.com/search?q=site:reddit.com&hl=en&num=10&lr=&ft=i&cr=&safe=images&tbs=rl:1)

[http://www.google.com/search?q=site:digg.com&hl=en&n...](http://www.google.com/search?q=site:digg.com&hl=en&num=10&lr=&ft=i&cr=&safe=images&tbs=rl:1)

------
rchowe
I suspect this is because the algorithm measures word use and not grammar. HN
tends to have people who follow the "simple is better" mantra and who write in
a tone similar to either an executive summary or an online dialog, thus an
accessible reading level.

------
duck
According to Google a HN thread about vomiting is "advanced".

<http://news.ycombinator.com/item?id=315490>

------
harshpotatoes
I think it's worth noting that most of the articles submitted to HN have no
comments on them, meaning if those pages without comments were indexed by
google and included in their measure of our reading level, there would be many
pages consisting of only the words: "flag, 1 point by xxxx, no comments,
Hacker News, new, threads..." etc etc. So maybe that has some influence on
google's calculation.

------
lpolovets
It would be cool to be able to compare two sites on one page, as with Google
Trends.

A few examples that I thought were neat:

\- msnbc (44/55/1) vs bbc.co.uk (15/82/2)

\- facebook (40/37/22) vs linkedin (2/90/6) (I wondering why facebook has so
much "advanced" content according to Google)

\- wordpress (35/47/16) vs xanga (76/23/1)

\- boston college (5/41/53) vs harvard (2/6/91)

------
carucez
I got to thinking, how would America's universities compare? It's an
interesting thought: The best universities should have the most advanced
reading level.

I've published my findings, a complete ranking, and source code:
[http://log.largevoid.com/2010/12/ranking-colleges-by-
reading...](http://log.largevoid.com/2010/12/ranking-colleges-by-reading-
level/)

------
cmelbye
I like how the result for "noobcomments" is marked as a Basic Reading Level.

~~~
jpwagner
yeah, but "best" is 100% intermediate

~~~
sp332
As simple as possible, but not simpler :-)

------
steveklabnik
It would be equally interesting to only see the reading level of the articles,
rather than the comments.

~~~
iwwr
An elevated reading level is not necessarily a sign of more thoughtful or
insightful comments. It could just be contrived banality disguised as
conceptual depth.

~~~
j_baker
"contrived banality disguised as conceptual depth"

Is it possible this phrase is an example of itself? :-)

~~~
astrofinch
Words that describe themselves: short, convoluted, awkward, multilingual.

~~~
gjm11
Is the word (assuming it to be one) "non-self-describing" self-describing, or
non-self-describing?

<http://en.wikipedia.org/wiki/Grelling-Nelson%20paradox>

------
aaronblohowiak
go to ltu for advanced reading, here for startups.

