
A Statistical Analysis of All Hacker News Submissions - minimaxir
http://minimaxir.com/2014/02/hacking-hacker-news/
======
pg
There was a point when we figured out how to stop spam submissions almost
completely. That was probably what happened at the end of 2011. That would
have been about the right time.

~~~
d23
I don't suppose there's an explanation we could have? :)

~~~
pg
Unfortunately like most of our anti-abuse measures it's surprisingly simple
and would be easy to circumvent.

~~~
wslh
Sorry to hijack the thread but now that you will have more time in your
hands... can we have an option to download our data from HN? I mean my
submissions, saved articles, and comments. Thanks!

~~~
minimaxir
You can use the API to download your submissions and comments extremely
easily, too.

~~~
wslh
Which official API? And... the saved articles are not public. If you are
referring to the new hn.algolia.com this is far (rate limit?) from being a
data liberation initiative. Even Google and Facebook are much better.

~~~
minimaxir
With the Algolia API, you can request 1000 stories or 1000 comments per
request. I don't think you'll hit the rate limit. :P

Here are your 1000 out of your 1248 comments:
[https://hn.algolia.com/api/v1/search_by_date?tags=comment,au...](https://hn.algolia.com/api/v1/search_by_date?tags=comment,author_wslh&hitsPerPage=1000)

And here are 1000 out of your 1548 submitted stories:
[https://hn.algolia.com/api/v1/search_by_date?tags=story,auth...](https://hn.algolia.com/api/v1/search_by_date?tags=story,author_wslh&hitsPerPage=1000)

You can pagenate each endpoint on the created_at_i parameter to get the rest.
I can write up a data liberation script if you want.

~~~
wslh
Thanks! I also need to web scrape HN to retrieve my saved articles. It will be
useful to use users credentials with the API.

------
mturmon
You have a green heat map of

    
    
      #submissions(time)
    

where time is 1-hour slots across 7 days. You also have a red heat map of

    
    
      #successful_submissions(time)
    

where successful is > 100 points. I think what you want is a third map which
is the ratio,

    
    
      #successful_submissions / #submissions
    

which would be the empirical probability of a submission being successful,
given the submission time. The raw counts don't tell you this.

(If you have a zero in the #submissions bin at some time, this will give 0/0,
so you might want to put in a "Laplace correction" which is to add 1 count to
each #submissions bin. There are other adjustments you can use, but this would
be good enough for the purpose of the plot.)

~~~
karpathy
I did a similar analysis to the one posted here and computed a similar heat
map to the one you describe, but I marked a submission as successful when it
went from new -> front page, not when it hit 100 points. The result is in
~middle of the post and it seems that weekends are best for chances of a story
making it to the front.

[http://karpathy.ca/myblog/2013/11/27/quantifying-hacker-
news...](http://karpathy.ca/myblog/2013/11/27/quantifying-hacker-news-
with-50-days-of-data/)

and the raw ipython notebook with too many details:
[http://cs.stanford.edu/people/karpathy/hn_analysis.html](http://cs.stanford.edu/people/karpathy/hn_analysis.html)

~~~
mturmon
Thanks very much. And on a log scale too!

As you noticed, not only do weekends offer a significantly improved chance of
making it to the front page, but also: the mid-morning weekday peak seems to
cause enough competition that submissions have a hard time making it.

This is in contradiction to an assertion made in the OP: "Your odds are
slightly better when submitting at peak activity (weekdays at 12 PM EST / 9 AM
PST)." The problem being, they did not calculate the odds.

------
gmisra
Obligatory repost - "Hacking Hacker News Headlines" from May 2011, examining
the significance of language in story headlines:

[http://metamarkets.com/2011/hacking-hacker-news-
headlines/](http://metamarkets.com/2011/hacking-hacker-news-headlines/)

------
davidw
Interesting - I'd love to see the number of stories on the front page about
politics over time. Is it really growing, or does it just seem that way?

~~~
x0054
You have to keep in mind that the biggest tech story from last year was also
the biggest political story of last year. So, the numbers would be skewed
towards rise in political stories, but it could be simply due to the overlap.

~~~
davidw
Presumably you could do stats with and without that one.

------
Fomite
Almost lost me with the word clouds, but I'm glad I soldiered on. An
interesting look at the patterns behind HN.

------
karangoeluw
> …so Lisp and Erlang are well-liked on HN.

Umm... Maybe not? What if a post is titled "I don't like Lisp. Go Python!",
and it hit the front page? How exactly do you infer the language being talked
about?

~~~
minimaxir
Here's the data set of all submissions containing Lisp or Erlang in their
title:
[https://docs.google.com/spreadsheets/d/1tnYpawKHOg7K1eKMaERw...](https://docs.google.com/spreadsheets/d/1tnYpawKHOg7K1eKMaERwHrT7yXbHNP0GMNM6RS7Cu80/edit?usp=sharing)

There are a few negative mentions, but they're in the minority.

------
Houshalter
[http://minimaxir.com/img/hn-points-hist.png](http://minimaxir.com/img/hn-
points-hist.png)

The wealth distribution of HN is awful. The rich get richer, by getting closer
to the front page and getting exponentially more points, for every point they
get.

~~~
DanBC
It does make me wonder what great links I'm missing because they only got a
few upvotes.

Up voting articles on New only goes so far. Other people have to stop upvoting
fluff.

Not sure what a solution would be.

~~~
Houshalter
A solution has been proposed here:
[http://www.bayesianwitch.com/blog/2013/why_hn_shouldnt_use_r...](http://www.bayesianwitch.com/blog/2013/why_hn_shouldnt_use_randomized_algorithms.html)

Basically move some new articles closer to the front page to get them more
exposure in order to find the ones that are actually best. More exploration
and less exploitation, and finding the optimal tradeoff between the two.

------
thanatropism
This is not statistical analysis, this is "descriptive statistics" at best.
This:

> One of the infamous memes about Hacker News is programming language elitism,
> with favoritism for languages such as Lisp and Erlang.

> Lisp and Erlang are indeed obscure, which might discredit the meme.

is the exact opposite of analysis. If it was found that 40% of HNers were
left-handed, HN would be noted as a particularly left-handed website, since
the base rate in the population is a fraction of that.

------
plg
should include "posts about analyzing HN posts" as a category

how meta

------
pdevr
Nicely done.

1\. What did you use to generate the graphs?

2\. While analyzing JavaScript, were submissions of posts related to Angular,
Bootstrap, Require, etc classified as JavaScript?

~~~
minimaxir
1\. Plots were made using R and ggplot2. (additionally, charts were rendered
on a Mac; rendering Line Charts on Windows doesn't work very well)

2\. To maintain an apple-to-apples comparison, I only checked for the presence
of a language, and not any frameworks.

~~~
pdevr
1\. Thanks for the answer and the tip about rendering the charts.

2\. I guess that is a practical approach - otherwise, it would have gotten too
complex with all the frameworks, tools and technologies.

------
gtirloni
So now our bosses have a pretty chart to show we don't work.

Weekends being dead don't help, folks!

------
aaronsnoswell
With the NSA graph it's worth noting that HN posts with 'NSA' or 'Snowden' in
the title are known to be down-graded by the site's ranking algorithm. Can't
remember where the source for this is right now.

~~~
csandreasen
NSA is, but Snowden is not (or, at least, wasn't noted as being in the
writeup).

See here: [http://www.righto.com/2013/11/how-hacker-news-ranking-
really...](http://www.righto.com/2013/11/how-hacker-news-ranking-really-
works.html)

------
JacobAldridge
It would be interesting to see the distribution of Erlang posts over time -
specifically, what portion of the 1,189 submissions came on Erlang Day (and
its 1-2 sequels)?

------
AznHisoka
What software did you use for those pretty charts?

