
Data Mining Hacker News: Front vs. Back - equilibrium
https://lettier.github.io/posts/2016-10-10-data-mining-hacker-news-front-vs-back.html
======
WhitneyLand
Really nice work David, good job.

One suggestion: For each important conclusion try to have at least one
sentence that is understandable by a business exec.

For example at first glance it looks like time of day may be significant, then
you conclude:

"Because the p-value is greater than the alpha value, we fail to reject the
null hypothesis that the two nominal categories are independent."

By adding after this something like "Therefore submitting articles at a
certain time of day is not an effective strategy to achieve front page
visibility.", your post gains accessibility.

This is not a nitpick. The idea is to make sure the full power of your
analysis is felt across a broad section of readers. Even if you send to tech
people, these things often find their way to a wider audience.

~~~
statsknowitall
Please don't use this language. Your suggested quote implies the null
hypothesis has been confirmed. In fact, the null hypothesis has simply not
been rejected. A better summary would be: "We did not find any proof that
submitting articles at a certain time of day is an effective strategy to
achieve front page visibility."

~~~
WhitneyLand
Yep, my mistake. Main point stands, helps to have some plain english.

~~~
statsknowitall
It's so easy to make mistakes when talking about this, with double and triple
negatives that don't necessarily cancel out. I made one myself in the above
comment before editing it.

------
willvarfar
Yeah these days it feels really random who ends up on the front page; there
are just too many stories being submitted, and too few people filtering them
:(

Back when I was blogging I crunched the HN stats and tried to draw
conclusions:
[http://williamedwardscoder.tumblr.com/post/18839832580/reddi...](http://williamedwardscoder.tumblr.com/post/18839832580/reddit-
vs-hacker-news-vs-twitter)

~~~
3pt14159
Yeah, HN needs subreddits. I really do not care about the medium tier SV
people that get bought for $50m. I know they are friends with a lot of other
people here, but the only time I care about an acquisition is if it it is
either over $500m or from somebody from Toronto that I might know.

Also, I'm much more interested in data science nuts and bolts or technology's
impact on foreign policy than I am about CSS or Go frameworks or libraries.

~~~
willvarfar
What HN really needs is self-organisation :) By remembering your votes and
views and doing a bit of collaborative filtering the site can give you a
'filter bubble' where things you are quite likely to like float towards the
top...

I prototyped and wrote a blog post about that too:
[http://williamedwardscoder.tumblr.com/post/15581427232/self-...](http://williamedwardscoder.tumblr.com/post/15581427232/self-
organizing-reddit) ;)

~~~
sdrothrock
> By remembering your votes and views and doing a bit of collaborative
> filtering the site can give you a 'filter bubble' where things you are quite
> likely to like float towards the top...

The lack of such a "filter bubble" is why I actually treasure HN more than
other similar news aggregation sites; it exposes me to ideas, sites, and
viewpoints outside of those I'm normally interested in, whether I agree with
them or not. :)

~~~
willvarfar
HN is a filter bubble, its just a collective filter bubble and not
personalised.

(My prototype split above the fold into "unfiltered unrated" stories and the
filtered stories, so everyone was always exposed to some irrelevancies.)

------
minimaxir
A couple years ago, I did my own analysis of all Hacker News submissions
([http://minimaxir.com/2014/02/hacking-hacker-
news/](http://minimaxir.com/2014/02/hacking-hacker-news/)) and also wrote a
script around that time to get all data ([https://github.com/minimaxir/get-
all-hacker-news-submissions...](https://github.com/minimaxir/get-all-hacker-
news-submissions-comments) , see also a modern dataset on Kaggle derived from
it: [https://www.kaggle.com/hacker-news/hacker-news-
posts](https://www.kaggle.com/hacker-news/hacker-news-posts)). I only looked
at the # of points as a metric for quality, so front v. back with this
approach is interesting. Given the good work in this post, I may take another
look at the data myself.

This is a case where the sample size used may be problematic. "425 fronts
against 570 corresponding backs" (n = 995), in the grand scheme of Hacker
News, is not a lot, even if statistical analysis permits it (example: the by-
hour Chi-Sq test, which barely hits the 5-per-cell assumption). Given the
method of collection by scraping the front page directly, this is
understandable, though.

However, that presents a problem. _the front-page algorithm has changed_ in
recent months and I myself have had difficulty predicting what makes the front
page and what doesn't (and what ends up making the front page _hours after
being submitted for no reason_ ). With relatively new features like the
second-chance pool and explicit dupe marking, there is new quality control of
the front page thanks to dang/sctb. That is another issue of looking at a
small subset of HN data; it does not reflect the site as a whole, although
looking at more-recent data might be more beneficial for optimizing one's own
posts.

~~~
gus_massa
> _(and what ends up making the front page hours after being submitted for no
> reason)._

Some post are "resubmitted" by the mods using some kind of manual curation.
They appear a few hours later. I sometimes notice this with a comment in an
obscure submission with 2 or 3 points that falls from the newest page, but a
few hours later the submission gets to the front page and then the comment can
get a few upvotes.

I think the most clear description of this by dang is in:
[https://news.ycombinator.com/item?id=10705926](https://news.ycombinator.com/item?id=10705926)

------
lewisjoe
Great job! I've been working on a related tool that could be useful as well.

[http://hnlive.tk/static/index.html](http://hnlive.tk/static/index.html) is a
"live" HN activity meter.

I wrote it for myself. Anytime before posting to HN, I use it to decide if the
activity on the site is high enough. Right now the graph says, current time
has the highest activity spike in the past 24 hours.

It's far from done. I'm yet to plot answers to few more common questions,
backed by realtime data. Like say,

\+ Which weekday had the highest activity, last week?

\+ Which weekday usually has high activity?

\+ What time slot last week had the highest activity spike?

------
tonylemesmer
I generally only visit the "new" stories page if I don't find many interesting
front page items. So I wonder if there is a correlation there. Amount of
browsing time available vs. promotion of new items to front page.

------
gus_massa
I like the analysis, but I wonder if the criteria is week enough to detect the
dupes from medium.com , because medium adds some tracking crap to the URL that
confuses the dupe detector of HN. For example see:
[https://hn.algolia.com/?query=I%20Peeked%20into%20My%20Node_...](https://hn.algolia.com/?query=I%20Peeked%20into%20My%20Node_Modules%20Directory%20and%20You%20Wont%20Believe%20What%20Happened%20Next&sort=byDate&dateRange=all&type=story&storyText=false&prefix=false&page=0)
(this list doesn't include many dupes that were detected and marked).

A problem with this analysis is that it doesn't count the dupes that never had
a sibling that get to the front page. Counting it would modify the
distribution of some domains and submitters.

~~~
encoderer
I think that is actually a document version, nothing to do with tracking.

~~~
mutagen
The first number might be a document version, the second is most certainly a
tracking number. Load that first link in a few successive tabs and see how you
get a different one each time.

------
iraldir
While it's a very interesting analysis, it kinda reinforce the idea that it's
down to luck. Sure you can make your post on the week end to increase your
chances (even though I don't understand that given there is only so many room
on the hot page, if every one is more likely to go on the top page then no one
is). I think it's just the matter that people going to the "news" section tend
to upvote the link that are already upvoted. The only way to increase that is
to artificially bump up the upvotes by asking friends from different parts of
the world to upvote your article while it's in the news section (note that if
you cannot give them a link to your post directly or their vote won't be taken
into account).

~~~
zamalek
> Sure you can make your post on the week end to increase your chances

If everyone did this, it would increase the impact of luck: there would be an
increased rate at which posts would fall off the front page of /newest -
diminishing the chances that someone would see the post and upvote it.
Conversely, with the consequential lower post-rate during the week your posts
would earn more eyeballs.

------
keyle
Impressive research. I don't really mind the repost if the article went in
fact in oblivion while we should have paid attention. A gentle reminder for
everyone to sometimes visit the 'new' section and upvote the interesting part.

Maybe an AI data mining process could know what's interesting based on....
wait, no, that's a bad idea :)

------
jstanley
Good investigation but would be nice to see more in the way of conclusions
that can be drawn.

------
brador
I wonder if you could analyze this data to extract moderation information (for
example when mods changed, or when mod activity level changed). It would be
interesting to identify data spikes, and try to understand why.

------
MeteorMarc
I think it would be nice to add the time delay upon the first upvote as a
feature in the analysis. Whenever checking the "new" page, I tend to look at
the items which already had an upvote.

------
chirau
Monitored tag filters, like on [https://lobste.rs/](https://lobste.rs/), would
be great for HN. There is too much randomness on HN now.

------
gcr
To what extent will the publication of this article change HN trends to make
its conclusions invalid?

If everyone reads this article and then follows its recommendations, wouldn't
HN posting strategy change?

------
overcast
Side discussion, does anyone know what type of code highlighting library they
are using for this? Looks like server side processing, and then outputs the
html/css.

~~~
obituary_latte
>pandoc.css

Looks like it might be called pandoc

[http://pandoc.org/demo/example18g.html](http://pandoc.org/demo/example18g.html)

~~~
overcast
Duh, not sure how I missed that last line of <head>. Thanks!

------
dredmorbius
Interesting study, it suggests a few dynamics.

The weekday data show a high back-page rate for Tuesday, and a high frontpage
rate for weekend posts (Saturday/Sunday). This suggests to me that the total
_volume_ of posts, a statistic not presented (that I noticed) might have some
bearing. Specifically, many PR firms and other seakers of publicity tend to
target Tuesday morning for positive items, as these beat the Monday rush (and
blues), but allow for time to process during the rest of the week. And
professional submitters are going to be quiet on weekends. If I had to hazard
a guess, I'd suggest that HN attracts a significant amount of direct or
indirect RP blitzing. My thought is that PR pieces are, in general, less
likely to be voted to front page than organic content -- where PR includes
low-quality blog, YouTube, marketing, and similar type content.

The time-of-day analysis suggests something similar. Traffic begins to pick up
at about 0400 system time, which is US/Pacific. That would be 7am East Coast
(morning breakfast/commute) and about 10am in Europe, suggesting there's
traffic arriving from those locations. There's also a pretty noticeable _dip_
in backs ratio around the noon hour, plus or minus, and a slight increase in
the early afternoon. Again, PR / SEO content might take a mid-day break within
the US.

As for "new" page reconfigurations, a concern I've had is that as submissions
increase, the latency of any given item on the page decreases -- well under an
hour at peak times. Odds of even a good item collecting upvotes is small.

An alternative presentation might be to randomly shard submissions such that
each is present on the page for at least some period of time, _for some
fraction of HN users_. A hash of UID (or some other arbitrary value) and shard
assignment, weighted by the predicted voting on the item, would present _each_
unvoted and low-voted submission to a small set of users, but over a longer
period of time, while increased positive votes would expand the exposure
category. The idea being that each piece has a more realistic opportunity for
exposure. Flags would remove from scores.

HN does a good job of (usually) promoting quality and interesting content. It
does have a high false-negative rate, in not promoting good content, which is
a problem. On the other hand, _there are very real limits to how much content
a pereson can handle in a day_ , and simply opening the firehose wider isn't a
viable solution.

Based on counts of daily emails from Stephen Wolfram and Walt Mossberg, and
_The New York Times_ moderation desk volumes, I'm seeing ~150 - 300 emails, or
<800 comment moderations, per day, as something of a pertty consistent upper
bound to meaningful content interaction, and that 800 is a pretty low value of
"meaningful" at about 36 seconds per item. HN's front page with 30 solid
articles is a pretty reasonable target for deeper material.

------
appleflaxen
the plot of differences would benefit from showing negative values - so that
fronts > backs is positive (like sat, sun) and backs > fronts is negative
(like tuesday).

The way it's currently shon (magnitude or abs value) requires a lot of
cognitive load to parse that could be intuitive.

------
immixG
This was very interesting - a bit confusing but helpful!

------
michaelknight
would like to see the next version in a week or two to see how your article
affected the numbers.

(im sure from now on we will see a pike on Tue 6 and 11 on posts)

------
Exuma
Very well put together post. Great work.

------
Raphmedia
Why is that "posts" button moving so much? I can't focus on the text at all. I
had to inspect the page and remove the animation in order to be able to focus.

~~~
Raphmedia
If anyone else if bothered by that, paste this in your chrome's console:

jQuery('.shaker').removeClass('shaker');

~~~
thenewwazoo
Thanks for this; it was the only way I could read the (very interesting!)
article.

