I just looked, and HN had well over 60M page views that month. The Reddit number is likely way too small as well.
Pretty sure on DEC 2014, the monthly page view for reddit would be near a billion and HN would be in the tens of millions.
I was actually impressed by the methods they used. I found myself thinking "this is what I'd really like to see," and then they'd report it. Validating their method on the MusicLab data seemed critical to me, as did examining reddit resubmissions versus YouTube views.
Although I thought methodologically it was almost as well done as it could have been outside of an experiment, I disagreed with the author's conclusions. They acknowledge some of the problems, such as the problem of the huge number of forgotten posts they didn't model at all, but other issues they don't.
For example, it seems the question of most interest is, given an observed post score, what's the actual "quality"? If you look at, say, Figure 3, it's apparent that there's huge variability in quality conditional on score, as observed score increases.
I think the correlational-style relationship they focus on obscures things like this that are critical to interpreting the findings. Yes, there's a strong estimated relationship between quality and score, if you ignore all the missing data that constitutes the bulk of submissions, and the fact that the relationship is being driven very strongly by a large quantity of very low-"quality" posts versus everything else, and the variability everywhere else. It's an odd, heteroscedastic, nonlinear relationship that isn't well-captured by a correlation, even a nonparametric one.
I also would have liked to see examination of variability in links across sites. How much variability is there in rank of an initial link, to the same material, across reddit, HN, Twitter, etc.? Maybe tellingly, the authors report the relationship between YouTube views and number of reddit submissions, but not the relationship (if I'm reading correctly) between YouTube views and rank of initial reddit submissions, which is kind of the key relationship.
So, liked the paper but if anything it just reconfirms the conclusions of earlier studies to me, that social network dynamics has a big influence on apparent popularity.
"We define quality as the
number of votes an article would have received if each article
was shown, in a bias-free way, to an equal number of users."
I haven't yet the whole paper yet - but isn't that ignoring other major factors like how "newsworthy" a particular link is? A low quality link might get a lot of upvotes simply because it was the first link submitted that describes an inherently interesting event.
It seems unnecessary. They should've just used "estimated votes" , since that is what they are, or something derived from votes.
Quality is almost content-free and worst case is chosen in bad faith or hubris to make the result seem more important