This was an extremely interesting paper to me, about a topic that I see as economically and sociologically fundamental.
I was actually impressed by the methods they used. I found myself thinking "this is what I'd really like to see," and then they'd report it. Validating their method on the MusicLab data seemed critical to me, as did examining reddit resubmissions versus YouTube views.
Although I thought methodologically it was almost as well done as it could have been outside of an experiment, I disagreed with the author's conclusions. They acknowledge some of the problems, such as the problem of the huge number of forgotten posts they didn't model at all, but other issues they don't.
For example, it seems the question of most interest is, given an observed post score, what's the actual "quality"? If you look at, say, Figure 3, it's apparent that there's huge variability in quality conditional on score, as observed score increases.
I think the correlational-style relationship they focus on obscures things like this that are critical to interpreting the findings. Yes, there's a strong estimated relationship between quality and score, if you ignore all the missing data that constitutes the bulk of submissions, and the fact that the relationship is being driven very strongly by a large quantity of very low-"quality" posts versus everything else, and the variability everywhere else. It's an odd, heteroscedastic, nonlinear relationship that isn't well-captured by a correlation, even a nonparametric one.
I also would have liked to see examination of variability in links across sites. How much variability is there in rank of an initial link, to the same material, across reddit, HN, Twitter, etc.? Maybe tellingly, the authors report the relationship between YouTube views and number of reddit submissions, but not the relationship (if I'm reading correctly) between YouTube views and rank of initial reddit submissions, which is kind of the key relationship.
So, liked the paper but if anything it just reconfirms the conclusions of earlier studies to me, that social network dynamics has a big influence on apparent popularity.
"We define quality as the
number of votes an article would have received if each article
was shown, in a bias-free way, to an equal number of users."
I haven't yet the whole paper yet - but isn't that ignoring other major factors like how "newsworthy" a particular link is? A low quality link might get a lot of upvotes simply because it was the first link submitted that describes an inherently interesting event.
“In a bias-free way” needs careful definition. That ought to include the order in which stories are shown (so it would eliminate any advantage of being the first link posted relating to a specific event).
"Intrinsic quality" is a terrible name - it should be something like "decontextualised quality" or "neutrally presented quality", because it's still an aggregate subjective view on the quality of the article.
I would call this more of a stag hunt: https://en.wikipedia.org/wiki/Stag_hunt There's a tension between spending your time helping vote on /newest to get stuff on the main page where they are then accurately ranked, and slightly tweaking the ranking on the main page while enjoying the overall fruits.
I just looked, and HN had well over 60M page views that month. The Reddit number is likely way too small as well.