
Websites That Feed Hacker News: Top Sources of Submissions by Median Score - anton_tarasenko
https://github.com/antontarasenko/smq/blob/master/reports/hackernews-top-domains-by-median.md
======
danso
This is a pretty good example of how certain metrics aren't always relevant to
reality, or, at least match the headline. The word "feed" implies that HN
depends on the contributions from/links to these sites, but most users of HN
would argue that domains such as github.com, github.io, nytimes.com, etc. are
far more prevalent and important to HN than virtually any of the domains
listed here. HN depends on daily, traffic...It's not that the sites with high
medians aren't _good_ , but they don't "feed" HN...Median score in this
context is a trivial metric. Number of top stories, daily, by domain would be
far more relevant in showing what "feeds the beast", as they say in the media
business.

~~~
blfr
_domains such as github.com, github.io, nytimes.com, etc. are far more
prevalent and important to HN_

NYTimes.com is certainly more prevalent but, just like Grauniad lately, it
seems to be diluting quality on HN rather than being important. Same goes for
Medium mentioned above.

Whereas sites like patio11's or cperciva's blogs, YC startups (bu.mp),
tutorials are what makes HN unique and interesting.

~~~
danso
> _it seems to be diluting quality on HN rather than being important_

I think this underscores the difficulty in quantifying the nature of
"quality", especially for a broad audience. I generally check the NYT homepage
every day, so seeing its URLs on HN isn't particularly helpful to me (ignoring
the value of the HN discussions)...however, there is so much interesting
information on a daily basis, period, that I bet if the HN front page
consisted solely of the most-upvoted of high-traffic mainstream sites, e.g.
github, nytimes, medium...it'd still be interesting to me because there'd be a
lot that I would've missed otherwise.

That said, it'd be cool to have an option/Chrome plugin to filter the
frontpage links to domains with relatively rare submissions, just to be able
to quickly see the unique upvoted submissions for the day.

------
anton_tarasenko
_The link is updated to reflect the following_

After SeanDav's question and minimaxir's comment, I summed up reposts' scores
before computing the mean and the median:

HN news sources by _mean score_ :
[https://docs.google.com/spreadsheets/d/1tTDDG2xg7OVKdUy4WCZ_...](https://docs.google.com/spreadsheets/d/1tTDDG2xg7OVKdUy4WCZ_5__17u_IOM4WPpbmnZDbxUI/edit?usp=sharing)

HN news sources by _median score_ :
[https://docs.google.com/spreadsheets/d/1P20sKg-
fI6msZVZtJFe0...](https://docs.google.com/spreadsheets/d/1P20sKg-
fI6msZVZtJFe0UHozX94AUINsiG_-1mbpaoo/edit?usp=sharing)

HN news sources by _number of submissions_ :
[https://docs.google.com/spreadsheets/d/1mmfbNWaX0Nr1P65VmwZp...](https://docs.google.com/spreadsheets/d/1mmfbNWaX0Nr1P65VmwZpm4WiceK7pepknSob4ti0M7s/edit?usp=sharing)

SQL code:
[https://github.com/antontarasenko/smq/blob/master/hackernews...](https://github.com/antontarasenko/smq/blob/master/hackernews/top-
domains-median.sql)

How-to:
[https://github.com/antontarasenko/smq](https://github.com/antontarasenko/smq)

~~~
pfarnsworth
You should do number of submissions where min_score > X (maybe 5 or so). This
will help filter out the spam submissions that no one ever sees.

~~~
anton_tarasenko
This is how it looks then:
[https://docs.google.com/spreadsheets/d/1hRpEmkV26VQSN2q_X9_B...](https://docs.google.com/spreadsheets/d/1hRpEmkV26VQSN2q_X9_BXoNP7MHvuyjCrJJLWt6NExA/edit?usp=sharing)

------
cowpig
I really find it discouraging that Sam Altman is at the top of that list. Most
of his articles fall into two categories: promoting things that will make him
money directly[1], or myopic musings/self-serving advice to people that will
make him money indirectly[2].

Is the HN algorithm rigged in favour of things he writes, or does this
community really get a lot out the things he says?

[1] [http://blog.samaltman.com/asana](http://blog.samaltman.com/asana)

[2] [http://blog.samaltman.com/the-tech-bust-
of-2015](http://blog.samaltman.com/the-tech-bust-of-2015) made me laugh, for
example

~~~
possibility
Sam owns YC, YC owns HN, what does it matter? The whole purpose of HN is to
make Sam (and the other partners, and investors, and YC startups) money.
Mindshare is incredibly valuable. It's advertising that doesn't totally suck.

~~~
pyrophane
Hacker News is a bit of a misnomer. It doesn't, nor has it ever, served
hackers. This is a site for the startup kids, and you either love it or hate
it, but you gotta accept it for what it is.

~~~
dang
Actually the better part of HN's audience isn't involved in startups and a
sizeable portion (dismayingly sizeable in my view) is cynical about them.

"Startup kids" is too dismissive. Some of the very best comments about
startups come from grizzled veterans. Will ChuckMcM or Animats mind if I call
them grizzled? Let's just pause to appreciate what incredible value they and
others add to this community from the wealth of their experience.

Than again, depending on your definition of "kid" there are "kids" on HN whose
experience with startups is already impressive. Experience should perhaps be
measured in iterations, not years.

HN has many subgroups, including plenty of hackers. Plenty of purely technical
stories make the front page. And the startup and hacker groups overlap.

We get complaints about the balance whichever way HN trends.

------
bbarn
Surprised not to see [http://nautil.us/](http://nautil.us/) on here. I have to
avoid clicking articles to not spoil my print version I see them so often on
here.

FWIW, if anyone from that site/mag are frequent HN readers, HN is the reason I
subscribed, and gifted subscriptions to several of my family for xmas this
year.

~~~
jboynyc
They didn't do very well on HN until about a year and a half ago, as I noted
in this essay: [https://www.jboy.space/log/ssrc-digital-media-
reflection.htm...](https://www.jboy.space/log/ssrc-digital-media-
reflection.html)

------
jedberg
Besides cutting off at 10 submissions,you should probably also throw away
anything that got say 2 points or less. Something like medium is brought way
down by all the submissions that got 1 point, which means they probably never
got seen. HN lets you resubmit low scoring items exactly for this reason.

------
koolba
I'd be more interested in seeing the distribution across submissions that
actually made the front page.

There's a daily deluge of articles from ars, techcrunch, nytimes, etc, so the
(tons) of articles that do get to the top get penalized by the ones that
don't.

I don't think there's a flag for "hit front page" so might have to estimate
that with a min point filter instead.

------
anton_tarasenko
A brief motivation for the parameters:

1\. Sorting by the median. The mean is not very informative for the quality of
the source. Most sources provide low-scored content with eventual hits that
drive the mean up. The median fixes this problem.

2\. Cutting off at 10 submissions. An arbitrary minimum to exclude pure luck
from the results.

In the end, this ranking excludes websites like github.com and youtube.com,
but it features some less known sources.

~~~
danso
What problem does the median fix? Many of the top sites in this list are
fairly niche; some don't even really exist any more (e.g., adgrok.com being a
business that sold to Twitter in 2011)...Undoubtedly, median is a better
metric than mean when the desire is to remove outliers...but in the way that
HN works, I'm not sure that need is relevant here. github.com and nytimes.com
are absent from this list because a lot of their links get submitted...but I
bet a lot more Github users can recall 5 great submissions in the past week
from either domain than they can from chris-granger.com, even among fans of
Light Table and Eve.

That said, I would be interested in the mean, just to see how different the
two lists might be.

~~~
anton_tarasenko
Have a look:
[https://news.ycombinator.com/item?id=11499402](https://news.ycombinator.com/item?id=11499402)

------
qntty
It would be interesting to compute the h-index for all HN submissions, with
score instead of citations, then sort them from highest to lowest.

------
SeanDav
Not sure how accurate this, alternatively it might need different assumptions
- What about: NYTimes, WSJ, GitHub, BBC, ArsTechnica, Medium etc?

~~~
anton_tarasenko
These websites have the low median score. That is, many submissions, many of
them not relevant, so the median is low.

~~~
minimaxir
"Not relevant" is not the same as "not upvoted." There are a number of reasons
why a submission does not receive many upvotes which are unrelated to the
quality of the content itself, which is why HN has repost rules.

The 10 story minimum is to ensure a reasonable threshold for error and so a
single submissions with 1000+ points (e.g. Show HNs) don't skew the results.

~~~
anton_tarasenko
Do you mean that duplicates from popular sources (NY Times, WSJ, etc) spoil
stats for these sources?

~~~
minimaxir
Not duplicates, but more noise than signal.

------
jhchen
How is paulgraham.com not on this list?

~~~
adrusi
Because this is looking at median score. Lots of people submit PG links as
soon as they show up, but only one or two of those submissions will make it to
the front page. If more than half of PG links have a score of 0–5, then the
median will be in that range as well.

~~~
jhchen
So HN itself does the merging and the raw dataset still includes the numerous
duplicate submissions then? If this is the case it's not just sources with a
lot of content like medium.com, github.com, nytimes.com being dragged down,
it's any popular source.

------
adam-ff
This morning an article I visited from the front page had only been around 21
seconds and already had 60 comments.

[http://imgur.com/1oyIv2d](http://imgur.com/1oyIv2d)

------
kevindeasis
I'm surprised I don't see medium in here.

Even more, I'm starting to see more post from medium nowadays that has
declining quality relative to 2015.

~~~
minimaxir
Medium is more noise than signal. There are an absurd about of Medium
submissions submitted to HN (in fact, my curiosity into why everyone liked
Medium all of a sudden on HN is the _primary cause_ why I started doing data
visualization on public data.)

If Reddit and YouTube submissions can have ranking penalities due to highly
variable quality, so should Medium.

------
hashatlas
In the future, post this as CSV and GitHub will turn it into an even-nicer
tabular format. Not to mention retaining the machine readability.

------
happyslobro
Ha! Look at Chris Granger go! Don't get me wrong, his work is awesome, but
it's pretty funny to see an individual in the top 10.

------
morisy
Interesting but weird. Some of these sites don't seem to exist (anymore?),
like muckandbrass.com

~~~
anton_tarasenko
This website looks like spam. Wayback Machine doesn't have its good history:
[http://web.archive.org/web/20030407151435/http://www.muckand...](http://web.archive.org/web/20030407151435/http://www.muckandbrass.com/p1temp.asp?pid=1&page=1)

~~~
dangrossman
It was a blog about Clojure 6 years ago; 2003 is too far back.

[https://web.archive.org/web/20100415161333/http://muckandbra...](https://web.archive.org/web/20100415161333/http://muckandbrass.com/)

In 2011, it was redirected to this blog, which is still live:
[https://cemerick.com/](https://cemerick.com/)

------
spoiledtechie
I'm gonna put money on it, that the big hitters in the list, probably game
hacker news a bit by asking their friends to vote them up.

~~~
steveklabnik
The voting ring detector _should_ take care of that.

