So that others can play with the data, here's a reverse engineering of the BigQuery OP used to create the leaderboard:
#standardSQL
SELECT
domain,
COUNT(*) AS num_posts,
perc_75,
AVG(score) AS avg_score,
(AVG(score) + 2*perc_75) * LOG(COUNT(*)) AS calc_score
FROM (
SELECT
REGEXP_REPLACE(NET.HOST(url), 'www.', '') AS domain,
score,
PERCENTILE_CONT(score,
0.75) OVER (PARTITION BY REGEXP_REPLACE(NET.HOST(url), 'www.', '')) AS perc_75
FROM
`bigquery-public-data.hacker_news.full`
WHERE
type = 'story'
AND url IS NOT NULL )
GROUP BY
domain,
perc_75
ORDER BY
calc_score DESC
(it's apparently not a perfect match since there appears to be a minimum # of posts requirement for domains [e.g. without that requirement, https://news.ycombinator.com/from?site=pardonsnowden.org is #3], which should be added to the description of the leaderboard)
I think that's pretty much it, though my domain regex is much uglier! Yeah you're right about a minimum # of posts cutoff — I set it to 25. Forgot to add this to the description but have added it now. Thanks!
Very cool, thanks for sharing! I did a somewhat similar analysis a while back [1], and I found that many of the top domains either had a YC affiliation or corresponded to extremely well-known companies or organizations. This made me interested in finding lesser known blogs that also produce high quality content. I tried to identify these by putting a limit on the number of unique users who had submitted content from each domain. My thinking here was that something like the GitHub blog would have submissions from many users, while smaller personal blogs would probably be mostly self-promoted. Using this approach, I was able to turn up some pretty interesting blogs that I had never heard of before.
I think it could really increase the usefulness of HN Domain Leaderboard if you added some additional filtering capabilities. Filtering based on the category would probably be pretty easy because you have that information there already, but perhaps also consider some measure of how broadly promoted each domain is. The time range option is already pretty cool, and I'll bet that a few more options would make it even more fun to play around with.
Thanks! Really interesting blog post and cool idea for surfacing small high-quality blogs.
I'd been planning to add filtering on categories from the start, but it was meant to be a weekend project so my motivation had started to drop after that, so I just wanted to put it out there. Will add extra filters in the next few days!
I'd really like to see the opposite of this: domains that have been flagged multiple times and have a high submissions-to-upvotes ratio so that I can filter them out.
While we're asking for features, I would like the opposite of that: a way to see only flagged domains, but ideally filtering out the spam/junk in some way.
It would be great if you could add top posts from each of these domains. I am really interested to see the top content I may have missed from a few of these domains.
Ah! was searching around for exactly this just a week ago and gave up. Could you add more granular date filters? (past month past week etc?)
thanks for doing it!
Interesting that there are no News related domains in the list. I wounder if that is due to the number of posts those domains have that never gain any traction.
Switch to the view that shows just the last year and you'll see lots of news sites.
This matches what I've been sensing (rightly or wrongly) in that the mix of stories has shifted from computer science and engineering to have a more business and general interest mix since I first started reading HN regularly. This goes for the sources as well as the stories themselves.
I actually hand-labeled the categories, but only for roughly the top 100 domains in each time period. The categories are a little bit arbitrary (where is the line between individual and blog, or blog and publication?) but I was mainly interested in seeing the distinction between individuals and companies.
I'm interested in this as well. Since only top 106 are tagged, it's possible it was done by hand?
edit: nevermind, it's only showing categories for top 106 domains, but if you set the period to 1 year, you can tell that there are more tagged domains. And it's not always 106 domains.
I suspect that the score a link gets is highly variable and doesn't follow a known distribution, therefore, taking a straight mean may not be a valid thing to do, or at the very least, very very skewed.
Interesting that so many of the top sites are "individual". I always thought that self promotion was shunned on places like HN, but I guess if you do it in the "right" way, it can be a successful tactic.
Self-promotion is specifically encouraged on HN, as long as you're promoting a high-quality topic that is of interest to hackers, and as long as it's not done in a spammy manner.
There's even the "Show HN" section specifically for self-promotion.
The articles for those sites aren’t necessarily being submitted by the people who run them. In most it’s likely others who found the content useful. I suppose if you squint anything you post on your own site is self-promotion, but that’s not the impression I get from your comment. Am I mistaken?
They all include somewhat useful articles, but some of them (once you arrive) are also quite self promoting. I think the odds are quite slim that this many others randomly stumbled across articles on obscure domains and submitted them repeatedly, so yes I think the owners of them are in many cases responsible for them having been submitted to HN.
But I wasn’t actually disparaging any of them, or suggesting anything inappropriate has happened. I guess I’m just surprised that HN as a community, with such a high degree of anti-commercialism on the site, has given these sites this much exposure.
Yup I classified that as "individual" because it is just 1 guy running it. My criteria for "blog" was that either there are multiple contributors, or the content is very far removed from the individual writing it, which I didn't feel was the case here.
(it's apparently not a perfect match since there appears to be a minimum # of posts requirement for domains [e.g. without that requirement, https://news.ycombinator.com/from?site=pardonsnowden.org is #3], which should be added to the description of the leaderboard)