Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: HN Domain Leaderboard (hnleaderboard.com)
266 points by refrigerator on March 27, 2018 | hide | past | favorite | 38 comments



So that others can play with the data, here's a reverse engineering of the BigQuery OP used to create the leaderboard:

   #standardSQL
   SELECT
     domain,
     COUNT(*) AS num_posts,
     perc_75,
     AVG(score) AS avg_score,
     (AVG(score) + 2*perc_75) * LOG(COUNT(*)) AS calc_score
   FROM (
     SELECT
       REGEXP_REPLACE(NET.HOST(url), 'www.', '') AS domain,
       score,
       PERCENTILE_CONT(score,
         0.75) OVER (PARTITION BY REGEXP_REPLACE(NET.HOST(url), 'www.', '')) AS perc_75
     FROM
       `bigquery-public-data.hacker_news.full`
     WHERE
       type = 'story'
       AND url IS NOT NULL )
   GROUP BY
     domain,
     perc_75
   ORDER BY
     calc_score DESC
Top 10000 results: https://docs.google.com/spreadsheets/d/1Z9atmizTAPkgFiBte2eQ...

(it's apparently not a perfect match since there appears to be a minimum # of posts requirement for domains [e.g. without that requirement, https://news.ycombinator.com/from?site=pardonsnowden.org is #3], which should be added to the description of the leaderboard)


I think that's pretty much it, though my domain regex is much uglier! Yeah you're right about a minimum # of posts cutoff — I set it to 25. Forgot to add this to the description but have added it now. Thanks!


Not going to lie. The first thing I checked in this site was for yours. You made the cut for the 1 year timespan. Congrats!


Very cool, thanks for sharing! I did a somewhat similar analysis a while back [1], and I found that many of the top domains either had a YC affiliation or corresponded to extremely well-known companies or organizations. This made me interested in finding lesser known blogs that also produce high quality content. I tried to identify these by putting a limit on the number of unique users who had submitted content from each domain. My thinking here was that something like the GitHub blog would have submissions from many users, while smaller personal blogs would probably be mostly self-promoted. Using this approach, I was able to turn up some pretty interesting blogs that I had never heard of before.

I think it could really increase the usefulness of HN Domain Leaderboard if you added some additional filtering capabilities. Filtering based on the category would probably be pretty easy because you have that information there already, but perhaps also consider some measure of how broadly promoted each domain is. The time range option is already pretty cool, and I'll bet that a few more options would make it even more fun to play around with.

[1] - https://intoli.com/blog/pareto-optimal-blogs/


Thanks! Really interesting blog post and cool idea for surfacing small high-quality blogs.

I'd been planning to add filtering on categories from the start, but it was meant to be a weekend project so my motivation had started to drop after that, so I just wanted to put it out there. Will add extra filters in the next few days!


I'd really like to see the opposite of this: domains that have been flagged multiple times and have a high submissions-to-upvotes ratio so that I can filter them out.


While we're asking for features, I would like the opposite of that: a way to see only flagged domains, but ideally filtering out the spam/junk in some way.


That's something HN would filter out.


It would be great if you could add top posts from each of these domains. I am really interested to see the top content I may have missed from a few of these domains.


Until he adds that, we could rely on the official HN search

https://hn.algolia.com


Thanks for the suggestion, this is something I was planning to do too! Soon...


This is a little out of date but may be of interest here. This is a visualization of the top 10,000 HN posts https://www.sizzleanalytics.com/Boards/sizzle/Hacker-News-To...


I would have thought bravenewgeek.com would make it onto the leaderboard since his posts [1] are typically high quality.

[1] https://news.ycombinator.com/from?site=bravenewgeek.com


Kind of amazing that the Rust blog, something relatively new, is the top domain of all time.


Ah! was searching around for exactly this just a week ago and gave up. Could you add more granular date filters? (past month past week etc?) thanks for doing it!


Interesting that there are no News related domains in the list. I wounder if that is due to the number of posts those domains have that never gain any traction.


Seems off to me. Anecdotally I am constantly reading Bloomberg news articles linked to on front page of HN.


Switch to the view that shows just the last year and you'll see lots of news sites.

This matches what I've been sensing (rightly or wrongly) in that the mix of stories has shifted from computer science and engineering to have a more business and general interest mix since I first started reading HN regularly. This goes for the sources as well as the stories themselves.


I agree with your observations. It is getting further and further away from it's roots.


How did you determine the domain categories?


I actually hand-labeled the categories, but only for roughly the top 100 domains in each time period. The categories are a little bit arbitrary (where is the line between individual and blog, or blog and publication?) but I was mainly interested in seeing the distinction between individuals and companies.


I'm interested in this as well. Since only top 106 are tagged, it's possible it was done by hand?

edit: nevermind, it's only showing categories for top 106 domains, but if you set the period to 1 year, you can tell that there are more tagged domains. And it's not always 106 domains.


Is mean a valid statistic for this dataset?

I suspect that the score a link gets is highly variable and doesn't follow a known distribution, therefore, taking a straight mean may not be a valid thing to do, or at the very least, very very skewed.

That being said, cool idea, well executed.



Love the site. It's clear and to the point.

One thing, if I may add, is that you need a link from the "about" page back to the main page.

Thanks again for sharing!


Glad you like the site :) Good point — I'll add that in.


Interesting that so many of the top sites are "individual". I always thought that self promotion was shunned on places like HN, but I guess if you do it in the "right" way, it can be a successful tactic.


Self-promotion is specifically encouraged on HN, as long as you're promoting a high-quality topic that is of interest to hackers, and as long as it's not done in a spammy manner.

There's even the "Show HN" section specifically for self-promotion.


The articles for those sites aren’t necessarily being submitted by the people who run them. In most it’s likely others who found the content useful. I suppose if you squint anything you post on your own site is self-promotion, but that’s not the impression I get from your comment. Am I mistaken?


They all include somewhat useful articles, but some of them (once you arrive) are also quite self promoting. I think the odds are quite slim that this many others randomly stumbled across articles on obscure domains and submitted them repeatedly, so yes I think the owners of them are in many cases responsible for them having been submitted to HN.

But I wasn’t actually disparaging any of them, or suggesting anything inappropriate has happened. I guess I’m just surprised that HN as a community, with such a high degree of anti-commercialism on the site, has given these sites this much exposure.


Aphyr needs more upvotes :)

EDIT: never mind - on the three year view he's in the top 10


I thought this would be a leaderboard of what users get the most votes for comments on different topics.


Karpathy got into the top 20 most upvoted domain submissions? I don't even remember that many.


nit: blog.pinboard.in is classified as individual


Considering it's one person, it might as well be individual.


Yup I classified that as "individual" because it is just 1 guy running it. My criteria for "blog" was that either there are multiple contributors, or the content is very far removed from the individual writing it, which I didn't feel was the case here.


hnleaderboard insecure connection rip


Also blocked by Cisco Umbrella. "This site is blocked due to a security threat that was discovered by the Cisco Umbrella security researchers."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: