
Show HN: Concept search engine based on most popular HN submissions - freediver
Thesis: Sites that do well on hacker news will tend to be sites with high quality content.<p>Tools: Hacker News Big Query, python, Google CSE<p>Steps:<p>1. Using HN Big Query, get all unique domains with more than 3 stories with more than 50 points (query link [1]). Sort by percentage of such stories to total number of stories.<p>By doing that, at the top you will get sites like blog.geoffralston.com that have 3 out of 3 submitted stories get more than 50 points (100% !). Or lucumr.pocoo.org had 46 out of 124 total stories reach 50+ points! Talking about good writing.<p>We cut the list at 2,500 sites , where the popular to submitted ratio is still at enviable 12%.<p>2. Add to this list all sites that had exactly one submission and that only submission ever from that domain had 300+ points on HN. I call them unexplored one hit wonders and thesis is that there are probably other gems on the domain just not ssubmitted yet. [2]<p>3. Now we have about 3,000 sites total. We will use Google CSE engine which allows up to 2,000 sites through annotations [3]. We have to clean the data now.<p>- Check if the domain still resolves. Sadly about 400 these high quality sites do not anymore.<p>- Check for redirects<p>For example this site is no longer on its old address:<p>https:&#x2F;&#x2F;david.weebly.com ... 302 Found (0.153)
   http:&#x2F;&#x2F;www.david.blog&#x2F; ... 200 OK (0.0676)<p>- Check for all other sorts of weird errors<p>This took most of the day :) I used modified version of [4]<p>4. Manually clean the list from news sites that made it on (nytimes, usatoday...)<p>5. If you want to check if your site is on the list, check [5]. If you are on the list congrats!<p>5. Finally, here is the end result:<p>https:&#x2F;&#x2F;cse.google.com&#x2F;cse?cx=014479775183020491825:c2lrlzrogb5<p>Search cream of the crop of HN submitted sites!<p>Let me know if you find this useful!
======
freediver
Have to post links here due to 2000 char limit.

[0]
[https://cse.google.com/cse?cx=014479775183020491825:c2lrlzro...](https://cse.google.com/cse?cx=014479775183020491825:c2lrlzrogb5)

[1]
[https://console.cloud.google.com/bigquery?sq=217608811855:37...](https://console.cloud.google.com/bigquery?sq=217608811855:3757189d77aa488ca48bddc1ec921afd)

[2]
[https://console.cloud.google.com/bigquery?sq=217608811855:67...](https://console.cloud.google.com/bigquery?sq=217608811855:672d9baea3854a6e8ce8b678c288cbfe)

[3] [https://developers.google.com/custom-
search/docs/annotations](https://developers.google.com/custom-
search/docs/annotations)

[4] [https://github.com/amgedr/webchk](https://github.com/amgedr/webchk)

[5]
[https://docs.google.com/spreadsheets/d/1ON26TVBUHH4FZuvH8YFQ...](https://docs.google.com/spreadsheets/d/1ON26TVBUHH4FZuvH8YFQOgB0IQ3XeZEzE6KE_dm4A14/edit?usp=sharing)

