
A search engine that removes the 1 million most popular web sites from its index - nichochar
https://millionshort.com/
======
marknutter
I'm still amazed at how bad Google search is for certain purposes (not that
there's anything better), and it's largely due to content farms. I've
installed Google's Personal Blocklist chrome extension which allows me to
filter out specific domains (ask.com, ehow.com, answers.yahoo.com,
wikihow.com, etc) and that does help some. It's interesting to me that google
search used to have preferences allowing you to block domains but they removed
it.

I actually find myself narrowing my searches by sites I know will have
reliable information, like this one, reddit, certain forums based on the
search topic, etc. I think there would be some real value in creating a search
engine that was very selective about the sites it crawls. Honestly, crawling
the comments from the best user-participation sites on the web (reddit, HN,
SO, quora, etc.) would probably make for a very useful search engine.

~~~
yaph
Build a custom Google search engine and try it
[https://www.google.com/cse/](https://www.google.com/cse/)

~~~
Keyframe
"More control, if you need it" \- Learn more link leads to a 404. Not really
confident about this.

------
cousin_it
Here's another similar idea, someone please do this: a search engine that
excludes all "commercial" sites. "Commercial" means the site contains ads or
accepts payments.

~~~
Benferhat
> "Commercial" means the site contains ads

Your idea started out good. Sometimes I want to research a product before I go
looking for a merchant. Blocking all sites that contain ads? Just use adblock
if you're that fanatical.

~~~
anjc
Presumably the idea is to define 'commercial' as a website with ads or accept
payments, rather than removing sites with ads to increase viewing pleasure.

I think it's a good idea. It'd certainly help with the problem of blogspam
etc.

------
bmac27
_Imagine a search engine that simply removed the top 1 million most popular
web sites from its index. What would you discover?_

A mix of affiliate-driven content sites, abandoned blogs & spam sites, based
on some of the searches I did. The broader the category, the more likely I
think you are to stumble on something relevant and helpful.

(Also, .gov sites don't seem to have been removed from the index. A search for
"type 2 diabetes" still brings a number of results from NIH.gov, mimicking
what Google serves up.)

~~~
courtewing
I believe "removes the top 1 million most popular websites" means that they
nix search results from the million most trafficked domains on the web as a
whole rather than simply the top million results for a specific search. I
doubt there are very many .gov domains that fall into that group.

~~~
bmac27
NIH.gov is ranked #335 globally (185 in U.S.) according to Alexa.

------
marbu
If you find this interesting, you can check the discussion when it was
introduced here for the first time:
[https://news.ycombinator.com/item?id=3910304](https://news.ycombinator.com/item?id=3910304)

------
Aardwolf
Cool!

Tiny complaint: Its default option is "Don't remove any sites". Kind of misses
the point imho...

Other complaint: If country is set to e.g. Switzerland, you see only .de and
.ch domains in the results! Is there no "worldwide" setting?

~~~
us0r
that and my PC/browser is set to en-US but I am in Eastern Europe and my
results are only for .ru (even though I'm not in Russia).

------
aray
It's cute as a novelty, but I tried actually using it just now, and giving me
a CAPTCHA every few searches makes it _unusable_ as a daily driver.

Which is a shame because I really enjoyed searching with it.

------
drakaal
Rather than ranking on traffic, and getting rid of some really great results
just because they are too popular, we (
[http://www.plexisearch.com](http://www.plexisearch.com) ) use indicators from
Natural Language Processing to rank pages.

A How To should rank better for "Make a cake" than a sales page or a review
page.

A Review should rank better for "Best SUV" than a Table Of Contents page.

Just because you are the Underdog doesn't mean you should win. You should have
a fair fight, but a good result is a good result.

I hate much of the stuff in Wikipedia. But there are some pages that were
amazingly well written.

I hate eHow, but there was a brief time when they had experts in the fields
they were writing about writing really great content. Those posts should do
well.

Later we may let you turn off Right or Left Leaning articles. We may expose
the feature we have that returns only easy to read results, a feature designed
for ESL, and Youth searches.

But we will never release a blanket no more top million sites.

~~~
derleth
OK, you can stop spamming now.

~~~
drakaal
Yes, my site with no ads which demonstrates how rankings work and how page
construction influences search is spam. Because losing money on every search
you do for the benefit of informing people how search both works, and could
work is such an awful endeavor. What was I thinking?

~~~
derleth
If that's the best way you have to promote your site, it must not be very
useful.

------
philbo
As much as I agree with the sentiment behind this, I think the implementation
is slightly misdirected.

Neither page popularity or query popularity are necessarily proportional to
domain popularity (eg, *.github.com). Ruling out the most popular domains is
therefore, I suspect, neither good or bad in terms of the quality of results
it produces on the whole. Sometimes it will produce better results, sometimes
it will produce worse, sometimes the same.

If a search engine/tool is going to add value, imho, the very difficult
problem that it must solve is to improve the quality of results. Unless I'm
missing something, I don't see that here (yet).

------
PhasmaFelis
Kind of surreal: doing a search with the first million results removed,
finding a purple link at #9. =o

(I like big robots. I searched for "mech." #9 was a mech-based browser MMO I'd
looked at briefly a couple weeks ago.)

~~~
bnegreve
It doesn't remove the top 1M results, it removes the top 1M most popular
websites from the index.

------
taxonomyman
We've been quietly working away on MillionShort - please stay tuned.

------
cm2012
This isn't a big deal to me, but considering how most scrappy search engine
competitors are taking the maximum privacy angle against Goog, how are you
approaching search privacy?

------
Drexl
For a website who's shtick it is to remove the top 1M results from it's index
the designer thinks having the default option remove nothing is a good idea
apparently. Also, I don't think it actually works unless www.adamwest.com is
not in the top one million results for "Adam West"

[https://millionshort.com/search.php?q=Adam+West&remove=1000k](https://millionshort.com/search.php?q=Adam+West&remove=1000k)

~~~
ronaldx
It removes the top 1M globally-ranked websites. So here it's removing
wikipedia, imdb, etc, and leaving only more specialist sites.

www.adamwest.com isn't removed because it's ranked 3,400,000th or so, using
alexa as a reference
([http://www.alexa.com/siteinfo/adamwest.com](http://www.alexa.com/siteinfo/adamwest.com))

------
bhartzer
How are they determining which sites are the 1 million most popular sites? I
hope it's not by some useless metric like Alexa Rank or something like that.

------
themstheones
How delightful. I just found the most scrumptious recipe for turkey cranberry
flambé that I simply never would have found on google.

------
hyperpape
Ok, I'm sold. My first search was useless, but I searched for "ES6 ready date"
(scrapping 100k top sites), and got [https://mail.mozilla.org/pipermail/es-
discuss/2013-November/...](https://mail.mozilla.org/pipermail/es-
discuss/2013-November/034557.html).

------
notacoward
It's not only missing sites, but it ranks them very differently too. When I
search for my own name on any other search engine, my own website comes up
first or second. When I search here, it's buried way down at twenty-something.
It would be nice to know how this data is being collected, collated, and
sorted.

~~~
jrochkind1
every web search site ranks results differently than every other one. There is
no one 'standard' ranking or anything.

Some of them may rank them in ways more useful to more people of course.
Getting ranking working well is pretty much the challenge in making a good web
search site, and is how Google originally vaulted over it's competitors.

Asking to know "how the data is collected, collated, and sorted" is… asking
for a lot. For one thing, it's pretty much 'trade secrets' \-- lots of people
would really like to know the answer to those questions about Google, but it
would probably take a multi-volume book to answer it completely, and Google
generally prefers not to share the details, as they are what makes Google what
it is (and would also make it easier for people trying to spam google results)

~~~
notacoward
Yes, every site is different, but this one's _way_ different. When Google,
Yahoo, DuckDuckGo and others all agree about the relative ranking of two
sites, a radically different result is likely to be evidence of a problem

Also, I'm not asking for every last detail. Even Google has shared the broad
outline of their PageRank method. I don't think it's at all unreasonable to
expect some explanation of the basic approaches involved.

------
forktheif
Used it for five minutes, and already found two great websites.

I don't know who's behind that search engine, but thank you!

~~~
ttty
can you say which? thanks

------
ucha
It doesn't work if all the results of the first pages belong to websites on
the top x sites specified. Try to search for 'google' and remove the top 1M
sites.

------
tokenizer
This is a great idea. With the saturation the top 1 million websites have on
regular search engines, I could see myself using this for more deepnet
searches. Thanks!

------
embro
I really like the idea... makes me feel of the good old time!

------
hughes
Really? After removing the top million hits, the top results for 'bitcoin' are
bitcoin.org and wikipedia?

~~~
icebraining
Are you sure you've chosen the right option (besides the text field)? By
default it's chosen "Don't remove any sites".

I ask because my two top results for 'bitcoin' are bitcoinbrasil.com.br and
bitcoin-today.yoyafi.com

------
dasmithii
Users of this search engine can actually stumble upon my personal website
without knowing it's exact url...

------
hnriot
python.org isn't in the top million most popular! What's wrong with people :)

------
adam-f
Searching for millionshort on millionshort gives you... millionshort.com

------
alextingle
Apparently my personal web-site is in the web's top one million.

~~~
Shish2k
So is mine... but still stuck behind the kebab companies :(

------
aldanor
Word "google" is apparently not searchable at all.

