A search engine that removes the 1 million most popular web sites from its index

marknutter · on Nov 27, 2013

I'm still amazed at how bad Google search is for certain purposes (not that there's anything better), and it's largely due to content farms. I've installed Google's Personal Blocklist chrome extension which allows me to filter out specific domains (ask.com, ehow.com, answers.yahoo.com, wikihow.com, etc) and that does help some. It's interesting to me that google search used to have preferences allowing you to block domains but they removed it.

I actually find myself narrowing my searches by sites I know will have reliable information, like this one, reddit, certain forums based on the search topic, etc. I think there would be some real value in creating a search engine that was very selective about the sites it crawls. Honestly, crawling the comments from the best user-participation sites on the web (reddit, HN, SO, quora, etc.) would probably make for a very useful search engine.

xpda · on Nov 27, 2013

This was easy to fix before Google broke their personal exclusion lists. You used to be able to block a set of sites from your Google searches. I guess Google figured out they were losing potential advertisers.

jggonz · on Nov 27, 2013

Try blekko.com (we get rid of all those spammy sites). If you prefer the traditional look (without all the categorization displayed separately) you can use our legacy interface at http://edit.blekko.com

jusob · on Nov 27, 2013

Unfortunately blekko is full of spam as well. Try https://blekko.com/#?q=%22buy%20windows%207%22%20oem

#1, #2 and #3 results are spam pages on a legitmate site.

contingencies · on Nov 27, 2013

I looked at this minus-million thing again yesterday when my brother brought it up. I'd seen it before, but we were talking about search engine crappiness and alternatives.

Since I remembered Blekko (met Greg, the founder guy, in SF once) I pulled it up and we searched for yunnan, a Provice in China, as a test. Most of the results were for commercial tour operators or thinly veiled redirects for such.

The million-missing one on the other hand turned up more interesting or 'bespoke' content.

I am ignorant of such matters but would have thought a spamassassin-style Bayesian model based on sentiment analysis, advertising frequency, update frequency, hosting location, content originality or any similar clump of readily obtainable metrics would be enough to usefully cull the vast majority of the useless modern stuff.

I mean, if I want commercial tour operators, I'll tell the search engine by typing something like "prices" or "companies" or "costs" or whatnot.

The other realization we had, doing this on an iPad, was that search engine interfaces positively suck. They're still stuck in the 90s. With a touch-based interface, there should be a more interactive model for query refinement than text editing. Like, uncheck [x] commercial sites.

ttty · on Nov 27, 2013

wtf: http://www.t4f.com.br/natal/index.php?oem=1568&buycheap=buy-...

Houshalter · on Nov 28, 2013

I just noticed that I often go to reddit to search for things instead of Google, even though their search sucks. But it's better than 50 pages of yahoo answers and similar sites, plus some random forums.

sireat · on Nov 28, 2013

Can't you just google mysearchterm site:reddit.com to get the best of both worlds?

Houshalter · on Nov 28, 2013

Reddit lets you search by votes or by date whereas Google just turns up arbitrary pages.

atombender · on Nov 27, 2013

Indeed, I had to switch back from DuckDuckGo to Google as I just couldn't deal with all the nonsense sites that polluted my search results, and DDG doesn't offer a personal blocklist feature to eliminate them.

For example, sites like yellowpages.com, whitepages.com, superpages.com, zillow.com, citysquares.com will all pollute basic searches that look like job descriptions ("custom cabinetry new york", for example).

(I complained to DDG about this a while back, and it looks like they have added some negative boosts to some spam sites.)

plaguuuuuu · on Nov 27, 2013

I never realised it until now

I searched for "180sx" on this site and the first ENTIRE PAGE of links were amazingly relevant and from sites I'd never seen before. Seriously useful information -local Australian body part suppliers, build logs.

I google car-related stuff all day and this is the most useful stuff I've seen in ages. From the first page of results.

dublinben · on Nov 27, 2013

It's pretty easy to directly search those sites with DuckDuckGo. They all should have !bang shortcuts.

DanBC · on Nov 27, 2013

But it'd be great if they could all be searched with one query, rather than a bunch of different !bang shortcuts.

yaph · on Nov 27, 2013

Build a custom Google search engine and try it https://www.google.com/cse/

Keyframe · on Nov 27, 2013

"More control, if you need it" - Learn more link leads to a 404. Not really confident about this.

marknutter · on Nov 27, 2013

Awesome, I hadn't thought to use that. Thanks.

cousin_it · on Nov 27, 2013

Here's another similar idea, someone please do this: a search engine that excludes all "commercial" sites. "Commercial" means the site contains ads or accepts payments.

oneeyedpigeon · on Nov 27, 2013

It's almost as if there's a need for a top-level domain to distinguish between commercial organisations and non-profits. This [1] was working brilliantly until everyone decided to abuse it left, right, and center.

[1] http://en.wikipedia.org/wiki/.com

Laremere · on Nov 27, 2013

The worst common day offender being .io.

nwh · on Nov 28, 2013

.io is a generic TLD in most books now anyway. There's lots of other ccTLD that are fairly useless, though it's obvious why ones like .co.ck (the uninhabited Cook Islands) were never picked up as generics.

lobotryas · on Nov 27, 2013

This would exclude SO (ads) and Wikipedia (accepts payments as donations).

Even of you refined your criteria, what do you have against sites that try to recoup costs of hosting or content? Or, are you referring to completely different sites, lumping them all under "commercial sites" label?

cousin_it · on Nov 27, 2013

Okay, it seems I was trigger-happy in excluding nonprofits that accept donations. Maybe "accept payments for products or services" would work better? Not sure how to formalize that though.

More generally, I don't have anything against any sites, I'm trying to come up with a way of viewing the internet that would have a high signal to noise ratio. It seems plausible to me that excluding all sites with ads would improve the signal to noise ratio over what we have today. If you think the signal to noise ratio would be improved further by carefully including some sites with ads (SO is a good example), can you take a stab at defining the criteria?

Maybe include only sites whose ads are deemed acceptable by adblock? But then we might include many content farms...

mlinksva · on Nov 27, 2013

Indeed, the first part (search that filters out site with ads, including affiliate links and the like) is something I wish for frequently. It isn't that I mind seeing ads (I can use AdBlock for that, and sometimes do), it's that I don't want to wade through spam sites when looking for info, and sites with ads seems like an obvious hard filter. Maybe DDG or the like could implement filter.

ry0ohki · on Nov 27, 2013

I remember Yahoo actually had this as a Yahoo Labs thing way back in the day (when they actually did search). You had a slider that would make your results all research (if you are researching a bike) to all commercial (if you want to buy one).

barista · on Nov 27, 2013

What about sites like amazon that serve both type of content? Reviews for research and an option to buy?

_qhtn · on Nov 27, 2013

You'd find them half-way on the slider.

PhasmaFelis · on Nov 27, 2013

"Contains ads" by itself is a pretty bad definition of "commercial site," and will exclude a whole lot of useful, high-quality sites that run ads to pay the hosting bills.

Benferhat · on Nov 27, 2013

> "Commercial" means the site contains ads

Your idea started out good. Sometimes I want to research a product before I go looking for a merchant. Blocking all sites that contain ads? Just use adblock if you're that fanatical.

anjc · on Nov 27, 2013

Presumably the idea is to define 'commercial' as a website with ads or accept payments, rather than removing sites with ads to increase viewing pleasure.

I think it's a good idea. It'd certainly help with the problem of blogspam etc.

JeremyMorgan · on Nov 27, 2013

Also, it's not the ads themselves, but sites that produce content for the purpose of serving up more ads. May not be as good theoretically.

derleth · on Nov 27, 2013

I think the idea was that the presence of ads indicates the advertisers had influence on content, which isn't true for a lot of websites which run ads.

drakaal · on Nov 27, 2013

I don't think that just because a site makes money it is a bad result. If you searched for "Used Jeep Grand Cherokee" Craigslist would be a good result. So would AutoTrader. Both break your rule.

robertjwebb · on Nov 28, 2013

I agree - such a search engine would not be as general as Google or Yahoo. But in sacrificing generality it might become better at handling certain types of requests. I am sure that given such a service the community of HN will make light work of finding out what they are :)

drakaal · on Nov 28, 2013

Also it would by hypocritical that it would block sites with ads but likely have to use ads to generate revenue.

As the owner of a search engine I can tell you search ain't cheap.

robertjwebb · on Nov 28, 2013

Yeah. Also, how do you computationally distinguish between a commercial and non-commercial site? Loads of grey areas.

drakaal · on Nov 28, 2013

Our Natural Language / Language Heuristics Engine can do this. We do to a small extent now, we haven't gone whole hog because I'm not sure I agree that commercial sites shouldn't rank. For things we determine to be reference searches we downrank sales pages, for things we determine to be shopping searches we uprank them.

robertjwebb · on Nov 29, 2013

What product are you developing the engine for?

brownbat · on Nov 30, 2013

Now everyone is fighting to perfect the concept and the definition.

Bikeshedding I say!

If we're just making something interesting and different, not shooting to replace all search, imperfect definitions are a wonderful place to start.

snoonan · on Nov 28, 2013

Would HN be excluded as a commercial site? :)

bmac27 · on Nov 27, 2013

Imagine a search engine that simply removed the top 1 million most popular web sites from its index. What would you discover?

A mix of affiliate-driven content sites, abandoned blogs & spam sites, based on some of the searches I did. The broader the category, the more likely I think you are to stumble on something relevant and helpful.

(Also, .gov sites don't seem to have been removed from the index. A search for "type 2 diabetes" still brings a number of results from NIH.gov, mimicking what Google serves up.)

courtewing · on Nov 27, 2013

I believe "removes the top 1 million most popular websites" means that they nix search results from the million most trafficked domains on the web as a whole rather than simply the top million results for a specific search. I doubt there are very many .gov domains that fall into that group.

bmac27 · on Nov 27, 2013

NIH.gov is ranked #335 globally (185 in U.S.) according to Alexa.

dave5104 · on Nov 27, 2013

That was my initial thought on what you'd end up with. Aren't the top 1 million most popular sites popular for a reason? Of course there's some garbage in there... but what's the value in removing Wikipedia?

marbu · on Nov 27, 2013

If you find this interesting, you can check the discussion when it was introduced here for the first time: https://news.ycombinator.com/item?id=3910304

Aardwolf · on Nov 27, 2013

Cool!

Tiny complaint: Its default option is "Don't remove any sites". Kind of misses the point imho...

Other complaint: If country is set to e.g. Switzerland, you see only .de and .ch domains in the results! Is there no "worldwide" setting?

us0r · on Nov 27, 2013

that and my PC/browser is set to en-US but I am in Eastern Europe and my results are only for .ru (even though I'm not in Russia).

aray · on Nov 27, 2013

It's cute as a novelty, but I tried actually using it just now, and giving me a CAPTCHA every few searches makes it unusable as a daily driver.

Which is a shame because I really enjoyed searching with it.

drakaal · on Nov 27, 2013

Rather than ranking on traffic, and getting rid of some really great results just because they are too popular, we ( http://www.plexisearch.com ) use indicators from Natural Language Processing to rank pages.

A How To should rank better for "Make a cake" than a sales page or a review page.

A Review should rank better for "Best SUV" than a Table Of Contents page.

Just because you are the Underdog doesn't mean you should win. You should have a fair fight, but a good result is a good result.

I hate much of the stuff in Wikipedia. But there are some pages that were amazingly well written.

I hate eHow, but there was a brief time when they had experts in the fields they were writing about writing really great content. Those posts should do well.

Later we may let you turn off Right or Left Leaning articles. We may expose the feature we have that returns only easy to read results, a feature designed for ESL, and Youth searches.

But we will never release a blanket no more top million sites.

derleth · on Nov 27, 2013

OK, you can stop spamming now.

drakaal · on Nov 27, 2013

Yes, my site with no ads which demonstrates how rankings work and how page construction influences search is spam. Because losing money on every search you do for the benefit of informing people how search both works, and could work is such an awful endeavor. What was I thinking?

derleth · on Nov 28, 2013

If that's the best way you have to promote your site, it must not be very useful.

philbo · on Nov 27, 2013

As much as I agree with the sentiment behind this, I think the implementation is slightly misdirected.

Neither page popularity or query popularity are necessarily proportional to domain popularity (eg, *.github.com). Ruling out the most popular domains is therefore, I suspect, neither good or bad in terms of the quality of results it produces on the whole. Sometimes it will produce better results, sometimes it will produce worse, sometimes the same.

If a search engine/tool is going to add value, imho, the very difficult problem that it must solve is to improve the quality of results. Unless I'm missing something, I don't see that here (yet).

PhasmaFelis · on Nov 27, 2013

Kind of surreal: doing a search with the first million results removed, finding a purple link at #9. =o

(I like big robots. I searched for "mech." #9 was a mech-based browser MMO I'd looked at briefly a couple weeks ago.)

bnegreve · on Nov 27, 2013

It doesn't remove the top 1M results, it removes the top 1M most popular websites from the index.

taxonomyman · on Nov 27, 2013

We've been quietly working away on MillionShort - please stay tuned.

cm2012 · on Nov 27, 2013

This isn't a big deal to me, but considering how most scrappy search engine competitors are taking the maximum privacy angle against Goog, how are you approaching search privacy?

Drexl · on Nov 27, 2013

For a website who's shtick it is to remove the top 1M results from it's index the designer thinks having the default option remove nothing is a good idea apparently. Also, I don't think it actually works unless www.adamwest.com is not in the top one million results for "Adam West"

https://millionshort.com/search.php?q=Adam+West&remove=1000k

ronaldx · on Nov 27, 2013

It removes the top 1M globally-ranked websites. So here it's removing wikipedia, imdb, etc, and leaving only more specialist sites.

www.adamwest.com isn't removed because it's ranked 3,400,000th or so, using alexa as a reference (http://www.alexa.com/siteinfo/adamwest.com)

bhartzer · on Nov 27, 2013

How are they determining which sites are the 1 million most popular sites? I hope it's not by some useless metric like Alexa Rank or something like that.

themstheones · on Nov 27, 2013

How delightful. I just found the most scrumptious recipe for turkey cranberry flambé that I simply never would have found on google.

hyperpape · on Nov 27, 2013

Ok, I'm sold. My first search was useless, but I searched for "ES6 ready date" (scrapping 100k top sites), and got https://mail.mozilla.org/pipermail/es-discuss/2013-November/....

notacoward · on Nov 27, 2013

It's not only missing sites, but it ranks them very differently too. When I search for my own name on any other search engine, my own website comes up first or second. When I search here, it's buried way down at twenty-something. It would be nice to know how this data is being collected, collated, and sorted.

jrochkind1 · on Nov 27, 2013

every web search site ranks results differently than every other one. There is no one 'standard' ranking or anything.

Some of them may rank them in ways more useful to more people of course. Getting ranking working well is pretty much the challenge in making a good web search site, and is how Google originally vaulted over it's competitors.

Asking to know "how the data is collected, collated, and sorted" is… asking for a lot. For one thing, it's pretty much 'trade secrets' -- lots of people would really like to know the answer to those questions about Google, but it would probably take a multi-volume book to answer it completely, and Google generally prefers not to share the details, as they are what makes Google what it is (and would also make it easier for people trying to spam google results)

notacoward · on Nov 27, 2013

Yes, every site is different, but this one's way different. When Google, Yahoo, DuckDuckGo and others all agree about the relative ranking of two sites, a radically different result is likely to be evidence of a problem

Also, I'm not asking for every last detail. Even Google has shared the broad outline of their PageRank method. I don't think it's at all unreasonable to expect some explanation of the basic approaches involved.

drakaal · on Nov 27, 2013

We ( http://www.plexisearch.com ) expose our secrets. We'll show you what page flags are changing the rankings. We don't share quite everything, mostly because some features are beta and we don't want to look too stupid when we get something wrong... But about 90% of the indicators we used to determine the rankings are exposed.

The formula changes based on the type of search, and the type of results we found, but the indicators are all exposed.

mrcactu5 · on Nov 27, 2013

I am guessing, this site calculates page-rank for the network of articles it has and removes the top 10^? articles and re-calculates the page-rank. That's how I would do it.

And I would guess this would be closer to vanilla page-rank than Google's secret sauce...

forktheif · on Nov 27, 2013

Used it for five minutes, and already found two great websites.

I don't know who's behind that search engine, but thank you!

ttty · on Nov 27, 2013

can you say which? thanks

ucha · on Nov 27, 2013

It doesn't work if all the results of the first pages belong to websites on the top x sites specified. Try to search for 'google' and remove the top 1M sites.

tokenizer · on Nov 27, 2013

This is a great idea. With the saturation the top 1 million websites have on regular search engines, I could see myself using this for more deepnet searches. Thanks!

embro · on Nov 27, 2013

I really like the idea... makes me feel of the good old time!

hughes · on Nov 27, 2013

Really? After removing the top million hits, the top results for 'bitcoin' are bitcoin.org and wikipedia?

icebraining · on Nov 27, 2013

Are you sure you've chosen the right option (besides the text field)? By default it's chosen "Don't remove any sites".

I ask because my two top results for 'bitcoin' are bitcoinbrasil.com.br and bitcoin-today.yoyafi.com

dasmithii · on Nov 27, 2013

Users of this search engine can actually stumble upon my personal website without knowing it's exact url...

hnriot · on Nov 27, 2013

python.org isn't in the top million most popular! What's wrong with people :)

adam-f · on Nov 27, 2013

Searching for millionshort on millionshort gives you... millionshort.com

alextingle · on Nov 27, 2013

Apparently my personal web-site is in the web's top one million.

Shish2k · on Nov 27, 2013

So is mine... but still stuck behind the kebab companies :(

aldanor · on Nov 28, 2013

Word "google" is apparently not searchable at all.