Hacker News new | past | comments | ask | show | jobs | submit login
A search engine that removes the 1 million most popular web sites from its index (millionshort.com)
176 points by nichochar on Nov 27, 2013 | hide | past | favorite | 79 comments



I'm still amazed at how bad Google search is for certain purposes (not that there's anything better), and it's largely due to content farms. I've installed Google's Personal Blocklist chrome extension which allows me to filter out specific domains (ask.com, ehow.com, answers.yahoo.com, wikihow.com, etc) and that does help some. It's interesting to me that google search used to have preferences allowing you to block domains but they removed it.

I actually find myself narrowing my searches by sites I know will have reliable information, like this one, reddit, certain forums based on the search topic, etc. I think there would be some real value in creating a search engine that was very selective about the sites it crawls. Honestly, crawling the comments from the best user-participation sites on the web (reddit, HN, SO, quora, etc.) would probably make for a very useful search engine.


This was easy to fix before Google broke their personal exclusion lists. You used to be able to block a set of sites from your Google searches. I guess Google figured out they were losing potential advertisers.


Try blekko.com (we get rid of all those spammy sites). If you prefer the traditional look (without all the categorization displayed separately) you can use our legacy interface at http://edit.blekko.com


Unfortunately blekko is full of spam as well. Try https://blekko.com/#?q=%22buy%20windows%207%22%20oem

#1, #2 and #3 results are spam pages on a legitmate site.


I looked at this minus-million thing again yesterday when my brother brought it up. I'd seen it before, but we were talking about search engine crappiness and alternatives.

Since I remembered Blekko (met Greg, the founder guy, in SF once) I pulled it up and we searched for yunnan, a Provice in China, as a test. Most of the results were for commercial tour operators or thinly veiled redirects for such.

The million-missing one on the other hand turned up more interesting or 'bespoke' content.

I am ignorant of such matters but would have thought a spamassassin-style Bayesian model based on sentiment analysis, advertising frequency, update frequency, hosting location, content originality or any similar clump of readily obtainable metrics would be enough to usefully cull the vast majority of the useless modern stuff.

I mean, if I want commercial tour operators, I'll tell the search engine by typing something like "prices" or "companies" or "costs" or whatnot.

The other realization we had, doing this on an iPad, was that search engine interfaces positively suck. They're still stuck in the 90s. With a touch-based interface, there should be a more interactive model for query refinement than text editing. Like, uncheck [x] commercial sites.



I just noticed that I often go to reddit to search for things instead of Google, even though their search sucks. But it's better than 50 pages of yahoo answers and similar sites, plus some random forums.


Can't you just google mysearchterm site:reddit.com to get the best of both worlds?


Reddit lets you search by votes or by date whereas Google just turns up arbitrary pages.


Indeed, I had to switch back from DuckDuckGo to Google as I just couldn't deal with all the nonsense sites that polluted my search results, and DDG doesn't offer a personal blocklist feature to eliminate them.

For example, sites like yellowpages.com, whitepages.com, superpages.com, zillow.com, citysquares.com will all pollute basic searches that look like job descriptions ("custom cabinetry new york", for example).

(I complained to DDG about this a while back, and it looks like they have added some negative boosts to some spam sites.)


I never realised it until now

I searched for "180sx" on this site and the first ENTIRE PAGE of links were amazingly relevant and from sites I'd never seen before. Seriously useful information -local Australian body part suppliers, build logs.

I google car-related stuff all day and this is the most useful stuff I've seen in ages. From the first page of results.


It's pretty easy to directly search those sites with DuckDuckGo. They all should have !bang shortcuts.


But it'd be great if they could all be searched with one query, rather than a bunch of different !bang shortcuts.


Build a custom Google search engine and try it https://www.google.com/cse/


"More control, if you need it" - Learn more link leads to a 404. Not really confident about this.


Awesome, I hadn't thought to use that. Thanks.


Here's another similar idea, someone please do this: a search engine that excludes all "commercial" sites. "Commercial" means the site contains ads or accepts payments.


It's almost as if there's a need for a top-level domain to distinguish between commercial organisations and non-profits. This [1] was working brilliantly until everyone decided to abuse it left, right, and center.

[1] http://en.wikipedia.org/wiki/.com


The worst common day offender being .io.


.io is a generic TLD in most books now anyway. There's lots of other ccTLD that are fairly useless, though it's obvious why ones like .co.ck (the uninhabited Cook Islands) were never picked up as generics.


This would exclude SO (ads) and Wikipedia (accepts payments as donations).

Even of you refined your criteria, what do you have against sites that try to recoup costs of hosting or content? Or, are you referring to completely different sites, lumping them all under "commercial sites" label?


Okay, it seems I was trigger-happy in excluding nonprofits that accept donations. Maybe "accept payments for products or services" would work better? Not sure how to formalize that though.

More generally, I don't have anything against any sites, I'm trying to come up with a way of viewing the internet that would have a high signal to noise ratio. It seems plausible to me that excluding all sites with ads would improve the signal to noise ratio over what we have today. If you think the signal to noise ratio would be improved further by carefully including some sites with ads (SO is a good example), can you take a stab at defining the criteria?

Maybe include only sites whose ads are deemed acceptable by adblock? But then we might include many content farms...


Indeed, the first part (search that filters out site with ads, including affiliate links and the like) is something I wish for frequently. It isn't that I mind seeing ads (I can use AdBlock for that, and sometimes do), it's that I don't want to wade through spam sites when looking for info, and sites with ads seems like an obvious hard filter. Maybe DDG or the like could implement filter.


I remember Yahoo actually had this as a Yahoo Labs thing way back in the day (when they actually did search). You had a slider that would make your results all research (if you are researching a bike) to all commercial (if you want to buy one).


What about sites like amazon that serve both type of content? Reviews for research and an option to buy?


You'd find them half-way on the slider.


"Contains ads" by itself is a pretty bad definition of "commercial site," and will exclude a whole lot of useful, high-quality sites that run ads to pay the hosting bills.


> "Commercial" means the site contains ads

Your idea started out good. Sometimes I want to research a product before I go looking for a merchant. Blocking all sites that contain ads? Just use adblock if you're that fanatical.


Presumably the idea is to define 'commercial' as a website with ads or accept payments, rather than removing sites with ads to increase viewing pleasure.

I think it's a good idea. It'd certainly help with the problem of blogspam etc.


Also, it's not the ads themselves, but sites that produce content for the purpose of serving up more ads. May not be as good theoretically.


I think the idea was that the presence of ads indicates the advertisers had influence on content, which isn't true for a lot of websites which run ads.


I don't think that just because a site makes money it is a bad result. If you searched for "Used Jeep Grand Cherokee" Craigslist would be a good result. So would AutoTrader. Both break your rule.


I agree - such a search engine would not be as general as Google or Yahoo. But in sacrificing generality it might become better at handling certain types of requests. I am sure that given such a service the community of HN will make light work of finding out what they are :)


Also it would by hypocritical that it would block sites with ads but likely have to use ads to generate revenue.

As the owner of a search engine I can tell you search ain't cheap.


Yeah. Also, how do you computationally distinguish between a commercial and non-commercial site? Loads of grey areas.


Our Natural Language / Language Heuristics Engine can do this. We do to a small extent now, we haven't gone whole hog because I'm not sure I agree that commercial sites shouldn't rank. For things we determine to be reference searches we downrank sales pages, for things we determine to be shopping searches we uprank them.


What product are you developing the engine for?


Now everyone is fighting to perfect the concept and the definition.

Bikeshedding I say!

If we're just making something interesting and different, not shooting to replace all search, imperfect definitions are a wonderful place to start.


Would HN be excluded as a commercial site? :)


Imagine a search engine that simply removed the top 1 million most popular web sites from its index. What would you discover?

A mix of affiliate-driven content sites, abandoned blogs & spam sites, based on some of the searches I did. The broader the category, the more likely I think you are to stumble on something relevant and helpful.

(Also, .gov sites don't seem to have been removed from the index. A search for "type 2 diabetes" still brings a number of results from NIH.gov, mimicking what Google serves up.)


I believe "removes the top 1 million most popular websites" means that they nix search results from the million most trafficked domains on the web as a whole rather than simply the top million results for a specific search. I doubt there are very many .gov domains that fall into that group.


NIH.gov is ranked #335 globally (185 in U.S.) according to Alexa.


That was my initial thought on what you'd end up with. Aren't the top 1 million most popular sites popular for a reason? Of course there's some garbage in there... but what's the value in removing Wikipedia?


If you find this interesting, you can check the discussion when it was introduced here for the first time: https://news.ycombinator.com/item?id=3910304


Cool!

Tiny complaint: Its default option is "Don't remove any sites". Kind of misses the point imho...

Other complaint: If country is set to e.g. Switzerland, you see only .de and .ch domains in the results! Is there no "worldwide" setting?


that and my PC/browser is set to en-US but I am in Eastern Europe and my results are only for .ru (even though I'm not in Russia).


It's cute as a novelty, but I tried actually using it just now, and giving me a CAPTCHA every few searches makes it unusable as a daily driver.

Which is a shame because I really enjoyed searching with it.


Rather than ranking on traffic, and getting rid of some really great results just because they are too popular, we ( http://www.plexisearch.com ) use indicators from Natural Language Processing to rank pages.

A How To should rank better for "Make a cake" than a sales page or a review page.

A Review should rank better for "Best SUV" than a Table Of Contents page.

Just because you are the Underdog doesn't mean you should win. You should have a fair fight, but a good result is a good result.

I hate much of the stuff in Wikipedia. But there are some pages that were amazingly well written.

I hate eHow, but there was a brief time when they had experts in the fields they were writing about writing really great content. Those posts should do well.

Later we may let you turn off Right or Left Leaning articles. We may expose the feature we have that returns only easy to read results, a feature designed for ESL, and Youth searches.

But we will never release a blanket no more top million sites.


OK, you can stop spamming now.


Yes, my site with no ads which demonstrates how rankings work and how page construction influences search is spam. Because losing money on every search you do for the benefit of informing people how search both works, and could work is such an awful endeavor. What was I thinking?


If that's the best way you have to promote your site, it must not be very useful.


As much as I agree with the sentiment behind this, I think the implementation is slightly misdirected.

Neither page popularity or query popularity are necessarily proportional to domain popularity (eg, *.github.com). Ruling out the most popular domains is therefore, I suspect, neither good or bad in terms of the quality of results it produces on the whole. Sometimes it will produce better results, sometimes it will produce worse, sometimes the same.

If a search engine/tool is going to add value, imho, the very difficult problem that it must solve is to improve the quality of results. Unless I'm missing something, I don't see that here (yet).


Kind of surreal: doing a search with the first million results removed, finding a purple link at #9. =o

(I like big robots. I searched for "mech." #9 was a mech-based browser MMO I'd looked at briefly a couple weeks ago.)


It doesn't remove the top 1M results, it removes the top 1M most popular websites from the index.


We've been quietly working away on MillionShort - please stay tuned.


This isn't a big deal to me, but considering how most scrappy search engine competitors are taking the maximum privacy angle against Goog, how are you approaching search privacy?


For a website who's shtick it is to remove the top 1M results from it's index the designer thinks having the default option remove nothing is a good idea apparently. Also, I don't think it actually works unless www.adamwest.com is not in the top one million results for "Adam West"

https://millionshort.com/search.php?q=Adam+West&remove=1000k


It removes the top 1M globally-ranked websites. So here it's removing wikipedia, imdb, etc, and leaving only more specialist sites.

www.adamwest.com isn't removed because it's ranked 3,400,000th or so, using alexa as a reference (http://www.alexa.com/siteinfo/adamwest.com)


How are they determining which sites are the 1 million most popular sites? I hope it's not by some useless metric like Alexa Rank or something like that.


How delightful. I just found the most scrumptious recipe for turkey cranberry flambé that I simply never would have found on google.


Ok, I'm sold. My first search was useless, but I searched for "ES6 ready date" (scrapping 100k top sites), and got https://mail.mozilla.org/pipermail/es-discuss/2013-November/....


It's not only missing sites, but it ranks them very differently too. When I search for my own name on any other search engine, my own website comes up first or second. When I search here, it's buried way down at twenty-something. It would be nice to know how this data is being collected, collated, and sorted.


every web search site ranks results differently than every other one. There is no one 'standard' ranking or anything.

Some of them may rank them in ways more useful to more people of course. Getting ranking working well is pretty much the challenge in making a good web search site, and is how Google originally vaulted over it's competitors.

Asking to know "how the data is collected, collated, and sorted" is… asking for a lot. For one thing, it's pretty much 'trade secrets' -- lots of people would really like to know the answer to those questions about Google, but it would probably take a multi-volume book to answer it completely, and Google generally prefers not to share the details, as they are what makes Google what it is (and would also make it easier for people trying to spam google results)


Yes, every site is different, but this one's way different. When Google, Yahoo, DuckDuckGo and others all agree about the relative ranking of two sites, a radically different result is likely to be evidence of a problem

Also, I'm not asking for every last detail. Even Google has shared the broad outline of their PageRank method. I don't think it's at all unreasonable to expect some explanation of the basic approaches involved.


We ( http://www.plexisearch.com ) expose our secrets. We'll show you what page flags are changing the rankings. We don't share quite everything, mostly because some features are beta and we don't want to look too stupid when we get something wrong... But about 90% of the indicators we used to determine the rankings are exposed.

The formula changes based on the type of search, and the type of results we found, but the indicators are all exposed.


I am guessing, this site calculates page-rank for the network of articles it has and removes the top 10^? articles and re-calculates the page-rank. That's how I would do it.

And I would guess this would be closer to vanilla page-rank than Google's secret sauce...


Used it for five minutes, and already found two great websites.

I don't know who's behind that search engine, but thank you!


can you say which? thanks


It doesn't work if all the results of the first pages belong to websites on the top x sites specified. Try to search for 'google' and remove the top 1M sites.


This is a great idea. With the saturation the top 1 million websites have on regular search engines, I could see myself using this for more deepnet searches. Thanks!


I really like the idea... makes me feel of the good old time!


Really? After removing the top million hits, the top results for 'bitcoin' are bitcoin.org and wikipedia?


Are you sure you've chosen the right option (besides the text field)? By default it's chosen "Don't remove any sites".

I ask because my two top results for 'bitcoin' are bitcoinbrasil.com.br and bitcoin-today.yoyafi.com


Users of this search engine can actually stumble upon my personal website without knowing it's exact url...


python.org isn't in the top million most popular! What's wrong with people :)


Searching for millionshort on millionshort gives you... millionshort.com


Apparently my personal web-site is in the web's top one million.


So is mine... but still stuck behind the kebab companies :(


Word "google" is apparently not searchable at all.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: