Hacker News new | comments | ask | show | jobs | submit login
Show HN: Imagine a search engine that removed top million sites from its index (millionshort.com)
552 points by taxonomyman on Apr 30, 2012 | hide | past | web | favorite | 195 comments



The results are surprisingly good. I did some searches for recipes, and frankly without the top 1000 you really start getting some fresh hits. Entries by real people, rather than sites raking up recipes for a hit.


Wow, I didn't know until today that Google doesn't serve more than the top 1000 hits. (A result of me thinking, "can't I just ask Google to show me the million-first through the million-tenth hits?)


First thing i used it for was recipes too! Started to find little blogs, or sites outside of north america, with some great Vegetarian recipes.


We're surprising ourselves too.


It's interesting to see popularity used as an inverse corollary with quality. Imagine a TV that skipped the most popular programming (goodbye American Idol), or a radio station that only plays non-hits.

Of course, there are great websites out there that are very popular (Wikipedia, NYTimes/WSJ, StackOverflow). I'd love to see a search engine with a better signal for quality than non-popularity (this search engine), or SEO (Google), but it's a fun start. :)


Maybe think of Million Short as more of a discovery engine. We're not saying a quality site can't be popular.


I think Fravia+ would have actually liked what you have here - its a really nice search engine.


He called it the "Yoyo" technique[1]. It doesn't work very well anymore, because Google's results are all quirky these days (you don't actually get what you literally search for, but what Google guesses what you intend to find).

[1] http://www.searchlores.org/yoyo1.htm (try to read past the preachy anti-commercialism, he could get kind of hot about that--whether you agree or not, there's troves of knowledge to be gained from that site)


Youtube, grooveshark and others kind of fill that gap. They still promote popular content though, but that's unavoidable unless you take money out of the equation.

http://channel101.com/ offers a nice selection of "unpopular" shows. You should check Danny Jelinek's "Everything" (https://vimeo.com/channels/231109) if you're into video-art.


i've had great success playing the non-hits format on college radio for years now, theres always great interest in what hasnt bubbled to the surface by force of popularity.


That's an awesome idea. You guys should add a button to Google search results that would completely invert the result list, and show the lowest stuff first, highest stuff last. Almost like sorting a list of products by price.

Anything you can identify as pure SEO spam, exclude of course. But if it's some original content that just isn't well connected or whatever, include it.

Then just observe user behavior for a while until you can discern patterns in how people who play around with that use it to discover new and interesting stuff. Maybe a new algorithm might come of it.


Reversing Google's results is an awful idea as you'd get the worst possible results for a search i.e. those that are completely unrelated to the search terms.

You need to either remove the top 100 (or 1000 etc) and look at what remains, or reverse only the top 10000 results (or 1000 etc).


I think that quality and page rank are not correlated. You may well have some quality sitting anywhere between page 1 and a million. Page rank is driven by various things like freshness and keyword matching. How do you determine quality?

This website is an interesting experiment however if you're after discovering random semi related pages, why not just use google or stumble upon?

Is there a way to include domains automatically excluded? For example WordPress.com is excluded.


You have to be careful with that idea since it can still lead to poor quality. My campus's radio station (WREK Atlanta) seems to play just weird music for the sake of being weird. They never play popular music, but they rarely play any good music either.


WREK is just about the best radio there is. It's not weird for the sake of being weird. It's intentionally eclectic and far-reaching, sure, but the point is to broaden horizons. As much for the DJs themselves to broaden their horizons as the audience.

Disclaimer- former WREK DJ. :-)


I listen to WREK from Japan via iTunes Radio– in fact WREK, WUSC, and KCRW are the only reason I have to still listen to radio. I can understand not liking it though, I was pretty surprised the first time I was on campus and heard music I liked.


The University of Georgia's student station has a pretty strict music philosophy that excludes popular artists. I would disagree that it results in no good music being played.

"Artists who meet any of the following criteria are automatically disqualified from airplay on WUOG 90.5FM:

The artist is or has been in regular rotation on major commercial radio stations. This does not apply to material used in specialty shows on major commercial stations.

The artist has a music video that has aired on major music video stations. This does not apply to material used in specialty shows on major commercial networks.

Another criterion that DJs must use when selecting music for airplay is its position on the Billboard Album Charts. If one of the artist’s albums has entered the top 50, then a DJ cannot play that artist."

http://wuog.org/music/philosophy/


Try GSU's 88.5 WRAS. They tend to stay away from both popular artists and the really strange stuff.


Or you may have different tastes from the DJs.


Should be popular with hipsters.


I think you're probably right, but do you mean that as a "dig" against millionshort? You may be entering meta-meta-contrarianism: the hipster of hipsters. :)

http://lesswrong.com/lw/2pv/intellectual_hipsters_and_metaco...


Not any more - I've heard of it.


It means that our ranking algorithms have good recall but very poor precision. We value web page connectivity more than its content. We don't know how to teach machines to evaluate web page for its merit so we hope that a large number of Twitting, Liking and Plusing non-experts will approximate single expert. Millionshort shows that this model isn't good enough.


It shows the problem with using tags <or metadata> for ranking rather than considering the source of the information.

Yahoo became the goto portal because the quality of their links was high. Google overtook them because metadata was the only practical way to keep up with the pace at which the web was expanding.

However, after more than a decade of SEO, it's limitations are evident as the long tail gets increasingly harder to reach with search engines based on the Google model, e.g. "Purchase empiricist philosophers on eBay."

My suspicion is that Google is abandoning neutral search for personalized search in part because of the problems SEO represents when it comes to neutral searching and personal tracking provides an algorithmic way of establishing the quality of results for ranking purposes. One which is easier than attempting to develop a neutral curation algorithm.


I was able to rank for some moderately competitive terms just by paying someone minimum wage to go leave nice comments on dofollow blogs. The whole "Content is king" meme is a joke, really.


Interesting. How do you find blogs that don't implement nofollow?


Plenty of blogs promote themselves as DoFollow Blogs. It's a bit sad, really. They basically trade their "link juice" for your insincere, contrived comments.


Which in turn can help them rank for more terms.


We value web page connectivity more than its content.

That's pretty much precisely how PageRank (in its original form) changed web search. At the time, at least, it worked because we were really bad at automatically analyzing content, and the connectivity signal provided better results.


This is the first off-brand search engine I've seen that's, in some sense, cooler than Google.

For one thing, the huge miasma of spam websites that dominates the SERPs just isn't there -- I hope this lights a fire under Google's butt and people see another world is possible.


It reminds me of why I first moved to Google from Yahoo/Webcrawler/Altavista/etc in the first place.


Yup. Since this is trivial for Google to copy, it's unlikely that it'll actually disrupt Google in any way, but still...very cool.


The technology is trivial, but the reasons behind it and philosophy that drives it aren't.

Google isn't going to come along and say, "oh, guys, you're right...Wave, Buzz, Google+ all a mistake. Social Graph..Mistake!. We're sorry. We are rolling everything back to 2004"

Google has been increasingly more frustrating for power users over the last couple of years. People are looking for alternatives. DDG is one, this is another. Power users matter, and not just because they spend more time inside an application, but because they recommend it to others.

A perfect example of this is firefox. I used firefox a lot, and then switched to Chrome. I've since switched my mother, grandpa, brother, girlfriend, sister and too many friends to count to chrome. I now use DDG for all my searching needs via the bang syntax (!, !g, !w, etc). It's only a matter of time before the rest of my family follows along.


Ah, I was unaware of the !g bang syntax, that's a great one. I'm also fond of !archwiki, and all the language bangs. DuckDuckGo is the only search engine that I know of that could take over a significant portion of search traffic. By significant I don't mean tens of millions, but I mean a dedicated userbase who use it in leue of Google.


Just out of curiosity, what does DuckDuckGo do that conventional keyword searches in Firefox or Chrome don't do?


Of course you can put all those keyword searches into your browser. The only difference is that DDG has already made a huge selection, so you don't have to, anymore.

So basically, DDG probably has a whole bunch of !bang searches that you simply had not thought of to create your own keywords for, yet. And when you need it, it's already there.

There are also a few !bang queries that are not external searches, such as one for rolling dice (it can do !roll 3d6+3).

Another minor difference is that you can add the !bang keyword anywhere inside the query, also at the end. Being able to add it at the end makes it easier to "ok let's try this on another search engine".


I have DDG set as my default search engine, but can you do the following:

! fuji Ames => takes me to Fuji Steak House in Ames Iowa directly

!w Sushi => Takes me Directly to the article on Sushi on Wikipedia

!m 801 grand ave Des Moines => takes me directly to google maps at the address.

I still probably do about 30% of my searches using !g kicking me straight to google for search results.


But how is that different from regular keyword searches? If I want to pull up the Wikipedia article on sushi in Firefox, I enter "w sushi" in the URL bar. If I want to read the Arch Linux wiki article on pacman, I type "arch pacman". If I want to watch Friday by Rebecca Black on Youtube, I type "y Friday", etc. I feel like there's more than a 50% chance that I'm missing something. What does DDG do that browsers don't?


I typed in "w sushi" into chrome and I got https://www.google.com/search?aq=f&ix=ucb&sourceid=c... which google set to my default. I expected to get "http://en.wikipedia.org/wiki/Sushi

Do I have to setup all these different keywords to work? If so that's the difference. It's essentially a pre-configured command line with literally hundreds (thousands?) of predefined syntax for searches.


Yes, I set those keyword searches myself. So the difference seems to be that with DDG, you just have to learn which shortcuts exist rather than customizing them yourself. That doesn't really sound like it saves much time, but it sounds like the real advantage of DDG is that the act of reading through the DDG list of keywords would probably give me ideas of useful time-saving shortcuts that I'd never think of making on my own.


I may well have this very wrong, wouldn't be the first time, but wont google loose a hell of a lot of money if they dont return the top results? It might be "easy" to implement technically, but I suspect it will completely muck up their revenue stream.

Perhaps there might be a great irony of this site becoming popular and using google ads to finance it!!!!!!


It isn't always simple for large organisations to copy/implement trivial/common sense things.

While Google has it's head up its' ass^H^H^H social stuff, there doesn't seem to be any effort to actually improve search.


Thanks for the great comparison.


I'd be fascinated to see the kind of SEO that would go on if this took off.

"Bad news-- we're a top 100 hit for several of our main keywords. We'll have to change our URL scheme again."


We can call it Search Engine Pessimization!


Bravo!


clap........................clap


High page rank pages / sites could start holding competitors' pages hostage by linking to them.

Maybe we'd see a resurgence in Flash.


So what we're saying is: "This would be a terrible, terrible thing for the web."


whats flash got to do with anything?

search engines can follow links (sometimes) through flash. and search engines know what the most popular sites are even without counting number of links to sites, so hiding links through flash (or js, whatever) wouldn't help


This reminds of searching the internet in the 90's, I'm finding results from pages I haven't visited or heard of before now.

This is really refreshing.


This is a breath of fresh air - I'm loving the unpredictability of the top results! It's like flicking through a new set of 1000 tv channels in a different country.


The positive feedback is awesome.. Thx


Would you share some implementation details.

What's your source for the top million sites; where do you get your site list from for the other results?


Who needs to imagine it? It's here.

Cat is out of the bag.

What is the Alexa list good for? Answer: Filtering out the boring, money-grubbing commercial sites. A truly GREAT idea.

A return to the good 'ole days. The non-commercial web.

Many young people who love today's www never got to experience it as it was before it became overrun with Google-ization and auto-generated garbage.

Take the ball and run with it. We ca reclaim the web. This is only the beginning.


Wow... It felt like using Google 10 years ago. I think you are onto something.


Wow.. awesome comliment..


Usually don't search unless I'm looking for something in particular, but just played with this for a good 15 minutes running random queries. The results are really good and at the same time I'm discovering sites I'd never otherwise see with Google.


Turns out removing the top million results from a search for Google... still returns google. Or google.com.au to be precise.

It's a cool idea, but I'm not sure it's working. I tried "american history" but it wouldn't return anything at all if I changed the "Remove the Top" dropdown.


Good feedback. Right now the filtering is not 100% working for non .com, .net, .org, .co.uk domains. Still working on it.


There's one pretty big flaw with this approach... For certain searches that do not have "millions of results", you get completely unrelated results.

If I search my name then the results are for names similar to mine, but not actually my name. This makes it completely useless for searching my name. I would think that there are many searches with this problem.

I think there needs to be some kind of weighting system used that dynamically decides the cutoff point. One million is a huge over-generalization for all search terms.


The number of results aren't a factor. We are removing results from sites that are in the list of top 1 million most popular web sites. Hope that helps.


Ah, this seems to be the case. It looks like you are dropping my name down to "thousand from results" despite it still saying "million from results".

Very nice!


I like this idea a lot. I came across a nice, concise explanation of a Buffer overflow

http://www.apolis.org/index.php?option=com_content&view=...


This is a copy of the Wikipedia article on buffer overflows.


One of the addition I would make to this search engine is to delete all the Wikipedia clones from the results.


Looks like this idea isn't totally in the free-and-clear from content farms.


Yeah except now you get the content farms that don't know how to rank. A competitive spammy term like "pay day loans" still shows plenty of low quality sites because of the crazy number of sites looking to cash in on the term.


Damnit, you're right.


That's awesome. Maybe we're on to something..


It's a similar measure that's often used in NLP. Sentences, documents etc. are usually stripped of common or popular terms first and the remaining ones tend to have higher information value.

It's not entirely a surprise that it works for meta-language constructs like the web and site popularity.


Uhh,

1) I am basically certain that it doesn't work. Imagine something this simple actually did work in general. Do you think Google, Bing, etc, wouldn't have implemented it?

2) I think the analogy is extremely flawed. nytimes.com is by no means a web page equivalent to "the" or "a" in language. Articles, pronouns, etc, don't really carry meaning. Despite being popular, nytimes certainly does.


I don't know. I haven't really spent a lot of time looking at this to get a good feel of the quality. But consider this, NYT tends to be a secondary source: e.g. book reviews (books are the primary source), science news (scientific papers are primary), business news (market movements and press releases are primary), etc.

Now consider, if I'm interested in some particular thing in science, is it better to get the NYT science reporting or just get the professor's publication and research page on their university website? Filtering out the top-n sites is more likely to turn up the professor's site near the top of the pile rather than after all of the popular sites' second and third level reporting.

Is this better? Depends on the audience. A "popular science" goal would argue the former is the better as science news simplifies, abstracts and popularizes complex science (with varying degrees of quality) while a scientist would prefer the latter.


The results feel actually fresh. It's removing the consumerism layer of bullshit that google serves us everyday. Also wondering if that has something to do with the "bubble" that google creates around us based on our search history and social network information.

Thanks for that.


Searching for 'python global interpreter lock' yields some interest blog articles describing the problems, also some related articles about approaches to the C10k problem with python (preforking, worker processes, etc.)

A++ would search again.


I'm really liking this. Instead of being bombarded with content that's just blasted with keywords, I get relevant well-written articles. Not only that, but no more W3Schools in my SERP's. The chance to read an article that's written with humans in mind, instead of Google is more than enough reason to spend some more time using this.


You think that top results in Google and other commercial search engines are always ranked based on "popularity"?

It would be harsh to call this naive, but it shows a serious lack of SEM and SEO knowledge. Ever heard of "paid placement"?

Many years ago when Digital's AltaVista was our main search engine, it was becoming loaded down with paid placement.

The results were polluted.

Google eventually became the "clean" solution.

But now it's Google that is loaded down with all sorts of commercial crud, much of pointing to Google acquisitions.

And paid placement, among numerous other strategies, new and old, still exists.

The simplicity of millionshort is brilliant.

Filter out the crap.


Add a way for me to put this as my search engine in my firefox search bar.

Please.

EDIT: In trying to accomplish this task I found an add-on that lets you do this for anything.

(https://addons.mozilla.org/en-US/firefox/addon/add-to-search...)


Awesome.


Here's what Dropbox thinks about power users:

WSJ: What's next?

Mr. Ferdowsi: We continue to focus on actually solving problems that real people have and not being distracted by what power users want.

Google has made clear what they think about power users:

No + operators in search.

No web-based code search.

No Google Labs for the public.

etc.

Plenty of wood behind the Google arrows, but all the cool ones have been cast out of the quiver.

Just what kind of targets is Google aiming at nowadays?

Millionshort I give you +999,999.

I would give you +1M if you took out the AdSense and PlusOne javascript.

This has been a long time coming.

Alas, DDG and other alternatives are all about _money_.

Search is about _discovery_.


Well that's one way to break out of the filter-bubble/echo-chamber I suppose. If only our best search technology was based on something better than a popularity contest :(


Popularity I think is important. But not at the expense of relevance. It's not a easy nut to crack.


Why? If I search, say, for a game review, I don't care whether it comes from a popular website or a blog no one reads. In fact, the topmost websites are more likely to be biased, since they try to appease everyone and they also have strong relationships with publishers. The blog no one reads is nearly guaranteed to be honest (if not well-written).

This holds true for most topics I can think of. Moreover, if I ever need to read Wikipedia and such, I already know about those websites, and I can go there directly - no need to search. Shouldn't web search engines act like discovery tools?


> Shouldn't web search engines act like discovery tools?

If your business model is based on advertising that depends on masses of page views to generate value, then no. You want to be as generally useful as possible so that e.g. people use you as an (extremely inefficient) DNS service.

(Google Labs has a single optional feature available for search. Perhaps their arch doesn't make 20% or Labs projects a good fit for plugging in extra fancy search features?)


And of course, W3Schools still manages to show up, thanks to their multiple crazy subdomains: http://cl.ly/GFup


Your're right. We strip www from the domains. We need a better function to rip out the domain from a URL. Just haven't had time to cook one up yet.


Try out http://publicsuffix.org/ - that plus a custom suffix list of overrides works wonders!


"httpwww."? Seriously?

What the hell is their problem?


After a few test searches, this is surprisingly effective for things which I had resigned "un-findable" because of poor Google results. This is most apparent on non-technical things, in this case specific Jazz chord fingerings for a guitar class I am taking.

I am very interested as to what comes of this, or rather what is influenced by its implications.


Guitar chords / tabs / lessons are a terrible SEO spam offender... Look at that! I finally found an accurate transcription of "Bohemian Rhapsody"


That was exactly the goal. In our opinion the "un-findable" as you put it represents a gold mine of information yet to be absorbed and enjoyed.


Man, I love this thing. I've already found a bunch of interesting links on path tracing. Bookmark'd.


Sweet. Glad to have helped.


What is the ranking used for the top million sites? A search result for "Australia" returns as the top result http://australia.gov.au, which Alexa ranks as 20,615 globally. Actually, a lot of the queries I tried returned Australian sites.

http://millionshort.com/search.php?q=australia&remove=10...

http://millionshort.com/search.php?q=somalia&remove=1000... -- another Australian site.


Right now the filtering is not 100% working for non .com, .net, .org, .co.uk domains. So, .gov.au isn't being properly filtered yet. Soon.


The Public Suffix list http://publicsuffix.org/ maintains a list of domains you need to filter on.


I'd really like to see randomization instead. Return results picked randomly from within the top 10 million or something.


I'm not sure this is a great idea. Predictability is a staple of a good user experience. Getting different results for the same query between users or sessions is bound to lead to broken expectations and frustration.


I like the idea of offering it after the first search. I see the value of this kind of search engine as introducing some novelty/entropy into the system. I could imagine using it as a backup to primary search engines, in which case I'd definitely want to get some randomness going.

I wouldn't just want the "million-and-eleventh" site (so to speak) when clicking next.


Think of this as more of a discovery engine. And predictability takes the fun out of discovery.


Great point. You might have a hard time competing with google on relevant results but you might be able to beat them on discovery, like a stumbleupon search engine.


Randomization is a good idea. Give everyone a fair change kind of thing maybe? reply


I need predictability... but make it optional, maybe a check-box with a key shortcut (the `R` I suggest).


Would be cool in mobile devices to see a shake gesture to "shake up" the results.

You could use this on modern smartphone browsers: https://github.com/alexgibson/shake.js


It's amusing to see all the SEO "experts" that don't make it into the top million:

http://millionshort.com/search.php?q=seo&remove=1000k


It removes the top million in general, not per query.


A real serendipity engine. Absolutely great, thank you. I'm finding tons of interesting products and ideas by searching the most banal things :)


I would prefer if my previous search was populated in the search box after completing a search (since I might want to try the search with a different filter).

It appears you have an "off by one issue" in the sidebar. There's always a blank entry in the list of ignored sites.

Filtering does not seem to be working (or I don't understand it). Searching on "chicken" produced the same results with 1million or 100k removed.


Will add that feature. Thx


I neglected to mention... Awesome idea!


Thankyou! I can see this being something I regularly use.

It may be a simple idea, but its something nobody else has done before, and I think the creators deserve a lot of credit for coming up with and implementing it. I hope they manage to get something from it. I can see that if the site becomes popular it will just get copied by other search sites.


This is pretty amazing. I didn't know that the old internet was still there! This may become my new favorite search engine.


"Quality" is subjective.

More relevant is _accuracy_, i.e., you get what you specify via search operators, and results are not influenced by all of Google's silly "factors". You know what you're looking for and how to frame the query. But Google assumes you're dumb and thinks it should decide for you.

Alexa Top 1M is a nice filter because the data comes from the Alexa Toolbar which only the most braindead web users would have installed. So you are in effect avoiding sites that the web's most braindead users would often visit.

Ranking sites based on "popularity" is great until you reach the point where the majority of users are not very intelligent. (cf. search engine users in 2004 with search engine users today.) When you reach that point, you get results where "quality" is determined by idiots (and SEO hats), not a group of intelligent peers.


It's like a hipster search engine. It's only interested in things before those things are cool.


Is it safe to assume that this is how Google's search results would look if nobody did SEO?


No, not at all. Google is driven by many signals that SEOs try to optimize on, but the methodology is still the same. Top-ranking sites have high-quality sites linking to them, good content, and are supposed to not look spammy (though that's debatable).

I don't know how this engine ranks but I assume it's a similar system, they're just chopping a good portion of the results out. For what you said to be true, the top 10K/100K/1M rankings would have to be there entirely due to intervention from an SEO, and that's just not the case. The Wikipedia's of the world have enough going for them that they don't need SEO, so they'll always show up in Google, and never in this engine.


I don't think so. Consider Facebook. They should show up when doing a search for social media or related term but with this SE, they don't show up at all for any search.


Funny, on my first query I found an obscure HN scraper:

  http://tazod.com/


Wow I think this is the exact idea I had for "HN Time Machine" - basically a thing to extend the life of the newest page and speed up the movement of stuff through the home page (ie. I think the HN new page moves too quickly and the home page not quickly enough)


Exactly what I was thinking in 2001 => http://www.halfbakery.com/idea/The_20Other_20search_20engine

Glad to see someone did it now...


This is a great idea and I see myself coming back to this. It's a shame that a little blog on tumblr or blogspot gets taken out because it's under a big name domain - but this has spam related benefits too.

Great work!


Maybe we can set up a Webmaster tools sort of submission process for inclusion.


Wow some of the content there is great. Forget about searching the deep web. For me deep is the real gems buried under the first 100 or so results where stuff actually gets interesting!


Only a little thing, could do with maintaining query strings between pages. It lost my query string and returned no results when I changed the drop down without me noticing.


Good catch. Will fix. Thx


I have been wanting something like this for a while. It's even on my todo list. Thanks for saving me the work. I'll be using it all the time!


How did you build this? Are you indexing the entire web yourself? Or are you using Google's index/removing the top 1 million based on domain?


I think search APIs like Yahoo BOSS allow you pass arguments that contain a black list of domains. I think it's the 'sites' argument that may be used like this: &sites=-google.com


You are right, but they won't allow the list to be 1 million sites long. You are talking 15 megs of data per data in plain text per request.

But I like the idea of being able for users to, via a setting perhaps, add their own list of deny/include sites.

Thx for the comment.


Isn't this under Blekko's domain of ideas with the slashtags letting you include particular sites?


I just tried it with a search for some competitive intelligence. I used the 100K removal option. I found a competitor in another country that had not made the top 2 pages on Google. It confirms that others are launching something similar to what I am building... but also the fact that it doesn't bubble to the top on Google means that the market space is not dominated yet.


I love this! And am totally going to use it. Removing the top "thousand sites" removes pretty much all the sites I WISH I could have filtered from my Google results anyhow (ehow, w3schools, etc).

One request: please keep the search text in the form field after clicking "search". Just so users can search the same thing multiple times with different values in the "Remove the top" drop-down.


Thank you - very refreshing! DuckDuckGo should implement something like this just for the spirit of it.

The web just got more interesting. :)


Doesn't make a dent in the travel site spam, unfortunately. though I might use this just to permanently remove About.com...


A few of the suggestions included the ability to set include/exclude sites which I think we'll add.


Just re-found a site I was looking for but couldn't find with google the other day. This could definitely be helpful.


Not sure if I found an anomaly or what, but a simple search of "Privacy" returns results from thesaurus.com, merriam-webster.com, truste.com, kelloggcompany.com, and many more that are all in the top few thousand according to QuantCast and Compete.

Great idea though, will definitely try this out some more.


merriam-webster.com redirects to m-w.com - looking into the others.. Sometimes if a site has header redirects it gets lost in the filters. Thx for pointing this out.


ahh that explains it. I'm actually working on a personal project right now and this could help out quite a bit, so I am excited to see where this goes. Best of luck!


Is it just me, or are the results fairly congruent with standard results from a search engine?


They are to a degree. We just remove remove the top million most popular sites. So, searching for "social network" would normally yield Facebook.com, but with MillionShort you discover a social network that that didn't make it past the noise.


Great idea, although I think if you could explain it a bit better you could avoid the confusion like several of these comments are showing. I like how my Hacker Newsletter project shows up #2 when searching for Hacker News. :)


And how do you determine the "top" million sites?



I think remove results with my search term in the domain name and this would be perfect!

For example I searched for how to start a garden and I can guarantee that startagarden.com is junk. But indie see some useful advice from small blogs etc


How would you like to see this work?


Well for example I searched for "grow a vegetable garden" and I still see a lot of results from URL's like:

http://howtogrowavegetablegarden.net/ http://www.grow-your-own-vegetable-garden.com/

And I just know any domain name optimized for a certain search term is going to be garbage.

I guess a simple version of this feature is you'd have a setting: "Exclude domains that contain my search term".

When the user clicks that you'd compare the domain name (removing all special characters) with my search term (also removing all special characters and white space). Maybe compare via edit distance and exclude if it passes a threshold?

Although edit distance might not work too well, perhaps looking at the longest common substring and if it's > say 90% of the length of my query exclude it?

I guess it would take some playing around. But there should be a good algorithm to exclude domain names very similar to my query.


I'm getting odd results with the following query :

Search String : Ruby

Remove From Top : 1000 & 10000

In both instances, the top hit is http://www.ruby-lang.org, which is also the top hit from both Google and DDG.

Am I missing something?

edit: formatting


I think that removes the top 1,000 or 10,000 across all searches, not just "ruby".


This is a cool search engine for discovery, but it defeats the purpose when you are looking for a location. Do a search for "facebook", you will not get any result that links you to facebook.com .


I've seen people do this a few times, and don't understand why they do it (Google 'facebook' or 'twitter' and then click the link to get to the site instead of typing 'facebook.com' or 'twitter.com' into the address bar). Can you explain why? I'm truly curious.


Couple reasons I've encountered..

1. User doesn't understand the WWW. They believe Google is THE gateway to everything internet.

2. User doesn't know (or has trouble typing) the exact address. Google tends to have the authoritative result first for popular sites like facebook, amazon, twitter..


I also often don't know the exact address, but browsers autocomplete URL for me. I think it's a fair bet those who search for Facebook already visited Facebook. Why search then?


I prefer to go straight to a site, rather than detour through a search engine. That said, I use chromium, and use the address bar like a search engine. Sometimes the auto-complete half-way through typing is the search rather than the site. If I don't notice..


I would like to see the search engine adhere more strictly to quoted search terms. It seems that they are partially ignored, which gives it some of the same problems that the major engines have.


Please add a favicon so I can see in my (icon only) Bookmarks Toolbar.


favicon coming right up. Minutes away.


Good idea. It's about time that search engines route around the power-law distributions of popular sites, popular bloggers and personalities to find the gems otherwise buried in the noise.


> Imagine a search engine that simply removed the top 1 million most popular web sites from its index. What would you discover?

A lot of my competitors who are still on the first page of Google results.


Wow that is pretty awesome. I reached some results I want that I could not find via popular search engines with hours of searching. Believe it or not, this engine is changing my life.


Glad that code is being put to good.


Very interesting. I was pleased with the results and have already added this site to my Chrome bookmark bar, right between my Google search and Hacker News icons.


Well, in a rather meta turn of events, searching for my username on this returned a link to hackerbra.in, which appears to be some kind of HN mirror.


If you search for "google" and remove the top million results, you still get google main page (in this case, the one for australia and india...)


The site's way too wide on my netbook, 1024x600. Also, the list of domains removed from the results covers up part of the results themselves.


Another Cool feature could be to exclude sites that use Adwords or Paid Search from the list too. Then it would really just be legit sites.


It doesn't appear to work. I did a search for aspirin and the top match returned by this is #5 doing the same search with Google.


We're simply removing the top million (or what you select) sites. The results can be the same - it just means that that site isn't in the list of top one million sites on the web.


Top million sites, not top million results.


Correct. Top million sites.


If this becomes popular, at some point results would disappear since unpopular sites will be pushed into the first million.


I guess the thought would that graduated to some level of critical mass. Sort of like a kickstart program for sites.


At some point, you wouldn't even be able to search for millionshort.


8 hours later, we just launched our first re-design. Thanks for all the great feedback and support. More to come.


I really like this. Right away I found some new sites about that I hadn't seen before with interesting content.


As someone learning web development, I'd love to get some insights into how one could build this.


Not knowing anything about what they do, the hack-ish way you could do it is to use Google CSE (custom search engines) to add a list of negative domains. Where to get the list of top 1 million domains? Probably from Quantcast here: http://www.quantcast.com/top-sites-1


Thanks.

Google has really killed the discoverability of the internet for me. I will be experimenting with this.

Best of luck.


This is an incredible breath of fresh air, what an odd thing to say.


Removing Wikipedia might be a mistake. Otherwise, it's great.


Why would anyone want Wikipedia in search results is beyond me. If I want to read a Wikipedia article, I can just search Wikipedia itself. I know the kind of information it has, so there is no point in "ranking" it against other websites.


I'm not sure if this is still the case, but in the past Wikipedia's search engine was terrible and it was actually easier to google "X wikipedia" or "X wiki".


Links to Wikipedia could appear in a separate box near the top...it should be this way on Google and Bing as well...because Wikipedia is either exactly what someone wants or exactly what they don't want.


i searched for my site and in teh goog i get first page....here i found nothing....so for me this = no good. I understand the base but i dont understand the result


Should I be depressed that I'm the top hit for my own name?


I wonder if there's an analogous hack for social news?


Now if I could add it to firefox's search bar...


I came here to make the same request.

And I am loving this.


very interesting hack. thanks for doing it


non-popular websites will start seeing some good traffic suddenly. It would be confusing for them :)


my website ranked 1st? guess i need to work harder T_T


Searched my name... Got my website

:(


you should some how incorporate hipster into the search site's name.


I searched for "Hero Academy" and the first result was Google's 5th result, a site called Hero Academy with the url "hero-academy.com". That's not very "million short", IMHO.


The site removes the top million sites by popularity, not the first million results returned by a Google search. "hero-academy.com" is likely not in the top million most popular sites.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: