Of course, there are great websites out there that are very popular (Wikipedia, NYTimes/WSJ, StackOverflow). I'd love to see a search engine with a better signal for quality than non-popularity (this search engine), or SEO (Google), but it's a fun start. :)
 http://www.searchlores.org/yoyo1.htm (try to read past the preachy anti-commercialism, he could get kind of hot about that--whether you agree or not, there's troves of knowledge to be gained from that site)
http://channel101.com/ offers a nice selection of "unpopular" shows. You should check Danny Jelinek's "Everything" (https://vimeo.com/channels/231109) if you're into video-art.
Anything you can identify as pure SEO spam, exclude of course. But if it's some original content that just isn't well connected or whatever, include it.
Then just observe user behavior for a while until you can discern patterns in how people who play around with that use it to discover new and interesting stuff. Maybe a new algorithm might come of it.
You need to either remove the top 100 (or 1000 etc) and look at what remains, or reverse only the top 10000 results (or 1000 etc).
This website is an interesting experiment however if you're after discovering random semi related pages, why not just use google or stumble upon?
Is there a way to include domains automatically excluded? For example WordPress.com is excluded.
Disclaimer- former WREK DJ. :-)
"Artists who meet any of the following criteria are automatically disqualified from airplay on WUOG 90.5FM:
The artist is or has been in regular rotation on major commercial radio stations. This does not apply to material used in specialty shows on major commercial stations.
The artist has a music video that has aired on major music video stations. This does not apply to material used in specialty shows on major commercial networks.
Another criterion that DJs must use when selecting music for airplay is its position on the Billboard Album Charts. If one of the artist’s albums has entered the top 50, then a DJ cannot play that artist."
Yahoo became the goto portal because the quality of their links was high. Google overtook them because metadata was the only practical way to keep up with the pace at which the web was expanding.
However, after more than a decade of SEO, it's limitations are evident as the long tail gets increasingly harder to reach with search engines based on the Google model, e.g. "Purchase empiricist philosophers on eBay."
My suspicion is that Google is abandoning neutral search for personalized search in part because of the problems SEO represents when it comes to neutral searching and personal tracking provides an algorithmic way of establishing the quality of results for ranking purposes. One which is easier than attempting to develop a neutral curation algorithm.
That's pretty much precisely how PageRank (in its original form) changed web search. At the time, at least, it worked because we were really bad at automatically analyzing content, and the connectivity signal provided better results.
For one thing, the huge miasma of spam websites that dominates the SERPs just isn't there -- I hope this lights a fire under Google's butt and people see another world is possible.
Google isn't going to come along and say, "oh, guys, you're right...Wave, Buzz, Google+ all a mistake. Social Graph..Mistake!. We're sorry. We are rolling everything back to 2004"
Google has been increasingly more frustrating for power users over the last couple of years. People are looking for alternatives. DDG is one, this is another. Power users matter, and not just because they spend more time inside an application, but because they recommend it to others.
A perfect example of this is firefox. I used firefox a lot, and then switched to Chrome. I've since switched my mother, grandpa, brother, girlfriend, sister and too many friends to count to chrome. I now use DDG for all my searching needs via the bang syntax (!, !g, !w, etc). It's only a matter of time before the rest of my family follows along.
So basically, DDG probably has a whole bunch of !bang searches that you simply had not thought of to create your own keywords for, yet. And when you need it, it's already there.
There are also a few !bang queries that are not external searches, such as one for rolling dice (it can do !roll 3d6+3).
Another minor difference is that you can add the !bang keyword anywhere inside the query, also at the end. Being able to add it at the end makes it easier to "ok let's try this on another search engine".
! fuji Ames => takes me to Fuji Steak House in Ames Iowa directly
!w Sushi => Takes me Directly to the article on Sushi on Wikipedia
!m 801 grand ave Des Moines => takes me directly to google maps at the address.
I still probably do about 30% of my searches using !g kicking me straight to google for search results.
Do I have to setup all these different keywords to work? If so that's the difference. It's essentially a pre-configured command line with literally hundreds (thousands?) of predefined syntax for searches.
Perhaps there might be a great irony of this site becoming popular and using google ads to finance it!!!!!!
While Google has it's head up its' ass^H^H^H social stuff, there doesn't seem to be any effort to actually improve search.
"Bad news-- we're a top 100 hit for several of our main keywords. We'll have to change our URL scheme again."
Maybe we'd see a resurgence in Flash.
search engines can follow links (sometimes) through flash. and search engines know what the most popular sites are even without counting number of links to sites, so hiding links through flash (or js, whatever) wouldn't help
This is really refreshing.
What's your source for the top million sites; where do you get your site list from for the other results?
Cat is out of the bag.
What is the Alexa list good for? Answer: Filtering out the boring, money-grubbing commercial sites. A truly GREAT idea.
A return to the good 'ole days. The non-commercial web.
Many young people who love today's www never got to experience it as it was before it became overrun with Google-ization and auto-generated garbage.
Take the ball and run with it. We ca reclaim the web. This is only the beginning.
It's a cool idea, but I'm not sure it's working. I tried "american history" but it wouldn't return anything at all if I changed the "Remove the Top" dropdown.
If I search my name then the results are for names similar to mine, but not actually my name. This makes it completely useless for searching my name. I would think that there are many searches with this problem.
I think there needs to be some kind of weighting system used that dynamically decides the cutoff point. One million is a huge over-generalization for all search terms.
It's not entirely a surprise that it works for meta-language constructs like the web and site popularity.
1) I am basically certain that it doesn't work. Imagine something this simple actually did work in general. Do you think Google, Bing, etc, wouldn't have implemented it?
2) I think the analogy is extremely flawed. nytimes.com is by no means a web page equivalent to "the" or "a" in language. Articles, pronouns, etc, don't really carry meaning. Despite being popular, nytimes certainly does.
Now consider, if I'm interested in some particular thing in science, is it better to get the NYT science reporting or just get the professor's publication and research page on their university website? Filtering out the top-n sites is more likely to turn up the professor's site near the top of the pile rather than after all of the popular sites' second and third level reporting.
Is this better? Depends on the audience. A "popular science" goal would argue the former is the better as science news simplifies, abstracts and popularizes complex science (with varying degrees of quality) while a scientist would prefer the latter.
Thanks for that.
A++ would search again.
It would be harsh to call this naive, but it shows a serious lack of SEM and SEO knowledge. Ever heard of "paid placement"?
Many years ago when Digital's AltaVista was our main search engine, it was becoming loaded down with paid placement.
The results were polluted.
Google eventually became the "clean" solution.
But now it's Google that is loaded down with all sorts of commercial crud, much of pointing to Google acquisitions.
And paid placement, among numerous other strategies, new and old, still exists.
The simplicity of millionshort is brilliant.
Filter out the crap.
EDIT: In trying to accomplish this task I found an add-on that lets you do this for anything.
WSJ: What's next?
Mr. Ferdowsi: We continue to focus on actually solving problems that
real people have and not being distracted by what power users want.
Google has made clear what they think about power users:
No + operators in search.
No web-based code search.
No Google Labs for the public.
Plenty of wood behind the Google arrows, but all the cool ones have been cast out of the quiver.
Just what kind of targets is Google aiming at nowadays?
Millionshort I give you +999,999.
This has been a long time coming.
Alas, DDG and other alternatives are all about _money_.
Search is about _discovery_.
This holds true for most topics I can think of. Moreover, if I ever need to read Wikipedia and such, I already know about those websites, and I can go there directly - no need to search. Shouldn't web search engines act like discovery tools?
If your business model is based on advertising that depends on masses of page views to generate value, then no. You want to be as generally useful as possible so that e.g. people use you as an (extremely inefficient) DNS service.
(Google Labs has a single optional feature available for search. Perhaps their arch doesn't make 20% or Labs projects a good fit for plugging in extra fancy search features?)
What the hell is their problem?
I am very interested as to what comes of this, or rather what is influenced by its implications.
http://millionshort.com/search.php?q=somalia&remove=1000... -- another Australian site.
I wouldn't just want the "million-and-eleventh" site (so to speak) when clicking next.
You could use this on modern smartphone browsers:
It appears you have an "off by one issue" in the sidebar. There's always a blank entry in the list of ignored sites.
Filtering does not seem to be working (or I don't understand it). Searching on "chicken" produced the same results with 1million or 100k removed.
It may be a simple idea, but its something nobody else has done before, and I think the creators deserve a lot of credit for coming up with and implementing it. I hope they manage to get something from it. I can see that if the site becomes popular it will just get copied by other search sites.
More relevant is _accuracy_, i.e., you get what you specify via search operators, and results are not influenced by all of Google's silly "factors". You know what you're looking for and how to frame the query. But Google assumes you're dumb and thinks it should decide for you.
Alexa Top 1M is a nice filter because the data comes from the Alexa Toolbar which only the most braindead web users would have installed. So you are in effect avoiding sites that the web's most braindead users would often visit.
Ranking sites based on "popularity" is great until you reach the point where the majority of users are not very intelligent. (cf. search engine users in 2004 with search engine users today.) When you reach that point, you get results where "quality" is determined by idiots (and SEO hats), not a group of intelligent peers.
I don't know how this engine ranks but I assume it's a similar system, they're just chopping a good portion of the results out. For what you said to be true, the top 10K/100K/1M rankings would have to be there entirely due to intervention from an SEO, and that's just not the case. The Wikipedia's of the world have enough going for them that they don't need SEO, so they'll always show up in Google, and never in this engine.
Glad to see someone did it now...
But I like the idea of being able for users to, via a setting perhaps, add their own list of deny/include sites.
Thx for the comment.
One request: please keep the search text in the form field after clicking "search". Just so users can search the same thing multiple times with different values in the "Remove the top" drop-down.
The web just got more interesting. :)
Great idea though, will definitely try this out some more.
For example I searched for how to start a garden and I can guarantee that startagarden.com is junk. But indie see some useful advice from small blogs etc
And I just know any domain name optimized for a certain search term is going to be garbage.
I guess a simple version of this feature is you'd have a setting: "Exclude domains that contain my search term".
When the user clicks that you'd compare the domain name (removing all special characters) with my search term (also removing all special characters and white space). Maybe compare via edit distance and exclude if it passes a threshold?
Although edit distance might not work too well, perhaps looking at the longest common substring and if it's > say 90% of the length of my query exclude it?
I guess it would take some playing around. But there should be a good algorithm to exclude domain names very similar to my query.
Search String : Ruby
Remove From Top : 1000 & 10000
In both instances, the top hit is http://www.ruby-lang.org, which is also the top hit from both Google and DDG.
Am I missing something?
1. User doesn't understand the WWW. They believe Google is THE gateway to everything internet.
2. User doesn't know (or has trouble typing) the exact address. Google tends to have the authoritative result first for popular sites like facebook, amazon, twitter..
A lot of my competitors who are still on the first page of Google results.
Google has really killed the discoverability of the internet for me. I will be experimenting with this.
Best of luck.
And I am loving this.