Hacker News new | past | comments | ask | show | jobs | submit login
Million Short allows users to remove up to the top 1M sites from a search set (forbes.com/sites/julianmitchell)
100 points by cpeterso on Jan 7, 2018 | hide | past | favorite | 57 comments

I use million short to search for howtos written by real craftsmen in their field.

One example is an end-grain cutting board that I made recently. For most things Ike that, the top 1000 are dominated by made-for-Pinterest blogs or major sites that aggregate low quality content that's good enough to get hits but not good for much more than that.

I just tested this for 'kumiko patterns' and 'islamic geometry' and was pleasantly surprised at the great results hidden 1000 links away that I would never have come across in google. I'm going to be using this search engine.

What would we do without the garbage content that is eHow and wikiHow?

I had a friend who had a whole list of dozens of banned sites configured for his google searches, which made Google way more useful. And then one day, Google decided to cut that feature out, as part of their drive to remove power user features across their products.

I have always wanted that feature. I never knew it existed previously.

There is still https://chrome.google.com/webstore/detail/personal-blocklist...

Similar extensions exist for firefox.

I’ve always wanted to see a search engine thar takes the top SEO tips and penalizes sites that use them.

For example searching for how to grow a garden. I never want to see howtogrowagarden.com. I’d prefer to find more genuine l, non seo juiced advice.

Problem is that SEO is tied to appearing relevant, so to penalize SEO is to penalize relevancy. Splitting SEO from relevancy is often a matter of making better search

Searching for least-relevant can be pretty random: it's easy to point at the center of a circle, but the edge of the circle is not a point

A perfect system would penalize SEO (appearing relevant) and reward actual relevance. Of course, this is easier said than done.

This is exactly what the search engines have been attempting for the last 10 years.

by your own words, since SEO is tied to appearing relevant, penalizing SEO peanlizes apparent relevancy.

and since SEO is abused for pageviews i.e. apparent relevancy, this penalizes SEO abuse.

The whole reason google spends billions on their search engine is that humanity does not yet have a program that can differentiate fake relevancy from relevancy, with perfect accuracy.

The venn diagram between SEO and user satisfaction is gradually being compressed into a circle by Google as they improve their algorithm.

SEO is already basically human oriented now- anyone selling mumbo jumbo SEO magic now is a crank. It used to actually work quite well.

Many SEO tips, like using H1 tags, make the site more usable IMO. I imagine removing the top million leaves you with some pretty ugly sites.

I’d like to remove just the top 10K or 100K.

It looks like they already have an option for that.

An approach I've thought of in the past would be to prioritize sites with fewer scripts and smaller stylesheets. Lots of .edu pages fit the bill

One simple tweak I wish I could do in google would be to eliminate search results based on broad criteria. For example, no sites that have a product for sale, no sites that repackage contents from other sites, no sites that include a given word, no sites that serve adds.

You can instruct Google not to include sites with a given word by prefixing the excluded words with a dash. For ex:

> buy hoes -sex -porn

Yes, but the rest is not solved by that. Google has really let the search game slide for a while.

They've even removed a lot of the useful search operators. Someone needs to bring advanced search back in a competing search engine.

But the goal isn't good search. It's good enough search with Google's narrative (usually ads) filling in the space between.

Do people really think search is a solved problem ?

I still have difficulty finding information I need for work from company Intranets

I still have difficulty finding really local information

I still have trouble finding news that is objective and not slanted or click bait

I still have difficulty finding recommendations on finding recommendations for good books to read

It seems to be like Google has really dropped the ball on search since they acquired "lock-in" through Gmail, Chrome sync, Android etc.

For cooperate intranets Google is not a possibility. There xapian, elasticsearch or lucene are the best, with xapian dominating the backend and the Java stuff the frontends.

For the others SEO optimizations are a real problem, yes. You can only try alternatives, like searx.to, bing or asking around.

> xapian, elasticsearch or lucene are the best, with xapian dominating the backend and the Java stuff the frontends

Elasticsearch uses Lucene under the hood, in my experience Lucene dominates the actual indexing and searching, although I'm not familiar with xapian.

By users, I reckon SharePoint search (FQL) is probably the biggest although it is way behind Lucene in features.

Google actually used to sell bright yellow branded racks that they would come and install on corporate networks to provide a "private Google" but I'm not sure if they still do.

With "xapian dominating" I meant the technical side. Of course lucene has more marketshare, because of the better elasticsearch frontend, and browser support. I.e highlighting and jumping to the results in word or PDF docs.

I wouldn't trust Google locally neither, and it's expensive.

SharePoint search is unfortunately used too often, yes.

What aspects does Xapian beat Lucene at?

Everything backend related. Much faster, much less memory, more backend features, huge indices - Google scale. (Gmane was its most prominent public user). Lot of language bindings like PHP, Perl, python.

Can you link to evidence of it being faster, and expand on what you mean by "more backend features"?

I believe you, I just can't find evidence online.

My application is not that big, around 10 million text files, but I would be interested in anything faster (or allowing more complex queries) than Lucene, which is what I use at the moment.

Most corporate information isn't accessible via http. Even SMB is being deprecated, in favour of proprietary document databases, where companies are locked into to extremely expensive but feebly engineered "solutions".

This is a cute idea, but there are some genuinely useful sites this winds up skipping over. Wikipedia is a great resource for quickly understanding a topic enough to drill down in to more precise research.

Where I see this being useful is searching for current events. For example a search for a local double-shooting I've been following returned some information I hadn't seen before. It would probably be good for them to focus on more news-oriented searching as that's where there's a serious echo chamber among the top websites.

I use https://addons.mozilla.org/en-US/firefox/addon/g-search-filt... for filtering Google. Can't live without it. It would be interesting to have a Github repo with some precompiled filters based on certain business domains.

But for me at least, a couple of rules are enough to solve 99% of the problems.

I'm using DuckDuckGo and I just realised I should install something similar, thanks!


EDIT: You can actually achieve something similar with bookmarks, using keywords and '%s'

The idea is: take a link to a search query string of a search engine, and replace the query part with '%s'. For example, take the following search query on DuckDuckGo:

    cute hedgehog -site:www.pinterest.com -site:boredpanda.com -site:amazon.com -site:etsy.com
This results in:

Bookmark that, and replace the "cute+hedgehog" part with "%s", then edit the bookmark and add (for example) "ddg" to the "keyword" section, then typing in:

    ddg cute hedgehog
... will send you to:


Nifty trick! Is there a limit to the number of site: filters?

Look for an extension that highlights the good results. I find that more valuable than filtering the bad results.

I just tried, DDG seems to be a bit unreliable, especially when youtube is involved:


As for Google, the limit to a query is 32 words, apparently: https://imgur.com/a/XW1Qa

... however, it also supports inurl:<query>, so you can easily filter out sites with manu subdomains (say, pinterest.com, pinterest.co.uk, etcetera), just by using -inurl:pinterest

Does it fetch next result page to fill removed items?

I'm not sure. Results I want are usually on the first page though, almost always highlighted in green. The "*.edu" highlight is particularly useful.

I love the idea but if this gets popular I guarantee the content-farm/SEO assholes will figure out how to be on the first results page -- registering a boatload of domains is the obvious countermeasure, there are probably many others.

Having worked for Demand Media, seo people will always find a way. It’s always going to be a cat and mouse game, sadly.

So is there a way around on this app?

Be the 1,000,001th result?

What’s sad is considering how panda hit them they may benefit.

to be the 1st result.

We're actively working to add lots of features on Million Short in 2018. I'd be happy to add any feature requests to our roadmap planning.

If I block ads on principle, but still want to support your site because I like the service, what are my options?

Also, adding a Dark Theme to the settings would be nice! (I'm trying to minimise the amount of light I'm exposed to at night to reduce eye-strain)

> I'd be happy to add any feature requests to our roadmap planning.

Love the concept, seems to work well. I personally use the 'media' specification of Google often (mostly: images, videos, pdfs, scholar articles, and Google patents). I didn't see a way to filter on Million Short.

...and if you could find a way to filter content that is not behind paywalls that would be amazing. As an engineer I'm constantly searching standards/specs (e.g. IEEE, ASTM, ISO standards etc.) some of which you can find free copies of for the previous revision (which are still pretty good), but you have to dig deep through top-ranking paywall sites.

Archived copy, which can viewed with JS disabled:


I tried out few popular queries where it's hard to find great content which is not extensively SEOed and this is working great! Sure, results are bit sparse but this is great tool to find good content that would be otherwise buried beyond 10 pages. Now I think about it, there are lot of URLs that is sourced by reputable authors on twitter, hn, Reddit etc - many of which would be example of "dark content" - i.e. Not easily found unless right keywords are entered. For example, search for neural network from scratch and you are unlikely to find great quality implementation like Layered [1] in any of the search engines. Instead you will only find what was extensively linked by others.

[1] https://github.com/danijar/layered

The actual site (millionshort.com) seems hugged to death. I was excited to try it.

If it takes off, I wonder if the phenomenon will give a boost to affiliate sites. If you can't be there in the top results, align with those who can.

I would have liked to read about how their technology works. I imagine they're not building their own index using their own crawlers.

"Nice startup you have there. We have found that users like what you do, so we've added your algorithm as an option on our search engine. Your business case just vapourized." --Google

This would skip a lot of the sites that buy ads from Google in the first place. They'd probably rather just purchase it and shut it down.

How large is their search index and how much funding does it take to build an index that can compete with Google?

Oh the irony, Forbes itself is a low quality content aggregator that needs to be blocked.

The contrarian search engine, I like the way they think. I'll take it for a spin later.

Please use the original title.

"... unless it is misleading or linkbait": https://news.ycombinator.com/newsguidelines.html

The article title is both, so we replaced it with representative language from the text.

Great idea.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact