
Ask HN: Is there a search engine which excludes the world's biggest websites? - cJ0th
Discovering unknown paths of the web seems almost impossible with google et al..<p>Are there any earch engines which exclude or at least penalize results from, say, top 500 websites?
======
noad
This is a great question, I also want a way to search the internet but exclude
all major media domains as well as any company over a certain size. So I just
want to search through old blogs, SO, non-corporate social media, weird
forums, etc.

There are so many cool things I remember reading on the web like 10-20 years
ago that still exist that are so buried now on Google they might as well not
exist. Nowadays searching any topic seems to always lead you to CNN and
Microsoft and Facebook and other huge corporations. Search results are just
becoming more sanitized and beige and meaningless every day.

~~~
Scoundreller
Heh, I was trying to do research on coronaviruses (of which COVID-19 is one of
many coronaviruses), but Google sanitized the result and only showed me
"official" COVID-19 resources and buried the broader coronavirus resources.

[https://www.google.com/search?q=coronavirus](https://www.google.com/search?q=coronavirus)

~~~
stolenmerch
COVID-19 isn't a coronavirus, it's the disease caused by the SARS-CoV-2 virus.

~~~
ganstyles
Giving you the benefit of the doubt, and assuming this isn't just pedantry,
especially since you're getting downvotes (because I assume everyone thinks
this is just pedantic correction) I looked it up.

In the context of "trying to do research on coronaviruses" your comment
appears to be not only correct but an important distinction, rather than the
pedantry it appears to be.

From Wikipedia: "...more lethal varieties [of coronaviruses] can cause SARS,
MERS, and COVID-19."

And...

"Severe acute respiratory syndrome coronavirus 2 [SARS-CoV-2] is the strain of
coronavirus..."

I learned something today!

~~~
teamspirit
Further, CoViD-19 literally stands for: Co[rona] Vi[rus] D[isease] -
[discovered in 20]19 - "D" standing for disease caused by this particular
strain.

To be honest, I was a bit disappointed when I found out, though I admit now
it's a little refreshing to have be so simply named.

~~~
Theodores
Which can be further abbreviated as C19. I have seen this in personal chats
and wonder how long it will be before it gets into newspaper headlines where
space is at a premium in print editions.

~~~
Wistar
I already see it in publications but more frequently see the variant C-19.

------
sanqui
There is a search engine with this exact goal:
[https://millionshort.com/](https://millionshort.com/).

I haven't had that great results with it myself though.

~~~
smackay
I tried it with "waders" which are either the things that you to put on your
feet to go fishing or a category of birds (shorebirds or herons). The results
after going for all the options were still exclusively stores wanting to sell
me the former.

Garbage in, garbage out. I guess. Still I like the idea of something to side-
step the SEO perhaps with more effort they can make it work but relying on
Google or any major search engine for the base results is the wrong way to go.

~~~
glenstein
I tested your search term and had similar experience. I have, however, had
positive experience with other categories, such as philosophy. Searching
Wittgenstein with the top million sites removed, I found some gems: a play, a
disney character that was a supercomputer named after Wittgenstein on a direct
to video movie which I learned was later partly inspiration for Wall-E,
Wittgenstein-oriented societies, awards, and general philosophy references I
had never heard of.

I suppose it depends on the category.

------
erikbye
For google you can use this [https://addons.mozilla.org/en-
US/firefox/addon/g-search-filt...](https://addons.mozilla.org/en-
US/firefox/addon/g-search-filter/), just drop in your list of those 500 URLs,
once you've decided on what the top 500 is.

For other engines you can use [https://addons.mozilla.org/en-
US/firefox/addon/greasemonkey/](https://addons.mozilla.org/en-
US/firefox/addon/greasemonkey/) with this script
[https://greasyfork.org/en/scripts/1682-google-hit-hider-
by-d...](https://greasyfork.org/en/scripts/1682-google-hit-hider-by-domain-
search-filter-block-sites)

~~~
atrudeau
For Chrome: [https://chrome.google.com/webstore/detail/google-search-
filt...](https://chrome.google.com/webstore/detail/google-search-
filter/eidhkmnbiahhgbgpjpiimdogfidfikgf)

------
thekyle
There is Million Short which allows you to search without the top 100, 1k,
10k, 100k, or 1m sites. Personally what I'd like to see is a search engine
that only indexes webpages without ads since that should eliminate lots of the
SEOd garbage. It would also be nice to use the text to code ratio to derank JS
heavy sites.

[https://millionshort.com/](https://millionshort.com/)

~~~
joshspankit
And blacklist anything with those outbrain blocks. “One weird trick to _____.”

Very surprised where I see those these days, and they always make me run away.

~~~
dorkwood
The fact that news websites happily adopted those is disgusting. They're
literally tricking people into thinking it's news content.

~~~
icedistilled
And they also, for the longest time, strongly turned me off paying for any
news sites. The more they use scummy ads that turn people off, the more they
need ad revenue.

That's a nice negative feedback loop or catch-22

------
tlarkworthy
I made a script on ObservableHQ to surf YouTube psuedo-randomly
[https://observablehq.com/@tomlarkworthy/random-place-on-
yout...](https://observablehq.com/@tomlarkworthy/random-place-on-youtube)

I do a random city + documentary as the search term, it's taken me all over
the world and seen some very strange things.

One of my favourites was Aarhus, which had a Danish language rapper
proclaiming he was putting Aarhus on the global map (I have never heard of the
city of Aarhus). [https://youtu.be/WSZxuzgImLo](https://youtu.be/WSZxuzgImLo)
They dis Copenhagen a lot too, lol. You get a more intimate YouTube experience
with the low view videos

But I also seen amazing religious rituals. An excellent documentary on
Karachi.

Because it's observable hq you can fork it and figure out your own algorithm
for biasing the random.

------
totemandtoken
Reminds me of this classic pg essay:
[http://www.paulgraham.com/ambitious.html](http://www.paulgraham.com/ambitious.html)

Specifically this quote: "The way to win here is to build the search engine
all the hackers use. A search engine whose users consisted of the top 10,000
hackers and no one else would be in a very powerful position despite its small
size, just as Google was when it was that search engine."

There has been a lot of grumblings about the state of search these days. Maybe
the time is nigh for a new search engine?

~~~
tetris11
I feel that we should go down the adblock hosts list approach, where people
download website lists from individuals they trust who have curated or scraped
links of websites complete with keywords, and its up to the user to refresh
their lists and then perform a search on their website.txt file

It will be limited, but still quite powerful, similar to the way that we can
pick and choose different host file sources from the web.

~~~
rthomas6
Does anyone else remember StumbleUpon? It's not exactly the same as a search
engine, but that worked really well for finding interesting content back in
the day.

------
igammarays
DEVONagent is a highly configurable search utility which can be used to
combine and de-duplicate results from multiple search engines at once, exclude
sites or keywords from a blacklist, follow deep links within search pages, and
perform some filtering logic on the text of results.

Before I knew about DEVONagent I would often just search multiple engines and
sources trying to find something particular (e.g. a particular PDF) or unique
results.

[https://www.devontechnologies.com/apps/devonagent](https://www.devontechnologies.com/apps/devonagent)

~~~
lemonberry
Thank you for the link. This looks really cool. I used DEVONthink years ago.
It seemed like a great piece of software but I didn't have a great use case
for it. Looking forward to checking out DEVONagent.

------
pavelmark
Simply removing Pinterest would be a huge step in the right direction.

~~~
pier25
And quora

~~~
stock_toaster
ye gods, yes.

------
chaos_a
[https://wiby.me/](https://wiby.me/) exists to solve this exact problem. I've
found some pretty neat/odd websites on it in the past.

~~~
generalpass
It is a carefully curated directory, which is problematic.

For example, I submitted Pizza Hut's archived original web page [1], but it
wasn't added.

Even for a search engine exposing niches, updating a directory manually will
likely be too slow, unless the directory is maintaining a single nich (e.g.,
unladen airspeed of every species of swallow), but then we end up with some
insane number of search engines and how to select which one?

[1]
[http://www.pizzahut.com/assets/pizzanet/home.html](http://www.pizzahut.com/assets/pizzanet/home.html)

~~~
kd5bjo
Especially if you’re focussing on evergreen information, there’s no reason why
people can’t have their own personalized crawler and index— I’ve occasionally
thought about rolling my own with a browser extension that lets me add seeds
at the click of a button.

~~~
zxexz
I've been working on something like this for my own use - I'm not a fan of
browser-based history. My home-rolled solution is starting to be good enough
where I can use it to easily find exactly what I'm looking for, assuming I've
previously read it, by both searching the title and URL, as well as the
content on that page (my major gripe with "History" in Chrome and Firefox is
that it doesn't search the page content, and if it did, syncing it would have
major privacy concerns).

The problem I'm running into is that I still have to use major search engines
to find new content, way more than I'd like. I hope to make my local service
available open source once I have 'federated' history search working, so that
we can have a primitive search engine and share with people we trust. Also
need to work out some security issues - it's scary having all the content you
read and see on your home network, protected only by your hackily-patched-
together security.

EDIT: Actually I'd like to elaborate a bit more in case anybody actually reads
this and has any ideas. On the desktop side, it's pretty easy. Initially
started out MITMing my own traffic with a self-signed cert added as a root
cert to all my machines. This only works on my home network, so I did a VPN
thing. This was way to clunky and the security concerns are innumerable. I
ended up biting the bullet and writing a chrome extension which works
wonderfully, except for some slight performance issues.

However, I wish to also archive my phone content - I read just as much on my
phone as my computer. I can do it on Android with the MITM process, but the
same issues as above still apply, and it doesn't work with iOS (at least I
can't find a way).

I'm thinking of taking an open source project, like Firefox/Fennec and
building it in to the app itself. In that case it may make sense to forgo the
browser extension and just roll my own forked browser on every platform, even
iOS. I don't know much about iOS dev though.

------
mikekchar
Here is an idea that I've always wanted to do, but will never have time for: A
curated search engine.

Basically the idea is to have people band together and "recommend" links. You
then do your normal spidering of the websites to create a search engine (or
even just call through to a number of existing search engines). However, the
ranking of the results is based on the weighting of the recommendations.

It's essentially a white list based on your own personal bubble. Of course
this won't work in general because you will always get SEO creeps spamming
recommendations. However, it gives you tools for working around those creeps.
The average person probably won't be able to manage it, but power users
probably will.

By not trying to solve the problem for everybody, it makes it easier to solve
to problem for _some_ people. Or at least that's my thesis :-) I might be
wrong.

------
netsectoday
You can boot up your own custom search engine in a few minutes with YaCy (Ya
See!) an open-source, P2P, Dockerized crawler and search engine built on top
of Solr.

[https://yacy.net/](https://yacy.net/)

If you're generous; you can make your index available to other P2P instances.

I wanted to run an API search the other week and was blown away with how
quickly I could prop-up my own custom search portal (I didn't want to pay for
API access to other search engines, and YaCy comes with a JSON and Solr
endpoints).

I ran it locally to test my crawl filters, then pushed a private instance out
to Digital Ocean to turn up the heat with the crawling. The only issue I had
was the crawler would hit the max memory threshold on long crawls and the
container would restart, but that was fixed by scaling up the box.

~~~
l72
I have my own yacy search engines running internally (non-peered) for similar
reasons. One crawls some key code documentation sites that I need for work,
and another crawls a whole bunch of music blogs.

While I typically still use RSS for reading music blogs, I find having the
search engine is a great way to go back and find something or discover
something new! Every time I find a new blog, I just add it as an index to yacy
to crawl.

I think it'd be great to see people spinning up larger instances that are
highly specialized. For example, maybe a search engine that is dedicated
solely to sci-fi and only crawls high quality boards, personal sites and
blogs, and skips all the spammy, seo-optimized sites.

------
crawlcrawler
I built a search engine for this and other, similar purposes. With Crawl
Crawler you start out by searching the meta data of a Common Crawl ("CC")
crawl. Then you define a sub section of that data collection by designing a
query which search result includes your favorite sites. Then you enrich that
sub section by linking those meta data documents (that come from CC's WAT
repo) to full text extracts or HTML from CC's WET repo or the WWW. Then you
set it to recurringly refresh that section. Voila! You have created a search
index that includes your preferred sites.
[https://crawlcrawler.com](https://crawlcrawler.com)

~~~
chris_f
This is pretty cool. I always wondered why there wasn't a user interface
search somewhere for the CommonCrawl data.

------
allwynpfr
You should try million short. As the name suggests, it takes our the first 100
/ 1k / or a million results so you're left with those that aren't all that
popular. That seems to be what you're looking for.
[https://millionshort.com/](https://millionshort.com/)

------
nic-waller
My hobby project is [https://random.surf](https://random.surf) (works better
on desktop than mobile).

I share that same desire to visit the web less travelled. I want to discover
interesting sites that deserve to be bookmarked because they will never show
up in a search engine.

~~~
77ko
Love it! Discovered something interesting very quickly. Bookmarked for future
use.

------
dangoljames
There used to be java applet embedded in altavista.com's website that could be
run against search results. It would do semantic processing on the results and
present a list of generated terms, each with a checkbox. Checking a box would
pull any returns which contained the topic from the remaining search results.

This was fire. If a topic were being discussed on the web, you could find it
with this tool. Unfortunately, it did not fit the vision of the parasitic
overlords who bred us to produce and consume for their benefit.

~~~
visarga
Altavista itself was a junk search engine though, especially after they sold
out and the new owner stuffed it with ads.

------
dennisy
I think you could get good results if you just penalise sites for the number
of third party JS. Which shows by proxy a more established site/corp.

You could add a bunch of heuristics such as size, number of links etc.

Maybe even train a classifier to select the “smaller” part of the web.

------
inopinatus
I would pay real subscription money for a search engine that focused on
knowledge-oriented results rather than retail and commercial results.

When I type “shoes”, it would give me: links for the functional and creative
history of footwear, the taxonomy of shoes, methods of construction, current
and historical footwear industry data, synonyms and antonyms, related terms
and professions, the dictionary definition, and similar links related to
secondary meanings (such as any protective covering at the base of an object,
horseshoes etc). I’d also hope for a comedy link to a biography of Cordwainer
Smith.

What I actually get, which I don’t want _at all_ : pages and pages of shoe
shopping.

The various means to exclude “top X sites” are the roughest possible heuristic
in that direction, and throw out the baby with the bathwater (for example, a
long-established manufacturer may well have an informational online exhibit)

Google has essentially failed me in its primary mission. Bing at least has the
grace to admit they are here to “connect you to brands”. And sadly, right now,
every other option is an also-ran.

In practice I use DDG, directed by !bangs towards known encyclopaedic or
domain-specific sources. I am certain that I’m missing out.

~~~
atlantique
That sounds like the job of an encyclopedia. Maybe some sort of collaborative
encyclopedia where people can edit pages and add references.

~~~
inopinatus
I know you’re only half joking. In practice many, perhaps most of my DDG
queries do end in !w - but there’s a wealth of information that is relevant,
interesting, and useful, but wouldn’t be considered encyclopaedic, or that is
merely summarised in Wikipedia; in addition, their references are included as
supporting citations, very far from a comprehensive index of currently
accessible information.

------
text_exch
I've long wanted to build a search engine of only personal blogs. I am less
familiar with the field of information retrieval so I haven't gotten started
yet, but it's always been a dream of mine and if anyone is interested please
contact me at threemillionthflower [at] the world's largest email provider.

Discovering unknown parts and blogs on the internet is one of the enduring
goals of a newsletter that I run [1], which provides a single link to an
interesting article every day, usually by lesser-known authors and blogs
across the internet.

[1] www.thinking-about-things.com

~~~
hopesthoughts
I'm now a subscriber.

------
011-video
You are your best search engine !

On a daily basis your brain use shortcut to get to the point. Open Firefox (of
course) ALT+B. Then add a new bookmark for instance :

Name : Stack Overflow

Location :
[https://stackoverflow.com/search?q=%s](https://stackoverflow.com/search?q=%s)

Tags :

Keyword : st

Now if you want to search "javascript timer", just type : st javascript timer

Add "%s" to all your favorites website search url.

Example : [https://en.wikipedia.org/wiki/%s](https://en.wikipedia.org/wiki/%s)

To discover some new website content, apply the same trick to Hacker news,
Reddit or any RSS River.

Voila, bye bye GG.

------
NateEag
For Google, you can ignore specific sites by adding "-example.com".

See this example of filtering Stack Overflow out of search results:

[https://www.google.com/search?q=loop+over+array+items+in+jav...](https://www.google.com/search?q=loop+over+array+items+in+javascript+-stackoverflow.com)

~~~
1f60c
More specifically, -site:example.com, although there have been reports of
Google breaking this time-honored functionality.

------
brentis
Imagine if sort results had table filters and sort.

Popularity, Relevance, Age, Type, etc. type could be blog, forum, site, or
video. Or like it used to be.

~~~
busymom0
I have been finding recently that Google has been breaking their existing
"sort", "time" filters. Try searching for something with site:reddit.com
prefix for example and set the time filter to be lets say "Last year". Google
still shows you results from 4-5 years ago.

~~~
BostonFern
I also discovered that recently. It's gone the way of the verbatim constraint.

Control is being forfeited to steer users back to more profitable content in
order to capitalize on a captive market.

I wonder if being open about it would be so bad for business, instead of the
attempt to manipulate users into enjoying the ratcheting-up of their
impotence.

Now, Youtube truncates search results and loads the recommendation stream
instead, long before hits are exhausted.

At least it's been a while since Silicon Valley was keeping the mythical
personalized advertising spiel in active circulation.

------
sneeuwpopsneeuw
I personally use Google Chrome with the duckduckgo search engine. Duckduckgo
is not perfect, very in depth searches (such as gameboy advance memory layout
only return junk, while google knows you are searching for a nich) but on your
average search it is as good as google, somethimes better because it is more
factual and will promote less webstores. When it does not give me what im
looking for I can add !g anywhere in the question and the same search is done
using google.

Then I use Violentmonkey an open source js/css injector to inject this user
script: [https://greasyfork.org/nl/scripts/1682-google-hit-hider-
by-d...](https://greasyfork.org/nl/scripts/1682-google-hit-hider-by-domain-
search-filter-block-sites) This will block specific domains for you in google,
yahoo, duckduckgo etc. I use this to block domains like Quora, sourceforge,
cnet and softonic.

The nice thing about this script is that you can permaban domain you know are
junk and they will completely be removed or you can ban a domain like
commercial websites. When you ban something it is not removed from google or
duckduckgo but it only shows the title in light gray, Im currently
experimenting with this on some mayor webstores so I can not really say if
this may help you but It can be a good start.

(edit) I saw some people say why this was not possible before. Google allowed
you to block domains and website a few years ago, but they removed this
feature. Duckduckgo never allowed you to do that because that would mean that
you will have a cookie that remembers your preferences and that is against
there principles.

~~~
1f60c
> I can add !g anywhere in the question

I knew about !bangs, but I didn't know you could put them anywhere in the
query (e.g. "hello !g world" searches Google for "hello world"). This is going
to save me a lot of time on mobile. Thanks!

------
greglindahl
If the question is "Is there a commercially-viable search engine that supports
this feature", then the answer is "probably not".

Implementing this properly involves having your own search index. And that's
pretty expensive.

------
bamboozled
I think on DDG you can do !mil which excludes the first million top ranking
sites.

Edit: Maybe it’s the first million results? I use it to find obscure things
sometimes.

------
mmsimanga
When researching a topic I have had great success searching HN and reading
through the comments. If I want to find alternative software tools for a tool
I am using the comments on HN are best. Searching through subreddits also
yields better results than Google.

------
DavidPiper
It feels like there could be a (partial) meta solution here:

A search engine that returns results whose pages weigh in under a certain
size.

From the comments it seems most of the "cruft" filling up Google results are
newer web apps, generally JS-heavy and advertising-heavy, etc.

If you had a filter for pages with (e.g.) < ABC kb of JS, < XYZ external links
(excluding img tags), I feel like there'd be a good chance that the "old" web
and the "unknown" web would bubble to the top.

There are plenty of false positives (particularly for "small" forums build
with modern JS apps, etc), but it could be one of many filtering tools to
achieve better search results.

~~~
ngold
Great idea. Google seems to do nothing but remove search options. At least
they still have a time filter. Ddg only does a year old I believe.

------
turnipla
Google used to let you blacklist websites many moons ago, that would go a long
way already.

Now there are a few extensions that do that, but obviously they only hide the
results from each page, so sometimes you will see pages with 2 results, if any
at all.

~~~
rozab
Would be easy to just inject a negative site clause into the query, e.g.
`-site:fandom.com`

~~~
dublin
It would be nice if there were a way to make the exclusion list de the default
for all your queries. For instance, I never want to see results from WikiHow
again. Ever. Or the New York Times or any of the other paywalled sites...

~~~
kortex
unpinterested is an extension which simply adds -site:pinterest to image
searches. I don't think it'd be hard to do something similar with a custom
list.

------
methou
I used a Google Custom Search Engine (CSE) to remove results from Softonic and
alikes, it works well, but still very Google.

~~~
petra
Google CSE is a great idea.Tried it in the past.

But i find the search is at a much lower quality than Google.

------
dexen
There is a similar problem where Youtube's recommendations and auto-play are
mostly big name brands, to the exclusion of individual reporters,
commentators, and other content producers. Since recently, a "De-Mainstream
Youtube" plugin[1] is available for Firefox and Chrome, fixing that to some
extent.

\--

[1] [https://demainstream.com/](https://demainstream.com/)

------
bmd3991
What I’d like to see is a search that excludes any page with ads, and any page
with affiliate links. That alone would get rid of 90% of the garbage

~~~
dddddaviddddd
Sort of a server-side, page-level adblocker.

~~~
bmd3991
It could make for an interesting Firefox extension I think

------
peel40
I think there's a simple google way. Just add `-bigwebsite.com` to your query.

[search term] -google -youtube -facebook ... -top100website and it should
work.

I found a list of the top 1m alexa websites here:

[http://s3.amazonaws.com/alexa-
static/top-1m.csv.zip](http://s3.amazonaws.com/alexa-static/top-1m.csv.zip)

An add-on with that list should do the work.

~~~
abraae
That's pretty clunky.

\- there's probably a pretty low limit for size of Google queries, you'll
likely hit it quickly

\- you won't be able to search for e.g a story about YouTube censoring some
content

~~~
peel40
I don't know about the current query size limit but I think it's pretty likely
to get hit quickly as you correctly pointed out. But, it's useful to use the
wildcard "-site" ex: "-site:bigwebsite.com" for excluding just the site, and
not the very word being mentioned.

ex:

facebook censorship -site:facebook.com

[https://www.google.com/search?q=facebook+censorship+-site%3A...](https://www.google.com/search?q=facebook+censorship+-site%3Afacebook.com&oq=facebook+censorship+-site%3Afacebook.com&aqs=chrome.0.69i59l2&client=ubuntu&sourceid=chrome&ie=UTF-8)

------
bhartzer
There is a custom search engine called Newgle.xyz that only shows results from
the 1000 or so new gTLDs (new top level domains).

It’s custom google search results, but since it’s excluding .com, .net, .org
etc then you probably won’t see any of the large sites there.

It’s also interesting to see which sites have been built in the last few
years, as the new gTLDS haven’t been around that long.

------
rkagerer
I would like one that punishes sites with too much ad to content ratio.

------
loosetypes
What are folks’ non-commoditized heuristics for finding new things online?

I was intrigued by how dorkweed’s approach has changed over time, as described
in a reply to a sibling comment.

As general search results get watered down and rotten tomato inflation maybe
trends towards reflecting company interests rather than my interest-level,
maybe it’s worth re-evaluating the vetting avenues we take as users.

Here’s mine: for games and shows I’ve recently found myself using quantity of
fan-videos on YouTube as a proxy for quality. So far it’s been a decent means
to find cult followings for something I otherwise wouldn’t necessarily hear
about.

Obviously this approach has its flaws - and is subject to financial
perversions to an extent - but I figure if enough people genuinely want to pay
tribute to a work, it might be worth checking out.

~~~
bluishgreen
How'd you find the quantity of fan videos?

Personal trick: I follow reaction video blogs, and if they are reacting to
something then it is usually worth watching. But reaction blogs are only for
short videos and other short form content.

------
ChrisMarshallNY
Remember sites like stumbleupon?

I find that the YouTube sidebar is useful for me to find interesting music. I
have eclectic tastes, and Google seems to have figured that out. I don't mind.

I suspect that it would be possible to create a custom API query to Google
that would have a "blacklist."

------
smsm42
There's Million Short: [https://millionshort.com/](https://millionshort.com/)

I think they try to do exactly what you ask, but I haven't used them
extensively so don't know how good are they.

------
abarrettwilsdon
For more queries, you can add modifiers to a Google Search to get the results
you want

Seeing folks mention the NOT operator (-). It's quite powerful! For example,
you can do:

intext:"Powered by intercom" -site:intercom.com will find all the sites that
use the Intercom widget

or ~blog bread baking -inurl:checkout -intext:checkout will find bread blogs
(or similar) without commercial intent

I put together a list of the two dozen or so most useful templates of this,
for folks who are interested: [https://www.alec.fyi/dorking-how-to-find-
anything-on-the-int...](https://www.alec.fyi/dorking-how-to-find-anything-on-
the-internet.html)

------
dhbradshaw
I've wondered too about something similar to that. Basically, I'd like
sessions for searching.

Each session would have an updatable list of sites that are favored,
whitelisted or blacklisted for a particular class of search.

------
maayank
I'm intrigued by actual use-cases for it except exploring, i.e. where it would
give better result for a query than the common search engines.

Anyone reading this, please post if you find any

~~~
crocodiletears
Big ones:

1\. Looking for niche domain or institutional/social knowledge produced by
experts or insiders for an informed audience that isn't necessarily available
in a scientific journal.

Especially with respect to the social sciences and literary analysis, there's
a wealth of intelligent commentators that don't surface well on Google without
very specific search terms, and the willful subtraction of domains like quora,
medium, and tumblr.

They're usually contained on poorly maintained WordPress sites that the author
has long-since forgotten about, or as invalid, handcoded html docs hidden in
the personal subdomains of university professors and students.

2\. Finding online communities that aren't a part of Reddit or a similarly
prominent platform

------
chasd00
in the same vein, it would be awesome to search for a product to buy with the
results being ecomm websites owned by people in my area. A way to "shop local"
online.

~~~
technotarek
ATTIC: A visual search and discovery engine to help you find the latest
products from small, unique businesses near you.

[https://attic.city/](https://attic.city/)

Currently for three product tiers (furniture, home decor, and
fashion/clothing) in 14 major US markets, where stores within ~100 miles or a
~2 hour drive are considered as part of the market.

Disclaimer: I'm one of the founders.

~~~
derision
How do you curate the stores?

~~~
technotarek
Aside from the constraints we apply to market/geography and product type? If
that's what you mean, then technically it's a matter of whether the store's
ecom platform is compatible with our indexer, which supports ~20 different
platforms (and hundreds of variations). Otherwise, we do some light curating
for product quality to include, but not limited to, the accuracy of meta data
(titling, description) and image quality.

------
amelius
If only Google allowed us to omit websites from search results.

Google says they need our information to "improve our experience", but we
can't tell them what to omit ...

~~~
rsoto
They used to allow that, it was very useful. But as with almost everything
Google does, they killed it.

------
fedede
Hey! I actually liked this idea and I'm considering starting a learning
project on it. I've seen a lot of interest and ideas in the comments, and
decided to create a very short Google form to start gathering all the
interested people so we can organise something interesing. Is anybody in? :)

[https://forms.gle/5KuTYVdYaMzRD2n78](https://forms.gle/5KuTYVdYaMzRD2n78)

------
jsgo
I don't know that I'd want a search engine to specifically exclude or limit
the results of specific sites of their choosing (even if top 500 as the
example is fairly unbiased), but I think I'd really like the ability to say
"move these specific domains a few pages back. Don't eliminate them outright,
but I have felt dumber having read them previously."

------
pengstrom
What I want is to filter out commercial results. When I'm searching for a
product I don't want shills, I want real opinions.

~~~
third_I
And then came undisclosed sponsorship and that difference blurred more than
ever...

------
21xhipster
[https://cyber.page](https://cyber.page)

Its kinda new so it excludes kinda everything :-) But you can make it work
better :-)

[https://ipfs.io/ipfs/QmQ1Vong13MDNxixDyUdjniqqEj8sjuNEBYMyhQ...](https://ipfs.io/ipfs/QmQ1Vong13MDNxixDyUdjniqqEj8sjuNEBYMyhQU4gQgq3)

~~~
Aeolun
Can you explain to me what this it is (or is meant to be)? It doesn’t appear
to have a search field, though the lightning effects are impressive.

------
freefriedrice
Why exclude the biggest websites?

The problem I see on DDG & Google is having to scroll 5-10 pages of utter SEO
nonsense.

"Do you have a question about ____? Many ask about _______. ____ is a common
question, here the are we some answer. [sic]".

Just utter garbage pages.

It used to be just with recipes or medical questions, but now it feels like
most everything that is a general query.

~~~
dredmorbius
[https://news.ycombinator.com/item?id=22792243](https://news.ycombinator.com/item?id=22792243)

------
piusp
I have used copernik in the past this was a collection of search engines,
listing more than 140 search engines. It combined the search results and
sorted them by the key word % matched. It also had a lot of tools inbuilt for
validating he links, coping the selection/ sorting and sharing the results.
Simply amazing results.

------
wyck
Google search is so sad these days, all results are media conglomerates, it's
completely counter to the core reason why the internet existed. I really hope
by catering to these mega corps that they are completely undermining their
brand and someone else comes along and pulls the rug out from underneath them.

If anyone noticed during the first couple days of covid, google search was
free from large media results, the algorithm reverted back to how it was years
ago and it was such a breath of fresh air. Of course they fixed the algo
immediately, it went back to only showing curated media results..there was an
anon google employee who posted why this occurred.

~~~
fermienrico
I think we are gonna see aversion from going public. Companies like Stripe and
SpaceX are gonna stay private for a long time.

When SEC laws, shareholder interest, quarterly performance and stock
volatility comes into play, corporations become this mindless soulless monster
that will devour everything in its way and fuck consumers in every which way.

Democratization of funds from central authority to public creates
disincentives and the shareholders don’t give a shit about many auxiliary
things such as environmental concerns. Bottom line always matters.

It’s not just google but any public corporation. Can you imagine SpaceX being
able to operate with the same passion with shareholder interests?

~~~
texasbigdata
It's not, I would argue, those things.

It is, potentially, the compensation plans. If you go to the proxy document
and look at how comp plans are set, they usually hire a consultant, and "best
practices" drivers are cash + big bonus based on typically some TSR (total
shareholder return metric).

So for google, "don't be evil" is what's written down, but for the top execs
"sell ads" is what gets they paid out before they retire. And those senior
level "lifers" are what 40 now?

Don't really have proof to support these claims though.

~~~
Shared404
Not even sure "don't be evil" is still written down. [1]

[1] [https://gizmodo.com/google-removes-nearly-all-mentions-of-
do...](https://gizmodo.com/google-removes-nearly-all-mentions-of-dont-be-evil-
from-1826153393)

------
pkamb
I'd love a search engine that mainly searched Stack Exchange, (old.)Reddit,
and some subset of blogs or single-author websites.

Especially removing Quora, Pinterest, and aggregation/reposting/SEO/affiliate
blogs.

And all "product" images with a white background. Only show real photographs.

~~~
dredmorbius
Setting specific site restrictions (only one applies at a time, e.g.,
"site:example.com"), or utilising DDG bang search, gets close.

------
Cyclone_
Seems like a browser plugin might be a quick and dirty way of just filtering
results to achieve the goal.

------
social_quotient
A mainstream search engines kinda like a big marker equity ETF or index? There
are a ton of benefits but as a negative they make price discovery difficult
and give monetary allocation to companies that probably shouldn’t have it.

Just a thought experiment, curious what others think.

------
wmnwmn
Maybe what we need is a return to the very beginning, namely human curated web
catalogs, aka Yahoo

------
dluan
I mainly use google as a reddit search engine these days. "tiki cocktail
pineapple juice reddit" gives me way more than google algorithm, and plus it's
kind of like human powered SEO where genuinely useful links will likely have
some discussion.

------
rdtwo
So I figured I’d try a few of these with “Seattle vegetable garden blog” as
the keyword. Either there aren’t a lot of blogs on the topic or most search
engines miss them because results are sparse and they really shouldn’t be.

------
ErikAugust
A curated, searchable web directory might be a concept that could come to be
these days. It would share some of its DNA with the old school web directory
but also share some with a search engine.

------
tokyokawasemi
I sometimes use "inurl:wordpress" when searching for travel info. This ensures
more first-person blog accounts, rather than all the tripadvisor junk that's
at the top.

------
known
[https://twitter.com/search?q=twitter&src=typed_query](https://twitter.com/search?q=twitter&src=typed_query)

------
moreWeed
Man you read my mind, just starting thinking about this. From a search
censorship perspective, the BBS's we were building in 93 would be better than
what we have now.

------
Nevada-Smith
Depending on what you're looking for, try Google Scholar [1]

[1] [https://scholar.google.com/](https://scholar.google.com/)

------
blondin
omg yes please.

can google allow us to exclude certain sites? i was surprised to see w3school
showing up above official documentations for pandas and numpy. this is simply
ridiculous!!

~~~
jotm
the "-" operator still works. E.g. "weird stuff I found interesting
-reddit.com -youtube.com -wired.com -w3schools.com"

------
badrabbit
It wouldn't be hard to remove such results using a browser extension,but you
will be scrolling a lot. Maybe duckduckgo should support it,feature request?

------
saadalem
Ok here is an additional idea for fun :

A search engine that shows only urls that are not indexed b google / another
one that gives you the websites with lower pagerank

~~~
bmwracer
Not sure it would yield any useful results, but have been thinking about a no-
index search engine for a while to help find obscure or otherwise hidden
information. If one exists or you build one let me know.

------
jungletime
Is there an option to filter out news articles?

"If you don’t read the newspaper you are uninformed; if you do read the
newspaper you are misinformed." Mark Twain

~~~
lihaciudaniel
That's why I watch TV instead.

------
dangoljames
Is there a search engine that actually combines selective search logic with
reductive logic, so that can be used to actually search topically?

------
corndoge
Similarly, I have always wanted a YouTube search filter for "least views",
since that content is invariably way higher quality

------
thoughtstheseus
I think one of the underlying problems in search is that most search engines
are more like recommendation systems.

------
runawaybottle
Filter google against Alexa rankings?

------
egberts1
What we all really need is a long-tail Bloom filter search engine on the
search engines themselves.

------
coronadisaster
if google would switch to the google's engine code that was used right before
they modified the "+" operator for google-plus, it would be a lot better....
ie: bring back the + operator please (the quotes dont work the same way)

------
citizenpaul
RSS used to be this. Google has done its best to kill it.

------
aiisjustanif
RIP StumbleUpon. The randomized search engine I want back.

------
starfallg
The elephant in the room is Baidu.

~~~
dpau
Ask HN: Is there a search engine which includes only what the Chinese
government wants me to see?

~~~
jotm
What are they blocking anyway? Anything important?

There's so many Chinese forums for hardware/firmware hacking/mods, a shame
translators are still very bad...

~~~
cameronbrown
Definitely. What if I want to research information on a certain 1989 student
protest to write a paper about censorship?

~~~
jotm
Look, I know it's important history and all, but looking at it realistically,
it's a tiny part of the information the average person needs. Them removing
Reddit, EdX, Sci-hub or the Hong Kong Companies Registry from search results
would have a much bigger impact.

~~~
starfallg
You very conveniently left out Wikipedia and all of the other sites that are
censored.

------
wojtczyk
That’s what hacker news is for ;)

------
Upvoter33
google: search terms -site:cnn.com -site:wikipedia.org -site:...

------
aiisjustanif
RIP StumbleUpon.

------
martin-adams
I like this question. I’ve often wanted a search engine which gives you the
choice to find sites that don’t contain a paywall, tracking or advertising.

------
graycat
For

> Ask HN: Is there a search engine which excludes the world's biggest
> websites?

> Discovering unknown paths of the web seems almost impossible with google et
> al..

> Are there any earch engines which exclude or at least penalize results from,
> say, top 500 websites?

Let's back up a little and then try for an answer:

Some points:

(1) For some _qualitative exclamation_ , there is a LOT of _content_ on the
Internet.

(2) There are in principle and no doubt so far significantly in practice a LOT
of searches people want to do. The search in the OP is an example.

(3) Much like in an old library card catalog subject index, the most popular
search engines are based heavily on key words and then whatever else, e.g.,
_page rank_ , date, etc.

So: (1) -- (3) represent some challenges so far not very well met: In
particular, we can't expect that the key words, etc. of (3) will do very well
on all or nearly all the searches in (2) for much of the content in (1).

And the search in the OP is an example of a challenge so far not well met.

Moreover, the search in the OP is no doubt just one of many searches with
challenges so far not well met.

Long ago, Dad had a friend who worked at Battelle, and IIRC they did a review
of _information retrieval_ that concluded that keyword search covers only a
fraction, maybe ballpark only 1/3rd, of the need for effective searching. And
the search in the OP is an example of what is not covered because the _library
card catalog_ did not index size of the book or Web site! :-)!

Seeing this situation, my rough, ballpark estimate has been that the currently
popular Internet search engines do well on only about 1/3rd of the content on
the Internet, searches people want to do, and results they want to find.

So, I decided to see what could be done for the other 2/3rds.

I started with some not very well known or appreciated advanced pure math; it
looks like useless, generalized abstract nonsense, but if calm down, stare at
it, think about it, ..., can see a path for a solution. Although I never
thought about the search in the OP until now, in principle the solution should
work also for that search. Or, the math is a bit _abstract_ and _general_
which can translate in practice to doing well on something as varied as the
2/3rds.

Then for the computing, I did some original applied math research.

Using TeX, I wrote it all up with theorems and proofs.

So, the project is to be a Web site. While in my career I've been programming
for decades, this was my first Web site. I selected Windows and .NET, and
typed in 100,000 lines of text with 24,000 statements in Visual Basic .NET
(apparently equivalent in semantics to C# but with _syntactic sugar_ I
prefer).

The software appears to run as intended and well enough for significant
production.

I was slowed down by one interruption after another, none related to the work.

But, roughly, ballpark, the Web site should be good, or by a lot the best so
far, for the 2/3rds and in particular for the search in the OP.

So, for

> Ask HN: Is there a search engine which excludes the world's biggest
> websites?

there's one coded and running and on the way to going live!

I intend to announce an alpha test here at HN.

~~~
defen
Can you talk at a high level about the problems of keyword search, or is that
part of your secret sauce? Off the top of my head I can think of two, which
are _intent_ and _encoding_.

Before you can even do a keyword search, you obviously need an intent to do
so. But that means keyword search is pretty useless when you don't know what
you don't know.

Encoding that intent...maybe doesn't matter for common searches, but everyone
has heard of the concept of "Google-Fu". English text is a pretty lossy medium
compared to the thoughts in people's heads...Shannon calculated 2.62 bits per
English letter, so the space of possibly-relevant sites for almost any keyword
is absolutely enormous (e.g. there are about 330,000 7-letter english keyword
searchs...distributed across how many trillions of pages, not even counting
"deep web" dynamically generated ones?). So we punt on that and use the
concept of relevance for sorting results, and in practice no one looks beyond
the first 10. I don't know what an alternate encoding might look like though

~~~
graycat
Good questions.

For

> Before you can even do a keyword search, you obviously need an intent to do
> so. But that means keyword search is pretty useless when you don't know what
> you don't know.

Right: The way at times in the past I have put something like that is to say
that, ballpark, to oversimplify some, keyword search requires the user to know
what content they want, know that it exists, and have keywords/phrases that
accurately characterize that content. For some searches, e.g., the famous
movie line

"I don't have to show you no stinking badges",

[https://www.youtube.com/watch?v=VqomZQMZQCQ](https://www.youtube.com/watch?v=VqomZQMZQCQ)

that is fine; otherwise it asks too much of the user.

For "encoding", my work does not use keywords or any natural language for
anything.

My work does get some new data for each search for each user. But privacy is
relatively good because for the results I use only what the user gives for
that search; in particular, two users giving the same inputs at essentially
the same time will get the same results. Thus, search results are independent
of the user's IP address or browser _agent string_. Moreover, the site makes
no use of cookies.

The role of the advanced pure math is to say that the data I get and the
processing I do with that data and what is in the database should yield good
results for the 2/3rds. The role of my original applied math is to make the
computations many times faster -- they would be too slow otherwise.

When keywords work well, and they work well enough to be revolutionary for the
world, my work is, except for some small fraction of cases, not better. So,
there is ballpark the 1/3rd where keywords work well. Then there is the
ballpark, guesstimate, 2/3rds I'm going for.

My work is not as easy for the users as picking a great, very accurate, result
from the top dozen presented by a keyword search engine, e.g., the movie line
example, but is much easier to use than flipping through 50 pages of search
results and is intended usually to give good results unreasonable to get from
a keyword search, without "characterizing" keywords, that yields, say,
millions of search results and would require a user to flip through dozens of
pages of search results.

Ads are off on the right side and not _embedded_ in the search results. The
SEO (search engine optimization) people will have a tough time influencing the
search results!

We will see how well users like it. If people like it, then it will be good to
make progress on the huge, usually neglected, content of the 2/3rds.

------
notaphilosopher
I'd like:

\- health search that excludes sellers, wellness and snake-oil websites

\- news search that excludes conspiracy theories, magical thinking, political
operatives, and paid bloggers

\- image search by similarity, similarity to an uploaded picture/s, words, or
description

\- media and warez search engine that excludes link-spam and malware sites

\- complex queries search because none of them do it well

\- anonymity

\- shopping search that kicks out disreputable sellers and phony store-fronts

\- mapping like OSM but fast, practical with an app, and detail-accessible

\- monetize using affiliate links that don't affect ranking

\- semi-curated results (domain reputation-ranked voting)

\- related pages

\- inbound/outbound links search

\- archive.org integration &| history page caching

\- documented query syntax

\- query within results

\- quick query history results navigation

\- keyword alerts

\- keyboard shortcuts that always work

------
burmer
DuckDuckGo /s

