Hacker News new | past | comments | ask | show | jobs | submit login
A look at search engines with their own indexes (2021) (seirdy.one)
271 points by tintedfireglass on June 21, 2022 | hide | past | favorite | 114 comments



I think a big part of why Google has such sticking power is that on the one hand they actually are pretty good at some things (even though they visibly flounder with others), but web browsers are heavily designed around leveraging the particular tasks Google is really good at.

This is manifested by turning the URL bar into a search bar by default, making URLs difficult to manually edit (especially on mobile), making bookmarks inaccessible and difficult to manage by requiring multiple clicks to access, and having an interface that makes it easy to accidentally bookmark the wrong websites. I don't believe there's some big conspiracy where Google has orchestrated this, but it's probably more like an effort for every web browser to mimic Chrome without realizing why that has problematic knock-on effects.

I really sort of wish there were good alternative web browsers that weren't ostensibly waging war on user agency, or at least gave more than token customization abilities.


Yes,These days most people think that GOOGLE=INTERNET and look on GOOGLE for the nearest hotel or ask GOOGLE today's weather. Google has basically become so big that for most people Google is the manifestation of the internet.

This reliance on web search as a gateway to the internet also started with google as in the days of IE we had separate URL bars and search bars, toolbars and bookmark tabs were the norm and there were many different ways people used to obtain their news,talk to friends and in general consume content on the internet.

But Google basically turned the browser into a funnel that sends all internet users to https://google.com and as chrome has most market-share in both desktop and mobile(thanks to android). Most browsers adopted the SEARCH-ENGINE route in their browsers leading to the position we are in.

Today the average person visits a website only in two ways

1.Auto-fill on their URL...sorry search bar 2.Type the name of the website on the google search engine.

I have no clue why the feds don't care about big tech monopolies the way Microsoft was attacked back then for IE which is basically child's play compared to what companies(even Microsoft with Edge) are doing today.


> I have no clue why the feds don't care about big tech monopolies the way Microsoft was attacked back then

You have to give some credits to the feds because tapping into half a dozen backends is obviously easier than having to tap into a few thousand.


The catchphrase of Francis Urquhart springs to mind.


> basically turned the browser into a funnel that sends all internet users to https://google.com

Exactly right and completely intentional. I saw they also pay Apple $15bn to be the default search provider on iOS [0]. Annually.

[0] https://www.macrumors.com/2021/08/27/google-could-pay-apple-...


> I have no clue why the feds don't care about big tech monopolies the way Microsoft was attacked back then for IE which is basically child's play compared to what companies(even Microsoft with Edge) are doing today.

People forget just how big Microsoft was at the time (big fish/small pond) but it was some obscenely high percentage of "personal computer desktops", and even then it took something like four or five years to come to trial. This was before Android, before smart phones at all, and the Mac was on life-support and nearly dead, and Linux was a joke as a desktop which is where the "Year of Linux" came from.

Google is big but nowhere near as "dominant" and they've been careful to mitigate so as to fly under the radar. It also helps that basically nobody is trying to sell software in the areas they "compete" in so the "harm" is harder to argue.

I wonder about whether Apple has a skunkworks project working on search, just like they did for the x86 and M1; it's one of the large areas they still directly depend on another company for.


> Google basically turned the browser into a funnel that sends all internet users to https://google.com

It's two multi-layer funnels [0]:

- Android device > Android OS > Chrome > Google search/ads

- Apple device > macOS/iOS > Safari/Chrome > Google search/ads

> These days most people think that GOOGLE=INTERNET

Definitely not the case in some countries. I am confident that there are a good number of failing democracies where you could easily conclude that FACEBOOK=INTERNET. I have deep knowledge of one where this is very apparent, to those in that country who stop to think and care.

[0] below https://news.ycombinator.com/item?id=31822014


Has anyone ever done experiments on alternative URL input fields? We continue to treat URLs as arbitrary strings, although they are composed of very well defined components, say for `https://foo:bar@news.ycombinator.com/item?id= 31820149#31821636`:

  - Transfer protocol (https://)
  - Optional credentials (foo@ or foo:bar@)
  - Mandatory hostname (news.ycombinator.com)
  - Optional path name (/item)
  - Optional parameters as key-value pairs (id=31820149)
  - Optional document anchor (#31821636)
This is easily imaginable as a form, even if that would obviously by inconvenient. It feels to me like there is a middle ground between "enter every URL part into a separate input box" and "let users handle an arbitrary serialisation format completely on their own".


The closest I had was Vimperator (RIP) which had `gu` for "Go Up" which would go to the page referenced by either a <link rel=up> or chop the last path component off the url and <C-a> <C-x> which would follow rel=prev or rel=next or increment or decrement the last number in the URL.


> This is manifested by turning the URL bar into a search bar by default

I'm so far gone at this point that I can't even imagine an alternative.

I use search for basically everything - even just going to websites that I go to frequently.

I have maybe 10 - 20 sites I visit everyday, followed by technical questions that I Google, followed by general questions that I Google (with "reddit" appended).

I sometimes browse the web (clicking on link after link), but it takes some effort to find a piece of yarn to follow that doesn't end up in one of the walled gardens.


I mean, they could probably start by having actually useful keyboard shortcuts, instead of 2 pairs of shortcuts that do exactly the same thing (`Ctrl+L` and `F6`; `Ctrl+K` and `Ctrl+E`), and the four together doing almost exactly the same thing (the first two focus the omnibar, the latter focus the omnibar and set it to search mode, which is equivalent to focusing the omnibar and typing "?" then pressing `Tab` or typing "<default search engine alias>" and pressing Tab).

At least from my own usage, I can identify many common usage patterns that would benefit from shortcuts that focus the omnibar and set it to some special mode, just like the VSCode command palette does (e.g. `Ctrl+Shift+P`, `Ctrl+P`, `Ctrl+Shift+O`, etc.). In fact, I don't know how a WebExtension-enabled VSCode-based browser hasn't popped up yet.


I think this paragraph on the difficulty of building good independent indexes should not be overlooked. What's going on with Cloudfare?

> When talking to search engine founders, I found that the biggest obstacle to growing an index is getting blocked by sites. Cloudflare is one of the worst offenders. Too many sites block perfectly well-behaved crawlers, only allowing major players like Googlebot, BingBot, and TwitterBot; this cements the current duopoly over English search and is harmful to the health of the Web as a whole.


CloudFlare isn't that bad in my experience. They were really aggressively blocking me when I started out, but there are some hoops[1] you can jump through to make them recognize your bot. Goes a long way.

It does depend on the sites' settings though. Some are set to block all bots, and then you're kinda out of luck.

In general, I've found that like 99% of the problems you might encounter running a bot can be solved by just finding the right person and sending them an email explaining your situation. In almost all cases, they'll let you through.

[1] https://blog.cloudflare.com/friendly-bots/


That's good to know -- thanks!


> it's probably more like an effort for every web browser to mimic Chrome without realizing why that has problematic knock-on effects.

I agree with your points in the first sentence here (URL/serch bar) etc. However this is by design. It protects Google search with a moat, which in turn protects the lucrative Apple-Google search deal. Microsoft (with Edge and Bing) complete the three-party monopoly in OS-browsers-search [0].

Yes, we need good alternative browsers and browser innovation. There are plenty around, but none of them apart from Firefox (and forks) support search diversity, as they do with the search box [1].

Our own, search engine app, has a way of supporting similar one click search choice: Search Choices [2]. Users love it. Why be bound into the one-search-to-rule-them-all paradigm?

But watch out, browsers are being circumvented. The moat now being built is around how search links with the operating system and voice assistants. How do these three companies play? And do they play fair? Who cares you may say. As Tim Wu has said [3] "it is not a crime to be a monopolist; it is a crime to abuse your monopoly power." How will he view how they play together?

Microsoft ignores your choice of search engine in Edge with searches from Windows [4]. If you have chosen Ecosia, or set-up Ecosia as your search preference in Edge, that choice will be ignored and Bing will be used. Your choice is overrided.

Apple does something similar with Spotlight [5]. Suppose you chose Mojeek or Ecosia as your search preference in Chrome. And Chrome is your default browser on your Mac or iPhone. If you have Google as your search preference in Safari (and remember that is the default which Google pays $billions for), then the Spotlight directed search will be done with Google search in Chrome, not Mojeek or Ecosia. Your choice is overrided.

Google have Android. What can you do with the Search widget?

[0] https://blog.mojeek.com/2022/05/gatekeepers-of-the-western-w...

[1] https://blog.mojeek.com/2020/12/popping-filter-bubbles-in-fi...

[2] https://blog.mojeek.com/2022/02/search-choices-enable-freedo...

[3] https://newrepublic.com/article/111650/why-does-everyone-thi...

[4] https://blog.mojeek.com/2021/11/how-microsoft-sucks-people-i...

[5] https://twitter.com/ColinHayhurst/status/1533854896633544704


That. And it's not just the search: https-everywhere, HTTP/2, QUIC, HTTP/3 ... all that is be design and it molds the internet to favor big cloud providers. The all too familiar embrace and extend basically.


Google and Bing have got using state psychological warfare on their users with the results they deliver and the adverts that get delivered on various websites including Youtube.

Its a legal form of harassment and intimidation imo, but I think some people already know this because of the big emphasis on mental health being pushed out by the media and various govts.


I think 'general purpose' search engines are often doomed to fail, and this extends to 'web search engines' (which, in reality, do not search the full web and are NOT general purpose.)

This list makes a great list of attempts at general purpose search engines (I would include DDG and Bing) which ultimately fail and are not what the masses want.

There's also precedence for how only domain-specific search engines are valuable / make sense these days: Google itself is shifting to ML-based answers (attempting to build the holy-grail of general-purpose search engines), and whatever you do isn't going to beat Google at that - even Google hasn't been successful at it yet.

Examples of real-world domain-specific search engines today:

* Image, YouTube, Maps search

* Amazon search

* Code search (Sourcegraph, cs.github.com, cs.opensource.google, etc.)

* Facebook, Twitter, Reddit, TikTok, LinkedIn search

* Documents search (iCloud, Drive, etc.)

* Messenger search (Discord, Slack, Messenger, etc.)

It's really telling how Google has largely only achieved general-purpose search in their domains of data (a Google search turns up YouTube videos, Maps locations) but isn't even complete in that area (Drive, images, etc. don't show up in Google results)

General-purpose search is simply non-viable these days due to (a) data silos and (b) you need to have domain specific search to provide an edge (e.g. regex for code search, or drag-n-drop an image for image similarity search)


So, I'm an electrical engineer (these days). I work with parts. Parts have part numbers. I need information about parts. Some of my parts are weird and rareish.

It is shocking how many searches I do which are well described by "show me every single thing on the entire Internet which matches this string". (I'd love to match a simple regex, since part numbers have stupid trivial variations, but I'll take what's actually feasible.) Or, I often want to match only PDFs, but that hurdle seems well solved these days. There aren't that many results for a few of these things (and those are often the ones I need information about the most), so I sometimes really do want to see every result.

I don't see any other way to do that than a full-text, full-web index. I just don't. And my job gets a lot harder if that class of search isn't possible anymore. So I'm desperately hoping these indices remain at least somewhat viable.

And yes, I am noticing that Google is lovingly correcting my long part numbers that do actually match a fair bit of results to crap that is useless. Fuck that bullshit, Google.


Google is running behind NLP(Natural Language Processing) causing problems like this. I feel we should learn to use search engines and not search-engines trying to guess what we want to find using some weird Black Box AI


This is exactly the problem of "general purpose" search engines as mentioned by GP. In classic information retrieval theory, a search query is a textual representation of the a user's mental search intent. A search engine's task is to interpret that textual representation in order to serve the user's need.

Basic NLP starts with simple techniques like stemming, synonyms, typo correction etc. But many users expect more, e.g. contextual disambiguation or broader synonym/similarity resolution. These users would see a search engine as dysfunctional if it could not even find pages in which a plural form or a spelling variation of a query term occurs.

However, there is an inherent contradiction to those users who want to find a specific string. Google, Bing and others apply complex algorithms ("weird Black Box AI") in order to interpret the user intent and the meaning of web pages in order to optimize towards the majority of internet users. To some degree, they do try to also serve users of (large enough) minority use cases in which the user feels like they knew exactly what words (strings) they were looking for. But again, the different use cases pose contradictory interests that are probably not resolvable in a "general purpose" approach.


I follow that, but I still have to ask: Why does it often ignore the quote marks it suggests to me? It's supposed to search for that exact string, but it never does and shows the same rubbish regardless. The quote marks should resolve the contradition between synonym searching and specific strings and they don't for some bizzare reason


Because their products are a complete fucking MESS.

Quotes used to be an exact search, but it now overrides that basically everywhere and continues to search for what it thinks you mean (and has for like a decade now). Even though the "Missing: ___ ‎| Must include: ___" popup just wraps that word in quotes in the search. Which is infuriating, since they know damn well adding quotes won't actually search for that word.

In reality, you need to click "tools" -> "all results" and change it to "verbatim" which does at least still appear to work... for now.


Today I learned... Thank you!


pose contradictory interests that are probably not resolvable in a "general purpose" approach.

How are they not resolvable? If Google respected quotation marks (as it used to do, in the old times) that would be enough to accommodate both kinds of users (or rather both kind of search needs - I'm the same person and do both).


Came here to say this. Quotation marks were hugely useful.


>Basic NLP starts with simple techniques like stemming, synonyms, typo correction etc. But many users expect more, e.g. contextual disambiguation or broader synonym/similarity resolution. These users would see a search engine as dysfunctional if it could not even find pages in which a plural form or a spelling variation of a query term occurs.

Any morphological variation would already be covered with stemming alone, so this particular point doesn’t speak in favor of more advanced interpretations, does it?

Having an path which uses advanced AI powered tools doesn’t necessarily imply one should drop any ability to search with more technical query interface for advanced users. And it’s doubtful big companies wouldn’t have the budget to maintain even two completely distinct tools, would it be required.


I have had this problem for years in modern user interfaces. Most of the times, when a system is trying to guess what I want to do or find, it guesses wrong.


Indeed. Search got along just fine before NLP. You should at least have the option of a "raw" text search.


I'm guessing the following: to index a long tail of infrequently-used keywords (such as part numbers), it takes a lot of storage costs, with little if any financial return.


Full text search on the entire web index is not feasible in terms of latency and compute cost


Is it feasible to roughly categorize websites (or parts of websites, e.g. subreddits) and then perform full text search on them only, or at least establish priorities? For instance, it's extremely unlikely OP would find anything useful about part numbers on *.reddit.com/r/funny.

Does anything like that exist today in usable form?


Sure -- you can restrict the search to websites that seem to be relevant to the query. Maybe you do it by hand at first, then you switch over to an ML model. You optimize the ML model for a few years, adding in more signals, and allowing it to guide other parts of the query to keep costs in check and raise quality (as measured by your preferred metrics)... and now you have reinvented fuzzy search that behaves in unintuitive ways, and people on HN complain that your service was better when it did raw string matching.


It's interesting you ask that, because the now-paid Kagi Search has a feature like this called Lenses: https://blog.kagi.com/kagi-features#:~:text=and%20Google.-,L...

I've found them somewhat helpful in getting rid of some of the garbage that pops up when I'm searching for a specific topic. But often, I find myself leaving them off since I sometimes want all the results on anything even slightly related because it might be useful.


This looks awesome. Unfortunately the pricing is like an order of magnitude higher than what my broke ass can pay for, but I'm truly rooting for them.


> actually match a fair bit of results to crap

Do you use advanced search features? Like quote things that MUST match. ext:pdf etc.

You can have this interface to help you: https://www.google.com/advanced_search


I mean, I'm able to get to where I'm going, between Google and other engines.

It's just annoying as fuck to see "Did you mean 'apple orchards in Siberia'?" for querying "ADS124S06IPBSR". Because that shit IS NOT the same.

(Note for the pedantic: I made up that example because I can't remember a real one. But it gives you an idea.)


This is what I like so much about you.com They show results from specific, popular sites/apps (I listed some examples below) plus allow you to set preferred sources to see sites/apps you find useful (for me it's reddit) higher/on top of your search results. I find it especially nice for coding cause sites like GitHub and StackOverflow are supported.

https://you.com/search?q=python+pandas+concatenate+two+dataf...

https://you.com/search?q=How+to+care+for+orchids

some supported sites:

- GitHub - Reddit - StackOverflow - Arxiv.org - TikTok - LinkedIn - W3 Schools - Twitter


Is this the one where you had to install an extension first to search ? And the CTO/Owner had a fight with everyone on HN ??


link or didn't happen


I do a lot of those specific searches with hash bangs through DuckDuckGo.

I would love a general purpose search engine. Most websites / web applications have bad search, and behave inconsistently. So I use site specific searches through DuckDuckGo.


The issue is that some of these domain-specific search engines kinda suck on their own, for example reddit search is notoriously bad. This is often overcome by searching what you want in a general-purpose search engine and appending "reddit" to your search query.


> whatever you do isn't going to beat Google at that - even Google hasn't been successful at it yet

Why not? This seems exactly like the sort of situation where an upstart could be very successful: an industry transitioning to a new approach.


Yay, another person who keeps an eye on the landscape of search engines and tries different ones. I thought I am the only one with this hobby.

I made this page to be able to use a different search engine for every search I do:

https://www.gnod.com/search

(Click on "more engines" to see the full list and chose the ones for which buttons are displayed. But you can also search right from the list by first entering a search query and then clicking on an entry in the list.)


How is this different from SearX(https://en.wikipedia.org/wiki/Searx) Wont just hosting your own searX instance be an easier thing to do? Curious. :)


With SearX I would have to maintain a python parser for each engine:

https://github.com/searx/searx/tree/master/searx/engines

Most of which are like 100 lines long:

https://github.com/searx/searx/blob/master/searx/engines/wik...

And probably often break:

https://github.com/searx/searx/commits/master/searx/engines/...

While the way I do it, all I have to do is maintain a single link per engine:

https://en.wikipedia.org/w/index.php?fulltext=1&search=hello...

So far, none of the links ever broke.


Interesting. Thanks for the insights :)


Suggestions for additions:

Kagi.com Neeva.com


Do you use these?

I tried Kagi and liked the design. Having to log into a search engine to use it is a no-go to me, though. But if there is "demand" for it, I can surely add it. So far, I think I am the only user of Gnod Search.

Neeva does not let me search, no matter what: "We'll get in touch as soon as Neeva is available in your region". Maybe they don't want to deal with users in Europe?

I think for a list of search engines to try, it is best to stick with the open ones. Otherwise, the experience will become cumbersome if every other engine you want to try tells you that you cannot use it for some reason.


I’m a former DDG user (maybe for 2 years?) who switched to Kagi in December (and have paid for my first month already). I preferred DDG over Google (for most searches, Google was better at rare searches; in addition, DDG got worse over time), but prefer Kagi over both. I almost always get what I need, and the personalization features (specifically ranking sites higher/lower linked to your account) are great. Really need to find the time to play around with lenses, but I have heard good things ;)


Same here! Love Kagi!


I switched to Kagi recently (and became a paid user!). Somehow DDG never really clicked for me, I find results on Kagi better.


Many people switched to paid Kagi and swear by it. I wasn't impressed at all so didn't pay, then I ran out of free Kagi searches in a few days and switched back to DDG+Google.


What left you unimpressed?


You can acually use Neeva without an account by setting your browser's search query to 'neeva.com/search?q=%s'. I use Neeva without an account and not in the US and it works just fine.


They are actually listed on the page, ctrl-F for them. They're listed under "mixed index".


I just want to thank you and the author for these works.

I'm glad you're watching search and surfacing the different efforts in this space!


This is very handy, thanks


The realization I made recently is the financial incentive of the page view is the corrupting force that ruins search

The content producer is incentivized to make content as quickly as possible in as little effort as possible and cover it with as many ads as possible.

On the other end, it's within GBYs interest to be slightly confused; by not nailing it, the user engages with the search using different words, potentially hitting an adword and clicking. GBY could be convincing the user that they're just using it wrong and need to execute more searches.

So the financial cost/reward model is configured in a way that gives high margin to spammy trash covered in irritating obstructive ads and search engines that are mysteriously confused by what we ask for. And that's the modern web.

Finally we are taught to tremble in fear if we dare consider touching the sacredness of The Market. So here we are stuck with trash technology to search avalanches of garbage.*

Luckily there appears to be a few search engines that focus on sites without advertising and trackers and they seem to produce nice results

---

* occasionally people like to claim this is just freedom or something so let me clear that field now: being forced to focus on maximizing profit isn't a free society. A free society would permit various missions to be pursued without fear of being squeezed out of existence by those focused exclusively on profit maximization. If priorities are imposed by a p&l tyranny, that's not freedom


>The content producer is incentivized to make content as quickly as possible in as little effort as possible and cover it with as many ads as possible.

It all depends on the author/producer resources and goals. If resource acquisition doesn’t enter in the equation of a specific work, there is no reason to expect the advertisement industry to influence it directly in such a way, does it?


I don't understand. Can you give some examples?

Let me make the theoretical claim clear first: Regardless of the sentiments of a creator, so long as there's SEO pumpers creating similar content, the creator will be crowded out unless they also play that game.

This theory explains why given say a new scientific finding, often a content factory's misreading of the results as opposed to the official statement from the research institute is what gets passed around.

Essentially there's a requisite to engage with "the game" to capture exposure and connect with those who you wish to communicate with which is reliant on increasing metrics instead of merit.

I've called it the "fast-fooding of content" in other writings on the topic.

Please engage with me if you have counterexamples, I'm not married to any particular idea.


Im seeing ads now for "create content faster with GPT-3. So expect quality to go down even farther, to meet the demands of capturing eyeballs as cheaply as possible.

I've also started seeing review websites that are clearly ai generated these days.


What are these search engines that focus on sites without advertising and trackers?


Try https://teclis.com and https://search.marginalia.nu

My test queries are things like "souffle recipe" or "worst celebrity hairstyles" - if it can survive that it can survive anything


There is also Kagi[1]

[1] https://kagi.com/faq#censoring


https://wiby.me is a good one. I actually found this blog on wiby


Alas my poor entry into the ring https://bonzamate.com.au is missing. It is Australia specific though which might explain that. It is running its own index though, so might be interesting to some.



Sort of have unfortunate similarities to BonziBuddy with that name.


Never even considered that. Although as an Australian it’s not even close.


I thought it was intentional, as a half joke / nod to the BonzaiBuddy of old.


Sweet as.


There's one important aspect of search engine evaluations that many people forget.

A key part in user satisfaction are re-searches (up to 50%, if I recall correctly from a paper from Teevan et al.)

This means that people expect a search engine to find the same results they found in the past for a given query. That's a huge part of lock-in in the ecosystem.


For most software, there is an open source attempt aimed at producing something more or less equivalent. Is there any such thing trying to attempt the same level of scalability as Google or Bing?

I'm aware of Apache Lucene, Elasticsearch, etc. The thing is, those all seem aimed toward indexing a lot fewer orders of magnitude less stuff than Google does.

I'm guessing that perhaps anyone capable & attempting to implement such a thing just automatically gets fast-tracked into a job at Google (or similar company). Is that why it doesn't exist?


How would you even test that? Open source projects typically have a budget of near zero. There are lots of things that the open source approach can't handle but commercial development can, large scale search engines are one.


I guess hypothetically you could go for a wikimedia-style model and set up a foundation that operates the thing, but that would require a lot of funding, which would require decent utility up front. Vicious circle there.

I'm struggling with how to run even a small scale search engine as an open source project. I haven't really found any good projects to model for how to go about an open development process for a system that actually requires decent hardware to run, so right now it's just me developing in public, which is fine I guess, but not much different from how it was before I open sourced it.



From what I understand, they basically built a search function for wikipedia (and related projects). A noble goal since their own search function is kinda shit, but it was never really an internet search engine.


Well, what you could try is an engine that's designed to crawl specific niches and then allowing people to run their own instances. So you don't run any hardware yourself. You just let other people provide it - they might find a way to make money from it, or their community might find ways to pay for it anyway. It's not necessarily true that every search engine has to search the entire web.


The problem with this is that you get significant benefits from a larger crawling corpus, even if you index just a small portion of it, the rest will inform that portion.

That's a real problem that is ultimately hard to get around. Like my index is fairly small as it is, but even so it requires more hardware than say a student could afford. Like it's not enough that I'm dependent on external funding, but it's still not something you slap together for fun and then get bored of.


If you're going that far, I guess you might as well set up a commercial search engine or SEO tools business off the back of it. At which point, the thinking may shift away from open source toward something proprietary, unless there were some other substantial differentiators.


> Let’s create a better way to search the internet

> We want to make searching the internet open and transparent. Our goal is to provide a wide range of independent and free options for navigating the web. This is relevant for all organizations and people who want to boost Europe’s digital sovereignty, a greater variety of search engines, and independent search results. We can make this possible by cooperating with existing data centres and by using open source principles and public moderation. We may not be able to do it as individuals, but we can do it together.

https://opensearchfoundation.org/


Will it censor "vaccine misinformation", "election misinformation", "russian disinformation" and all the other mainstream topics that google actively suppresses in favor of the narrative or is your moniker of "open" just mean "open to the ideas we support".


As an English language user Yandex is awful for typical use in my experience. I'd say at least a third of the results it serves on its English site are Russian. That said, I find myself using Yandex quite a bit simply because I can assume it has a Russian bias, or at the very least it doesn't have the same biases you might expect from a search engine operated in the West.

I think a lot of people would be surprised at the extent at which Google and other Western search engines filter their search results today. A lot of content which might be labelled "extreme" or politically problematic I find is now either missing entirely from Google search or pushed so far down in the results that it's practically inaccessible. On Yandex I find I don't have this problem.

As an example during the pandemic Yandex was pretty useful for finding conspiratorial content about Covid-19 which was censored by Google. Not that this is necessarily a good thing for the average user, but it has its uses if you're trying to come to your own conclusions about things rather than relying on the narrative provided by Western media companies. It's obviously also been useful when trying to better understand the Russian side of the ongoing RU/UA conflict which has either been completely absent or very unfairly portrayed by most Western media sources (again, perhaps for good reason, but still).


Surprised to see my site removed from the list:

> Ask.moe uses Google Custom Search now, so it's not a search engine anymore; it's a search client.

Not quite sure what a "search client" is, but I don't see why a Bing proxy (using the Bing Search API) is any more of a search engine than a Google proxy (using the Programmable Search Engine).


Does anyone know what the search engines that were excluded for "reasons" are?

I like having thorough lists and I am curious what as many as possible are regardless of political silliness or differences in using financial systems.

My guess is the crypto one is presearch, but the others I have no idea.


They list out their reasons here https://seirdy.one/posts/2021/03/10/search-engines-with-own-...

Sounds like two were excluded for having a far-right focus, one for crypto payments, and the rest for being small/simple proof-of-concepts

That said, I would also be interested in what engines in particular were excluded, with a small blurb as to why it wasn't included.


Precisely. I don't care the reasons. I would want to know what they are.

I absolutely hate when people claim to be providing a "complete" list of something and then skip things because 'reasons'.


I'm glad others are interested in this topic and there is ongoing research. The current state of search engines is abysmal. My "plan" was to support DuckDuckGo as the alternative to Google, since they declined in quality so rapidly in the last few years with most results just pushing their own services or funneling you into product ads. But, DDG has been making a lot of misteps recently as well. Despite all the chest pounding they do about privacy, they still load tracking scripts on their domain and I feel their communication about it has been really poor and gives of scummy vibes.


DDG is completely dependent on Microsoft for both the search results and their ad revenue. This causes a lot of problems as whenever Microsoft stops supporting DDG,the company is basically dead. In my opinion I think this is why DDG is taking a lot of "missteps" as they don't want to sour their ties with Microsoft. This is just my theory.


I was about to report the note 2 of the article:

> DuckDuckGo’s help pages claim that the engine uses over 400 sources; my interpretation is that at least 398 sources don’t impact organic results.

ouch and I thought DDG was clean... But yet, it doesn't mean that DDG tracks us, does it ?


Your reply made me read DDG privacy policy.

> > Similarly, we may add an affiliate code to some eCommerce sites (e.g. Amazon & eBay) that results in small commissions being paid back to DuckDuckGo when you make purchases at those sites.

Why was there a big backlash against brave for doing this but no one bats an eye when DDG does it?


theres a massive ddg bias on hw, see this thread for example https://news.ycombinator.com/item?id=31492505


I bounce from default engine to engine (most recently startpage as a 'better' google) but so far always find myself back with google search. Last time I went elsewhere, I recorded the occasions when I appended the URL bar search with @google. Quick info on a business (address, telephone # etc.) and maps dominated. I haven't used another engine that can deliver similar. I also realized how much time I spend looking at maps and street viewer...

The knowledge graph and maps are a massive search moat.


Previous discussion post by original author (March 2021): https://news.ycombinator.com/item?id=26429942


Google won because they started early and had the right algorithm. Back in 1998, Google's PageRank was an innovative algorithm that calculated relevance based on counting backlinks instead of parsing the word counts in embedded HTML text like other search engines. This made Google way better than any other available search engine back then (Lycos, Yahoo, AltaVista, etc.), and within weeks, everyone was switching to Google. Google then had the scale to grow with the internet.


PageRank was and still is mostly marketing, there's no way Google was so much better due to PageRank. It's more likely due to their method of indexing inlinks as pseudo terms within documents that gave them the initial edge.


It's a one-two punch. PageRank does undeniably help the underspecified case, indexing anchor texts adds more keywords and makes more searches underspecified.


Definitely the right algorithm if you want search results that are driven by marketing teams.


Do Chinese search engines have their own index engines? They're censored, but if you could remove that, would they be as good as Google, Bing, yandex?


Yes, I listed several. Scroll down a bit.


I know a common complaint is search engine x fails at finding info for knowledge domain y, usually something technical that in the overall lexicon is ambiguous with something more generic/popular. I wonder if a good solution to this is to use one of the smaller open engines in a private instance and just submit the handful of sites you know have the relevant info?


I'm using presearch - https://presearch.io/about

and I like it.


Presearch is nice but the crypto token is stopping me from using it.


I don't use the token stuff (I don't even log in).

It has decent results, and all the main search engines are available, if needed, on the side.


Inspired by this I have started making a list of alternative search engines and plan to review them once count reaches 100. https://github.com/Tintedfireglass/search-engines


By the looks of it, we are witnessing the "Cambrian explosion" of web search.

Search engines will evolve and those that do not adapt might be up for a nasty surprise by the end of Mesozoic Era, no matter how big they are.


>Two engines were excluded from this list for having a far-right focus.

Does this mean like extremist terrorism or regular right wing stuff? Would be interesting to see.


Why aren't more of them using Common Crawl? Is it flawed?


Can only speak for my self, but I use my own dinky crawler because CC doesn't really solve any problems I have operating my search engine, and its unwieldy size creates new problems I didn't have doing my own crawling.

Crawling just isn't the hard part of building a search engine. Sure there are pitfalls and obstacles, but they're all fairly solvable.


I found it weird that the author references the Common Crawl a couple of times but it does not have its own entry on the page.

I’m not familiar with what it is… guess I’ll just Google it.


No DDG?


They don't have their own index so they are listed under Bing.


It’s there under bing




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: