I have often wondered if perhaps a distributed search engine is the real answer. Yacy was interesting but the results terrible. But perhaps it’s possible to use activitypub protocol to build a distributed search that has different implementations on the backend. It’s something I keep toying with in my spare time, and with what I did with bonzamate we’re the search is done in lambda functions on aws possible to roll out in a way that’s cheap for everyone to build out. The ability to federate between those providing good results, even at query time has the ability to be very powerful if implemented correctly.
I'd also like to mention that I'll be using  as a template for how to describe to the public my own process of researching and my though process for picking and choosing a particular piece of tech/data structure/solution.
Just a splendid read.
You can follow me on twitter, or use my blogs RSS, or email if you like. You can get all details on my profile :)
Distribution seems like the exact opposite of being reliable? How would the network defend against bad actors?
However you would probably want to have them return ranking information allowing for re-ranking on the caller including its own system and others. Perhaps returning this could be optional?
Or even have the federated searches share portions of the index when peering others. This would allow one system to determine if the ranks appear in line with expectation. Allowing one to trust but verify.
Ah, “it would not”, sounds absolutely great.
The problem OP is mentioning is exactly why I read HN and that I have a newsletter about the blogs of HN . If you want to have opinions on a topic, the only thing you can do is read personal blog posts about the topic. I'm not sure you can make a search engine out of that, because at some point being the #1 will provide money to someone and he will be able to game the ranking. Look at Wikipedia on a business topic and you will see only the most popular tools are mentionned and on pinterest the results are now full of pinterest SEO experts. The person who will be able to prevent manipulation on the internet internet is going to be a Billionaire.
I'm very bullish on curation, especially at work, because employees should be less interested about their self interest and could curate the knowledge of the company, but for the web I don't really know how to do it
Pinterest is useful if you are looking for ideas of things that look similar. If you want to choose a pillow, just type pillow and you will find a list of collections of pillows. It doesn't make sense to show Pinterest results in Google because that's not what Google users are searching. The fact that you want to block Pinterest is just because Pinterest is sucessful and has overflowing google images has a SEO strategy.
Pinterest is scourge for a text search engine. It's also scourge for an IMAGE search engine, because it's just hiding the real source.
Pinterest is in the top 10 websites I would blacklist from search results.
By no means an expert on Wikipedia searches, but here are a few examples. There are many more searches and options.
This is a standard search for Wikipedia entries exactly matching a query. It will redirect to an entry if one exists. Sometimes it may be necessary to navigate to the disambiguation.
alias links="links -no-connect"
We started Thangs.com to allow for actual 3D search, eg. search by uploading a 3D cylinder and find other cylinders as well as engines it fits within.
The growth has been mostly quite strong since we launched a year ago, but it’s a very capital intensive product to build - massive volumes of data.
Our monetization strategy is to skip the ads and instead use our 3D indexing to allow us to build a 3D, visual, revision control system - also free. However, we plan to charge enterprises for the standard advanced SaaS capabilities - which we then integrate with our current enterprise product.
We have a single tenant 3D search product that is doing well in enterprise, without the revision control capabilities yet integrated. Beyond search and reuse, there are massive opportunities in supply chain and many other valuable problem categories where true geometric search is a uniquely valuable solution.
It’s not always the case that ads and web3 principles are required to build commercially viable, yet somewhat narrow search products. The world of search is large and there is ample room for commercially viable specialization.
I usually don't even bother and just debug the source of the thing giving me problems now. Thank God for open source. It's kind of ridiculous though
I know the scalability of PCRE sounds crazy and I'm not asking for that, but maybe some thing is possible that'd be useful for errors that mention things specific to you but you want to just say "\w+" there
GitHub has something that's pretty flexible but for some unknown reason it lacks a very extremely obvious "hide  exact  near duplicates" setting so you don't get 25 pages of the same thing over and over again.
It could store the actual body of the page I browse and search in those directly, extract figures, summarize. It would be implemented as a kind of man in the middle getting the actual content. Next time you could search the local content, cutting the link with google for a lot of the searches.
Specific interests are better targeted by websites - like Hacker News in fact - which work really well for turning up interesting content to consider.
If we see market disruption, it is likely to come from the web browser space cutting off the air supply somehow (looking at Brave; the ideas there are interesting). It is notable that Google has been treating the web browser space as strategic for at least a decade; their money is everywhere and they've bought up a lot of influence. Indeed, the entire mobile play could be seen as securing a platform for their web browser, which in turn secures a platform for their search engine.
That's probably still true, but the equipment is probably 100x (maybe more) cheaper, reasonable ranking algorithms are widely known, and I'm sure there are a dozen other reasons why starting a search engine today is easier than it was in the past.
> And for general instance, the best search engine is the most popular one because it has a bigger index.
Unless the big search engine has systematic flaws in certain domains that give new entrants a niche in which they can be competitive. Classic disruption opportunities.
The problem is there is no easy, cheap and frictionless way to do payments online.
BTW i think the author's goal is better achieavable by improving the search on vertical directories and catalogs that already exist rather than creating new niche search engines.
1. The ad-supported search engine business models have led to the creation of massive quantities of low-quality, irrelevant content that those search engines are drowning now in themselves.
2. Ad-supported search exposes fundamental conflict of interest in such search engine UX - serve the user or serve the advertiser?
'Boutique' search engines are not a new idea. I have used same.energy  to search for images and that was probably the best image search I've seen. I've used marginalia  to find an interesting article about a broad topic I was interested in.
The problem is not in the lack of boutique search engines, it is that using them in this way (one search engine for each type of query) is not convenient. If we extrapolate the future to having 100 finest boutique search engines, how are we going to remember to use them, especially if the use is seldom?
Browsers will let you set only one main search engine. While you can have more with shortcuts, you still need to remember them.
The nature of searching is such that we want to spend 99% of time in our primary search engine. It should then be the the job of that primary search engine to be aware of the 100 boutique search engines and pull the data from them, all happening transparently for the user. The user is then presented with best possible results for their query, assembled by carefully crafted algorithms that take results both from horizontal and vertical indexes. This has been called 'meta-search engine' concept before although I prefer the term 'search engine client' similar to how we have 'email clients'. Having access to high quality search engine clients will allow the existence of high quality boutique search engines, as the two eco-systems need each other.
Furthermore, the search engine client should also have tools to allow user to curate on its own and share this with other users. Want to blacklist domains? Want to create 'programming' results that suit your interests? This should be possible. These concept are being explored in Kagi Search  (disclaimer: founder) and what makes all this possible today is the transition to paid search as a product, breaking the spell of the ad-supported search and allowing us to finally focus only on the user and their needs.
Fundamentally, there is a conflict of interest between the interest of the search engine, and the interest of the user. The search engine wants to show the results that generate money, which may or may not be the same results that are the most beneficial for the visitor.
The AND operator doesn't work anymore. The result may ignore one or more of your keywords.
A google search is not the best search of the internet you can get ad-supported. It is asking google what it has for sale containing most of the keywords.
Here's a trick though: if you notice anything getting significantly worse, just try to report it.
Reporting is a seriously annoying process as they have applied a number of tricks to discourage feedback and I don't know if they ever read those reports, but it has one nice side effect:
When you report anything you get removed from the current experiment that you are assigned too (e.g. "break site:" or "slide in ads on top of search results just as user is about to click first result") and assigns you to a new one that is hopefully less annoying.
: a colleague of mine demonstrated this to me three years ago, unfortunately he followed my advice and also disabled at least one extension before I could take steps to verify that it was in deed a Google experiment.
The basic thing that people will eventually start to understand is that in fact we already have a good start on creating technologies that allow for cooperative public data systems that can replace the private technology monopoly platforms.
A decentralized search engine controlled by the users' interests must be hard for the content providers to "infiltrate" and control the results, or it's useless because as soon as it will become sufficiently popular to matter, intentional manipulation and spam will become popular.
Yes, I've built a curated boutique search engine, such as that described in the article. Short summary is that yes, it is possible for one person to build a useful search engine nowadays, if they don't attempt to index the whole of the internet (the niche mine covers is personal and independent websites). I've plenty of details about building it in the blog at https://blog.searchmysite.net/ , but key points you might find useful:
- It is now indexing around 6.5M pages, which is around quarter of what Google indexed when it was launched in 1998.
- Estimated running costs are now looking to exceed US$1000 a year. (I could change hosting provider to reduce this.)
- In January 2021 I estimated I'd spent around 350 hours building it (evenings and weekends over the preceding year). I haven't estimated how long I have spent this year, but it won't be quite as much as last year.
Do you make money with this project?
Hardware investment is about $3-4k as a one-time cost and then I estimate I'll need a 1 Tb SSD per every couple of years as the server does kind of chew through them with great appetite.
My monthly operational costs are $15 in power, and $20 for cloudflare because I kept getting DDoS:ed by botnets.
As for development time, dunno, I've been working on it in my spare time since this spring some time, generously estimated 30h/week x 30 weeks, so the upper bound may be 900 hours, but probably closer to something like 600 hours as I have other projects as well, and I'm not always feeling it.
I don't think off the shelf search solutions or databases are viable, they are too flexible which means they can't be fast and space-efficient enough to keep cost down. They're meant to run in a data center, not on a single computer. That means your operational costs will be prohibitive.
It's required a lot of old-fashioned wizardry to build though, bit-twiddling and demoscene-esque hacks to coax a lot of data into a minimal amount of space, the type of microoptimization stuff that usually is a waste of time except the data set is so large that saving single bytes in object encoding often translates to saving multiple gigabytes. If you aren't at least fairly comfortable with building custom compression algorithms, memory mapped hash tables, things like that, it's gonna be a rough project. If I didn't have a background in low level programming, this would have been nearly impossible.
Beyond that, most of this stuff you can pick up along the way. I didn't really know shit about building search engines before I started. I just threw together a design that made sense and built... something, and iterated upon that. With every iteration it's gotten faster, smaller, better, smarter. I think the upcoming release is gonna be yet another huge improvement.
Honestly though the big issue is crawling. Not only is it a bandwidth monster many sites are hostile to non google bots, along with cdn’s and cloudflare.
I do my own crawling and don't agree with any of these statements. Bandwidth is not a bottleneck, and blocking is mostly only a problem if your bot is too aggressive.
Indexing is the bigger challenge. An efficient fast service for an index of billions of pages is a different level to that for millions of pages.
It runs off one somewhat well-specced consumer grade PC running in the living room of some person living in Sweden as far as I understand.
Already I find it totally delightful to use compared to both Google and DDG for anything that fits its index.
> "Google is great at answering questions with an objective answer, like “# of billionaires in the world” or “What is the population of Iceland”. It’s pretty bad at answering questions that require judgment and context like “What do NFT collectors think about NFTs?”."
I expect search engines to search and retrieve articles about a topic e.g. billionairs, Iceland, or NFT collectors, and the question answering is an add-on (sometimes nice, sometimes annoying) but not the main point of the tool. Arguably if you want something to actually answer questions, this "something" should be fundamentally different from a search engine; that's a knowledge base + reasoner but not search.
I think folks here would like our "boutique search engine"
Besides the American competitive debate community who uses this today, I've always wondered if and when the broader community of people who argue (e.g. lawyers, armchair politicians) or even the public would start to notice our resource or find it useful
That kind of question seems really difficult for a search engine to find when there likely isn’t such a summary posted in exactly the format you are looking for. But thankfully there are other ways to do your own research on the internet. I’m sure you could easily find an NFT related subreddit, post exactly that question, and have multiple answers within the hour.
Another poster here built his own crawler and kept getting DDoS'd by botnets.
Search-crawling seems to elicit some hostility.
We know Google is working on this .
It’s also something we’re trying to solve with sites like Knifist .
Answering questions in such a way is obviously very hard to do at scale. At the very least, boutique search engines should try to focus on this more.
To VC's, the 1990's internet was a "grab all you can before it's to late" in the same way the App Store was when it was launched. Using lots of VC money the current players have secured their position as "caretakers" of the internet, a medium that's no longer novel for Klondike like expeditions. All gold has been mined (or is already claimed).
That said: Reverse phone number lookups are 99% SEO spam. The hugely helpful forum search options were axed. Searches for obscure or arcane information are routinely rewarded with barely-related marketing and bot-built keyword pages.
On any given day, I'll run into dozens of instances of how gleaning the info I want out of Google is hugely more difficult (or outright impossible) than it was a decade ago.
The only thing Google still has in it's favor is it still obeys search operands - well, usually.
Just a few. Keep in mind most are indexing engines and not full featured general purpose search engines. You have to build a lot on top to get something remotely close to google.