Agreed. So much so I started two search engines, https://bonzamate.com.au/ which is explicitly for Australian sites and https://searchcode.com/ which is about to get a large index update and make it more useful.
I have often wondered if perhaps a distributed search engine is the real answer. Yacy was interesting but the results terrible. But perhaps it’s possible to use activitypub protocol to build a distributed search that has different implementations on the backend. It’s something I keep toying with in my spare time, and with what I did with bonzamate we’re the search is done in lambda functions on aws possible to roll out in a way that’s cheap for everyone to build out. The ability to federate between those providing good results, even at query time has the ability to be very powerful if implemented correctly.
I just finished reading this [0]. Your story-telling abilities are great. Your technical abilities as well. As a hobbyist search engine "engineer" doing independent research and slowly building on my search engine code base, I would like to subscribe to your newsletter.
I'd also like to mention that I'll be using [0] as a template for how to describe to the public my own process of researching and my though process for picking and choosing a particular piece of tech/data structure/solution.
The same way any other federated system does. Allow, ignore, block and such... The whole point is you get to choose. If someone happens to get a lot of requests from other peers they might be considered a reliable source.
However you would probably want to have them return ranking information allowing for re-ranking on the caller including its own system and others. Perhaps returning this could be optional?
Or even have the federated searches share portions of the index when peering others. This would allow one system to determine if the ranks appear in line with expectation. Allowing one to trust but verify.
Honestly I have no idea. There is no tracking of any kind on it. I should probably hook one of those newer awstats implementations into the caddy logs for it to get some idea though.
True. I just never bothered with anything. I get some basic details out of cloudwatch, but that's about it. As mentioned I think I should do something with what comes out of caddy one of these days.
Interesting that the author doesn't mention neither Wikipedia nor pinterest. They are the two biggest curators led website.
The problem OP is mentioning is exactly why I read HN and that I have a newsletter about the blogs of HN [1]. If you want to have opinions on a topic, the only thing you can do is read personal blog posts about the topic. I'm not sure you can make a search engine out of that, because at some point being the #1 will provide money to someone and he will be able to game the ranking. Look at Wikipedia on a business topic and you will see only the most popular tools are mentionned and on pinterest the results are now full of pinterest SEO experts. The person who will be able to prevent manipulation on the internet internet is going to be a Billionaire.
I'm very bullish on curation, especially at work, because employees should be less interested about their self interest and could curate the knowledge of the company, but for the web I don't really know how to do it
I find the fact you mention Wikipedia and Pinterest in the same breath surprising. As someone who doesn't really use the latter, I always knew it as the site to exclude from search results. Whatever the query, a result in Pinterest will look like utter spam that only used my query to get into SERP and then try to make me go to something barely related. On the other hand, Wikipedia seems to be always on topic and usually high-quality content. I understand that one medium is more vulnerable than the other. But still, with somewhat similar content creation structure, I'd expect more similar outcomes. So, what is the main reason for the perceived quality difference?
They have different goals, Pinterest's goal is to let user create collections of things, Wikipedia's goal is to be right.
Pinterest is useful if you are looking for ideas of things that look similar. If you want to choose a pillow, just type pillow and you will find a list of collections of pillows. It doesn't make sense to show Pinterest results in Google because that's not what Google users are searching. The fact that you want to block Pinterest is just because Pinterest is sucessful and has overflowing google images has a SEO strategy.
Ideas as in "search some term, get a not-even-vaguely-related picture collection you cannot see unless you log in"? Or the "ideas" that also happen to hide where that very idea came from?
Pinterest is scourge for a text search engine. It's also scourge for an IMAGE search engine, because it's just hiding the real source.
Pinterest is in the top 10 websites I would blacklist from search results.
Compared to the garbage results that the popular and wannabe popular web search engines serve up on a regular basis, Wikipedia search is vastly underrated, IMO. The results are qualitatively different when the search engine project is not being sponsored by advertising. And Wikiepdia is happy to give you as many results as you want. No easily triggered rate limits for consuming too many results too fast.
By no means an expert on Wikipedia searches, but here are a few examples. There are many more searches and options.
This is a standard search for Wikipedia entries exactly matching a query. It will redirect to an entry if one exists. Sometimes it may be necessary to navigate to the disambiguation.
alias links="links -no-connect"
q=podunk;
links "https://en.wikipedia.org/w/index.php?limit=500&profile=all&search=$q&ns0=1"
This is another similar search. It will not redirect to an entry if one exists. It will return all entries that have text matches the query.
This is a search I use to explore websites. It searches for domains found in external links. For example, one could search with the query "podunk.edu" and find all the links in Wikipedia articles to pages at Podunk University websites.
Agree with the fundamental premise of the article.
We started Thangs.com to allow for actual 3D search, eg. search by uploading a 3D cylinder and find other cylinders as well as engines it fits within.
The growth has been mostly quite strong since we launched a year ago, but it’s a very capital intensive product to build - massive volumes of data.
Our monetization strategy is to skip the ads and instead use our 3D indexing to allow us to build a 3D, visual, revision control system - also free. However, we plan to charge enterprises for the standard advanced SaaS capabilities - which we then integrate with our current enterprise product.
We have a single tenant 3D search product that is doing well in enterprise, without the revision control capabilities yet integrated. Beyond search and reuse, there are massive opportunities in supply chain and many other valuable problem categories where true geometric search is a uniquely valuable solution.
It’s not always the case that ads and web3 principles are required to build commercially viable, yet somewhat narrow search products. The world of search is large and there is ample room for commercially viable specialization.
The only noise I have noticed is SEO spam sites copying posts from stack overflow or GitHub. But usually these only come up when almost no content on the topic exists so it didn’t really deprive me of anything.
Speaking of sed, I kinda wish there was some kind of regex like query you could search with beyond booleans and ranges
I know the scalability of PCRE sounds crazy and I'm not asking for that, but maybe some thing is possible that'd be useful for errors that mention things specific to you but you want to just say "\w+" there
GitHub has something that's pretty flexible but for some unknown reason it lacks a very extremely obvious "hide [] exact [] near duplicates" setting so you don't get 25 pages of the same thing over and over again.
This could be trivially done in a greasemonkey plug-in with JavaScript and fetch() and actually if I wasn't already in bed I'd probably write that right now
I would like to see a local search engine. I have been thinking about it for a long time. A local program that intercepts/caches what I am searching. Then it can search in this cache before requesting anything online (maybe even summarize, graph content). It could work offline too. Privacy could me one of its best features.
It could store the actual body of the page I browse and search in those directly, extract figures, summarize. It would be implemented as a kind of man in the middle getting the actual content. Next time you could search the local content, cutting the link with google for a lot of the searches.
My memories of the Good Old Days is that there are substantial economies of scale and economies of data in the search engine business. Frequently a site couldn't even provide search results for their own pages as effectively as Google did. So while a curator approach sounds interesting, the situation that is likely to emerge is for specific instances the best search engine is the most popular one. And for general instance, the best search engine is the most popular one because it has a bigger index.
Specific interests are better targeted by websites - like Hacker News in fact - which work really well for turning up interesting content to consider.
If we see market disruption, it is likely to come from the web browser space cutting off the air supply somehow (looking at Brave; the ideas there are interesting). It is notable that Google has been treating the web browser space as strategic for at least a decade; their money is everywhere and they've bought up a lot of influence. Indeed, the entire mobile play could be seen as securing a platform for their web browser, which in turn secures a platform for their search engine.
> My memories of the Good Old Days is that there are substantial economies of scale and economies of data in the search engine business.
That's probably still true, but the equipment is probably 100x (maybe more) cheaper, reasonable ranking algorithms are widely known, and I'm sure there are a dozen other reasons why starting a search engine today is easier than it was in the past.
> And for general instance, the best search engine is the most popular one because it has a bigger index.
Unless the big search engine has systematic flaws in certain domains that give new entrants a niche in which they can be competitive. Classic disruption opportunities.
Big search engines have huge flaws for users looking for specific content. Everything is crammed into the same funnel. Google tries to address some of this with niche indexes like scholar, but there is a huge pent up demand for people that want higher signal to noise. Google profits from low signal by selling priority. Imagine a future where people can filter buy blogs or forums or companies or government sites or any number of subsets of the internet
It strikes me as odd that Tech businesses expect their users to work for them for free. There used to be a time when users were dazzled enough that they 'd do that, now it's adversarial. I don't get my food for free from the bakery, so i better pay my users for their curation.
The problem is there is no easy, cheap and frictionless way to do payments online.
BTW i think the author's goal is better achieavable by improving the search on vertical directories and catalogs that already exist rather than creating new niche search engines.
1. The ad-supported search engine business models have led to the creation of massive quantities of low-quality, irrelevant content that those search engines are drowning now in themselves.
2. Ad-supported search exposes fundamental conflict of interest in such search engine UX - serve the user or serve the advertiser?
'Boutique' search engines are not a new idea. I have used same.energy [1] to search for images and that was probably the best image search I've seen. I've used marginalia [2] to find an interesting article about a broad topic I was interested in.
The problem is not in the lack of boutique search engines, it is that using them in this way (one search engine for each type of query) is not convenient. If we extrapolate the future to having 100 finest boutique search engines, how are we going to remember to use them, especially if the use is seldom?
Browsers will let you set only one main search engine. While you can have more with shortcuts, you still need to remember them.
The nature of searching is such that we want to spend 99% of time in our primary search engine. It should then be the the job of that primary search engine to be aware of the 100 boutique search engines and pull the data from them, all happening transparently for the user. The user is then presented with best possible results for their query, assembled by carefully crafted algorithms that take results both from horizontal and vertical indexes. This has been called 'meta-search engine' concept before although I prefer the term 'search engine client' similar to how we have 'email clients'. Having access to high quality search engine clients will allow the existence of high quality boutique search engines, as the two eco-systems need each other.
Furthermore, the search engine client should also have tools to allow user to curate on its own and share this with other users. Want to blacklist domains? Want to create 'programming' results that suit your interests? This should be possible. These concept are being explored in Kagi Search [3] (disclaimer: founder) and what makes all this possible today is the transition to paid search as a product, breaking the spell of the ad-supported search and allowing us to finally focus only on the user and their needs.
what is special about ad supported search engines? If the most popular search engine was not ad supported all the same incentives apply. Get your site to the top of the results = profit.
It's about the incentive for the search engine to manipulate the ranking. If the search engine does not make money off ads, then there is no incentive to promote profitable results.
Fundamentally, there is a conflict of interest between the interest of the search engine, and the interest of the user. The search engine wants to show the results that generate money, which may or may not be the same results that are the most beneficial for the visitor.
But there's a counter incentive in that if the search engine does not produce results that are good for the user the user will leave for one that does.
There are operators that don't work in google anymore
Link:
Site:
The AND operator doesn't work anymore. The result may ignore one or more of your keywords.
A google search is not the best search of the internet you can get ad-supported. It is asking google what it has for sale containing most of the keywords.
Well, both of you might be right, Google is always running a crazy amount of experiments (mostly on how sloppy they can be and still get away with it it seems ;-)
Here's a trick though: if you notice anything getting significantly worse, just try to report it.
Reporting is a seriously annoying process as they have applied a number of tricks to discourage feedback and I don't know if they ever read those reports, but it has one nice side effect:
When you report anything you get removed from the current experiment that you are assigned too (e.g. "break site:" or "slide in ads on top of search results just as user is about to click first result"[1]) and assigns you to a new one that is hopefully less annoying.
[1]: a colleague of mine demonstrated this to me three years ago, unfortunately he followed my advice and also disabled at least one extension before I could take steps to verify that it was in deed a Google experiment.
No, we need decentralized search and content distribution. An individual group or company can't replace Google and if it did it would be just as bad as them. And the reason you have these dominant companies is because they provide all-encompassing services. Not niche topics. Take a look at https://yacy.net/ as one potential starting point to think about.
The basic thing that people will eventually start to understand is that in fact we already have a good start on creating technologies that allow for cooperative public data systems that can replace the private technology monopoly platforms.
A challenge for decentralized search and content distribution is the antagonistic relationship between search and content distribution, e.g. SEO manipulation. Cooperation is difficult when interests diverge a lot, as they do between content providers and users with respect to search (users want to find the most relevant provider for their search, providers want their resource to be found for all searches, no matter how irrelevant).
A decentralized search engine controlled by the users' interests must be hard for the content providers to "infiltrate" and control the results, or it's useless because as soon as it will become sufficiently popular to matter, intentional manipulation and spam will become popular.
Does anyone here have experience trying to build a search engine, even a "boutique" one, for the web? Is it something an individual could conceivably operate on their own?
> Does anyone here have experience trying to build a search engine, even a "boutique" one, for the web? Is it something an individual could conceivably operate on their own?
Yes, I've built a curated boutique search engine, such as that described in the article. Short summary is that yes, it is possible for one person to build a useful search engine nowadays, if they don't attempt to index the whole of the internet (the niche mine covers is personal and independent websites). I've plenty of details about building it in the blog at https://blog.searchmysite.net/ , but key points you might find useful:
- It is now indexing around 6.5M pages, which is around quarter of what Google indexed when it was launched in 1998.
- Estimated running costs are now looking to exceed US$1000 a year. (I could change hosting provider to reduce this.)
- In January 2021 I estimated I'd spent around 350 hours building it (evenings and weekends over the preceding year). I haven't estimated how long I have spent this year, but it won't be quite as much as last year.
Such an interesting idea! I tried it just once, with 'startup ideas' and I can see how it gives more useful results than google for this simple phrase.
Thanks for your feedback. I was hoping that the paid listings for the search as a service would cover the running costs so it could be self-sustaining, although so far to be honest it is a long way off doing that.
I've built https://search.marginalia.nu/ from scratch, as a solo hobby project. It's literally just a computer in my living room.
Hardware investment is about $3-4k as a one-time cost and then I estimate I'll need a 1 Tb SSD per every couple of years as the server does kind of chew through them with great appetite.
My monthly operational costs are $15 in power, and $20 for cloudflare because I kept getting DDoS:ed by botnets.
As for development time, dunno, I've been working on it in my spare time since this spring some time, generously estimated 30h/week x 30 weeks, so the upper bound may be 900 hours, but probably closer to something like 600 hours as I have other projects as well, and I'm not always feeling it.
I don't think off the shelf search solutions or databases are viable, they are too flexible which means they can't be fast and space-efficient enough to keep cost down. They're meant to run in a data center, not on a single computer. That means your operational costs will be prohibitive.
It's required a lot of old-fashioned wizardry to build though, bit-twiddling and demoscene-esque hacks to coax a lot of data into a minimal amount of space, the type of microoptimization stuff that usually is a waste of time except the data set is so large that saving single bytes in object encoding often translates to saving multiple gigabytes. If you aren't at least fairly comfortable with building custom compression algorithms, memory mapped hash tables, things like that, it's gonna be a rough project. If I didn't have a background in low level programming, this would have been nearly impossible.
Beyond that, most of this stuff you can pick up along the way. I didn't really know shit about building search engines before I started. I just threw together a design that made sense and built... something, and iterated upon that. With every iteration it's gotten faster, smaller, better, smarter. I think the upcoming release is gonna be yet another huge improvement.
Yes. Gigablast is run by one person. Assuming you roll your own hardware and co-lo it or such you can get by with minimal hardware. The catch being once you hit scale your servers become hammered. It’s one reason why I have been investigating using aws lambda to hold and search the index. Solves the initial scale problem.
Honestly though the big issue is crawling. Not only is it a bandwidth monster many sites are hostile to non google bots, along with cdn’s and cloudflare.
> Honestly though the big issue is crawling. Not only is it a bandwidth monster many sites are hostile to non google bots, along with cdn’s and cloudflare.
I do my own crawling and don't agree with any of these statements. Bandwidth is not a bottleneck, and blocking is mostly only a problem if your bot is too aggressive.
Could be due to the long form content you index. I found those sort of sites tend to have less reluctance on 3rd parties. Also possible you are more talented than I with your crawler writing.
Crawling is a challenge but not as much as mentioned. We have had a respectful bot since 2004, so the problems for us are more about awareness than hostility.
Indexing is the bigger challenge. An efficient fast service for an index of billions of pages is a different level to that for millions of pages.
I have always found crawling harder personally. At least with indexing you can switch to batch processing, or split the index into real-time and stale portions to give that fresh feel. Just my personal opinion though. I have never had the chance to index billions of pages. Several hundred million is about as far as I have gone.
A major cost for a web search engine is that you have to index the web, and it's pretty big. For even a modest commercial enterprise this isn't insurmountable, but it's a lot to ask for a hobbyist.
Attic.city was developed and is run with less than one FTE (two part time founders). Definitely boutique — home and fashion products from indie stores in the US — and thereby manageable both in terms of labor and the stack. We’re growing incrementally. Lately the focus has been on internal tooling and progressively automated health/status metrics.
I built a niche search site [0] mainly to "scratch my own itch". It now works well enough for my own needs, but gets almost no traffic. I'm not really going to put any effort into SEO, and it is never going to get noticed/ranked without. There are a whole list of improvements I'd like to make, but what is my motivation if no one will ever notice?
It's interesting that the author starts with a notion of equating search engines with answering questions.
> "Google is great at answering questions with an objective answer, like “# of billionaires in the world” or “What is the population of Iceland”. It’s pretty bad at answering questions that require judgment and context like “What do NFT collectors think about NFTs?”."
I expect search engines to search and retrieve articles about a topic e.g. billionairs, Iceland, or NFT collectors, and the question answering is an add-on (sometimes nice, sometimes annoying) but not the main point of the tool. Arguably if you want something to actually answer questions, this "something" should be fundamentally different from a search engine; that's a knowledge base + reasoner but not search.
My coauthor for our paper at COLING 2020 titled "DebateSum: A large-scale argument mining and summarization dataset" wrote a search engine over our huge debate evidence dataset called www.debate.cards
I think folks here would like our "boutique search engine"
Besides the American competitive debate community who uses this today, I've always wondered if and when the broader community of people who argue (e.g. lawyers, armchair politicians) or even the public would start to notice our resource or find it useful
> It’s pretty bad at answering questions that require judgment and context like “What do NFT collectors think about NFTs?”.
That kind of question seems really difficult for a search engine to find when there likely isn’t such a summary posted in exactly the format you are looking for. But thankfully there are other ways to do your own research on the internet. I’m sure you could easily find an NFT related subreddit, post exactly that question, and have multiple answers within the hour.
I found some pretty good articles on it with “nft collector thoughts”, although I’m alright reading in between the lines and going through four articles to see if they answer my question before deciding if I need to drill down into my search query with more keywords. However, as you said, I don’t think the question proposed has ever been asked on a public internet website, at least regarding the context the author is looking for since the question’s answer can usually be inferred: “well, collectors obviously collect things they like or think are valuable. Why would I specifically ask them what they think?”
I think what you're looking for is curated search engines. The issue is, that curated content doesn't fit everyones taste. What may seem high quality to you, may not be the high quality content for someone else. del.icio.us clones and what not, and Pinboard, already have search functionality.
In my opinion, Google's way of doing information organization worked really good in the past, now it's time to move on. The same system that separated signal from noise is now helping to generate too much noise while trying to fight it at the same time.
A few years back I setup a Yacy crawler. My IP soon appeared on blacklists (SMTP, etc). I'd force-swap IPs and a few days later I'd be blacklisted again.
Another poster here built his own crawler and kept getting DDoS'd by botnets.
I think this is true for Twitter as well, different communities have different needs that a catch all service won't necessarily implement (for ex, stocks with stocktwits). Similar to the decomposition of Craigslist that's always referenced
The search experience itself needs to be better. Why stop at listing search results? Why not find common patterns across top results to provide a holistic perspective?
We know Google is working on this [1].
It’s also something we’re trying to solve with sites like Knifist [2].
Answering questions in such a way is obviously very hard to do at scale. At the very least, boutique search engines should try to focus on this more.
To VC's, the 1990's internet was a "grab all you can before it's to late" in the same way the App Store was when it was launched. Using lots of VC money the current players have secured their position as "caretakers" of the internet, a medium that's no longer novel for Klondike like expeditions. All gold has been mined (or is already claimed).
I like that trailing ? that indicates doubt of how well G2020gle works.
That said: Reverse phone number lookups are 99% SEO spam. The hugely helpful forum search options were axed. Searches for obscure or arcane information are routinely rewarded with barely-related marketing and bot-built keyword pages.
On any given day, I'll run into dozens of instances of how gleaning the info I want out of Google is hugely more difficult (or outright impossible) than it was a decade ago.
The only thing Google still has in it's favor is it still obeys search operands - well, usually.
I’m genuinely curious for a real world example of how it lets you down ? How do you know it’s not giving you what you need for example ? How do you quantity ?
Just a few. Keep in mind most are indexing engines and not full featured general purpose search engines. You have to build a lot on top to get something remotely close to google.
Don't forget Apache Solr. In common with a number of other search engines including Elasticsearch, it is based on Apache Lucene. There are ports of Apache Lucene to a number of other languages (e.g. I noticed recently that the Rust port was particularly active).
The stated mission of a company worth almost two trillion dollars is to “organize the world’s information” and yet the Internet remains poorly organized. Or, stated differently, in a world of infinite information, it’s no longer enough to organize the world’s information. It becomes important to organize the world’s trustworthy information.
I have often wondered if perhaps a distributed search engine is the real answer. Yacy was interesting but the results terrible. But perhaps it’s possible to use activitypub protocol to build a distributed search that has different implementations on the backend. It’s something I keep toying with in my spare time, and with what I did with bonzamate we’re the search is done in lambda functions on aws possible to roll out in a way that’s cheap for everyone to build out. The ability to federate between those providing good results, even at query time has the ability to be very powerful if implemented correctly.