Separate crawlers, indexers and search websites. One search website could go to several indexers. One indexer could get info from several crawlers. In today's internet this makes so much more sense than everything being run by the same company.
For example, you could have a crawler that specializes at searching Medium.com. You could have another one that specializes in searching just computer game websites. The data from both are aggregated by indexer A and indexer B. Indexers would rank content using different algorithms.
When user opens some search website and types a query, the website pre-processes the query and sends it to both indexers. Both return the best results they can. The website then aggregates them into a single page and shows to the user.
With this system there will be competition at every level. And a startup could conceivably build a node in this infrastructure without tackling the entire thing at the same time (like you have to do right now).
You think website X has bad interface? You can build your own interface and buy information from existing indexers (which you chose).
You think you can do better job at ranking results? You can build your own indexer and sell your results to pre-existing websites.
You think some niche area of the web is ignored? You can build your own crawler and sell your results to one or several indexers.
If someone does a bad job, the next level of the system has incentives (and freedom) to drop them. This includes the consumer of search results.
I have expressed similar ideas in other discussions here on HN, so I agree with you that this would be a really neat to have an ecosystem like this. But the question is how do you make money? What is your incentive to run a crawler? What is your incentive to run an indexer? People are used to the idea that web search is free, so you search sites cannot charge people money. It also means that they cannot buy indexer services.
Most consumers probably won't want to pay for search. At least initially until the culture changes. But that's fine, they don't have to access the crawler or indexers directly. Instead they can point their browser at the frontend URL of their choice. And that frontend can monetize by injecting ads or whatever other business model they want.
As long as they earn more in ads than they pay on the open market for backend crawlers, indexers, etc., then they're a viable business. Given how ridiculously profitable Google is, this shouldn't be a difficult hurdle to clear.
Also, it is unclear how to start something like this. A sort of a chicken and egg problem where you cannot start running any of these components because there is nobody running the other parts.
on a serious note, yacy is good enough to crawl a bunch of sites you like and get much superior search results be it limited to your own interest. google is complete garbage. specially nowadays google just returns unrelated cnn pages. try search for corona virus map, I get tons of news articles talking about the same page but the page is not in the results. whatever you add to the query it keeps returning the same news articles from the same sources. if you populate yacy with pages about the topic you will yield real results.
Edit: checking the result again, the first page also has a europe.eu page with a map, and another from arcgis.com with a map. So most of the first page results seem highly relevant and good quality.
Sites that wished to appear in searches would have an incentive to get added to an index. That could create a market for compute to calculate and update indexes.
But I was thinking of something more radical ... let the indexes be peer-to-peer with an IPFS storage back-end, but build at least the generic crawler as a browser plugin that also performs searches against the index. For "normal" web pages, each visit results in a reindexing. Of course you have to solve the chicken and egg problem of pages that aren't visited because nobody has indexed them so you don't find them in search.
Small number of indexes (like HotBot before google ruled the internet). I'm not sure an oligopoly is that much better than a monopoly
Or you have a lot of indexes. But then the system scales really poorly. All the indexes have to handle a large number of queries and be hugely powerful, and all the clients have to spend a ton of time querrying and aggregating them.
How would all of those be funded? Also, wouldn't sperate indexes be Zero Sum just like Search Engines are?
And of course a big 'enough' publisher like Medium would be incentivised to provide its own crawler implementing the standard interface, for SEO.
(Reposts are fine after a year or so; links like this are just to supply more reading for the curious.)
Also, if we would get this decentralized web of trust with weights it's like the Internet reinvented. Thanks to small world effect you have access to reviews you can actually trust, your personal trust to ebay seller, potential business partner or any politician...
When I say that is the only way, it is like a fact to me. I've spend a long time thinking about it and I'd be delighted to be proven wrong.
I had to uninstall it as it was killing my 8-core 5960x server.
The results can be ok if you push it towards the sites you enjoy reading, but apart from that..
Sitemaps help a bit, but site needs to be crawled completely initially and changed pages need to be retrieved.
It'd be more efficient if there were a standard by which websites could create and publish their own index. Web search engines could collate the pages and add page ranks.
The correctness of the index can be tested by randomly sampling a few pages with a bias to the highest rated pages.
What are other projects in this space?
I know Martti Malmi (Satoshi's 1st Bitcoin contributor) is doing some stuff (no tokens) in this space.
Anybody else/know/have links/talks/articles on other (no blockchain) decentralized search?
But full text search is not really naturally applicable to a DHT approach. So what exactly do they do? How does what they do scale? How is work broken up between dht nodes (what exactly is being used as the hash key in the distributed hash table)? How does it deal with malicious nodes (SEO is a big thing in search)? Is the division of resources between dht nodes "fair"? How are results ranked?
EDIT: They now have an online demo that doesn't require any installation. After waiting about a minute for peers to connect, queries ran relatively "fast" each time. That's an improvement. But the results are still junk, IMO.
Result "Hardcore teen sex"
Why is this trending?