YaCy: Decentralized Web Search

gambler · on Feb 5, 2020

P2P is probably the best solution for search in the long run, but here is how it can be decentralized in the interim.

Separate crawlers, indexers and search websites. One search website could go to several indexers. One indexer could get info from several crawlers. In today's internet this makes so much more sense than everything being run by the same company.

For example, you could have a crawler that specializes at searching Medium.com. You could have another one that specializes in searching just computer game websites. The data from both are aggregated by indexer A and indexer B. Indexers would rank content using different algorithms.

When user opens some search website and types a query, the website pre-processes the query and sends it to both indexers. Both return the best results they can. The website then aggregates them into a single page and shows to the user.

With this system there will be competition at every level. And a startup could conceivably build a node in this infrastructure without tackling the entire thing at the same time (like you have to do right now).

You think website X has bad interface? You can build your own interface and buy information from existing indexers (which you chose).

You think you can do better job at ranking results? You can build your own indexer and sell your results to pre-existing websites.

You think some niche area of the web is ignored? You can build your own crawler and sell your results to one or several indexers.

If someone does a bad job, the next level of the system has incentives (and freedom) to drop them. This includes the consumer of search results.

MadWombat · on Feb 5, 2020

> In today's internet this makes so much more sense than everything being run by the same company

I have expressed similar ideas in other discussions here on HN, so I agree with you that this would be a really neat to have an ecosystem like this. But the question is how do you make money? What is your incentive to run a crawler? What is your incentive to run an indexer? People are used to the idea that web search is free, so you search sites cannot charge people money. It also means that they cannot buy indexer services.

dcolkitt · on Feb 5, 2020

I think it could work with micro transactions on the backend. In B2C Internet consumers won't pay for content. But in B2B its the norm. Paying for an indexer or crawler is really no different than the umpteen million AWS transactions that occur everyday.

Most consumers probably won't want to pay for search. At least initially until the culture changes. But that's fine, they don't have to access the crawler or indexers directly. Instead they can point their browser at the frontend URL of their choice. And that frontend can monetize by injecting ads or whatever other business model they want.

As long as they earn more in ads than they pay on the open market for backend crawlers, indexers, etc., then they're a viable business. Given how ridiculously profitable Google is, this shouldn't be a difficult hurdle to clear.

MadWombat · on Feb 5, 2020

The requirements in infrastructure are not symmetrical here. An indexer requires a significantly bigger investment in both storage and computational power than either a crawler or an end-user site. Seems like everyone would want to run a crawler, but nobody would want to run an indexer.

Also, it is unclear how to start something like this. A sort of a chicken and egg problem where you cannot start running any of these components because there is nobody running the other parts.

drmabus · on Feb 6, 2020

the egg came first.

on a serious note, yacy is good enough to crawl a bunch of sites you like and get much superior search results be it limited to your own interest. google is complete garbage. specially nowadays google just returns unrelated cnn pages. try search for corona virus map, I get tons of news articles talking about the same page but the page is not in the results. whatever you add to the query it keeps returning the same news articles from the same sources. if you populate yacy with pages about the topic you will yield real results.

drivebycomment · on Feb 6, 2020

I just tried "coronavirus map" and it showed maps or an article with map from WP, CNN, NYT. And one from CDC. That looks reasonable to me.

Edit: checking the result again, the first page also has a europe.eu page with a map, and another from arcgis.com with a map. So most of the first page results seem highly relevant and good quality.

MadWombat · on Feb 6, 2020

So far, Google search results are far better than everything else I tried. Granted, I haven't tried yacy, but I doubt it would have the type of data aggregation that makes Google results what they are.

EGreg · on Feb 5, 2020

Something like this maybe?

https://qbix.com/token

iovrthoughtthis · on Feb 5, 2020

What about open, distributed indexes instead of companies?

Sites that wished to appear in searches would have an incentive to get added to an index. That could create a market for compute to calculate and update indexes.

notduncansmith · on Feb 5, 2020

It’s also difficult to trust decentralized crawling infrastructure. How do you know that crawlers are reporting honestly?

MadWombat · on Feb 5, 2020

Hopefully this could be market regulated, if a particular crawler gets caught making things up or doing otherwise a bad job crawling, indexers could simply stop using it in favor of a different crawler.

smoyer · on Feb 5, 2020

yacy has support for additional crawlers.

But I was thinking of something more radical ... let the indexes be peer-to-peer with an IPFS storage back-end, but build at least the generic crawler as a browser plugin that also performs searches against the index. For "normal" web pages, each visit results in a reindexing. Of course you have to solve the chicken and egg problem of pages that aren't visited because nobody has indexed them so you don't find them in search.

iovrthoughtthis · on Feb 5, 2020

Allow people to submit their sites to the index. People who want to be found have an incentive to do the work.

Polylactic_acid · on Feb 6, 2020

People already do this. You can submit your sitemap to bing and google. But I doubt you will get anyone to bother with a small player or p2p service.

bawolff · on Feb 5, 2020

I feel like there is either 2 possibilities:

Small number of indexes (like HotBot before google ruled the internet). I'm not sure an oligopoly is that much better than a monopoly

Or you have a lot of indexes. But then the system scales really poorly. All the indexes have to handle a large number of queries and be hugely powerful, and all the clients have to spend a ton of time querrying and aggregating them.

bobajeff · on Feb 5, 2020

> Separate crawlers, indexers and search websites

How would all of those be funded? Also, wouldn't sperate indexes be Zero Sum just like Search Engines are?

tobylane · on Feb 6, 2020

The winner will be the vertical integrator, not the best quality. People will pick the quantitative measure (speed) over anything qualitative.

6510 · on Feb 5, 2020

YaCy does seem to already have topical crawlers, they are just users who crawl topics they find interesting. You not just get more than "the" website but [ideally] every small site they could find.

OJFord · on Feb 6, 2020

Yes!

And of course a big 'enough' publisher like Medium would be incentivised to provide its own crawler implementing the standard interface, for SEO.

kleer001 · on Feb 5, 2020

Now THAT sounds like a fertile ecosystem.

dang · on Feb 5, 2020

A thread from 2016: https://news.ycombinator.com/item?id=12433010

2014: https://news.ycombinator.com/item?id=8746883

2011: https://news.ycombinator.com/item?id=3288586

(Reposts are fine after a year or so; links like this are just to supply more reading for the curious.)

ivarv · on Feb 5, 2020

I think the value of this kind of project isn't necessarily to use a general public instance, but instead to train your own instance. e.g. Imagine collecting your search history into a search corpus. Once you're set up, then running your search queries across not only your node but across a trusted network of peers as well. Yes, you'd be in a filter bubble, but the results would be highly curated.

comboy · on Feb 5, 2020

That is the only way to decentralize given reality of spam and incentives to manipulate results. "trusted network of peers" is the key here. In my opinion in order for it to work trust should be weighted. And weights should be update-able based on results you want or don't want to see.

Also, if we would get this decentralized web of trust with weights it's like the Internet reinvented. Thanks to small world effect you have access to reviews you can actually trust, your personal trust to ebay seller, potential business partner or any politician...

When I say that is the only way, it is like a fact to me. I've spend a long time thinking about it and I'd be delighted to be proven wrong.

dana321 · on Feb 5, 2020

I have tried this extensively in the past around a year ago, unfortunately its written in Java and suffers performance issues with GC when you start pushing it hard, the whole system crawls and the performance is erratic.

I had to uninstall it as it was killing my 8-core 5960x server.

The results can be ok if you push it towards the sites you enjoy reading, but apart from that..

Polylactic_acid · on Feb 6, 2020

I found the results to be utterly useless. I wasn't able to perform a single search that came back with sane results. I searched "youtube" and the first few pages were random blogs. I couldn't not find "youtube.com" anywhere in the results.

numpad0 · on Feb 6, 2020

UX wise it’s also n-gram with n>2 and only works well with Indo-European languages.

oever · on Feb 5, 2020

Distributed search is great to make searching less dependent on a few big players. But if many more bots start crawling your site, the load and waste of energy can get very high.

Sitemaps help a bit, but site needs to be crawled completely initially and changed pages need to be retrieved.

It'd be more efficient if there were a standard by which websites could create and publish their own index. Web search engines could collate the pages and add page ranks.

The correctness of the index can be tested by randomly sampling a few pages with a bias to the highest rated pages.

haolez · on Feb 5, 2020

What's the opt-in censorship story with YaCy? I don't want to accidentally find gore or dark stuff in general.

bawolff · on Feb 5, 2020

I wish there was more technical details or a whitepaper. All i can see is some handwaving about dht.

But full text search is not really naturally applicable to a DHT approach. So what exactly do they do? How does what they do scale? How is work broken up between dht nodes (what exactly is being used as the hash key in the distributed hash table)? How does it deal with malicious nodes (SEO is a big thing in search)? Is the division of resources between dht nodes "fair"? How are results ranked?

slenk · on Feb 5, 2020

I wonder how it differs from Searx. Have been running a Searx instance for myself for a few months now and i love it.

nindwen · on Feb 5, 2020

Searx aggregates different search engines. YaCy is peer-to-peer engine that can run its own crawlers to index content, and query other peers for their content. Looks like Searx has YaCy-integration available, so you could run a local YaCy-node and get its results added to your Searx results.

slenk · on Feb 6, 2020

Good to know. I might have to do that. Thank you.

tsukurimashou · on Feb 5, 2020

interesting, has anybody here tried to use it? I'd like to hear some user experience!

ravenstine · on Feb 5, 2020

I tried using it seriously 5 years ago. Things might have changed since then, though. All I remember is that it was tremendously slow and the results were pretty much garbage. It's an interesting project, nonetheless.

EDIT: They now have an online demo that doesn't require any installation. After waiting about a minute for peers to connect, queries ran relatively "fast" each time. That's an improvement. But the results are still junk, IMO.

https://yacy.searchlab.eu/Status.html

Kaiyou · on Feb 5, 2020

I gave it a try in 2017. I could live with it being really slow, but it didn't have any useful results for anything, so I moved on. Really like the idea, though.

tombert · on Feb 5, 2020

Had a similar experience back in 2015 or so; I think that the idea of a decentralized web-scraper actually makes a lot of sense, but it just didn't seem to give me a result that was useful, and actually seemed to give a lot of porn results for perfectly innocuous searches. I ended up going back to DDG.

mippenhappen · on Feb 5, 2020

I played around with that years ago. Was absolutely useless at the time. Honestly doubt it has gotten much better.

DethNinja · on Feb 6, 2020

What if some bad actor decides to harm the network by listing illegal links? How network secures itself against that?

Polylactic_acid · on Feb 6, 2020

Other networks have lists of banned hashes. You can just hash the banned domains/links and users can subscribe to ban lists. You could even distribute them in plain text tbh. The issue is not about having the links, but accidentally clicking on them.

flokie · on Feb 6, 2020

Query "best pizza place"

Result "Hardcore teen sex"

Why is this trending?

marknadal · on Feb 6, 2020

YaCy has been around for a long time, no?

What are other projects in this space?

I know Martti Malmi (Satoshi's 1st Bitcoin contributor) is doing some stuff (no tokens) in this space.

Anybody else/know/have links/talks/articles on other (no blockchain) decentralized search?