
YaCy: Decentralized Web Search - okasaki
https://yacy.net
======
gambler
P2P is probably the best solution for search in the long run, but here is how
it can be decentralized in the interim.

Separate crawlers, indexers and search websites. One search website could go
to several indexers. One indexer could get info from several crawlers. In
today's internet this makes so much more sense than everything being run by
the same company.

For example, you could have a crawler that specializes at searching
Medium.com. You could have another one that specializes in searching just
computer game websites. The data from both are aggregated by indexer A and
indexer B. Indexers would rank content using different algorithms.

When user opens some search website and types a query, the website pre-
processes the query and sends it to both indexers. Both return the best
results they can. The website then aggregates them into a single page and
shows to the user.

With this system there will be competition at every level. And a startup could
conceivably build a node in this infrastructure without tackling the entire
thing at the same time (like you have to do right now).

You think website X has bad interface? You can build your own interface and
buy information from existing indexers (which _you_ chose).

You think you can do better job at ranking results? You can build your own
indexer and sell your results to pre-existing websites.

You think some niche area of the web is ignored? You can build your own
crawler and sell your results to one or several indexers.

If someone does a bad job, the next level of the system has incentives (and
freedom) to drop them. This includes the consumer of search results.

~~~
MadWombat
> In today's internet this makes so much more sense than everything being run
> by the same company

I have expressed similar ideas in other discussions here on HN, so I agree
with you that this would be a really neat to have an ecosystem like this. But
the question is how do you make money? What is your incentive to run a
crawler? What is your incentive to run an indexer? People are used to the idea
that web search is free, so you search sites cannot charge people money. It
also means that they cannot buy indexer services.

~~~
dcolkitt
I think it could work with micro transactions on the backend. In B2C Internet
consumers won't pay for content. But in B2B its the norm. Paying for an
indexer or crawler is really no different than the umpteen million AWS
transactions that occur everyday.

Most consumers probably won't want to pay for search. At least initially until
the culture changes. But that's fine, they don't have to access the crawler or
indexers directly. Instead they can point their browser at the frontend URL of
their choice. And that frontend can monetize by injecting ads or whatever
other business model they want.

As long as they earn more in ads than they pay on the open market for backend
crawlers, indexers, etc., then they're a viable business. Given how
ridiculously profitable Google is, this shouldn't be a difficult hurdle to
clear.

~~~
MadWombat
The requirements in infrastructure are not symmetrical here. An indexer
requires a significantly bigger investment in both storage and computational
power than either a crawler or an end-user site. Seems like everyone would
want to run a crawler, but nobody would want to run an indexer.

Also, it is unclear how to start something like this. A sort of a chicken and
egg problem where you cannot start running any of these components because
there is nobody running the other parts.

~~~
drmabus
the egg came first.

on a serious note, yacy is good enough to crawl a bunch of sites you like and
get much superior search results be it limited to your own interest. google is
complete garbage. specially nowadays google just returns unrelated cnn pages.
try search for corona virus map, I get tons of news articles talking about the
same page but the page is not in the results. whatever you add to the query it
keeps returning the same news articles from the same sources. if you populate
yacy with pages about the topic you will yield real results.

~~~
drivebycomment
I just tried "coronavirus map" and it showed maps or an article with map from
WP, CNN, NYT. And one from CDC. That looks reasonable to me.

Edit: checking the result again, the first page also has a europe.eu page with
a map, and another from arcgis.com with a map. So most of the first page
results seem highly relevant and good quality.

------
dang
A thread from 2016:
[https://news.ycombinator.com/item?id=12433010](https://news.ycombinator.com/item?id=12433010)

2014:
[https://news.ycombinator.com/item?id=8746883](https://news.ycombinator.com/item?id=8746883)

2011:
[https://news.ycombinator.com/item?id=3288586](https://news.ycombinator.com/item?id=3288586)

(Reposts are fine after a year or so; links like this are just to supply more
reading for the curious.)

------
ivarv
I think the value of this kind of project isn't necessarily to use a general
public instance, but instead to train your own instance. e.g. Imagine
collecting your search history into a search corpus. Once you're set up, then
running your search queries across not only your node but across a trusted
network of peers as well. Yes, you'd be in a filter bubble, but the results
would be highly curated.

~~~
comboy
That is the only way to decentralize given reality of spam and incentives to
manipulate results. "trusted network of peers" is the key here. In my opinion
in order for it to work trust should be weighted. And weights should be
update-able based on results you want or don't want to see.

Also, if we would get this decentralized web of trust with weights it's like
the Internet reinvented. Thanks to small world effect you have access to
reviews you can actually trust, your personal trust to ebay seller, potential
business partner or any politician...

When I say that is the only way, it is like a fact to me. I've spend a long
time thinking about it and I'd be delighted to be proven wrong.

------
dana321
I have tried this extensively in the past around a year ago, unfortunately its
written in Java and suffers performance issues with GC when you start pushing
it hard, the whole system crawls and the performance is erratic.

I had to uninstall it as it was killing my 8-core 5960x server.

The results can be ok if you push it towards the sites you enjoy reading, but
apart from that..

~~~
Polylactic_acid
I found the results to be utterly useless. I wasn't able to perform a single
search that came back with sane results. I searched "youtube" and the first
few pages were random blogs. I couldn't not find "youtube.com" anywhere in the
results.

------
oever
Distributed search is great to make searching less dependent on a few big
players. But if many more bots start crawling your site, the load and waste of
energy can get very high.

Sitemaps help a bit, but site needs to be crawled completely initially and
changed pages need to be retrieved.

It'd be more efficient if there were a standard by which websites could create
and publish their own index. Web search engines could collate the pages and
add page ranks.

The correctness of the index can be tested by randomly sampling a few pages
with a bias to the highest rated pages.

------
haolez
What's the opt-in censorship story with YaCy? I don't want to accidentally
find gore or dark stuff in general.

------
bawolff
I wish there was more technical details or a whitepaper. All i can see is some
handwaving about dht.

But full text search is not really naturally applicable to a DHT approach. So
what exactly do they do? How does what they do scale? How is work broken up
between dht nodes (what exactly is being used as the hash key in the
distributed _hash_ table)? How does it deal with malicious nodes (SEO is a big
thing in search)? Is the division of resources between dht nodes "fair"? How
are results ranked?

------
slenk
I wonder how it differs from Searx. Have been running a Searx instance for
myself for a few months now and i love it.

~~~
nindwen
Searx aggregates different search engines. YaCy is peer-to-peer engine that
can run its own crawlers to index content, and query other peers for their
content. Looks like Searx has YaCy-integration available, so you could run a
local YaCy-node and get its results added to your Searx results.

~~~
slenk
Good to know. I might have to do that. Thank you.

------
tsukurimashou
interesting, has anybody here tried to use it? I'd like to hear some user
experience!

~~~
Kaiyou
I gave it a try in 2017. I could live with it being really slow, but it didn't
have any useful results for anything, so I moved on. Really like the idea,
though.

~~~
tombert
Had a similar experience back in 2015 or so; I think that the idea of a
decentralized web-scraper actually makes a lot of sense, but it just didn't
seem to give me a result that was useful, and actually seemed to give a lot of
porn results for perfectly innocuous searches. I ended up going back to DDG.

------
DethNinja
What if some bad actor decides to harm the network by listing illegal links?
How network secures itself against that?

~~~
Polylactic_acid
Other networks have lists of banned hashes. You can just hash the banned
domains/links and users can subscribe to ban lists. You could even distribute
them in plain text tbh. The issue is not about having the links, but
accidentally clicking on them.

------
flokie
Query "best pizza place"

Result "Hardcore teen sex"

Why is this trending?

------
marknadal
YaCy has been around for a long time, no?

What are other projects in this space?

I know Martti Malmi (Satoshi's 1st Bitcoin contributor) is doing some stuff
(no tokens) in this space.

Anybody else/know/have links/talks/articles on other (no blockchain)
decentralized search?

