
Ask HN: Can we create a new internet where search engines are irrelevant? - subhrm
If we were to design a brand new internet for today&#x27;s world, can we develop it such a way that:<p>1- Finding information is trivial<p>2- You don&#x27;t need services indexing billions of pages to find any relevant document<p>In our current internet, we need a big brother like Google or Bing to effectively find any relevant information in exchange for sharing with them our search history, browsing habits etc.  Can we design a hypothetical alternate internet where search engines are not required?
======
adrianmonk
I think it would be helpful to remember to distinguish two separate search
engine concepts here: indexing and ranking.

Indexing isn't the source of problems. You can index in an objective manner. A
new architecture for the web doesn't need to eliminate indexing.

Ranking is where it gets controversial. When you rank, you pick winners and
losers. Hopefully based on some useful metric, but the devil is in the details
on that.

 _The thing is, I don 't think you can eliminate ranking._ Whatever kind of
site(s) you're seeking, you are starting with some information that identifies
the set of sites that might be what you're looking for. That set might contain
10,000 sites, so you need a way to push the "best" ones to the top of the
list.

Even if you go with a different model than keywords, you still need ranking.
Suppose you create a browsable hierarchy of categories instead. Within each
category, there are still going to be multiple sites.

So it seems to me the key issue isn't ranking and indexing, it's who controls
the ranking and how it's defined. Any improved system is going to need an
answer for how to do it.

~~~
mavsman
How about open sourcing the ranking and then allowing people to customize it.
I should be able to rank my own search results how I want to without much
technical knowledge.

I want to rank my results by what is most popular to my friends (Facebook or
otherwise) so I just look for a search engine extension that allows me to do
that. This could get complex but can also be simple if novices just use the
most popular ranking algorithms.

~~~
aleppe7766
That would bring to an even bigger filter bubble issue, more precisely to a
techno élite which is capable, willing and knowledgeable enough to feel the
need go through the hassle, and all the rest navigating in such an indexed
mess that would pave the way to all sort of new gatekeepers, belonging to the
aforementioned tech élite. It’s not a simple issue to tackle, perhaps a public
scrutiny on the ranking algorithms would be a good first step.

~~~
samirm
I disagree. The people who don't know anything and are unwilling to learn
wouldn't be any worse off than they are today and everyone else would benefit
from an open source "marketplace" of possible ranking algorithms that the so
called "techno elite" have developed.

~~~
aleppe7766
I think the proposed improvement to the web in its intentions should mostly
benefit the "ignorants", not those that can already navigate through the
biases of today's technological gatekeepers. Please note, ignorants are not at
fault for being so. Especially when governments cut funds for public
education, and media leverages (and multiplies) ignorance to produce needs and
sales, fears and votes. Any solution must work first to make the weak
stronger, more conscious. A better and less biased web can help people grow
their unbiased knowledge, and therefore exercise their right of vote with a
deeper understanding of the complexity. Voting ignorants are an opportunity
for the ill intentioned politicians, as much as are a problem for me, you and
the whole country.

------
iblaine
Yes, it was called Yahoo and it did a good job of cataloging the internet when
hundreds of sites were added per week:
[https://web.archive.org/web/19961227005023/http://www2.yahoo...](https://web.archive.org/web/19961227005023/http://www2.yahoo.com/)

I'm old enough to remember sorting sites by new to see what new URLs were
being created, and getting to that bottom of that list within a few minutes.
Google and search was a natural response to solving that problem as the number
of sites added to the internet grew exponentially...meaning we need search.

~~~
kickscondor
Directories are still useful - Archive of Our Own
([https://archiveofourown.org/](https://archiveofourown.org/)) is a large
example for fan fiction, Wikipedia has a full directory
([https://en.wikipedia.org/wiki/Category:Main_topic_classifica...](https://en.wikipedia.org/wiki/Category:Main_topic_classifications)),
Reddit wikis perform this function, Awesome directories
([https://github.com/sindresorhus/awesome](https://github.com/sindresorhus/awesome))
or personal directories like mine at href.cool.

The Web is too big for a single large directory - but a network of small
directories seems promising. (Supported by link-sharing sites like Pinboard
and HN.)

~~~
ninju
How about this

[https://en.wikipedia.org/wiki/List_of_lists_of_lists](https://en.wikipedia.org/wiki/List_of_lists_of_lists)

~~~
kickscondor
Yes! But, of course, for directories outside of Wikipedia. This is very
interesting for its classification structure. It's so typical of Wikipedia
that a 'master list of lists' (by my count, there are 589 list links on this
page) contains lists such as "Lists of Melrose Place episodes" and "Lists of
Middle-earth articles" alongside lists such as "lists of wars" or "lists of
banks".

------
ovi256
Everyone has missed the most important aspect of search engines, from the
point of view of their core function of information retrieval: they're the
internet equivalent of a library index.

Either you find a way to make information findable in a library without an
index (how?!?) or you find a novel way to make a neutral search engine - one
that provides as much value as Google but whose costs are paid in a different
way, so that it does not have Google's incentives.

~~~
davemp
The problem is that current search engines are indexing what is essentially a
stack of random books thrown together by anonymous library goers. Before being
able to guide readers to books, librarians have to the following non-trivial
tasks over the entire collection:

\- identify the book's theme

\- measure the quality of the information

\- determine authenticity / malicious content

\- remember the position of the book in the colossal stacks

Then the librarian can start to refer people to books. This problem was
actually present in libraries before the revolutionary Dewy Decimal System
[1]. Libraries found that the disorganization caused too much reliance on
librarians and made it hard to train replacements if anything happened.

The Internet just solved the problem by building a better librarian rather
than building a better library. Personally I welcome any attempts to build a
more organized internet. I don't think the communal book pile approach is
scaling very well.

[1]:
[https://en.wikipedia.org/wiki/Dewey_Decimal_Classification](https://en.wikipedia.org/wiki/Dewey_Decimal_Classification)

~~~
jasode
_> I welcome any attempts to build a more organized internet. I don't think
the communal book pile approach is scaling very well._

Let me know if I misunderstand your comment but to me, this has already been
tried.

Yahoo's founders originally tried to "organize" the internet like a good
librarian. Yahoo in 1994 was originally called, _" Jerry and David's Guide to
the World Wide Web"_[0] with hierarchical directories to curated links.

However, Jerry & David noticed that Google's search results were more useful
to web surfers and Yahoo was losing traffic. Therefore, in 2000 they _licensed
Google 's search engine_. Google's approach was _more scaleable_ than Yahoo's.

I often see several suggestions that the alternative to Google is curated
directories but I can't tell if people are unaware of the early internet's
history and don't know that such an idea was already tried and how it
ultimately failed.

[0]
[http://static3.businessinsider.com/image/57977a3188e4a714088...](http://static3.businessinsider.com/image/57977a3188e4a714088bac98/jerry-
yang-and-david-filo-created-yahoo-back-in-january-1994-while-studying-at-
stanford-university-they-renamed-it-yahoo-in-march.jpg)

~~~
organsnyder
I remember trying to get one of my company's sites listed on Yahoo! back in
the late 1990s. Despite us being an established company (founded in 1985) with
a good domain name (cardgames.com) and a bunch of good, free content (rules
for various card games, links to various places to play those games online,
etc.), it took _months_.

~~~
dsparkman
That was not a bad thing. It was curated. Most of the crap never made it in
the directory precisely because humans made decisions about what got in. If
you wanted in the directory faster, you could pay a fee to get to the front of
the queue. The result is that Yahoo could hire people to process the queue and
make money without ads.

~~~
organsnyder
Isn't paying money to jump to the front of the queue just another form of
advertising?

------
neoteo
I think Apple's current approach, where all the smarts (Machine Learning,
Differential Privacy, Secure Enclave, etc.) reside on your device, not in the
cloud, is the most promising. As imagined in so much sci-fi (eg. the Hosaka in
Neuromancer) you build a relationship with your device which gets to know you,
your habits and, most importantly in regard to search, what you mean when you
search for something and what results are most likely to be relevant to you.
An on-device search agent could potentially be the best solution because this
very personal and, crucially, private device will know much more about you
than you are (or should be) willing to forfeit to the cloud providers whose
business is, ultimately, to make money off your data.

~~~
jasode
_> , where all the smarts [...] reside on your device, not in the cloud, is
the most promising. [...] An on-device search agent could potentially be the
best solution [...]_

Maybe I misunderstand your proposal but to me, this is not technically
possible. We can think of a modern search engine as a process that reduces a
raw dataset of _exabytes_ [0] into a comprehensible result of ~5000 bytes
(i.e. ~5k being the 1st page of search result rendered as HTML.)

Yes, one can take a version of the movies & tv data on IMDB.com and put it on
the phone (e.g. like copying the old Microsoft Cinemania CDs to the smartphone
storage and having a locally installed app search it) but that's not possible
for a generalized dataset representing the gigantic internet.

If you don't intend for the exabytes of the search index to be stored on your
smartphone, what exactly is the "on-device search agent" doing? How is it
iterating through the vast dataset over a slow cellular connection?

[0]
[https://www.google.com/search?q="trillion"+web+pages+exabyte...](https://www.google.com/search?q="trillion"+web+pages+exabytes)

~~~
ken
The smarts _living_ on-device is not necessarily the same as the smarts
_executing_ on-device.

We already have the means to execute arbitrary code (JS) or specific database
queries (SQL) on remote hosts. It's not inconceivable, to me, that my device
"knowing me" could consist of building up a local database of the types of
things that I want to see, and when I ask it to do a new search, it can
assemble a small program which it sends to a distributed system (which hosts
the actual index), runs a sophisticated and customized query program there,
securely and anonymously (I hope), and then sends back the results.

Google's index isn't architected to be used that way, but I would love it if
someone did build such a system.

~~~
ativzzz
To some extent, doesn't Google already do this? Meaning that based on your
location/Google account/other factors such as cookies or search history, it
will tailor your results. For instance, searching the same query on different
computers will result in different results.

Though to your point, google probably ends up storing this information in the
cloud

~~~
bduerst
Also instant search results, which were common search terms that were cached
at lower levels of the internet.

------
alfanick
I see a lot of good comments here, I got inspired to write this:

What if this new Internet instead of using URI based on ownership (domains
that belong to someone), would rely on topic?

In examples:

netv2://speakers/reviews/BW netv2://news/anti-trump netv2://news/pro-trump
netv2://computer/engineering/react/i-like-it
netv2://computer/engineering/electron/i-dont-like-it

A publisher of webpage (same html/http) would push their content to these new
domains (?) and people could easily access list of resources (pub/sub like).
Advertisements are driving Internet nowadays, so to keep everyone happy, what
if netv2 is neutral, but web browser are not (which is the case now anyway)?
You can imagine that some browsers would prioritise some entries in given
topic, some would be neutral, but harder to retrieve data that you want.

Second thought: Guess what, I'm reinventing NNTP :)

~~~
decasteve
Inventing/extending a new NNTP is nice idea too.

The Internet has become synonymous with the web/http protocol. The web
alternatives to NNTP won instead of newer versions of Usenet. New versions of
IRC, UUCP, S/FTP, SMTP, etc., instead of webifying everything would be nice.
But those services are still there and fill an important niche for those not
interested in seeing everything eternal septembered.

~~~
bogomipz
I believe there is/was an extension to NNTP for full text search or at least a
draft proposal no?

------
codeulike
That was what the early internet was like (I was there). People built indexes
by hand, lists of pages on certain topics. There was the Gopher protocol that
was supposed to help with finding things. But this was all top-down stuff, the
first indexing/crawling search engines were bottom-up and it worked so much
better. And for a while we had an ecosystem of different search engines until
Google came along, was genuinely miles better than everything else, and wiped
everything else out. Really, search isn't the problem, its the way that search
has become tied to advertising and tracking thats the problem. But then
DuckDuckGo is there if you want to avoid all that.

~~~
supernovae
Search is the problem. If you don’t rank in google you don’t exist on the
internet. There is an entire economy built on manipulating search that is pay
to play in addition to google continually focusing on paid search of natural
SERPs. Controlling search right now is controlling the internet.

~~~
codeulike
Whatever you replace Search with would be gamed in the same way.

~~~
supernovae
true, but when it was lycos, hotbot, altavista, google, webcrawler, aol,
gopher, archy, usenet and so many other sources it was much easier to exist in
many ways (harder to dominate) - people used to ‘surf the web’, join
“webrings” and share stuff.. now they consume and post memes. so i blame
behavior as much as monopoly

~~~
codeulike
A lot of other things have changed since then, so the difference in tone you
are noticing might not have much to do with search engines. In 1996 there were
only about 16 million people on the internet, and usage obviously skewed
towards the more technical nerdy crowd. Now there are 4,383 million people on
the internet. Which is about 57% of everyone.

~~~
Sohcahtoa82
I see this a lot on HN. People forget that a lot of things in the early days
of the Internet only worked because there were so few people on the Internet.

If you were rich and had a T1 in your home in the days everyone was on dialup,
sure you could host a website yourself. But these days, even if you're one of
the lucky residents on a gigabit symmetrical connection, there's a limit to
how much you can serve. Self-hosting isn't an option unless your website is a
niche.

------
davidy123
I think in one sense the answer is it always depends who or what you are
asking for your answers.

The early Web wrestled with this, early on it was going to be directories and
meta keywords. But that quickly broke down (information isn't hierarchical,
meta keywords can be gamed). Google rose up because they use a sort of
reputation system based index. In between that, there was a company called
RealNames, that tried to replace domains and search with their authoritative
naming of things, but that is obviously too centralized.

But back to Google, they now promote using schema.org descriptions of pages,
over page text, as do other major search engines. This has tremendous
implications for precise content definition (a page that is "not about fish"
won't show up in a search result for fish). Google layers it with their
reputation system, but these schemas are an important, open feature available
to anyone to more accurately map the web. Schema.org is based on Linked Data,
its principle being each piece of data can be precisely "followed." Each
schema definition is crafted by participation from industry and interest
groups to generally reflect its domain. This open world model is much more
suitable to the Web, compared to the closed world of a particular database
(but, some companies, like Amazon and Facebook, don't adhere to it since
apparently they would rather their worlds have control; witness Facebook's
open graph degeneration to something that is purely self-serving).

------
_nalply
The deeper problem is advertising. It is sort of a prisoner's dilemma: all
commercial entities have a shouting contest to attract customer attention.
It's expensive for everybody.

If we could kill advertisement permanently, we can have an internet as
described in the question. This will almost be like an emergent feature of the
internet.

~~~
worldsayshi
We could supercharge word of mouth. I've been thinking about an alternative
upvote model where content is ranked not primarily based on aggregate voting
but by:

\- ranking content that users you have upvoted higher

\- ranking content that users with similar upvote behaviour higher

While there is a risk of upvote bubbles, it should potentially make it easier
for niche content to spread to interested people and make it possible for
products and services to spread using peer trust rather than cold shouting.

~~~
thekyle
> ranking content that users with similar upvote behaviour higher

This is what Reddit originally tried to do before they pivoted.

[https://www.reddit.com/r/self/comments/11fiab/are_memes_maki...](https://www.reddit.com/r/self/comments/11fiab/are_memes_making_the_internet_boring/c6m9sy8/)

~~~
worldsayshi
Oh, interesting!

Makes me think that their original plan could still work if they just put a
bit more effort into crafting that algorithm.

For example, the main criticism brought up is that things that you dislike
that your peers like keep getting recommended. Why not add a de-ranking aspect
into it and try adding downvote-peers in addition to upvote peers.

I imagine you could create this interesting query language that could answer
questions like: what things do you like if you like X and Y but not Z? (I kind
of remember that something akin to this have been hacked together using
subreddit overlap.)

------
quelsolaar
Yes, we need search engines, but they don't need to be monolithic. Imagine
that indexing the text of your average web page takes up 10k. Then you get
100.000 pages per Gig. It means that you if you spend ~270USD on a consumer 10
tera drive you can index a billion webpages. Google no longer says how many
pages they index, but its estimated to be with in one order of magnitude of
that.

This means that in terms of hardware, you can build your own google, then you
get to decide how it rates things and you don't have to worry about ads and
SEO becomes much harder because there is no longer one target to SEO. Google
obviously don't want you to do this (and in fairness google indexes a lot of
stuff that isn't keywords form web pages), but it would be very possible to
build an open source configurable search engine that anyone could install,
run, and get good results out of.

(Example: The piratebay database, that arguably indexes the vast majority of
avilable music / tv / film / software was / is small enough to be downloaded
and cloned by users)

~~~
rhmw2b
Google's paper on Percolator from 2010 says there are more than 1T web pages.
9 years later there is surely way more than that.

[https://ai.google/research/pubs/pub36726](https://ai.google/research/pubs/pub36726)

The real issue would be crawling and indexing all those pages. How long would
it take for an average user's computer with a 10Mb internet connection to
crawl the entire web? It's not as easy a problem as you make it seem.

~~~
quelsolaar
I'm not saying its easy, its not, but people tend to think that because Google
is so huge, you have to be that huge to do what Google does. My argument is
that in terms of hardware google need expensive hardware because they have so
many users, not because what they do requires that hardware to deliver the
service for one or a few users.

I have a gigabit link to my apartment (go Swedish infrastructure!). At that
theoretic speed I get 450 gigs an hour, so I could download ten tera in a day.
We can easily slow that down by an order of magnitude and its still a very
viable thing to do. If someone wrote the software to do this, one could
imagine some kind of federated solution for downloading the data, so that
every user doesn't have to hit every web server.

------
theon144
Almost definitely not.

Search engines are there to find and extract information in an unstructured
trove of webpages - no other way to process this than with something akin to a
search engine.

So either you've got unstructured web (the hint is in the name) and
GoogleBingYandex or a somehow structured web.

The latter has been found to be not scalable or flexible enough to accomodate
for unanticipated needs - and not for a lack of trying! This has been the
default mode of web until Google came about. Turns out it's damn near
impossible to construct a structure for information that won't become
instantly obsolete.

~~~
0815test
> A structured web ... has been found to be not scalable or flexible enough to
> accomodate for unanticipated needs - and not for a lack of trying!

Linked Open Data (the latest evolution of Semantic Web technologies) is
actually working quite well at present - Wikidata now gets more edits per unit
of time than Wikipedia does, and its data are commonly used by "personal
assistant" AIs such as Amazon's Alexa. Of course, these can only cover parts
of the web where commercial incentives, and the bad actors that sometimes
pursue them, are not relevant.

------
swalsh
I've had this idea floating in my head for a while, that one thing that might
make the world better is some kind of distributed database, and a gravitation
back to open protocols (though instead of RFC's... maybe we could maintain an
open source library for the important bits) I was thinking the architecture of
DNS is a good starting point. From there we can create public indexes of data.
This includes searchable data, but also private data you want to share (which
could be encrypted, and controlled by you (think PGP). I'd modify browsers so
that I don't have to trust a 3rd party service)

Centralization happens because the company owns the data, which becomes
aggregated under one roof. If you distribute the data it will remove the
walled gardens, multiple competitors should be able to pop up. Whole
ecosystems could be built to give us 100 googles.... or 100 facebooks, where
YOU control your data, and they may never even see your data. And because
we're moving back to a world of open protocols, they all work with each other.

These companies aren't going to be worth billions of dollars any more.... but
the world would be better.

~~~
davidy123
You've just more or less described Solid.
[https://solid.mit.edu/](https://solid.mit.edu/)

I think a lot of people dismiss Solid based on its deep origins in Semantic
Web, or because it's a slow project, based on Web standards, intended to solve
long term problems.

But being part of the Web is a huge process, and with DIDs it maps just fine
into decentralized worlds.

------
alangibson
The 2 core flaws of the Internet (more precisely the World Wide Web) are lack
of native search and native payments. Cryptocurrencies have started to address
the second issue, but no one that I know of is seriously working on the first.

Fast information retrieval requires an index. A better formulation of the
question might be: how do we maintain a shared, distributed index that won't
be destroyed by bad actors.

I wonder if the two might have parts of the solution in common. Maybe using
proof of work to impose a cost on adding something to the index. Or maybe a
proof of work problem that is actually maintaining the index or executing
searches on it.

~~~
asdff
Why does there need to be one central source of truth on the internet? It
seems like it would be impossible to implement. Even if google worked like it
did 15 years ago and you got decently relevant results to your search terms,
that's still not even scraping the surface of the whole internet that is
relevant to your search terms.

It's an impossible problem to solve because we don't have good consistent
metadata to draw on. Libraries work because they have good metadata to catalog
their collections. Good metadata needs to be generated by hand, doing it
automatically is bound to lead to errors and special cases that will pollute
your search results.

I say we abandon the idea of the ideal search engine, accept the fact that we
will never be able to find every needle in every haystack, and defer to a
decentralized assortment of thousands of topic-specific indexes of relevant
information. Some of them will be shit, but that's fine, the internet has
always been a refuge for conspiracy theorists and other zaney interests. The
good stuff will shine through the mud, as it's always done.

------
lefstathiou
My approach to answering this would entail:

1) Determining what percentage of search engine use is driven by the need for
a short cut to information you know exists but dont feel like accessing the
hard way

2) Information you are actually seeking.

My initial reaction is that making search engines irrelevant is a stretch.
Here is why:

Regarding #1, the vast majority of my search activity involves information I
know how and where to find but seek the path of least resistance to access. I
can type in "the smith, flat iron nyc" and know I will get the hours, cross
street and phone number for the Smith restaurant. Why would I not do this
instead of visiting the yelp website, searching for the Smith, set my location
in NYC, filtering results etc. Maybe I am not being open minded enough but I
don't see how this can be replaced short of reading my mind and injecting that
information into it. There needs to be a system to type a request and retrieve
the result you're looking for. Another example, when I am looking for someone
on LinkedIn, I always google the person instead of utilizing LinkedIn's god
awful search. Never fails me.

2\. In the minority of cases I am looking for something, I have found that
Google's results have gotten worse and worse over the years. It will still be
my primary port of call and I think this is the workflow that has potential
disruption. Other than an Index, I dont know what better alternatives you
could offer.

------
peteyPete
You'd still want to be able to retrieve "useful" information which can't be
tampered with easily which I think is the biggest issue.

You can't curate manually.. That just doesn't scale. You also can't let just
anyone add to the index as they wish or any/every business will just flood the
index with their products... There wouldn't be any difference between
whitehat/blackhat marketing.

You also need to be able to discover new content when you seek it, based on
relevancy and quality of content.

At the end of the day, people won't be storing the index of the net locally,
and you also can't realistically query the entire net on demand. That would be
an absolutely insane amount of wasted resources.

All comes back to some middleman taking on the responsibility
(google,duckduckgo,etc).

Maybe the solution is an organization funded by all governments, completely
transparent, where people who wish to can vote on decisions/direction. So non
profit? Not driven by marketing?

But since when has government led with innovation and done so at a good pace?
Money drives everything... And without a "useful" amount of marketing/ads etc,
the whole web wouldn't be as it is.

So yes, you can.. But you won't have access to the same amount of data, as
easily, will likely have a harder time finding relevant information
(especially if its quite new) without having to parse through a lot of crap.

------
kyberias
If we were to design a brand new DATABASE ENGINE for today's world, can we
develop it such a way that:

1\. Finding information is trivial

2\. You don't need services indexing billions of rows to find any relevant
document

~~~
RhysU
How far can one get with content-addressable storage? It's not obvious to me
how to emulate search results ranking (well, anyhow), but it could give you a
list of documents satisfying some criteria according to the authors who stored
them.

------
fghtr
>In our current internet, we need a big brother like Google or Bing to
effectively find any relevant information in exchange for sharing with them
our search history, browsing habits etc.

The evil big brothers may not be necessary. We just need to expand alternative
search engines like YaCy.

------
azangru
I can't imagine how this is possible. Imagine I have a string of words (a
quote from a book or an article, a fragment of an error message, etc), and I
want to find the full text where it appears (or pages discussing it). How
would you do that without a search engine?

~~~
rmsaksida
I think OP's idea is that search services would be built into the Internet,
and not provided by a third party. That is, when a website is published or
updated, it is somehow instantly indexed and made available for search as a
feature of the platform on which it was published.

~~~
jedberg
But you still need a third party to rank the results. I don't just want any
page about my error message, I want the _best_ page.

~~~
jonathanstrange
The page rank could be a transparent algorithm, which is regularly updated by
a consortium like W3C.

The question is whether this would work in an adversarial setting where every
party tries to inflate their page rankings by any trick they can find.

~~~
dageshi
Not a chance it would survive. Google has enough problems fighting SEO right
now and they don't publish their algorithm and have incredibly deep pockets.

------
lxn
Most of the search engines now days have the advantage of being closed source
(you don't know how their algorithm actually work). This makes the fight
against unethical SEO practices easier.

With a distributed open search alternative the algorithm is more susceptible
to exploits by malicious actors.

Having it manually curated is too much of a task for any organization. If you
let user vote on the results... well, that can be exploited as well.

The information available on the internet is to big to make directories
effective (like it was 20 years ago).

I still have hope this will get solved one day, but directories and open
source distributed search engines are not the solution in my opinion unless
there is a way to make them resistant to exploitation.

~~~
pavas
I've been thinking that the only way to get around the bad-actor (or paid
agent) problem when dealing with online networks is to have some sort of
distributed trust mechanism.

I feel like manually curated information is the way to go, you just have to
find some way to filter out all the useless info and marketing/propaganda. You
can't crowd source it because it opens up avenues for gaming the system.

The only solution I can think of is some sort of transitive trust metric
that's used to filter what's presented to you. If something gets by that
shouldn't have (bad info/poor quality), you update the weights in the trust
network that led to that action so they are less likely to give you that in
the future. I never got around to working through the math on this, however.

~~~
dorusr
That's very workable.Any agent should have a private key with which it signs
it's pushes. Age of an agent and score of feedback for that agent determine
its ranking.Though that still leaves gaming possible with the feedback. But
heavy feeback like "this is malicious content" could be moderated. (So that
people cant just report stuff they don't like).

~~~
pavas
The reason I mentioned that the trust metric should be transitive and
distributed is so that it prevents gaming as much as possible. You wouldn't
want to have a trusted central authority (for everyone) because that could
always be corrupted or gamed if it's profitable enough. Rather every
individual would have a set of trusted peers with different "trust" weights
for each based on the individual's perception of their trustworthiness, that
could be changed over time.

This trust (weighting) should be able to propagate as a (semi-)transitive
property throughout the network to take advantage of your trusted peers'
trusted peers. This trust weight propagation would need to converge, and when
you are served content that has been labeled incorrectly ("high-value" or
"trustworthy" or whatever metric, when you don't see it that way), then your
trust weights (and perhaps your peers') would need to re-update in some sort
of backpropagation.

The hard part is keeping track of the trust-network in a way that is O(n^c)
and having the transitive calculations also be O(n^c) at most. I'm quite sure
there are ways of doing this (at least with reasonably good results) but I
haven't been able to think through them.

------
VvR-Ox
This would be the internet that was used in Star Trek I think. The computers
they use can just be asked about something and the whole system is searched
for that - so the service to find things is inherent to the system itself. In
our world things like that are done by entities who try to maximize their
profits (like the Ferengi) without thinking too much about effectiveness or
ethics.

This phenomena can be seen throughout many systems we built - e.g. use of
internet, communication, access to electricity or water. We have to pay the
profit-maximizing entities for all of this though it could be covered by
global cooperatives who manage this stuff in a good way.

------
blue_devil
I think "search engines" is misleading. These are relevance engines. And
relevance sells - the higher the relevance, the better.

[https://www.nytimes.com/2019/06/19/opinion/facebook-
google-p...](https://www.nytimes.com/2019/06/19/opinion/facebook-google-
privacy.html)

------
Ultramanoid
This is what we had in early internet days, directories of links. Early Yahoo
was the perfect example of this. You jumped from one site to another, you
asked other people, you discovered things by chance. You went straight to a
source, instead of reading a post of a summary of a site that after 20
redirections loaded with advertising and tracking gets you to the intended and
actually useful destination.

Most web sites then also had a healthy, sometimes surprising link section,
that has all but disappeared these days.

~~~
HNLurker2
>directories of links.

This is what I did back in 2015 as a project to increase my SEO rank to my
business. Basically spam directory (and create my own) just to increase my
pagerank.

------
d-sc
Indexing information is a political problem as much as a technical one.
Ultimately there will always be people who will put more effort into getting
their information known than others. These people would game whatever
technical solution exists.

~~~
albertgoeswoof
Thats true. What if you could artificially limit the amount of effort someone
can put in to getting content out there? Or even make it known to the consumer
how/why the content is ranked highly?

~~~
d-sc
Most closed platforms do a subset of what you mention: I can only put so many
posts on my Facebook before they stop making it to all my friends. If I pay
more for higher ranking it’s labeled an advertisement.

However, creating rules transitions the contention point to who makes the
rules. If you think that my algorithm will rank my sources better than your
sources, you may be less interested in my algorithm regardless of its
technical merits.

------
vbsteven
I was recently thinking about an open search protocol with some federation
elements in two parts. A frontend and an indexer. The idea is that anyone can
run his own search frontend or use a community hosted one (like Matrix). And
then each frontend has X amount of indexers configured.

Each indexer is responsible for a small part of the web and by adding indexers
you can increase your personal search area. And there is some web of trust
going on.

Entities like stackoverflow and Wikipedia and reddit could host their own
domain specific indexers. Others could be crowdsourced with browser extensions
or custom crawlers and maybe some people want to have their own indexer that
they curate and want to share with the world.

It will never cover the utility and breadth of Google Search but with enough
adoption this could be a nice first search engine. With DDG inspired bang
commands in the frontend you could easily retry a search on Google.

With another set of colon commands you can limit a search to one specific
indexer.

The big part I am unsure about in this setup is how a frontend would choose
which indexers to use for a specific query. Obviously sending each query to
each indexer will not scale very well.

------
dalbasal
Just as a suggestion, this question might be rephrased as can we have an
Internet that doesn't require search companies, or even just massive search
monopolies.

I'm not sure what the answer is re:search. But, an easier example to chew on
might've social media. It doesn't take a Facebook to make one. There are lots
of different social networking sites (including this one) that are orders of
magnitude smaller in terms of resources/people involved, even adjusting for
size of the userbase.

It doesn't take a Facebook (company) to make Facebook (site). Facebook just
turned out to be the prize they got for it. These things are just decided as
races. FB got enough users early enough. But, if they went away tomorrow..
users will not lack for social network experiences. Where they get those
experiences is basically determined by network effects, not the product
itself.

For search, it doesn't take a Google either. DDG make a search engine, and
they're way smaller. With search though, it does seem that being a Google
helps. They _have_ been "winning" convincingly even without network effects
and moat that make FB win.

------
zzbzq
[https://medium.com/@Gramshackle/the-web-of-native-apps-ii-
go...](https://medium.com/@Gramshackle/the-web-of-native-apps-ii-google-and-
facebook-ed2ee497302d)

Cliff's notes:

\- Apps should run not in a browser, but in sandboxed App containers loaded
from the network, somewhat between Mobile Apps and Flash/Silverlight. Mobile
apps that you don't 'install' from a store, but navigate to freely like the
web. Apps have full access to the OS-level APIs (for which there is a new
cross-platform standard), but are containerized in chroot jail.

\- An app privilege ("this wants to access your files") should be a prominent
feature of the system, and ad networks would be required to built on top of
this system to make trade-offs clear to the consumer.

\- Search should be a functionality owned and operated by the ISPs for profit
and should be a low-level internet feature seen as an extension of DNS.

\- Google basically IS the web and would never allow such a system to grow.
Some of their competitors have already tried to subvert the web by the way
they approached mobile.

------
btbuildem
You don't remember how it was before search engines, do you?

It was like a dark maze, and sometimes you'd find a piece of the map.

Search coming online was a watershed moment -- like, "before search" and
"after search"

~~~
bduerst
Yep.

\- You had your web rings, which would cycle from site to site based on a
category, some pages having multiple rings.

\- You had your "communities", organizing sites by URL structure, where
similar pages were grouped together like a strip mall or something (i.e.
_neighborhoods_ for geocities).

\- You had scammy services that would submit your pages to multiple search
engines, at a cost, but would guarantee you would show up in results.

\- You had your aggregators, like dogpile, where you would sift through pages
of results from different search engines, hoping to find something different.

It wasn't a good time. If you think about the problem that search engines
solve today - connecting people with information that they want - we're
currently at a peak.

------
chriswwweb
Sorry, but this was too tempting:
[https://imgur.com/a/6UcAOnF](https://imgur.com/a/6UcAOnF)

But seriously, I'm not sure it is feasible, I wish the internet could auto-
index itself and still be decentralized, where any type of content can be
"discovered" as soon as it is connected to the "grid".

The advantage would be that users could search any content without filters,
without AI tempering with the order based on some rules ... BUT on the other
hand, people use search engines because their results are relevant (what ever
that means these days), so having an internet that is searchable by default
would probably never be a good UX and hence not replace existing search
engines. It not just about the internet being searchable, it would have to
solve all the problems search engines have solved in the last ten years too

------
mhandley
We could always ask a different question: what would it take for everyone to
have a copy of the index? Humans can only produce new text-based content at a
linear rate. If storage continues to grow at an exponential rate, eventually
it becomes relatively cheap to hold a copy of the index.

Of course those assumptions may not be valid. Content may grow faster than
linear. Content may not all be produced by humans. Storage won't grow
exponentially forever. But good content probably grows linearly at most, and
maybe even slower if old good content is more accessible. Already it's
feasible to hold all of the English wikipedia on a phone. Doing the same for
Internet content is certainly going to remain non-trivial for a while yet. But
sometimes you have to ask the dumb questions...

~~~
tjansen
You may have the storage to store it, but do you have the bandwidth to receive
everything that's being produced?

~~~
mhandley
There are 8 billion people. If half of them were awake, and 10 percent of that
half are actually typing at 40 words/minute, that would be about 13Gbit/s. I
couldn't receive that feed today at home, but my work could. A satellite feed
could work today too. And I wasn't really talking about today, but 10-20 years
from now. Storage will be a problem for a lot longer than network capacity.

~~~
JoeAltmaier
Storage grows geometrically, while network capacity has grown less so. It's
possible we will soon have more storage than humans can create content to
fill. No problem; automated systems also fill storage!

------
tooop
The question should be how can we create a new internet where we don't need a
centralized 3rd party search engine not a new internet where there is no
search engine. You can't find anything if there is no search (engine) and you
can't change that.

------
GistNoesis
Yes, download data, create indices on your data yourself as you see fit,
execute SQL queries.

If you don't have the resources to do so yourself, then you'll have to trust
something, in order to share the burden.

If you trust money, then gather enough interested people to share the cost of
construction of the index, at the end everyone who trust you can enjoy the
benefits of the whole for himself, and you now are a search engine service
provider :)

Alternatively if you can't get people to part with their money, you can get by
needing only their computations, by building the index in a decentralized
fashion. The distributed index can then be trusted at a small computation cost
by anyone who believe that at least k% of the actors constructing it are
honest.

For example if you trust your computation and if you trust that x% of actors
are honest :

You gather 1000 actors and have each one compute the index of 1000th of the
data, and publish their results.

Then you have each actor redo the computation on the data of another actor
picked at random ; as many times as necessary.

An honest actor will report the disagreement between computations and then you
will be able to tell who is the bad actor that you won't ever trust again by
checking the computation yourself.

The probability that there is still a bad actor lying is (1-x)^(x*n) with n
the number of times you have repeated the verification process. So it can be
made as small as possible, even if x is small by increasing n. (There is no
need to have a majority or super-majority here like in byzantine algorithms,
because you are doing the verification yourself which is doable because 1000th
of the data is small enough).

Actors don't have the incentive to lie because if they do so, it will be
exposed provably as liars forever.

Economically with decreasing cost of computation (and therefore decreasing
cost of index construction), public collections of indices are inevitable. It
will be quite hard to game, because as soon as there is enough interest
gathered a new index can be created to fix what was gamed.

------
cf141q5325
There is an even deeper problem then surveillance, the results of search
engines get more and more censored with more and more governments putting
pressure on them to censor results according to their individual wishes.

------
wlesieutre
Taking a step back to before search engines were the main driver for finding
content online, who remembers webrings?

Is there a way to update that idea of websites deliberately recommending each
other, but without having it be an upvote/like based popularity contest driven
by an enormous anonymous mob? It needs to avoid both easy to manipulate crowd
voting like reddit and the SEO spam attacks that PageRank has been targeted
by.

Some way to say "I value recommendations by X person," or even give individual
people weight in particular types of content and not others?

~~~
jefftk
_> who remembers webrings?_

I recently configured openring [1] and am liking it a lot. Example of one of
my pages with it [2]

[1]
[https://git.sr.ht/~sircmpwn/openring](https://git.sr.ht/~sircmpwn/openring)

[2] [https://www.jefftk.com/p/adventures-in-
upstreaming](https://www.jefftk.com/p/adventures-in-upstreaming) scroll down
to "Recent posts on blogs I like"

------
topmonk
What we should have is an open, freely accessible meta-information database of
things like, whether user X liked/disliked, what other page/sites linked to
this site, what their admins ranked this site as, if they did, etc., etc.

Then we have individual engines that take this data and choose for the user
what to display for that user only. So if the user is unhappy with what they
are seeing, they simply plug in another engine.

Probably a block chain would be good to store such a thing.

------
jonathanstrange
There is still YaCy [1]. I'm not sure whether it's this one or another
distributed search engine I tried 10 years ago, but the results were not very
convincing. I believe that's to some extent because of a lack of critical
mass, if more people would use these engines, they could improve their
rankings and indexing based on usage.

[1] [https://yacy.net/en/index.html](https://yacy.net/en/index.html)

------
_Nat_
> In our current internet, we need a big brother like Google or Bing to
> effectively find any relevant information in exchange for sharing with them
> our search history, browsing habits etc.

Seems like you could access Google/Bing/etc. (or DuckDuckGo, which'd probably
be a better start here) through an anonymizing service.

But, no, going without search engines entirely doesn't make much sense.

I suspect that what you'd really want is more control over what your computer
shares about you and how you interact with services that attempt to track you.
For example, you'd probably like DuckDuckGo more than Google. And you'd
probably like Firefox more than Chrome.

\---

With respect to the future internet...

I suspect that our connection protocols will get more dynamic and
sophisticated. Then you might have an AI-agent try to perform a low-profile
search for you.

For example, say that you want to know something about a sensitive matter in
real life. You can start asking around without telling everyone precisely what
you're looking for, right?

Likewise, once we have some smarter autonomous assistants, we can ask them to
perform a similar sort of search, where they might try to look around for
something online on your behalf without directly telling online services
precisely what you're after.

------
gesman
I think there is a grain of a good idea here.

As i see it - new, "free search" internet would be a specially formatted
content for each page published that will make it content easily searchable.
Likely some tags within existing HTML content to comply with new "free search"
standard.

Open source, distributed agents would receive notifications about new,
properly formatted "free search" pages and then index such page into the
public indexed DB.

Any publisher could release content and notify closest "free search" agent.

Then - just like a blockchain - anyone could download such indexed DB to do
instant local searches.

There will be multiple variations of such DB - from small ones (<1TB) to
satisfy small users giving just "titles" and "extracts" to large ones who need
detailed search abilities (multi TB capacity).

"Free search", distributed agents will provide clutter-free interface to do
detailed search for anyone.

I think this idea could easily be pickup up pretty much by everyone - everyone
would be interested to submit their content to be easily searchable and escape
any middlemen monopoly that is trying to control aspects of searching and
indexing.

------
hokus
[https://tools.ietf.org/html/rfc1436](https://tools.ietf.org/html/rfc1436)

------
salawat
The problem isn't search engines per se.

The problem is closed algorithms, SEO, and advertising/marketing.

Think about it for a minute. Imagine a search engine that generates the same
results for everyone. Since it gives the same results for everyone, the burden
of looking for exactly what you're looking for is put back exactly where it
needs to be, on the user.

The problem though, is you'll still get networks of "sink pages" that are
optimized to show up in every conceivable search, that don't have anything to
do with what you're searching for, but are just landing pages for links/ads.

Personally, I liked a more Yellow Pageish net. After you got a knack for
picking out the SEO link sinks, and artificially disclose them, you were fine.
I prefer this to a search provider doing it for you because it teaches you,
the user, how to retrieve information better. This meant you were no longer
dependant on someone else slurping up info on your browsing habits to try to
made a guess at what you were looking for.

------
tablethnuser
One way to replace search is to return to curation by trusted parties. Rather
than anyone putting a web page up and then a passive crawler finding it and
telling everyone about it, (why should I trust any search engine crawler,) we
could "load" our search engine with lists of websites. These lists are
published and maintained by curators that we have explicitly chosen to trust.
When we type into the search box it can only return results from sites present
on our personal lists.

e.g. someone's list of installed lists might look like:

\- New York Public Library reference list

\- Good Housekeeping list of consumer goods

\- YCombinator list of tech news

\- California education system approved sources

\- Joe Internet's surprisingly popular list of JavaScript news and resources

How do you find out about these lists and add them? Word of mouth and
advertising the old fashioned way. Marketplaces created specifically to be
"curators of curators". Premium payments for things like Amazing Black Friday
Deals 2019 which, if you liked, you'll buy again in 2020 and tell your
friends.

There are two points to this. First, new websites only enter your search graph
when you make a trust decision about a curator - trust you can revoke or
redistribute whenever you want. Second, your list-of-lists serves as an
overview of your own biases. You can't read conspiracy theory websites without
first trusting "Insane Jake's Real Truth the Govt Won't Tell You". Which is
your call to make! But at least you made a call rather than some outrage
optimizing algorithm making it for you.

I guess this would start as a browser plugin. If there's interest let's build
it FOSS.

Edit: Or maybe it starts as a layer on top of an existing search engine. Are
you hiring, DDG? :P

~~~
joaobeno
Kind of like the Good Old Days™? Jokes aside, Search engines are for stuff we
don't know... I get my news from some sites, this one for example, and I don't
depend o Google for that... But when I want to know about the viability of
zipping a UTF-8 encoded text to save space on my DB, there is no way to get
may answer without a search engine...

~~~
tablethnuser
If there's anything the modern internet has taught me it's that the Good Old
Days were doing some things right! We don't know how to scale trust to
internet-sized communities yet so a little tribalism may be warranted.

The way this solution solves the I Don't Know What I Don't Know Problem is by
making you curate your own list of experts. For your example query, a
colleague may have told you about a popular list that thousands of DBAs
subscribe and contribute to. So when you search that query it has the sites to
crawl and find the material

------
dpacmittal
Why don't we get rid of tracking instead of getting rid of search engines. Why
can't I just have my ad settings set by myself. I should be able to say, I'm
interested in tech, fashion, watches, online backup solutions etc. Show me
only these ads. It would get rid of all kinds of tracking.

Can anyone tell me why such an approach wouldn't work?

~~~
nexuist
I could put "aviation" as one of my interests but I'm nowhere close to being
able to afford a plane or any aviation related accessories unless they're R/C
models (even then that's pushing it).

Just because I see ads I'm interested doesn't mean I'll want to buy what
they're selling. Whereas if a system that tracked me can deduce that I'm a
private pilot, it can make an educated guess towards my income and adjust the
type of items it shows me correspondingly. I doubt many people would be
willing to provide this information (demographics, location, income) that
advertisers care about most.

------
8bitsrule
IME, searching by collections of keywords has become a good strategy. Avoiding
using vague/topical keywords ('music', 'chemical'), instead asking for
specific words that should/must be found in the search results. If the results
start to exclude an important keyword (e.g. '1872' or 'giant' or 'legend'),
put a plus sign in front of it and resubmit.

I regularly use DDG (which claims privacy) for this, and requests can be quite
specific. E.g. a quotation "these words in this order" may result in -no
result at all-, which is preferable to being second-guessed by the engine.

I wonder how 'search engines are not required' would work without expecting
the searcher to acquire expertise in drilling down through topical categories,
as attempts like '[http://www.odp.org/'](http://www.odp.org/') did.

------
gexla
Good question. I'm going to run an experiment.

First "go-to" for search will be my browser history.

As long as the site I know I'm looking for is in my browser history, then I'll
go there and use the search feature to find other items from that site.

Bookmark all the advanced search pages I can find for sites I find myself
searching regularly.

Resist mindless searching for crap content which usually just takes up time as
my brain is decompressing from other tasks.

For search which is more valuable to me, try starting my search from
communities such as Reddit, Twitter or following links from other points in my
history.

Maybe if it's not worth going through the above steps, then it's not valuable
enough to look up?

NOTE: Sites such as Twitter may not be much better than Google, but I can at
least see who is pushing the link. I can determine if this person is someone I
would trust for recommendations.

I bet if I did all of the above, I could put a massive dent in the number of
search engine queries I do.

Any other suggestions?

~~~
KirinDave
> but I can at least see who is pushing the link. I can determine if this
> person is someone I would trust for recommendations.

This doesn't seem true at all to me. Twitter dramatically shapes and modifies
timelines to promote whatever they want. They're even more aggressive on
modifying the search experience.

All of those constraints are invisible. It's dangerous to think you have more
control or insight there.

~~~
gexla
Twitter has a social graph and communities much like Reddit. This adds more
information. There are people posting information who I trust and even know
IRL.

> All of those constraints are invisible. It's dangerous to think you have
> more control or insight there.

And yet you are commenting as if these results aren't invisible to you? The
machinery behind Google search results aren't invisible? Are you trying to say
that one invisible thing is more "X" than another invisible thing?

~~~
KirinDave
> And yet you are commenting as if these results aren't invisible to you? The
> machinery behind Google search results aren't invisible?

Because of my unique and fortunate work history I understand the internals of
these systems better than many people do. I'm objecting to the distinction
you're drawing, not suggesting an alternative order of transparency. There
really isn't much difference between the two companies output in the regard
we're discussing.

~~~
gexla
I agree that Twitter may not be much different from Google if you are relying
on the algo. Notice that I mentioned Twitter as a community next to Reddit
though. Everyone has different usage patterns for these services. I follow
(and have been an active participant) in a number of niche communities on
Twitter. In some cases, I could do a search for a term and most results would
be from people I have interacted with through Twitter and other channels. Each
of those people carried a reputation within that niche and some I knew better
than others. I wouldn't use Twitter as a general search returning a large
number of untrusted results. Sure, even search results on a specific user
could be biased, but at least it would be from familiar territory.

------
ex3xu
Like others here I don't have too much problem with indexing.

What I would like to see is a human layer of infrastructure on top of
algorithmic search, one the leverages the fact that there are billions of
people who could be helping others find what they need. That critical mass
wasn't available at the beginning of the internet, but it certainly is now.

You kind of have attempts at this function in efforts like the Stack Exchange
network, Yahoo Questions, Ask Reddit, tech forums etc. but I'd like to see
more active empowerment and incentivization of giving humans the capacity to
help other humans find what they need, in a way that would be free from
commercial incentives. I envision stuff like maintaining absolutely impartial
focus groups, and for commercial search it would be nice to see companies
incentivized to provide better quality goods to game search rather than better
SEO optimization.

------
ntnlabs
How about this: Internet as a service. Instead of looking for answers You will
"broadcast" Your needs. Like "I need a study about cancer". And You will
receive a list of sources that answered Your question. maybe with some sort of
decentralised rating and maybe Country and author. How about that?

~~~
pjc50
Broadcasting your searches seems, if anything, even worse for privacy, and an
invitation for just-in-time spam.

~~~
swalsh
That's a workable issue. If a random and unique guid is asking for results, it
would be hard to correlate users.

Of course there would defiantely be an issue with how you generate the guid
(for example if it was generated by the users MAC + some predictable random
number generator that might be reversible). So you would keep that in mind.
But these seem like workable issues.

------
desc
As others have commented, the problem here is the ranking algorithm and how it
can be gamed. Essentially, trust.

'Web of trust' has its flaws too: a sufficiently large number of malicious
nodes cooperating can subvert the network.

However, maybe we can exploit locality in the graph? If the user has an easy
way to indicate the quality of results, and we cluster the graph of relevance
sources, the barrier to subverting the network can be raised significantly.

Let's say that each ranking server indicates 'neighbours' which it considers
relatively trustworthy. When a user first performs a search their client will
pick a small number of servers at random, and generate results based on them.

* If the results are good, those servers get a bit more weight in future. We can assume that the results are good if the user finds what they're looking for in the top 5 or so hits (varying depending on how specific their query is; this would need some extra smarts).

* If the results are poor (the user indicates such, or tries many pages with no luck) those servers get downweighted.

* If the results are actively malicious (indicated by the user) then this gets recorded too...

There would need to be some way of distributing the weightings based on what
the servers supplied, too. If someone's shovelling high weightings at us for
utter crap, they need to get the brunt of the downweighting/malice markers.

Servers would gain or lose weighting and malice based on their advertised
neighbours too. Something like PageRank? The idea is to hammer the _trusting_
server more than the _trusted_ , to encourage some degree of self-policing.

Users could also chose to trust others' clients, and import their weighting
graph (but with a multiplier).

Every search still includes random servers, to try to avoid getting stuck in
an echo chamber. The overall server graph could be examined for clustering and
a special effort made to avoid selecting more than X servers in a given
cluster. This might help deal with malicious groups of servers, which would
eventually get isolated. It would be necessary to compromise a lot of
established servers in order to get enough connections.

Of course, then we have the question of who is going to run all these servers,
how the search algorithm is going to shard efficiently and securely, etc etc.

Anyone up for a weekend project? >_>

------
gist
This is to broad a question to answer. There are really to many different uses
of the Internet to try and fashion a solution that works in all areas. Not to
mention the fact that it's to academic to begin with. How do you get such a
large group of people to change a behavior that works for them already? And
very generally most people are not bothered by the privacy aspect as much as
tech people (always whining about things) are or even the media. People very
generally like they can get things at no cost and don't (en masse) care
anywhere near as much about being tracked as you have been led to believe. And
that's when tracking is not even benefiting them which it is often. This is
not 'how can we eliminate robo calls'. It's not even 'how can we eliminate
spam'.

------
Havoc
Seems unlikely. Search engines solves a key problem.

To me they are conceptually not the problem. Nor is advertising

This new wave of track you everywhere with ai brand of search engines is an
issue though. They’ve taken it too far essentially.

Instead of respectable fishing they’ve gone for kilometer long trawling nets
that leave nothing in their wake

------
hideo
This isn't an entire solution, but Van Jacobson's Content-Centric Networking
concept is fascinating, especially when you consider its potential social
impact compared to the way the internet exists today

[https://www.cs.tufts.edu/comp/150IDS/final_papers/ccasey01.2...](https://www.cs.tufts.edu/comp/150IDS/final_papers/ccasey01.2/FinalReport/FinalReport.html)
[http://conferences.sigcomm.org/co-
next/2009/papers/Jacobson....](http://conferences.sigcomm.org/co-
next/2009/papers/Jacobson.pdf)

------
munchausen42
To get rid of search engines like Google and Bing we don't need to build a new
internet - we just need to build new search engines.

E.g., how about an open source spider/crawler that anyone can run on their own
machine continuously contributing towards a distributed index that can be
queried in a p2p fashion. (Kind of like SETI@home but for stealing back the
internet).

Just think about all the great things that researchers and data scientists
could do if they had access to every single public Facebook/Twitter/Instagram
post.

Okayokay ... also think about what Google and FB could do if they could access
any data visible to anyone (but let's just ignore that for a moment ;)

~~~
asdff
You know google has been crawling for years and probably already has accessed
any public data

------
nonwifehaver3
Yes, out of sheer necessity. Search results have become either a crapshoot
when looking for commercially adjacent content due to SEO, or “gentrified”
when looking for anything even remotely political, obscure, or controversial.
Google used to feel like doing a text search of the internet, but it sometimes
acts like an apathetic airport newsstand shopkeeper now (& with access to only
the same books and magazines).

Due to this I think people will have to use site-specific searches,
directories, friend recommendations, and personal knowledge-bases to discover
and connect things instead of search engines.

------
cy6erlion
I think there is only two options.

1) Have an index created by a centralized entity like google 2) Have the nodes
in the network create the index

The first option is the easiest but can be biased on who gets to be on the
index and their position on the index.

Option two is hard because we need a sort of mechanism to generate the index
from the subjective view of the nodes in the network and sync this to everyone
in the network.

The core problem here is not really the indexing but the structure of the
internet, domains/websites are relatively dumb they can not see the network
topology, indexing is basically trying to create this topology.

------
JD557
You could use something like Gnutella[1], where you flood the network with
your query request and that request is then passed along nodes.

Unfortunately (IIRC and IIUC how Gnutella works), malicious actors can easily
break that query schema : just reply to all query requests with your malicious
link. I believe this is how pretty much every query in old Gnutella clients
returned a bunch of fake results that were simply `search_query + ".mp3"`.

1:
[https://en.wikipedia.org/wiki/Gnutella](https://en.wikipedia.org/wiki/Gnutella)

------
quickthrower2
Search engines are not required: there are directories out there with tonnes
of links. It is just that search engines are damn convenient. And googles
search is light years ahead of any websites own search.

~~~
barrystaes
I turn to search engines mostly when entering a new knowledge domain! E.g.
learn about a specific product. Sometimes when im just lazy.

For routine stuff i tend to have established resource starting points, like
documentation, official/community sites, blogs/news feeds, and yes: link
directories (like awesome lists).

------
oever
The EU has a funding call open for Search and Discovery on the Next Generation
Internet.

[https://nlnet.nl/discovery/](https://nlnet.nl/discovery/)

------
inputcoffee
It was thought that one way of finding information is to ask your network
(Facebook and Twitter would be examples), and then they would pass on the
message and a chain of trusted sources would get the information back to you.

I am being purposefully vague because I don't think people know what an
effective version of that would look like, but its worth exploring.

If you have some data you might ask questions like:

1\. Can this network reveal obscure information?

2\. When -- if ever -- is it more effective than indexing by words?

~~~
dbspin
This seems significantly laborious. Not sure that the utility of this kind of
network recommendation scales to incentivise participation beyond a few
people. i.e.: We already have user groups on sites like reddit, FB etc, where
experts or enthusiasts answer questions when they feel like it. But this is a
slow process that relies on a group that contains enough distributed
knowledge, but isn't overwhelmed with inquiries. As a counter example, the
/r/BuildaPC subreddit long ago exceeded the size where it could answer a
significant proportion of build questions, and most remain unanswered despite
significant community engagement.

Not convinced any kind of formalised 'question answering network' could
replace search. It would be both slow, and require an enormous asymmetric
investment of time, for a diffuse and unspecified reward.

~~~
inputcoffee
I don't think it would be questions.

Suppose you like fountain pens, and you recommend certain ones. One of your
friend looks for fountain pens that their friends recommend and finds the ones
you like.

That is just one example of things that don't require explicit questions.

Another one might be you have searched for books or other things and then they
follow the same "path". So long as you have similar interests it might work.

People haven't solved this issue, but there is a lot of research out there on
networks of connections potentially replacing certain kinds of search.

------
ninju
I find myself not need to do a 'generic' Internet search that much anymore

For long-term facts and knowledge lookup: Wikipedia pages (with proper
annotation)

For real-time World happens: A mix of direct news websites

For random 'social' news: <\-- the only time I direct direct Google/Bing/DDG
search

The results from the search engines nowadays are so filled with (labeled)
promoted results and (un-labeled) SEO results that I have become cynical and
jaded to the value of the results

------
jka
There'd be a feedback loop problem, but are DNS query logs a potential source
of ranking/priority?

Over time the domains that users genuinely organically visit (potentially geo-
localized based on client location) should rise in query volume.

Caveats would include DNS record cache times, lookups from robots/automated
services, and no doubt a multitude of inconsistent client behavior oddities.

A similar approach could arguably be applied even at a network connection log
level.

------
mahnouel
Maybe I'm missing the point. But Instagram, Facebook, Twitter - all of them
are not mainly experienced through search but through a feed of endless
content, curated by an algorithm. Most regular users don't even search that
often, they consume. Maybe there could be an decentralized Internet where you
follow specific handles and then they bring their content into your main
"Internet" aka feed (= user friendlier RSS).

------
z3t4
An idea I've had for a long time is a .well-known/search standard (REST)
endpoint. Where your browser, or a search aggregator combines results from
many sites like stack overflow, MDN, news sites, i duvidual blogs, etc. That
way search engines doesnt have to create a index. It would be up to the sites
to create the search result. This means searching would be parallel and
distributed.

------
epynonymous
my ideal internet would be more like a set of concentric rings per user, a
ring would represent different preferences, filters, and data, i could choose
to include certain users access to parts of my rings, and i could access other
parts of other user's rings. obviously there should be an open ring that every
user can access which would need a search engine run by a company or set of
companies, this would be like today's internet, but that would not be the same
ring, i could switch between rings with ease. i think this maybe somewhat what
tim berners lee is doing with the decentralized web, or perhaps bits of dark
net interwoven with the internet.

an example use case would be like a set of apps that my family could use for
photo sharing, messaging, sending data, links to websites, etc. perhaps
another set of apps for my friends, another for my company, or school. the
protocols would not require public infrastructure, dns, etc. perhaps tethering
of devices would be enough. there would be a need for indexing and search,
email, etc.

------
sktrdie
I feel like Linked Data Fragments provides a solution to this:
[http://linkeddatafragments.org/](http://linkeddatafragments.org/)

You're effectively crawling portions of the web based on your query, at
runtime! It's a pretty neat technique. But you obviously have to trust the
sources and the links to provide you with relevant data.

------
Johny4414
What about Xanadu? Internet is very broken but almost no one seems to care
(for a reason). Idea of more p2p web is there for a while but at the end of
the day user don't care to much about anything so it probably never happen.

[https://en.wikipedia.org/wiki/Project_Xanadu](https://en.wikipedia.org/wiki/Project_Xanadu)

~~~
coldtea
Xanadu assumes good players. It will be decimated by the very first spammer /
advertiser that appears...

It's a vision for an academic, small scale, network, not for a viable global
web.

------
CapitalistCartr
I've said this before: I dearly miss Alta Vista. It indexed, but the user had
to provide the ranking, which required actually thinking about what was
wanted. I would construct searches of the pattern (word OR word) And (word
NEAR word) with great success. Naturally Google, requiring far less thinking
to use, won.

------
politician
Lately, I've been turning over an idea that in order to advance, the next
generation of the Internet should be designed so that third-party advertising
is impossible to implement. I believe that as a consequence, this requirement
will prevent crawler-based search engines from operating which presents a
source discovery problem.

Discovering new sources of information in this kind of environment is
difficult, and basically boils down to another instance of the classic key
distribution problem - out-of-band, word-of-mouth, and QR codes.

Search engines like Google and Bing solve the source discovery problem by
presenting themselves as a single source; aggregating every other source
through a combination of widespread copyright infringement and an opaque
ranking algorithm.

Google and Bing used to do a great job of source discovery, but the quality of
their results have deteriorated under relentless assaults from SEO and Wall
Street.

I think it's time for another version of the Internet where Google is not the
way that you reach the Internet (Chrome) or find what you're looking for on
the Internet (Search) or how you pay for your web presence (Adsense).

------
BerislavLopac
We already have it, and it's called BitTorrent. DNS as well.

What you call Internet is actually World Wide Web, just another protocol
(HTTP) on top of Internet (TCP/IP), which was designed to be decentralised but
lacked any worthwhile discovery mechanism before two students designed the
BackRub protocol.

------
wsy
To everybody who wants to tackle this challenge: start by considering how you
would protect your 'new internet' against SPAM and SEO attacks.

For example, if you build on a decentralized network, ask yourself how you can
prevent SEO companies from adding a huge amount of nodes to promote certain
sites.

------
rayrrr
There's been a few mentions of the PageRank algorithm already...FWIW, Google's
patent just expired.
[https://patents.google.com/patent/US6285999B1/en](https://patents.google.com/patent/US6285999B1/en)

------
qazpot
See Ted Nelson's Xanadu Project -
[https://en.wikipedia.org/wiki/Project_Xanadu#Original_17_rul...](https://en.wikipedia.org/wiki/Project_Xanadu#Original_17_rules)

Point 4 allows a user to search and retrieve documents on the network.

------
hayksaakian
If you look at usage patterns, social media has replaced search engines for
many use cases.

For example, if you want to know where to eat tonight, instead of searching
"restaurants near me" you might ask your friends "where should I eat tonight"
and get personalized suggestions.

------
weliketocode
Your two points really don’t fit with your follow-up explanation.

If you don’t believe finding information is currently trivial using Google,
that’s going to be a tough nut to crack.

What would you use for information retrieval that doesn’t involve indexing or
a search engine?

------
garypoc
We would still need search engines, but we could change the business model.
For example we could make a protocol to associate URL with content and search
keywords. Something similar to DNS associated with distributed Elasticsearch
servers

------
lowcosthostings
The good one post which you have to share.
[https://www.lowcostwebhostings.com/dealstore/webhostingpad](https://www.lowcostwebhostings.com/dealstore/webhostingpad)

------
fooker
I'll be pessimistic here and say no, that is an impossible pipe dream. For any
such system design you can come up with, a centralized big brother controller
system will be more more efficient and have better user experience.

------
siliconc0w
You could make a browser plugin that effectively turned everyone into a spider
that sent new chunks of the index to some decentralized blockchain-esque
storage system for all to query with its own blockchain-esque micro payments

------
tmaly
I think once really good AI becomes a commodity and can fit in your phone AND

Once we have really fast 5?G networks, there is a good possibility that some
type of distributed mesh type search solution could replace the big players.

------
Advaith
I think this is the long game with respect to blockchains and establishing
trust in general.

You will be able to trust data and sources instantly. There will be no
intermediaries and trust will be bootstrapped into each system.

------
blackflame7000
What if we just make a program that Googles a bunch of random stuff constantly
so that there is so much garbage in their algorithms that they can't
effectively figure out real vs synthetic searches.

------
nobodyandproud
We need an alternative internet where anonymity between two parties is
impossible.

Not a place for entertainment, but where government or business transactions
can be safely conducted.

A search engine would be of secondary importance.

------
reshie
i guess if we had a highly regulated and one site for one type of service it
would be possible but i would not really want that. you could have a algorithm
that would parse your query and send you directly to a site of course it could
get it wrong where you may need to refine you query just like now sometimes.
of course thats still a search engine but more direct. bookmarks are already a
form of web without re-searching.

it sounds like what you really want is a decentralized search engine and
anonymous by default as apposed to no search engine.

~~~
orky56
AOL [1] among others in the early days did exactly that. Without using a
search engine, you could access whatever type of content you wanted. Similar
to a communism vs capitalism argument, you don't quite the same amount of
variety but you trade that for instant access to what you need.

[1][[https://www.trbimg.com/img-5320a78f/turbine/orl-0312aol-1996...](https://www.trbimg.com/img-5320a78f/turbine/orl-0312aol-19960829)]

------
paparush
We could go back to Gopher.

------
Papirola
I still remember gopher
[https://en.wikipedia.org/wiki/Gopher_(protocol)](https://en.wikipedia.org/wiki/Gopher_\(protocol\))

~~~
Jaruzel
Some of us still use it...

Shameless plug: [http://www.jaruzel.com/gopher/gopher-client-browser-for-
wind...](http://www.jaruzel.com/gopher/gopher-client-browser-for-windows)

------
Isamu
That was the original Internet. Search engines evolved to make finding things
possible.

Another original intent: that URLs would not need to be user-visible, and you
wouldn't need to type them in.

------
truckerbill
We could try and revive and improve the web-ring concept. Or more simply,
convince the community to dedicate a page of their site linking to other
related/relevant sites.

~~~
pbhjpbhj
Webrings are still there, they're just implicit. People link within their
content to the same resources over-again, or have more explicit footer blocks
or aside link stacks.

Search engines use this structure for domain authority.

A search for "link:example.com -site:example.com" would have found that
webring in the past.

------
thedevindevops
You want to create another
[https://en.wikipedia.org/wiki/Deep_web](https://en.wikipedia.org/wiki/Deep_web)
?

------
ken
Is this the same as asking if we can create a telephone system with no phone
books, or a city with no maps? Where is our shared understanding of the
system's state?

------
kazinator
Can you walk through a complete use case?

A user wants to find a "relevant document".

What is that? What information does the user provide to specify the document?

Why does the user trust the result?

------
bitL
How can I help? Dumped most of centralized solutions in favor of self-hosted
(mostly ActivityPub-based) services and still can't get rid of search.

------
comboy
I'm too late, but yes, it is not easy but it definitely seems doable:
[http://comboy.pl/wot.html](http://comboy.pl/wot.html)

I'm sorry it's a bit long, TL;DR you need to be explicit about people you
trust. Those people do the same an then thanks to the small world effect you
can establish your trust to any entity that is already trusted by some people.

No global ranking is the key. How good some information is, is relative and
depends on who do you trust (which is basically form of encoding your
beliefs). And yes, you can avoid information bubble much better than now but
writing more when I'm so late to the thread seems a bit pointless.

------
FPurchess
I wonder if we could rearranged the internet as decentralised nodes exchanging
topic maps which then can be queried in a p2p fashion.

~~~
dredds
I've always imagined a kinda cross between Solid (ontology mapping) and
Zeronet (seed hosted) with perhaps Dat for social (mutability) where crowd
navigation determines the relations as a feedback. (original pagerank was a
simplified version of such)

------
otabdeveloper4
Yes, it's called "Facebook", and it already exists.

Probably not what you had in mind, though. Be careful what you wish for.

------
xorand
Two-way links would help. I can't locate the information now but it seems that
it was proposed initially.

------
robot
It is a huge problem. It's not possible to fix it some other way without
putting in the same effort.

------
buboard
Didnt we? It is called "ask your friends". It s a great way to turn your
friends into enemies.

------
ISNIT
Maybe we should all just learn a graph query language and live on WikiData ;)

------
amelius
Are any academic groups still researching search engines?

------
ptah
I guess nowadays the web IS the internet

------
sys_64738
Yes because nobody will be using it.

------
peterwwillis
tl;dr the problems are 1) relevancy, 2) integrity, 3) content
management/curation.

If you've ever tried to maintain a large corpus of documentation, you realize
how incredibly difficult it is to find "information". Even if I know exactly
what I want.... where is it? With a directory, if I've "been to" the content
before, I can usually remember the path back there... assuming nothing has
changed. (The Web changes all the time) Then if you have new content... where
does it go in the index? What if it relates to multiple categories of content?
An appendix by keyword would get big, fast. And with regular change, indexes
become stale quickly.

OTOH, a search engine is often used for documentation. You index it regularly
so it's up to date, and to search you put in your terms and it brings up
pages. Problem is, it usually works poorly because it's a simple search engine
without advanced heuristics or PageRank-like algorithms. So it's often a
difficult slog to find documentation (in a large corups), because managing
information is hard.

But if what you actually want is just a way to look up domains, you still need
to either curate an index, or provide an "app store" of domains (basically a
search engine for domain names and network services). You'd still need some
curation to weed out spammers/phishers/porn, and it would be difficult to find
the "most relevant" result without a PageRank-style ordering based on most
linked-to hosts.

What we have today is probably the best technical solution. I think the
problem is how it's funded, and who controls it.

------
fergie
Author of the npm module search-index here.

"1- Finding information is trivial"

The web already consists, for the most part, of marked up text. If speed is
not a contraint, then we can already search through the entire web on demand,
however, given that we dont want to use 5 years on every search we carry out,
what we really need is a SEARCH INDEX.

Given that we want to avoid Big Brother like entities such as Google,
Microsoft and Amazon, and also given, although this is certainly debatable,
that government should stay out of the business of search, what we need is a
DECENTRALISED SEARCH INDEX

To do this you are going to need AT THE VERY LEAST a gigantic reverse index
that contains every searchable token (word) on the web. That index should
ideally include some kind of scoring so that the very best documents for, say,
"banana" come at the top of the list for searches for "banana" (You also need
a query pipeline and an indexing pipeline but for the sake of simplicity, lets
leave that out for now).

In theory a search index is very shardable. You can easily host an index that
is in fact made up of lots of little indexes, so a READABLE DECENTRALISED
SEARCH INDEX is feasable with the caveat that relevancy would suffer since
relevancy algorithms such as TD-IDF and Page Rank generally rely on an
awareness of the whole index and not just an individual shard in order to
calculate score.

Therefore a READABLE DECENTRALISED SEARCH INDEX WITH BAD RELEVANCY is
certainly doable although it would have Lycos-grade performance circa 1999.

CHALLENGES:

1) Populating the search index with be problematic. Who does it, how they get
incentivized/paid, and how they are kept honest is a pretty tricky question.

2) Indexing pipelines are very tricky and require a lot of work to do well.
There is a whole industry built around feeding data into search indexes. That
said, this is certainly an area that is improving all the time.

3) How the whole business of querying a distributed search index would
actually work is an open question. You would need to query many shards, and
then do a Map-Reduce operation that glues together the responses. It may be
possible to do this on users devices somehow, but that would create a lot of
network traffic.

4) All of the nice, fancy schmancy latest Google functionality unrelated to
pure text lookup would not be available.

"2- You don't need services indexing billions of pages to find any relevant
document"

You need to create some kind of index, but there is a tiny sliver of hope that
this could be done in a decentralized way without the need for half a handful
of giant corporations. Therefore many entities could be responsible for their
own little piece of the index.

------
sonescarol
l

------
sonnyblarney
G is apparently losing a lot of product related search to Amazon, I suggest
that the 'siloing' of the web, for better or worse, might yield some progress
here.

i.e. when you search, you start in a relevant domain instead of Google so
Amazon for products, Stack Exchange for CS questions.

Obviously not ideal either.

------
diminoten
No. Search is a consequence of data volume.

------
wfbarks
a New New Internet

------
codegladiator
No

------
nojobs
Also we should keep hiding from big brothers to save our data from companies
and government and pay for it. VPN I mean. But first you need to find a proper
one, I waste enough time on it. [https://vpn-review.com/found](https://vpn-
review.com/found) one here

------
drenvuk
Finding information has never been trivial and until you can read people's
minds to see what they really mean when they search for 'cookies' when they
really mean "how to clear my internet browsing history for the past hour" it
will continue to be non-trivial. The work Google has done in the search space
is damn near magical. Your question belittles the literal billions of dollars
and millions of man hours that have gone into making the current and previous
implementations of Google's search engine _almost good enough_.

This is not simple, and your Ask HN reeks of ideology and contempt without so
much as an inkling of the technical realities that would have to be overcome
for such a thing to happen. That goes for both old and new internet.

/rant

~~~
boblebricoleur
> Your question belittles the literal billions of dollars and millions of man
> hours that have gone into making the current and previous implementations of
> Google's search engine

I don't think this question belittles Google's work.

I feel saying that would be like saying that animals that chose to live on the
land were belittling millions of years of evolution in the water.

People working at Google chose to spends their time building a search engine
for the world wide web, fine. That does not mean that sharing information
accross a network has to be done via { world wide web, google }.

All of this is purely theroical of course, but I'm sure someone more creative
than me would find another solution. Maybe not a solution that would exactly
fit OP's description, maybe not a solution that would be practical with the
current infrastructure.

But a solution that would render Google as-is obsolete ? Yes, I think that
would be possible.

