
The Web is missing an essential part of infrastructure: an open web index - Rumperuu
https://arxiv.org/abs/1903.03846
======
weinzierl
Isn't this what Common Crawl[1] is. From their FAQ:

> What is Common Crawl?

> Common Crawl is a 501(c)(3) non-profit organization dedicated to providing a
> copy of the internet to internet researchers, companies and individuals at
> no cost for the purpose of research and analysis.

> What can you do with a copy of the web?

> The possibilities are endless, but people have used the data to improve
> language translation software, predict trends, track the disease propagation
> and much more.

> Can’t Google or Microsoft just do that?

>Our goal is to democratize the data so everyone, not just big companies, can
do high quality research and analysis.

Also DuckDuckGo founder Gabriel Weinberg expressed the sentiment that the
index should be separate from the search engine many years ago:

> Our approach was to treat the “copy the Internet” part as the commodity. You
> could get it from multiple places. When I started, Google, Yahoo, Yandex and
> Microsoft were all building indexes. We focused on doing things the other
> guys couldn’t do. [2]

From what I remember reading once DuckDuckGo doesn't use Common Crawl though.

[1] [https://commoncrawl.org/](https://commoncrawl.org/)

[2]
[https://www.japantimes.co.jp/news/2013/07/28/business/duckdu...](https://www.japantimes.co.jp/news/2013/07/28/business/duckduckgo-
chief-spills-on-search-engine-wars/)

~~~
AznHisoka
I dont believe Common Crawl offers a _real time_ search index as its delayed
by more than a month (although that could have changed recently). Still useful
for research purposes but not that desirable for a search engine that competes
with Google, etc.

~~~
burtonator
This is literally why I created my company:

[http://www.datastreamer.io/](http://www.datastreamer.io/)

We've been around for about a decade. IBM watson used us as their social data
provider during Jeopardy. We provide data to tons of companies and you're
probably using our services - just that it's not obvious where we're used
since it's SaaS B2B and not B2C.

We're not free but the primary reason we exist is that other vendors charge
borderline extortionate pricing and I fundamentally believe that the web MUST
remain open.

We've also been providing data for very affordable pricing to researchers for
more than a decade.

Search for us as Spinn3r under Google Scholar (our previous name) and we have
hundreds and hundreds of PhDs who have access to our data.

We do charge for research usage now but it's very very very affordable.

The entire point is that we're trying to enable innovation.

~~~
Sir_Cmpwn
This doesn't make any sense. You talk about open data but yours is the
opposite. You're just another commercial data hoarder, please don't act like
you're not.

~~~
sytelus
You are mistaking between free and open. You can be open without being free.
Maintaining web index is extremely expensive. Imagine storing most of the web
on your own servers and serving it. Someone has to pay bills for all those
disk space and bandwidth. I don’t think web index would ever be free (unless
storage, compute and bandwidth were free) but having at reasonably priced is a
very good thing. I would hope these indices are available on AWS, Azure etc
where people can just use it with cloud compute and pay per use.

~~~
profalseidol
> I don’t think web index would ever be free

Yet the company first mentioned does it for free, lol:

[https://commoncrawl.org/](https://commoncrawl.org/)

I've checked Datastreamer.io for 5 seconds, I don't see any link to their
repo. If not "open source" then what does "open" mean?

~~~
ksahin
Commoncrawl is not a company, it's a non-profit. Open means you can access the
data, there is no assumption about the data being free or not.

~~~
warent
What? It's a nonprofit organization engaging in nonprofit business. Any
organization that engages in business is a "company." Common Crawl is a
company. Your comment isn't accurate and it doesn't address the parent's
comment.

------
netborn
There are two entities trying to pull this off:

Common Crawl (non-profit): Stores regular, broad, monthly crawls as WARC
files. Provides a separate index that can be used to look data up (no a
fulltext index though). Used mostly in academia.

Mixnode (for-profit): Regularly crawls the web and lets users write SQL
queries against the data. Not sure who the primary users are since it's in
private beta.

There are some search engine APIs, but I don't think the conflict of interest
would allow for cost-effective large-scale access and pricing...

~~~
wongarsu
> but I don't think the conflict of interest would allow for cost-effective
> large-scale access and pricing

Not for existing search machine providers, but I think there is room for new
players to do this large scale. Imagine an AWS service that high performance
access to crawled data as well as a number of indexes and a fairly simple
search engine using this data. That would commoditize one of Google's biggest
advantages, and anyone could, at least in principle, run their own search
engine from the data. Because the market for this is much wider than
traditional search engines just providing the data and indices for a pay-as-
you-go fee could still be very profitable.

------
mlinksva
Could the Internet Archive, specifically
[https://web.archive.org/](https://web.archive.org/) be the basis of an Open
Web Index as proposed by the author?

I'm sure there are tons of obstacles to that path, but it also would be far
ahead of any new initiative in at least two ways: it already has a huge index
and ingestion pipeline, and it is a trusted organization.

~~~
mbay
yes - I worked on this a bit with Mark Graham, the director of the Wayback
Machine

~~~
mlinksva
thx - can you say more?

~~~
mbay
Not too much really. Its a big interest of Mark's but its still early in the
planning stages. I helped him with some preliminary research and gave this
brief talk about our work:
[https://www.ischool.berkeley.edu/events/2018/facilitating-
di...](https://www.ischool.berkeley.edu/events/2018/facilitating-diverse-
collection-and-curation-web-crawling-and-indexing-and-blockchain)

------
ilaksh
It seems like the idea is recommending the Open Web Index (has its own
website).

I like a modified version of this. I think that it should be a p2p technology
and not try to create one meta-index but rather be many domain-specific ones,
with one or more tools or DBs to select which indices to search given a
query/context.

Are there any decentralized alternatives to Google out there already?

I think that also this overlaps with the idea of moving from a server-centric
internet to a content-centric internet.

~~~
vpzom
Maybe [https://www.yacy.net](https://www.yacy.net)?

~~~
ilaksh
Thanks! I installed it. It seems like exactly the right concept, but the
results for the terms that I tested with were horrible.

EDIT: I waited a few minutes and now the results are MUCH better! I think I
just needed to let it connect to more peers or something.

------
mschuster91
While I like the idea, I fear the potential for abuse, conflict and community
splits. It will need some sort of moderation, at least to prevent:

1\. spam

2\. child pornography

3\. content against the laws

The only thing that is easy to define as policy is #2. No one likes child
porn. But even then, there are grey areas with differing legal status -
lolicon on the anime side and "barely legal" on the realistic side, plus CGI.

Spam - for me I'd flag all commercial advertisings as spam, others would
heasitate to block Viagra spammers.

Then the final category: illegal content. The US doesn't like nipples. Germany
has no problem with nipples. Swastikas and other NS insignia? Other way
around. Some post-Soviet states have banned Hammer and Sickle or the Red Star.
Some countries have extremely strict libel laws, others have non-existing
libel laws. In some countries (hello Germany) even _linking_ to illegal
content can get you thrown into jail, in others not.

And finally: who should pay for operational costs of such an index? Wikipedia
only works out because the contributors worldwide donate _enormous_ amounts of
time to it, and Wikipedia has only a fraction of the amount of content that
Youtube and Twitter create, and Facebook is orders of magnitude bigger.

~~~
WaltPurvis
The proposal is for a publicly funded index as base-level infrastructure.

Filtering out spam, pornography, and other undesirable or illegal content
would be done at the service level, i.e., but companies/organizations building
user-facing search applications on top of the index.

~~~
skybrian
And suppose someone builds a service specifically to find illegal content?
There will be pressure to block them and also remove stuff from the index. So
you need a policy on who gets blocked and that's just as political.

------
kickscondor
There are lots of niche directories out there - if you consider Reddit wikis,
"awesome" lists and so on.

A few of us out there are also working on small directories:

* [https://href.cool](https://href.cool) (mine)

* [https://indieseek.xyz](https://indieseek.xyz)

* [https://iwebthings.com](https://iwebthings.com)

The thought is that you can actually navigate a small directory - they don't
need to be five levels deep - and a network of these would rival a huge
directory, avoid centralization, editor wars, single point of failure.

~~~
Biganon
Crimes... Lies... Disneyland... Weird cryptic list of the attractions...
Excuse me but what am I reading?

------
russellbeattie
The web needs to be forked into two distinct standards: One for dynamic
content, and one for documents. The first would use basically everything in
the HTML5/CSS/JS toolbox, and the second would be more akin to AMP, but for
all docs.

The benefits of this would be a standard for Wysiwyg editors (goodbye million
rich text editor projects, Markdown and even Microsoft Word), and more
semantic markup for both search engines and accessibility.

Right now it takes millions of man hours to create a performant browser, which
is limiting those engines to only the largest organizations. Even Microsoft
gave up making their own. And even with all that effort, I still can't create
a clean HTML document with an interface as rich as MS Word, or even add bold
or color formatting to a Twitter post, or update a Wikipedia page without
knowing wiki markup.

We need to pull the dynamic, JS powered side of the web out from the core,
limit CSS to non-dynamic properties, and standardize on an efficient in-
document binary storage akin to MIME email attachments so HTML docs can be
self-contained like a Word or PDF doc.

This document-centric web could be marked off within a standard web page, so
you could combine it in regular interfaces for things like social network
posts. Or it could be self standing, allowing relatively large sites to be
created with indexes, footnotes, etc., but served from a basic static server.

This isn't a technical challenge, it's an organizational one. I've thought for
years that Mozilla should be doing this, instead of messing with IoT and
phones, etc. It's such an obvious problem that needs addressing, and would
have a huge payback in terms of advancing the web as we know it.

~~~
icebraining
> This isn't a technical challenge, it's an organizational one

No, it's an economical one. Who will use that web? You mention Twitter, yet
are they not dependent on JS for analytics and ad-tracking? The few sites not
dependent on such features are already usable on Lynx and Elinks, and the
others simply won't use them.

For the advantages, you mention having a good WYSIWYG editor, but the reason
you can't add bold or color to a Twitter post is obviously not because they
are unable to add those functions, but because _they don 't want you to do
that_. Which raises the question: what happens when that editor lets you
create something the site doesn't allow you to use?

(By the way, Wikipedia has had a visual editor since 2012, you just have to
switch using the "pencil" button:
[https://en.wikipedia.org/wiki/Wikipedia:VisualEditor](https://en.wikipedia.org/wiki/Wikipedia:VisualEditor))

------
michaelangerman
I have been wanting this for years...

If you look at the original Yahoo Page when Yahoo first started out it
attempted to solve this problem.

I believe this index could be regionally or language based...

In the United States one could use

Dewey Decimal

[https://en.wikipedia.org/wiki/Dewey_Decimal_Classification](https://en.wikipedia.org/wiki/Dewey_Decimal_Classification)

Library of Congress

[https://en.wikipedia.org/wiki/Library_of_Congress_Classifica...](https://en.wikipedia.org/wiki/Library_of_Congress_Classification)

~~~
tokyodude
It won't work without a central authority. See Soundcloud as an example.
People tag their music with whatever they think will get them traffic. So, in
order to do this you'll need a mass of volunteers which will lead to politics
"XYZ should be classified as G! No it should be F!", "classifying ABC as DEF
is racists/sexist/..." and other arguments. You'll also get people lobbying to
have things removed (right to be forgotten, pornography, drug ad, prostitution
ads, disparaging the government -China, -Thailand, etc..., etc...

I'm not saying it shouldn't be done but I think it will be way more work than
expected and there will be all kinds of issues.

~~~
zdragnar
If anything, that sounds like a solid argument to decentralize it. I don't
want China's government, white supremacists, churches, soccer moms, Jihadis or
grievance-of-the-month activists controlling how information is indexed; I
would rather use multiple indexes that balance out controlling interests and
biases.

~~~
rabidrat
Unfortunately if it's decentralized, then it becomes controlled by spamlords,
SEO artists, advertisers, and anyone else who stands to gain from manipulating
the index to their advantage. At least if it's centralized, the fights are out
in the open and have a chance of converging on something reasonable (like e.g.
wikipedia).

~~~
snaky
Decentralized doesn't mean flat. You can trust to some actors only (and to
some _they_ trust to).

------
aheilbut
I've always thought it would make more sense if each web server could be
responsible for indexing the material that it serves (and offer notifications
of updates), so instead of having to crawl everything yourself, you could just
request the index from each domain, and then merge them.

~~~
lenzm
A significant problem with this is trust. You can't trust websites to reliably
or accurately index their sites due to both incompetence and malice. I don't
think there's any way around the malicious component. Formal or informal
standards may take care of the competence factor with the feature being built
into common publishing platforms.

XML sitemaps are a microcosm of putting the indexing onus on websites instead
of the search engines - they are basically ignored by search engines because
they have been abused and are not a useful signal. If pages aren't important
enough to be linked to throughout your website then they aren't interpreted as
being important enough to return to users. The optimistic case is that
sitemaps/indices will send parallel signals to the search engines in which
case they are redundant. The pessimistic case is that the sitemaps/indices
will send signals orthogonal to the content provided to users in which case
the website is either being deceitful or incompetent. In any case, the search
engine will not want to use the sitemap/index as a signal as it either doesn't
provide value or provides negative value.

~~~
oldjokes
Honestly I'd much rather have a bunch of dice rolls on incompetence than the
current centralized, single point of control over the entire index.

Google has been purging large swaths of data from the indexes and they won't
say how or why or exactly what criteria they are using. It's difficult to
imagine a worse solution for the web than this current model.

~~~
bogomipz
"Google has been purging large swaths of data from the indexes and they won't
say how or why or exactly what criteria they are using."

Wow, interesting, this is the first I've heard of this. Might you have some
link or citations about this? Thanks.

~~~
aytekin
4% of the Google index hit by de-indexing

[https://searchengineland.com/4-of-the-google-index-hit-by-
de...](https://searchengineland.com/4-of-the-google-index-hit-by-de-indexing-
bug-moz-data-shows-315248)

------
3xblah
The PDF is a little short on details. It sounds like webamsters would all have
to cooperate with allowing crawls from an "OWI" bot.

One of the challenges of creating a "web index" is first creating indexes of
each website. "Crawling" to discover every page of a website, as well as all
links to external sites, is labour-intensive and relatively inefficient. Part
of that is because there is no 100% reliable way to know, before we begin
accessing a website, each and every URL for each and every page of the site.
There are inconsistent efforts such "site index" pages or the "sitemap"
protocol (introduced by Google), but we cannot rely on all websites to create
a comprehensive list of pages and to share it.

However, I believe there is a way to generate such a list from something that
almost all websites do create: logs.

When Google crawls a website, it is often or maybe even always the case that
the site generates logs of every HTTP request that googlebot makes.

If a website were to share publicly, in some standardised format, the portion
of their log where googlebot has most recently crawled the site, we might see
a URL for each and every page of the site that Google has requested.

Automating this procedure of sharing listings of those googlebot HTTP
requests, the public could generate a "site index" directly from the source,
via the information on googlebot requests in the logs.

Allowing crawls from a "new" bot would not be necessary.

Webmasters know what URLs they offer to Google. Google knows as well. The
public, however, does not.

It is a public web. Absent mistakes by webmasters, any pages that Google is
allowed to crawl are intended to be public.

Why should the public not have access to a list of all the pages of websites
that Google crawls?

I don't know, but there must be reasons I have failed to consider.

What are the reasons the public _not know_ what pages are publicly available
via the web, except as made visible (or invisible) through a middleman like
Google?

There are none.

Being able to see logs of all the googlebot requests would be one way to see
what Google has in their index without actually accessing Google.

~~~
yomly
Isn't the act of sharing these logs vulnerable to a similar problem to site
maps?

Not everyone will do it and those that do may not do it to 100% completeness:
people may not keep their http logs in good order, for example.

~~~
3xblah
"Not everyone will do it..."

Not everyone will provide CCBot with the same access that they provide to
Googlebot. The question is how many will?

It is sort of a catch-all issue with anything on the web: "Not everyone will
do it." I am not sure that anyone aims for _100%_ participation where the web
is concerned.

There is always an uncertain amount of variation involved with particpation in
anything across the entire www.

------
sixtypoundhound
How far is this from the (now defunct) DMOZ?

Publicly maintained directory that I believe was at least theoretically
independent of the larger web companies. It certainly had it share of drama,
but was a decent human vetted index of what was out there....

------
diNgUrAndI
As a user, if some other search engine can serve results that are better than
Google, I'd be happy to use it. I've tried duckduckgo, the results are
disappointing and often mis-intepreted what I intended to search. So I kept
coming back to Google.

Will Google be willing to open its indexes? Probably not at their best
interest, because it will help its competitors?

~~~
peterburkimsher
I had an idea about a new indexing algorithm that would only need static file
hosting (e.g. Github) for searching.

[https://news.ycombinator.com/item?id=17548623](https://news.ycombinator.com/item?id=17548623)

If you like, I can try implementing that with my next data analysis project.
Right now I'm studying the MySpace Dragon Hoard, and I'll soon write a blog
post with maps of music genres around the world.

------
Tomte
I assume that in a world of competitive index users, there is no one size fits
all. Presumably application design (and feature) choices will heavily
influence how the index should work.

For simple "I know TF-IDF, let's build a toy search engine" it will suffice,
but apart from that?

~~~
shereadsthenews
This is the problem with this idea. Format of the index will be intricately
tied to the algorithms that are meant to traverse it. The production of a
search result by Google or Bing in a fraction of a second is an outright
miracle of software engineering. If this open index service provides something
developers can easily understand and consume, such as a term-doc hitlist with
a simple encoding, it will be enormous, expensive, and impractical to
traverse.

~~~
walterbell
Google needs sub-second response to show ads.

Some users may be happy to wait for hours or days to get high quality answers
not available from commercial companies. Can still be faster than emailing a
human friend or consultant or tasking an employee or department.

~~~
AndyPatterson
No idea where you got this statisti, but I guarantee no user would be happy to
wait hours or (god forbid) days on a search result.

~~~
walterbell
It's still sooner than "never", which is the current response time for answers
that Google cannot provide.

Central search indexes like Google are not going away. There are client-side
metasearch interfaces that combine Google results with other sources. Those
other sources can be much slower, including human responses. You would still
have your synchronous sub-second response from centralized search, but there
would be asynchronous results from decentralized search.

This exists today, e.g. when you post a question on HN or a messaging app,
asking other humans for answers not available in public indexes. Most of the
world's knowledge is _not public_ , it's obscure and may only be of interest
to specific niche audiences.

~~~
petra
>> Most of the world's knowledge is not public

Where is it found ?

~~~
walterbell
There's a long list, including private correspondence, commercial journals,
proprietary databases, trade secrets, internal corporate data sets, private
archives, financial trade data, classified national databases. That was
_before_ the rise of FAANG, big data, proprietary analysis / inferences /
knowledge graphs derived from public data sources, metadata traffic analysis,
and advertising surveillance business models.

I can't find a reference at the moment, but this topic was covered in a
professional journal for historians.

------
z3t4
With the advance of fiber networks I think each browser/device will have their
own web index. One problem is web sites that can only handle 1-3 simultaneous
users. It will be an eternal hug of death from all the crawling.

------
netwanderer3
The index itself is already separate in a sense that nobody is being stopped
to do the indexing task themselves.

Google is a private for-profit company so we cannot realistically expect them
to provide something for free to the public without generating profits in
return.

The web index is not a locked up proprietary resource by anyone, so people can
do the indexing themselves but the real question is how do you fund a service
that will keep increasing its workload exponentially and indefinitely? What
institution will have the required resources to bare such costs?

------
rolph
whole document here from arxiv.org page :

[https://arxiv.org/pdf/1903.03846](https://arxiv.org/pdf/1903.03846) [PDF]

------
NikkiA
Hmmm, There's Curlie - the reboot of dmoz

[https://curlie.org/](https://curlie.org/)

------
EGreg
OpenStreetMap is doing it with Maps.

I welcome the idea of data being totally free where you make apps to use
mirrors instead of APIs

------
galaxyLogic
How about indexing just the <h1> </h1? Is that the intention? We don't want
too much information.

~~~
known
I prefer <title> </title>

~~~
mdaniel
Arguably no one invests in the title tag anymore because it's not user-visible
in the way a heading tag is, or go further in the other direction and use the
`<meta>` tags honored by Facebook and Twitter, since the page author has
incentive to keep that content up-to-date

~~~
galaxyLogic
But also arguably 'title' is still important maybe even more than before
because it shows on the tab, and everybody uses lots of tabs now.

When there was no tabs users could always see the content of the page knowing
what they are looking at. But tabs hide other pages so it is important that we
know what is in all those other tabs.

------
iamwil
I didn't see mention of who would pay for this infrastructure. Is it
considered a gov't funded or volunteer / donation thing?

There doesn't seem to be a mention of how to alleviate a tragedy of the
commons problem (unless I missed it). If common crawl is doing a fine job, who
funds them?

~~~
ericd
Maybe the crawl could be distributed somehow, and you could pull versions of
the web from those distributed nodes via BitTorrent.

------
Rumperuu
Abstract:

 _A proposal for building an index of the Web that separates the
infrastructure part of the search engine - the index - from the services part
that will form the basis for myriad search engines and other services
utilizing Web data on top of a public infrastructure open to everyone._

------
fifoan
I asked a Google Engineer in a Google Interview (at the end of it, when you
get the chance to ask them questions) - if Google would ever make it's
infrastructure available to the public so they could leverage it in whatever
way they wanted.

He had no idea what I was talking about.

------
sneg55
Somebody could try to build own crawler and feed them with 260MM domain names
dataset from [https://domains-index.com](https://domains-index.com)

~~~
throwamay1241
Is there more like this? Afaik SSL certificates are required to be committed
to an open ledger but I can't find anywhere to obtain the ledger..

~~~
papower
Maybe you're referring to Certificate Transparency Logs ? There's background
info at [http://www.certificate-transparency.org/what-is-
ct](http://www.certificate-transparency.org/what-is-ct)

You could implement your own log monitor or use services like crt.sh or
certstream to build a candidate list of domains that have registered SSL
certs.

------
OneWordSoln
Ultimately, Wikipedia provides an effective keyword lookup that maps to
curated links.

Regardless, the notion of a general web index is well nigh moot at this point
due to its not having been built into the system from the get-go. Any such
attempt at this point will be, by definition, ad hoc and built by some group
of individuals, with the vastness of the content, the cost of the project and
the intrinsic conflicts that will no doubt arise making independence from
finance and legal issues non-trivial, to say the least.

Really, Wikipedia is the most sensible foundation I can imagine, given that
Google has become a self-serving for-profit corporate advertising machine.

~~~
marktangotango
Good observation. This makes me think of all the useful content Wikipedia
_doesn’t_ link to though.

~~~
OneWordSoln
Thanks. And yeah, WP is really only a broad, top-tier foundation that can (and
has) grow(n) organically in a rather appropriately demand-driven way. [Of
course, such human systems will always have problems on touchy subjects as
digital information is always most useful for factual subjects where opinions
are less relevant.]

Really, I am of the perspective that a machine-grokked indexing system will
always be less useful in a significant set of edge-cases than a human-curated
index due to such factors as language ambiguity and gaming of such algorithms.
As well, the sheer size of the internet requires ranking the pages to ensure
the most useful links are properly denoted as such.

WP, being likely the most important and useful crowd-sourced and -built human
information system, it is up to us to both keep it funded and add the
information we deem important.

------
lifeisstillgood
I have argued that one regulatory outcome over Google could be the open
release of their index - and even their database of "if you searched for X and
clicked the top link then came back five seconds later we can infer the top
link is not good for X"

And yes I know that's pretty much all of Google. It's just that it's hard to
get away from the idea that an index of web pages is anything other than the
property of the people who created each web page and the links on it.

And it's not such a big leap to argue that data that is generated by my
behaviour is actually my data. (if is likely to be personally identifying data
- or perhaps a different term like personally deanonimisable)

I do agree with the general direction of GDPR - but I honestly think the
digital trail we leave is a different class of problem that needs different
classes of legal concepts to work with.

I think digital data is a form of intellectual property that I create just by
moving in the digital realm.

And if you have to pay me to use my data to sell me ads, you will likely stop.

~~~
icebraining
You can't argue that such data is individually owned and also that it must be
released publicly, because that would require consent from everyone whose data
was used.

------
Animats
Like Open Directory?

~~~
dredmorbius
You're referring to (the now-defunct) DMOZ?

[http://dmoz-odp.org](http://dmoz-odp.org)

As opposed to Apple Open Directory?

[https://en.m.wikipedia.org/wiki/Apple_Open_Directory](https://en.m.wikipedia.org/wiki/Apple_Open_Directory)

~~~
Animats
Yes.

------
return1
more centralization, great. why not make search itself distributed by
broadcasting the queries recursively and gathering results?

~~~
dredmorbius
Suggestions as to how?

~~~
return1
i dont know any relevant project sorry

~~~
dredmorbius
Thanks, there's SubHub, and a few possible options.

An indexing standard seems a critical element.

------
skwog
It is ... missing more than just that.

------
shereadsthenews
Ironically it is EU regulations that make this idea totally impossible. One
does not simply index documents, at least not for Europeans. You have to
expurgate your index for the "right to be forgotten" people. You have to
remove all the Nazi stuff because of Germans. This idea by a German is not
possible because Europe.

------
dexteve
One Ring to rule them all

------
AndyPatterson
This simply isn't needed, and if it is it can be done by a charity or any
group of people, not something that should be built into the infrastructure of
the web itself.

You have to remember that while the little that the web provides is also its
strongest attraction; it allows the web to be accessed and modified by anyone,
they're on little bit of the web can be very different from someone elses.

So by adding on a way that the web must be indexed is kind of like moving
closer to communism than liberalism. I guess if we start dictating to google
where to get their data then we've moved to the full blown hammer and sickle
stage :).

~~~
Retra
Putting together an accurate picture of the state of the world is not
contradictory with liberalism.

