
Google Exec Says It's A Good Idea: Open The Index And Speed Up The Internet - helwr
http://www.siliconvalleywatcher.com/mt/archives/2010/07/google_exec_say.php
======
geekfactor
It strikes me that the entire article/proposal is based on a faulty premise:

 _"After all, the value is not in the index it is in the analysis of that
index."_

The ability for a given search engine to innovate is based on having control
of the index. The line between indexing and analysis isn't quite as clean as
what is implied by the article, if only for the simple fact that you can only
analyze what is in the index.

For example, at it's simplest and index is a list of what words are in what
document on the web. But what if I want to give greater weight to words that
are in document titles or headings? Then I need to somehow put that into the
index.

What if I want to use the proximity between words to determine relevance of a
result for a particular phrase? Need to get that info into the index, too.

In the end, what the author really wants is for someone to maintain a separate
copy of the internet for bots. In order for someone to do that, they'd need to
charge the bot owners, but the bot owners could just index your content for
free, so why would they pay?

~~~
mapgrep
Three easy reasons search engine owners /might/ pay for a full copy of the web
crawl:

-Faster. You don't have the latency of millions of HTTP connections, but instead a single download. (Or a few dozen. Or a van full of hard drives.)

-Easier. The problem of crawling quickly but politely has been handled for you. The reading of sitemaps has been handled for you. The problem of deciding how deep to crawl, and when to write off a subsite as an endless black hole, has been handled for you. Etc.

-Predictable. Figuring out, in advance, how much it is going to cost you to crawl some/all of the web is, to say the least, tricky. Buying a copy with a known price tag provides a measure of certainty.

Of course, I am leaving out the potential pitfalls, but the point is there
/are/ arguments in favor of buying a copy of the web (and then building your
own index).

~~~
geekfactor
All good points. Maintaining a current and high quality index is clearly not
free in terms of development time, bandwidth, storage or a host of other
factors.

------
jedberg
No one seems to remember that Amazon did this 5 years ago:
<http://arstechnica.com/old/content/2005/12/5756.ars>

~~~
helwr
"The Alexa Web Search web service has been deprecated and is no longer
available for new subscriptions." <http://aws.amazon.com/alexawebsearch/>

~~~
nl
Which shows what the demand for it was.

------
panic
What "index" is this article asking Google to open? The index against which
they run actual queries has to be tied to Google's core search algorithms,
which I doubt they'd want to make public.

So would they open an "index" of web page contents? In this case, why would
another search engine access Google's "index" rather than the original server?
The original server is guaranteed to be up to date, and there's no single
point of failure.

~~~
lpolovets
There are two good reasons to access an index instead of the original server:

1) You won't DOS the host site.

2) You don't have to respect robots.txt. If you need to crawl a 1 million page
site, and its robots.txt restricts you to 1 page/second, then you'll have to
wait for a long time. Downloading a crawl dump from a central repository would
be much easier.

~~~
sigil
> 2) You don't have to respect robots.txt. If you need to crawl a 1 million
> page site, and its robots.txt restricts you to 1 page/second, then you'll
> have to wait for a long time.

Yeah, consider a site like Hacker News, where the crawl delay is not 1 second
but _30 seconds_ [1].

If you're trying to grab historical data like iHackerNews did [2], you might
be better off hitting Google's webcache instead...except, to enumerate urls
you have to scrape the search results pages, AFAIK.

This is why Google opening its webcache via an API is a _GREAT_ idea.

[1] <http://news.ycombinator.com/robots.txt>

[2] <http://api.ihackernews.com/>

~~~
kelnos
Tsk, tsk. On a side note, HN's robots.txt is returned with a content-type of
text/html.

------
trotsky
I guess I don't understand, if someone provides me with a storage cluster of
the_whole_internet for free, won't my proprietary_search_algorithm
significantly degrade the IOPS and network bandwidth of the storage? Where
would it all be? In some google data center that now anyone can demand
colocation in? What happens when I accidentally or maliciously slow down
Bing's updates and degrade their quality? And, as others mentioned, what
happens when people push data into the index that doesn't represent what
they're hosting?

It seems like this would be quite a complex project with a for the public good
approach. Maybe it could work as an AWS project to sell amazon compute cycles.

~~~
jerf
I suspect the only way to make this work would be for the "index" to actually
be some sort of stream. It wouldn't be a "file" or "database". It would
require probably on the order of hundreds of thousands of dollars worth of
hardware just to receive the stream and run a hello-world map-reduce(-esque
calculation) on it would be my guess. It's your job to turn that stream into a
queryable database.

As for your last two questions, there's nothing new whatsoever about them.
Search engine pollution is ancient news.

------
sebastianavina
yeah, sure... lets make a system and store all kind of information there, so
people can browse it... it would be great to distribute it around the world,
maybe on different companies, and sync the data every day so it keeps fresh...
I don't know, maybe we can even have every person store their own data on
their own private server... but of course, in an open index... </sarcasm>

~~~
uxp
I'm blinded by what you are trying to imply sarcastically.

A rough distributed model could be implemented similar to the way we
(hackers/coders) use github as a central repository for a distributed system.
People contributing to the index on a private server could do whatever they
want but since that instance of the index is not public, no one else will care
about what the owner has done to it. Forks can be pushed to a public staging
area where others can view it and verify it's accuracy, and then the major
players can merge those changes into their forks.

The complaint (with github) that it is hard to figure out the canonical repo
is also invalid in this model, as one can start with a fork of Google or
Yahoo's public repo, and then build their own through merging or hacking
directly on it, just like one can fork Linus' Linux kernel, and then merge in
other's forks to incorporate other changes.

Remember, the index itself, as in the raw data taken in by GoogleBot or Yahoo!
Slurp bot, would be the shared information. The analysis of the data, as in
pagerank and other factors that Google decides makes one page more relevant to
a keyword than the other, would not be shared as that is the bread and butter
of each engine.

~~~
DrCatbox
The sarcasm was because the idea precisly explains what we have today, it is
called the web.

~~~
recoiledsnake
So the web wikipedia.org is the same as making the files available at
<http://dumps.wikimedia.org/enwiki/latest/> ?

------
chrislomax
I think this is a good idea. The whole idea of people syncing their own data
doesn't work though, it gives too much room for people to fudge their data
into the system so it favours them more.

I think the idea is good though. I think there would be a fight though to say
who is the aggregator of the information. This would also mean whoever does
distribute it has a stranglehold on the industry in terms of how and when it
supplies this information.

I can see it's uses but I can equally see a lot of cons for the system not
working or some serious amount of anti trust.

If you could get an unbiased 3rd party involved though and they built the
database then I think that would work.

------
Emore
For the record, the Google exec (Berthier Ribeiro-Neto) is the co-author of
"Modern Information Retrieval" [1], an excellent book and close to a standard
text on IR.

[1] [http://www.amazon.com/Modern-Information-Retrieval-
Ricardo-B...](http://www.amazon.com/Modern-Information-Retrieval-Ricardo-
Baeza-Yates/dp/020139829X)

~~~
a_m_kelly
I can second the recommendation of that book, I've heard a lot of good things,
though I haven't read it. It's recently been updated in a 2nd edition [1],
though I have no idea if there are substantive changes, presumably, there are,
given more than a decade has elapsed. If anyone's read the updated version,
I'd appreciate knowing if and/or how the book's changed, I've been thinking
about picking it up.

I have read pretty big sections of Manning's Introduction to IR, and it served
me fairly well as an introduction to the field. It's available online.[2]

[1] [http://www.amazon.com/Modern-Information-Retrieval-
Concepts-...](http://www.amazon.com/Modern-Information-Retrieval-Concepts-
Technology/dp/0321416910/ref=sr_1_1?s=books&ie=UTF8&qid=1306358312&sr=1-1)

[2] [http://nlp.stanford.edu/IR-book/information-retrieval-
book.h...](http://nlp.stanford.edu/IR-book/information-retrieval-book.html)

------
extension
We're talking about the _cache_ , right? The index, or more likely indices,
are optimized data structures used to search the cache. I doubt Google could
share those without revealing too much about their ranking algorithm.

Letting sites inject into the cache is an interesting idea, but Google will
still have to spider periodically to ensure accuracy. Inevitably, a _large_
number of sites will just screw it up, because the internet is mostly made of
fail. This would leave Google with only bad options: If they delist all the
sites to punish them, they leave a significant hole in their dataset. But if
they don't punish them and just silently fix it by spidering, there is no
longer any threat to keep the black hat SEOs in check. Either way, it would
cause an explosion in support requirements and Google is apparently already
terrible at that.

~~~
amikazmi
I think the idea was that only Google will crawl your site and update the
index, then the rest of the search engines will use the index instead of
hitting your site.

------
ChuckMcM
"Each of these robots takes up a considerable amount of my resources. For
June, the Googlebot ate up 4.9 gigabytes of bandwidth, Yahoo used 4.8
gigabytes, while an unknown robot used 11.27 gigabytes of bandwidth. Together,
they used up 45% of my bandwidth just to create an index of my site."

I don't suppose anyone has considered making an entry in robots.txt that says
either:

last change was : <parsable date>

Or a URL list of the form

<relative_url> : <last change date>

There are a relatively small number of robots (a few 10's perhaps) which crawl
your web site, all of the legit ones provide contact information either in the
referrer header or on their web site. If you let them know you had adopted
this approach then they could very efficiently not crawl your site.

That solves two problems;

@ web sites on the back end of ADSL lines but don't change often wouldn't have
their bandwidth chewed by robots,

@ The search index would be up to date so if someone who needed to find you
hit that search engine they would still find you.

~~~
troels
There's already place for doing this in the http protocol. I would assume that
crawlers respect this, if provided, although I haven't tested to verify my
expectation.

~~~
unfasten
I have a newly registered domain with only a sparse page up as the index so
far. It's been getting crawled fairly regularly by Google, Baidu and Yahoo.
Google and Baidu are sending If-Modified-Since (Baidu is also sending If-None-
Match) and are receiving 304 Not Modified responses each time they crawl.
Yahoo sends neither header and is requesting the full page every single time.
This is without any explicit cache headers set on my end.

~~~
troels
That is to be expected. `If-None-Match` and `Etag` are a relatively late
caching strategy, that is done at the server (or edge) side.

Have you tried serving your pages with `Expires` and `Cache-Control` headers?
I you give it - say - a timeout of a week, then a well-behaving client
shouldn't retry before that time has went by.

------
SoftwareMaven
A couple of thoughts come to mind:

1\. If I were Microsoft, I wouldn't trust Google's index. How do I know they
aren't doing subtle things to the index to give them an advantage?

2\. Having the resources to keep a live snapshot of the web is one of the big
players' advantages. Opening the index, while good for the web, would not
necessarily be good for the company. Google could mitigate that by licensing
the data: for data more than X hours old, you get free access; for data newer
than that, you pay a license fee to Google. Furthermore, integrate the data
with Google's cloud hosting to provide a way to trivially create map/reduce
implementations that use the data.

3\. On the other side, what a great opportunity the index could provide for
startups. Maintaining a live index of the web is costly and getting more and
more difficult as people lock down their robots.txt. Being able to immediately
test your algorithms against the whole web would be a godsend for ensuring
your algorithms work with the huge dataset and that your performance is
sufficient.

Here's to hoping Google goes forward with it!

------
thevivekpandey
The first step would be for some top companies (Google, Yahoo...) to share the
index. That way, there would be some speed up of the internet, and the index
would not be open to abuse by arbitrary people/companies.

------
mmaunder
The author should use something like "crawl data" instead of "index". An index
is the end result of analyzing crawled web pages.

It's a cool idea though because Yahoo sucks up a ton of my bandwidth and
delivers very little in SEO traffic. On most of my sites now I have a Yahoo
bot specific Crawl-Delay in robots.txt of 60 seconds, which pretty much bans
them.

------
stretchwithme
Maybe each site should be able to designate who indexes it and robots can get
that index from that indexer. Let the indexers compete. Let each site decide
how frequently it can index. Allow the indexer that gets the business use the
index immediately, with others getting access just once a day. Perhaps a
standardized raw sharable index format could be created, with each search
company processing it further for their own needs after pulling it.

And let the site notify the indexer when things change, so all the bandwidth
isn't used looking for what's changed. Actual changes could make it in to the
index more quickly if the site could draw attention to it immediately rather
than an army of robots having to invade as frequently as inhumanly possible.
The selected indexer could still visit once a day or week to make sure nothing
gets missed.

------
ck2
Google would never do this.

Their attitude is to take everything in but not to let you automate searches
to get data out.

This is the biggest problem I have with search engines - you want to deep
index all my sites? Fine, but you better let me search in return - deeper than
1000 results (and ten pages). Give us RSS, etc.

~~~
chrislomax
The whole article is about information in its rawest form and nothing to do
with searchable content.

You would write something that takes the information they are referring to in
this article, it's how you digest and index that information yourself that
makes the difference

------
sigil
"Index" is the wrong word. He's not calling for Google to open up their index,
but rather open their webcache.

------
random42
This article is about an year old. [july, 2010]

------
jwr
It strikes me that both in the article and in most comments people have no
idea of what they are talking about, and yet they boldly carry on.

"The index"? Feature extraction is the most complex part of almost any machine
learning algorithm, and search is no different. Indexing full text documents
is a really difficult task, especially if you take inflected languages into
account (English is particularly easy).

I don't see a way to "open the index" without disclosing and publishing a huge
amount of highly complex code, that also makes use of either large
dictionaries, or huge amounts of statistical information. It's not like you
can just write a quick spec of "the index" and put it up on github.

FWIW, I run a startup that wrote a search engine for e-commerce (search as a
service).

------
198d
I don't think it's quite that simple. The index that google serves search
query results from is a direct result of the algorithms they've applied to the
data the googlebot has gathered. If by 'index' the author means the data the
goolebot (for example) has downloaded from the internet, that's quite a bit
different, but still probably serves the purpose the author is looking for.
The index is a highly specialized representation of all the data they've
collected.

~~~
jessriedel
> If by 'index' the author means the data the goolebot (for example) has
> downloaded from the internet.

That's what the author means by 'index'.

------
mindstab
Does it seem naive to anyone else to allow site owners to update the index and
stop spidering. a) lots of people for various reasons (ignorance, security
through obscurity) would just not update it and stuff would fall out of
search. Second, this seems incredibly ripe for abuse. Like we don't have
enough search spam result problems already, letting spammers have more direct
access to the content going into their rankings seems like a truly bad idea.

------
tlb
When spiders use more bandwidth than customers, your website must not be very
popular. It implies that each page is viewed only a handful of times / month
on average.

~~~
tlrobinson
It's also possible they have a huge amount of content that is sparsely
accessed by a large number of users. But in general I agree.

Edit: SmugMug seems to fall into this category:
[http://don.blogs.smugmug.com/2010/07/15/great-idea-google-
sh...](http://don.blogs.smugmug.com/2010/07/15/great-idea-google-should-open-
their-index/)

Also interesting:

 _And if you think about it, the robots are much harder to optimize for –
they’re crawling the long tail, which totally annihilates your caching layers.
Humans are much easier to predict and optimize for._

------
eykanal
Good article. The fact is, the index itself isn't worth nearly as much as the
algorithms. Heck, open the index, and let anyone add to it. MSN, Yahoo, Bing,
anyone... let them add to that single index and make the index awesome, and
then anyone can try their hand at making a great search algorithm. If each
company really thinks their search algorithm is better than everyone else's,
this is competition at it's best.

~~~
jsnell
It might not be _as_ valuable as ranking algorithms, but that doesn't mean it
has no value at all. A deeper and more timely index is a clear competitive
advantage, the former for long tail queries and the latter for topical ones.
What would be the benefit for Google to let Microsoft leech off that effort?
And if it for some reason was done, what would be the point of continuing
further development of indexing quality?

I think a more correct title for this post would have been "Google exec tries
to politely dismiss a silly idea", but of course that's not as punchy.

------
SkimThat
TL;DR - A lot of traffic on the Internet comes from search engine bots like
Google’s and Yahoo’s indexing pages. If Google’s index was open, search
engines could share each other’s resources and not have to repeatedly spider
pages. This would significantly boost traffic speed and the idea was even
supported by Larry Page, one of, Google’s co-founders. Page initially resisted
Google going commercial.

------
braindead_in
The title is a bit misleading. The author suggested it and the Google Brazil
head supported it and said 'You should write a position paper on it'.

------
tlrobinson
What format would the indexes be made available in? Raw lists of URLs and
caches of the HTML pages, or pre-built inverted indexes, PageRank data, etc?

If it's the former all this really does is move the burden from sites to
Google, and introduces a single point of failure.

If it's the latter, which seems unlikely, what incentive does Google have to
share that data? It's part of their competitive advantage.

------
bkudria
Google has a ton of private data that should not have been indexed, in their
index. It's just that no one has thought to search for it yet. (See:
<http://en.wikipedia.org/wiki/Johnny_Long>)

A single public index would expose this data to stronger analysis (or even
plain reading), not just Google search queries.

------
redditmigrant
I dont know if this is naive, but wouldnt the data model/storage strategy of
the index be influenced by the ranking algorithms that use it? If thats the
case, then I would presume google's index tries to store the data in a form
thats efficient for their ranking algorithm to work off of and it might not be
in the best format for say bing/yahoo to use.

~~~
chrislomax
I think they are referring to the data in its rawest format before they have
indexed and ranked the information themselves. They will all crawl the
information in exactly the same way. They will just take the plain text and
store it. I don't think any bot would actually do anything else with the data
on the fly.

If you think about it, it does make sense in a lot of respects. I have dealt
with a lot of companies that sell data, the only difference is this data is
freely available to everyone so everyone thinks they should crawl the
information themselves.

The only people who lose out are the people paying the bandwidth bills. The
internet would actually be slower due to the amount of information passing
around when it is not needed.

This idea makes more sense the more we discuss it

------
dennisgorelik
Centralization [of search index] has significant overhead.

Bandwidth is not nearly as expensive as the overhead of such search index
centralization.

------
ecaradec
Even if it would be beneficial to the whole internet, but if google did that
that would be like giving an advantage to all the google competitors : they
wouldn't need to solve the crawling problem. It may not be algorithmically
gorgeous but it's still one problem less. Would be fun though, we could buy a
tarball of the whole internet ;)

------
robot
Also, why not use a single base station at each location for all mobile mobile
service providers? Rather than having multiple 3G base stations for each
provider, polluting our radio space? I think when there is competition, there
is always a multiple of something, it's just a fact of open market and we may
have to live with it.

------
bluelu
At the end, one single company will control the internet. I hope this is not
something you want. Like twitter does control twitter and only opens up their
data to gnip, etc...

This won't be accepted. And even legally, this is not possible due to
copyright laws in different countries.

------
endlessvoid94
I have a potentially stupid question. When the author says "45% of my
bandwidth", does he mean 45% of a QUOTA? Or actually 45% of the pipe is being
used?

If it's the former, this seems like it wouldn't help speed at all.

~~~
ddemchuk
45% of his total monthly bandwidth used. So say he's getting 50 gigs of
traffic a month, the bots make up 22.5 gigs of that

~~~
endlessvoid94
OK. So how does this speed anything up?

EDIT: To clarify, how does reducing the amount of bandwidth used speed up
anything? Why am I being downvoted for this?

~~~
Locke1689
Google doesn't know when you will update content on your site so it has to
almost constantly hit it with crawler bots. The idea would be to move towards
a push-based model wherein sites could push updates to an open index instead
of waiting for bots to crawl the site and use up extra bandwidth.

Of course, if this is the primary issue, I don't see why Google couldn't just
implement a closed push-based index. When your site updates, you push the
changes to Google. The index is still closed but it solves the bandwidth
problem without opening Google resources.

~~~
lukes
Google wouldn't trust the sites to do this correctly. So even if they did
provide some push mechanism I think they'd still send bots to your server to
be sure.

~~~
Locke1689
That's quite possible, but in that case an open index would be even more
useless since Google would just have to duplicate it internally anyway.

------
Apple-Guy
The Google guy does not work in the head office, and isn't in charge of
policy. He doesn't understand that search -> ads is what earns the Google
riches.

------
stcredzero
If Google and a few other companies can charge some multiple of what it costs
to index a site, then it could even be a money making prospect for them.

------
benwerd
Well, on one level, it's a great idea. On another, it gives Google the keys to
the entire freaking web.

------
brianobush
part of the secret that makes any search engine unique is the knowledge that a
site at x.com exists and they know there is a forum at x.com/forums which is
not visible by simply crawling from the root x.com. On the other hand, I would
love an open web cache for my work.

------
joshaidan
While I think this is a really cool idea, for some reason the word hiybbprqag
comes to mind. :)

------
ddemchuk
The reason Google (and Bing and Yahoo and Yandex and and and) is in the
position they are in is because they have the bandwidth and computational
power to crawl and index the web with the speed and reach necessary for it to
be useful. They aren't going to just start giving that away any time soon...

------
agentultra
There are protocols for bots. Not all of them follow it... so block requests
from them.

Problem solved... like a million internet years ago.

~~~
turbohz
How do you get indexed, then? because I can't see how this solves anything.

Isn't the proposal clear enough?

1\. Optimize the indexing process so that we avoid each search engine crawling
independently every site.

2\. Devise a method to refresh the index when the content changes (hash,
date...)

Seems resonable enough, to me.

~~~
Meai
I assume he means to block all search engines except the big guys: Google,
Bing, Yahoo. Anyone else has too little impact. Not sure how one could do
that, it's not like a request tells me "hey there, I'm a robot! Let me in?"

~~~
gnaritas
Most actually do via the user agent string in the request. You can't stop a
malicious bot this way, but you could kill most bot traffic with a rewrite
rule, presuming robots.txt isn't good enough.

