
Our new search index: Caffeine - Anon84
http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html
======
iworkforthem
old index: problem: parts of the index were refreshed at a faster rate than
others; the main layer would update every couple of weeks and to refresh a
layer of the old index, Google would analyze the entire web first. impact:
there was a significant delay between when Google found a page and made it
available to its users.

new index: solution: caffeine, Google's new index will analyze the web in
small portions and update its search index on a continuous basis, globally. As
Google find new pages, or new information on existing pages, Google can add
these straight to the index. improvement: user can find fresher information
than ever before—no matter when or where it was published.

recent changes in Google search result page suggested that the searched
results appears to be categorized further to the different structured data.
This could be inline with the rolling out Caffeine.

~~~
Corrado
Hmmm... come to think about it I have been getting quite a lot of results from
CodeWeblog.com in my search results. Unfortunately, that site seems to be a
link farm and its so useless that I reported it to the Google spam page. Maybe
its appearance in the top sites was a result of the Caffeine changes.

~~~
biafra
Ah! Its not only me who is annoyed by these codeweblog.com results.

Is there a way to remove domains form google results as a preference? I know I
can always add "-site:codeweblog.com" But can I save this preference for
future search queries?

------
tptacek
Wow is that ever a terrible infographic.

~~~
defen
I couldn't tell if it was a joke... "magic internet stuff here"

~~~
MikeCapone
Considering they had that page with pigeons to explain how pagerank works, I
would assume it's not supposed to be very serious.

~~~
inerte
The pigeon ranking page was an April 1st joke. It's exactly funny in one day
of the year. Saying how you'll rank 85% of web searches is a little bit more
serious business.

~~~
pjscott
This shouldn't affect the ranking; updated data will just get to the ranking
algorithm faster, is all.

------
gojomo
100 petabytes of information (100 million GiB) feeding their index is more
than I would have expected.

~~~
moe
On the other hand it doesn't seem that much when you consider todays storage
density.

You can fit around 0.5 PB into one rack nowadays. 200 racks then sounds a bit
less impressive than 100 Petabytes.

However, that ofcourse doesn't account for redundancy, nor for doing anything
useful with such a pile of data. Both of which impose some interesting
challenges at that scale.

~~~
blasdel
You think Google's mainline storage of their index is on _disk_?

 _It is to laugh._ I don't think that's been true since around the time they
stopped building server shelves out of lego or cardboard.

They've been keeping the full text of the web in RAM (for the snippets), with
indexes, several times over. With independent live siblings in multiple
datacenters.

~~~
houseabsolute
Well, the serving part of their index is in memory (or largely so) as has been
publicly announced. But are you really suggesting they might do the same for
indexing?

Let's suppose for a second that they can get 8GB sticks of RAM for 1/3 the
retail price of around $600, and for the sake of the back of the envelope
calculation, let's round it up to ten GB. So let's call it $20/GB of RAM. Four
gigabyte sticks might seem cheaper until you consider they'd have to double
the number of machines to hold them. Now they mentioned that the size of the
index is 100,000,000 GB, which means that they would have spent $2 billion on
this RAM alone, not to mention all the other components that are required to
house it.

Especially considering how much their capital expenditure has fallen lately
([http://www.datacenterknowledge.com/archives/2010/01/22/googl...](http://www.datacenterknowledge.com/archives/2010/01/22/google-
capex-edges-higher-in-q4/)), it doesn't seem very likely that they could
afford to spend even $2 billion on this. And the assumptions I made about the
price of RAM are pretty generous, considering they'd have had to have been
acquiring this for some time, and therefore that the RAM would have been more
expensive previously.

~~~
blasdel
Your retail RAM prices are inflated about 3x, but I think your point is
correct.

You know how Google has always had that little blurb about how long it took to
process your query on the results page? Back when they were for sure running
everything out of RAM, it was usually something comically small like 0.00025
seconds. I just checked and it's now more like 0.25 seconds.

Perhaps they've done testing and found that absurdly fast results don't matter
(or no longer matter) as much as they thought?

~~~
houseabsolute
Sorry, that's just what I saw on Newegg for 8GB sticks. I did not intend to
mislead.

I am under the impression that most of Google's index is still served out of
ram. Certainly I never saw any announcements to the contrary. The other poster
also pointed out that sub-millisecond latencies are pretty unlikely.
Considering all the diverse sets of data they need to pull from, it would be
similarly difficult to believe that tens or hundreds of disk seeks could be
carried out in a timely manner for each query to yield a 200 ms total
calculation time. If I had to guess they have probably just started taking
factors into account that they did not before. For example, perhaps they are
measuring total internal latency rather than just the latency of a particular
sub-system. Or maybe there are components that go to disk while the posting
list lookups are done in RAM.

------
spuz
What I dislike about Google's current algorithm is that when I search for a
common term such as "javadoc simpledateformat", the API documentation for Java
1.4.2 (as opposed to 1.6) comes up as apparently, that is what most documents
on the web link to. I hope this new index will allow more recent and more up-
to-date documents to get to the top of the results list.

~~~
Jermey128
In Firefox I have the following quick-search to fix that (and make searching
the API a little faster):

[http://www.google.com/search?q=java+6+%s&ie=utf-8&oe...](http://www.google.com/search?q=java+6+%s&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-
US:official&client=firefox-a)

This should also work in Chrome.

~~~
spuz
Yep, I've done exactly that for the Java 6 docs in Chrome and FF. The general
point of how do you keep the latest version of a set of documents the most
relevant still applies however. I hope Google can fix this :)

------
dskhatri
Nice to see they reference an iPod instead of a Nexus One. Somehow I feel that
if it was a Microsoft announcement, the company would reference Zunes.

~~~
batiudrami
Nexus One total internal storage: 512MB 'the largest iPod': 160GB

The Nexus One is a phone, not a device with a large storage capacity. If
Google made one of those, I'm sure it would be referenced.

edit: the Nexus One box also includes a 4GB micro SD card, but that'd be
comparing flash memory with harddrives, and is hardly an intuitive
explanation.

------
jodrellblank
That is an _enormous_ amount of data.

I wonder how much is junk?

~~~
mkramlich
You're new to the Internetz, right? :)

~~~
jodrellblank
I meant junk like search engine cheating forum and usenet posts and sites
registered to repeat stock phrases and content of spammy aggregator sites,
rather than making judgements against lolcats and breakfast time tweets.

------
chintan
huh? 40 miles of iPods.

+5 for an intuitive analogy. Priceless!

~~~
mkomo
hopefully iPod-miles (iPm?) will become a new ISO standard for measuring
storage capacity.

~~~
rimantas
Would nicely complement a station wagon full of tapes for bandwidth.

------
jsiarto
I'm waiting for Google to open up an API or dashboard on top of Caffeine to
compete with some of the monitoring tools like Radian6 and Jive. They could
own that space and it's growing like crazy right now. Plus, we don't need any
more companies coming out with "social web" search engines.

------
samaparicio
Why is every measure of storage always "demystified" with a paper stack
analogy? I find it funny that when the paper stack would not provide the right
scale, the used a stack of storage devices...

------
leej
they could be using in-memory compression. they could be storing hot parts of
index in memory. they could be using both.

------
keegangrayson
code or it didn't happen...

------
c00p3r
Seems like the target group of this blog post are teenagers, not even
students. Some street slang would look well near the size measurement in
ipods. =)

------
dalore
Aren't we all using Duck Duck Go anyway?

------
swah
Java or C++

~~~
shadowsun7
Google uses only four languages in their codebase: C++, Java, Python and
Javascript. Yegge has covered this pretty extensively in the past. Go Google
him.

~~~
jlouis
And Go, presumably.

~~~
swah
I asked what this project was probably written in...

