
Googlebot is Chrome - blauwbilgorgel
http://ipullrank.com/googlebot-is-chrome/
======
gojomo
Clearly, a bunch of Google's services have been able to behave like a
rendering/JS-executing browser for a while. However, that may not mean the
initial URL fetching is usually (or even often) by a Chrome-like process.

My guess would be there are a mixture of processes which all collect page
captures into a working area. (Which process is used for a particular site/URL
might vary over time based on feature-detection – for example an initial old-
style survey collection might be followed by a more browser-like collection
later. The 'working area' is whatever serves the BigTable role in 2011.) All
captures can then be further analyzed (virtually re-crawled from the working
area) as necessary, by an even larger collection of analysis processes. That
many of these analysis processes share code with other Google projects is to
be expected, but to say that they are actually 'GoogleBot' (once they're not
doing the initial fetching) or 'Chrome' (when they're processing bulk working-
area data after-the-fact) is likely an oversimplification that obscures more
than it is illuminates.

The strongest external evidence that Google's main crawling was using a
Chrome-like engine would be if the pattern of URL-fetching, as observed in
logs, became more like a browser: everything needed to render one page fetched
in rapid succession (unless it's already previously cached). Are people seeing
that in their logs, for example on Googlebot's first visit to all-new content?

------
azakai
The article seems to have various inaccuracies, like the ones quoted here, so
I am not sure what to make of it overall.

> And if that weren’t enough to make you curious, Google didn’t just take
> WebKit’s Rending Engine and call it Chrome…

They didn't rebrand WebKit, they wrote a ton of code on top. This sounds like
an unfair dismissal of Google's efforts.

> they created a new JavaScript Engine known as V8. This new JavaScript engine
> is perhaps the fastest engine available

Mostly true, but it depends on the benchmark of course.

> and Google chose to add engineering complexity by making it
> standalone/embeddable; incidentally making projects like NodeJS possible.

Not at all true. The JS engine in WebKit, JavaScriptCore, is also embeddable
and used in various places (for example, Seed in GNOME). Ditto SpiderMonkey.

> The engineering energy that went into creating the V8 Engine Is no small
> matter, as it was written entirely in C++ and designed to convert JavaScript
> to machine code to increase speed.

True, but all JS engines I am aware of are written in C++ and compile to
machine code (and they did so from around when V8 launched, in the case of
JavaScriptCore and SpiderMonkey).

~~~
techarity
Hello Azaki, thanks for the feedback.

The article was written as a simplification as the target audience were SEO
Professionals who may or may not have a development background. I'm really
surprised in the interest so far, I didn't expect this to get outside of it's
intended audience.

The "inaccuracies" are largely simplifications, rather than deliberate
inaccuracies, but please do fact check it; I appreciate the feedback. It's
especially valuable to hear from developers.

I'm not at all dismissing Google's efforts if you read the article more
deeply. I just place more emphasis on their effort to make Chrome threaded, as
I believe that functionality is absolutely necessary to deploy a browser as a
spider.

It also an amazing piece of engineering, and has loads of benefits. As for
V8's speed, I couldn't find any recent benchmarks, so I leaned on the
anecdotal evidence. As you said, it really depends on the benchmark.

Mentioning the programming language was more about hinting at Google's
proficiency in the space; C++ is one of their core development languages.

Anywho, thanks again for the feedback. If nothing else, I hope you found it
interesting!

------
blauwbilgorgel
Mirror Google Cache:
[https://webcache.googleusercontent.com/search?q=cache:http:/...](https://webcache.googleusercontent.com/search?q=cache:http://ipullrank.com/googlebot-
is-chrome/)

I started believing that Googlebot would render entire pages, when Google
Pagespeed was introduced. When they calculate page loading times, they can
also produce a waterfall view of the rendering progress.

Before I suspected some rendering/css/DOM parsing was used to detect hidden
text and links, but not on such a massive scale. Like now you can index a new
page and it almost looks like a preview image gets generated "on-the-fly" when
there is no cache available.

I wonder how much of this render data is used for Google Panda quality control
updates. Position of advertisements, call-to-actions, author information,
address information and disclaimer/privacy policies could now all play a role
in making your site seem more credible in the eyes of Google and your users.

~~~
daxelrod
Be careful clicking that Google Cache link. The page appears to make a bunch
of requests to shady sites.

~~~
techarity
Haha those calls are AJAX calls for the social sharing widgets in IPullRank.
It's Twitter, Digg, ShareThis, etc.

Thank you for looking out for the fine folks at HN though!

~~~
daxelrod
I was actually talking about requests to the website roots of various porn and
p2p domains.

EDIT: Please contact me at the address in my profile if you'd like me to send
you a list of URLs that my browser accessed when I loaded that cached page.

~~~
techarity
A quick run through the site with Firebug's XHR and Dependency Logger didn't
show any Porn or P2P links or strange script calls.

I'll see if the site owner wants to reach out to you view email; thanks so
much for speaking up.

------
techarity
Google's ability to crawl JavaScript in IFRAMES was confirmed by a Google
Search Quality Engineer just recently today.

<https://twitter.com/#!/mattcutts/status/131425949597179904>

There might be something to this theory...

------
aritraghosh007
Great ! Like the title and then click on the link , what you get is this -
HTTP 503 : Service Temporarily Unavailable , HUH !

~~~
ipullrank
Site is back up.

------
extension
Surely there is some way to sniff for Chrome client-side on Googlebot
requests, and report back to the server?

~~~
JonnieCache
Modernizr would be my choice.

------
mrspandex
The title made me think Chrome was acting as a sort of distributed Googlebot.
Although the technology would be awesome, the privacy implications would be
huge.

------
johnmurch
Part of a presentation at #searchlove - epic stuff!

