
How Googlebot crawls JavaScript - mweibel
http://searchengineland.com/tested-googlebot-crawls-javascript-heres-learned-220157
======
KMag
This was actually my primary role at Google from 2006 to 2010.

One of my first test cases was a certain date range of the Wall Street
Journal's archives of their Chinese language pages, where all of the actual
text was in a JavaScript string literal, and before my changes, Google thought
all of these pages had identical content... just the navigation boilerplate.
Since the WSJ didn't do this for its English language pages, my best guess is
that they weren't trying to hide content from search engines, but rather
trying to work around some old browser bug that incorrectly rendered (or made
ugly) Chinese text, but somehow rendering text via JavaScript avoided the bug.

The really interesting parts were (1) trying to make sure that rendering was
deterministic (so that identical pages always looked identical to Google for
duplicate elimination purposes) (2) detecting when we deviated significantly
from real browser behavior (so we didn't generate too many nonsense URLs for
the crawler or too many bogus redirects), and (3) making the emulated browser
look a bit like IE and Firefox (and later Chrome) at the some time, so we
didn't get tons of pages that said "come back using IE" er "please download
Firefox".

I ended up modifying SpiderMonkey's bytecode dispatch to help detect when the
simulated browser had gone off into the weeds and was likely generating
nonsense.

I went through a lot of trouble figuring out the order that different
JavaScript events were fired off in IE, FireFox, and Chrome. It turns out that
some pages actually fire off events in different orders between a freshly
loaded page and a page if you hit the refresh button. (This is when I learned
about holding down shift while hitting the browser's reload button to make it
act like it was a fresh page fetch.)

At some point, some SEO figured out that random() was always returning 0.5.
I'm not sure if anyone figured out that JavaScript always saw the date as
sometime in the Summer of 2006, but I presume that has changed. I hope they
now set the random seed and the date using a keyed cryptographic hash of all
of the loaded javascript and page text, so it's deterministic but very
difficult to game. (You can make the date determistic for a month and dates of
different pages jump forward at different times by adding an HMAC of page
content (mod number of seconds in a month) to the current time, rounding down
that time to a month boundary, and then subtracting back the value you added
earlier. This prevents excessive index churn from switching all dates at once,
and yet gives each page a unique date.)

~~~
copsarebastards
> (This is when I learned about holding down shift while hitting the browser's
> reload button to make it act like it was a fresh page fetch.)

Most useful aside of all time.

~~~
snowwrestler
I used to use this a lot. My experience is that for some reason, a couple
years ago it stopped working reliably as a fresh page fetch. Some items were
still coming up cached. Now I use incognito or private browsing windows
instead.

~~~
laumars
If you're running Chrom(e|ium) with developer tools open then you can right
click the refresh button and it gives you a few refresh options (eg clear
cache and reload).

That tends to be my fall back whenever I'm specifically fussed about the
"freshness" of a page. That or _curl_

~~~
snowwrestler
Thanks, never tried right-clicking that before. There's also a checkbox in dev
tools settings to "Disable cache while dev tools is open."

------
compbio
I feel that dynamic websites are not websites, but applications. Even after
this thorough research, I'd still be very wary of turning a primarily content-
based site into a dynamic app.

A plain HTML site is accessible, and will be accessible in a 1000 years. A
site depending on (external) JavaScript sources will force John Titor to
travel back in time to find version 1.x of jQuery. Starting with JavaScript
abandons principles of progressive enhancement. Sometimes there is not even a
fallback/graceful degradation, reminding me of these 2001-era: "Best viewed at
800x600 resolution in Netscape"-sites.

Google holds enormous clout among SEO's. Google says they will factor in site-
speed and a large fraction of the web will become faster. Google can say more
sternly that using JavaScript can have ugly consequences for user experience
and accessibility, but they are swimming upstream: The web seems to be moving
on to fancy new technologies regardless of what their SEO says.

Not much good comes from HTML5 JavaScript fans forcing your hand. Tor enabled
JavaScript, because too much of the web would break without it, leading to a
poor user experience. This led to a huge security gaffe, which I fully blame
on webdevelopers eschewing basic principles, to get that slideshow running.

~~~
userbinator
I think the trend of "turning a primarily content-based site into a dynamic
app", and indeed most of what has been referred to as "Web progress", "moving
the Web forward", etc. comes from the desire of content producers to obtain
and maintain more control over their content. Look at how browsers have
evolved to de-emphasise features which give the user control while adding
those that are author-targeted.

We're moving from browsers being viewers for simple HTML documents (which can
be copied, shared, and linked via simple means), to a platform for running
complex applications written in JavaScript which often render data retrieved
in proprietary formats from proprietary APIs. The "open by default" nature of
plain HTML has become the "closed by default" of the data processed by web
apps, much like with many native apps. Native app platforms (e.g. mobile) are
also gradually becoming more "closed by default"; I'm not sure if that's a
related trend.

 _Google can say more sternly that using JavaScript can have ugly consequences
for user experience and accessibility, but they are swimming upstream: The web
seems to be moving on to fancy new technologies regardless of what their SEO
says._

Part of the reason is because Google themselves are doing this in many of
their products... some of their employees probably disagree with "JS
everything", but they're in the minority.

~~~
matthewmacleod
_I think the trend of "turning a primarily content-based site into a dynamic
app", and indeed most of what has been referred to as "Web progress", "moving
the Web forward", etc. comes from the desire of content producers to obtain
and maintain more control over their content. Look at how browsers have
evolved to de-emphasise features which give the user control while adding
those that are author-targeted._

I don't agree with this. Browsers are more user-targeted than ever.

 _We 're moving from browsers being viewers for simple HTML documents (which
can be copied, shared, and linked via simple means)_

Browsers still allow this.

 _to a platform for running complex applications written in JavaScript which
often render data retrieved in proprietary formats from proprietary APIs._

I have rarely seen a web API that uses anything other than straightforward
JSON.

 _The "open by default" nature of plain HTML has become the "closed by
default" of the data processed by web apps_

Almost always equally as open as any HTML you would previously received.

 _Native app platforms (e.g. mobile) are also gradually becoming more "closed
by default"; I'm not sure if that's a related trend._

What do you mean by this?

~~~
scintill76
> I have rarely seen a web API that uses anything other than straightforward
> JSON.

There are differences between HTML vs. AJAX+JSON+Javascript+DOM. JSON has a
lot less of a schema than HTML. You don't have to execute custom code from a
remote server to render plain HTML. Client-side rendering is more complex for
the client. A Javascript-based page is typically going to require more
requests for remote resources than inline HTML, potentially meaning more
bandwidth and caching/archiving costs. I can't quite put my finger on the
implications right now, but I wanted to note that "JSON is a standard."
doesn't mean much to me, since JSON is not comparable to HTML.

------
onion2k
With the way websites work today surely the only possible way to build a
search engine is to make something like a headless browser (similar to
PhantomJS) that crawls the web like a user, seeing what the user sees,
ignoring everything that's hidden from the user, and interpreting the
importance of pages like a user would. Just parsing the HTML source of the
page won't even get close to seeing the key features of a page any more.

Impressive work by Google to do that at scale, of course, but they'd be dead
in the water if they didn't.

------
erikb
But how! I don't know about other people here, but in our company we haven't
figured out how to parse (for testing of course) dynamic websites. All tools,
including free tools like Selenium and paid tools like QF-Test, seem to not be
able to understand how it works, or our web developers are not able to code
dynamic web like it should be coded.

~~~
onion2k
I use nightwatch.js ( [http://nightwatchjs.org/](http://nightwatchjs.org/) ).
It's a layer on top of Selenium that makes browser testing _a lot_ more
straightforward. If you start with small, straightforward tests and build
testable things from there you code _will_ improve.

~~~
erikb
The problem is that tester and developer aren't the same person. This way the
tests are much better at finding expectations the developer didn't have, but
it's harder to convince the developers to code more testable, because they
don't know the pains of testing.

~~~
onion2k
The point of testing is not to prove the code doesn't work. It's to prove the
code does work. That subtle but important difference is the key to good
testing.

Finding a problem with code is useful, but it's extremely limited. You might
find 100 bugs, but if there's 101 bugs your product has the potential to fail
completely. It's so much more useful to define a framework of things that the
code _has to do properly_ and make sure it _does do them all properly_. To
that end, testing should come first - define what the code needs to do, write
tests to make sure it does those things (automated unit tests where possible,
but at the very least well defined processes for how you make sure it works),
and _then_ write the code to actually do it. Any developer who isn't
interested in proving their code works, and will continue to work as it
becomes more complex, is a _terrible_ developer.

tl;dr If you want to fix testing don't write any code until you know how
you're going to test it works.

------
markbnj
We wrote our scraper to use phantomjs via the selenium.webdriver interface in
python 2, simply because for something like 80% of the sites we extract
information from, the data was not fully available unless we could render the
dynamic parts of the page. I am not at all surprised that Google's bot is
executing js. I have assumed they could do this for years now.

As for pure html front-ends, I understand the attraction, but when a single
js-based implementation gets you consistent behavior and presentation across
all browsers and mobile devices the advantages are pretty huge.

------
mcculley
I've long thought that the need for a high performance sandboxed JavaScript VM
was the real impetus for Google's investment in v8, and that Chrome was just a
useful opportunity to leverage it and to get external contribution. Is there
any evidence that this is the case?

~~~
KMag
Unlikely. I was using SpiderMonkey to execute JavaScript in Google's indexing
pipeline long before I had heard about v8, and I doubt Lars had me in mind
when he started on v8. Of course, I tele-conferenced with Lars before Chrome
was released, but SpiderMonkey was still the indexing system's JavaScript
interpreter on Chrome's go-live date.

~~~
mcculley
Interesting. Thanks for clarifying that!

------
anaolykarpov
I wonder how google indexes a page that inserts an element in the DOM 120
seconds after the page was loaded using a setTimeout()

~~~
netnichols
They probably don't care about that content.

My first guess would be that they snapshot the DOM in the JS tick immediately
after window.onload completes. Maybe they have a short pause to let any fast
timeouts or callbacks complete, but there's got to be a cutoff at some point
(e.g. to stop an infinite wait for pages that continuously update a relative
date). Of course, with their own JS engine, I bet they can get really fancy
with the heuristics to determine when to take that snapshot.

~~~
KMag
Actually, we did care about this content. I'm not at liberty to explain the
details, but we did execute setTimeouts up to some time limit.

If they're smart, they actually make the exact timeout a function of a HMAC of
the loaded source, to make it very difficult to experiment around, find the
exact limits, and fool the indexing system. Back in 2010, it was still a fixed
time limit.

Source: executing JavaScript in Google's indexing pipeline was my job from
2006 to 2010.

~~~
blumkvist
What about AJAX? Does it load/read/index data after the fact?

------
bceagle
There is no doubt Google continues to get better at indexing client side
rendered HTML but it is not perfect and indexing is not the same as ranking
high in organic search. For ranking, there are distinct advantages to server
rendering. The biggest one is consistent initial page load performance. Long
story short, if you care about ranking and not just indexing, you still need
server rendering.

------
Loic
For people wondering about Ajax requests, Googlebot is performing them very
well together with SVG rendering.

For example this URL:

[https://www.chemeo.com/predict?smiles=CCCC](https://www.chemeo.com/predict?smiles=CCCC)

is performing the drawing of the molecule using RaphaelJS, then pulling the
corresponding molecule from the database using Ajax and updating the page.
Googlebot is performing all the steps perfectly well to add the end index the
page.

It is very annoying because this is not important in our case, what we want is
the good indexing of the main data pages, not these pages... I do not want to
block the bot yet, but I need to figure out a way to have the main page better
ranked.

~~~
gildas
However, if you search on Google:

    
    
        "Property Prediction for Butane" site:https://www.chemeo.com
    

You'll see this page is not indexed.

~~~
Loic
Interesting, it looks like they massively dropped these pages from the index
(or at least, from what they return as results) this is great (if they are not
dropping other important pages)!

------
hughw
Now, what kinds of V8 vulnerabilities can we exploit to get inside Google?
Said every intelligence agency everywhere.

~~~
KMag
From 2006 to 2010, my primary role at Google was JavaScript execution in the
indexing pipeline. I knew I was likely executing every known JavaScript engine
exploit out there plus a good number of 0-days, and ran the javascript engine
in a single-threaded subprocess with a greatly restricted set of allowed
system calls.

Certainly the right combination of kernel zero-days and JS interpreter
exploits could be used to take over the machine, but it would be non-trivial.

~~~
gwern
> ran the javascript engine in a single-threaded subprocess with a greatly
> restricted set of allowed system calls.

You were trying to sandbox the JS engine rather than using disposable VMs?

------
INTPenis
I'm just waiting for the first security researcher to exploit the googlebot.

~~~
KMag
My primary role at Google from 2006 to 2010 was executing JavaScript in the
indexing pipeline (not exactly in Googlebot, but close enough). I knew I was
executing probably every known exploit out there, plus a lot of 0-days, and
took lots of precautions (single-threaded subprocesss with very restricted
list of allowed syscalls, etc.). It's not perfect, but breaking out of the
sandbox would require a kernel 0-day in the subsystem used by our sandbox,
plus a JS engine exploit.

~~~
INTPenis
I think it goes without saying that no system is 100% safe. ;)

------
mandeepj
Any idea how Google digests\consumes web pages while crawling? Does it take
out all the html and store just the plain text? If this is the case then can
you share some more info on how they are doing it?

I think there is no way they are going to scrap the websites as there are
millions of them with each having their own structure.

------
Drezzor
I get rendering HTML from JS, handling timeouts, even infinite scroll. But how
in the world are they handling onMouseOver, and other mouse events? My best
guess so far is reversing the code from the document.location events.

~~~
aboodman
I think you are misunderstanding how this works. Google isn't "handling" any
events at all, your webpage is. Google is instead the source of those events -
it is simulating the role of a user.

So the bot loads your webpage into a headless browser and sends it a series of
events to simulate a user interacting with it, and waits for navigation
requests.

There is probably a whitelist of simulation behaviors:

    
    
      * mouseover, then click each <a> node
      * mouseover every pixel
      * mouseover, then change every <select> node
      * mouseover, then click every <button>
      etc...
    

Caveat: though I worked at Google when this work was being done, I was on a
different team and don't have any inside knowledge - just speculating on an
approach that would make sense.

------
aparadja
Any idea whether this affects (randomized) A/B testing? I think that in the
past, Google has simply ignored the dynamic test changes to the site's
content. Now I'm not quite sure anymore.

~~~
KMag
For duplicate elimination, it's important to have deterministic execution of
Javascript, for duplicate elimination. Getting several identical pages (with
different URLs) in your search results is a really bad user experience.

As of when I left Google in 2010, the JavaScript random number generator
always returned 0.5 (and some SEO figured it out and blogged about it, no
secrets here). However, I was trying to convince my manager to let me instead
seed a random number generator with an HMAC of all of the currently loaded
HTML and JavaScript (to make it deterministic but hard to display something
good 1 in a million times but 100% of the time to Google's indexing system).

------
Grue3
If we can execute Javascript on GoogleBot, could it possibly be hacked/broken
somehow? Surely there must be security vulnerabilities in its Javascript
engine.

------
jmngomes
It would be interesting to see how client-side templating affects SEO, namely
now that we understand that JS is indeed executed by the crawler.

------
alexlarsson
I wonder if it does Ajax. Then you could do some kind of weird recursion where
it does a request to google for itself.

------
z3t4
Google can even read frames! (sarcasm) It makes all other search engines look
bad in comparison though ...

------
romaniv
Just remember that there is life outside of Google and by brushing it off
you're stifling further web innovations.

Also, I wonder how Google handles security while executing random JS code.
It's one thing to hack into a single browser. It's another thing to hack into
a crawler. Think of all the possibilities.

------
frik
searchengineland hasn't tested AJAX as the author wrote in the comments:
"That's a great question! Our test was to programmatically insert text where
we wanted into the DOM, but not as a server side transaction, like AJAX."

~~~
billybolero
I read another blog about someone who tested that (don't have the url, but it
was easy to find), and their conclusion was that the crawler won't wait for
any Ajax request to finish to let you render that content. If you want to
render with Javascript, you need to make that data a part of the initial
payload and render that data during onload.

