
Deprecating our AJAX crawling scheme - antichaos
http://googlewebmastercentral.blogspot.com/2015/10/deprecating-our-ajax-crawling-scheme.html
======
m0th87
Don't believe the hype. Google has been saying that they can execute
javascript for years. Meanwhile, as far as I can see, most non-trivial
applications still aren't being crawled successfully, including my company's.

We recently got rid of prerender because of the promise from the last article
from google saying the same thing [1]. It didn't work.

1:
[http://googlewebmastercentral.blogspot.com/2014/05/understan...](http://googlewebmastercentral.blogspot.com/2014/05/understanding-
web-pages-better.html)

~~~
thoop
Todd from Prerender.io here. We've seen the same thing with people switching
to AngularJS assuming it will work and then coming to us after they had the
same issue.

[1] This image is from 2014, when Google previously announced they were
crawling JavaScript websites, showing our customer's switch to an AngularJS
app in September. Google basically stopped crawling their website when Google
was required to execute the JavaScript. Once that customer implemented
Prerender.io in October, everything went back to normal.

Another customer recently (June 2015) did a test for their housing website.
They tested the use of Prerender.io on a portion of their site against Google
rendering the JS of another portion of their site. Here are the results they
sent to me:

Suburb A was prerendered and Google asked for 4,827 page impressions over 9
days Suburb B was not prerendered and Google asked for 188 page impressions
over 9 days

We've actually talked to Google some about this issue to see if they could
improve their crawl speed for JavaScript websites since we believe it's a good
thing for Google to be able to crawl JavaScript websites correctly, but it
looks like any website with a large number of pages still needs to be
sceptical about getting all of their pages into Google's index correctly.

1: [https://s3.amazonaws.com/prerender-
static/gwt_crawl_stats.pn...](https://s3.amazonaws.com/prerender-
static/gwt_crawl_stats.png)

~~~
grey-area
Perhaps this could be down to response times too, they might crawl much
quicker if given static HTML very quickly?

What were the page render times for the two types of page?

------
cotillion
So they're actually evaluating all js and css Googlebot is consuming. That's
insane.

Can we forget about any new competitors in search engine land now? Not only do
you have to match Google in relevance you'll actually have to implement your
own BrowserBot just to download the pages.

~~~
thephyber
The hints were littered everywhere that they did this.

Google does malware detection. Not on every crawl, but a certain percentage of
crawls. At my old social network site, they detected malware that must have
come from ad/tracking networks because those pages had no UGC. This suggests
they were using Windows virtual machines (among others) and very likely using
browsers other than a heavily modified curl / wget and a headless Chrome.

They started crawling the JavaScript-rendered version of the web and AJAX
schemes that use URL shebangs. This was explicit acknowledgement that they
were running JavaScript and did advanced DOM parsing.

They have always told people that cloaking (either to Google crawler IP
blocks, user-agent, or by other means) content is a violation and they
actively punished it. This suggests they do content detection and likely
execute JavaScript to detect if extra scripts change the content of the page
for clients that don't appear to be Googlebot.

They have long had measures in place to detect invisible text (eg. white text
on white background) or hidden text (where HTML elements are styled over other
HTML elements). This suggests both CSS rendering and JS rendering.

~~~
mynameisvlad
> They have long had measures in place to detect invisible text (eg. white
> text on white background) or hidden text (where HTML elements are styled
> over other HTML elements). This suggests both CSS rendering and JS
> rendering.

No, this actually suggests it's not doing either. Both invisible and hidden
text the way you've described it would be implemented with a CSS style. Not
using that style would mean the text would appear as normal. I understand you
probably meant that the JS was injecting the text in, which is fully possible,
but that's neither hidden nor invisible text.

~~~
leviathan
The parent is talking about them penalizing sites that use such hidden text
that would normally show to the crawler but be invisible to an actual human
looking at the page.

------
rdoherty
Wow, I built a project that rendered JS built webpages for search engines via
NodeJS and PhantomJS. Rendering webpages is _extremely_ CPU intensive, I'm
amazed at the amount of processing power Google must have to do this at
Internet scale.

I really hope this works, lots of JS libraries expect things like viewport and
window size information, I wonder how Google is achieving that.

~~~
MichaelApproved
I'm wonder if they're cutting out a lot of the rendering that PhantomJS is
doing. Not to say that any type of rendering is cheap but I'm guessing they
have a limited version of a JS rendering engine that does just enough to index
the page.

I bet they'd also skip on all the FB like buttons and other common social
media elements that don't impact the content.

~~~
cmg
They're starting to consider page load speed as a factor in rankings, which
would lead me to believe that they're letting all the social buttons /
trackers / media load.

~~~
anon1mous
How do you know a page has loaded? A complex page with ads, AJAX, WebSockets
may be constantly busy. Most social buttons, ads, etc. are now loaded by
callbacks, that usually finish after the page has rendered.

~~~
tracker1
Most of that data flow, barring user interaction is much more limited compared
to the initial load of controls, iframes, images, etc... you can visibly see
the drop off..

If you look at the network tab in chrome dev tools, you can see when the dom
ready event fires, the window load event, and when it really feels the content
was done loading. That final load time is when the data flow lulls out for a
bit.

------
a2tech
This is good-one of my current projects for a customer is entirely AJAX/JS
rendered and we were worried that Googlebot would have a fit with it.

~~~
iwilliams
We recently built a site for a customer in Ember and their SEO guys were
concerned about indexing. I wasn't sure how it was going to work out, but in
the end Google has been able to index every page no problem.

~~~
a2tech
Do you know if they sent Google a sitemap? Our client is insisting on a
sitemap that has pointers to every-single-product. Something on the order of
2MM+ product pages. It seems like a bit much to me

~~~
MichaelApproved
Keep this in mind
[https://support.google.com/webmasters/answer/183668?hl=en&to...](https://support.google.com/webmasters/answer/183668?hl=en&topic=8476&ctx=topic)

> _Break up large sitemaps into a smaller sitemaps to prevent your server from
> being overloaded if Google requests your sitemap frequently. A sitemap file
> can 't contain more than 50,000 URLs and must be no larger than 50 MB
> uncompressed._

------
espeed
This was the missing piece for Polymer elements / custom web components. Now
that Google has confirmed it's indexing JavaScript, web-component adoption
should take off.

~~~
tracker1
I want to like polymer/web-components... I just find that it kind of flips
around the application controls that redux+react offers. I'm not sure that I
like it better in practice.

------
eurokc98
Gary Illyes @goog said this was happening Q1 this year, and like others
mentioned lots of other direct/indirect signals have pointed this way.

[http://searchengineland.com/google-may-discontinue-ajax-
craw...](http://searchengineland.com/google-may-discontinue-ajax-crawlable-
guidelines-216119) March 5th: Gary said you may see a blog post at the Google
Webmaster Blog as soon as next week announcing the decommissioning of these
guidelines.

Pure speculation but interesting... The timing may have something to do with
Wix, a Google Domains partner, who is having difficulty with their customer
sites being indexed. The support thread shows a lot of talk around "we are
following Google's Ajax guidelines so this must be a problem with Google".
John Mueller is active in that thread so it's not out of the realm of
possibility someone was asked to make a stronger public statement.
[http://searchengineland.com/google-working-on-fixing-
problem...](http://searchengineland.com/google-working-on-fixing-problem-with-
wix-web-sites-not-showing-up-in-search-results-233310)

~~~
nostrademons
I'm betting that they finally solved the scalability problems with headless
WebKit. Google's been able to index JS since about 2010, but when I left in
2014, you couldn't rely on this for anything but the extreme head of the site
distribution because they could only run WebKit/V8 on a limited subset of
sites with the resources they had available. Either they got a whole bunch
more machines devoted to indexing or they figured out how to speed it up
significantly.

~~~
tracker1
I'd say both are pretty likely.. another round of lower-power servers with
potentially more cores... more infrastructure... Combined with improvements in
headless rendering pipelines. I haven't looked into it in well over a year
now, but last I checked dynamic updates took about 2-3 days to get discovered
vs. server-delivered being hours for a relatively popular site.

I'm guessing they've likely cut this time in half through a combination of
additional resources, and performance improvements. Wondering if they'd be
willing to push this out as something better than PhantomJS... probably not as
it's a pretty big competative advantage.

I know MS has been doing JS rendering for a few years, they show up in
analytics traffic (big time if you change your routing scheme on a site with
lots of routes, will throw off your numbers).

------
nailer
Currently I use prerender.io and this meta tag:

    
    
        <meta name="fragment" content="!">
    

I don't actually use #! URLs, (or pushstate, though I might use pushstate in
the future) but without both of these Google can't see anything JS generated -
using Google Webmaster Tools to check.

Does this announcement mean I can remove the <meta> tag and stop using
prerender.io now?

~~~
rgbrgb
We have a similar setup and were wondering the same thing (though we use push
state). Today we were actually trying to figure out a workaround for 502s and
504s that google crawler was seeing from prerender. We just took the plunge
and removed the meta tag because over 99% of our organic search traffic is
from google. Fingers crossed!

~~~
thoop
I'd love to help here if I can. I'd also love to hear the results of you
removing the meta tag! todd@prerender.io

------
shostack
Any idea how related this might be to Wix sites getting de-indexed?[1]

[http://searchengineland.com/google-working-on-fixing-
problem...](http://searchengineland.com/google-working-on-fixing-problem-with-
wix-web-sites-not-showing-up-in-search-results-233310)

------
rcconf
This might be obvious to anyone who has done SEO, but can Googlebot index
React/Angular websites accurately? I was always under the impression that the
isomorphic aspect of React helped with SEO (not just load times.)

~~~
vbezhenar
If a modern browser can render your site accurately, then Google can index it.

~~~
tracker1
It's always lagged in my experience... I'm hoping this announcement means that
lag is under a day instead of the 2-3 it was a bit over a year ago.

------
jwr
Finally. It was obvious we would have to get to that point eventually, it just
wasn't clear when.

