
Rendering AJAX-crawling pages - gildas
https://webmasters.googleblog.com/2017/12/rendering-ajax-crawling-pages.html
======
thoop
Todd from Prerender.io here. We always knew this day would come eventually :)
We are currently serving around 60 million prerendered pages to crawlers every
day, with Google being about half of those requests. We are recaching around 1
billion pages every month in PhantomJS/Headless Chrome. Google is the only
crawler executing a meaningful amount of JavaScript so Bing, Baidu, Yandex,
Facebook, Twitter, and other SEO crawlers still need prerendering.

For anyone that will need to update their own crawlers to match Google’s new
javascript crawling, we’ve opened up our Prerendering engine that uses
Headless Chrome at [https://prerender.com](https://prerender.com). You can
capture HTML, Screenshots, PDFs, or even HAR files from any web page with just
an http request to our service. So it’s super easy to add javascript crawling
to any crawler with Prerender.com (and it’s open source
[https://github.com/prerender/prerender](https://github.com/prerender/prerender)).

For our Prerender.io customers, this announcement just means that Google will
stop crawling ?_escaped_fragment_= URLs so they won’t request prerendered
pages anymore. Instead, Google will just execute the javascript directly and
index the result.

We’ve always recommended that our customers use the escaped fragment protocol,
so it will be a smooth transition as Google slowly stops crawling the
?_escaped_fragment_= URLs. No changes need to be made if you are currently
using Prerender.io. Keep an eye on our twitter (@prerender) and we’ll give
updates on Google’s transition.

The one thing to look for when Google starts executing your javascript is keep
an eye on your Google Webmaster Tools for your number of pages crawled by
Google. In the past, we’ve seen that Google crawls much slower when executing
the javascript themselves. Hopefully javascript websites don’t take a hit in
number of pages crawled daily since that can affect large sites having all of
their pages up to date in Google's index.

~~~
exikyut
This isn't so much a service question as a technical bogglement, but I'm very
curious what sort of heavy lifting you need to do 60 million renders a day.

Assuming renders take 7-10 seconds at worst, that means (if I've got my math
right!) that you need to do between (60m/(86400/7)=4861) and
(60m/(86400/10)=6944) renders per second in order to keep up. (86400 = seconds
in a day)

...Ahahahaha :)

Given that a single Chrome instance on my new-but-not-particularly-amazing i3
box can be sluggish at the best of times... I have no idea what sort of
tolerances Xeon(?)-class hardware (possibly running Xen? :P) have to running
multiple entire copies of Chromium... I initially wondered if you needed 1000
compute instance, then I realized maybe you only needed 400, now I honestly
don't know at all.

\--

I'm also curious how using Headless Chrome and PhantomJS is working out. As
in, genuinely interested. IIUC my understanding is that PhantomJS has pretty
much wound down, while Headless Chrome is fractionally different enough from
Chrome that it's possible to tell which one you're running on
([https://news.ycombinator.com/item?id=14936025](https://news.ycombinator.com/item?id=14936025)).
I've been idly curious about "perfectly sandboxing" webpages so they honestly
can't tell they're not in a "normal" PC/laptop/mobile environment, and my
impression is that I'd have to start with a _very_ carefully configured copy
of normal Chromium in order to do it.

\--

I must admit that I got curious at what 60m monthly renders looked like
against the pricing structure... but couldn't really figure it out, it's not a
simple enough exponential curve (and I can't math for nuts). Single-stepping
through the pricing algorithm was very interesting though ($1522 for
enterprise, huh cool).

\--

PS. The view-source link at the bottom is unfortunately broken; Chrome blocked
opening such URLs recentlyish. Fixing it will likely require, ironically, a
little server-side renderer :)

\--

EDIT: One last thing, note
[https://news.ycombinator.com/item?id=15882066](https://news.ycombinator.com/item?id=15882066)
from this thread

~~~
thoop
Yep, we have LOTS of servers :) We pretty heavily cache pages too.

Headless Chrome is great and we're super thankful that the Chromium team put
the work in! PhantomJS is good... it just doesn't have all of the latest
features, like ES6. So it was really helpful that headless Chrome came along
right as people started using more ES6.

Yeah, Chrome did break the opening of view-source URLs a while back for our
[https://prerender.io/](https://prerender.io/) buttons on the bottom of the
homepage.

------
pixelmonkey
Their support for rendering JavaScript is, IMO, that they have turned Chrome's
engine into their web crawler.

We see some signs of this with the chrome headless project and with the fact
that Chrome, masquerading as a mobile viewport, can be used as an effective
mobile crawler, even when the mobile UX is provided primarily by JavaScript.

I think they still use a plain HTTP request based crawler for "most" sites
(mainly for speed), but then flip on Chrome-based crawling for popular sites
and for sites that seem to be JS-heavy. I see no reason why, long term, Chrome
wouldn't become the primary crawl/render engine for Google.

~~~
thoop
It looks like they are currently using Chrome 41
([https://developers.google.com/search/docs/guides/rendering](https://developers.google.com/search/docs/guides/rendering))
when rendering pages. I agree that with all of the work on headless Chrome,
they should move towards using headless Chrome in the future.

~~~
exikyut
Iiiiinteresting.

I wonder if all. the. security. patches. from every subsequent Chrome
milestone regularly get backported to M41?

Obviously it's sandboxed to the hit. Poking the sandbox and seeing what it was
made of would be VERY interesting though.

------
gildas
update: John Mueller from Google said [1]:

"If you can only provide your content through a 'DOM-level equivalent'
prerendered version served through dynamic serving to the appropriate clients
(ed. note: e.g. Google bot), then that works for us too."

[1] [https://groups.google.com/d/msg/js-sites-wg/70GqODR-
iN4/foUz...](https://groups.google.com/d/msg/js-sites-wg/70GqODR-
iN4/foUzscb3AQAJ)

------
oelmekki
I love how the title of this article is an understatement. Google abandons
ajax crawling scheme because... it will render javascript in all pages. This
is an awesome news :)

------
notatoad
so, they're actually just removing the option for websites that use still use
a /#! url structure to send a pre-rendered copy of the page.

not quite the same thing as abandoning ajax crawling.

~~~
lmkg
They are abandoning the ajax crawling scheme. "Scheme" in this case in the
sense of "documented interface," not in the sense of "we plan to do things."
Easy confusion though, since the scheme didn't see much uptake that I'm aware
of.

~~~
_sdegutis
Good thing they aren't abandoning AJAX Crawling Scheme support though.
Specialized lisp dialects are the lifeblood of our industry.

~~~
hamandcheese
Your joke is either not well received or over peoples heads, but I chuckled.

~~~
exikyut
Thanks; _now_ I went back and reread and noticed the capitalization :)

~~~
_sdegutis
Never change, Hacker News.

------
mrskitch
If you're looking to roll your own HTML rendering
[https://browserless.io/](https://browserless.io/) is great for doing just
that and allows you to use puppeteer to do the HTML generation.

You could even go so far as to remove all unnecessary JS and CSS, but that'd
require a bit more elbow grease.

------
zeveb
It's mildly amusing that this page itself doesn't render without JavaScript
enabled.

JavaScript is killing the static Web — and IMHO that's a terrible thing. In
part, I imagine Google like JavaScript pages because it's a barrier to the
entry of another search competitor.

~~~
pmontra
The content is hidden inside a noscript tag inside a div.loading with
visibility: hidden. However setting visibility: visible doesn't help. The
quick workaround is editing the noscript section from developers tools, delete
the noscript tag and keep its content.

By the way, if Google executes the JavaScript in the pages it means that they
could be mining some coins for those sites with coin-hive and the like? They
probably check what they run and how long they run it, but people can be very
smart if the goal is making money.

~~~
zeveb
> The content is hidden inside a noscript tag inside a div.loading with
> visibility: hidden.

Why would someone _do_ that? I'm reminded of this exchange from the
Hitchhiker's Guide to the Galaxy:

> “But the plans were on display…”

> “On display? I eventually had to go down to the cellar to find them.”

> “That’s the display department.”

> “With a flashlight.”

> “Ah, well, the lights had probably gone.”

> “So had the stairs.”

> “But look, you found the notice, didn’t you?”

> “Yes,” said Arthur, “yes I did. It was on display in the bottom of a locked
> filing cabinet stuck in a disused lavatory with a sign on the door saying
> ‘Beware of the Leopard.”

------
jwr
It's so great to read an article like that one and realize halfway that none
of it applies to you. [Smug mode on] Having your site/app in
ClojureScript+React+Rum, with flawless server-side rendering is quite nice.

------
feelin_googley
Google itself uses the fragment meta tag scheme in Google Groups (DejaNews) so
users can access certain newsgroups without Javascript. Will this be
deprecated too?

~~~
gdulli
Haha I long for the days of DejaNews! Imagine StackOverflow but without the
bullshit. And Google Groups is obviously terrible.

I'd say that Google totally ruined Deja but if I remember correctly it had
already declined before the acquisition.

~~~
kevin_thibedeau
> without the bullshit

Save for the endless cross-posting. Out of control trolls. and the FTDSOJ
thread.

~~~
exikyut
Google's giving me OCRed PDFs, Chinese text, JSON dumps and license plate
numbers when I google that acronym. Think I fell off the end of the index
there.

What's it mean?

