
Javascript apps can be fully crawlable - beernutz
http://prerender.io/
======
dwwoelfel
This is a great approach, but detecting the user-agent is the wrong way to
decide if you should pre-render the page. If you include the following meta
tag in the header:

    
    
       <meta content="!" name="fragment">
    

then Google will request the page with the "_escaped_fragment_" query param.
That's when you should serve the pre-rendered version of the page.

Google has documentation on this here:
[https://developers.google.com/webmasters/ajax-
crawling/docs/...](https://developers.google.com/webmasters/ajax-
crawling/docs/html-snapshot) and we've been using this method at
[https://circleci.com](https://circleci.com) for the past year.

Waiting for google to request the page with _escaped_fragment_ should also
prevent you from getting penalized for slow load times or showing googlebot
different content.

~~~
Isofarro
That Google Ajax crawler spec is no magic bullet.

Nick Denton: "Dip in uniques largely because of drop in Google refers.
Pageviews (which are driven more by core audience) less affected." \--
[http://twitter.com/nicknotned/status/61152134929981440](http://twitter.com/nicknotned/status/61152134929981440)

Nick Denton: "Google does not fully support "hashbang" URLs. So we're
eliminating them rather than waiting for Mountain View." \--
[http://twitter.com/nicknotned/status/61465859079671808](http://twitter.com/nicknotned/status/61465859079671808)

Nick Denton: "Yeah, I'd advise against hashbang urls. Will kill search traffic
-- even if you abide by Google protocol." \--
[http://twitter.com/nicknotned/status/62595141927583745](http://twitter.com/nicknotned/status/62595141927583745)

~~~
alanlewis
These tweets are from 2.5 years ago. Has the google bot improved since then?
(Honest question)

~~~
Isofarro
Considering the intention behind the Google document is to enable support for
existing Ajax applications, and not the cornerstone of crawlability of newly
built apps, probably not.

Also, the same document that's quoted in defence of these Web (unfriendly)
Apps is [https://developers.google.com/webmasters/ajax-
crawling/](https://developers.google.com/webmasters/ajax-crawling/)

Where in the first section of that document:
[https://developers.google.com/webmasters/ajax-
crawling/docs/...](https://developers.google.com/webmasters/ajax-
crawling/docs/learn-more) There is this:

" _If you 're starting from scratch, one good approach is to build your site's
structure and navigation using only HTML. Then, once you have the site's
pages, links, and content in place, you can spice up the appearance and
interface with AJAX. Googlebot will be happy looking at the HTML, while users
with modern browsers can enjoy your AJAX bonuses._"

------
timr
Don't do this.

Rendering different content based on user agent is tempting the webspam gods.
Rendering _nothing but a big gob of javascript_ to non-googlebot user agents
is a recipe to get the banhammer dropped on your head.

You're either gambling that Google is smart enough to know that _your
particular_ big gob of javascript isn't cloaking keyword spam (in which case
you should just depend on their JS evaluation, since you already are,
implicitly), or you're gambling that they won't bust you even though your site
looks like a classic keyword stuffer.

~~~
stephenheron
Google does have a section within their guidelines on creating "HTML
Snapshots". "If a lot of your content is created in JavaScript, you may want
to consider using a technology such as a headless browser to create an HTML
snapshot." [https://developers.google.com/webmasters/ajax-
crawling/docs/...](https://developers.google.com/webmasters/ajax-
crawling/docs/html-snapshot)

~~~
timr
Did you read that page, or did you just skim it? They're telling you to use a
"headless browser" as one possible (clunky) way of responding to
_escaped_fragment_ requests, which is a workaround wherein you put a special
tag in your _original_ page to tell the googlebot to make _another_ request to
get a static version of the page.

Using _escaped_fragment_ is _not the same thing_ as rendering different
content based on user agent.

------
_lex
This will get you penalized for having a website that takes forever to load.
This is what happens:

Googlebot requests page -> your webapp detects googlebot -> you call remote
service and request that they crawl your website -> they request the page from
you -> you return the regular page, with js that modifies it's look and feel
-> the remote service returns the final html and css to your webapp -> your
webapp returns the final html and css to Googlebot. That's gonna be just
murder on your loadtimes.

If this must be done, for static pages, it should be done by grunt during
build time, not by a remote service. For dynamic content, it's best to do the
phantomjs rendering locally, and on an hourly (or so) schedule, since it
doesn't really matter if googlebot has the latest version of your content.

Or perhaps I'm mistaken and the node-module actually calls the service hourly
or so and caches results on app so it doesn't actually call the service during
googlebot crawls. If that's the case, I take back my objections, but I'd
recommend updating the website to say as much.

~~~
10098
Pretty sure the load time problem can be mitigated by caching.

~~~
_lex
Best case scenario you still have network trips going out to the service, so
it's still not a great solution UNLESS the caching was done by your webapp -
which is what I spoke about at the end of my comment above.

Unless this works w/o adding network roundtrips on each request, it's not a
great idea.

~~~
reissbaker
I think the Unix philosophy of "do one thing and do it well" applies here.
There are already off-the-shelf caching solutions that do what you describe:
for example, with Varnish you can serve cached pages immediately and update
the cache contents in the background.

It would probably be better to use those than reimplement them in an uber-
webapp.

------
Isofarro
An entire project written to simulate progressive enhancement (badly). One
that only works for specified whitelisted User-Agents, instead of being based
on capability.

I'm also not understanding the use-case for this project. Everytime the topic
of "Web Apps", "JavaScript Apps", "Single page web apps" comes up, evangelists
point out that they are applications (or skyscrapers), not just fancy
decorators for website content.

So exactly what is this project delivering as fallback content? A server-
generated website?

This project just seems pointlessly backwards. Simulating a feature that the
JavaScript framework has already deliberately broken. One that introduces a
server-side dependency on a project deliberately chosen not to have a server-
side framework.

This just looks like a waste of effort, when building the JavaScript
application properly the first time, with progressive enhancement, covers this
exact use-case, and far, far more use-cases.

The time would have been better spent fixing these evidently broken JavaScript
frameworks - Angular, ember, Backbone. Or at least to fix the tutorial
documentation to explain how to build Web things properly. (This stuff isn't
difficult, it just requires discipline)

I call hokum on people saying there's a difference between Websites and Web
apps (or the plethora of terms used to obfuscate that: Single-page apps,
JavaScript apps). This project proves that these are just Websites, built
improperly, and this is the fudge that tries to repair that for Googlebot.

~~~
philbo
+100

Why some developers are so against progressive enhancement mystifies me. It is
an elegant solution that actually works in all cases rather than an ugly hack
that should probably work in the majority of cases. How can there even be a
dispute about it? It's insane!

~~~
nailer
> Why some developers are so against progressive enhancement mystifies me.

There's another common adage: HTML is content. CSS is presentation. JS is
behaviour.

Some public web apps simply don't work without behaviour.

~~~
Isofarro
> Some public web apps simply don't work without behaviour.

Every app that uses a solution like this to generate static views of a website
is an app that simply works without behaviour.

------
wldlyinaccurate
If you are able to "pre-render" a JavaScript app like this, then you should be
serving users the pre-rendered version and then enhancing it with JavaScript
after onload.

JavaScript-only apps are a blight on the web. All it takes is a bad SSL cert,
or your CDN going down, and your pages become useless to the end-user.

~~~
dchest
_All it takes is a bad SSL cert, or your CDN going down, and your pages become
useless to the end-user._

How are non-JavaScript pages protected from this?

~~~
wldlyinaccurate
Apologies for being vague. Regarding the SSL certificate, I was referring to
modern browsers refusing to load "unsafe" assets.

When the JS can't load, JS-heavy apps tend to either be raw templates (i.e.
full of {{ statements }}) or completely blank (if the templates were going to
be loaded in a separate request). As Isofarro said, non-JS pages don't suffer
from this because the content is there in plain HTML.

------
ewillbefull
Wouldn't the pre-render based on useragent be penalized because Google doesn't
like being shown pages differently than non-Googlebot useragents?

~~~
michaelbuckbee
Google doesn't like it when they are shown different content than a browsing
user. This is roughly the equivalent of pointing Google Agent to a copy of the
page requested that happens to be in Memcached instead of spinning up the full
app stack to do the render.

~~~
dsl
> Google doesn't like it when they are shown different content than a browsing
> user.

This is exactly correct. Regardless of your motivations.

~~~
benaiah
Not a technically different page, specifically different _content_. Serving
different pages to Google is fine, as long as they contain the same primary
content that the real pages do. That's the whole point - so you can serve
prerendered pages to Google but still have a JS-based frontend for the actual
users.

~~~
dsl
AJAX sites often lazy load in content later. My point is the page delivered
initially is not the same as the static version content wise or technically.

------
eonil
Static rendering of dynamic content? I don't think this does make sense.

If it's pre-rednered, it's missing something. If it has all the data at first,
then it's not dynamic.

Pre-rendered(static) javascript app(dynamic)...? Hmm... I don't see anything
more than something like JWT in JS instead of Java?

~~~
FedRegister
>Static rendering of dynamic content? I don't think this does make sense.

Bro do you even Web 1.0? That's what CGI scripts in Perl did! Pull the data
from the database, generate HTML (no JavaScript back then!) on the fly, and
send to the browser.

~~~
eonil
JS is definitely client-side dynamic technology. At least from AJAX era.

Well... I don't understand how you and many people (including the author) can
read _JS_ as server-side dynamic in this HTML5 era...!!!

------
anonymous
I was under the impression that Googlebot already executes javascript on
pages.

A more interesting idea would be if you do this for every user - prerender the
page and send them the result, so they don't have to do the first, heavy js
execution themselves. I know it sounds a bit retarded at first - you're
basically using javascript as a server-side page renderer, but think about
this: You can choose to prerender or not to prerender based on user agent
string -- do it for people on mobile phones, but not for desktop users. You
can write your entire site with just client-side page generation with
javascript and let it run client-side at first, then switch to server-side
prerendering once you have better hardware.

~~~
benaiah
Something similar to that, albeit slightly more elegant, is the work that
AirBnB has done with their rendr [0] project, which serves prerendered content
that's then rerendered with JS _if it needs to be changed_. You can do similar
things with non-Backbone stacks, of course.

[0]: [https://github.com/airbnb/rendr](https://github.com/airbnb/rendr)

------
pzxc
A better way is to do a hybrid single/multipage app as described here:

[https://news.ycombinator.com/item?id=6507135](https://news.ycombinator.com/item?id=6507135)

It's a multipage app, that uses ajax to function as a singlepage app. From the
user's point of view it's a singlepage app, but it's accessible from any of
the URLs that it pushStates to, so it's like the best of both worlds. It's
fully crawlable because it functions as a multipage app, but it's got the
speed of a singlepage app (if your browser supports pushState)

------
bfirsh
This is a similar thing, but is far faster because it uses Zombie instead of
Phantom: [https://github.com/bfirsh/otter](https://github.com/bfirsh/otter)

------
tjmehta
I tried using phantomjs in the past to serverside render a complex backbone
application for SEO, and it was taking over 15 seconds to return a response
(which is bad for SEO).

Looking at the prerender's source I did't see any caching mechanism.

What kind of load times have you see rendering your apps?

Have there been recent significant improvements in phantomjs's performance?

~~~
chaddeshon
I run [http://www.brombone.com](http://www.brombone.com). We provide
prerendered snapshots as a service.

You can get it faster than 15 seconds, but you can't really get it fast
enough. We precache everything. I would strongly recommend against trying to
process the pages in realtime.

------
ivanhoe
Still the main problems is not solved: you risk getting penalized for serving
a different content to the googlebot

------
beernutz
I have been looking for something like this for a long time. Seems very
straight forward.

I have not tested it yet, but I wonder if the speed of render will penalize
you in the google results. Seems like a separate machine with a good CPU might
be worthwhile if you are going to run this.

------
gkoberger
I can see a lot of issues with this (slow, displaying different content to
Google can get you penalized, etc)... but this is a really clever hack.

Google is less important (they already execute JS), however it's good for
sites like Facebook (which doesn't when you share a link).

~~~
mk3
They execute Javascript in limited fashion. So you should consider using what
is suggested by google itself [https://developers.google.com/webmasters/ajax-
crawling/docs/...](https://developers.google.com/webmasters/ajax-
crawling/docs/specification) . If you are using angular, then you will get
your template displayed instead of fully rendered page. with all {{sitename}}
displayed.

------
gildas
Shameless plug: [http://seo4ajax.com](http://seo4ajax.com)

It's a SaaS which is much more elaborated than this project (there is one year
of development into it). We serve and crawl thousands of pages every day
without any issues.

------
se_
If you're using Rails have a look at [https://github.com/seojs/seojs-
ruby](https://github.com/seojs/seojs-ruby), it's a gem similar to prerender
but it's using our managed service at
[http://getseojs.com/](http://getseojs.com/) to get the snapshots. There are
also ready to use integrations for Apache and Nginx.

Some benefits of SEO.js to other approaches are:

\- it's effortless, you don't need to setup and operate your own phantomjs
server

\- snapshots are created and cached in advance so the search engine crawler
won't be put off by slow page loads

\- snapshots are updated regularly

------
chadscira
I recently needed to do this for google, but i wanted the rendering time, and
delivery of the page to be under 500MS, so i hacked up something that works
with expressjs

[https://github.com/icodeforlove/node-express-
renderer](https://github.com/icodeforlove/node-express-renderer)

It uses phantomjs but removes all the styles initially so the rendering time
is much faster. (my ember app was averaging 70MS to render, but i prefetch the
page data)

~~~
paulocal
came across this recently and it's super easy to implement

------
RoboTeddy
This looks similar to Meteor's "spiderable" package

[http://docs.meteor.com/#spiderable](http://docs.meteor.com/#spiderable)

~~~
imslavko
Looks like that's exactly what Meteor's spiderable package does since
08/2012[0]: look at user-agent, run phantomjs for 10s and return a rendered
page once google/facebook crawler detected.

[0]: [http://www.meteor.com/blog/2012/08/08/search-engine-
optimiza...](http://www.meteor.com/blog/2012/08/08/search-engine-optimization)

------
commanderj
Making JS heavy sites crawlable is also possible with libraries like
[https://github.com/minddust/jquery-pjaxr](https://github.com/minddust/jquery-
pjaxr) and [https://github.com/defunkt/jquery-
pjax](https://github.com/defunkt/jquery-pjax) . Plus the push state has the
advantage of "real" urls.

~~~
dchest
How?

~~~
uptown
With each user-interaction that updates a page fragment it modifies the
address in the browser's address bar to correspond to the current state. If
somebody were to copy and paste that URL into a new tab, your site would load
the complete interface if you've structured your back-end code correctly.

You do this by building in logic to the part of the code that outputs your
view to see whether the request is coming as a PJAX request, or not. If it is,
you output the page-fragment, which is then added to your existing DOM. If
it's not a PJAX request, your back-end outputs the entire code for your site.

There's a limitation to PJAX where you can only update one fragment at a time,
though PJAXR seems to address that limitation by providing support for
updating multiple-fragments simultaneously. Either way, you get the huge
advantage of having a fully-crawlable site without needing to integrate pre-
rendering work-arounds for search-engine compatibility.

------
gorm
Very cool, but something I don't get:

\- Try to go to prerender.io, press "Install It -> Ruby on rails". Now it
loads the ruby on rails example.

\- Then go all the way down and change to "Prerendereed content". Pressing
"Install It -> Ruby on rails" doesn't do anything now.

Shouldn't it render the same content? "Add the middleware gem to your
Gemfile..." and so on.

~~~
thoop
prerender.io uses js(bootstrap) for the tab switching. So the prerendered page
doesn't do anything because it doesn't load that javascript.

------
fuddle
PhantomJS can be a pain to setup, I think the approach taken by Discourse.org
is the best option: [http://eviltrout.com/2013/06/19/adding-support-for-
search-en...](http://eviltrout.com/2013/06/19/adding-support-for-search-
engines-to-your-javascript-applications.html)

~~~
Isofarro
Progressive enhancement is still better than this. The noscript element is a
fallback when JavaScript is not available or turned off.

It doesn't handle situations where JavaScript is enabled, but your application
failed to get the JavaScript completely to the browser.

With modern JavaScript and feature detection, the use of no script elements is
a code smell.

------
t0
Why hasn't Google implemented this yet? Their current solution isn't good
enough ([https://developers.google.com/webmasters/ajax-
crawling/](https://developers.google.com/webmasters/ajax-crawling/)).

~~~
est
web apps today are so much more than ajax. You have to actually a full blown
DOM tree to get what real user-agents renders

~~~
dsl
If only Google had access to a full blown browser they could use in the crawl
engine...

~~~
rurounijones
* At scale, without massive performance drops

~~~
Volpe
I'm confused, search indexing isn't a realtime exercise... Why would
performance be an issue? Running a headless browser vs running "whatever it is
they run that can execute JS" doesn't seem like a huge leap...

~~~
est
Have you ever experienced web apps that laggs like crap? Yeah think about that
x 10000 million web pages.

~~~
Volpe
... right but a bot doesn't get impatient. So I don't see your point.

~~~
cygx
They should just shut down all their data centers and crawl the whole web from
a single box located in someone's basement.

After all, the bot doesn't get impatient.

~~~
Volpe
... Comments have really gone to shit here haven't they.

Some how we all end up antagonistic over bullshit like whether google have a
big enough computer.

But alas, you're right, google could never crawl with an actual browser - what
a ridiculous suggestion. I apoligise for such a dumb-witted comment.

As an aside: For my part in contributing such bad quality comments, I
apoligise.

~~~
cygx
The point is that Google probably doesn't have a lot of cycles to spare -
anything else wouldn't be good business sense.

Anything that significantly adds to the load will lose them money - whether or
not the operation needs to be realtime is secondary to that.

I apologise for giving offense: I wrote the comment the same way I would have
made it face-to-face, which is always a bit risky in a purely textual medium.

~~~
dsl
I don't know if you are trying to be serious at this point or not. Google has
millions (literally) of machines with dozens of cores each. Search is their
business that makes all the money.

Google executes JavaScript and renders the full DOM for every page internally.
They generate full length screenshots of every page and have pointers to where
text appears on the page so they can do highlighting of phrases within the
screenshot.

It isn't even a debatable question if Google reuses the Chrome engine to do
this.

------
selvakn
Shameless plug:
[https://github.com/selvakn/rack_phantom](https://github.com/selvakn/rack_phantom)

Similar idea, but implemented without a server for rendering, with a phantomjs
process. And only for rails/rack app.

------
radq
I believe Bustle.com does something similar to this. There was a talk about it
in the Ember NYC August meetup.

[http://www.youtube.com/watch?v=8MYcjaar7Vw](http://www.youtube.com/watch?v=8MYcjaar7Vw)

------
steeve
I've made a plugin to automate this for AngularJS:
[https://github.com/steeve/angular-seo](https://github.com/steeve/angular-seo)

Works with PhantomJS (of course).

------
ateevchopra
This is a really great idea !. I mean now data in apps made on js can be
searched. My question is that can we add "Search with google" to out
javascript app then ?

------
acqq
I surf with JavaScript turned off and I see just a blank page. If it's
"crawlable" I certainly expect it to be visible to me without turning
JavaScript on.

~~~
welly
> I surf with JavaScript turned off

Why would you do this? Genuinely interested. Do you browse the web with
JavaScript turned off the majority of the time or just in this particular
example?

~~~
acqq
I keep JavaScript turned off by default. Then I turn it on for only a few
sites of critical importance for me which would not function otherwise. And I
don't feel I miss anything, most of the content I care about is still HTML and
it should remain so. JavaScript is not needed to show me the text.

That way the chances for cross site sripting attacks are greatly reduced and
the content appears much faster.

------
tomekmarchi
Pushstate 4 the win. Done with hash fragments not going back to that mess.
Pushstate is quick and easy to implement,I don't see a reason to over
complicate.

------
franze
hi, my 2 cents

>Javascript apps can be fully crawlable yes, and i think it's cool that you
try to provide a solution as a service for this.

but as with every technology, there are some tradeoffs

a) serving google a different response bases on the user-agent is the
definition of cloaking (it's not misleading or malicious cloaking, it's
cloaking non the less)

b) you hardcode a dependency to a third party server - you have no control
over - into your app (and from the sample code on the page, there is no
fallback available if this server is down)

c) there are latency/web-performance issue i.e.: for a first time request by a
search engine the roundtrip would look like so:

[googlebot GET for page -> googlebot detected -> app GET to prerender.io ->
prerender.io GET to page -> app delivers page -> prerender.io returns page to
app -> app returns page to googlebot]

this will always be slower than

[googlebot GET for page -> app returns page to googlebot]

so basically the prerender.io approach creates some issues. said that. we
don't have - yet - another "no tread-off" solution

the "make ajax crawlable" approach basically allows - non malicious, non
misleading - cloaking [https://developers.google.com/webmasters/ajax-
crawling/docs/...](https://developers.google.com/webmasters/ajax-
crawling/docs/specification)

(sorry google, but ?_escaped_fragment_= was really one of your must stupidest
specs ever, even worse then "nofollow")

so if you target "?_escaped_fragment_=" in the GET request, and not the user-
agent cloaking a.k.a. sending different responses is ok

but: it creates a double googlebot crawl issue i.e.:

[googlebot GET [http://www.exmaple.com/test](http://www.exmaple.com/test) ->
googlebot parses HTML and finds <meta name="fragment" content="!"> in the HTML
-> googlebot pushes
[http://www.exmaple.com/test?_escaped_fragment_=](http://www.exmaple.com/test?_escaped_fragment_=)
into its "stuff to crawl-queue" (a.k.a. discovery-queue) -> googlebot crawls
[http://www.exmaple.com/test?_escaped_fragemten_=](http://www.exmaple.com/test?_escaped_fragemten_=)
-> gets server side get request (or if you would use a prerender.io service
the whole roundtrip to the prerender.io site would start) ]

this is a no go if you have a big site with hundred of thousands to millions
of pages.

and there is another much, much bigger issue:

    
    
      * showing JS clients 
      * and "other only-partially-JS clients" (google parses some JS) different responses 
    

just does not work in the long turn.

why? if there is no direct feedback, then there is no direct feedback!

non-responsive mobile site currently offer overall poor user experience, why?
because all the guys working on the site sit in front of their fat office
desktops. no feedback equals crap in the long run.

and it's worse for "for robots only" views, because people just don't have to
live with the crap they server spits out, as they always just see the fancy JS
versions. since the hashbang ajax crawl-able spec came out it consulted some
clients on this question, everyone who choose the _escaped_fragment_ road
anyway did regret it later on. even if the the first iteration works, 1000
roll out later, it doesn't - if there is no direct feedback, then there is no
direct feedback.

conclusion: if you have a bit site and want to do big-scale (lots of pages)
SEO you are stuck with landingpages and delivering HTML + content via the
server + progressive enhancement for functionality, until the day google get's
its act together.

and for first-view webperformance i recommend the progressive enhancement
approach anyway, too.

