

Google proposes standard for making Ajax crawlable - coderdude
http://googlewebmastercentral.blogspot.com/2009/10/proposal-for-making-ajax-crawlable.html

======
seldo
Indexing AJAX websites is important, but I don't think much of this proposal.

As far as I can tell, their plan is to create a standard way that AJAX hash
links can be translated into real links, which Google will then crawl,
expecting that you are running a machine-readable web browser on your
server(!) that will render the full page.

Expecting everyone to make a big modification like installing a Java-based
browser to their infrastructure is crazy.

A much more sensible solution is the one webmasters should be doing anyway:
progressive enhancement. Any AJAX link, when clicked with Javascript disabled,
should render the page as it was intended, just more slowly and without
animations etc.. Then your site is completely indexable, and as a bonus it's
more robust and accessible too.

With a modern MVC framework this is really not that hard to achieve: your AJAX
controller should just be spitting out a chunk of HTML, and whether it gets
integrated into your template at the server-side or on the client by
Javascript should be irrelevant.

~~~
unlinkedlist
To be fair, the "standard" part of this proposal doesn't really have anything
to do with headless browsers. You could just as easily have a typical server-
side app that generates these URLs as appropriate.

For example, a common approach to presenting content in an AJAXy kind of way
is to just put a path after the hash. E.g.

    
    
        http://www.example.com#/products/bungee-cords
    

It's probably really easy for the web app to figure out the right content to
serve up for #/products/bungee-cords, headless browser or not.

I know you'll be saying, "Of course, but apps should be doing that anyway with
progressive enhancement." That's all well and good, but that leads to two
separate sets of URLs: one that most users see, and one that search engines
see. This fixes that.

~~~
seldo
I don't understand what you're saying. Firstly, web apps can never see
anything after a # in a URL: the browser does not send it to them. Javascript
can see it, but I don't think that's what you meant.

With progressive enhancement, your URLs should look like
<http://www.example.com/products/bungee-cords>. If clicked they should render
the page as appropriate. JavaScript, if enabled, will attach a handler to the
links and when it sees you trying to click on /products/bungee-cords will halt
that click and instead make an AJAX request for the equivalent content. There
should be no need to have # links whatsoever.

~~~
boucher
That ignores the fact that people want the URL field to reflect the current
state of things, so that URLs can be copied out and pasted elsewhere.

It also ignores the fact that there is no way to set the URL field (through
JavaScript or any other mechanism) without actually triggering a page change.

------
simonw
That's horrible. Why is everyone so desperate to make everything depend on
JavaScript? Especially Google, with GWT and Google Moderator (a great example
of a site that could work just fine without any JavaScript at all). I'm all
for progressive enhancement and making the user's experience better if they
have JavaScript enabled, but hiding everything behind a single GWT script tag
and then using grotesque hacks to make the content crawlable is just crazy.

~~~
boucher
There are large classes of web applications which can not work without
javascript. Period.

Nobody is saying _everything_ needs to be that way. But some things do. This
is an attempt to make those things work better with search engines.

~~~
qeorge
Selling web applications that don't work without Javascript is legally
dangerous (discriminates against disabled users). There's a lot more to
accessibility than making sure your site works without Javascript, but if
you're missing this piece its somewhat of a moot point.

There's never been a case that's gone to verdict on the matter, so its still
somewhat of a gray area. But the reason they don't go to verdict is because
the offending company has always agreed to a large cash settlement. Target's
$6MM settlement with the Federation for the Blind is the most recent case.[1]

There's plenty of large websites that drop the ball here (e.g., Reddit), but
that doesn't mean its safe to do so. Ignore accessibility at your own risk.

[1][http://www.webstandards.org/2008/08/28/what-the-target-
settl...](http://www.webstandards.org/2008/08/28/what-the-target-settlement-
should-mean-to-you/)

~~~
boucher
If you honestly believe that people shouldn't make applications that don't
work without JavaScript, then you are without question saying that there are
applications which _should not be built_ on the web. I respect that opinion,
but I don't agree with it.

------
jimmybot
Why is Google telling people to run a headless browser just for their crawler?
Sounds way too complicated for any small-scale website. Why doesn't Google run
the headless browser themselves? It sounds like they are offloading parsing
costs onto websites, but then they have a cheating problem they have to deal
with.

Somewhat related: It's not AJAX, but I remember hearing Microsoft Research a
couple of years ago talk about parsing CSS. They wanted the page as an image
so that they could analyze pictorially what was a menu, what was a header,
what was main content, etc. It seemed fairly neat for the time, although I
think there are much simpler heuristics for when you are after just the main
article type content of a page that will work for most blogs/CMS's out there.

(No, they didn't use IE. I asked. Yes, they tried. They said it was slow and
crashed too much.)

------
endergen
That's great. Exactly what I'd been thinking about lately.

Avi Bryant: I was just thinking of pinging you about your approach to
generating html in Clamato. Are you going to be trying to unify client-
side/server-side html generation to solve exactly this problem?

Some thoughts, Google suggests using a headless browser to generate the static
version of the dynamic content. This may be the only option for existing code
bases. But with server side javascript and by extension Clamato there must be
a more elegant solution to generate clean static html as well having a dynamic
UI client-side. This seems like something you would have thought a lot about
while building The Seaside Framework (<http://seaside.st>) and now Clamato
(<http://clamato.net/>).

~~~
avibryant
Server-side javascript is only relevant here if it has access to a working
DOM, since that's what any HTML generation or templating is going to be based
on. At that point you basically do have a headless browser, whether you call
it that or not.

I find Google's proposal goofy; I think I agree with the poster who suggested
feeding google static HTML (probably via a sitemap) which references canonical
URLs that might be ajaxy, rather than having it crawl an ajax site and do this
magic token manipulation to get the static HTML equivalent.

------
keltex
This is really an important idea. For example part of a website I'm working on
displays "customer testimonials".

The right way to do this for user experience is to to allow the user to cycle
through the testimonials using AJAX. This allows for some nice looking UI and
is faster because less data is exchanged with the server.

The right way to do this for SEO is to have each page with a different
testimonial on it with it's own custom title, meta description, and SEO
friendly URL.

Which way we go is still undecided.

~~~
csytan
Have you considered parsing the page using javascript, or selectively
returning a partial on an ajax request? I use the first method quite a bit.
It's a bit slower to send the whole page, but does not require any server side
changes.

Once you get the html, libraries like jQuery make the parsing and replacing
very easy.

$(html).find('#testimonial').replaceAll('#testimonial');

~~~
unlinkedlist
Simpler still:

    
    
        $('#testimonial').load('/testimonial-3 #testimonial p');

------
tjpick
I think they've found a very complicated solution. Wouldn't it be much much
easier to:

1\. give the content behind the ajax call a static url, which can be accessed
by non-js clients too

2\. add something like <link rel="alternate" media="ajax" type="text/html"
href="<http://example.com/doc.html#state> /> to the head of that static
version, which the search engines can then use as a pointer to back the js
version

------
ramanujan
Perhaps this is infeasible for some reason, but why can't they just use a
(higher performance) version of Selenium or something similar?

1\. Use machine learning to identify page elements likely to produce AJAX
responses. Not too hard to do this, especially if you actually render the page
in (say) Chrome and use the 2D layout in addition to the 1D HTML/CSS as part
of your feature set.

2\. Use your (souped up, ultra fast) Selenium replacement to play with all
those AJAX features.

[http://www.reasonablyopinionated.com/2009/01/how-to-test-
aja...](http://www.reasonablyopinionated.com/2009/01/how-to-test-ajax-apps-
with-selenium.html)

...on the third hand, perhaps they're thinking that anyone technically savvy
enough to set up an AJAX site (or set up sitemaps or the like) can run this
headless browser.

------
coliveira
I think they are trying to make people do work that they should be doing. The
headless browser is something that _Google_ needs to apply to their crawling
process.

The work of webmasters should be making the URLs available -- Google has to
come up with ways to interpret the content. This is similar to asking web
servers to provide an HTML version of PDF documents that they are serving.

------
rimantas
Hm. If it is content, then it should work in such a way that it can be
accessed no matter what tries to get it: webcrawler, javascript incapable
browser or user with a "normal" browser but JS switched off. If user agent is
ajax-capable, then use ajax. Google for "hijax".

If it is a webapp, then making it crawlable does not make sense anyway.

------
qeorge
Fundamental problem: the URL fragment after the hash isn't sent to the web
server as part of the request. Am I wrong?

~~~
coderdude
Google will scan your documents for URLs like
"/something.php#!some_ajax_action, and will rewrite those URLs without the
fragments (a method they describe in the article, if you read it).

~~~
qeorge
Thanks, I missed that on the first time through. Quite an awkward
implementation, IMHO.

------
monos
maybe i'm thinking of the wrong kind of content they want to crawl, but isn't
it easier to provide special crawlable, plain html pages (possibly hidden to
users or provided as an "accessible" version).

