

Making AJAX Applications Crawlable - romland
http://code.google.com/intl/sv-SE/web/ajaxcrawling/docs/getting-started.html

======
patio11
This kind of rubs me wrong, in that it is nakedly making Google's engineering
problems into the Internets' engineering problems, there isn't even a
scintilla of the usual "You're just making it better for your end users and
our crawler just happens to benefit" fig leaf, and compliance will be about as
optional as complying with HTTP because Google _is_ navigation on the
Internet.

~~~
noibl
"making Google's engineering problems into the Internets' engineering
problems" -- indeed, and more specifically, saying that because Google
can't/won't embed a JS engine in their crawler, publishers should embed one in
their webserver. For which they suggest you use Java or, alternatively, Java.

~~~
retube
Is that right? I have been wondering to what extent the Google crawler renders
js. Given that it's Google, I was imagining that they probably do pretty much
full DOM rendering of pages, as so many pages are now rendered dynamically. Of
course, it's difficult to simulate human-interaction with the page, so I guess
this is their solution to that - the web designer leaving a signpost.

------
prodigal_erik
I don't get it. They're requiring authors to provide HTML content that lives
behind a URL. How is this different than just requiring the application
gracefully degrade to a plain HTML mode that's usable without JavaScript, and
crawling that?

~~~
gojomo
It's similar in effort required, but this more clearly tells Google to display
as target URLs (and send searchers directly to) a site's #!fragment-ed pages,
rather than their degraded/simple pages.

(Of course, the site could accept visits to degraded pages, detect AJAX-
capable browsers, then redirect the users to the preferred AJAXy URLs... but
perhaps they don't even want the non-AJAX URLs to appear in normal use.)

~~~
stan_rogers
It shouldn't require a redirect to "grow" the context around the fragment,
though -- which is kind of the point of progressive enhancement.

~~~
gojomo
You can see a problem that the Google convention handles better than 'degrade
to simple pages with distinct URLs' with the 'Noloh' site touted elsewhere in
this thread.

Consider one of their AJAXy-#fragment pages:

<http://www.noloh.com/#/features/>

Search for a phrase on that page: ["NOLOH generates only the absolutely
necessary concise"]

Google finds and sends you to their 'simplified' version of the same page:

<http://www.noloh.com/?features/>

...which upon visit "grows" itself with a fragment to...

<http://www.noloh.com/?features/#/features/>

Ugh! Does the site really want people on that URL, possibly bookmarking and
sharing it? Probably not; they could be using a redirect on first Google-
visit.

Try clicking to another page from the double-feature page, like FAQs. You wind
up at:

[http://www.noloh.com/?features/#/features/&faqs%2F=](http://www.noloh.com/?features/#/features/&faqs%2F=)

Ugh, it just keeps getting worse.

Using the Google #!fragment convention, the initial URL appearing in the index
could be the simple/direct:

<http://www.noloh.com/#/features/>

Some sites will want that. One canonical #fragment-filled URL collects all the
inbound traffic/linkjuice.

~~~
asnyder
Fair enough, but you have to remember that we needed this to work on all
servers and browsers since 2007. This Google implementation was just released.
Surely we could've done a URL rewrite for our specific server, but we try our
best to showcase how NOLOH will operate without the need for any tweaks, as
many of our users operate on shared servers without any access to rewrites.

Oddly, you don't make any mention that we effectively solved this issue
automatically for our users and that they've been able to have their full
websites searchable by Google. You were able to do a search for our content,
and guess what we didn't need to do ANYTHING from a site development
standpoint for that to work. Sure, without a rewrite it can be ugly, but the
content was fully searchable.

Frankly, have you seen some of the URLs that major websites such as
amazon.com, or other generate? Criticizing us for showing how the URL would
look without rewrites is really nitpicking our site, however we do want to
thank you for pointing out a minor issue, we should not have the &faqs%2f you
saw above, coming from a search engine and navigating. It should be
<http://www.noloh.com/?features/#/faqs>.

Sure enough we'll be implementing the Google style approach in NOLOH, and best
of all NOLOH developers need not change anything. Their apps were searchable
since 2007, and will continue to be searchable with newer and better methods
for the forseeable future.

------
asnyder
Interestingly enough NOLOH (<http://www.noloh.com>) made AJAX applications
crawlable years ago. If you're interested in the specifics, see
<http://dev.noloh.com/#/articles/Search-Engine-Friendly/>

Disclaimer: I'm a co-founder of NOLOH.

~~~
gruseom
I took a look. Your approach is to detect requests from spiders and respond
with plain HTML content rather than the content wrapped in Javascript, etc.,
that a normal user would get. You address the obvious question, "But isn't
that cloaking?" by saying no, the _content_ itself is the same, so nobody
should object. Fair enough, I happen to agree with you, but _our_ opinion is
irrelevant; what matters is whether _Google_ consider this practice
legitimate. Can you (or any HNer) tell me definitively whether they do or not?
And has their policy changed recently with the introduction of this new
crawlability spec?

------
endergen
Has google actually started doing this? I thought this was still in an
exploratory stage.

------
mmastrac
Facebook started using this style of URL a little while ago but I can't get it
to serve the static equivalent up yet. My bet is that they'll be the first big
name to ship this in production.

------
robryan
Wouldn't it be a lot easier for Google to build a JS parser, and just examine
the requests and follow it through as a normal browser would?

------
korch
Google has to do this, it's an existential threat to their money maker
PageRank. Imagine a few years out, after most of the web is no longer
organized as static documents linked together(nice for crawling!), but
transforms into a real-time evolving mish-mash of web API's, re-re-
aggregators, and interconnected web services. i.e. the top 10 social
networking sites will account for 90%+ of user activity, and it's their APIs &
data we'll all be using.

You can't crawl that.

Anyone else worry that Google's inevitable grand "evil" act will ironically be
them holding back the web from transforming into this? Microsoft could have
totally killed Google by hitting the fast-forward button on Ajax in 2004, by
leveraging IE to make the whole web into the "Deep Web." You can't sell ads on
what you can't crawl—just see what Apple is doing with iAd to carve out
mobile.

------
gcb
thats a cloackers wet dream.

