

Google's Indexing Javascript more than we thought - dohertyjf
http://www.distilled.net/blog/seo/google-stop-playing-the-jig-is-still-up-guest-post/

======
bowyakka
I have been saying this for years, but most people have refused to believe me

Like most people a long time ago I also held the belief that robots were just
dumb scripts, however I learnt that this is not the case when I had to trap
said robots for a previous employer.

See at the time I was working for one of the many online travel sites; now
most people probably are not aware that there is quite a bit of money that can
be made in knowing airline costs. The thing is that to get this information is
not actually cheap, see most of the GDS (Global Distribution System) providers
are big mainframe shops that require all sorts of cunning to happen to emulate
a green-screen session for the purposes of booking a flight.

The availability search (I forget the exact codenames for this) is done first,
this search gives you the potential flights (after working through the
byzantine rules of travel) and a costing or fare quote for your trip. This
information is reliable about ~95% of the time. Each search costs a small
amount against a pre-determined budget, and the slightly more over the limit
(kinda like how commercial bandwidth is sold), if my memory serves it was
0.001 euro cents for each search.

During the booking phase (known as the GDS code FXP) the price is actually
settled, the booking is a weird form of two-phase commit where first you get a
concrete fair quote. This quote ‟ringfences” the fare - essentially ensuring
that the seat cannot be booked for roughly 15 minutes. In practise there are a
load more technicalities around this part of the system and as such it is
possible for double bookings and over bookings to happen, but lets keep it
simple for the sake of this story. These prebookings are roughly 99.5%
accurate on price but cost something like 0.75 cents (there is a _lot_ that
happens when you start booking a flight).

So with that in mind if you are in the business of trying to resell flights it
can be to your advantage to avoid the GDS costs and scrape one of the online
travel companies. You also want the prebook version of the fare as its more
likely to be accurate, the travel sites mind less about people scrapping the
lookup search.

Thus begins the saga of our bot elimination projects, first we banned all IP's
that smash the site thousands of times, this is easy and kills 45% of the bots
dead. Next up we start proper robots.txt and ways to discourage googlebot and
the more "honest" robots, that gets us up to dealing with 80% of the bots.
Next we take out china, russia etc as ip-addresses, we find that these often
have the most fraudulent bookings anyhow so no big loss, that takes us up to
90% of the bots.

Killing the last 10% was never done, every time we tried something new
(captua's, JS nonce values, weird redirect patterns, bot traps and pixels,
user agent sniffs etc etc) the bots seemed to immediately work around it. I
remember watching the access logs where we had one IP that never, ever bought
products, just looked for really expensive flights. I distinctly remember
seeing it hit a bottrap, notice the page was bad, and then out of nowhere the
same user session appears on a brand new IP address with a new user agent, one
that essentially said "netscape navigator 4.0 on X11" (this was firefox 1-2
days so seeing unix netscape navigator was a rare sight), it was clear the bot
went and executed the javascript nonce with a full browser, and then went back
to fast scraping.

A few years later, at the same company but for very different reasons I wrote
a tool to replace a product known as gomez with an in house system. The idea
of gomez and similar products like site-confidence is to run your website as
the user sees it, from random ip's across the world and then report on it. I
wrote this tool with XulRunner which is a stripped down version of firefox.
Now admittedly I had the insider knowledge of where the bot traps were, but I
was amazed at how easy it was to side-step all of our bot-detection in only a
few days, I also had unit tests for the system that ran it on sites like
Amazon and Google and even there is was shocking how easily I was able to side
step bot traps (I am sure since they have got better, but it surprised me how
easy it was).

I am not saying all the bots are smart, but my mantra since then has been that
"if there is value for the bots to be smart, they can get very smart". I guess
its all about the cost payoff for those writing the bots, is it a good idea to
run JS all the time as a spider - probably not, does it make sense to save you
from 0.75 cents of cost per search - very much so !

~~~
paganel
> I am not saying all the bots are smart, but my mantra since then has been
> that "if there is value for the bots to be smart, they can get very smart".

I was once actually on the other side of the fence as you were, around 5-6
years ago (in a different industry, though). You're right, if there's value to
be gained by scrapping other people's pages there's almost always a way round
the obstacles.

I remember the day when my boss presented me a link to a strange-named FF
extension, called Chickenfoot
(<http://groups.csail.mit.edu/uid/chickenfoot/faq.html>). It allowed one to
very easily write FF extensions that would programatically click on whatever
links you wanted to be scrapped, all this from inside the browser, like a
normal user would have done. I used to run FF with this extension installed on
a dedicated cheap PC, saving the data to our servers, and from time to time
automatically restarting FF because the machine was running out of memory. Fun
times :)

~~~
eli
Sure, automated browser testing is a whole industry, and I think we all know
those tools aren't always used for testing sites you control.

Take a look at Selenium and Watir.

~~~
korny
And phantom.js

------
giberson
It occurs to me that if GoogleBot is executing client javascript you could
take advantage of Google's resources for computational tasks.

For instance, let me introduce you to SETI@GoogleBot. SETI@GoogleBot is much
like SETI@home except it takes advantage of GoogleBot's recently discovered
capabilities. Including the SETI@GoogleBot script into your web pages will
cause (after the page load event) the page to fetch a chunk of data from the
SETI servers via ajax request and proceed to process that data in JavaScript.
Eventually, once the data has been processed it will be posted back to the
SETI servers (via another ajax request) and then repeat the cycle. Thus
enabling you, for the small cost of a page load, have GoogleBot process your
SETI data and enhance your SETI@home score.

Obviously, this isn't a new idea (using page loads to process data via
JavaScript) but it is an interesting application to exploit GoogleBot's likely
vast resources.

~~~
corin_
One would assume that they are clever enough to have built in safeguards to
prevent anything going too long, or using too much processing power.

~~~
xtacy
Not just that, I would also assume that PageRank will penalise your site if a
JS takes so long to execute.

~~~
cr4zy
Looking at Google Webmaster tools, I see a significant decline in my reported
site performance starting in September, even though my site's speed has
improved significantly since then by my own measures. Assuming this is due to
our 'next' feature that AJAXs in, I'm going to disallow the 'next' urls in
robots.txt and cross my fingers.

~~~
thenextcorner
The Google Webmaster Tools site performance is being measured through multiple
data points which can include: \- People on dial up (yes these still exist) \-
People in other countries, if you have not taken care of cashing or CDN, your
website might load slower for far-far away visitors This represents an average
across all the data points being measured.

The way the data is being captured, is through: \- Google Toolbar \- Google
Chrome \- Google DNS services

Your observations of the speed and the improvements are not always similar as
the ones from the aggregated data Google has access to.

I'm not sure what you are trying to accomplish with the dis-allowment of the
next urls in robots.txt. Can you explain more what your hypothesis is in this
test, and how you would measure success?

~~~
cr4zy
I guess disallowing won't work if what you say is correct i.e. Google doesn't
measure site performance with Googlebot. However, this wouldn't explain the
slowdown since September unless changes were also made to include JS execution
time in site performance within the services you mentioned.

Perhaps a solution then would be to trigger the AJAX on mouse over, but that
seems kludgy. In my case, I need make the AJAXed content part of the initial
page load anyway, for the sake of user experience. But I can see cases where
Google should not be counting AJAX as part of the page load time. God forbid
somebody uses long polling for example. Maybe Google is doing this in a smart
way, looking at the changes after the AJAX and determining if they should
count as part of the page load.

~~~
thenextcorner
It really depends on what you are trying to accomplish if you need to worry
what Google is reporting with respect to the page load time. In general, it's
always good to pay attention to page load times, regardless what Google finds
of it!

You can experiment with Asynchronous calls, or slow load jQuery scripts, which
kick off after the headers and html framework already have been loaded.

Overall, I would not worry about the reports in Google WMC that much, just try
to get faster in general.

If you are serious in delivering an ultra speedy service online, there are
services which can test your site or application on multiple locations,
different OS and connection speed or using a different browser. But these
services are pricy, trust me on that one!

------
heyitsnick
Is there any example of a site having their dynamically generated* disqus
comments indexed by google? Disqus is probably one of the most common form of
ajax-generated content on the web, so if this were the case that googlebot was
actively indexing dynamic content like this, I would expect to see disqus
supported.

* disqus has an API to allow you to display disqus comments serverside, so some disqus implementations - think mashable is one - will have comments indexed without the aid of Javascript.

------
eli
I don't buy this argument. Wanting to have a more complete rendering engine
for their crawler might have been a factor in designing Chrome, but I can't
imagine it was in any way the driving force. The costs of developing a browser
that runs well on millions of different computers and configurations are far
beyond what it would take to make a really great headless version of WebKit
for your crawler.

~~~
wahnfrieden
Yeah, I took that part of the original article (the one this one is citing) to
be a dumbed-down explanation of googlebot's new behaviors. That article's
audience were SEO people, not "engineers". It's unfortunately misleading
enough that we get articles like these once you try to extrapolate from that.

------
wiradikusuma
i wonder if it also means we don't need to implement _escaped_fragment_
anymore [http://code.google.com/web/ajaxcrawling/docs/getting-
started...](http://code.google.com/web/ajaxcrawling/docs/getting-started.html)

~~~
chrisguitarguy
Just because GoogleBot _can_ crawl and execute/index javascript, doesn't mean
that it will on your site. The best bet would be to keep them. Or take them
off and see what happens. If you don't see negative effects, then you will
have discovered something interesting.

~~~
ipullrank
Yeah I'd definitely say we should continue to follow our established best
practices until G gets better at this but Josh's evidence and our continued
testing on this subject is very compelling.

~~~
alexmuller
I'd argue that best practice in web development is not requiring JavaScript to
load a page, but I'm sure that issue has been done to death in the past.

------
almost
Why the assumption that "GoogleBot" is a single thing? Of course we know that
google has a headless browser running, we see it's output in the instant
previews, but I'm sure they still do plenty of standard crawling (and probably
some half way partial JS execution and/heuristics too).

------
tripzilch
> My personal favorite example of this is Google Translate, which is one of
> the most accurate machine translating tools on the planet. Google almost
> sacked it because it was not profitable, and _had it not been for public
> outcry_ we may have lost access to this technology altogether.

I kind of missed this "public outcry", when did it happen? And if Google
listens to public outcry, why did we lose Google Code Search?

~~~
jsnell
When the shutdown of Google Translate API was announced, a few months ago.
(Just the API, note. Not the tool itself).

It was saved because people care enough about the translation API that they're
willing to pay for using it.

------
nyellin
You might be able to check what the Googlebot executes by adding javascript to
your site and checking the thumbnail.

EDIT: Removed comment about the bot's user-agent. The article links to a
Google FAQ which answers the question.

~~~
jQueryIsAwesome
They execute absolutely everything you put in Javascript; it looks exactly as
it looks in Chrome. And it looks like it takes the snapshot after all the
initial processing is done.

Javascript-heavy site with perfect snapshots: <http://goo.gl/xNUIM>

But they also index and take a snapshot of the no-javascript version:
<http://goo.gl/eP84M>

------
jwatte
I think they have it backwards. What if Chrome is GoogleBot? You get quality
measurement on pages based on user behavior on the page. Crowdsourcing beats
crawling!

