
Google bot delays executing JavaScript for days - dbeardsl
http://itbrokeand.ifixit.com/2012/10/21/google-bot-delays-javascript.html
======
h2s

        > If you're removing code or changing an endpoint,
        > be careful you don't screw the Google bot, which
        > might be "viewing" 3-day-old pages on your
        > altered backend.
    

An interesting proposition. Personally, unless I was operating in some sector
where keeping Googlebot happy was key to staying competitive and there was
solid evidence it could hurt my page rank, I don't think I'd be prepared to go
to this length. Google is doing quite an atypical thing here compared to
regular browsers and I'd like to think Google engineers are smart enough to
account for this type of thing in the early stages of planning.

They have a difficult cache invalidation problem here. The only way to find
out if the Javascript in use on a site has changed is by checking if the page
HTML has changed. And on top of that, the Javascript can change without any
noticeable change to the HTML.

~~~
zaptheimpaler
obligatory: "There are only two hard problems in Computer Science: cache
invalidation, naming things, and off-by-one errors."

------
ashray
Googlebot also does some other crazy stuff. Like looking at url patterns and
then trying out variations.. they're almost trying to sniff URLs!

For example if I have a page: www.domain.com/xyz/123

Googlebot (without any links to other pages, will actually try URLs like)
www.domain.com/xyz/1234 www.domain.com/xyz/122 www.domain.com/xyz/121 and so
on...

It's crazy how much 'looking around' they do these days!

~~~
ceejayoz
I believe that one's mostly a search for duplicate content - looking for URL
parameters that don't make a difference.

------
eli
I'm not too surprised. I've got Googlebot still requesting old URLs even
through there are no incoming links to them (that I know of) and they've been
either 404 or 301 redirected for six months. I even tried using 410 Gone
instead of 404, but it made no difference.

~~~
chrislomax
To just reiterate this further, I am still 301'ing urls that have been dead
for nearly 5 years. I still get requests in for them. I don't want to 404 them
in fear of losing that slight bit of traffic so I just 301 them. I am really
surprised they don't remove these urls from their cache and I can't think for
the life of me why they don't?

~~~
gizmo686
They might have some obscure incoming url from somewhere else on the net.

~~~
SquareWheel
Webmasters should show that. If you 404 the page it'll appear in the errors
pane after some time and show incoming sources.

~~~
gizmo686
It sounds like the only thing requesting the 404`ed page was google bot, which
I do not believe tells you the referrer. If this is true, then it would mean
either that google does not clear their cache (which I doubt), or that the
link exists somewhere on the net, but in a place where no human would find it.
I've done some work with web crawlers, and it you fall into that type of hole
alot more often than I would expect.

~~~
SquareWheel
I'm not sure I understand, why wouldn't Webmasters show that one hard to find
link if Googlebot found it?

------
jes5199
Your users may be, too. It's not unusual for me to open my sleeping laptop
several days later and expect the open web pages to work without refreshing
them.

~~~
oakwhiz
What might be a good idea for Javascript-heavy web apps is to make an Ajax
call to the server to see if a refresh of the page is required.

~~~
moreati
Please don't do that. I left that tab open on purpose, I'm half way through
reading the page - if the page refreshes I likely lose my position

~~~
iamjustlooking
You don't have to refresh the page, you could make it so that your next page
click loads the full page instead of using ajax/pjax.

quick pjax e.g.:

    
    
      <html data-lastupdated="1234567890"...
    
      $.getJSON('/lastupdated.json', function(lastupdated) {
      	if(lastupdated > $('html').data('lastupdated'))
      	{
      		$('a[data-pjax]').removeAttr('data-pjax');
      	}
      });

~~~
oakwhiz
This is exactly what I meant by my initial comment - I should have been more
clear.

------
TazeTSchnitzel
I wonder if it is Google's visual site previews/thumbnails that you get when
you click on the arrow at the side of a search result, that are doing this.

Perhaps Google fetches the crawled page from the cache and then renders that
for the previews?

~~~
sj26
This was my first thought, and seems likely. They do several forms of analysis
on their cache. It could even be some engineers running tests or queries that
require rendering the page or at least bootstrapping the DOM.

------
georgemcbay
Is this surprising? I'd expect the possibility of this sort of behavior from
any system that was vaguely Map-Reduce-y and operated on the scale of data
that Google's indexing does.

------
ericcholis
I'm wondering if some of the simpler cache-busting tricks would force google
update their cache. For example, somescript.js?v=201210221559.

~~~
dbeardsl
That's not the issue here, we include the md5 hash of the content in the url
of every javascript / css asset. New pages had all the correct (brand new)
urls. The issue is that Google is executing javascript on html pages they
downloaded days ago. The only solution I can see is to fire off cloudfront
cache expiration requests for all old assets. But that negates the simplicity
of including the hash of the content in the url.

~~~
ChuckMcM
Is it possible that people are looking at the page from Google's cache? I'm
thinking the 3taps kind of 'web site scraping that doesn't look like web site
scraping'

~~~
xiongchiamiov
Hmm, that's interesting. I don't think so, though, because the user-agent on
the requests is the googlebot:

    
    
        From: googlebot(at)googlebot.com
        User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

~~~
ChuckMcM
Well an interesting check would be to look at one of your pages in the cache
that fires an AJAX call and see where that call comes from. I agree it would
be 'weird' if it came from Googlebot instead of the browser looking at the
cache.

At Blekko we post process extracted pages of the crawl which, if they were
putting content behind js could result in js calls offset by the initial
access but 3 days seems like a long time. Mostly though the js is just page
animation.

------
lists
Did anyone else get really bad font rendering running Chrome on Windows 7?

~~~
teamonkey
Yes, it's a known issue and has been a problem for a long time. Quite ironic
that it's particularly bad with Google Web Fonts.

<http://code.google.com/p/chromium/issues/detail?id=137692>

