Hacker News new | comments | show | ask | jobs | submit login

I'm wondering if some of the simpler cache-busting tricks would force google update their cache. For example, somescript.js?v=201210221559.

That's not the issue here, we include the md5 hash of the content in the url of every javascript / css asset. New pages had all the correct (brand new) urls. The issue is that Google is executing javascript on html pages they downloaded days ago. The only solution I can see is to fire off cloudfront cache expiration requests for all old assets. But that negates the simplicity of including the hash of the content in the url.

Is it possible that people are looking at the page from Google's cache? I'm thinking the 3taps kind of 'web site scraping that doesn't look like web site scraping'

Hmm, that's interesting. I don't think so, though, because the user-agent on the requests is the googlebot:

    From: googlebot(at)googlebot.com
    User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

Well an interesting check would be to look at one of your pages in the cache that fires an AJAX call and see where that call comes from. I agree it would be 'weird' if it came from Googlebot instead of the browser looking at the cache.

At Blekko we post process extracted pages of the crawl which, if they were putting content behind js could result in js calls offset by the initial access but 3 days seems like a long time. Mostly though the js is just page animation.

Would it make sense that loading from the cache makes a call to the origin server?

I just checked one of my sites which loads available delivery dates via ajax through the google cache, and yep, it caches that as the dates are when the cache was taken.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact