Hacker News new | comments | show | ask | jobs | submit login

I'm convinced that Google has several Googlebots that are run depending on how popular a site is.

That is, new and low traffic sites are crawled by less intelligent bots, and as the site gets more visitors or better rankings, more complicated and resource intensive bots are deployed.

How this might work with the most popular sites out there, the Amazons and Wikipedias of this world - I'm not so sure about that. If I were in charge, I'd be tempted to have customised bots and ranking weights for each of these exceptional sites.

Sadly the chances of getting a real answer on this in my lifetime are close to zero.




Or perhaps there are also heuristics in place to determine which strategy to follow, ie. heuristics to see whether executing JS would be worth it, would yield additional content. So, say, when crawling documentation, where JS doesn't give any of that (eg. Sphinx' JS search), it could decide - nah, not doing JS, not worth it.

I'd expect that there are also other heuristics and different strategies for crawling to better handle eg. content presented by one of the popular CMSes.


    heuristics to see whether 
    executing JS would be worth 
    it, would yield additional 
    content.
You are literally describing vanilla page rank. If a large number of links are found to a page, but that page doesn't contain the contents the link rate suggest it should contain... either link rate has failed, or JavaScript should be executed.


I recently put a little browser for themes for the hyper.app terminal online (https://hyperthemes.matthi.coffee). It was just something I used to try out Elm, and there's no reason to believe Google would regard it as anything different than a generic new site.

If you look at Google's cached version, you can see that the JS is executed (although it fails trying to download the actual data): https://webcache.googleusercontent.com/search?q=cache:hN5yCk...

Edit: as has been pointed out below, the cached version is just the same as the original and the JS gets executed on your end. This doesn't show weather Google also executes it during its crawl.


> If you look at Google's cached version, you can see that the JS is executed

Correct me if I am wrong but when I look at the cached version of the homepage (http://webcache.googleusercontent.com/search?q=cache:hN5yCky...), I don't see that the JS has been interpreted.


The "Network error" in the top left is a JS result. Also, since you linked to the source view, you'll see that there actually isn't anything in the body but the script being loaded, whereas "Full version" shows the user interface was correctly initialised.


Disable JavaScript in your browser when viewing the cached page. All you will see is an empty page.


Now I'm feeling a bit stupid, will slowly walk away and hope I can get by with "it's Jan 1st and I didn't get much sleep" as an excuse...


First, thank you for taking the time to stick your neck out a bit and share an example.

Second, thank you for responding to someone pointing out that you were wrong without putting yourself on the defensive.

I see this entire sub-thread as a positive; glad we all could learn along with you!


You're at least partly right about sites being crawled differently depending on popularity. I think the factors may not be limited to popularity alone, but we see this behavior documented in the crawling rate documentation Google provides its users/clients so there is no reason it couldn't apply to other "expensive" actions their crawlers do.

How this might work with the most popular sites out there?

We see it in on-page answers that provide extracts of pages with the answers to questions asked in search phrases that include a reference to the document they were sourced from.

Matt Cutts used to qualify sites like Wikipedia as "reputable" to the eyes of the search engine.


You could join Google :)


The chances of becoming a part of the search team are still very low. Even most Googlers won't know the exact details.


But within Google, you could ask somebody and get the right answer.


And outside Google, you could look for Google employees on linkedin who may be in position to have the answer and ask nicely ;)


no it is not :) you can move around in the company.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: