Hacker News new | comments | show | ask | jobs | submit login
How Googlebot crawls JavaScript (searchengineland.com)
237 points by mweibel on May 12, 2015 | hide | past | web | favorite | 109 comments



This was actually my primary role at Google from 2006 to 2010.

One of my first test cases was a certain date range of the Wall Street Journal's archives of their Chinese language pages, where all of the actual text was in a JavaScript string literal, and before my changes, Google thought all of these pages had identical content... just the navigation boilerplate. Since the WSJ didn't do this for its English language pages, my best guess is that they weren't trying to hide content from search engines, but rather trying to work around some old browser bug that incorrectly rendered (or made ugly) Chinese text, but somehow rendering text via JavaScript avoided the bug.

The really interesting parts were (1) trying to make sure that rendering was deterministic (so that identical pages always looked identical to Google for duplicate elimination purposes) (2) detecting when we deviated significantly from real browser behavior (so we didn't generate too many nonsense URLs for the crawler or too many bogus redirects), and (3) making the emulated browser look a bit like IE and Firefox (and later Chrome) at the some time, so we didn't get tons of pages that said "come back using IE" er "please download Firefox".

I ended up modifying SpiderMonkey's bytecode dispatch to help detect when the simulated browser had gone off into the weeds and was likely generating nonsense.

I went through a lot of trouble figuring out the order that different JavaScript events were fired off in IE, FireFox, and Chrome. It turns out that some pages actually fire off events in different orders between a freshly loaded page and a page if you hit the refresh button. (This is when I learned about holding down shift while hitting the browser's reload button to make it act like it was a fresh page fetch.)

At some point, some SEO figured out that random() was always returning 0.5. I'm not sure if anyone figured out that JavaScript always saw the date as sometime in the Summer of 2006, but I presume that has changed. I hope they now set the random seed and the date using a keyed cryptographic hash of all of the loaded javascript and page text, so it's deterministic but very difficult to game. (You can make the date determistic for a month and dates of different pages jump forward at different times by adding an HMAC of page content (mod number of seconds in a month) to the current time, rounding down that time to a month boundary, and then subtracting back the value you added earlier. This prevents excessive index churn from switching all dates at once, and yet gives each page a unique date.)


> (This is when I learned about holding down shift while hitting the browser's reload button to make it act like it was a fresh page fetch.)

Most useful aside of all time.


I used to use this a lot. My experience is that for some reason, a couple years ago it stopped working reliably as a fresh page fetch. Some items were still coming up cached. Now I use incognito or private browsing windows instead.


If you're running Chrom(e|ium) with developer tools open then you can right click the refresh button and it gives you a few refresh options (eg clear cache and reload).

That tends to be my fall back whenever I'm specifically fussed about the "freshness" of a page. That or curl


Thanks, never tried right-clicking that before. There's also a checkbox in dev tools settings to "Disable cache while dev tools is open."


i used to go through a lot of head scratching when doing manual testing, before discovering the joys of cmd-shift-r


> At some point, some SEO figured out that random() was always returning 0.5. I'm not sure if anyone figured out that JavaScript always saw the date as sometime in the Summer of 2006, but I presume that has changed. I hope they now set the random seed and the date using a keyed cryptographic hash of all of the loaded javascript and page text, so it's deterministic but very difficult to game.

I don't get why the rendering had to be deterministic. Server-side rendered HTML documents can also contain random data and it doesn't seem to prevent Google from doing "duplicate elimination".


Byte-for byte de-duping of search results is perfect and fairly cheap. Fuzzy de-duping is more expensive and imperfect. Users get really annoyed when a single query gives them several results that seem like near copies of the same page.

Tons of pages have minor modifications by JavaScript, and only a very small percentage have modifications done by JavaScript that result in JavaScript analysis resulting in improved search results.

So, if JavaScript analysis isn't deterministic, it has a small negative effect on the search results of many pages that offsets the positive effect it has on a small number of pages.


Great thread. I'm one of those people that was poking around and trying to figure this out a few years back.

Obviously, there's a lot I know you can't say, but I'd love to know your general thoughts on how far off we were: http://ipullrank.com/googlebot-is-chrome https://moz.com/blog/just-how-smart-are-search-robots


" Since the WSJ didn't do this for its English language pages, my best guess is that they weren't trying to hide content from search engines, but rather trying to work around some old browser bug that incorrectly rendered (or made ugly) Chinese text, but somehow rendering text via JavaScript avoided the bug."

Or maybe they were trying to get past the great firewall of China?


> Or maybe they were trying to get past the great firewall of China?

Possible, but at that time the only affected pages were for a certain date range in their archives, not the most recent pages. I alse think the Great Firewall of China did simple context-free regex searches that would have caught the text in the JavaScript literals.


Did you load in Ajax? I've got a client that runs a site that loads HTML in separately. They've been paying for a third party service to run PhantomJS and save HTML snapshots to serve to Googlebot - is that no longer needed?

(I'm not thrilled about rendering this way, but it makes development a lot easier.)


In practice, and from experience... content changes driven by JS tend to lag a few days, if the content was changed via direct output... If you're doing client-side rendering, couldn't you refactor to use node, or similar for your output rendering?

If you aren't heavy reliant on conversions from search traffic, you can probably get away with being JS driven, I'd suggest sticking with Anchor tags for direct navigation with JS overrides. Assuming you are supporting full url changes.. otherwise you need to support he shebang alternate paths... which is/was a pain when I did it 3-4 years ago.


As an aside, did you work on the indexing team at Google? I was on the indexing team from 2005-2007, and I remember that Javascript execution was being worked on then, but I don't remember who was doing it (was a long time ago ;) ). My name is my username.


I was always in the New York office (before and after the move from Times Square to Chelsea), on the Rich Content Team sub-team of Indexing. My username is the same as my old Google username.

I was working on the lightweight high-performance JavaScript interpretation system that sandboxed pretty much just a JS engine and a DOM implementation that we could run on every web page on the index. Most of my work was trying to improve the fidelity of the system. My code analyzed every web page in the index.

Towards the end of my time there, there was someone in Mountain View working on a heavier, higher-fidelity system that sandboxed much more of a browser, and they were trying to improve performance so they could use it on a higher percentage of the index.


Ah, okay, cool. Never visited the NY office. That's probably why I just remember the general idea that "JS execution was being worked on."


I feel that dynamic websites are not websites, but applications. Even after this thorough research, I'd still be very wary of turning a primarily content-based site into a dynamic app.

A plain HTML site is accessible, and will be accessible in a 1000 years. A site depending on (external) JavaScript sources will force John Titor to travel back in time to find version 1.x of jQuery. Starting with JavaScript abandons principles of progressive enhancement. Sometimes there is not even a fallback/graceful degradation, reminding me of these 2001-era: "Best viewed at 800x600 resolution in Netscape"-sites.

Google holds enormous clout among SEO's. Google says they will factor in site-speed and a large fraction of the web will become faster. Google can say more sternly that using JavaScript can have ugly consequences for user experience and accessibility, but they are swimming upstream: The web seems to be moving on to fancy new technologies regardless of what their SEO says.

Not much good comes from HTML5 JavaScript fans forcing your hand. Tor enabled JavaScript, because too much of the web would break without it, leading to a poor user experience. This led to a huge security gaffe, which I fully blame on webdevelopers eschewing basic principles, to get that slideshow running.


I think the trend of "turning a primarily content-based site into a dynamic app", and indeed most of what has been referred to as "Web progress", "moving the Web forward", etc. comes from the desire of content producers to obtain and maintain more control over their content. Look at how browsers have evolved to de-emphasise features which give the user control while adding those that are author-targeted.

We're moving from browsers being viewers for simple HTML documents (which can be copied, shared, and linked via simple means), to a platform for running complex applications written in JavaScript which often render data retrieved in proprietary formats from proprietary APIs. The "open by default" nature of plain HTML has become the "closed by default" of the data processed by web apps, much like with many native apps. Native app platforms (e.g. mobile) are also gradually becoming more "closed by default"; I'm not sure if that's a related trend.

Google can say more sternly that using JavaScript can have ugly consequences for user experience and accessibility, but they are swimming upstream: The web seems to be moving on to fancy new technologies regardless of what their SEO says.

Part of the reason is because Google themselves are doing this in many of their products... some of their employees probably disagree with "JS everything", but they're in the minority.


I think the trend of "turning a primarily content-based site into a dynamic app", and indeed most of what has been referred to as "Web progress", "moving the Web forward", etc. comes from the desire of content producers to obtain and maintain more control over their content. Look at how browsers have evolved to de-emphasise features which give the user control while adding those that are author-targeted.

I don't agree with this. Browsers are more user-targeted than ever.

We're moving from browsers being viewers for simple HTML documents (which can be copied, shared, and linked via simple means)

Browsers still allow this.

to a platform for running complex applications written in JavaScript which often render data retrieved in proprietary formats from proprietary APIs.

I have rarely seen a web API that uses anything other than straightforward JSON.

The "open by default" nature of plain HTML has become the "closed by default" of the data processed by web apps

Almost always equally as open as any HTML you would previously received.

Native app platforms (e.g. mobile) are also gradually becoming more "closed by default"; I'm not sure if that's a related trend.

What do you mean by this?


> I have rarely seen a web API that uses anything other than straightforward JSON.

There are differences between HTML vs. AJAX+JSON+Javascript+DOM. JSON has a lot less of a schema than HTML. You don't have to execute custom code from a remote server to render plain HTML. Client-side rendering is more complex for the client. A Javascript-based page is typically going to require more requests for remote resources than inline HTML, potentially meaning more bandwidth and caching/archiving costs. I can't quite put my finger on the implications right now, but I wanted to note that "JSON is a standard." doesn't mean much to me, since JSON is not comparable to HTML.


>> We're moving from browsers being viewers for simple HTML documents (which can be copied, shared, and linked via simple means)

> Browsers still allow this.

Yes, they allow it. But consider that user stylesheets have been dropped, hardly any improvements have been made with presenting standard html (unless you count giving in to IE and moving from less-stark black-on-grey to too stark black-on-white as an improvement).

Still no browser does a half-decent job of avoiding ragged-right text[1], give you decent margins on un-styled content, etc. There's no real reason for this. You could claim "backwards compatability" -- but if there was genuine interest, there'd be nothing stopping the introduction of a <sane-default-render-html6-whatever> content type.

It's ironic, that browsing a plain html-site in w3m in the console is a better reading experience than opening the same page in a desktop browser. So of course people need to supply a crap-load of stuff just to get decent, basic text layout, that flows well across various screen sizes. No reason a basic, unstyled html-document couldn't look much better than a TeX/LaTeX document published in the 80s, with the added bonus of re-flowing in a sane way for various window/screen sizes -- but they all look awful, too the point that plain html is actually not usable.

You need to wrap a document in js to get sensible layout, and in css to get sensible presentation. Even if the document doesn't contain any other media than text. Add an image or two, and things keep going downhill. It's absurd.

[1] http://alistapart.com/article/the-look-that-says-book


At the end of the day whether or not the source data is in a proprietary format or not, it's being rendered into HTML in the case of web apps. The only difference is whether that happens on the client or server side. It's trivial to parse in either case. In fact I'd argue it's usually easier to dig up a JSON end point in the case of a web app which is far more parseable than HTML.

I don't think we can point to this reason to explain the rise of web app's.


Surely you're not suggesting that parsing HTML is a good way to retrieve data for display?


IMHO: Websites that don't have "realtime" content should always stick with traditional HTML. I'm a Webdeveloper myself and i don't like the JavaScript Frontend trend.

Many Devs use Frontend JS in places where it's absolutely not needed. If you're building an App that updates in realtime, shows informations while it's created, i'm fine with Frontend JS, but it's an overkill for most content pages.

Sure, it depends on your implementation details, but as i said, it's just my opinion.


I mostly agree, but at the same time, the rise of native apps has raised the bar of what people expect in terms of UX. Take Hacker News and Reddit, primarily content based sites and a good fit for the classic server rendered HTML approach. Still a lot of people prefer using native apps to access that content. You can only get so far by adding some CSS to make the site responsive, but you won't be anywhere near the UX that native gives you without JS.

If we want the nice UX without relying too heavily on Javascript, there's a lot that has to change on the HTML/CSS side of things. And I don't see that happening at all.

And let's not forget that you a) don't get accessibility for free just by rendering on the server and b) almost all screen readers today support Javascript.


I agree! But you have to make a differnce between "enhancing the UX with JS" and "the whole App is written in JS".

I think it's perfectly fine to enhance stuff with JS, as long as important content is visible without it.

Some people go head over heels the JS route (since this seems the way people do it today) and build things that can be build way cheaper (measured in hours) with traditional HTML. Since the outcome is the same (static content) it's just not necessary.

Note: I'm focusing on static content here. Pages that mainly show text and images (Blogs, Newssites et al). (Web)Apps are another topic that present good reasons to use Frontend JS.


It depends... if your content is mostly static, or mostly dynamic it really depends...

Sometimes doing it all in react/angular is easier than bolting on jquery extension after extension... bloating everything up. Also, if you're using more modern techniques, you're going through a build/minify step anyways which makes it even easier still to be more JS based than static.


But you can still progressively enhance with JS to achieve that nice UI, and often it will be more usable because it's built on a solid RESTful foundation that is close to browser behaviour and therefore user expectation.

My experience with JS only apps is that they're often less usable, more brittle, and often don't work at all in IE


Progressive enhancement work well for simple stuff. Like progressively enhancing a form post, or a "like" button which just sends an Ajax request. But as the complexity grows, progressive enhancement doesn't really scale and you end up with two separate versions of your site/app.

I agree that Javascript only apps are often less usable, because the devs making them aren't testing enough on different browsers and devices. But the trend of the "Javascript only" approach is certainly driven by more than just frontend devs that want to use shiny new things (even if that is a factor as well).


It often is. Most web apps are business applications, doing CRUD - form filling, and if you were around before AJAX, you'll know that we were building these back then just fine. I'm not saying we can't do better now, but that PE will get you to a better experience than graceful degredation.

Sometimes you need additional functionality, maybe a realtime graph of share prices that has to be JS. So progressively enhance just that component, or gracefully degrade if you have to (e.g. put a sign up saying "switch on JS to get this specific functionality") but don't use it as an excuse to turn everyone away. It might be that you will reach people who can live without the stock ticker.

Sometimes you just can't do without JS. I wrote a desktop publishing app on the browser once. Obviously I wrote it in JS - users were forced to use a modern standards compliant browser (this was an internal app) - but if I'm doing an ecommerce site, or really any public site, I'm always challenging the devs who want to "build it in angular" to reconsider that option before ploughing ahead.


I still prefer just-HTML sites to the typical JS-based sites (e.g. the new Google Groups) I see.


> two separate versions of your site/app.

It isn't 2010 anymore.

React (just to name an example, there are many others) completely avoids this issue - you get serverside and clientside rendering out of the box.


You dropped the context of that quote. React isn't exactly a poster child of progressive enhancement.


Actually it's very well suited to it (well, in theory at least - haven't tried it yet). You can render the same html on the server as on the client, using node.js, so you can build a page on the server and then let the JS enabled client take over after the initial page load, or let the server do the work for the non-JS enabled browser.


hacker news and reddit are very close to real time content.

reddit is almost a chat service.

your example is also very poignant as neither of those would get close to exist without proper urls


I like Tantek's definition the best: "if it’s not curlable, it’s not on the web". http://tantek.com/2015/069/t1/js-dr-javascript-required-dead


That just sounds like a passive-aggressive arbitrary rule change. For about 99% of the world the curability, or not, of the web doesn't matter at all.

We already have a perfectly good web and includes things that are not curable, even if we exclude javascript (trivial example: you can't, meaningfully, curl a live sport event).


What do you mean, can't curl a live sport event? Do you mean you can't save it to disk (PVR), or if it's not audio/video, it's not useful to curl it and parse it for stats? If it is video, and you don't consider curl-as-pvr a valid use-case of curl'ing -- how about presenting the text-overlay as a text/html/rss-feed? Mix it at the client for those that want video, or show it as an RSS stream for text w/images?

Not to mention that when we demand to have a functional js-parser and dom-tree just to get at the content, things like text-to-speech and many other things (including building a search-engine!) becomes much harder. For very little (I'd say no) gain.


> A plain HTML site is accessible, and will be accessible in a 1000 years. A site depending on (external) JavaScript sources will force John Titor to travel back in time to find version 1.x of jQuery.

Not to sound completely apathetic, but so what? Most of us aren't building sites that we expect to be around in 10 years, much less 1000 years. The ephemeral nature of what we're building isn't lost on us - we're trading that guaranteed longevity for an improved development process (though some obviously disagree).

Frequently, writing a traditional website with any sort of meaningful UI interactions was/is kind of a mess. Most of us don't write these applications (and you're right, they are applications) because we have any particular affinity for JavaScript, but because it makes the whole process much nicer. It still sucks, it's just nicer.

Sure, progressive enhancement is a thing. And it's a great idea. In practice, top-down directives will probably be something akin to "Sure, do that, but do it on your own time and not at the expense of anything else." The realized benefits are very low (the % of users with JavaScript disabled is incredibly small), and saying something like "our site won't be accessible in 1000 years otherwise" is likely to get you mostly blank stares. It's a pretty big investment with very little benefit to most companies.

Sure, 50 years down the road if these sites still exist they'll probably be nigh-unusable without some sort of "ES6 emulator mode", but so what? I don't think we'll go wanting for any historical artifacts from this time period. If we do, it'll be because future generations have no interest in our generation - not because we didn't produce enough relics.


The arguable reason the web exploded in the first place were the architectural principles behind it were intentionally constrained to enable 50+ year sustainability and recombination for apps built within its architecture.

This isn't so much about plain-jane HTML pages (useful as they are, since they have a simple interaction model than many understand and enjoy). It's more about using and exposing data in a visible manner (known formats and semantics) and hyperlinks rather than a single page app with opaque data. This gives you network effects.

Think about the minor uproar over hash-bang URLs around 5 years ago, Twitter being the primary offender. That was single page application oriented rather than hyperlink orientation. There is a reason they've moved away from that.

In the 90s, Google or Yahoo was just something students did with the links that were out there - that eventually generated hundreds of billions in value because of network effects and visbility of the information in HTML (Ie. They could apply algorithms to it like PageRank).

The point of the web architecture is that it enables serendipity. Most anyone who has had massive success in business will explain the role of luck, serendipity, and network effects in their rise.

Designing a web app for today's paycheck by closing it off behind a WebSocket+ JavaScript mess eliminates a proven avenue for network effects. Sometimes that might be OK, but it's unnecessarily limiting for many kinds of ventures.


I think thisviewpoint is too limited.

20 years ago a webpage was just text, but it has evolved in so much more.

I'd be ok with a data site rendering everything from a set of json files. There is more legitimacy in having the presentation done in static html.

Same would go with sites mixing different information sources (twitter, rss etc). You can do the data fetching server side, but the user might prefer having it done client side for a reason or another (transparency for instance).

These kind of sites woyld still be purely informative and yet having them heavily using js makes sense.

I could think of many more situations were generating html from a different format on the client side is the right way to go. Horses for courses.


90% of everything served as JSON can be served as semantic HTML and then manipulated with roughly the same amount of code required to manipulate JSON. Yes, JSON navigation is "built in". However, HTML has incredibly powerful CSS queries which allow you to manipulate hierarchical data with minimal fuss.


I understand that everything in json could also be represented in another format. But if your master data is in json, does it always make sense to convert the data to static html just for the sake of it ? Would you build a server component only for that conversion ?

The answer to these question will depend on your priorities and use case, and the choice can easily be between no site or a js rendered site.


You're assuming the same thing you're arguing for. Namely, that JSON should be your base format. I can equally say:

"I understand that everything in HTML could also be represented in another format. But if your master data is in HTML, does it always make sense to convert the data to JSON just for the sake of it? Would you build a separate client-side rendering component only for that conversion?"

There two non-circular considerations that favor HTML:

1. Web pages still render via DOM. JSON data has to be tramsformed, HTML does not.

2. HTML has semantic capabilities. JSON does not.


I am arguing that no one format SHOULD be your base format for data. You'll happen to have a format or another for any reason, it could be json, xml, csv or anything else.

Contrary to that you are positing that everything should always be based on or converted to plain html before serving to a browser. That's the point where we disagree.


I interpret 'master data' as meaning something different: a uniform set of schemes used throughout an enterprise that supports an organization's data stores. Master data is a canonical representation.


CSS doesn't even have a way to select all h1 elements that contains a div with a date class so I strongly question the assertion that it has "extremely powerful CSS queries" (if you think you are about to prove me wrong with a one-liner, please re-read the phrasing).


You did not provide an example of production-ready JSON query library for feature comparison.


For those that ignore it, John Titor[0] was a time traveler sent back in time to acquire some obsolete IBM machine which is needed in the future to debug some legacy code.

[0] http://en.wikipedia.org/wiki/John_Titor


You're probably not a native speaker of English, but of Latin. In Latin, `ignorare` can mean `not to know` in addition to its meaning of `not to pay attention to`, but in English, it only has the meaning of `not to pay attention to`.

Vale.


There are most probably no native speakers of Latin. Maybe you meant Spanish?


How about time travellers from the Imperium Romanum?


I'm sure he meant speakers of Latin-derived languages. I see native French speakers make this mistake often in English.


Thanks for this information.


fair enough, assuming you mean "latin derived language" thanks!


Actually I assumed you were a time traveller.


While I cannot speak for anyone else, I rarely ever use JavaScript and my "user experience" not at all "degraded". The browser I use, along with netcat, tcpclient, etc., does not even support JavaScript.

The only exceptions are when sites force use of JavaScript. Curiously, often these are sites where money is involved, e.g., banks, merchants, etc. I guess JavaScript makes things safer in those cases? Then one of the popular high complexity (and high security of course) browsers becomes necessary.

Perhaps it is because the type of content I consume is just reading material, listening material or viewing material. Each of which I can usually download and view with a dedicated application, if I so choose.

I used to think JavaScript might become unavoidable for the user and I spent time thinking about how to accomodate it. I often pondered how search engines would cope with it as well. But over the years I have changed my mind; I do not spend any time worrying about JavaScript as a barrier to content.

JavaScript can be a nuisance for non-interactive www usage but, at least in my experience, with some effort this impediment can be overcome. Whether the "user" is a nerd using nc or Googlebot.

Do you think 1000 years from now there will still be one group of people working to make the www more "interactive" and another group of people working to make the www more "machine readable", the later undoing the work of the former?


With the way websites work today surely the only possible way to build a search engine is to make something like a headless browser (similar to PhantomJS) that crawls the web like a user, seeing what the user sees, ignoring everything that's hidden from the user, and interpreting the importance of pages like a user would. Just parsing the HTML source of the page won't even get close to seeing the key features of a page any more.

Impressive work by Google to do that at scale, of course, but they'd be dead in the water if they didn't.


But how! I don't know about other people here, but in our company we haven't figured out how to parse (for testing of course) dynamic websites. All tools, including free tools like Selenium and paid tools like QF-Test, seem to not be able to understand how it works, or our web developers are not able to code dynamic web like it should be coded.


    pip install robotframework-selenium2library
test.txt:

    *** Settings ***
    Library           Selenium2Library

    *** Test Cases ***
    My First Test
        [Setup]    Open Browser    http://google.com    firefox
        Input Text    name=q    Robot Framework
        Click Button    name=btnG
        Wait Until Page Contains Element    ires
        Click Link    Robot Framework
        [Teardown]    Close All Browsers
and then execute

    pybot test.txt
Selenium2 and Robot Framework are pretty neat when it comes to web testing.


We're using CasperJS (basically a simple interface to PhantomJS), works great for us. Have a look here: http://casperjs.org/


I use nightwatch.js ( http://nightwatchjs.org/ ). It's a layer on top of Selenium that makes browser testing a lot more straightforward. If you start with small, straightforward tests and build testable things from there you code will improve.


The problem is that tester and developer aren't the same person. This way the tests are much better at finding expectations the developer didn't have, but it's harder to convince the developers to code more testable, because they don't know the pains of testing.


The point of testing is not to prove the code doesn't work. It's to prove the code does work. That subtle but important difference is the key to good testing.

Finding a problem with code is useful, but it's extremely limited. You might find 100 bugs, but if there's 101 bugs your product has the potential to fail completely. It's so much more useful to define a framework of things that the code has to do properly and make sure it does do them all properly. To that end, testing should come first - define what the code needs to do, write tests to make sure it does those things (automated unit tests where possible, but at the very least well defined processes for how you make sure it works), and then write the code to actually do it. Any developer who isn't interested in proving their code works, and will continue to work as it becomes more complex, is a terrible developer.

tl;dr If you want to fix testing don't write any code until you know how you're going to test it works.


WebDriverWait and expected_conditions.presence_of_element_located are your friends. Here's some sample code from our automation project: https://gist.github.com/danielsamuels/a39e0fef4e15d2ab04b5


You should look into Phantom.js or Zombie.js.


Google made chrome, so I'm sure they know how to render webpages, and execute JavaScript. It's all mostly open source so you could use it too.


Yes, Google made Chrome. Keep in mind that they created Googlebot, and Chrome is really only a "slimmed down" version of Googlebot. Chrome came about because of Googlebot.


Or maybe they did it to get full access to the users (having access to your customers through other providers sks), get a more stable web experience (which increases the numbers of all users and thereby the number of google ad viewers), and have more power in the definition of web standards?


I don't think this is true at all.


We wrote our scraper to use phantomjs via the selenium.webdriver interface in python 2, simply because for something like 80% of the sites we extract information from, the data was not fully available unless we could render the dynamic parts of the page. I am not at all surprised that Google's bot is executing js. I have assumed they could do this for years now.

As for pure html front-ends, I understand the attraction, but when a single js-based implementation gets you consistent behavior and presentation across all browsers and mobile devices the advantages are pretty huge.


I've long thought that the need for a high performance sandboxed JavaScript VM was the real impetus for Google's investment in v8, and that Chrome was just a useful opportunity to leverage it and to get external contribution. Is there any evidence that this is the case?


Unlikely. I was using SpiderMonkey to execute JavaScript in Google's indexing pipeline long before I had heard about v8, and I doubt Lars had me in mind when he started on v8. Of course, I tele-conferenced with Lars before Chrome was released, but SpiderMonkey was still the indexing system's JavaScript interpreter on Chrome's go-live date.


Interesting. Thanks for clarifying that!


I wonder how google indexes a page that inserts an element in the DOM 120 seconds after the page was loaded using a setTimeout()


They probably don't care about that content.

My first guess would be that they snapshot the DOM in the JS tick immediately after window.onload completes. Maybe they have a short pause to let any fast timeouts or callbacks complete, but there's got to be a cutoff at some point (e.g. to stop an infinite wait for pages that continuously update a relative date). Of course, with their own JS engine, I bet they can get really fancy with the heuristics to determine when to take that snapshot.


Actually, we did care about this content. I'm not at liberty to explain the details, but we did execute setTimeouts up to some time limit.

If they're smart, they actually make the exact timeout a function of a HMAC of the loaded source, to make it very difficult to experiment around, find the exact limits, and fool the indexing system. Back in 2010, it was still a fixed time limit.

Source: executing JavaScript in Google's indexing pipeline was my job from 2006 to 2010.


What about AJAX? Does it load/read/index data after the fact?


A better example would be "how does Google index a page where an element changes sufficiently slowly that a user would see it post-change but a faster-than-real-time script would have a different experience"

Think of a long page where you need to read a few things as you scroll. You could game Google by timing it so that most humans would see content x but any script that ran at an unnatural speed would see y.


Maybe they trigger timers immediately


And how about server-side state?

Imagine you use setInterval to load a new paragraph from a server, and the server only provides a new parapgraph 1200ms after the first polling?


Each time we re-analyzed the page, we got more data and more URLs for the crawler to grab and have waiting for us the next time we analyzed that particular page. Of course, session-dependent content would get badly messed up, but that generally doesn't make useful (or at least repeatable) search results anyway.

Source: I primarily did JavaScript execution for Google's indexing pipeline 2006 to 2010.


Waiting for all one-time async operations to finish is not unfeasible. Async loops (like pinging server for possible updates) is a tad harder.


Seriously, they don't have to. That kind of timeout makes the DOM element irrelevant.


There is no doubt Google continues to get better at indexing client side rendered HTML but it is not perfect and indexing is not the same as ranking high in organic search. For ranking, there are distinct advantages to server rendering. The biggest one is consistent initial page load performance. Long story short, if you care about ranking and not just indexing, you still need server rendering.


For people wondering about Ajax requests, Googlebot is performing them very well together with SVG rendering.

For example this URL:

https://www.chemeo.com/predict?smiles=CCCC

is performing the drawing of the molecule using RaphaelJS, then pulling the corresponding molecule from the database using Ajax and updating the page. Googlebot is performing all the steps perfectly well to add the end index the page.

It is very annoying because this is not important in our case, what we want is the good indexing of the main data pages, not these pages... I do not want to block the bot yet, but I need to figure out a way to have the main page better ranked.


However, if you search on Google:

    "Property Prediction for Butane" site:https://www.chemeo.com
You'll see this page is not indexed.


Interesting, it looks like they massively dropped these pages from the index (or at least, from what they return as results) this is great (if they are not dropping other important pages)!


you could add an additional URL parameter via pushState and ensure that you're defining the canonical tag only to the main data pages. You could also define the new parameter in webmaster tools and tell googlebot to ignore it


Now, what kinds of V8 vulnerabilities can we exploit to get inside Google? Said every intelligence agency everywhere.


From 2006 to 2010, my primary role at Google was JavaScript execution in the indexing pipeline. I knew I was likely executing every known JavaScript engine exploit out there plus a good number of 0-days, and ran the javascript engine in a single-threaded subprocess with a greatly restricted set of allowed system calls.

Certainly the right combination of kernel zero-days and JS interpreter exploits could be used to take over the machine, but it would be non-trivial.


> ran the javascript engine in a single-threaded subprocess with a greatly restricted set of allowed system calls.

You were trying to sandbox the JS engine rather than using disposable VMs?


> Said every intelligence agency everywhere.

Not everywhere. The ones in the USA already have free access to anything in Google. In fact, Google is a part of their network.


I'm just waiting for the first security researcher to exploit the googlebot.


My primary role at Google from 2006 to 2010 was executing JavaScript in the indexing pipeline (not exactly in Googlebot, but close enough). I knew I was executing probably every known exploit out there, plus a lot of 0-days, and took lots of precautions (single-threaded subprocesss with very restricted list of allowed syscalls, etc.). It's not perfect, but breaking out of the sandbox would require a kernel 0-day in the subsystem used by our sandbox, plus a JS engine exploit.


I think it goes without saying that no system is 100% safe. ;)


How do you think the decision is made to insert a safebrowsing interstitial?


Any idea how Google digests\consumes web pages while crawling? Does it take out all the html and store just the plain text? If this is the case then can you share some more info on how they are doing it?

I think there is no way they are going to scrap the websites as there are millions of them with each having their own structure.


I get rendering HTML from JS, handling timeouts, even infinite scroll. But how in the world are they handling onMouseOver, and other mouse events? My best guess so far is reversing the code from the document.location events.


I think you are misunderstanding how this works. Google isn't "handling" any events at all, your webpage is. Google is instead the source of those events - it is simulating the role of a user.

So the bot loads your webpage into a headless browser and sends it a series of events to simulate a user interacting with it, and waits for navigation requests.

There is probably a whitelist of simulation behaviors:

  * mouseover, then click each <a> node
  * mouseover every pixel
  * mouseover, then change every <select> node
  * mouseover, then click every <button>
  etc...
Caveat: though I worked at Google when this work was being done, I was on a different team and don't have any inside knowledge - just speculating on an approach that would make sense.


Any idea whether this affects (randomized) A/B testing? I think that in the past, Google has simply ignored the dynamic test changes to the site's content. Now I'm not quite sure anymore.


For duplicate elimination, it's important to have deterministic execution of Javascript, for duplicate elimination. Getting several identical pages (with different URLs) in your search results is a really bad user experience.

As of when I left Google in 2010, the JavaScript random number generator always returned 0.5 (and some SEO figured it out and blogged about it, no secrets here). However, I was trying to convince my manager to let me instead seed a random number generator with an HMAC of all of the currently loaded HTML and JavaScript (to make it deterministic but hard to display something good 1 in a million times but 100% of the time to Google's indexing system).


If we can execute Javascript on GoogleBot, could it possibly be hacked/broken somehow? Surely there must be security vulnerabilities in its Javascript engine.


It would be interesting to see how client-side templating affects SEO, namely now that we understand that JS is indeed executed by the crawler.


I wonder if it does Ajax. Then you could do some kind of weird recursion where it does a request to google for itself.


Google can even read frames! (sarcasm) It makes all other search engines look bad in comparison though ...


Just remember that there is life outside of Google and by brushing it off you're stifling further web innovations.

Also, I wonder how Google handles security while executing random JS code. It's one thing to hack into a single browser. It's another thing to hack into a crawler. Think of all the possibilities.


searchengineland hasn't tested AJAX as the author wrote in the comments: "That's a great question! Our test was to programmatically insert text where we wanted into the DOM, but not as a server side transaction, like AJAX."


I read another blog about someone who tested that (don't have the url, but it was easy to find), and their conclusion was that the crawler won't wait for any Ajax request to finish to let you render that content. If you want to render with Javascript, you need to make that data a part of the initial payload and render that data during onload.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: