Googlebot has done various amounts of JS parsing/execution for a while now. They've also issued similar webmaster guidelines in the past (e.g., don't use robots.txt to block crawling of scripts and styles).
From the 2012 article: "Google is actually interpreting the Javascript it spiders. It is not merely trying to extract strings of text and it does appear to be nuanced enough to know what text is and is not added to the Document Object Model."
I was surprised, its kind of old news. There is useful JS to look at when you're crawling a web page. Its also a useful way to detect malware on a page.
Google has understood pages that have had only client-side rendered content for years. I worked mostly on JavaScript execution in the indexing system from 2006 until 2010, as well as other "rich content" indexing.
Certainly somewhere in the 2008 to 2009 timeframe we saw that the Chinese language version of the Wall Street Journal had a lot of pages in their archive where all of the non-boilerplate content was rendered via JavaScript. Since they didn't do this with their English content, it didn't seem to be an attempt to hide content from search engines, but much more likely a workaround for an older browser that wouldn't properly render Unicode, but with a JavaScript engine that would properly render Unicode. Sometime in the 2008 to 2009 timeframe, Google's indexing system started understanding text that was written into documents from JavaScript body onload handlers, and the Chinese Wall Street Journal archive content was exhibit A in my argument that my changes should be turned on in production.
I'm sure they've increased the accuracy of the analysis since, but they've certainly been able to index content written by JavaScript for something like 5 years now.
Edit 2: Since the 2008-2009 timeframe, Google also notices when you use JavaScript to change a page's title. I caused a crash in Google's indexing system when I made a bad assumption about Google's HTML parser's handling of XHTML-style empty title tags <title/> and tried to construct negative-length std::strings from them. When your code runs on every single webpage that Google can find, you're certain to hit corner cases you didn't anticipate. I did test for empty <title></title>, but not <title/>, and made incorrect assumptions about the two pointers I'd get to the beginning and end of the title.
Google seems to be ignoring the title changes our JS makes. Should we not have the title tag there in the first place and then add it in when the title is known?
There were a lot of caveats that both reduced fidelity and would have made announcements confusing back when I was at Google. Also, if one mentions a limitation in an announcement, a lot of the Search Engine Optimization community would be citing the announcement for more than 6 months. So, it's difficult and potentially counter-productive to have an announcement with lots of caveats.
V8 and Chrome weren't even a glimmer in Google's eye back in 2006, so I hope they've largely replaced the code I was working on with something based on Chrome. As late as 2010, the DOM was a completely custom implementation that looked somewhat like Firefox, but with enough IE features to fool lots of other pages that would otherwise change their content to "You must run IE to view this page". (On a side note, as much as many people would like to see such pages heavily penalized and indexed as if the IE-only message were their only information, some of those pages are unique sources of invaluable information and users wouldn't be well served by such harsh treatment.)
Everyone seems to be missing the point of this announcement. They are just saying that they are now adding a feature in Webmaster Tools to show you they see your site so you can diagnose problems.
Perhaps you haven't been privy to any of the discussions about: "what is the seo implication of SPAs?" Up until now people were pursuing isomorphic JavaScript so they could render server side and client side so the search engines would see their content.
This "just a tool" lets webmasters see how google sees their pages, which means SPAs are becoming safe for SEO. That tool plus the song and dance about understanding the modern web is a pretty strong signal in an industry that operates heavily on rumors.
Yeah, I remember about 2 years ago reading a blog post that analysed the origin of requests to an outdated API. Turns out, Google had crawled the site and was executing JS 17 (or so) days later.
Hopefully this news is about them improving that two week lead time so that more modern websites have a better chance of being indexed.
Within hours of being indexed, the JavaScript on the page would have been executed and the links extracted. Presumably these particular extracted links didn't look particularly interesting to the crawl scheduling algorithm, and sat in the crawl scheduling queue for 16 or 17 days.
Source: I worked mostly on JavaScript execution in Google's indexing system from 2006 to 2010. I don't know the intricacies of the crawl scheduling, but I do know it's highly non-trivial.
I think the difference is that they are officially announcing that they will do this at scale. Previously it was only a us geeks that knew this was happening, and only for some sites.
> It's always a good idea to have your site degrade gracefully. This will help users enjoy your content even if their browser doesn't have compatible JavaScript implementations. It will also help visitors with JavaScript disabled or off, as well as search engines that can't execute JavaScript yet.
I'm glad that they included this.
I get that Javascript is required to make certain sites work the way they do, but I'm appalled by the number of sites that require Javascript just to display static text.
Google themselves are guilty of this. Google Groups is (for the most part) just an archive of email mailing lists, but try reading a thread on Google Groups with Javascript disabled![0]
There are very few sites that cannot gracefully downgrade to at least some degree, and there are very good reasons for doing so. A major one is that AJAX-heavy sites tend not to perform well on slow connections[1] (again, assuming essentially static content here). If you want your users to be able to access your site on-the-go, graceful degradation is your friend.
[0] It's especially ironic now that Google Groups is the only place to read many old Usenet archives going back as far as the early 1980s.
[1] Try browsing Twitter on a slow (ie, tethered, or "Amtrak wifi" level connection). For a website that originally originated as a way to send messages over SMS, and is still used that way in other parts of the world, it degrades amazingly poorly over slow connections.
This is a worrying trend I've been noticing as well. The last couple of years especially I've noticed a very large increase in people just not caring about graceful downgrading as well as people just not even testing in other browsers. I've had many conversations with people that are huge fans of Angular and Backbone and similar frameworks and when I mention people without Javascript I just get the canned "but everyone has Javascript turned on anyway and if they don't, too bad" response. Interestingly enough every one of them were also people who developed and tested only against Chrome and never even bothered to acknowledge other browsers. I know someone who is a developer who actually likes IE and uses it as their main browser and for years they've dealt with breaking bugs on sites like Github and other popular tech sites because people just don't even test in other browsers any more.
The cynic in me wants to say that this mentality is pushed for by companies like Google because no Javascript means no spying. But honestly I think it really just comes down to laziness. So few people truly care about their craft.
Out of curiosity, what form of "spying" do you think is enabled by Javascript that would not be possible without Javascript?
Edit: And since I replied to a small part of your comment, I should say that I disagree completely with your "few people truly care about their craft" statement. At least, I think that writing code that handles a lack of Javascript is only valuable if you have enough users to justify it. i.e. if you spend 20% of your time working on features for 0.1% of users, then you are doing a disservice to the rest of your users. Even more so if you have to compromise the experience for everyone else such that degrading is an option.
In some cases, you go out of your way to accommodate small fractions of your audience. ARIA and catering to those with disabilities is a good example. But turning off JS is a choice; one I respect, but feel no obligation to cater to. I think pages should show a noscript warning, but other than that, its a matter of engineering tradeoffs.
Some analytics companies track mouse movements to watch how people interact with web pages. They can also use JavaScript to fingerprint browsers beyond what is available with cookies.
> but I'm appalled by the number of sites that require Javascript just to display static text.
I encounter these (a lot of them blogs -- a perfect example of static text, maybe images, that should be readable with just about any browser) when searching via Google, and the text-only cache option tends to be quite useful for getting the text that I want to read. If that doesn't show the content, then I go back --- there's plenty of other sites out there, if you don't make it easy to read your content I'll just go somewhere else where I can find the same thing.
The problem is, it is not easy to gracefully degrade. Users have become so accustomed to pages enhanced with JS that it is not possible to not use it even for most simple sites. If you want to go gracefully degrade, it is going to take a lot of effort nowadays. Of course, some new tech like isomorphic rendering (Airbnb's render, React's renderToString) helps this, but again, this is not as easy as it sounds.
I was under the assumption that Googlebot already used a headless Chrome to index websites for some time.
Google used Chrome to generated page preview pictures (at index-time) to show the search terms (mouse over, but this feature is no more, as it seems). Some websites that shows you the user agent displayed the Chrome user-agent in the preview picture, back then (~2 years ago).
I've heard speculation that this is a large reason why Google invests so heavily in Chrome. When developers make their sites work with Chrome, they end up making their sites work nicely for the Googlebot without even realizing it.
If the Googlebot used a web engine that wasn't found in a common browser, then they would have to pay the costs to make it compatible with all web pages out there. Instead, they push that cost onto developers targeting Chrome. All work to make websites work right in Chrome also make them work for the Googlebot, effectively standardizing the web onto something that the Googlebot knows how to process.
I always thought this was the secret reason Chrome was built. Build a better Googlebot and then, wait a minute, why not just release an awesome browser to get more people using our product at the same time? Forked.
Technically speaking, Google built the new Googlebot and then realized that they could release parts of Googlebot to the public in the form of a web browser.
I was glad when instant preview was released. I would always somehow accidentally mouse over the preview area and get stuck with a random site showing up on half my screen. Terrible UX.
"It's always a good idea to have your site degrade gracefully. This will help users enjoy your content even if their browser doesn't have compatible JavaScript implementations. It will also help visitors with JavaScript disabled or off, as well as search engines that can't execute JavaScript yet."
Are Google going to follow their own advice here? Try visiting the official Android blog with Javascript disabled
In fact, try visiting a whole bunch of *blogspot.co.uk sites with Javascript disabled and see how "gracefully" they degrade. Remember, these are blog sites with mostly text content. And yet Google won't serve them up without Javsacipt enabled.
SEO has been a major factor in my reluctance to implement client-side JS frameworks in many projects (we work with wedding photographers and other small businesses who live and breath on being found in Google). It actually seems harder to optimize a JS-heavy site than a Flash site (which we used to sell a lot of).
If Google can actually index and rank a businesses website that is, say, pure BackboneJS that would be awesome. But I'd like to see it in the wild before trying to sell something like that.
AirBnB went to a lot of trouble to make their Backbone app render on both the client and server-side, without need for building different views for each: http://nerds.airbnb.com/weve-open-sourced-rendr-run-your-bac... . This is a pretty awesome solution for those using Node.
Adobe is working with Google and Yahoo! to enable one of
the largest fundamental improvements in web search
results by making the Flash file format (SWF) a first-
class citizen in searchable web content.
Google uses the Adobe Flash Player technology to run SWF
content for their search engines to crawl and provide the
logic that chooses how to walk through a SWF.
Edit: parent commenter edited/changed his text quite a bit, originally it was about Flash content
Google engineer Ran Adler (sometimes he spells his first name Ron to avoid confusing English speakers) deserves most of the credit here. He went to Adobe with a proposal for the hooks into the Flash player that he needed and worked with their engineers to get those hooks working. I don't doubt there was a fair amount of work on Adobe's side, too, but it wasn't like Adobe had the technology in place before Ran started working with them.
Ran came up with an API for the hooks he needed that didn't give away too much of the most clever parts of what he was doing. The belief was that Google would get the hooks it wanted and in return, Adobe could share the special Flash player with other major search engines and everyone would be indexing Flash content. The hope was that Google would just be doing it a bit more cleverly than the competition. I'm not sure if any of the other major search engines ever used the hooks Ran designed.
Source: I worked on Google's rich content indexing team from 2006 to 2010. Ran worked mostly on Flash indexing and I worked mostly on JavaScript indexing.
I think the important thing to keep in mind is HOW Google ensures that their understanding of Javascript helps them improve the Search Experience.
If they crawl your page with javascript enabled, and find that after a hover event a button appears and after a click on that button a modal appears, and that modal has content about BLUE WIDGETS, they are still NEVER going to rank that URL for "BLUE WIDGETS"
Google wants to send users searching for "BLUE WIDGETS" to a page where content about "BLUE WIDGETS" is instantly visible and apparent.
I was at ng-conf in January and asked the Angular team a question about improving SEO. Without going into any detail, they hinted at the idea that very shortly it would no longer matter. Honestly I'm kind of surprised it took this long.
Just out of curiosity, how is this possible? If the web server can't handle being crawled...how can it handle serving web pages?
"If your web server is unable to handle the volume of crawl requests for resources, it may have a negative impact on our capability to render your pages. If you’d like to ensure that your pages can be rendered by Google, make sure your servers are able to handle crawl requests for resources."
Try to equate being crawled with the /. effect. The site can server web pages, but not at scale. So if the googlebot comes along to index your site, the webserver may fail under load. Perhaps you're on co-hosting plan and your provider suspends your site because the crawling has put you over one of the limits on your plan.
I wonder how Google handles traffic to its crawlers. A webpage I viewed recently loaded an entire database query into my web browser, then I made queries locally to sort it, which kinda sucked since 20MB of information is a lot.
I figure Google has to have some form of safeguard against this. Either CPU or network bandwidth limited, likely a time limit too.
In my head I'm picturing a crawler locked in a loop of forever querying random google search results and adding them to a page.
Across many requests, Google has a very prevalent "bandwidth/time" calculation for each domain/subdomain that factors in domain importance, # of pages, frequency of page updates, and even input from webmasters via Webmaster Tools. The bot will just stop requesting pages after that ratio is reached. (This is why new domains that publish LOTS of pages at once may take a while to get crawled.)
Across individual requests, they have distanced themselves from setting a concrete limit to request sizes[1] (they used to only cache the first 100Kb, then people saw them caching up to 400Kb, now they definitely index things like PDFs that are much larger.)
Obviously the claim probably isn't literally true, but it they could certainly share a JS engine (V8) at the very least, and the idea that the motivation behind developing that JS engine may have been spidering JS-dependent pages doesn't seem too far-fetched.
Actually, they built Googlebot first--and then took parts of it to build Chrome. Googlebot is Chrome and Chrome is Googlebot (minus a few features like crawling).
Makes total sense. When I was researching Angular, something that didn't make any sense was Google not being able to crawl Angular websites. Google makes Angular you see, hence my confusion.
This instantly puts me back on the Angular hunt, as now I don't have to pay for a service as ridiculous as 'static page SEO'.
> Google makes Angular you see, hence my confusion.
The thing is, Google is not a single entity, like any other big corp. So there will be teams doing things differently, even in conflicting ways. Google is not using Angular for most of its sites, I think Closure is more popular there.
You can do static page SEO with JS using libraries that allow isomorphic processing, see rendr, react etc.
Google makes many things with different purposes, and it doesn't usually hold back on creating them to wait till every possible integration that might be useful with other Google efforts is done. If it did, it would never ship anything.
If a "static page SEO" service emits something remotely useful to Googlebot, that's an indication you should have used progressive enhancement from the start.
Interesting. I wonder if some changes will need to take place in order to make sure the client side tracking services (e.g MixPanel, Kissmetrics, Google's own GA) ignore Google bot.
This is great news. However, will you be able to feed the Googlebot a script to simulate user interaction and download all of the ajax content that normally needs clicks or mouseovers?
But how does it know when to take the snapshot? A clicked link can do all sorts of asynchronous things and Google is unaware when the new rendering is "finished".
When there are no outstanding HTTP requests, DOM events, CSS transitions or setTimeouts the page can be assumed to be rendered. Not all pages will enter this state, so some heuristics are likely used.
What you are describing is not possible with, for example, Phantom. HTTP request observation is, but AFAIK the others are not. I know it's likely that Google has something far more advanced / customized, but just wanted to point this out for anyone thinking about doing this themselves; it's a really hard problem to solve.
"Snapshot" also implies that Google search results are likely to get a lot more visual. The tone of the article made me feel that. Does a 'snapshot' imply imagery in U.S. English?
> Does a 'snapshot' imply imagery in U.S. English?
Not necessarily. For instance, I would talk about a "database snapshot", meaning a "backup of a database at a particular moment in time".
Obviously the word "snapshot" originates from photography, but to me, the connotation is much more "capture an exact copy at a moment" than "make an image".
(Unnecessary pedantry ahead.) Actually, it originates in hunting (fowling) and meant an opportunity shot taken without time to set up or consciously aim, relying on reaction time and trained instinct. In photography, it was a pejorative term applied to amateur photographs taken with essentially the same amount of thought and planning by people who owned those new Kodak gizmos that Mr. Eastman made, and it's pretty much retained that meaning.
My understanding of pjax is that this should be irrelevant. I thought the point of pjax is that you're not generating content with javascript but rather pulling it in from an existing page and inserting it into the current DOM (but the other page that the new content came from also exists at a specific URL, so the site still "works" even without javascript -- it's just a speed boost if you do have js enabled).
I thought the whole point of pjax was that it would also return the content for any URL via a normal GET. At least - that's how I've always implemented it (using django-pjax).
I believe you're thinking of FTP, not HTTP. Content is rather more than documents, and spans from animations to interactive presentations, visualisations and demonstrations, games and ... you get the idea ;)
An old trick is to use abuse the crawler to generate a huge index for reverse lookups. For example, reverse "72b302bf297a228a75730123efef7c41" to md5("banana").
They're indexing blog posts and content within seconds, not minutes. I've seen my blog posts indexed within seconds and they've been doing that for a few years now.
It's pretty obvious that they need to do this, whether reported or not, in order to handle some very easy spam attacks. E.g., replacing keyword-baiting content with, say, an advert for something totally irrelevant.
2008: http://moz.com/ugc/new-reality-google-follows-links-in-javas...
2009: http://www.labnol.org/internet/search/googlebot-executes-jav...
2011: https://twitter.com/mattcutts/status/131425949597179904
2012: http://www.thegooglecache.com/white-hat-seo/googlebots-javas...
From the 2012 article: "Google is actually interpreting the Javascript it spiders. It is not merely trying to extract strings of text and it does appear to be nuanced enough to know what text is and is not added to the Document Object Model."