Understanding web pages better

mbrubeck · on May 23, 2014

Googlebot has done various amounts of JS parsing/execution for a while now. They've also issued similar webmaster guidelines in the past (e.g., don't use robots.txt to block crawling of scripts and styles).

2008: http://moz.com/ugc/new-reality-google-follows-links-in-javas...

2009: http://www.labnol.org/internet/search/googlebot-executes-jav...

2011: https://twitter.com/mattcutts/status/131425949597179904

2012: http://www.thegooglecache.com/white-hat-seo/googlebots-javas...

From the 2012 article: "Google is actually interpreting the Javascript it spiders. It is not merely trying to extract strings of text and it does appear to be nuanced enough to know what text is and is not added to the Document Object Model."

ChuckMcM · on May 23, 2014

I was surprised, its kind of old news. There is useful JS to look at when you're crawling a web page. Its also a useful way to detect malware on a page.

thatthatis · on May 23, 2014

My understanding is that the state before was:

Content augmented with JS -- OK to index. Blank page with all content rendered by JS -- not indexed

And what they're saying today is: "we are now going to be able to index single page apps that don't have any server rendered content"

Perhaps I've missed something, but this is my interpretation: SPAs are now full citizens in the SEO world.

KMag · on May 23, 2014

Google has understood pages that have had only client-side rendered content for years. I worked mostly on JavaScript execution in the indexing system from 2006 until 2010, as well as other "rich content" indexing.

Certainly somewhere in the 2008 to 2009 timeframe we saw that the Chinese language version of the Wall Street Journal had a lot of pages in their archive where all of the non-boilerplate content was rendered via JavaScript. Since they didn't do this with their English content, it didn't seem to be an attempt to hide content from search engines, but much more likely a workaround for an older browser that wouldn't properly render Unicode, but with a JavaScript engine that would properly render Unicode. Sometime in the 2008 to 2009 timeframe, Google's indexing system started understanding text that was written into documents from JavaScript body onload handlers, and the Chinese Wall Street Journal archive content was exhibit A in my argument that my changes should be turned on in production.

I'm sure they've increased the accuracy of the analysis since, but they've certainly been able to index content written by JavaScript for something like 5 years now.

Edit: without giving away any Google secrets, here's a pretty good analysis of my work from 2008: http://moz.com/ugc/new-reality-google-follows-links-in-javas...

Edit 2: Since the 2008-2009 timeframe, Google also notices when you use JavaScript to change a page's title. I caused a crash in Google's indexing system when I made a bad assumption about Google's HTML parser's handling of XHTML-style empty title tags <title/> and tried to construct negative-length std::strings from them. When your code runs on every single webpage that Google can find, you're certain to hit corner cases you didn't anticipate. I did test for empty <title></title>, but not <title/>, and made incorrect assumptions about the two pointers I'd get to the beginning and end of the title.

juretriglav · on May 24, 2014

Thanks for the insight! I know it's very off-topic, but seeing your comment about indexing title tags I have to ask if you have any idea what could be wrong here: http://stackoverflow.com/questions/23732242/how-to-get-googl...

Google seems to be ignoring the title changes our JS makes. Should we not have the title tag there in the first place and then add it in when the title is known?

thatthatis · on May 23, 2014

So would the more accurate interpretation be: google is starting to promote that it can and will index javascript single page apps?

Thanks for sharing, and nice work btw.

KMag · on May 24, 2014

There were a lot of caveats that both reduced fidelity and would have made announcements confusing back when I was at Google. Also, if one mentions a limitation in an announcement, a lot of the Search Engine Optimization community would be citing the announcement for more than 6 months. So, it's difficult and potentially counter-productive to have an announcement with lots of caveats.

V8 and Chrome weren't even a glimmer in Google's eye back in 2006, so I hope they've largely replaced the code I was working on with something based on Chrome. As late as 2010, the DOM was a completely custom implementation that looked somewhat like Firefox, but with enough IE features to fool lots of other pages that would otherwise change their content to "You must run IE to view this page". (On a side note, as much as many people would like to see such pages heavily penalized and indexed as if the IE-only message were their only information, some of those pages are unique sources of invaluable information and users wouldn't be well served by such harsh treatment.)

davemel37 · on May 23, 2014

Everyone seems to be missing the point of this announcement. They are just saying that they are now adding a feature in Webmaster Tools to show you they see your site so you can diagnose problems.

thatthatis · on May 24, 2014

Perhaps you haven't been privy to any of the discussions about: "what is the seo implication of SPAs?" Up until now people were pursuing isomorphic JavaScript so they could render server side and client side so the search engines would see their content.

This "just a tool" lets webmasters see how google sees their pages, which means SPAs are becoming safe for SEO. That tool plus the song and dance about understanding the modern web is a pretty strong signal in an industry that operates heavily on rumors.

tonyhb · on May 23, 2014

Yeah, I remember about 2 years ago reading a blog post that analysed the origin of requests to an outdated API. Turns out, Google had crawled the site and was executing JS 17 (or so) days later.

Hopefully this news is about them improving that two week lead time so that more modern websites have a better chance of being indexed.

KMag · on May 23, 2014

Within hours of being indexed, the JavaScript on the page would have been executed and the links extracted. Presumably these particular extracted links didn't look particularly interesting to the crawl scheduling algorithm, and sat in the crawl scheduling queue for 16 or 17 days.

Source: I worked mostly on JavaScript execution in Google's indexing system from 2006 to 2010. I don't know the intricacies of the crawl scheduling, but I do know it's highly non-trivial.

crucialfelix · on May 23, 2014

I think the difference is that they are officially announcing that they will do this at scale. Previously it was only a us geeks that knew this was happening, and only for some sites.

chimeracoder · on May 23, 2014

> It's always a good idea to have your site degrade gracefully. This will help users enjoy your content even if their browser doesn't have compatible JavaScript implementations. It will also help visitors with JavaScript disabled or off, as well as search engines that can't execute JavaScript yet.

I'm glad that they included this.

I get that Javascript is required to make certain sites work the way they do, but I'm appalled by the number of sites that require Javascript just to display static text.

Google themselves are guilty of this. Google Groups is (for the most part) just an archive of email mailing lists, but try reading a thread on Google Groups with Javascript disabled![0]

There are very few sites that cannot gracefully downgrade to at least some degree, and there are very good reasons for doing so. A major one is that AJAX-heavy sites tend not to perform well on slow connections[1] (again, assuming essentially static content here). If you want your users to be able to access your site on-the-go, graceful degradation is your friend.

[0] It's especially ironic now that Google Groups is the only place to read many old Usenet archives going back as far as the early 1980s.

[1] Try browsing Twitter on a slow (ie, tethered, or "Amtrak wifi" level connection). For a website that originally originated as a way to send messages over SMS, and is still used that way in other parts of the world, it degrades amazingly poorly over slow connections.

icehawk219 · on May 23, 2014

This is a worrying trend I've been noticing as well. The last couple of years especially I've noticed a very large increase in people just not caring about graceful downgrading as well as people just not even testing in other browsers. I've had many conversations with people that are huge fans of Angular and Backbone and similar frameworks and when I mention people without Javascript I just get the canned "but everyone has Javascript turned on anyway and if they don't, too bad" response. Interestingly enough every one of them were also people who developed and tested only against Chrome and never even bothered to acknowledge other browsers. I know someone who is a developer who actually likes IE and uses it as their main browser and for years they've dealt with breaking bugs on sites like Github and other popular tech sites because people just don't even test in other browsers any more.

The cynic in me wants to say that this mentality is pushed for by companies like Google because no Javascript means no spying. But honestly I think it really just comes down to laziness. So few people truly care about their craft.

agentS · on May 23, 2014

Out of curiosity, what form of "spying" do you think is enabled by Javascript that would not be possible without Javascript?

Edit: And since I replied to a small part of your comment, I should say that I disagree completely with your "few people truly care about their craft" statement. At least, I think that writing code that handles a lack of Javascript is only valuable if you have enough users to justify it. i.e. if you spend 20% of your time working on features for 0.1% of users, then you are doing a disservice to the rest of your users. Even more so if you have to compromise the experience for everyone else such that degrading is an option.

In some cases, you go out of your way to accommodate small fractions of your audience. ARIA and catering to those with disabilities is a good example. But turning off JS is a choice; one I respect, but feel no obligation to cater to. I think pages should show a noscript warning, but other than that, its a matter of engineering tradeoffs.

cpeterso · on May 23, 2014

Some analytics companies track mouse movements to watch how people interact with web pages. They can also use JavaScript to fingerprint browsers beyond what is available with cookies.

userbinator · on May 24, 2014

> but I'm appalled by the number of sites that require Javascript just to display static text.

I encounter these (a lot of them blogs -- a perfect example of static text, maybe images, that should be readable with just about any browser) when searching via Google, and the text-only cache option tends to be quite useful for getting the text that I want to read. If that doesn't show the content, then I go back --- there's plenty of other sites out there, if you don't make it easy to read your content I'll just go somewhere else where I can find the same thing.

At least with Google Groups it seems some of it is readable without JS now -- e.g. try this link with JS off: https://groups.google.com/d/forum/comp.lang.python

JamesSwift · on May 25, 2014

"To use Google Groups Discussions, please enable JavaScript in your browser settings, and then refresh this page."

ludwigvan · on May 23, 2014

The problem is, it is not easy to gracefully degrade. Users have become so accustomed to pages enhanced with JS that it is not possible to not use it even for most simple sites. If you want to go gracefully degrade, it is going to take a lot of effort nowadays. Of course, some new tech like isomorphic rendering (Airbnb's render, React's renderToString) helps this, but again, this is not as easy as it sounds.

frik · on May 23, 2014

I was under the assumption that Googlebot already used a headless Chrome to index websites for some time.

Google used Chrome to generated page preview pictures (at index-time) to show the search terms (mouse over, but this feature is no more, as it seems). Some websites that shows you the user agent displayed the Chrome user-agent in the preview picture, back then (~2 years ago).

it was called Instant Preview: http://googlesystem.blogspot.co.at/2010/11/google-instant-pr...

details: https://sites.google.com/site/webmasterhelpforum/en/faq-inst...

Google removed this useful feature in 04/2013 :(

  As we’ve streamlined the results page, we’ve had to 
  remove certain features, such as Instant Previews.

-- https://productforums.google.com/forum/#!topic/websearch/Aom...

a-priori · on May 23, 2014

I've heard speculation that this is a large reason why Google invests so heavily in Chrome. When developers make their sites work with Chrome, they end up making their sites work nicely for the Googlebot without even realizing it.

If the Googlebot used a web engine that wasn't found in a common browser, then they would have to pay the costs to make it compatible with all web pages out there. Instead, they push that cost onto developers targeting Chrome. All work to make websites work right in Chrome also make them work for the Googlebot, effectively standardizing the web onto something that the Googlebot knows how to process.

27182818284 · on May 23, 2014

>Chrome to index websites for some time

I always thought this was the secret reason Chrome was built. Build a better Googlebot and then, wait a minute, why not just release an awesome browser to get more people using our product at the same time? Forked.

bhartzer · on May 23, 2014

Technically speaking, Google built the new Googlebot and then realized that they could release parts of Googlebot to the public in the form of a web browser.

sp332 · on May 23, 2014

Especially for their own Blogger platform, which doesn't show anything with JS disabled.

freehunter · on May 23, 2014

I was glad when instant preview was released. I would always somehow accidentally mouse over the preview area and get stuck with a random site showing up on half my screen. Terrible UX.

chestnut-tree · on May 23, 2014

"It's always a good idea to have your site degrade gracefully. This will help users enjoy your content even if their browser doesn't have compatible JavaScript implementations. It will also help visitors with JavaScript disabled or off, as well as search engines that can't execute JavaScript yet."

Are Google going to follow their own advice here? Try visiting the official Android blog with Javascript disabled

http://officialandroid.blogspot.co.uk/

In fact, try visiting a whole bunch of *blogspot.co.uk sites with Javascript disabled and see how "gracefully" they degrade. Remember, these are blog sites with mostly text content. And yet Google won't serve them up without Javsacipt enabled.

thinxer · on May 24, 2014

yet they provide a rendered version for crawlers. try append "?_escaped_fragment_=" to the pages, like this one:

http://officialandroid.blogspot.com/2014/04/new-mobile-apps-...

And you will see the text content. It sucks, anyway.

callmeed · on May 23, 2014

SEO has been a major factor in my reluctance to implement client-side JS frameworks in many projects (we work with wedding photographers and other small businesses who live and breath on being found in Google). It actually seems harder to optimize a JS-heavy site than a Flash site (which we used to sell a lot of).

If Google can actually index and rank a businesses website that is, say, pure BackboneJS that would be awesome. But I'd like to see it in the wild before trying to sell something like that.

For example, AirBnB appears to be using Backbone here: https://www.airbnb.com/s/San-Francisco--CA--United-States

Is Google able to crawl their listings by executing all the JS? Or is AirBnB implementing other tricks to get indexed?

benregenspan · on May 23, 2014

AirBnB went to a lot of trouble to make their Backbone app render on both the client and server-side, without need for building different views for each: http://nerds.airbnb.com/weve-open-sourced-rendr-run-your-bac... . This is a pretty awesome solution for those using Node.

frik · on May 23, 2014

Google licensed the Adobe Flash text extraction library. Using the library it was easy to extract text and links.

* Google Now Crawling And Indexing Flash Content (2008): http://searchengineland.com/google-now-crawling-and-indexing...

* Adobe page: https://web.archive.org/web/20080702135702/http://www.adobe....

  Adobe is working with Google and Yahoo! to enable one of 
  the largest fundamental improvements in web search 
  results by making the Flash file format (SWF) a first-
  class citizen in searchable web content. 

  Google uses the Adobe Flash Player technology to run SWF 
  content for their search engines to crawl and provide the
  logic that chooses how to walk through a SWF.

Edit: parent commenter edited/changed his text quite a bit, originally it was about Flash content

KMag · on May 23, 2014

Google engineer Ran Adler (sometimes he spells his first name Ron to avoid confusing English speakers) deserves most of the credit here. He went to Adobe with a proposal for the hooks into the Flash player that he needed and worked with their engineers to get those hooks working. I don't doubt there was a fair amount of work on Adobe's side, too, but it wasn't like Adobe had the technology in place before Ran started working with them.

Ran came up with an API for the hooks he needed that didn't give away too much of the most clever parts of what he was doing. The belief was that Google would get the hooks it wanted and in return, Adobe could share the special Flash player with other major search engines and everyone would be indexing Flash content. The hope was that Google would just be doing it a bit more cleverly than the competition. I'm not sure if any of the other major search engines ever used the hooks Ran designed.

Source: I worked on Google's rich content indexing team from 2006 to 2010. Ran worked mostly on Flash indexing and I worked mostly on JavaScript indexing.

cpeterso · on May 23, 2014

Within Adobe, this project was codenamed "Ichabod" because it was a "headless" (no rendering) Flash Player. :)

callmeed · on May 23, 2014

Yeah, but this didn't work for dynamic content inside of Flash (i.e. the swf loading image or text content from a server-side API).

userbinator · on May 24, 2014

If you want to know, just visit that AirBnB page with JS off; you can still see and browse the listings, so Googlebot can too.

andrenotgiant · on May 23, 2014

I think the important thing to keep in mind is HOW Google ensures that their understanding of Javascript helps them improve the Search Experience.

If they crawl your page with javascript enabled, and find that after a hover event a button appears and after a click on that button a modal appears, and that modal has content about BLUE WIDGETS, they are still NEVER going to rank that URL for "BLUE WIDGETS"

Google wants to send users searching for "BLUE WIDGETS" to a page where content about "BLUE WIDGETS" is instantly visible and apparent.

nateabele · on May 23, 2014

I was at ng-conf in January and asked the Angular team a question about improving SEO. Without going into any detail, they hinted at the idea that very shortly it would no longer matter. Honestly I'm kind of surprised it took this long.

saddestcatever · on May 23, 2014

I couldn't be more excited for this change. The current process of making Angular sites compatible for Google SEO is atrocious

snake_plissken · on May 23, 2014

Just out of curiosity, how is this possible? If the web server can't handle being crawled...how can it handle serving web pages?

"If your web server is unable to handle the volume of crawl requests for resources, it may have a negative impact on our capability to render your pages. If you’d like to ensure that your pages can be rendered by Google, make sure your servers are able to handle crawl requests for resources."

devNoise · on May 23, 2014

Try to equate being crawled with the /. effect. The site can server web pages, but not at scale. So if the googlebot comes along to index your site, the webserver may fail under load. Perhaps you're on co-hosting plan and your provider suspends your site because the crawling has put you over one of the limits on your plan.

valarauca1 · on May 23, 2014

I wonder how Google handles traffic to its crawlers. A webpage I viewed recently loaded an entire database query into my web browser, then I made queries locally to sort it, which kinda sucked since 20MB of information is a lot.

I figure Google has to have some form of safeguard against this. Either CPU or network bandwidth limited, likely a time limit too.

In my head I'm picturing a crawler locked in a loop of forever querying random google search results and adding them to a page.

andrenotgiant · on May 23, 2014

Across many requests, Google has a very prevalent "bandwidth/time" calculation for each domain/subdomain that factors in domain importance, # of pages, frequency of page updates, and even input from webmasters via Webmaster Tools. The bot will just stop requesting pages after that ratio is reached. (This is why new domains that publish LOTS of pages at once may take a while to get crawled.)

Across individual requests, they have distanced themselves from setting a concrete limit to request sizes[1] (they used to only cache the first 100Kb, then people saw them caching up to 400Kb, now they definitely index things like PDFs that are much larger.)

[1] http://www.mattcutts.com/blog/how-many-links-per-page/

rcthompson · on May 23, 2014

Reminds me of "Googlebot Is Chrome": http://ipullrank.com/googlebot-is-chrome/

Obviously the claim probably isn't literally true, but it they could certainly share a JS engine (V8) at the very least, and the idea that the motivation behind developing that JS engine may have been spidering JS-dependent pages doesn't seem too far-fetched.

bhartzer · on May 23, 2014

Actually, they built Googlebot first--and then took parts of it to build Chrome. Googlebot is Chrome and Chrome is Googlebot (minus a few features like crawling).

sergiotapia · on May 23, 2014

Makes total sense. When I was researching Angular, something that didn't make any sense was Google not being able to crawl Angular websites. Google makes Angular you see, hence my confusion.

This instantly puts me back on the Angular hunt, as now I don't have to pay for a service as ridiculous as 'static page SEO'.

ludwigvan · on May 23, 2014

> Google makes Angular you see, hence my confusion.

The thing is, Google is not a single entity, like any other big corp. So there will be teams doing things differently, even in conflicting ways. Google is not using Angular for most of its sites, I think Closure is more popular there.

You can do static page SEO with JS using libraries that allow isomorphic processing, see rendr, react etc.

untog · on May 23, 2014

There might be something as ridiculous as 'static page SEO' but there is nothing ridiculous about 'static pages'.

dragonwriter · on May 23, 2014

Google makes many things with different purposes, and it doesn't usually hold back on creating them to wait till every possible integration that might be useful with other Google efforts is done. If it did, it would never ship anything.

Isofarro · on May 24, 2014

If a "static page SEO" service emits something remotely useful to Googlebot, that's an indication you should have used progressive enhancement from the start.

sashagim · on May 23, 2014

Interesting. I wonder if some changes will need to take place in order to make sure the client side tracking services (e.g MixPanel, Kissmetrics, Google's own GA) ignore Google bot.

Theodores · on May 23, 2014

This is great news. However, will you be able to feed the Googlebot a script to simulate user interaction and download all of the ajax content that normally needs clicks or mouseovers?

cleverjake · on May 23, 2014

no. however it already clicks things.

Touche · on May 23, 2014

But how does it know when to take the snapshot? A clicked link can do all sorts of asynchronous things and Google is unaware when the new rendering is "finished".

ryanpetrich · on May 23, 2014

When there are no outstanding HTTP requests, DOM events, CSS transitions or setTimeouts the page can be assumed to be rendered. Not all pages will enter this state, so some heuristics are likely used.

Touche · on May 23, 2014

What you are describing is not possible with, for example, Phantom. HTTP request observation is, but AFAIK the others are not. I know it's likely that Google has something far more advanced / customized, but just wanted to point this out for anyone thinking about doing this themselves; it's a really hard problem to solve.

Theodores · on May 23, 2014

"Snapshot" also implies that Google search results are likely to get a lot more visual. The tone of the article made me feel that. Does a 'snapshot' imply imagery in U.S. English?

ForHackernews · on May 23, 2014

> Does a 'snapshot' imply imagery in U.S. English?

Not necessarily. For instance, I would talk about a "database snapshot", meaning a "backup of a database at a particular moment in time".

Obviously the word "snapshot" originates from photography, but to me, the connotation is much more "capture an exact copy at a moment" than "make an image".

stan_rogers · on May 23, 2014

(Unnecessary pedantry ahead.) Actually, it originates in hunting (fowling) and meant an opportunity shot taken without time to set up or consciously aim, relying on reaction time and trained instinct. In photography, it was a pejorative term applied to amateur photographs taken with essentially the same amount of thought and planning by people who owned those new Kodak gizmos that Mr. Eastman made, and it's pretty much retained that meaning.

ForHackernews · on May 24, 2014

Interesting: http://www.etymonline.com/index.php?term=snapshot

http://www.youtube.com/watch?v=Mv7atRAoU9s

kclay · on May 23, 2014

Giving that chrome can debug ajax request with full call stack ,I wouldn't think it would be to hard to adapt to know these things.

dallen33 · on May 23, 2014

Does this mean Googlebot can crawl pjax pages?

https://github.com/defunkt/jquery-pjax

jordanlev · on May 23, 2014

My understanding of pjax is that this should be irrelevant. I thought the point of pjax is that you're not generating content with javascript but rather pulling it in from an existing page and inserting it into the current DOM (but the other page that the new content came from also exists at a specific URL, so the site still "works" even without javascript -- it's just a speed boost if you do have js enabled).

andybak · on May 23, 2014

I thought the whole point of pjax was that it would also return the content for any URL via a normal GET. At least - that's how I've always implemented it (using django-pjax).

known · on May 24, 2014

Googlebot = Google's "Customized" Chrome

gwbas1c · on May 23, 2014

Honestly, I find sites that completely rely on Javascript to be somewhat unreliable.

The web is fundamentally a document retrieval system. Content needs to work without Javascript.

grkvlt · on May 24, 2014

I believe you're thinking of FTP, not HTTP. Content is rather more than documents, and spans from animations to interactive presentations, visualisations and demonstrations, games and ... you get the idea ;)

thrillscience · on May 23, 2014

So I wonder if we can trick Googlebot into doing computations for us? Maybye mine bitcoins! :-)

tantalor · on May 23, 2014

An old trick is to use abuse the crawler to generate a huge index for reverse lookups. For example, reverse "72b302bf297a228a75730123efef7c41" to md5("banana").

ihsw · on May 23, 2014

Probably not. Googlebot likely has very fine-grained limitations on CPU/memory/IO usage, and penalizes websites that use too much. Just a hunch.

frik · on May 23, 2014

This could work, especially on high-ranked news sites that are crawled every few minutes for new content.

I just noticed a comment on HN that I wrote a few minutes ago, was already in Google search results - was shocked a bit.

bhartzer · on May 23, 2014

They're indexing blog posts and content within seconds, not minutes. I've seen my blog posts indexed within seconds and they've been doing that for a few years now.

0x0 · on May 24, 2014

That's probably because a default wordpress install has pingomatic.com listed in the callbacks for on-post ("update services").

Raphael · on May 23, 2014

The best way to achieve this is with PubSubHubbub to avoid the delay of polling your feed or pages. (Not that it would necessarily stop.)

marcosdumay · on May 23, 2014

I had a not-important site with many pages, for a while Google was crawing it continuously, with 10 threads in paralell.

znowi · on May 23, 2014

Guidelines on how to conform to the rules of the Matrix, otherwise you may be expelled from the system into obscurity of Zion.

on May 23, 2014

[deleted]

wnevets · on May 23, 2014

I'm sure they're using chrome/v8

hosay123 · on May 23, 2014

It's pretty obvious that they need to do this, whether reported or not, in order to handle some very easy spam attacks. E.g., replacing keyword-baiting content with, say, an advert for something totally irrelevant.

So this isn't really new or surprising