
Understanding web pages better - dmnd
http://googlewebmastercentral.blogspot.com/2014/05/understanding-web-pages-better.html
======
mbrubeck
Googlebot has done various amounts of JS parsing/execution for a while now.
They've also issued similar webmaster guidelines in the past (e.g., don't use
robots.txt to block crawling of scripts and styles).

2008: [http://moz.com/ugc/new-reality-google-follows-links-in-
javas...](http://moz.com/ugc/new-reality-google-follows-links-in-
javascript-4930)

2009: [http://www.labnol.org/internet/search/googlebot-executes-
jav...](http://www.labnol.org/internet/search/googlebot-executes-javascript-
on-web-pages/8040/)

2011:
[https://twitter.com/mattcutts/status/131425949597179904](https://twitter.com/mattcutts/status/131425949597179904)

2012: [http://www.thegooglecache.com/white-hat-seo/googlebots-
javas...](http://www.thegooglecache.com/white-hat-seo/googlebots-javascript-
interpreter-a-diagnostic/)

From the 2012 article: _" Google is actually interpreting the Javascript it
spiders. It is not merely trying to extract strings of text and it does appear
to be nuanced enough to know what text is and is not added to the Document
Object Model."_

~~~
ChuckMcM
I was surprised, its kind of old news. There is useful JS to look at when
you're crawling a web page. Its also a useful way to detect malware on a page.

~~~
thatthatis
My understanding is that the state before was:

Content augmented with JS -- OK to index. Blank page with all content rendered
by JS -- not indexed

And what they're saying today is: "we are now going to be able to index single
page apps that don't have any server rendered content"

Perhaps I've missed something, but this is my interpretation: SPAs are now
full citizens in the SEO world.

~~~
KMag
Google has understood pages that have had only client-side rendered content
for years. I worked mostly on JavaScript execution in the indexing system from
2006 until 2010, as well as other "rich content" indexing.

Certainly somewhere in the 2008 to 2009 timeframe we saw that the Chinese
language version of the Wall Street Journal had a lot of pages in their
archive where all of the non-boilerplate content was rendered via JavaScript.
Since they didn't do this with their English content, it didn't seem to be an
attempt to hide content from search engines, but much more likely a workaround
for an older browser that wouldn't properly render Unicode, but with a
JavaScript engine that would properly render Unicode. Sometime in the 2008 to
2009 timeframe, Google's indexing system started understanding text that was
written into documents from JavaScript body onload handlers, and the Chinese
Wall Street Journal archive content was exhibit A in my argument that my
changes should be turned on in production.

I'm sure they've increased the accuracy of the analysis since, but they've
certainly been able to index content written by JavaScript for something like
5 years now.

Edit: without giving away any Google secrets, here's a pretty good analysis of
my work from 2008: [http://moz.com/ugc/new-reality-google-follows-links-in-
javas...](http://moz.com/ugc/new-reality-google-follows-links-in-
javascript-4930)

Edit 2: Since the 2008-2009 timeframe, Google also notices when you use
JavaScript to change a page's title. I caused a crash in Google's indexing
system when I made a bad assumption about Google's HTML parser's handling of
XHTML-style empty title tags <title/> and tried to construct negative-length
std::strings from them. When your code runs on every single webpage that
Google can find, you're certain to hit corner cases you didn't anticipate. I
did test for empty <title></title>, but not <title/>, and made incorrect
assumptions about the two pointers I'd get to the beginning and end of the
title.

~~~
thatthatis
So would the more accurate interpretation be: google is starting to promote
that it can and will index javascript single page apps?

Thanks for sharing, and nice work btw.

~~~
KMag
There were a lot of caveats that both reduced fidelity and would have made
announcements confusing back when I was at Google. Also, if one mentions a
limitation in an announcement, a lot of the Search Engine Optimization
community would be citing the announcement for more than 6 months. So, it's
difficult and potentially counter-productive to have an announcement with lots
of caveats.

V8 and Chrome weren't even a glimmer in Google's eye back in 2006, so I hope
they've largely replaced the code I was working on with something based on
Chrome. As late as 2010, the DOM was a completely custom implementation that
looked somewhat like Firefox, but with enough IE features to fool lots of
other pages that would otherwise change their content to "You must run IE to
view this page". (On a side note, as much as many people would like to see
such pages heavily penalized and indexed as if the IE-only message were their
only information, some of those pages are unique sources of invaluable
information and users wouldn't be well served by such harsh treatment.)

------
chimeracoder
> It's always a good idea to have your site degrade gracefully. This will help
> users enjoy your content even if their browser doesn't have compatible
> JavaScript implementations. It will also help visitors with JavaScript
> disabled or off, as well as search engines that can't execute JavaScript
> yet.

I'm glad that they included this.

I get that Javascript is required to make certain sites work the way they do,
but I'm appalled by the number of sites that require Javascript _just to
display static text_.

Google themselves are guilty of this. Google Groups is (for the most part)
just an archive of email mailing lists, but try reading a thread on Google
Groups with Javascript disabled![0]

There are very few sites that cannot gracefully downgrade to at least some
degree, and there are very good reasons for doing so. A major one is that
AJAX-heavy sites tend not to perform well on slow connections[1] (again,
assuming essentially static content here). If you want your users to be able
to access your site on-the-go, graceful degradation is your friend.

[0] It's especially ironic now that Google Groups is the only place to read
many old Usenet archives going back as far as the early 1980s.

[1] Try browsing Twitter on a slow (ie, tethered, or "Amtrak wifi" level
connection). For a website that originally originated as a way to send
messages over SMS, and is still used that way in other parts of the world, it
degrades amazingly poorly over slow connections.

~~~
icehawk219
This is a worrying trend I've been noticing as well. The last couple of years
especially I've noticed a very large increase in people just not caring about
graceful downgrading as well as people just not even testing in other
browsers. I've had many conversations with people that are huge fans of
Angular and Backbone and similar frameworks and when I mention people without
Javascript I just get the canned "but everyone has Javascript turned on anyway
and if they don't, too bad" response. Interestingly enough every one of them
were also people who developed and tested only against Chrome and never even
bothered to acknowledge other browsers. I know someone who is a developer who
actually likes IE and uses it as their main browser and for years they've
dealt with breaking bugs on sites like Github and other popular tech sites
because people just don't even test in other browsers any more.

The cynic in me wants to say that this mentality is pushed for by companies
like Google because no Javascript means no spying. But honestly I think it
really just comes down to laziness. So few people truly care about their
craft.

~~~
agentS
Out of curiosity, what form of "spying" do you think is enabled by Javascript
that would not be possible without Javascript?

Edit: And since I replied to a small part of your comment, I should say that I
disagree completely with your "few people truly care about their craft"
statement. At least, I think that writing code that handles a lack of
Javascript is only valuable if you have enough users to justify it. i.e. if
you spend 20% of your time working on features for 0.1% of users, then you are
doing a disservice to the rest of your users. Even more so if you have to
compromise the experience for everyone else such that degrading is an option.

In some cases, you go out of your way to accommodate small fractions of your
audience. ARIA and catering to those with disabilities is a good example. But
turning off JS is a choice; one I respect, but feel no obligation to cater to.
I think pages should show a noscript warning, but other than that, its a
matter of engineering tradeoffs.

~~~
cpeterso
Some analytics companies track mouse movements to watch how people interact
with web pages. They can also use JavaScript to fingerprint browsers beyond
what is available with cookies.

------
frik
I was under the assumption that Googlebot already used a headless Chrome to
index websites for some time.

Google used Chrome to generated page preview pictures (at index-time) to show
the search terms (mouse over, but this feature is no more, as it seems). Some
websites that shows you the user agent displayed the Chrome user-agent in the
preview picture, back then (~2 years ago).

it was called _Instant Preview_ :
[http://googlesystem.blogspot.co.at/2010/11/google-instant-
pr...](http://googlesystem.blogspot.co.at/2010/11/google-instant-
previews.html)

details: [https://sites.google.com/site/webmasterhelpforum/en/faq-
inst...](https://sites.google.com/site/webmasterhelpforum/en/faq-instant-
previews)

Google removed this useful feature in 04/2013 :(

    
    
      As we’ve streamlined the results page, we’ve had to 
      remove certain features, such as Instant Previews.  
    

\--
[https://productforums.google.com/forum/#!topic/websearch/Aom...](https://productforums.google.com/forum/#!topic/websearch/AomrTzXMWDI%5B1-25-false%5D)

~~~
27182818284
>Chrome to index websites for some time

I always thought this was the secret reason Chrome was built. Build a better
Googlebot and then, wait a minute, why not just release an awesome browser to
get more people using our product at the same time? Forked.

~~~
bhartzer
Technically speaking, Google built the new Googlebot and then realized that
they could release parts of Googlebot to the public in the form of a web
browser.

------
chestnut-tree
_" It's always a good idea to have your site degrade gracefully. This will
help users enjoy your content even if their browser doesn't have compatible
JavaScript implementations. It will also help visitors with JavaScript
disabled or off, as well as search engines that can't execute JavaScript
yet."_

Are Google going to follow their own advice here? Try visiting the official
Android blog with Javascript disabled

[http://officialandroid.blogspot.co.uk/](http://officialandroid.blogspot.co.uk/)

In fact, try visiting a whole bunch of *blogspot.co.uk sites with Javascript
disabled and see how "gracefully" they degrade. Remember, these are blog sites
with mostly text content. And yet Google won't serve them up without Javsacipt
enabled.

~~~
thinxer
yet they provide a rendered version for crawlers. try append
"?_escaped_fragment_=" to the pages, like this one:

[http://officialandroid.blogspot.com/2014/04/new-mobile-
apps-...](http://officialandroid.blogspot.com/2014/04/new-mobile-apps-for-
docs-sheets-and.html?_escaped_fragment_=)

And you will see the text content. It sucks, anyway.

------
callmeed
SEO has been a major factor in my reluctance to implement client-side JS
frameworks in many projects (we work with wedding photographers and other
small businesses who live and breath on being found in Google). It actually
seems harder to optimize a JS-heavy site than a Flash site (which we used to
sell a lot of).

If Google can actually index and rank a businesses website that is, say, pure
BackboneJS that would be awesome. But I'd like to see it in the wild before
trying to sell something like that.

For example, AirBnB appears to be using Backbone here:
[https://www.airbnb.com/s/San-Francisco--CA--United-
States](https://www.airbnb.com/s/San-Francisco--CA--United-States)

Is Google able to crawl their listings by executing all the JS? Or is AirBnB
implementing other tricks to get indexed?

~~~
frik
Google licensed the Adobe Flash text extraction library. Using the library it
was easy to extract text and links.

* Google Now Crawling And Indexing Flash Content (2008): [http://searchengineland.com/google-now-crawling-and-indexing...](http://searchengineland.com/google-now-crawling-and-indexing-flash-content-14299)

* Adobe page: [https://web.archive.org/web/20080702135702/http://www.adobe....](https://web.archive.org/web/20080702135702/http://www.adobe.com/devnet/flashplayer/articles/swf_searchability.html)
    
    
      Adobe is working with Google and Yahoo! to enable one of 
      the largest fundamental improvements in web search 
      results by making the Flash file format (SWF) a first-
      class citizen in searchable web content. 
    
      Google uses the Adobe Flash Player technology to run SWF 
      content for their search engines to crawl and provide the
      logic that chooses how to walk through a SWF.
    

Edit: parent commenter edited/changed his text quite a bit, originally it was
about Flash content

~~~
KMag
Google engineer Ran Adler (sometimes he spells his first name Ron to avoid
confusing English speakers) deserves most of the credit here. He went to Adobe
with a proposal for the hooks into the Flash player that he needed and worked
with their engineers to get those hooks working. I don't doubt there was a
fair amount of work on Adobe's side, too, but it wasn't like Adobe had the
technology in place before Ran started working with them.

Ran came up with an API for the hooks he needed that didn't give away too much
of the most clever parts of what he was doing. The belief was that Google
would get the hooks it wanted and in return, Adobe could share the special
Flash player with other major search engines and everyone would be indexing
Flash content. The hope was that Google would just be doing it a bit more
cleverly than the competition. I'm not sure if any of the other major search
engines ever used the hooks Ran designed.

Source: I worked on Google's rich content indexing team from 2006 to 2010. Ran
worked mostly on Flash indexing and I worked mostly on JavaScript indexing.

~~~
cpeterso
Within Adobe, this project was codenamed "Ichabod" because it was a "headless"
(no rendering) Flash Player. :)

------
andrenotgiant
I think the important thing to keep in mind is HOW Google ensures that their
understanding of Javascript helps them improve the Search Experience.

If they crawl your page with javascript enabled, and find that after a hover
event a button appears and after a click on that button a modal appears, and
that modal has content about BLUE WIDGETS, they are still NEVER going to rank
that URL for "BLUE WIDGETS"

Google wants to send users searching for "BLUE WIDGETS" to a page where
content about "BLUE WIDGETS" is instantly visible and apparent.

------
nateabele
I was at ng-conf in January and asked the Angular team a question about
improving SEO. Without going into any detail, they hinted at the idea that
very shortly it would no longer matter. Honestly I'm kind of surprised it took
this long.

~~~
saddestcatever
I couldn't be more excited for this change. The current process of making
Angular sites compatible for Google SEO is atrocious

------
snake_plissken
Just out of curiosity, how is this possible? If the web server can't handle
being crawled...how can it handle serving web pages?

"If your web server is unable to handle the volume of crawl requests for
resources, it may have a negative impact on our capability to render your
pages. If you’d like to ensure that your pages can be rendered by Google, make
sure your servers are able to handle crawl requests for resources."

~~~
devNoise
Try to equate being crawled with the /. effect. The site can server web pages,
but not at scale. So if the googlebot comes along to index your site, the
webserver may fail under load. Perhaps you're on co-hosting plan and your
provider suspends your site because the crawling has put you over one of the
limits on your plan.

------
valarauca1
I wonder how Google handles traffic to its crawlers. A webpage I viewed
recently loaded an entire database query into my web browser, then I made
queries locally to sort it, which kinda sucked since 20MB of information is a
lot.

I figure Google _has to_ have some form of safeguard against this. Either CPU
or network bandwidth limited, likely a time limit too.

In my head I'm picturing a crawler locked in a loop of forever querying random
google search results and adding them to a page.

~~~
andrenotgiant
Across many requests, Google has a very prevalent "bandwidth/time" calculation
for each domain/subdomain that factors in domain importance, # of pages,
frequency of page updates, and even input from webmasters via Webmaster Tools.
The bot will just stop requesting pages after that ratio is reached. (This is
why new domains that publish LOTS of pages at once may take a while to get
crawled.)

Across individual requests, they have distanced themselves from setting a
concrete limit to request sizes[1] (they used to only cache the first 100Kb,
then people saw them caching up to 400Kb, now they definitely index things
like PDFs that are much larger.)

[1] [http://www.mattcutts.com/blog/how-many-links-per-
page/](http://www.mattcutts.com/blog/how-many-links-per-page/)

------
rcthompson
Reminds me of "Googlebot Is Chrome": [http://ipullrank.com/googlebot-is-
chrome/](http://ipullrank.com/googlebot-is-chrome/)

Obviously the claim probably isn't literally true, but it they could certainly
share a JS engine (V8) at the very least, and the idea that the motivation
behind developing that JS engine may have been spidering JS-dependent pages
doesn't seem too far-fetched.

~~~
bhartzer
Actually, they built Googlebot first--and then took parts of it to build
Chrome. Googlebot is Chrome and Chrome is Googlebot (minus a few features like
crawling).

------
sergiotapia
Makes total sense. When I was researching Angular, something that didn't make
any sense was Google not being able to crawl Angular websites. Google makes
Angular you see, hence my confusion.

This instantly puts me back on the Angular hunt, as now I don't have to pay
for a service as ridiculous as 'static page SEO'.

~~~
ludwigvan
> Google makes Angular you see, hence my confusion.

The thing is, Google is not a single entity, like any other big corp. So there
will be teams doing things differently, even in conflicting ways. Google is
not using Angular for most of its sites, I think Closure is more popular
there.

You can do static page SEO with JS using libraries that allow isomorphic
processing, see rendr, react etc.

------
sashagim
Interesting. I wonder if some changes will need to take place in order to make
sure the client side tracking services (e.g MixPanel, Kissmetrics, Google's
own GA) ignore Google bot.

------
Theodores
This is great news. However, will you be able to feed the Googlebot a script
to simulate user interaction and download all of the ajax content that
normally needs clicks or mouseovers?

~~~
cleverjake
no. however it already clicks things.

~~~
Touche
But how does it know when to take the snapshot? A clicked link can do all
sorts of asynchronous things and Google is unaware when the new rendering is
"finished".

~~~
ryanpetrich
When there are no outstanding HTTP requests, DOM events, CSS transitions or
setTimeouts the page can be assumed to be rendered. Not all pages will enter
this state, so some heuristics are likely used.

~~~
Touche
What you are describing is not possible with, for example, Phantom. HTTP
request observation is, but AFAIK the others are not. I know it's likely that
Google has something far more advanced / customized, but just wanted to point
this out for anyone thinking about doing this themselves; it's a really hard
problem to solve.

------
dallen33
Does this mean Googlebot can crawl pjax pages?

[https://github.com/defunkt/jquery-pjax](https://github.com/defunkt/jquery-
pjax)

~~~
jordanlev
My understanding of pjax is that this should be irrelevant. I thought the
point of pjax is that you're not generating content with javascript but rather
pulling it in from an existing page and inserting it into the current DOM (but
the other page that the new content came from also exists at a specific URL,
so the site still "works" even without javascript -- it's just a speed boost
if you do have js enabled).

------
known
Googlebot = Google's "Customized" Chrome

------
gwbas1c
Honestly, I find sites that completely rely on Javascript to be somewhat
unreliable.

The web is fundamentally a document retrieval system. Content needs to work
without Javascript.

~~~
grkvlt
I believe you're thinking of FTP, not HTTP. Content is rather more than
documents, and spans from animations to interactive presentations,
visualisations and demonstrations, games and ... you get the idea ;)

------
thrillscience
So I wonder if we can trick Googlebot into doing computations for us? Maybye
mine bitcoins! :-)

~~~
frik
This could work, especially on high-ranked news sites that are crawled every
few minutes for new content.

I just noticed a comment on HN that I wrote a few minutes ago, was already in
Google search results - was shocked a bit.

~~~
bhartzer
They're indexing blog posts and content within seconds, not minutes. I've seen
my blog posts indexed within seconds and they've been doing that for a few
years now.

~~~
0x0
That's probably because a default wordpress install has pingomatic.com listed
in the callbacks for on-post ("update services").

------
znowi
Guidelines on how to conform to the rules of the Matrix, otherwise you may be
expelled from the system into obscurity of Zion.

------
hosay123
It's pretty obvious that they need to do this, whether reported or not, in
order to handle some very easy spam attacks. E.g., replacing keyword-baiting
content with, say, an advert for something totally irrelevant.

So this isn't really new or surprising

