
Does Google execute JavaScript? - rrradical
https://www.stephanboyer.com/post/122/does-google-execute-javascript
======
mootothemax
I'm convinced that Google has several Googlebots that are run depending on how
popular a site is.

That is, new and low traffic sites are crawled by less intelligent bots, and
as the site gets more visitors or better rankings, more complicated and
resource intensive bots are deployed.

How this might work with the most popular sites out there, the Amazons and
Wikipedias of this world - I'm not so sure about that. If I were in charge,
I'd be tempted to have customised bots and ranking weights for each of these
exceptional sites.

Sadly the chances of getting a real answer on this in my lifetime are close to
zero.

~~~
dom0
Or perhaps there are also heuristics in place to determine which strategy to
follow, ie. heuristics to see whether executing JS would be worth it, would
yield additional content. So, say, when crawling documentation, where JS
doesn't give any of that (eg. Sphinx' JS search), it could decide - nah, not
doing JS, not worth it.

I'd expect that there are also other heuristics and different strategies for
crawling to better handle eg. content presented by one of the popular CMSes.

~~~
valarauca1

        heuristics to see whether 
        executing JS would be worth 
        it, would yield additional 
        content.
    

You are literally describing vanilla page rank. If a large number of links are
found to a page, but that page doesn't contain the contents the link rate
suggest it _should_ contain... either link rate has failed, or JavaScript
should be executed.

------
lprubin
Another, faster way to see what JavaScript google can crawl on your website is
the google search console (previously known as google webmaster tools). They
have a fetch as google button that allows you to enter a url on a site you own
and see a visual rendering of how google crawlers see your page. It even gives
you a side by side comparison of what the crawler sees vs what a user sees.

~~~
m0th87
My company has seen major discrepancies between what this tool gives back vs
how the page ultimately ends up getting indexed.

------
nrjdhsbsid
My pet theory is that Google actually developed chrome as a web crawler and
that consumer release was to ensure that Google would always be able to crawl
pages (since sites would always want to work properly with chrome)

It also explains why they effectively killed flash and Java applets. They were
competing technologies that weren't owned by Google and not crawlable. If they
would have taken off, Google's position as top search engine could have been
in danger

~~~
teddyh
If that was true, you would think they would have responded better to the
threat of Facebook than with Google+ and their distasteful pushing of it on
all their platforms.

What do I mean with “the threat of Facebook”? In the old days, before today’s
large “social media” sites, people made their own web pages on places like
GeoCities or on simpler social-media-like sites like LiveJournal, etc. Those
sites all had content and linked to each other. _This_ is the web which the
Google search engine and its algorithm was meant to find things in, and it
worked very nicely, as it took advantage of the links other people had made to
your site as a proxy for relevance in search results for your site. People
making small web pages about their favorite topics (with lots of links to
other people’s pages, since information was hard to find) could slowly and
easily transition into making larger and larger reference web sites with lots
of information, thereby attracting lots of incoming links from others, which
in turn enabled people to find the information using Google’s search engine.

Compare this to now. Firstly, people having a Facebook account have no place
to simply place information, no _incentive_ to simply make a web page about,
say, tacos or model trains, because that’s not what Facebook is about.
Facebook is about the here-and-now, and whatever is yesterday is forgotten. As
I understand it, there is no real way, in Facebook, to make a continuously
updated page with a fixed address for people to go to as a reference point
about some subject, or at least people are not directed towards doing this as
part of their online activity (as opposed to in the past, when it was
basically the _only_ thing which people could do). Secondly, this makes it so
that people have no natural path going from using Facebook to creating a
larger web site with information, and there are no smaller model train or taco
Facebook “pages” which could have links to your larger site and thereby
validate its relevance. Thirdly, even if this second point was false, Google
could not use these Facebook pages, because Google cannot crawl them. These
pages are all internal to Facebook, and Facebook has every incentive to not
allow Google to crawl and search this information. Facebook would much rather
people used their own site to search, and thereby gaining all of Google’s
sources of income: User monitoring and advertising.

~~~
jupiter90000
Very interesting points. Kind of a tangent, I had kind of felt nostalgic about
the days of the personal/hobby websites you appear to allude to that seemed so
prolific before Facebook/etc, and wondered why things had changed so much.

It makes sense to me that now someone who used to be motivated to build a site
about their life or a topic of interest now may often just sign up for a
service like FB and occasionally do a post with an article, picture, etc about
themselves/their interests. It seems to require much less effort for folks,
which is perhaps why they do this. I lament the change somewhat and wonder
what the future holds for this type of thing.

~~~
dispose13432
> It seems to require much less effort for folks, which is perhaps why they do
> this

It's also _much_ easier to find readers.

Compare:

Facebook: become friends with your co-workers, now they see your picture

Blog: _Please_ go to jupiter90000 (that's how many zeros?).blogspot.com to see
my once a week pic updates!!

~~~
teddyh
I agree, but this is _not_ an inherent difference of the open web vs. a walled
garden, only a difference of implementation of the software involved. Indeed,
the developers of the Web and its browsers were _aware_ of this problem, and
they though that they _had_ solved it, using something they called
“Bookmarks”. Now, as implemented, bookmarks may not be easy enough, and there
have been other ideas, like RSS feeds, which tried to improve upon the idea.
Just don’t think that this difference is inherent and set in stone. New
features could be developed.

------
codedokode
Google executes JS but maybe not on every website. If you have a JS error
reporting tool on such site then you can get reports from Google IP addresses.
I saw them first maybe 4 or 5 years ago.

Executing JS everywhere would require a lot of CPU time and I think Google
prefers not to do that when possible. And indexing a JS app is a very
complicated task anyway (it is difficult for a robot to even find navigation
elements if they are implemented as div's with onclick handler instead of
links) so you better use sitemaps to make sure the bot can find content.

And I don't think it is necessary to index rich apps. It makes no sense to
index a ticket search app (the data become outdated too fast) or an online
spreadsheet editor. Just make indexable pages as server-rendered HTML pages
and put their URLs into a sitemap.

Also Google looks for strings in JS code that look like URLS (e.g var url =
'/some/page') and crawls them later.

------
tyingq
Google's announcement when they started parsing javascript:
[https://webmasters.googleblog.com/2014/05/understanding-
web-...](https://webmasters.googleblog.com/2014/05/understanding-web-pages-
better.html)

~~~
cbr
They had already been doing it for a while by 2014. For example, I saw them
doing it on my site in January 2012: [https://www.jefftk.com/p/googlebot-
running-javascript](https://www.jefftk.com/p/googlebot-running-javascript)

(Disclosure: I work for Google, on unrelated stuff, though I didn't at the
time I wrote that blog post.)

------
andrewstuart2
This is a subject that really irks the engineering side of me. It's utterly
ridiculous that engineering and efficiency decisions are so deeply affected by
whether or not the largest search engine will properly index your content.

Why is it that Google doesn't get flak for not discovering content that's
engineered to send the absolute minimum over the wire, cache intelligently in
localStorage and IndexedDB, and scale well by distributing the appropriate
amount of rendering work to the client agent? Why can't I expose a
(JSON/)REST-API-to-deep-link mapping and have Google just crawl my JSON data
and understand (perhaps verifying programmatically some percent of the time)
that the links they show in search will deep link appropriately to the
structured JSON content they crawled?

It's such a waste of talent and resources to force server-side rendering.
There's obviously the resource cost of transmitting more repetitive content
over the wire, and requiring servers to do more work that the client could do.
(Yes, even with compression this will still be a higher cost, because more
repeated sequences reduces the value of variable-length encoding). But more
than that, what bothers me is that there's this false truth that server-side
rendering is a requirement for modern architectures, which must result in
hundreds of thousands of wasted engineering hours trying to enable the idea of
server-side _and_ client-side rendering with the same code.

This is not about time-to-first-byte either. Yes, the user-perceived latency
matters, but the idea that server rendering even solves this problem is again
utterly false. Sure, the time to very first byte ever may be faster, but
that's not a winning long-term strategy unless you never expect your client to
request the same content twice (or come back to your site at all). When
properly cached and synchronized, the client-side-only app has many orders of
magnitude faster TTFB, because it's coming from disk or even memory, and can
be shown immediately. The only thing left to do is ask the server "what's new
since my last timestamp?"

All of these benefits seem to be completely disregarded 99% of the time
because the golden "SEO" handcuffs are already on. I really hope we can get
away from this mindset as a community and rather let the better-engineered and
sites with the best and fastest UX _over time_ will start driving search
engine technology, instead of the other way around.

~~~
paulddraper
> result in hundreds of thousands of wasted engineering hours trying to enable
> the idea of server-side and client-side rendering with the same code

It this problem _really_ that difficult? Why?

Why should your code care if it is running on my computer or yours?

Isomorphic JS has been around for years. Build your product on bloated tech
stack relying on a increasingly poorly planned web of dependencies, and I'll
agree it could be challenging.

> Why can't I expose a REST API to deep link mapping and have Google just
> crawl my REST API

They do crawl REST APIs. Specifically, ones using Hypertext Markup Language.

> cache intelligently in localStorage and IndexedDB

Speak of hundreds of thousands of wasted engineering hours...HTTP caches are
simple and straightforward. IndexedDB leaks memory in Chrome so badly that
Emscripten had to disable it
([https://bugs.chromium.org/p/chromium/issues/detail?id=533648](https://bugs.chromium.org/p/chromium/issues/detail?id=533648)
[https://github.com/kripken/emscripten/pull/3867/files](https://github.com/kripken/emscripten/pull/3867/files)).
Mozilla advised developers not to adopt Local Storage due to the inherent
performance issues. ([https://blog.mozilla.org/tglek/2012/02/22/psa-dom-local-
stor...](https://blog.mozilla.org/tglek/2012/02/22/psa-dom-local-storage-
considered-harmful/)) And how many wasted hours went into WebSQL?

> utterly ridiculous that engineering and efficiency decisions are so deeply
> affected by whether or not the largest search engine will properly index
> your content

Actually, it makes a lot of sense. Content needs to be discoverable. Hosting a
complex language in a VM where the slightest deviation from the 600-page
specification (and that's just for the core language...not the browser APIs)
causes failure -- that's not "discoverable". It's like putting up a billboard
with one giant QR code, just because that makes it easier to develop the
content.

~~~
andrewstuart2
_Isomorphic._ Thank you, I was searching my brain for that word for like half
an hour. :-)

> Why should your code care if it is running on my computer or yours?

It shouldn't. But my users already care about perceived latency, and that is
directly limited by the speed of light. My users want feedback as quickly as
possible that their input has been received, and that something is happening
in response. Thanks to the speed of light, this would ideally take place
instantly right in front of their eyeballs. That can't happen yet, so as much
as can realistically happen on my user's CPU, memory, and storage is the next
best thing.

> They do crawl REST APIs. Specifically, ones using Hypertext Markup Language.

What I meant to say was JSON, so I'm contributing to my own pet peeve of
saying "REST" and meaning "JSON." :-p

HTML is awesome and does a wonderful job of letting me mark content in a way
that it can be efficiently rendered, semantically, and be both human-readable
and marginally machine readable. There are two problems, though. The first is
that _full documents_ (since the article points out AJAX is not performed by
Google) are incredibly repetitive and wasteful, especially when retrieving the
same content fragments multiple times.

The second is that it is strongly coupling content and presentation, two
orthogonal concepts, much earlier than is optimal. Sure, you can cache full
documents and display them when requested again, but the more common case is
that a large subset of what I just displayed to my user will be displayed
again, with one new item, but has still invalidated my cache because the
granularity is at the full-page presentation level, and not the business
domain object level. If, instead, I cache and render business objects on the
client side, I can be more intelligent and granular with my caching strategy,
react much more quickly to my users' feedback, and have a much smaller impact
on their constrained devices. Not only that, but transmitting structured
business objects instead of presentation-structured content lets me more
efficiently reuse that data across devices for which HTML may not be the most
effective way to present the data to them.

My personal architectural bents aside, the truth remains that _content
discovery agents_ (e.g. indexers) should not be treated as _content delivery
agents_ with such a huge influence on content format. This ends up creating
(IMO) too much influence over external engineering decisions, rather than
allowing engineers to think critically about the right architecture that gives
users the best possible experience.

Most importantly, I'm not saying that all the engineering effort should be
placed upon the discovery agents. Of course there are limits on how much they
can discover on their own, and (as always in matters involving many parties)
there need to be good conversations about the state of things, and what we
think is the right direction to go to support each other and our users. It's
just been my opinion lately that this is not so much a conversation anymore as
a unidirectional stream of "best practices" coming from a single group.

~~~
paulddraper
Yeah, I understand the server-side rendering vs. client-side updating, and the
design benefits of API-driven development. And unfortunately, a lot of popular
JS frameworks haven't done a great job about helping with these.

Closure Library/Templates was meant to render server-side and bind JS
functions after render, or create client-side dynamically. (Interestingly, the
historic reasons were performance, not SEO.)

React and Meteor have good server-side stories. Angular 2 is getting one.

I would say there is a lot of low hanging fruit in just avoiding most client-
side JS. Take [http://wiki.c2.com/](http://wiki.c2.com/) \-- the "original"
wiki. That should all be static. Same with blogs, documentation, and lots of
other public, indexable content.

~~~
inlined
[Disclosure: I work at Google but don't work on anything related to the
crawler]

All this anger aside, I'm actually pretty impressed with the world we live in
and proud of my company. Think about how far we've come that merely crawling
and indexing the vastness of the internet is so mundane now. Now we should
expect the whole internet to be downloaded and executed. That's got to be a
great security and integrity problem. Surely someone had tried to break out of
the sandbox. Can that be abused to affect SEO of other sites. The easy answer
is "spin up a new VM for each page" but that would slow the indexing process
down by orders of magnitude.

~~~
andrewstuart2
I'm not sure where you're sensing anger. The thread so far is a pretty great
example of the discourse I've come to really appreciate on HN. Sure,
disagreement may be uncomfortable or feel awkward to read at times, but I
think it's easily for the best. I'd much rather have somebody disagree with me
and give good reasons than just blindly agree.

------
pul
IT has now also executed ajax:

[https://www.google.nl/search?q=site%3Adoesgoogleexecutejavas...](https://www.google.nl/search?q=site%3Adoesgoogleexecutejavascript.com+yes)

------
linkregister
Good, uncomplicated article.

If you can get Google servers to execute Javascript, that sounds like a
possible attack vector. It's likely that Google runs these in a proprietary
feature-sparse interpreter.

The lack of AJAX would make it difficult to leak information about the black-
box interpreter.

------
franze
For a more in depth look at how Google treats JS best watch this talk
[https://youtu.be/JlP5rBynK3E](https://youtu.be/JlP5rBynK3E) by Googles John
Müller at an Angular conference.

------
binaryanomaly
While google is certainly the main search engine most people use, isn't it to
some point also very important what other engines such as bing, yandex, baidu
etc. do.

If you have a professional website you want to be found also by these other
engines. Until also these support javascript you may end up with a hybrid SEO
architecture anyway which means nothing was gained?

~~~
coldtea
> _If you have a professional website you want to be found also by these other
> engines._

Do you really? Except if you are interested in the Chinese market.

In fact the inverse is probably more true: if you run one of those other
search engines, then you want to be as good at indexing any particular site as
Google is.

~~~
binaryanomaly
Well at first, Google is the major player but not at 100% market share, some
figures say 80%. Different by countries, continents...

Second supporting other platforms is also important, think of FB, Twitter,
etc.

So either this really becomes the new standard that everyone supports or you
may end up with a hybrid or traditional approach if you care about other
platforms as well - imho you should.

------
chinathrow
I've checked a site I know which is using nothing but Angular/JS on the
frontend towards PageSpeed Insights [1] and it fully failed that test - no
results visible. Also, the whole page is not indexed but the root URL itself.
No page snippet preview, nothing.

[1]
[https://developers.google.com/speed/pagespeed/insights/](https://developers.google.com/speed/pagespeed/insights/)

------
latenightcoding
Nicely done! I have been writing crawlers for a while now and executing
Javascript is very expensive and slow, even for Google. When I crawl the web I
usually run javascript from a headless browser only on top priority sites.

~~~
elorant
What I find very useful is to run Firefox through Selenium with a plugin
installed to disable pictures. Then it's blazing fast.

------
crispytx
Best post I've seen on Hacker News. I've always played it safe and never
assumed Google would index content displayed dynamically with JavaScript, but
now I know!

------
xg15
If I click the link from the article that leads to the webcache version, I get
"yes, but embedded only".

If I click the link within _that_ page that leads to the exact same webcache
url, I get "yes, embedded and external but no ajax".

If I google the site, the preview text is the non-changing portion of the text
only ("This is an experiment to...") - not even a "No".

I think Google is just trolling us.

------
avitzurel
Beyond the theory, the talks and the articles. I have multiple 100% JS
rendered pages (blank page with no javascript).

Google is crawling and indexing them with zero issues.

------
tyingq
I wonder what Google does to avoid indexing too many pages. There are a fair
number of SPAs and software like shopping carts that have a large number of
checkboxes, pulldowns, knobs, dials, etc...that both change the content and
the current url query params.

~~~
appleiigs
I think it avoids indexing too many pages by not triggering the checkboxes,
pulldowns, etc.

Through Google Console/Webmaster Tools, you can tell Google which query params
your website uses. But for my modest website, google only uses the page query
(?page=3) and I don't notice it using the other queries.

------
zitterbewegung
I tried this on bing and duckduckgo and they only have in the description the
body text.

------
nkkollaw
I'm loading i18n before showing any content, and was afraid Google wouldn't
index the content, but it didn't have any problem doing that.

------
tgtweak
This is a great way to test a hypothesis, and a good experiment.

I'll mention that there is a ule that was added a few years back, in
backbone.js days, that urls with /#!route anchors will enable (read: force)
ajax requests and JavaScript from the spider. Still remains a helpful way to
force caching/indexing of JavaScript-only pages in Google.

------
amorphid
One random thought... Google goes to SomeWebsite.com. The site has only enough
HTML to load a big ol' JavaScript app, which Google slowly crawls. Well, that
JS app makes a bunch of AJAX calls. There's no reason I can think of that
would prevent Google from remembering which AJAX calls were made, and then
just crawling the URLs for those calls on subsequent visits. Why load
SomeWebsite.com's JavaScript.com every time you want to index the site, when
you can just remember that the JS calls SomeWebsite.com/some-endpoint.json?
Sucking the JSON out of an endpoint might even be faster that indexing regular
HTML. Haven't written a lot of crawlers, so I'm mostly guessing here.

~~~
codedokode
Crawling AJAX data alone makes no sense because it could be just a piece of
JSON and Google needs a rendered HTML page with an URL it can show in the
results. If you have some data that are not available at a separate URL (e.g.
they are loaded when user presses a button), they will not be indexed.

------
tigras2
We got website, react+babel+ajax, we monitor those ajax requests because of
bad scrappers:) And we constantly see googlebot. At least 1k request per day,
google bot agent from google ip range. So yes google does ajax and does
understand packed react also.

------
GoToRO
My experience is that content hidden behind js will get indexed later, if at
all, and it will be updated not as often. Also they will run js for bigger
sites first, and not so much for smaller sites.

------
jorblumesea
I think one of the most frustrating things about indexing from Google is the
complete lack of transparency. I understand that it helps Google slow down the
arms race of search engines, but it also means that devs doing 100% banal work
need to sift through mountains of rumors and spin up sites to test
assumptions.

I have literally heard every combination of practices with regards to SEO and
have no idea what is truly correct. Every source contradicts each other,
Google employee statements contradict those, etc.

------
jotto
If you don't want the deal with the ambiguity of whether your AJAX will run or
not, I'll shamelessly suggest
[https://www.prerender.cloud/](https://www.prerender.cloud/) which is helping
a few sites who couldn't get Google to execute their AJAX.

------
obvio
here's an interesting experiment from a while ago
[http://searchengineland.com/tested-googlebot-crawls-
javascri...](http://searchengineland.com/tested-googlebot-crawls-javascript-
heres-learned-220157)

tldr; google indexes js generated content.

------
faragon
Is that safe? E.g. exploits, privilege scalation, etc.

------
jwatte
My pet theory is that part of the anonymous usage data Chrome sends back, is
digested page contents that go into pagerank. And such browser level digesting
would be on rendered pages (after JavaScript execution.)

I have no other reason to believe it is true other than it's what I would do
to distribute the job of crawling the web to my users if i were Google :-)

~~~
Fogest
Yeah, it would make sense. There has to be more reason to giving people a free
web browser other than just that it uses Google search.

~~~
neurostimulant
Chrome does send domains/urls entered in the omnibar to google (I think
someone did the experiment to test it several years ago), but sending out page
content to google? If it were true it'll cause a huge legal and privacy
problem, especially in places with tight privacy law like Europe.

~~~
samsonradu
Exactly, imagine if Google sends the page content after you log into your bank
account.

~~~
4ad
And yet this is exactly what happens with automatic translation.

------
user5994461
> Does Google execute JavaScript?

Yes.

There are sites that can't be loaded without javascript that are indexed fine
by Google. The only explanation is that they run some javascript.

------
korzun
They have been for almost three years...

