
Does Google crawl dynamic content? - pstadler
http://www.centrical.com/test/google-json-ld-and-javascript-crawling-and-indexing-test.html
======
dansingerman
We built our site, [https://appapp.io](https://appapp.io) (a search engine for
the App Store) as a one page app. It serves no dynamic content in html from
the server, so we were unsure to what extent google would spider/index it.

As far as we can tell, it makes no difference from if it was generated server
side:
[https://www.google.com/search?q=site%3Aappapp.io](https://www.google.com/search?q=site%3Aappapp.io)

So yes, Google definitely does index dynamic content. I would love to know if
it ranks it equivalently.

Also, Bing does not:
[http://www.bing.com/search?q=site%3aappapp.io](http://www.bing.com/search?q=site%3aappapp.io)

(apologies for the minor self-promotion)

~~~
gildas
If you do a search with 'site:site:[https://appapp.io'](https://appapp.io')
and go to the last page of results, you'll see Google indexed officially
approximately 120 results. For example, this request does not return any
content: 'site:[https://appapp.io](https://appapp.io) "Release notes for
version 6.6.0"'. It should return the page /<lang>/app/we-heart-it/539124565.

~~~
dansingerman
Yes, we have more work to do to get Google to index all our content. It's
still a bit of a mystery to us (Google webmaster tools tells us about 2500
pages are in their index)

Our goal is not to have every app indexed (as that will by definition be non-
original content), but to have our app category pages indexed, e.g.
[https://appapp.io/gb/genre=Games;has_iap=false;price=Paid/se...](https://appapp.io/gb/genre=Games;has_iap=false;price=Paid/search)

~~~
gildas
Moreover, note that in the SERP, you can find some pages with the good title
but with the wrong description, cf. the 2 first results for example here
[http://imgur.com/cOV85UJ](http://imgur.com/cOV85UJ).

BTW, if you don't want every app to be indexed, I would recommend you the tag
<meta name="bots" content="noindex"> in the app pages. Alternatively, you
could define the canonical URL as the URL of the original content.

~~~
neogenix
We have a ReactJs application served from S3 with dynamic content on
[https://teletext.io](https://teletext.io). All content (eg texts) is loaded
asynchronously from a CloudFront distribution and I can confirm this is
indexed properly by Google (see
[https://www.google.com/search?q=site%3Ateletext.io](https://www.google.com/search?q=site%3Ateletext.io)).

The only thing we cannot seem to get right are the meta title and meta
description. If you set that asynchronously based on the React page you are
rendering, Google only seems to pick it up in about 10% of the pages. So the
SERPS doesn't look as pretty as you would like. I didn't found a solution for
that yet. :-(

------
userbinator
I suppose this explains all the times I've seen a promising search result with
the words I was searching for prominently highlighted, then visited the page
to find what I was looking for is no longer there. Sometimes the cached, text-
only version has it, and sometimes not. Alternatively, I'll see search results
with _none_ of the words I was searching for, yet perhaps they did sometime in
the past. Rather annoying.

~~~
Rangi42
This is a problem with a lot of paginated sites, such as Tumblr, various
forums, and comment pages. Anything that's ordered from newest to oldest won't
have a constant correspondence between URL and content.

~~~
HappyTypist
That's why pagination (when newest to oldest) should be designed as something
like "after_id=x". Sure, there is some implementation complexity, but your
users will love you if they can actually find the content they searched for.

~~~
frik
What about the frontpage which is usually page 1 of the pagination?

Google should give the actual article URL a higher score and pagination pages
a lower score. So that in their search results I see the content first and the
"dupe content" on pagination pages not at all (or way down). (at least for
common blog software)

~~~
Sacho
What about a redirect from your home page to whatever page 1 is at the moment?

~~~
frik
Imagine a blog. On frontpage example.com/ (= example.com/?page=1) it shows the
newest 10 articles, on example.com/?page=2 it shows the next 10 articles, and
so on. Every article has the actual URL in its headline hyperlink (e.g.
example.com/?article=123)

Now imagine that Google links to example.com/?page=2 as it found the search
phrase also there (at a given time only Google knows). So when the user clicks
on the search result link that leads to example.com/?page=2 NOW should the
blog software know what Google or the user wants?

One thing that comes to my mind is to use the referrer and if it's a common
search engine parse the s=SEARCHTERM string and use an internal article search
to find the best matching article.

~~~
BHSPitMonkey
I think you missed the part earlier in this thread about using ?after=id
rather than ?page=n. See Reddit for an example.

~~~
jlgaddis
You can see it here on HN as well. Click "new" at the top of this page, then
"More", and see the "next=" parameter in the URL.

------
gPphX
I have modified wikipedia pages, then googled it, to see search result
"instantly" updated.

Also, sneaky web sites often give different results to the googlebot user
agent than to a non-google firefox user agent

[https://en.wikipedia.org/wiki/User_agent](https://en.wikipedia.org/wiki/User_agent)

[https://addons.mozilla.org/en-
GB/firefox/search/?q=user+agen...](https://addons.mozilla.org/en-
GB/firefox/search/?q=user+agent&cat=all)

~~~
anewhnaccount
Google deliberately crawls in a non-Google looking way to try and detect
"masking".

~~~
tracker1
Yeah, that expert sex change site used to bug the hell out of me for that
reason... scroll forever and a day after loading.

------
m0dest
I'd _really_ love it if you repeated the same tests for Bing, just to get
coverage. (Yahoo/Baidu would be the other big two.) Historically, Bing hasn't
used fully functional headless browsers to crawl, which has limited its
ability to index dynamic content like this.

Google has "only" 70% market share, so it seems irresponsible to make
engineering decisions without testing the others. Google+Bing+Yahoo+Baidu get
you to 98%.

~~~
mkaufmann
I just did this page. The page is indexed. When I look for the search term
"Update this was posted to Google on Friday the 17th of July, 2015. Monday,
the 20th" the page is shown.

Trying to find any of the other search strings in the article for the
different loading variants does not return any results. So no variant of
javascript injected content is working on Bing currently.

------
peterhartree
The post author writes:

> So, very soon, the days of pre-rendering PhantomJs snapshots and serving
> shadow content to spiders will be over.

To be clear: webmasters of sites with dynamic content should not celebrate
yet. There are still influential spiders other than Google's that do not parse
JavaScript (for example, Facebook[1] and Twitter[2]).

[1]
[https://developers.facebook.com/docs/sharing/webmasters/craw...](https://developers.facebook.com/docs/sharing/webmasters/crawler)

[2] Can't find an official statement on this, but
[https://twittercommunity.com/search?q=javascript%20crawl](https://twittercommunity.com/search?q=javascript%20crawl)

~~~
oneeyedpigeon
And, of course, there are still all the other reasons why you shouldn't be
serving static text via javascript; I wish the article had included such a
caveat.

------
bigethan
I'm curious how google strongly penalizes SPAs for being slow to load.

The content may be indexed, but if your visitors are on a mobile network, that
initial visit (or a visit with stale cache) is going to be crappy. It's great
that they can read in they content (though bing cannot), but if it's buried on
page two, does it even matter?

As someone who is a proponent of web perf, these kind of articles make me
worried that server side rendering will be ignored because "SEO works now for
Javascript", even if it's slow and google is only 70% desktop & 80% mobile
search.

~~~
sandercentrical
SPAs should not be slow. If they are, they haven't been designed properly, I
think.

------
jimrandomh
My theory is that the Google crawler is a modified, headless version of
Chrome. These results seem consistent with that hypothesis.

~~~
porker
Are they also using people's browsing history to 'find' content? E.g. from
their safety filter?

Though I don't think it's happening, I've thought it'd be very clever if users
became the search spider for Google, telling them when content had gone stale
and/or doing the spidering on Google's part. Just by using Google's browser.

~~~
jacquesm
I can confirm that is not the case. I've had a canary page for that purpose
set up for years and it never fired. If it ever does you can expect a blog
post. I have another one that is set up to fire if google ever uses gmailed
links to crawl, that one too never fired.

Now, that's only one bit of data but if you want to be sure you can set up a
trigger page of your own.

~~~
kawsper
Have you tried accessing the site through Google DNS? 8.8.8.8 and 8.8.4.4.

~~~
uhoreg
DNS only sees the host name, so it can't be used to see what URLs are being
accessed.

~~~
jacquesm
I think that was the test case that he intended. So set up a domain that is
otherwise unknown and then use google's DNS and see if the domain is hit by
the search engine.

------
dyoo1979
Probably relevant:
[http://googlewebmastercentral.blogspot.com/2014/05/understan...](http://googlewebmastercentral.blogspot.com/2014/05/understanding-
web-pages-better.html)

------
anarchitect
Crawling JS content, yes. But does it _rank_ for that content in the same way
it would if the whole document was generated on the server?

------
captainmuon
I wonder if you could use this to find information about the google crawler.
Inject system and browser info into the page. Then you can find out what kind
of browser engine it runs, with which settings etc.. If you wanted, you could
use this information to do undetectable masking (I don't think it would work
in the long run, though)

It would be also interesting to see what timeouts it still allows. I wouldn't
be surprized if the modified browser "virtualizes" time and runs
window.setTimeout immediately. Maybe you could make a busy loop and find out
what the real timeouts are. It seems there got to be some, otherwise this
would open a way to DOS the crawler (not that I'd do that).

------
roboshake
Google may be indexing dynamic content now, but the question I'm curious about
is how it affects crawl efficiency. I can't imagine indexing JS content is as
efficient as indexing content returned from the original HTTP request.

~~~
_ao789
They just wait around for the page to change a bit and once it's done (after
some timeout) then it takes what's currently on the DOM as being the actual
page. This is then sent back, analysed and parsed as the content. Similar to
how phantomjs does page rendering with timeout..

------
olalonde
Related comment from an HNer who worked on this at Google (from 2006 to 2010):
[https://news.ycombinator.com/item?id=9531344](https://news.ycombinator.com/item?id=9531344)

------
gildas
Regarding SPA-based websites, as long as your site has only a few pages, these
results are relevant. I would like to see the same kind of test on a site with
1000+ pages for example. I already did this kind of test in the past and it
was failing miserably (i.e. only a dozen of pages were correctly indexed).

------
spyder
The next test could be: Does google crawl hidden text (display:none, very
small, very transparent colored text)? My guess is they do crawl it because it
can have legitimate uses, but if there is to much of them on a page then they
give it a lower ranking.

~~~
spacecowboy_lon
Hidden text can be problematic as it's often a form of cloaking as can hiding
text off the page using css positioning.

------
Sarkie
This article is from May.

[http://searchengineland.com/tested-googlebot-crawls-
javascri...](http://searchengineland.com/tested-googlebot-crawls-javascript-
heres-learned-220157)

------
andreasklinger
I am sure that google "discovers" javascript/ajax content. They also mention
this on their guides several times.

But are there any experiments/results related to SEO impact/crawl frequency
etc?

------
frik
Offtopic: "Google search results on tablet"

Recently Google changed their search result page for tablets. First it looked
fine, and useful.

But many times the first result page is now completely full of advertisements,
only the second page now shows usual links to websites like Github, Wikipedia,
Youtube, etc. of a common search term. Very annoying! And the Youtube link is
broken on iPad (it tries to link to a non HTTP address). I am just unlucky to
be part of an AB-testing?

An news article about the changes: [http://searchengineland.com/google-
launches-new-search-resul...](http://searchengineland.com/google-launches-new-
search-results-interface-for-tablets-235340)

------
mywacaday
centrical.com is blocked where i work by mcafee web gateway due to GTI
reputation identifing it as malicious and high risk

~~~
sandercentrical
Ha! Interesting. It is my personal website since 2001 or so, and has never
been compromised. It is a flat file website in S3 with Cloudflare on top - not
much to hack. I think mcAfee is a tiny bit too strict ;-)

------
mcot2
iirc this is why they originally started development on what is now Chrome.

------
largote
TL;DR: Yes, most of the time.

