
Ask HN: How to stop Google indexing dynamic search pages? - scalesolved
Hey HN folks,<p>A few months ago I received a manual action penalty from Google as they detected spam pages on our domain.  The problem was that when people were searching on our site they are directed to a page with the following:<p>https:&#x2F;&#x2F;$domain&#x2F;search?query=$QUERY<p>Some users (most likely bots) are generating huge spam searches on our search page and somehow Google is indexing these and there are no inbound links to these pages (at least I cannot find any).<p>To resolve this I did the following:<p>* On our search page I set the following header: X-Robots-Tag: noindex (based off of the documentation here https:&#x2F;&#x2F;developers.google.com&#x2F;search&#x2F;reference&#x2F;robots_meta_tag).<p>* Submitted URLs to be dropped from Google Index via Webmaster console<p>* Submitted 3 reconsideration requests to Google to avoid the penalties<p>In theory this should stop all search pages being indexed (as they all contain the noindex header) and it has helped drop the number of indexed pages marked as spam by 99% however we still have a significant number of urls marked as spam and so our site has a penalty from Google.<p>Has anyone had this issue before?  How can I stop these pages becoming indexed when I have the noindex header set _and_ if you search the spam urls there are no inbound links to them?<p>Any help appreciated folks!
======
jacquesm
Hilarious how Google thinks they are now in editorial control of your content
to the point where you are on the hook for fixing _their_ bugs. You're being
treated as a wayward content provider, rather than that they should be happy
to get the benefit of your content to index.

~~~
Fnoord
Problem, as usual, lies with default being opt-out instead of opt-in. They did
the same with WAPs. Now mine need to append _nomap.

~~~
kiallmacinnes
While I agree, opt-in would certainly be better.. I really see no way to
transition from the current state, to 100% opt in. And, I think opt-out was
the right choice at the beginning - nobody would have updated their sites to
suit the startup search engine.

That said, Google & co could set a date, whereby new content / pages will not
be indexed unless its marked as indexable. This would allow historic content
to remain indexed, and new content be opt-in at the expense of any new search
engines being unable to index historic content. How can they tell if the
content has not opted in, or is just no longer actively maintained without
having a pre-existing index?

------
helij
You need to add <meta name="robots" content="noindex, follow"> to the <head>
section of all your search results pages.

You want robots NOT to index pages but to still follow links on your search
pages.

Create clean sitemap.xml file and submit it to Search Console.

Another way is to just canonicalize all search results pages to your search
page.

With Google and these things time is involved. Once it's in the index it will
take time to properly clean everything up. How was the traffic before this
happened? Did the website rank for any decent keyword? Sometimes when this
happens the smart thing to do is to just start from scratch with a new domain.

If you want more extensive help email me.

------
dgranda
Based on my experience:

A.- I would also add "nofollow, noarchive" tags [1] to your X-Robots-Tag
header:

\- "nofollow" -> do not to follow (i.e., crawl) any outgoing links on the
page.

\- "noarchive" -> prevents Google from showing the Cached link for a page.

B.- I would specify in Search Console (former Webmaster Console) how should
Google handle "query" parameter [2]

C.- Prevent those spam searches by blocking source IP address, User-Agents,
combinations of both, etc.

Good luck!

[1]
[https://support.google.com/webmasters/answer/79812?hl=en](https://support.google.com/webmasters/answer/79812?hl=en)

[2] [https://www.google.com/webmasters/tools/crawl-url-
parameters...](https://www.google.com/webmasters/tools/crawl-url-
parameters?hl=en&siteUrl=https://<domain>/)

~~~
scalesolved
Thanks for the suggestions!

A) I'm going to add the nofollow and noarchive to see if that helps the issue.

B) I've already set the search console to ignore the query parameter but I'm
still getting new spam results coming in.

C) I've been looking into this but so far the meta information for the spam
requests is not consistent and it's tricky to identify so far.

Thanks for the help and the luck, I think I'll need it!

------
tangue
You should use the canonical tag. Moz has a good page on how it works.

[https://moz.com/blog/canonical-url-tag-the-most-important-
ad...](https://moz.com/blog/canonical-url-tag-the-most-important-advancement-
in-seo-practices-since-sitemaps)

------
sebst
You could also annotate your page.
[https://schema.org/SearchResultsPage](https://schema.org/SearchResultsPage)

Edit: Maybe it is also worth annotating the search field
([https://developers.google.com/search/docs/data-
types/sitelin...](https://developers.google.com/search/docs/data-
types/sitelinks-searchbox)) so that google can match it against your search
results page.

------
Jaruzel
Register for Google Webmaster tools. There's an option in there to exclude
links that have dynamic parameters. You can define the parameters you want it
to ignore.

------
itamarst
Maybe also add a robots.txt?
[http://www.robotstxt.org/](http://www.robotstxt.org/)

~~~
scalesolved
Would adding robots.txt help? I cannot blanket ban the page as it needs to
display to the user with 0 or more results and my understanding is that whilst
robots.txt would take precedence for index settings as I have it defined in
the header for that page it ought to achieve the same.

~~~
wejick
I think adding rule for /search* to your robot is easier way to block google
bot

~~~
dougunplugged
Don't do this until Google sees your new robots meta directives (or canonical
tag) on these pages, that way they will drop from the index. Then you can add
this to your robots.txt to prevent it being crawled again.

------
eddflrs
Adding <meta name="robots" content="noindex" /> to each page should work. Also
as a heads up, having an entry in robots.txt to disallow is not enough since
pages can still be indexed if they can be navigated from anywhere else on the
web.

~~~
aidos
I thought robots.txt was meant to be pulled from the domain and honoured
anyway. At least that’s what used to happen. Just because someone links to you
doesn’t mean the spiders should crawl all the content

------
computator
Can anyone answer a related question: Are you penalized for _not_ running
Google Analytics and/or Google Webmaster tools? In other words, if you have a
clean website with no analytics whatsoever, is your ranking likely to be
worse?

~~~
Topgamer7
Anyone who doesn't work at Google wouldn't be able to with absolute certainty
be able to answer this question. Anyone who does work at Google wouldn't be
allowed to answer this question. However page rank is widely attributed to
mainly based on back links.

~~~
Jedi72
If not using GA doesnt affect your page rank, I don't see why Google would
need (or even want) to be secretive about it. Only if it does hurt your
ranking (how is that huge antitrust lawsuit today feeling Big G??) will they
need to be all secretive about it.

~~~
Topgamer7
It is not in their best interest to be open about everything. From their
perspective it is better to leave it a mystery and allow people to assume that
it does negatively affect your rank to not use their services. Whether it does
or does not affect it. From your perspective, you need a good ranking? You get
GA.

~~~
taneq
It's not a threat. It's the _implication_.

------
emilfihlman
Heh, I ran into a similar issue previously:
[https://news.ycombinator.com/item?id=16302821](https://news.ycombinator.com/item?id=16302821)

GoogleBot is broken.

------
detaro
Are they still being added newly, or have just not been purged from Google
index yet?

~~~
scalesolved
There are still new spam search queries being indexed.

~~~
detaro
Do you have a robots.txt entry that's stopping Google from fetching them? That
can counter-intuitively cause Google to index pages.

~~~
stevenicr
In my limited experience, the robots.txt is helpful, but not a stop all. The
big G still indexed a bunch of my pages because a certain group was creating
(off of my site) links to the spammy pages - which makes G index it; however
if you have like: Disallow: _spamresult_ Disallow: _search_

They can still end up in the index, just with a not that says "no description
is available for this page"

I remember years ago the debate Matt Cutts asked if G should index and pointed
out that other engines were indexing pages that were robots.txt blocked.. meh.

I had to setup a 301 to homepage redirect system to zap all the pages I took
out... although some other engines still spider looking for those pages even
though I removed them with 301s over a year ago - perhaps the spammers still
have links going to them?

I started just blocking all indexing from sogu or whatever it's called and
similar bots in the robots.txt and then started to look at ip / cidrs to block
further after thinking they would get the hint after several months.

Hope your situation is different.

~~~
stevenicr
Just realized that I had put the asterisk * in front of and after the two
words I had with Disallow up above, but that kicked in HN formatting instead
of showing Disallow: *search with another asterisk after it is what I mean to
show.

------
lgats
Blocking the search function in robots.txt may help as well.

User-agent: *

Disallow: /search

Disallow: /search _

------
known
You can restrict in .htaccess

~~~
officialchicken
That might work, but only if you use apache. If there is a CDN or a different
server or proxy (nginx, varnish) it's probably better to rely on meta tags,
robots.txt, and/or canonical tag.

