
Problems caused by Google crawlers - MBCook
http://www.hackerfactor.com/blog/index.php?/archives/823-More-Google-Abuse.html
======
ryanlol
This guy has had so many issues with various crawlers "attacking" his
(boring!) sites, I'm starting to think he might just be a bit paranoid.

see:

[https://www.hackerfactor.com/blog/index.php?/archives/678-Bo...](https://www.hackerfactor.com/blog/index.php?/archives/678-Bot-
Spotting.html)

[https://www.hackerfactor.com/blog/index.php?/archives/762-At...](https://www.hackerfactor.com/blog/index.php?/archives/762-Attacked-
Over-Tor.html)

[https://www.hackerfactor.com/blog/index.php?/archives/775-Sc...](https://www.hackerfactor.com/blog/index.php?/archives/775-Scans-
and-Attacks.html)

[https://www.hackerfactor.com/blog/index.php?/archives/777-St...](https://www.hackerfactor.com/blog/index.php?/archives/777-Stopping-
Tor-Attacks.html)

[https://www.hackerfactor.com/blog/index.php?/archives/779-Be...](https://www.hackerfactor.com/blog/index.php?/archives/779-Behind-
the-Tor-Attacks.html)

I've done lots of portscanning of the entire internet, this always generates
vast amounts of abuse emails. Many of these these emails are xenophobic rants
from obviously distressed people staring at their access.logs, thinking
they're being targeted by the Russian government or something.

I can't help but wonder if this might be a similar case.

~~~
bitL
On one of my obscure but public domains I always have fun seeing Chinese
traffic guessing the admin password of the VPS at least 3x a day. I guess
somebody is continuously poking at vulnerabilities to get root access to as
many domains as they can, then later maybe selling them to their "customers"?
One doesn't have to be paranoid these days...

~~~
ryanlol
Anybody with a gbit line will be able to easily bruteforce the entire
internet, there are lots of people doing that.

This is just normal background noise on the internet, one has to be pretty
paranoid to assign any significance to it.

------
crazygringo
What a weird... rant. There are so many questions here:

\- Is he sure it's Googlebot and not somebody spoofing headers?

\- Is he sure Google is generating the queries, as opposed to Google just
following URL's coming from elsewhere?

\- Is he sure it's the search crawler, and not some browser service trying to
cache autocomplete combinations or results?

\- Has he tried simple potential fixes like changing GET to POST for searches,
or robots.txt to set a crawl interval to something like 30s?

It just seems so weird. Google crawlers have finite resources so this seems
like a misunderstanding or really bizarre edge case or both.

~~~
mintplant
> Is he sure it's Googlebot and not somebody spoofing headers?

In the author's previous post on the subject [0], he verified that the request
were coming from Google IPs.

> Is he sure Google is generating the queries, as opposed to Google just
> following URL's coming from elsewhere?

From the URLs quoted in the article, it looks more like some sort of (machine
learning-y?) system trying to infer his site's canonical URL scheme.

[0]
[https://www.hackerfactor.com/blog/index.php?/archives/484-Go...](https://www.hackerfactor.com/blog/index.php?/archives/484-Google-
Abuse.html)

~~~
emayljames
Is it possible that it is Google Cloud (console) IP?.

------
burtonator
I've ran a large scale web crawler for a decade:

[http://www.datastreamer.io/](http://www.datastreamer.io/)

The main issues and complaints we run into are mostly unusual requests for
indexing content.

People see the total number of requests and think it's a significant burden
and just have an irrational need to protect their content.

It's almost a form of paranoia at some point.

I mean we have NO problem removing someone that doesn't want to be indexed and
99/100 we do so without a problem.

The only times we push back are when government organizations contact us as we
feel we have a legal right to index the content. We haven't had to really
fight over this because they usually agree and that's the end of the
discussion.

~~~
zzzcpan
I see hundreds of thousands of requests from search engines daily without any
burden, but I do block all other bots. It has nothing to do with paranoia, a
lot of bots are just evil making it impossible to act towards bots in good
faith. For example there are forensic bots, copyright protection bots, brand
protections bots and similar bots with the whole purpose of using crawled data
to send you threats and "abuse" reports or drag you to court. There are bots
that scan for vulnerabilities to exploit and bots that scan pages to find
something bad to blacklist somewhere. There are bots that generate spammy
links or even grab content for black hat seo, something search engines punish
you for. There are bots that belong to companies that sell your data or try to
gain competitive advantage through this without giving you anything in return,
something you may or may not be comfortable with. There are even bots that
gather data for political and intelligence purposes, literally causing death
and destruction.

~~~
brennebeck
Given the wide range of bots you mentioned, I’m very curious how you’re just
blanket blocking all other bots? Could you elaborate a bit?

Edit: add other; ambiguous.

------
kordlessagain
Here's the robots.txt file for hackerfactor.com:

    
    
      User-agent: ia_archiver
      Allow: /
      User-agent: ScoutJet
      Disallow: /blog
      User-Agent: Googlebot
      Allow: /
      Disallow: /blog/index.php?/archives/2018/
      Disallow: /blog/index.php?/archives/2017/
      Disallow: /blog/index.php?/archives/2016/
      Disallow: /blog/index.php?/archives/2015/
      Disallow: /blog/index.php?/archives/2014/
      Disallow: /blog/index.php?/archives/2013/
      Disallow: /blog/index.php?/archives/2012/
      Disallow: /blog/index.php?/archives/2011/
      Disallow: /blog/index.php?/archives/2010/
      Disallow: /blog/index.php?/archives/2009/
      Disallow: /blog/index.php?/archives/2008/
      Disallow: /blog/index.php?/archives/2007/
      Disallow: /blog/index.php?/archives/2006/
      Disallow: /blog/index.php?/archives/2005/
      Disallow: /blog/index.php?/archives/2004/
      Disallow: /blog/index.php?/archives/2003/
      Disallow: /blog/index.php?/archives/2002/
      Disallow: /blog/index.php?/archives/2001/
      Disallow: /blog/index.php?/archives/2000/
      #Disallow: /blog/index.php?/categories/
      User-agent: *
      Disallow: /blog
      Disallow: /badbot-a2d0ac98abcaf3cafd9eff83b3cffa98fec7a390a6c5b9
    

I'm not 100% certain about that ? in there, but I think you can do this
instead:

    
    
      Disallow: /blog/index.php?/archives/*/

------
userbinator
_The first wave of the attack began dropping off letters._

I wouldn't call it an attack, but it seems more like some sort of
autocompletion checking given how similar it is to the queries generated to
Google if you use their search autocomplete feature --- one request for each
character entered.

That said, Google's search quality has already taken a steep decline and I've
been getting blocked very often for searching more obscure things now, so
whatever "bot mitigations" they put in, I hope they don't make that even
worse.

~~~
_asummers
The claim was that it started with the full URL and progressively removed
them. If it were reversed I would buy the autocomplete bug. I fixed something
like that a few weeks ago at work.

~~~
mynameisvlad
If Chrome address bar autocomplete generates queries (which I'm not sure if it
does), then the following scenario could easily explain what's going on:

\- As a user, I type "foo.com" into my address bar. Having already gone to a
specific page of that site before, autocomplete "helpfully" expands it to that
page's URL "[https://foo.com/this/is/a/page"](https://foo.com/this/is/a/page")

\- It turns out I don't actually want that, I want the main page, so I start
deleting characters until I'm left with "[https://foo.com"](https://foo.com")

If Google took each of those deletions as a separate URL (either by bug or on
purpose) and tried to verify said URL is valid, I could see a GoogleBot
request being generated for each iteration between the initial URL and the
full URL, starting with that initial one.

~~~
estebank
Given what I've read on the article, I'm fairly confident that what's
happening here is a conjunction of Chrome autocomplete suggestions causing the
crawl of every URL permutation and the website returning the same _content_ to
every permutation, instead of using 403 redirects or including a
`rel=canonical tag`.

------
dymk
Some of these mitigation tactics seem a little much, given the situation; did
the author consider a robots.txt first?

~~~
greglindahl
robots.txt can certainly keep google away from your search box, but it won't
defend you against the lower-case probes, the extra letter probes, the delete-
a-character probes, etc.

~~~
dwild
Yet he didn't try it. That's seriously weird that he doesn't mentions trying
that (he updated his post to say they doesn't support regex, yet does support
wildcard, which is all what's needed for a search).

He removed the search before using a simple robots.txt...

------
AndrewStephens
I had no idea that googlebot (and others) will sometimes POST to urls[0] -
this seems like a terrible idea.

I imagine what happened to this guy is that somebody coded up a site that used
his as a service and google spidered it and went crazy trying to follow the
links. It sucks, but that is what robots.txt is for. Anything that requires
more than just serving a page should be blocked.

You can drive yourself insane reading your logs trying to guess what bots are
trying to achieve. They seem to excel at spidering exactly what you don't want
while ignoring the content that would actually be useful.

The one time I did have a problem with a misbehaving crawler was with
something called 80legs[1]

[0] [https://webmasters.googleblog.com/2011/11/get-post-and-
safel...](https://webmasters.googleblog.com/2011/11/get-post-and-safely-
surfacing-more-of.html)

[1]
[https://sheep.horse/2012/8/80legs_is_a_pain_in_the_neck.html](https://sheep.horse/2012/8/80legs_is_a_pain_in_the_neck.html)

~~~
crazygringo
That is pretty crazy, I'm kind of shocked.

I'm sure their ML is pretty decent... but I do have to wonder if it's ever
wound up leaving hundreds of comments on an anonymous forum or something,
thinking it was a search box.

------
WolfRazu
This sounds to me like someone is simply using GoogleBot in their request
headers whilst probing.

------
partiallypro
You can adjust the crawl rate in webmaster tools

------
xg15
I'm surprised by almost all comments in this thread dismissing the article
completely.

Google does field trials all the way in Chrome. Why would it be so unrealistic
that they did similar trials with Googlebot?

------
emilfihlman
Related thread:
[https://news.ycombinator.com/item?id=16302821](https://news.ycombinator.com/item?id=16302821)

Google bot is stupid af and their support is even less helpful.

TL;DR: Google just scans you site even if it always gets 404s for millions of
requests.

------
Kiro
> Unfortunately, I had to remove that feature because Google began submitting
> random dictionary words. Hundreds of thousands of them.

That sounds insane. Is that actually true or is there an exaplanation of some
sort?

~~~
dlubarov
Googlebot does sometimes submit search forms in an attempt to discover new
pages, though I'm surprised to hear "hundreds of thousands". Either that's an
exaggeration or it was over a very long time period.

As other comments mentioned, he could have easily prevented this with
robots.txt:

    
    
        User-agent: *
        Disallow: /search

~~~
mynameisvlad
Would that work if the search box was on the main page, though? Presumably,
GoogleBot wouldn't index the results, but is it smart enough to know that the
form redirects to the disallowed page?

~~~
greglindahl
Yes, Googlebot (and everyone else, really) is smart enough to figure out if a
form action is disallowed in robots.txt.

------
dawnerd
A good start:

[https://support.google.com/webmasters/answer/80553?hl=en](https://support.google.com/webmasters/answer/80553?hl=en)

Then use robots.txt on any dynamic route like search.

A solid site map could help too.

------
jrnichols
What's surprising to me is how many hoops people have to jump through just to
tell Google to go away.

