
The day I was DoSed by Google - thekeywordgeek
http://thekeywordgeek.blogspot.com/2015/03/the-day-i-was-dosed-by-google.html
======
johnmu
Hi, I work with the Google crawling & indexing teams. Let me check with them
to see what their recommendation would be. At first glance (based on the pages
cached), it seems like we're just following the links within your site, like
we would with other websites. In general, there are a few things one could do
in a case like this (I might have more from the team later; these are in no
particular order):

\- Use rel=nofollow on links you don't need to have followed (this prevents
passing of PageRank, which generally means we're less likely to crawl them)

\- Use 503 for rate-limiting crawlers. 503 means we'll just retry later.

\- Use the crawl rate limit in Webmaster Tools (I see you submitted the report
there, so that should be active soon)

\- If the content is fully auto-generated, you might choose to use a
"noindex,nofollow" robots meta tag on these pages to prevent them from being
indexed separately. It's hard for me to judge how useful your content would be
in search directly.

~~~
thekeywordgeek
A big thank you for enquiring on my behalf!

A 503 would still require a GAE instance to be running so wouldn't necessarily
deal with my problem.

I have seen "noindex nofollow" kill a site stone dead in the past so I am very
wary indeed of using it. In my experience once you've noindexed a page it is
nigh-on impossible to get the engine to index it again.

My content is autogenerated, though I hope it has enough value to be
considered useful. It's time-series data of word frequencies in politics, so
for example you might use it to see how one candidate is doing relative to
another in an election campaign.

~~~
johnmu
FWIW I think the main problem is that you're essentially creating an "infinite
space," meaning there's an extremely high number of URLs that are findable
through crawling your pages, and the more pages we crawl, the more new ones we
find. There's no general & trivial solution to crawling and indexing sites
like that, so ideally you'd want to find a strategy that allows indexing of
great content from your site, without overly taxing your resources on things
that are irrelevant. Making those distinctions isn't always easy... but I'd
really recommend taking a bit of time to work out which kinds of URLs you want
crawled & indexed, and how they could be made discoverable through crawling
without crawlers getting stuck elsewhere. It might even be worth blocking
those pages from crawling completely (via robots.txt) until you come up with a
strategy for that.

~~~
johnmu
And one more thing ... you have some paths that are generating more URLs on
their own without showing different content, for example:

[http://www.languagespy.com/politics/uk/trends/70th/70th-
anni...](http://www.languagespy.com/politics/uk/trends/70th/70th-
anniversary/70th-anniversary?startDate=2015-02-11&endDate=2015-03-10)
[http://www.languagespy.com/politics/uk/trends/70th/70th-
anni...](http://www.languagespy.com/politics/uk/trends/70th/70th-
anniversary?startDate=2015-02-11&endDate=2015-03-10)
[http://www.languagespy.com/politics/uk/trends/70th-
anniversa...](http://www.languagespy.com/politics/uk/trends/70th-
anniversary?startDate=2015-02-11&endDate=2015-03-10)

I can't check at the moment, but my guess is that all of these generate the
same content (and that you could add even more versions of those keywords in
the path too). These were found through crawling, so somewhere within your
site you're linking to them, and they're returning valid content, so we keep
crawling deeper. That's essentially a normal bug worth fixing regardless of
how you handle the rest.

------
arcatek
It's weird that Google is charging for the GoogleBot bandwidth on its own
services. Of course they absolutely don't have to do it, but stories like this
make me worry about using the Google Cloud Storage.

[edit] That being said, the issue would be the same if it was another hoster
or another search engine. I guess the real solution would be to be able to
limit the crawl rate, as the OP said.

~~~
skj
I thought about that too (I'm a Googler working on cloud), but then a
colleague mentioned that this would become a way to get free computation from
Google.

So, while I agree with the sentiment that it sucks that this crawling eats the
quota, the solution is not to simply bypass the quota.

~~~
DanBC
> but then a colleague mentioned that this would become a way to get free
> computation from Google

I'm a bit confused. What computation does the GoogleBot cause to be performed
that benefits the Google service user? (Not Googlebot related stuff like
indexing).

EDIT: Thanks kyrra!

~~~
kyrra
Have a bunch of pages with no real content (but have millions of pages).
Everytime someone tries to load a page, do some intensive task (ex: mining
bitcoins). If you just make it appealing to GoogleBot and no one else, you get
free computational resources.

~~~
belorn
If the mining are done at Google Cloud Storage, initiated by a google search
bot, can't Google then identify and handle such abuse? I assume Google already
scans for multiple types of abuse, such as sites that spread malware.

~~~
nemothekid
Google shouldn't really have to do this. Replace GoogleBot with BingBot or GCE
with AWS and you still have the same problem. A website operator should be
working to make sure search crawlers don't consume too many resources given
that the bots follow rules.

Otherwise you'd have a team at every cloud provider trying to figure out how
to manage bots.

------
AznHisoka
I once had a site with over a million auto generated pages. I thought if even
1 user visited 1 page a day, I'd be rich.

I no-indexed all of them because it had thin content. Guess what happened to
my traffic? Almost no effect.

Stop assuming Google will send you traffic for auto generated pages - do you
really think Google will even display them in the first few pages over quality
content that actually is written by human beings?

Allowing those auto generated pages to be indexed will do you more harm than
good. Noindex them.

------
nitrix
Googlebot respect cache control. Have you tried more aggressive caching? There
are solutions out there (free), like CloudFlare. You can even throttle your
site, block some bots, etc, so at least your backend doesn't get shut down.

------
andrewstuart2
What's the point of letting Google index a site that nobody can get to? I'm
pretty sure you're going to tank in their rankings anyway for your site being
so sporadically available.

I'd probably go for not allowing spiders to crawl more than a few chosen pages
(home, about, etc) until you have enough revenue to support it going to other
pages.

------
corobo
Rather than it hitting quota error pages would it be feasible to give
Googlebot a 503 header back after a certain amount of pages? Setting a Retry-
After header to the next day should let it know when to come back for more

From
[https://plus.google.com/+PierreFar/posts/Gas8vjZ5fmB](https://plus.google.com/+PierreFar/posts/Gas8vjZ5fmB)
(Not sure how official this is but Pierre appears to work for Google)

Primarily the section

"2\. Googlebot's crawling rate will drop when it sees a spike in 503 headers.
This is unavoidable but as long as the blackout is only a transient event, it
shouldn't cause any long-term problems and the crawl rate will recover fairly
quickly to the pre-blackout rate. How fast depends on the site and it should
be on the order of a few days."

Edit: Looks like the over quota page is a 503. Couldn't hurt to do it early
yourself, Googlebot will see it the same way whatever provides it the 503

------
lancer383
Silly question, but wouldn't putting something like Crawl-delay in the
robots.txt file (somewhat) help alleviate the situation (if it is respected)?

Or maybe even just block crawling of the entire site except for the homepage?

~~~
thekeywordgeek
Google doesn't respect crawl-delay, sadly. They rely on the Webmaster Tools
setting, which is unavailable to me as I've described.

------
chillydawg
Restrict it 100% in robots.txt for now so it at least works? And then once
you've managed to get through to a human (lol, good luck :( ) at Google, you
can go from there.

~~~
thekeywordgeek
See my reply to jacquesm above. Very wary of blocking, as sometimes persuading
the engine you've unblocked it afterwards is nigh-on impossible.

~~~
hluska
Do you have any idea if this service will attract any actual users? Based on
what I've read, I can't even figure out what it does, so I am certainly not a
potential user. But, do you have any actual demand??

What I'm hearing is that you built a massive application, you've run into a
technical problem and now you would rather wait on Google to fix it than to
take any suggestions on how to get it up for actual users to use. Seriously,
don't do this - at your stage, it would be better to have 10 real users than a
site that has been fully indexed by Google.

On your note about persuading Google to index your site after being excluded,
do you have any actual experience with this happening?? I've been doing this
kind of stuff for years and years and have never had a problem. It can take
five or six weeks at the outside, but that is still less of a problem than a
product that can't be accessed...

------
jrochkind1
You can easily tell Google not to index your site with a robots.txt --
although you probably don't want to.

You can also tell Google to index your site more slowly, in Google WebMaster
Tools, although if I remember right the setting expires every few months, and
needs to be reset.

The odd thing here IS that webmaster tools won't let him restrict the crawl
rate, that's very odd.

(Also, it would be nice if you could restrict crawl rate in robots.txt, not
just webmaster tools).

In the end though, if your business is going to depend on Google indexing it,
then you don't really want to tell Google not to -- or, really even to tell it
to index more slowly. But a robots.txt can be a temporary measure while you
figure out what to do -- if you want Google to index your site, you've got to
make your site able to stand up to googlebot traffic. Caching is often pretty
helpful, and can help with your site's reliability and performance beyond
googlebot issues.

That's kind of just the way it is, right? If you want google index, you've got
to be able to handle googlebot. Nothing too shocking here?

Caching is definitely something to look into, that can improve the reliability
and performance of your site beyond just dealing with googlebot.

I guess the odd thing is that Webmaster Tools is not letting the author rate-
limit. And it would be really nice if google defined and respected some
extension to robots.txt to do it there. I guess you could always rate-limit
google bot with your own firewall-ish tools, but it might make googlebot mad
and you might get even less indexing than you wanted.

Really, if your product's success depends on google indexing it, you don't
want to slow it down anyway, except maybe as a temporary measure -- you're
going to have to figure out how to handle it. People are usually complaining
about how to make sure googlebot comes to _more_ of their site _more often_,
not the reverse!

------
bryanrasmussen
I guess if I gave myself the restrictions you've laid out, that I wouldn't
necessarily do, I would have a check at the beginning of the request - if it
was the google bot then give it back a 503 or something else. Then you can use
this function based on other parameters, for example what time of day it is
being used or maybe open up for parts of things you want indexed at times as
others have suggested.

~~~
thekeywordgeek
I'd be concerned as to adverse effects on my indexing. But it wouldn't fix my
problem as it would still be a request that would require a GAE instance to
handle it.

(edit) Yes, the bot is still hitting the GAE site atm even though it's
returning a quota error.

~~~
hudell
But would the bot do the same number of requests if it got an error?

------
jacquesm
You could simply limit the pages through your robots.txt and _slowly_ expose
them in your sitemap. That would give you control over how many pages are
spidered. If you want _all_ your pages spidered and you dump millions of pages
into the system the bot will hit you hard, but that's only because you offer a
lot of pages to begin with, so that's where you can throttle the accesses.

~~~
thekeywordgeek
I am a little concerned about doing that though. Having seen sites killed
stone dead by people blocking stuff by mistake in robots.txt and the engines
then never looking at them again I'm very wary indeed of blocking stuff I
intend later to unblock.

~~~
jacquesm
It sounds like you care more about google traffic than you care about real
users, if you want to do this without having more than a few entries in your
robots.txt and your sitemap then you could simply remove the other pages until
you're ready to have them spidered, alternatively have them behind a login and
hand out invitations.

In a nutshell, if you put up millions of pages and tell google about it it
will index you, if you don't want that you'll have to make choices about the
quantity and/or switch to a different kind of host.

Also, this kind of 'bot trap' tends to attract penalties so if this is not
some ploy to get traffic out of google you may want to re-consider how you've
laid things out, the difference between a legitimate site with a lot of
generated pages and a page-spammer is hard to determine and google tends to
err on the side of caution.

------
simonz05
We have a medium sized website with lots of user generated content. Think
Google has indexed ~2m unique pages and they crawl ~420K every day. This costs
us about 1,5GB/day in traffic. Dedicated servers with volume traffic option
usually brings you a long way in terms of cost if you have this type of
traffic pattern. For your case a cheap E3-1220 at €49 from leaseweb would
bring you a long way.

------
tempestn
As a short term solution, have you considered Cloudflare? If Google is
repeatedly crawling the same static pages, you can have them served from the
Cloudflare cache instead, and the free version should still have the features
you need.

Another option would be to move to a dedicated server. You can get quite a
powerful server from a company like LiquidWeb for a few hundred dollars a
month. (A "managed" server, so although a bit of know-how is needed to get it
performing optimally, they can help you with the basics at least.) I expect
with a bit of tuning of your web server (nginx or even apache with mpm_event
or worker) you could handle that level of traffic even without caching, but
you could also use something like Varnish to do even better.

------
jjoe
I'm not familiar with GAE and I'm suspecting this won't work there. But we've
recently helped a client "deal" with enormous bot traffic. I dubbed the whole
thing the "bot-split approach". One LB routes based on user agent. One traffic
line goes to a heavily-cached server running Varnish and is hooked to a
"stale" DB (updated hourly). And one traffic line dedicated to realtime/user
traffic with almost zero caching. Heavy caching keeps the bot box alive with
an OK load while the real user node has plenty of wiggle room.

------
NKCSS
How big is the site? I have rented a dedicated server with 100MBit unmetered
and 1TB storage, 16GB ram and 2 xeon cpu's for under €100 a month; maybe it's
better to look at such an option instead.

~~~
thekeywordgeek
Really big :)

Its source is the English language, so if there's a word or phrase that gets
used, it has a result. Corpus linguistics is fun like that.

~~~
samsk
If you've such big site, you should avoid resource priced cloud and go with
your own VPS. It might take you more time to set it up, but it will be
definitely much cheaper ($50/month and less), and it can surely handle all
your traffic until it grows really big....

~~~
edwintorok
Yeah see here for example for 1TB/month traffic VPS:
[http://iwstack.com/](http://iwstack.com/), or unmetered traffic with
dedicated box: [http://www.online.net/en/dedicated-server/dedicated-
server-o...](http://www.online.net/en/dedicated-server/dedicated-server-
overview-start)

The cost effectiveness really depends on whether your data would fit into that
1TB or it'd require much more.

~~~
Aldo_MX
Even if his data doesn't fit into that 1TB, he can always delegate storage to
different VPS and use a load balancer.

Saying "I have infinite data" is not an excuse for not looking for
alternatives.

------
ikeboy
[http://www.behind-the-enemy-lines.com/2012/04/google-
attack-...](http://www.behind-the-enemy-lines.com/2012/04/google-attack-how-i-
self-attacked.html)

Discussion of the above:
[https://news.ycombinator.com/item?id=3890328](https://news.ycombinator.com/item?id=3890328)

------
blueskin_
>Your site has been assigned special crawl rate settings. You will not be able
to change the crawl rate.

That's the messed up part. I guess the question is, does robots.txt override
that or not? If it does, fine. If not, all you need to do is make a few
"google ignores robots.txt" posts and the problem solves itself.

~~~
corobo
Googlebot doesn't use the crawl delay line in robots.txt if that's what you're
asking. I guess theoretically it's because you can normally change it in the
webmasters interface.

It's not really ignoring robots.txt either, as crawl delay is not an
'official' setting

------
andmarios
We had a similar experience at marine.travel a couple weeks ago.

Google would crawl our GAE site at bursts of about 30,000 requests in 4-minute
periods. We had some quota exceeded moments.

On the other hand we got to load test our MongoDB backend in GCE without
writing gatling tests. The results weren't promising for our ~$180/month VM.

------
rasz_pl
'I made a site generating infinite amounts of pages after pages filled with
auto-generated /dev/urandom. Its so precious I want GoogleBot to index ALL THE
THINGS!

..and so googlebot indexes ALL THE THINGS eating my quota, if only it indexed
my garbage slower.'

Cool story brah.

------
vampirechicken
How about GAE learns to recognize google web crawlers and does not penalize
GAE users for that traffic?

------
tux
"The fact that it's Google who are causing me to use up my budget with Google
is annoying but not sinister..."

Actually it is, google should detect its own GoogleBot and not charge people
for using up all the traffic. Because this can be used on purpose to have
people pay more. Very interesting artcile. Thank You, Jenny.

------
vkjv
Could you maybe start serving 503 to googlebot after hitting a threshold?

------
tzakrajs
When working in web hosting, I noticed Google Bot would take down many poorly
optimized websites. Your site needs to be able to absorb a simple search
indexer crawling your site. Next the OP will be writing about Yandex, Yahoo or
Bing DDoSing his site.

------
bhartzer
I would definitely restrict Googlebot from accessing site, as it's acting like
a "bad bot". Besides, does your business model rely on search engine traffic?
I hope not.

~~~
falcolas
Most websites rely on search engine traffic. It's how most people access
websites these days - type "hacker news" into Google/Bing/Yahoo and click on
the top link.

Especially if they haven't been there before.

------
carsonreinke
Google Webmaster Tools allows you control crawling or just a simple
robots.txt.

~~~
bonestamp2
He mentioned in the article why neither of those suffice.

