
Issue 8788 - Every day around 9 AM Brussels time, huge drop in GAE performance - thijser
https://code.google.com/p/googleappengine/issues/detail?id=8788
======
hosay123
> _I'm going to assume everyone experiencing this issue is using M/S.
> Upgrading to HRD will solve your issue._

This is _the_ reason I abandoned AE and part of why adopting a platform that
isn't standardized is incredibly dangerous. The problem is technical debt
constantly accrues _even when you aren't making changes_.

Even though the API was unchanged, HRD differs subtly enough that breakage can
occur on any non-trivial project. Edge cases (how indices behave within
transactions comes to mind, but there are plenty more examples) will see new
semantics compared to M/S, and so this "upgrade" involves not only thorough
testing and auditing, but likely also code changes and potentially significant
engineering hours.

<http://goo.gl/HVuaC>: _These techniques are not needed with the (now
deprecated) Master/Slave Datastore, which always returns strongly consistent
results for all queries._

This means a project written and signed off circa 2011 requires mandatory
engineering costs just to continue running in a functioning and supported
fashion. An AE app will never quite resemble that ancient perl5 behemoth
running uninterrupted since 1997, as the underlying implementation and
recommended APIs are constantly modified and replaced (Datastore, NDB, Python
major version).

 _"A strong test suite will save your soul!"_ I hear you say, tests that a
small project might have survived without if targeting any other platform, and
testing on AppEngine is also yet another moving target (for example, testing
nested subrequests was all but impossible using the SDK until relatively
recently).

The promise was a carefree life for a project willing to code against their
proprietary APIs; the reality is a constantly moving target, "not quite free"
autoscaling and the threat that while you're asleep an unannounced change will
take down your app (I could name a few, but as many will attest this has
happened regularly since launch).

~~~
Ensorceled
> The promise was a carefree life for a project willing to code against their
> proprietary APIs; the reality is a constantly moving target, "not quite
> free" autoscaling and the threat that while you're asleep an unannounced
> change will take down your app (I could name a few, but as many will attest
> this has happened regularly since launch).

Yeah, I got sucked in with the same promise and I had the exact same sour
experience. Including panicky calls from the client when the app suddenly
stopped working. The maintenance windows used to plop right in the middle of
my client's busy time, once a month at least and often more.

The worst part is the apologists, like lysprr@gmail.com in the original bug
report:

> I got here from HackerNews, but after seeing the original poster spam the
> forums in multiple places and have a bad attitude, I can't blame Google for
> not fixing what looks to me like a non-issue. > > Fuck 'em.

That always reply to your request for help while you're attempting to fix a
suddenly dead application and a totally screwed client.

------
rachelbythebay
Remember the old "thundering herd" problem with Apache children and things of
that nature? You'd basically have a whole bunch of processes which had a
listening fd from an earlier call to listen(). When a new connection would
come in, the kernel would wake all of them, even though only one of them would
actually have something to get. The others would go through the process for
nothing. It caused a big performance hit back in the day.

Well, imagine now that you have a directory or lock service where you can
store things and perform atomic updates. When you do a write to something in
it, it fans out to all of its clients, and they all wake up (nearly)
simultaneously and receive the update. They then have to do whatever
processing you do with new data of that type.

If they all do this at the same time, then you have no processes left to
service incoming requests. They're all identically busy with whatever mutexes
held in order to apply those config changes safely, so no other work happens
on those clients while they load in the new data.

It's not so much that it's taking a mutex and is getting stuck for a little
bit, since that's going to happen no matter what. It's that _all_ of the
children do it at the same time, so there's nobody to service your hit, and
you're guaranteed to get stuck. If it was spread out, then only some
percentage of incoming requests would get stuck behind this. The others would
get lucky and would hit another instance which either had already run it or
hadn't yet run it.

I'm not saying this is what's going on here, but it sure sounds familiar.

~~~
stingraycharles
On what basis do you think these issues are related? The bug report provides
very little insight in what's going on, only that there's a severe performance
degradation at 9AM.

The thundering herd problem applied to waking up child processes is one
possible explanation, but there are dozens of other explanations that are just
as likely, based on the information we're provided with.

~~~
brown9-2
The commenter you are replying to is a former Google employee; the description
of the lock service sounds like Chubby
(<http://research.google.com/archive/chubby.html>), App Engine likely uses
some sort of distributed directory service for keeping track of things like
quotas.

------
Confusion
Well, the bug report doesn't really invite quick attention. Simply reporting
your observations is not enough: you should position yourself as a competent
customer, by explaining what you have done to ensure the problem isn't on your
side. Mention the code hasn't changed, that you have no database cleanup
cronjobs or similar running that could be interfering, etc.

My first instinct when I see a report like this is: he probably has some
cronjob running he forgot about; perhaps one whose performance decreased with
O(n^2).

By which I'm not saying that Google is right in not replying for days, but by
which I am saying that as a customer, there are easy ways to get attention
beyond shouting and threatening. Show it's an interesting problem and you're
bound to get some techie's attention.

~~~
bodegajed
I consider him panicking more than shouting and threatening. I couldn't
imagine having that kind of treatment as a vps customer else I'll be moving
out asap.

~~~
meaty
If you were a VPS customer, you'd be less locked in and can just walk to
another vendor so they actually are shit hot with support usually.

It was obvious when I first tried it that GAE has crappy support.

~~~
jfoster
What are we actually talking about when discussing "good support" and "bad
support"? Is it just someone nice to talk to whilst someone else fixes a
problem for you? There was an interesting article along these lines by the
former President of Enterprise at Google written recently:
[http://gigaom.com/2013/01/26/the-delusions-that-companies-
ha...](http://gigaom.com/2013/01/26/the-delusions-that-companies-have-about-
the-cloud/)

In this case, the GAE feature that underlies this issue is the Master/Slave
(MS) datastore. It's been deprecated for ages in favour of the High-
Replication Datastore (HRD).

~~~
pixl97
He falls in to a trap of knowing machine behavior, but not dealing with people
behavior.

 _Insanity #2: I need somebody to talk to when a service interruption occurs_

You hear about an earthquake in California, you call your aunt to make sure
she is ok.

You are getting bad weather in the area you live, your mom calls and checks on
you.

The server you use disappears off the internet and your providers status page
hasn't been updated for a week, you '...'?

When something goes wrong, it's not an event that effects everybody (even if
it is), it's an event that effects you. As long as humans are still involved
in the purchasing and managing of servers you'll always need someone to call
and yell at/be soothed by.

~~~
jfoster
That's true. I think that his broader point still stands, though. Once you get
beyond variants of "are you working on it or do I need to convince you to?"
the role of support is basically catering to irrational desires.

------
benjaminwootton
Google support is an absolute disgrace.

I had a Nexus 7 go AWOL at Christmas and I've never had such a shambolic
customer service experience.

They have absolutely no respect or customer service ethos when it comes to
people who are actually paying them real cash money.

Not in a million years would I sign off on hosting a production project on App
Engine.

~~~
Ironlink
This customer is complaining about a service component which has been
deprecated since almost 11 months ago. There is a tool which migrates
application data from the old datastore to the new one. When you don't move
off of deprecated infrastructure, I'd say you've set yourself up for problems.

~~~
scottbartell
A comment from Google explaining the issue wouldn't be much to ask. "Move off
of deprecated infrastructure" is much better than radio silence.

------
davedx
Why would anyone put anything production critical on Google these days,
knowing that they provide 0 support across most of their business?

~~~
wiradikusuma
because we're paying customers? my bill is peanuts, but i know there are many
big customers, e.g. Khan Academy. and also they have Premier Support which is
$500/mo.

~~~
raverbashing
One thing I've learned with Google is that they don't give a rat's ass if
you're a paying customer or not.

Khan Academy may have it easier, I'm sure Google won't let _them_ down

Really, go somewhere else, spend less money and have better support.

(At the expense of, if you're lucky, Google will give you almost zero
headaches)

~~~
dylanvee
There's nothing about Khan Academy's application (or any other customer's)
that would somehow immunize it from platform-wide serving issues.

~~~
raverbashing
Of course not, like when AWS fails Netflix stops working like every customer.

But you can bet that if Khan Academy has a problem it will be looked into with
extra attention.

~~~
dylanvee
You're correct--because we have a Premier account, which anyone else can
obtain too: <https://developers.google.com/appengine/docs/premier/>

~~~
raverbashing
Well, Caveat emptor

Also, Khan academy receives funding from Google (
<http://en.wikipedia.org/wiki/Khan_academy> ), so I don't think it boils down
to only having a Premier account.

~~~
codeka
"You just have to pay for support and you get support? I don't believe it,
there must be more to it than that!"

Whether you believe it or not, a Premier account is what you need if you want
support. You could argue that $500/mo is too expensive, but it is what it is.

~~~
raverbashing
Oh, I believe you can pay and get support

What I don't believe is that the support provided by Google is good or
sufficient. Based on experiences with _paying_ Google Apps, I'd say it's not.

~~~
kordless
My question is whether or not you personally have paid google for said support
and then not received what was promised?

------
afhof
A GAE user sees a problem of his service being slow, writes a frantic bug
report with caps and exclamation marks and threatens to leave GAE. As a GAE
user myself, two questions come to mind:

1\. Is GAE outside of their .9995 SLA* uptime? If they aren't, then it
probably isn't important enough spend time looking into it. Customers cannot
expect better than the agreed upon uptime percent, and hosting companies are
obligated to reimburse customers if they go below SLA. Both of these are
covered in the SLA doc.

2\. Is it reproducible? So far, the bug report mentions 2 people out of GAE
users. Is 2 people enough to say its a problem with GAE? One person is
panicked, and the other provides few details for the bug report.

*<https://developers.google.com/appengine/sla>

~~~
efdee
1\. 0.9995 SLA means about 6 minutes of downtime a month. Since it's a daily
event, I'm guessing that yes, the SLA is violated. 2\. It's a problem that is
occurring daily, with a test case that has pretty much no code at all. That in
itself does not prove anything, but it really makes me wonder how it could be
a problem on the user's side.

~~~
mark-r
My math for downtime per month works out a little differently:
(1-0.9995)x30x24x60 = 21.6 minutes. Still I mostly agree with you.

------
nivla
Having never used GAE, it would nice if someone could expand M/S and HRD for
me.

It looks like OP of the bug-report is using a depreciated feature/program
which according to the Project Member is causing latency issues at a specific
time daily. But that could not be the real issue since another commentator who
is using the new HRD is also having the same problem. It is even frustrating
for people who are reading this. All it implies is the lack of communication
from Google when something goes awry. Come on Google, stop reinforcing my
stereotypes about your customer support!

Selling to a customer is different than selling to a business, you may have a
great product at a great price but if you offer terrible CS, in the B2B world
everyone is going to avoid you. It is a place where support is valued more
than the product itself.

Therefore, unless you start offering a decent CS, you can lower your price all
you want, I will be sticking with AWS.

~~~
ronyeh
M/S is Master/Slave:
[https://developers.google.com/appengine/docs/python/datastor...](https://developers.google.com/appengine/docs/python/datastore/usingmasterslave)

HRD is High Replication Datastore:
[https://developers.google.com/appengine/docs/adminconsole/mi...](https://developers.google.com/appengine/docs/adminconsole/migration)

M/S is deprecated, and HRD is the new hotness (and it conveniently costs
more).

~~~
moobirubi
I think it cost the same, when it was first out it cost more, but now it cost
the same.

~~~
jfoster
The rates are the same ($1 per million writes and $0.70 per million reads
beyond the daily free threshold), but the daily free threshold is 0.05 million
of each for master/slave, and 0.01 million of each for high-replication.

------
zwischenzug
I manage the 3rd line support of some of the busiest websites in the world (we
provide back-end e-commerce software).

I can't say I think much of google's response here. Nearly two weeks before
the first comment, and then shut down after 2 days and a question directed at
who knows who, and no explanation?

The analysis elsewhere on here suggests they're violating SLA, so this should
get more attention. I'm guessing support is under-resourced @ google, and the
culture of support is a bit shabby (no acknowledgement of inconvenience or
indication or evidence of work undertaken in the background) - hardly
surprising for a large-scale software business based on free services.

------
neya
I'm sorry, but this is the price you pay for running your business that is
dependent TOTALLY on a 3rd party service. Forget Google, everyone out there is
most likely the same, that's why it's important for you to run your 'apps' on
something you have control over - Like Linode, AWS, Rackspace, Openshift, etc.
and also have back-up nodes from other providers for redundancy, for emergency
situations, incase of storms, etc.

I would recommend trying your apps on OpenStack (Openshift in particular),
which doesn't have the vendor lock-in, which you face right now.

~~~
pestaa
Interesting comment. When AWS went down in the US, devops rightly said "we
told you to run your apps on something you have control over."

~~~
StavrosK
What's that? A server I own on a rack I own in a datacenter I own on a power
factory I own and a telco I own?

"Having control over" something is a scale, it's not binary.

~~~
pestaa
Indeed, I was trying to say it is all relative.

~~~
StavrosK
Sure, but it's a tradeoff. No need for devops vs no control over outages.

------
lucb1e
To their credit, people are apparently using something that's been deprecated
and should be changed regardless. At least, that was their conclusion when it
was changed to wontfix. The replies are very rare and curt though, I can't
really say it's quality service when you're paying for a product.

Customer support from Google has always been like this as far as I've
experienced and heard. There is no way to actually reach and converse with
anyone, regardless whether you are paying them for the service or what kind of
request it is.

Once a Google employee randomly replied to a complaint of mine about Google+
(I didn't even +mention them). After a few comments and him confirming that it
was added to the bugs list, I asked if it was okay to +mention him in the
future with similar issues. It was okay. I did. He never showed his face
again. (His profile still says "Works at Google+".)

Another Google employee I know online also never replies to anything
concerning Google. I know he works on the Google+ project, but I can only hope
he passes on any bugs I +mentioned him in.

For Youtube, you can post in their forums but merely hope for a reply.
Copyright complaint disputes are no priority, either.

I haven't used many paid products, but I have read about their customer
support being one of the very worst and also have never been able to find a
single e-mail address or phone number to get support at for any service.

Edit: By the way, I would have moved away from the Google Apps Engine a long
time ago if my app went down every morning during rush hour for 10 days
straight.

------
sgift
It is interesting that basically no one (including the news poster) noticed
that there is an comment (#12) which states that this problem happens on HRD
too. This statement may be false and/or a completely different issue, but at
least it should be considered here for HN comments which state "M/S is
deprecated, Google is right, just use HRD."

------
brown9-2
A bug tracker seems like a horrible way to report production (or non-
production) support issues. This is the same bug tracker OSS projects on
Google Code use.

Is it really helpful for the public to comment on my support request? Seems
like the signal to noise ratio would be quite low, and then you get inane
comments like:

 _I got here from HackerNews, but after seeing the original poster spam the
forums in multiple places and have a bad attitude, I can't blame Google for
not fixing what looks to me like a non-issue.

Fuck 'em._

You have to believe that the choice of tools has some bearing on the quality
of the response from Google. Seems like there is very little incentive for any
"Project members" to trawl through open bug reports when no one is ever
responsible.

------
Al-Khwarizmi
Not surprising... the second most-voted bug in Google Code, reported exactly a
year ago ( <http://code.google.com/p/support/issues/detail?id=24324> )
deplores the removal of a feature that was already there (the Updates page)
and was the single most useful feature in Google Code for many of us. After
one year and more than 800 people registering their interest on the issue,
they haven't even explained why they removed it or whether there are any plans
of brinding it back.

------
afhof
Comment from the WontFix mark:

"M/S is deprecated and there is a clear and straightforward path to migrating
to HRD."

M/S was deprecated April 4, 2012, so it has been some time since the notice
has been out there. High replication data store has been available for over 2
years now. Whether or not less than a year is too short a deprecation period
is another issue.

~~~
nailer
Read the following message - other customers using Python on HRD are reporting
the same issue as the parent, who is Java on M/S.

------
pyalot2
Ok, so here's the deal. If your app runs exclusively on GAE you've essentially
tied yourself to one cloud vendor. Now disregarding the respective benefits
and drawbacks of google as a hosting company for your app (I would never do
that), being dependent on one cloud provider is a very bad idea. No matter if
you run on EC2, Azure or GAE, if you can't seamlessly switch to another
provider, you're screwed. These all go down regularly and have issues. They're
big companies, you're a small company, you have no such thing as "recourse".
The court of public opinion will not save your company.

~~~
kaolinite
Agree with this to an extent however the company I work for deploys on AWS and
is far too cautious about vendor lock-in, to the point where we use AWS
basically as a VPS, not a cloud service, and get none of the advantages (and
all of the disadvantages, e.g. worse performance, higher price).

------
killermonkeys
Many on the thread say the reporters are over-reacting. They are not. What
would amazon do? They would not consider this an issue, would respond in less
than 24 hours, and would take complete responsibility. GAE is a pay service. I
think this level of service is pathetic.

As noted the only attempt at diagnosis is completely wrong (even the reporter
is not on MS) and very late.

------
timme
The headline blows this out of proportion.

Few people (who act obnoxious as hell) report a problem that can be solved by
moving away from a deprecated system, yet they fail to even read the note
because they're busy smashing exclamation marks into the issue tracker.

~~~
Ironlink
I couldn't agree more. In fact, it was deprecated just about eleven months
ago: [http://googleappengine.blogspot.com/2012/04/masterslave-
data...](http://googleappengine.blogspot.com/2012/04/masterslave-datastore-
thanks-for-all.html)

When your datastore gets deprecated, you act sooner rather than later.

~~~
jfoster
The two datastores even have the same API. As long as your app doesn't depend
on the exact performance characteristics of the old one, the migration is very
straightforward. I did it for one of my apps in a morning and was done well
before lunch.

------
edent
Why would anyone expect customer support from Google? They have made it clear
time and time again that they don't provide it.
[http://shkspr.mobi/blog/2013/02/googles-customer-contempt-
co...](http://shkspr.mobi/blog/2013/02/googles-customer-contempt-conundrum/)

~~~
Ironlink
Support packages are available.
[http://googleenterprise.blogspot.com/2013/02/google-cloud-
pl...](http://googleenterprise.blogspot.com/2013/02/google-cloud-platform-
introduces-new.html)

~~~
EwanToo
The problem with saying things like "Support packages are available", is that
time and again we see paying Google customers with support packages being
treated awfully.

For example, this guy was a paying Google customer but couldn't get help
<http://www.sultansolutions.com/google-voice-lost-number/>

These are paying customers who are paying a non-trivial amount of money for
support (though not the "Premium" support in this case, which is an extra $500
per month for GAE).

~~~
Ironlink
We're a paying customer of GAE. I think it's quite clear that paying for the
basic service doesn't include support beyond the public issue tracker and the
forums. Support packages start at $150 per month, and at that point you get a
4 hour response time. I think that's entirely reasonable. We have yet to sign
up for a support level, but then again we're not really seeing any troubles
with the service.

~~~
EwanToo
I guess the question is, if you're paying and you find what you believe is a
system wide issue, do you expect a resolution?

In this case, Google's response is seemingly "It might (or might not) be a
system-wide issue, but we don't care - we won't fix it".

There's no indication in this case that someone paying $500 a month for the
premium support would get a better answer.

------
mos
Customer support of Google really sucks! Currently the GAE cloud has a
reliability problem (also for new customers). Instances are restarted like
crazy. This leads to downtimes. But that's not enough. Customers have even to
pay more(!) instance hours because of this. There is the running gag on the
mailing-list: "Whenever GAE is unreliable for weeks Google needed to make
revenue targets ;-)"

References

Current Issue:
[http://code.google.com/p/googleappengine/issues/detail?id=88...](http://code.google.com/p/googleappengine/issues/detail?id=8844)

Same issue from last year that took weeks to be resolved (check last
comments!):
[http://code.google.com/p/googleappengine/issues/detail?id=80...](http://code.google.com/p/googleappengine/issues/detail?id=8004)

Some Pros and Cons of Google App Engine in this blog-post:
<http://www.mosbase.com/>

------
kelvin0
BTW, this issue is not simply a due to MS, it also happens on HRD. So any
google support apologists here, please read the BUG thread submitted by this
poor customer before dismissing it simply as a 'migration issue'.

I have had some issues with Google Docs (paid for premier commercial account).
Some documents we had stored simply vanished from our account. After getting
the run around for 3-4 days, finally a google engineer tolds us they can't
help us recover the documents THEY 'lost' unless we have the URL to the
document ... Thankfully someone on our team had kept the URL when I first
shared that document with them (1+ year after the document had been created).

Nightmare ...

------
bromley
Quick tip for anyone making a system with high load and daily or hourly
quotas: When an account is created, assign a random start time (e.g. 05:43 for
daily quotas or minute 12 for hourly) to measure that account's quotas
against. Then you can avoid this issue of the system getting a huge spike in
load when everyone's quota refreshes at the same time.

------
mrerrormessage
It happens that 9 AM Brussels time is midnight pacific time. I'm sure Google
is running some maintentance cron at midnight thinking "This is a low demand
time," and it is, across the US, but not in Brussels. These are old instances,
and Google probably doesn't want to re-time or rewrite the cron job to be more
efficient.

~~~
guard-of-terra
Do people go to bed that early in the US? Looked at my project's charts, the
low demand time is definitely 4AM and midnight is 40% of peak.

------
logn
Ironically, the guy who closed this issue owns this project:

<https://code.google.com/p/sentimentally/>

"sentimentally is a tool that determines sentiment of your emails. Once
determined, it helps you gauge your relationships with co-workers, customers,
friends, or other individuals based on the tone of your conversations with
these people."

------
kushti
Never pay Google. It has terrible support for all products

~~~
rbanffy
Not my experience at all. While it's true I never had much trouble with App
Engine, when I had problems, they were solved rather quickly.

------
raverbashing
" Upgrading to HRD will solve your issue. M/S is deprecated and there is a
clear and straightforward path to migrating to HRD."

Can anyone explain why this is not possible for them?

Wonderful Google support apart, there are a lot of alternatives out there.

~~~
log0
Multiple comments mentioned that this occurs in HRD as well.

~~~
afhof
Only one other comment mentioned the problem with HRD as the datastore.

~~~
justin66
And the guy having the problem with HRD doesn't count because ... ?

~~~
Ironlink
Because they did not post any kind of evidence (request logs, Pingdom report,
etc.), not to mention the App ID in question (so that Google would know where
to look). All too often, bug reports end up being some kind of
misunderstanding.

------
okku
I am also a GAE-user, I have had no problems like the OP. But I start to miss
a fundamental feature, sockets. I have worked around it by using other
services and polling.

Maybe wrong forum, but is there any infrastructure templates for setting up a
scalable web/db/loadbalancer/memcached for a simple tradional webservice, in
my case a game?

I want to be able to sleep at night, and easily scale up by adding some more
machines in case of higher load.

I could use denormalized myslq/postgre or mongodb for speed. Preferred
language is Python (or maybe c# or java).

Any ideas?

~~~
ThingTwo
Channels.

~~~
okku
Channels are for communicating with javascript clients. I want to communicate
with other servers and applications.

~~~
petersmagnusson
stay tuned

------
petersmagnusson
Hi folks. We are fully aware of this issue. We've added it to external issue
tracker
([https://code.google.com/p/googleappengine/issues/detail?id=8...](https://code.google.com/p/googleappengine/issues/detail?id=8901)),
please follow up there.

Response from us was initially muted because it looked like it only affected
M/S apps, but it turns out (a) it can impact HRD as well, and (b) we're pretty
unhappy about the level of impact for many M/S apps so we're looking at ways
to resolve. It's a high priority and we're looking at a number of ways to
address it. It's also a pretty interesting issue, because indirectly it's
caused by (a) the large scale that App Engine is running, and (b) the large
extent with which GAE is running free applications.

Regardless, apologies to those who felt support was unresponsive. We are
working very hard to improve support. For the sophisticated audience that
comes to these pages, please link to me on Google+ to get my attention if we
are failing you (<https://plus.sandbox.google.com/110401818717224273095>).

------
mnml_
Gae is cool but its not worth the money. I shouldn't be out of beta.

------
lnanek2
I've heard a lot of people saying this is why Google can't get a lot of
businesses to sign on. There's no one for the CEO to call and complain to
directly when their stuff is down.

------
chris_wot
Well, there's a good reason not to use this service.

------
xsace
Jesus, so glad I switched to node when the hosting rates increased back in the
sept 2011 with the "GAE out of preview" move

------
saosebastiao
Ahh the perils of being a customer of Google.

