

About today's App Engine outage - azylman
http://googleappengine.blogspot.com/2012/10/about-todays-app-engine-outage.html

======
bithive123
This sounds very similar to something that can happen at small scales too. It
happened to me when I naively used a preforking apache and mod_proxy_balancer
on a machine with 2GB RAM. We had a surge in traffic and the load balancer
passed the "paging threshold" (it started swapping) and at that point the
increased latency caused requests to pile up as users would get impatient and
hit reload, leaving processes tied up waiting to time out.

In my case I was able to get things working again by disabling all non-
essential apache modules and lowering the timeout. I even had to take the load
balancer completely down for a few minutes to get the users to back off long
enough to ramp up smoothly again. Then I switched to nginx and haven't looked
back.

Obviously I am not comparing my apache instance to App Engine, just the broad
strokes of the self-perpetuating failure mode. But, reading between the lines
this post basically admits to oversubscription on App Engine. Talking about
load as having an unexpected impact on reliability (especially during a
"global restart") is a nice way of saying that they got more traffic than they
could handle.

~~~
teraflop
I doubt the "paging threshold" referred to in this post has anything to do
with paging to disk. It probably just means the point at which the operations
folks' pagers start going off.

------
halayli
"We begin a global restart of the traffic routers to address the load in the
affected datacenter."

This sounds like "Did you turn it off and on?"

~~~
bsaul
this one made my day :)

------
untog
Amazon: this is the kind of explanation blog post we want from you. Please be
inspired by it.

~~~
saurik
Here is a link to what Amazon posted about the outage a few days ago. I
actually wrote this post originally with the later examples, not realizing
that Amazon had actually posted another great one about the latest outage, but
when I went looking to find out how long that outage was I was greatly pleased
to find another AWS message I get to read.

<https://aws.amazon.com/message/680342/>

(Ah, apparently it was only just submitted to HN about an hour ago by someone,
and is about to fall off the front page.
<https://news.ycombinator.com/item?id=4705067>)

This is what Amazon posted during the really large outage last year (the one
that still only affected multiple availability zones for at most an hour or
so):

<http://aws.amazon.com/message/65648/>

I can't find a link on Amazon's website, but this is a copy/paste from the
explanation of a smaller outage that occurred a earlier this year:

<https://news.ycombinator.com/item?id=4124488>

Amazon's explanations are, I find, much more detailed (although this App
Engine one was pretty good): when something serious goes wrong at AWS, we
often not only get an apology (and a service credit), but we learn something
about how distributed systems work in the process.

When we don't see explanations from Amazon is when a subset of the servers
within a single availability zone (not even an entire zone) are inaccessible
for less than an hour (which occasionally happens); otherwise, they honestly
"kick ass" at post-mortem, as the above examples show.

However, it is my understanding that Google has had all kinds of random issues
that only affected some customers that are dealt with in private, so that
isn't different with them. The outage this morning, however, was "all of App
Engine doesn't work anymore", something that has never even happened to AWS.

(Now, _during_ the issue, Amazon really really sucks to the point where I'd
often rather them say nothing than to keep having their front-line keep
reassuring people; that said, in the middle of a crisis, most systems/people
suck.)

------
loceng
Cascading failures seem to be a recurring theme amongst hosting providers..

~~~
rhizome
I know neteng isn't the simplest thing in the world, but I was struck that for
all the "Google Interview...dummies don't even think about it" stories (not to
mention Microsoft-mockery the last couple decades), the fix was first to
reboot everything, then when that lumped too much traffic in the wrong places,
to reboot everything more slowly, 4hrs later, which fixed everything in 35min.

~~~
packetslave
Think "load balancing service" when you see "traffic router" in this case.
This was not necessarily a case of physically rebooting a Juniper or
something.

~~~
rhizome
That's one interpretation, but from what little information they offer it
seems their initial "reboots" (of whatever form) introduced asymmetric traffic
loads that blew out some segments, after which they chased fixes for a couple
hours.

------
philip1209
A 10% credit for SLA violations seems quite generous - credit for 3 days after
about 6 hours of downtime

~~~
wmf
10x strikes me as an appropriate factor since it gives the provider a strong
disincentive for outages.

~~~
philip1209
They did refund approximately 10x ->

6 hours/24 hours = .25 day.

10*.25 Day =2.5 days ~= 3 Days

------
mwsherman
A big, hairy problem that a YC company should take on: modeling complexity and
predicting emergent phenomena like this. (Ditto Amazon’s outage.)

It wouldn’t be just for data centers, but that’s a good place to start.

~~~
digeridoo
That's actually a problem academia should take on, but unfortunately that's
not the direction computer science has taken.

------
velar
I remember a similar incident at Microsoft that brought Bing down, the
solution was to add 2 more Cisco core routers (which works in pairs) at a cost
of 100k(200k?) each.

It's funny watching software people getting owned by the hardware people ;)

------
andrewcooke
i get the impression that routers are hard to configure well, particularly
used in "complex" ways. isn't there some hardware startup looking at fixing
this? i can't remember the name, but thought i had read about it here before.

naivley, it seems like the configuration is at too low a level - individual
routers - and that there should be higher level coordination with support for
simulating different conditions. or does that already exist? what's state of
the art for places like google to manage routers?

~~~
packetslave
Where the blog post says "traffic router" you should read "big pools of load
balancing servers" not Junipers and Ciscos and whatnot.

See also the posts in the google-appengine-downtime-notify group during the
incident.
[https://groups.google.com/forum/?fromgroups=#!topic/google-a...](https://groups.google.com/forum/?fromgroups=#!topic/google-
appengine-downtime-notify/SMd2pDJsCPo)

------
michaelkscott
For anyone interested, here are the comments on the AWS outage:
<http://news.ycombinator.com/item?id=4705067>

------
cloudwizard
It makes more sense for GAE to potentially have cascading failures since they
failover for you. AWS does not so it is less vulnerable.

------
Evbn
Why doesn't AppEngine degrade by browning out low priority services (free
tier, batch jobs, low paying customers) instead of overloading themselves?

