In my case I was able to get things working again by disabling all non-essential apache modules and lowering the timeout. I even had to take the load balancer completely down for a few minutes to get the users to back off long enough to ramp up smoothly again. Then I switched to nginx and haven't looked back.
Obviously I am not comparing my apache instance to App Engine, just the broad strokes of the self-perpetuating failure mode. But, reading between the lines this post basically admits to oversubscription on App Engine. Talking about load as having an unexpected impact on reliability (especially during a "global restart") is a nice way of saying that they got more traffic than they could handle.
This sounds like "Did you turn it off and on?"
(Ah, apparently it was only just submitted to HN about an hour ago by someone, and is about to fall off the front page. https://news.ycombinator.com/item?id=4705067)
This is what Amazon posted during the really large outage last year (the one that still only affected multiple availability zones for at most an hour or so):
I can't find a link on Amazon's website, but this is a copy/paste from the explanation of a smaller outage that occurred a earlier this year:
Amazon's explanations are, I find, much more detailed (although this App Engine one was pretty good): when something serious goes wrong at AWS, we often not only get an apology (and a service credit), but we learn something about how distributed systems work in the process.
When we don't see explanations from Amazon is when a subset of the servers within a single availability zone (not even an entire zone) are inaccessible for less than an hour (which occasionally happens); otherwise, they honestly "kick ass" at post-mortem, as the above examples show.
However, it is my understanding that Google has had all kinds of random issues that only affected some customers that are dealt with in private, so that isn't different with them. The outage this morning, however, was "all of App Engine doesn't work anymore", something that has never even happened to AWS.
(Now, during the issue, Amazon really really sucks to the point where I'd often rather them say nothing than to keep having their front-line keep reassuring people; that said, in the middle of a crisis, most systems/people suck.)
At least they're ... consistent? =S
6 hours/24 hours = .25 day.
10*.25 Day =2.5 days ~= 3 Days
Does the 10% cover that in all cases? (Did they maybe roll-back any charges from those 6 hours, as well?)
It wouldn’t be just for data centers, but that’s a good place to start.
It's funny watching software people getting owned by the hardware people ;)
naivley, it seems like the configuration is at too low a level - individual routers - and that there should be higher level coordination with support for simulating different conditions. or does that already exist? what's state of the art for places like google to manage routers?
See also the posts in the google-appengine-downtime-notify group during the incident. https://groups.google.com/forum/?fromgroups=#!topic/google-a...