
Summary of the AWS Service Event in the Sydney Region - mcbain
http://aws.amazon.com/message/4372T8/
======
jaketay
Our instances with ap-southeast-2 were out for around 12 hours. We used
multiple availability zones and it didn't prevent downtime at all. It's very
interesting the difference between AWS and Google outage responses. AWS is
down for 12+ hours for some customers, force each customer to chase service
level credits and sign off the postmortem with a nameless & faceless "-The AWS
Team". Not one person at AWS was willing to take responsibility for this
failure.

Whereas Google was recently down for less than 18 minutes. A VP at Google sent
an email advising all affected customers, posted continuous updates to their
status page, sent a further apology email at the conclusion, posted a service
credit exceeding the SLA to all customers in the zone (without forcing
customers to chase this themselves with billing) and lastly wrote one of the
most well written post mortems I've ever seen. AWS has much to learn from
Google about how to handle outages properly.

~~~
beachstartup
people vote with their dollars and their votes are overwhelmingly telling
amazon they're doing a great job.

~~~
brazzledazzle
For the record I agree with you but existing customers who are heavily
invested in AWS would find it difficult to vote with their dollars.

~~~
atonse
Agreed. I have a couple of clients that have a pretty substantial AWS spend,
but the cost of switching to Azure is too high compared to the difference in
offerings. You don't want to spend tens of thousands of dollars of developer
time and risk switching datacenters for a small improvement.

------
chrismorgan
Why, oh why do they report times in PDT rather than AEST (the zone of the
affected area) or UTC (the standard everything else is based on)?

(Mutter, mutter, … something about Americans and their timezones … and
northern hemispherians and their seasons …)

~~~
bmon
This also baffles me. Here are the times converted to AEST:

    
    
      Sun July 5th
      3:25 PM -- Initial Power outage
      4:42 PM -- Instance launching in unaffected AZ's restored
      4:46 PM -- Power Restored
      6:00 PM -- 80% of instances recovered
      7:49 PM -- DNS recovered
    
      Mon July 6th
      1:00 AM -- Almost all instances recovered

~~~
mentalpiracy
Small correction: dates are for June 5th and 6th, not July!

------
thomasfoster96
For those wondering what the "severe weather" was:

* [http://www.smh.com.au/national/australias-wild-weather-sydne...](http://www.smh.com.au/national/australias-wild-weather-sydneys-massive-storm-in-pictures-20160606-gpcyu7.html)

* [http://www.abc.net.au/news/2016-06-07/sydney-weather-storm-d...](http://www.abc.net.au/news/2016-06-07/sydney-weather-storm-damaged-beachfront-homes-likely-dismantled/7487056)

* [http://www.sbs.com.au/news/gallery/pictures-wild-weather-sav...](http://www.sbs.com.au/news/gallery/pictures-wild-weather-savages-nsw-and-tasmania)

------
bigiain
Heh - I love the image in my head of the flywheel providing a few extra
seconds of power to the coffee urn in the Blackwoods warehouse out the back
and to all the fan heaters and big screen TVs in Toongabbie - just as Foxtel,
Dominos, and Channel 9's Nagios dashboards all start turning red and their ops
staff phones start beeping.

------
daniel-levin
>> The specific signature of this weekend’s utility power failure resulted in
an unusually long voltage sag (rather than a complete outage)

It is false to assume that the state of the electrical supply is either on or
off. This may come as a surprise, but not to me. In 2008, Eskom (South
Africa's electricity suppliers) experienced similar faults. The mains supply
voltage is 220v here. At one point, some devices started to fail in my house,
and others, such as lights, continued to work, but significantly dimmer. We
measured 180v at the plugs. There were similar outages in my area last year,
where an outright cut-off was preceded by voltage drops. This outage is
interesting because it is an example of a bug owing to false assumptions!

There have also been incidences where certain cables have been stolen [1] and
that has caused the opposite: voltage spikes.

[1] I couldn't tell you which, or what kind, but I remember it has something
to do with "the neutral"

~~~
aroch
It doesn't sound like they were treating it as binary, they have breakers in
place for brownout detection too -- those breakers just weren't triggered
fast/early enough in the brown out.

------
PhantomGremlin
I love reading about problems like these, it's great that Amazon is
forthcoming about them. There's always some new wrinkle.

E.g. in this case, in normal operation, power from the utility power grid
spins a flywheel. When the grid fails, the flywheel provides a holdover until
Amazon's diesel generators can start.

But in this failure the voltage from the grid sagged, rather than going away
completely. The breaker isolating the flywheel from the grid didn't open
quickly enough. So power from the flywheel was sent out to the grid. It didn't
succeed in powering the grid for very long. Oops.

~~~
JorgeGT
Something else must have failed or was not properly configured, because backup
Diesel generators should kick-in after 2-15 seconds of voltage drop,
regardless of the flywheel. The flywheel is used in critical systems to cover
only that <1m gap.

~~~
jbg_
Amazon addresses that in their report. Each UPS is fed by generator power and
grid power. Because the UPSes had been forced to try to supply the grid, and
because they are giant spinning weights that you really don't want to go wrong
and kill someone or destroy property, they did a safety inspection before
powering them back up, which meant a delay before the facility could be
supplied by generator power.

------
shermozle
I'm a bit dubious about their "if you used multi-AZ you'll be fine" when I had
multiple outages in a multi-AZ Elastic Beanstalk application of over an hour.
Methinks the load balancers aren't as magical as they'd like to make out.

~~~
jdc0589
> Methinks the load balancers aren't as magical as they'd like to make out.

Agreed. I'm still put off by the fact that ELBs specifically can not handle a
sudden spike in traffic orders of magnitude higher than the previous rate.
They fall on their face, bad. If you expect a spike like that to happen, you
literally have to submit a ticket and ask them to pre-warm your ELB...

~~~
thesandlord
Yeah this is one big advantage GCP has over AWS, no need to prewarm load
balancers. Not sure how Azure works (I work for Google).

------
vacri
I knew that this was a big event when it happened last Sunday, because the AWS
service status page had a yellow triangle rather than a green tick. Usually
when they have an outage, they just put a tiny blue 'i' on the green tick...

~~~
benjaminRRR
That's right, and you're even lucky to the (i) usually it's buried in the rss
which is why [https://twitter.com/aws_shd](https://twitter.com/aws_shd) is
useful.

------
clentaminator
Or, in summary, "Uninterruptable power supply is actually interrupted."

------
mryan
There is something Orwellian about referring to this as a 'service event'.

I am reminded of 'The Event' from That Mitchell and Webb Look [0]. We don't
talk about The Event.

[https://www.youtube.com/watch?v=wnd1jKcfBRE](https://www.youtube.com/watch?v=wnd1jKcfBRE)

------
voltagex_
I'd love to see a write up from the power company's point of view.

