

Summary of June 29 AWS Service Event in US-East - rdl
http://aws.amazon.com/message/67457/

======
kiwidrew
It seems to me that almost all of the issues revolved around the complicated
EC2/EBS _control_ systems. Time and time again, we hear about an AWS failure
which resulted in brief instance outages. If it was just a dumb datacentre,
the affected servers would just boot up, run an fsck on the disks, and return
to service. But because of the huge complexity added by the AWS stack, the
control systems inevitably start failing, preventing everything from starting
up normally.

I can't help but get the feeling that, if it weren't for the fancy "elastic"
API stuff, these outages would remain nothing more than just minor glitches.
At this point, I don't see how you could possibly justify running on AWS. Far
better to just fire up a few dedicated servers at a dumb host and let them do
their job.

~~~
rdl
It's undeniable that the added complexity of AWS makes it harder to predict
how it will behave in any specific failure condition, which is one of the
basic things you want to work out when designing it. "Traditional"
infrastructure is hard enough (especially networking), and still fails in
unique ways, and we've had 10-100 years to characterize it.

Having a common provisioning API in a bunch of physically diverse centers IS a
huge advantage for availability, cost, etc, though. If you have 100 hours to
set up a system, and $10k/mo, you have two real choices:

1) Dedicated/conventional servers: Burn 10 hours of sales and contract
negotiation, vendor selection, etc., plus ~10-20h in anything specific to the
vendor, and then set up systems. Get a bunch of dedicated boxes (coloing your
own gear may be better at scale but 10-20x the time and then upfront cost...),
set up single-site HA (hardware or software load balancer, A+B power, dual
line cord servers for at least databases, etc.

2) Set up AWS. Since the lowest tiers are free, it's possible you have a lot
of experience with it already. 1h to buy the product, extra time to script to
the APIs, be able to be dynamic. You could probably be resilient against
single-AZ failure in 100h with $10k/mo, although doing multi-region (due to DB
issues and minimum spend per region) might be borderline.

In case #1, you're protected against a bunch of problems, but not against a
plane hitting the building, someone hitting EPO by accident, etc. In case #2,
you should be resilient against any single physical facility problem, but are
exposed to novel software risk.

The best solution would be #3 -- some consistent API to order more
conventional infrastructure in AWS-like timeframes. Arguably OpenStack or
other systems could offer this (using real SANs instead of EBS, real hw load
balancers in place of ELB, ...), and you could presumably do some kind of
dedicated host provisioning using the same kind of APIs you use for VM
provisioning (big hosting companies have done this with PXE for years; someone
like Softlayer can provision a system in ~30 minutes on bare metal). Use
virtualization when it makes sense, and bare metal at other times (the big
Amazon Compute instances are pretty close) -- although the virtualization
layer doesn't seem to be the real weakness, but rather all the other services
like ELB/EBS, RDS, etc.)

Basically, IaaS by a provider who recognizes software, and especially big
complex interconnected systems, is really hard, and is willing to sacrifice
some technical impressiveness and peak efficiency for reliability and easy
characterization of failure modes.

~~~
true_religion
> Dedicated/conventional servers: Burn 10 hours of sales and contract
> negotiation, vendor selection, etc., plus ~10-20h in anything specific to
> the vendor, and then set up systems. Get a bunch of dedicated boxes (coloing
> your own gear may be better at scale but 10-20x the time and then upfront
> cost...),

Or just use a provider that you're already familiar with, at whatever list
price they give you. This is fair since in option #2 you're assuming that we
already know about Amazon, and won't stress about the 3x dedicated server
cost.

> set up single-site HA (hardware or software load balancer, A+B power, dual
> line cord servers for at least databases, etc.

Many high-end providers will simply do this for you, and have setups ready on
rack for you to use.

It's not really that hard so long as you're not going past say... 15 boxes.
For anything less than that, I can pretty much guarantee you'll find a high-
end provider that will have you set up in 4-6 hours maximum.

------
droithomme
So generators at multiple sites all failed in the exact same way, being unable
to produce a stable voltage, even though they are all nearly new, have low
hours, and are regularly inspected and tested.

It's impossible that it's an amazing coincidence they all failed on the same
day. The fact they were all recently certified and tested means that that
process doesn't work to ensure they will come on line any more than the
process worked at Fukushima nuclear plant.

They don't give the manufacturer or model, and they say that they are going to
have them recertified and continue to use them. So that means they are not
going to fix the problem, because they don't know why they failed.

You can not fix the problem if you do not know what caused it.

~~~
jaylevitt
To my ears - and maybe this is just wishful hearing - it sounded like they
were very, VERY strongly pointing the finger at a certain unnamed generator
manufacturer, but doing so in a way that incurred no legal liability.

That manufacturer is probably flying every single C-level exec out to the US-
East data center, over the July 4th holiday, to personally disassemble the
generator, polish each screw, and carefully put it all back together while
singing an a cappella version of "Bohemian Rhapsody", including vocal
percussion.

And if they do it to Amazon's satisfaction, Amazon has hinted that they
_might_ decide not to out them to the rest of the world. That's called
leverage.

------
latch
I don't know how big these data centers are, but why don't they just build a
power plant right next to it, dedicated and with underwire lines?

I'm a fan of decentralized power generation, and it would seem like large
consumers would have the most to gain.

Is this a regulation issue? I imagine Amazon becoming a provider of
electricity (even if it's just to itself), can become a political mess.

~~~
rdl
People normally try to site big datacenters with dual high voltage feeds from
separate substations.

Taking down the entire grid causes many problems, and basically doesn't happen
(it happened in the Northeast a few years ago....)
<http://en.wikipedia.org/wiki/Northeast_blackout_of_2003>

The problem is diesel generators basically suck, especially when left powered
off. In the long run, I predict fuel cells will take over in the standby power
market.

~~~
gee_totes
In the short run, what's your alternative to diesel generators? I don't think
they suck.

~~~
Hoff
Alternatives to diesel?

Some sites use microturbines or full turbines, and various of the local backup
generators are LP or LNG and not diesel.

Neither diesel nor the gasoline blends are particularly stable during storage,
unfortunately.

Capstone claims a 10-second stabilization and transfer time for their
microturbines, and has microturbines from tens of kilowatts up to a megawatt
package.

Con-Ed was (is?) running 155 MW and 174 MW turbines mounted on power barges,
and they've had turbines around at least as far back as the 1920s.

For some historical reading on turbine power generation:

<http://www.pondlucier.com/peakpower/blackstart/>

And yes, some of the fuel cell co-generation deployments look quite promising.

~~~
droithomme
Gas turbines are a good choice. Mentioned elsewhere in this thread are
concerns about storing gas on site and tremendous cost. Gas is not stored on
site, it comes from the gas line, which doesn't stop working when the power
goes out. Regarding cost, McDonalds has experimented with running their own
gas turbines _per restaurant_ because it can be cheaper than paying for an
electric line in some areas. Here is a 13 yr old article about their initial
experiments with gas turbines:

[http://findarticles.com/p/articles/mi_m0FZX/is_6_65/ai_55084...](http://findarticles.com/p/articles/mi_m0FZX/is_6_65/ai_55084209/)

~~~
ibejoeb
Wow, neat concept for power-hungry businesses. I've been told that it's
essentially impossible to go totally off-grid (in the US), i.e., you can do
your own generation, but you'd sell it back into the grid and effectively pay
net for consumption, which might turn out to be income.

That could be for dependent on zoning, though, but I've heard this from a few
engineers designing solar systems for residential.

------
drags
_For multi-Availability Zone ELBs, the ELB service maintains ELBs redundantly
in the Availability Zones a customer requests them to be in so that failure of
a single machine or datacenter won’t take down the end-point._

Based on how they behave in outages, I've always been curious (read:
suspicious) about whether ELBs were redundant across AZs or hosted in a single
AZ regardless of the AZs your instances are in.

It's good to hear that they are actually redundant and to understand how
they're added/removed from circulation in the event of problems.

~~~
WALoeIII
In my experience you get an IP returned as an A record for each AZ you have
instances in. Inside each AZ traffic is balanced equally across all instances
attached to the ELB. The ELB service itself is implemented as a Java server
running on EC2 instances, and it is scaled both vertically and horizontally to
maximize throughput.

------
rdl
I'm kind of confused why UPS failing doesn't lead to an emergency EBS shutdown
procedure which is more graceful than just powering it all off. Blocking new
writes, letting stuff complete, and unmount in the last 30 seconds would save
a LOT of hassle later.

~~~
EwanToo
Blocking new writes on it's own would cause instant filesystem corruption for
all the hosts using EBS, unless they had already completed their writes and
had time to flush their disk caches.

You'd need to integrate it with each VM running, possibly just be sending it
the equivalent of a shutdown command from the console, so that it understood
that the disk was going away in X seconds, and to shutdown any databases
immediately, flush all cache, unmount filesystems, and shutdown itself.

It wouldn't be massively difficult, but not as simple as simply shutting down
EBS.

~~~
joahua
Wouldn't there be contention issues around writes to EBS in this case? I know
very little about this, but I keep hearing about EC2's relatively poor IO
performance and imagine that there would be a big fat traffic jam if every
running instance received a signal to get their house in order before
filesystems are forcibly unmounted.

~~~
EwanToo
Absolutely, and that seems to be a big part of why EBS takes so long to come
back - simply booting all those disks over the network generates so much
traffic it takes forever!

On the other hand, an fsck to recover the filesystem probably causes even more
traffic.

------
Maxious
"many clients, especially game consoles and other consumer electronics, only
use one IP address returned from the DNS query."

Is this referring to Netflix?

~~~
zhoutong
AWS officially recommends CNAME records for ELBs, but the IP addresses don't
change regularly and also CNAME for root host name won't work if other records
are present, so many sysadmins straightaway use A records with the ELB IPs.

~~~
rdl
I have never understood the design decision to require zone apex to be an A
Record. (I mean, I know that is what the RFC says, but I don't know what the
RFC authors were thinking back in ancient times.)

~~~
zhoutong
Well, disregarding the design decision, or the RFC, it's basically wrong to
use CNAME alongside of other records, especially MX records.

Considering most active domains have MX records under the root host name (like
example.com.), CNAMEs won't work for these hostnames.

If there's native implementation for an "ALIAS" record which only points to
the corresponding A record of the target, then it will work anywhere.

------
pauly007
So far the comments have focused on the technical aspects outlined as points
of failure in Amazon's summary: grid failure, diesel generator failure, and
the complexities of the amazon stack. What are your thoughts on Amazon's
professionalism in their response and action plan going forward? If you're an
AWS customer does this style of response keep you on board?

~~~
rdl
The level of clarity in post incident reporting by Amazon is excellent.
During-incident, sub-par. Amazon seems to try to minimize any "more than just
a single AZ is affected" in their realtime reporting during outages. There's
also a disconnect between the graphics and the text.

What I don't like is that they make repeated promises about AZes and features
which are repeatedly shown to be untrue. They also have never disclosed their
testing methodology, which leads me to assume there isn't much of one. That
makes me unlikely to rely on any AWS services I can't myself fully exercise,
or which haven't shown themselves to be robust. S3, sure. EC2, sure (except
don't depend on modifying instances on any given day during an outage). EBS,
oh hell no. ELB, probably not my first choice, and certainly not my sole ft/ha
layer. Route 53, which I haven't messed with, actually does seem really good,
but since it's complex, I'm scared given the track record of the other complex
components.

~~~
bifrost
"What I don't like is that they make repeated promises about AZes and features
which are repeatedly shown to be untrue."

Thats pretty much SOP in the hosting business, nobody really knows whats going
to happen because nobody really knows how to test. Most developers can/do not
understand what ops failures are like, and therefore most testing is only
superficial.

~~~
rdl
Right, but I think we've established Amazon is actually a software company
which happens to run a retail business and contract for some datacenters. Same
way Google is either a supercomputer company or an advertising company which
uses search to get users cheaply.

------
rdl
Impressive that the failure of grid power (frequent) and a SINGLE GENERATOR
BANK causes this much chaos on the Internet.

~~~
cperciva
If I"m reading it right, it's a failure of grid power and two or more
generator banks: "each generator independently failed to provide stable
voltage".

~~~
rdl
Ah, yeah, you're right (I was assuming they had 2 generators and needed both
for the load, but it looks like they have 1+1 or N+1, probably split by room,
like a real colo facility usually does).

------
jpetazzo
It's interesting to see that they had similar issues 2 weeks ago (power
outage) and it looks like nothing was done in those 2 weeks to address the
issue, since it happened again this week-end.

------
nmcfarl
I like that all times are reported in PDT for an event that happened on the
east coast - it says something about priorities.

~~~
smackfu
If the outage happened in two time zones, which time zone should they use to
report? The priority is to make the write-up clear, which means picking a
single time zone and being consistent. I guess you could argue it should be
UTC, but practically it makes no difference.

~~~
cperciva
_I guess you could argue it should be UTC, but practically it makes no
difference._

Everybody knows, or should know, their time zone's delta to UTC. I'm sure
there are lots of people who don't know their delta to PDT.

