Hacker News new | comments | show | ask | jobs | submit login

In case your browser doesn't speak RSS:

Service is operating normally: Root cause for June 14 Service Event June 16, 2012 3:15 AM

We would like to share some detail about the Amazon Elastic Compute Cloud (EC2) service event last night when power was lost to some EC2 instances and Amazon Elastic Block Store (EBS) volumes in a single Availability Zone in the US East Region.

At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power. At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan. At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity). Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power. Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.

The generator fan was fixed and the generator was restarted at 10:19PM PDT. Once power was restored, affected instances and volumes began to recover, with the majority of instances recovering by 10:50PM PDT. For EBS volumes (including boot volumes) that had inflight writes at the time of the power loss, those volumes had the potential to be in an inconsistent state. Rather than return those volumes in a potentially inconsistent state, EBS brings them back online in an impaired state where all I/O on the volume is paused. Customers can then verify the volume is consistent and resume using it. By 1:05AM PDT, over 99% of affected volumes had been returned to customers with a state 'impaired' and paused I/O to the instance.

Separate from the impact to the instances and volumes, the EBS-related EC2 API calls were impaired from 8:57PM PDT until 10:40PM PDT. Specifically, during this time period, mutable EBS calls (e.g. create, delete) were failing. This also affected the ability for customers to launch new EBS-backed EC2 instances. The EC2 and EBS APIs are implemented on multi-Availability Zone replicated datastores. The EBS datastore is used to store metadata for resources such as volumes and snapshots. One of the primary EBS datastores lost power because of the event. The datastore that lost power did not fail cleanly, leaving the system unable to flip the datastore to its replicas in another Availability Zone. To protect against datastore corruption, the system automatically flipped to read-only mode until power was restored to the affected Availability Zone. Once power was restored, we were able to get back into a consistent state and returned the datastore to read-write mode, which enabled the mutable EBS calls to succeed. We will be implementing changes to our replication to ensure that our datastores are not able to get into the state that prevented rapid failover.

Utility power has since been restored and all instances and volumes are now running with full power redundancy. We have also completed an audit of all our back-up power distribution circuits. We found one additional breaker that needed corrective action. We've now validated that all breakers worldwide are properly configured, and are incorporating these configuration checks into our regular testing and audit processes.

We sincerely apologize for the inconvenience to those who were impacted by the event.




"service event"

Added to my list of favourite euphemisms.


Sounds like Amazon is doing something wrong, shouldn't it fail over to Battery then Generator?


The batteries at colocation facilities are only designed to hold power long enough to transfer to the generator. They're also a huge single point of failure. A better design is a flywheel that generates enough power. But datacenters are often hit with these generator failures (in my experience, once every year or so.)

Amazon had a correct setup--but not great testing.

By the way, these are great questions to ask of your datacenter provider: Are there two completely redundant power systems up to and including the PDUs and generators? How often are those tested? How do I set up my servers properly so that if one circuit/PDU/generator fails, I don't lose power?

There is a "right way" to do this--multiple power supplies in every server connected to 2 PDUs connected to 2 different generators--but it's expensive, and many/most low-end hosting providers won't set this up due to the cost.

(I ran a colocation/dedicated server company from 2001-2007.)


"Are there two completely redundant power systems up to and including the PDUs and generators? How often are those tested?"

And "are they tested for long enough to detect a faulty cooling fan that'll let the primary generator run at normal full working load for ~10mins and are the secondary gensets run and loaded up long enough to ensure something that'll trip ~5mins after startup isn't configured wrong?

While they clearly failed, I do have some sympathy for the architects and ops staff at Amazon here. I could very easily imagine a testing regime which regularly kicked both generator sets in, but without running them at working load for long enough to notice either of those failures. My guess is someone was feeling quite smug and complacent 'cause they've got automated testing in place showing months and years worth of test switching from grid to primary to secondary and back, without every having thought to burn enough fuel to keep the tests running the generators long enough to expose these two problems.

"There is a "right way" to do this …"

There's a _very_ well known "right way" to do this in AWS - have all your stuff in at least two availability zones. Anybody running mission critical stuff in a single AZ has either chosen to accept the risk, or doesn't know enough about what they're doing… (Hell, I've designed - but never go to implement - projects that spread over multiple cloud providors, to avoid the possible failure mode of "What happens if Amazon goes bust / gets bought / all goes dark at once?")


Or maybe they should just do it backwards

Run the generators and have the grid as backup

And just stop the generators to validate fallback to grid once in a while


Large cogeneration sites (where they use the waste heat from electrical generation for process steam, building/district heating, etc.) actually do run in grid-backup mode. An example is MIT's cogeneration plant (a couple of big natural gas turbines on Vassar street) -- a lot of universities do this since they can use the steam for heating, and a lot of industrial sites do it for process.

It comes down to cost and zoning/permitting. It's much easier to get a permit to run a generator for backup use than to run one 24x7. It's also hard to get a 1-10MW plant which is per-KWh as efficient/inexpensive as the grid (although now that natural gas is about 20% of what it was when I last bought it, gas turbines actually are cheaper than industrial tariff grid power, if you have good gas access...). Being able to actually use the waste heat is what makes the combined cycle efficiency worth it.

There was a crazy plan to run a datacenter on a barge tethered to the SF waterfront, for a variety of reasons, but a primary one being power -- the SF city government wouldn't be able to regulate the engines/generators on a ship running 24x7.


My university had a big cogen plant, but it was never designed to power the entire campus (it was only able to do so at around 3 AM). Aside from providing heating and power, because it was run off of natural gas it qualifies for clean energy credits, which the university makes money off of by selling on the market.


Hmm, wouldn't it be less practical to do that with large CHP plants (vs small ones)? Here in Europe district heating CHP plants are generally run by utilities.


that sounds like the Crash Only Software paper, but with respect to power sources


The rotational UPSes are the cause of the majority of 365 Main's downtime, and in general, horrible and must be destroyed with prejudice.

They're a nice idea in principle (and were the best option back in the mainframe era), but power electronics have gotten better faster than rotational maintenance at a datacenter company. They also weren't widely deployed enough to have a great support system, and it was firmware/software which caused most of their outages.

Dual line cord for network devices, and then STSes per floor area, probably make the most sense. Basically no commodity hosting provider uses dual line cord servers on A and B buses. I love having dual line cord for anything "real" (including virtualization servers for core infrastructure), but when you're selling servers for $50/mo, you can't.

(there's the Google approach to put a small UPS in each server, too...)


I had to upvote you just because I'm still pissed at 365 Main dropping power to our entire cage five years ago.


(Person you replied to here.) "The rotational UPSes are the cause of the majority of 365 Main's downtime, and in general, horrible and must be destroyed with prejudice."

No. Incorrect. There is a reason I 100% refused to move my hosting company there. I'm not going to say anything else publicly, but it wasn't the hardware that caused repeated outages there. (I moved my hosting company from San Francisco to San Jose, and lived in the Bay Area for 10 years. Everyone in the hosting industry in the Bay Area knew each other. I also hosted for years in AboveNet SJC3, which had the same flywheel setup.)

Note: I hope at this point they've fixed the issue. I've been out of the industry for a few years. I wish them the best.


Yes, I almost took half a floor of 365 Main back in 2003-2004, and didn't due to their (at the time) tenuous financial situation and thus being underresourced on everything. That and there being ~no carriers in the building at the time. For SF colo, 200 Paul remains a superior choice, although some floors have had problems, and it's a totally conventional design.

But the hitech UPS was a weak link. When they sold all their facilities to someone else (DRT), that fixed most of the other issues.


Totally correct -- In our data center, and many of the universities that we work with, will have a right hand (RH) and left hand (LH) power feed for a dual PDU server, switch, router, console server, etc. You will typically run a power bar on the right and left hand side of the rack and wire then accordingly. You will have dedicated beakers, panels, UPS, and generator, etc for the RH/LH side. If you ever need to service a panel, then you can safely cut power knowing that everything will still be powered by its partner. This happens once and a while and allows you breathing room if you need to replace a breaker or a power feed fails. We also test our generators on a monthly cycle to test for failures.

I also wanted to address you point about batteries. We have a device on each battery that monitors it's state. So we can find faults before they cause the entire UPS to fail.


"We also test our generators on a monthly cycle to test for failures."

Curious - when "testing" them, how long do you run them for and at what load?

I could see the beancounters being _very_ unhappy with the ops people saying "we want to run both gen sets at full datacenter load for more than 10 minutes at a time, every month", which is what Amazon would have to have done to detect the faulty cooling fan problem. I'm guessing there are _some_ organisations who do that, but I suspect most datacenters don't.


I work for the Fed. We are a remote site and have several power outages each year due to trees on the power lines or snow related issues. It's pretty much required and we've had zero issues justifying it.


The baseline generator maintenance cost exceeds the fuel cost every month. A bigger issue is getting permits from your local/moronic government to run your generators for testing, but for a big datacenter, you're probably in an industrial neighborhood (to get correct power from multiple substations or higher) and this isn't an issue -- it's more an issue with office-datacenters or other backup systems in normal residential/commercial neighborhoods.


It takes a lot of discipline to run a shared power setup like you describe. Most of those servers that have two power supplies operate in a shared power mode, rather than active/failover. This means that if one of your sides (LH/RH) is over 50% and you fail over, you are going to have a cascading failure as the other side goes >100%. It used to surprise me how often I saw things like this, although I'm talking about server rooms in the low 100s of servers, not huge data centers.


Ideally, you want at least 3+ generators any two of which can power the facility.


Telcordia recommends start testing generators monthly and doing a full load test once a year. Not every generator that starts will carry a load as many folks have discovered.


no doing it right = everything runs off battery and you should be testing your gensets every day or so - we ceratinly did back in the 80's at Telecom Gold


Battery backup isn't usually considered a "failover" step, just like your desktop battery backup doesn't actually do much other than stop charging when the power goes out.

Datacenters only really have battery backup to let the emergency generators come up.


You're correct. It goes battery then generator. If they didn't use battery first, then when the power initial failed all systems would be off-line as the generator takes about 15-30 seconds to kick in.


He's correct or incorrect, and that entirely depends on the facility. Many newer datacenters don't use battery banks--they are expensive to maintain and often cause more failures than they prevent.


From what I can remember of a datacenter tour, most generators supply power to the batteries, which then supply power to servers. The batteries can only supply a few minutes (I think) of power, so the generators need to turn on immediately.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: