So generators at multiple sites all failed in the exact same way, being unable to produce a stable voltage, even though they are all nearly new, have low hours, and are regularly inspected and tested.
It's impossible that it's an amazing coincidence they all failed on the same day. The fact they were all recently certified and tested means that that process doesn't work to ensure they will come on line any more than the process worked at Fukushima nuclear plant.
They don't give the manufacturer or model, and they say that they are going to have them recertified and continue to use them. So that means they are not going to fix the problem, because they don't know why they failed.
You can not fix the problem if you do not know what caused it.
To my ears - and maybe this is just wishful hearing - it sounded like they were very, VERY strongly pointing the finger at a certain unnamed generator manufacturer, but doing so in a way that incurred no legal liability.
That manufacturer is probably flying every single C-level exec out to the US-East data center, over the July 4th holiday, to personally disassemble the generator, polish each screw, and carefully put it all back together while singing an a cappella version of "Bohemian Rhapsody", including vocal percussion.
And if they do it to Amazon's satisfaction, Amazon has hinted that they might decide not to out them to the rest of the world. That's called leverage.
It was only at one site that the generators failed, but it was two generators in that site:
In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT. Ten minutes later, the backup generator power was stabilized, the UPSs were restarted, and power started to be restored by 8:14pm PDT. At 8:24pm PDT, the full facility had power to all racks.
It sounds like they were too optimistic about their generator startup times.
It seems to me that almost all of the issues revolved around the complicated EC2/EBS control systems. Time and time again, we hear about an AWS failure which resulted in brief instance outages. If it was just a dumb datacentre, the affected servers would just boot up, run an fsck on the disks, and return to service. But because of the huge complexity added by the AWS stack, the control systems inevitably start failing, preventing everything from starting up normally.
I can't help but get the feeling that, if it weren't for the fancy "elastic" API stuff, these outages would remain nothing more than just minor glitches. At this point, I don't see how you could possibly justify running on AWS. Far better to just fire up a few dedicated servers at a dumb host and let them do their job.
It's undeniable that the added complexity of AWS makes it harder to predict how it will behave in any specific failure condition, which is one of the basic things you want to work out when designing it. "Traditional" infrastructure is hard enough (especially networking), and still fails in unique ways, and we've had 10-100 years to characterize it.
Having a common provisioning API in a bunch of physically diverse centers IS a huge advantage for availability, cost, etc, though. If you have 100 hours to set up a system, and $10k/mo, you have two real choices:
1) Dedicated/conventional servers: Burn 10 hours of sales and contract negotiation, vendor selection, etc., plus ~10-20h in anything specific to the vendor, and then set up systems. Get a bunch of dedicated boxes (coloing your own gear may be better at scale but 10-20x the time and then upfront cost...), set up single-site HA (hardware or software load balancer, A+B power, dual line cord servers for at least databases, etc.
2) Set up AWS. Since the lowest tiers are free, it's possible you have a lot of experience with it already. 1h to buy the product, extra time to script to the APIs, be able to be dynamic. You could probably be resilient against single-AZ failure in 100h with $10k/mo, although doing multi-region (due to DB issues and minimum spend per region) might be borderline.
In case #1, you're protected against a bunch of problems, but not against a plane hitting the building, someone hitting EPO by accident, etc. In case #2, you should be resilient against any single physical facility problem, but are exposed to novel software risk.
The best solution would be #3 -- some consistent API to order more conventional infrastructure in AWS-like timeframes. Arguably OpenStack or other systems could offer this (using real SANs instead of EBS, real hw load balancers in place of ELB, ...), and you could presumably do some kind of dedicated host provisioning using the same kind of APIs you use for VM provisioning (big hosting companies have done this with PXE for years; someone like Softlayer can provision a system in ~30 minutes on bare metal). Use virtualization when it makes sense, and bare metal at other times (the big Amazon Compute instances are pretty close) -- although the virtualization layer doesn't seem to be the real weakness, but rather all the other services like ELB/EBS, RDS, etc.)
Basically, IaaS by a provider who recognizes software, and especially big complex interconnected systems, is really hard, and is willing to sacrifice some technical impressiveness and peak efficiency for reliability and easy characterization of failure modes.
> Dedicated/conventional servers: Burn 10 hours of sales and contract negotiation, vendor selection, etc., plus ~10-20h in anything specific to the vendor, and then set up systems. Get a bunch of dedicated boxes (coloing your own gear may be better at scale but 10-20x the time and then upfront cost...),
Or just use a provider that you're already familiar with, at whatever list price they give you. This is fair since in option #2 you're assuming that we already know about Amazon, and won't stress about the 3x dedicated server cost.
> set up single-site HA (hardware or software load balancer, A+B power, dual line cord servers for at least databases, etc.
Many high-end providers will simply do this for you, and have setups ready on rack for you to use.
It's not really that hard so long as you're not going past say... 15 boxes. For anything less than that, I can pretty much guarantee you'll find a high-end provider that will have you set up in 4-6 hours maximum.
I have only read briefly about it but there's also the new Metal as a Service (MaaS) with the latest Ubuntu. From what I understand, it gives you those APIs to manage bare metal for those cases where virtualization doesn't make sense.
It takes time to boot up VMs, basically. System booting is one of the most expensive operations in virtualization, especially when the disk image is somewhere in the network. There will be a lot of random I/O, memory allocation and high CPU usage. It takes a few seconds of CPU time for a decent server to initialize a KVM VM. Try multiplying that by hundreds of thousands of instances.
For multi-Availability Zone ELBs, the ELB service maintains ELBs redundantly in the Availability Zones a customer requests them to be in so that failure of a single machine or datacenter won’t take down the end-point.
Based on how they behave in outages, I've always been curious (read: suspicious) about whether ELBs were redundant across AZs or hosted in a single AZ regardless of the AZs your instances are in.
It's good to hear that they are actually redundant and to understand how they're added/removed from circulation in the event of problems.
In my experience you get an IP returned as an A record for each AZ you have instances in. Inside each AZ traffic is balanced equally across all instances attached to the ELB. The ELB service itself is implemented as a Java server running on EC2 instances, and it is scaled both vertically and horizontally to maximize throughput.
build a power plant right next to it, dedicated and with underwire lines
They did, and it's called "backup generators". :-)
More seriously, power plants go offline on a regular basis -- typical availability factors are around 90% due to the need for regular maintenance. You need to have a power grid in order to have any reasonable availability.
Gas turbines are a good choice. Mentioned elsewhere in this thread are concerns about storing gas on site and tremendous cost. Gas is not stored on site, it comes from the gas line, which doesn't stop working when the power goes out. Regarding cost, McDonalds has experimented with running their own gas turbines per restaurant because it can be cheaper than paying for an electric line in some areas. Here is a 13 yr old article about their initial experiments with gas turbines:
Wow, neat concept for power-hungry businesses. I've been told that it's essentially impossible to go totally off-grid (in the US), i.e., you can do your own generation, but you'd sell it back into the grid and effectively pay net for consumption, which might turn out to be income.
That could be for dependent on zoning, though, but I've heard this from a few engineers designing solar systems for residential.
Probably good in some areas (Canada near a pipeline but with bad roads), and bad in seismic areas like California (gas is likely to get shut off) or California (pge malinvestment likely to cause all gas lines to explode independent of earthquake).
Yeah, lots of little BTSes, relay stations, etc. use Lister LPG reciprocating engines rather than diesel because the long-term fuel stability is better, and because they're a lot more reliable at unattended automatic start than diesels, especially in the 10-20KW range (where the available diesels are particularly bad).
I don't personally know of any datacenters using turbines for backup power. I don't know enough about turbines to know why that is -- maybe they're not available in smaller sizes, or have higher maintenance requirements, or higher capex per MW, or something else. The only turbines I've seen are cogen/prime power with grid backup, or marine.
> I don't personally know of any datacenters using turbines for backup power. I don't know enough about turbines to know why that is -- maybe they're not available in smaller sizes, or have higher maintenance requirements, or higher capex per MW, or something else. The only turbines I've seen are cogen/prime power with grid backup, or marine.
They are just expensive; and then you also need a (safe) way to store gas on site. They also have small efficiency bands and therefore aren't always optimal for certain applications. Larger turbines also need a diesel generator to start them ("black start").
I don't think you'd use natural gas for backup power, you'd probably use Jet-A or actual diesel. (most turbines in the world probably run on Jet-A, being in actual jets; natural gas gets used for stationary power because it's cheaper and cleaner burning, sure, but that shouldn't be a major factor in a backup generator).
Diesel stores pretty well, if you can keep water out of it, otherwise it's prone to algae formation. Big standby generation sites will filter their diesel to remove water, either continuously or periodically.
Dual HV grid feeds (from different grids, or really distinct parts of the same grid). I'll pay extra for datacenters with those. That will dramatically reduce the number of times you lose all utility power. Also, pick a utility which doesn't suck (which also saves you money); i.e. Silicon Valley Power instead of PG&E (which is why you see all the best datacenters in the Bay Area in the South...)
There's no real current alternative to diesel generators for true standby power, but they still suck.
I visited a large data center about five years ago, and they had a power plant on site. However, California has state regulations that they have to use a percentage of power from the utility company (i.e.-they can't go completely off grid).
There is a Rather Large hydro dam in Russia (Sayano-Shushenskaya) which was generating too much energy in times of heavy rain (too limited capacity to bring the excess to other markets), so they basically had to build an aluminum smelter there to add load.
I'm kind of confused why UPS failing doesn't lead to an emergency EBS shutdown procedure which is more graceful than just powering it all off. Blocking new writes, letting stuff complete, and unmount in the last 30 seconds would save a LOT of hassle later.
I was wondering about that too -- I'd love to see the same thing happen with EC2 instances, sending an instance stop signal five minutes before power goes out so that daemons can be stopped and the OS can sync its filesystems to disk before shutting down.
My guess is that this functionality is missing for historical reasons -- the original S3-backed instances didn't have any concept of powering off an instance.
If they did that, they could almost cut down to 6min battery per site and skip the generators. In the event of any problem, sound the 5 minute warning, and relocate all jobs to another datacenter.
(generators and UPS might be cheaper, especially since all facilities have them, than getting all users to be good at handling failure gracefully. It's ok for an in-house datacenter doing batch processing, or with great developers, like Google, but probably not for a generalized IaaS)
Blocking new writes on it's own would cause instant filesystem corruption for all the hosts using EBS, unless they had already completed their writes and had time to flush their disk caches.
You'd need to integrate it with each VM running, possibly just be sending it the equivalent of a shutdown command from the console, so that it understood that the disk was going away in X seconds, and to shutdown any databases immediately, flush all cache, unmount filesystems, and shutdown itself.
It wouldn't be massively difficult, but not as simple as simply shutting down EBS.
Wouldn't there be contention issues around writes to EBS in this case? I know very little about this, but I keep hearing about EC2's relatively poor IO performance and imagine that there would be a big fat traffic jam if every running instance received a signal to get their house in order before filesystems are forcibly unmounted.
The problem is that a lot of inconsistencies are caused by the software, not I/O or cache. For example, if your database is in a middle of transaction and everything is in the memory, you can't "let the stuff complete" without installing a software on the database server to monitor power failures.
Yes, you'd have to implement this in the application layer. There would be an API which sends out the 5 minute warning to all systems, which would then need to clean up and die before hard shutdown. This has existed in the physical server and UNIX world forever.
It's especially interesting to look at Amazon.com vs. AWS.
IIRC, Amazon.com doesn't use ELB, they use some proper load balancers (with DNS above) in front of several Regions of ELBs.
I wonder if there's a market for global load balancing + advanced DNS tricks as a service, even if you're using Amazon entirely. Generally extending Amazon's IaaS with features Amazon should implement hasn't been a great strategy, as Amazon eventually implements an inferior but cheaper or easier (or at least default) solution for non-core parts.
Kind of surprised to see www.amazon.com goes to DC for me (from SF), though. It's possible it's all in US-East, across all the AZ's, and just has localized sites like co.uk in other places.
The load balancer itself can always be a single point of failure, no matter how many layers of packaging you have. By default, DNS system doesn't support failover because multiple A records will result in round robin. So there are two possible solutions: 1. Use a DNS with health checking so that records can be changed dynamically for failover. 2. Use Anycast for load balances to broadcast only healthy nodes with the same IP prefix. Both ways are imperfect and not widely used in production systems.
Normally you Anycast DNS only, and then use DNS load balancing (of IPs) to load balance physical load balancers (which are themselves sharing IPs on subnets and doing high availability stuff) in front of banks of servers.
You probably implement Geo, health, and load based optimization in the DNS layer and also in the physical load balancers. A single host out of ~30 dying at Site A doesn't affect how the DNS load balancing responds, but a critical number does.
Due to broken DNS implementations in the wild, you ideally keep your per-site load balancers up in as many situations as possible (even if load must be shed, you keep core network and load balancers up) and redirect traffic from there to other sites.
For larger sites, the big choice is running a single AS global network where IPs can move between datacenters vs. running each datacenter as a separate network. Advantages and disadvantages to each.
The big disadvantage of all of this is you're now spending a few million a month on salaries, infrastructure (various PoPs, network gear, etc.) before you've bought a single server or served a single page. Not so lean.
> The big disadvantage of all of this is you're now spending a few million a month on salaries, infrastructure (various PoPs, network gear, etc.) before you've bought a single server or served a single page. Not so lean.
So a business opportunity kicks in. That's how the cloud revolutionizes computing.
The problem I think is that "really hard parts of the front end of your app, as a service" is only really valuable once someone has also provided "really hard parts of the back end of your app (database distribution/replication across multiple sites, ideally with resilience against various failures, and your choice on CAP), as a service", too.
(on the front end, you should probably also be doing CDN+ (caching, WAN TCP acceleration, DoS prevention) too.)
[also, wow, I didn't realize you were the "famous" Zhou Tong.]
Database distribution is definitely a pain for RDBMS. A completely re-invented distributed database system like Cassandra is what's needed. But this simply means that a random blogger can't deploy WordPress on this kind of highly available hosting too.
In any case, AWS is probably closest to this kind of revolution, as they already have different types of managed database services (relational, object, in-memory and key-value). They also have Route 53, which is Anycast-based and CloudFront.
AWS does have a shot at building a system like that, but I'd rather it be made of open components from multiple vendors.
[I wasn't following the bitcoin stuff as it happened, and then later read about bitcoinica (we actually use the difficulties with hosting providers you faced as an example with our unlaunched product...). I didn't know you were on HN -- awesome. (you are probably one of the world's experts on problems with service provider internal security, now, although it was expensive education). And handling things by paying everyone out was a much better decision than most startups after a breach.]
[The first line of the comment was intended to keep this on-topic. I didn't really have any financial interest in Bitcoinica at the time of the hack. A major management handover took place three weeks before that and the 100% of company was sold in 2011. The community assumed that I was still the owner but it's simply not true. The hacked mail server (the "root cause") didn't belong to me either. But yes, I learned a lot by being both an insider and an outsider, and these are fortunately free lessons. I still follow some of the valuable experience in dealing with Bitcoinica's infrastructure in my new project, which doesn't deal with money. Startup infrastructure has a big market. Now I use KVM to build a small private cloud for my new project on top of dedicated server(s) simply because I love the flexibility of cloud deployments. If someone brings that to developers who are not sysadmins, it'll be cool.]
I think this area will actually prove to be one of the key differences between AWS and Google's New Compute Engine. I'm not an expert in this area but the technical details of compute engine talk from google io (http://www.youtube.com/watch?v=ws2VRHq5ars) makes it sound like they are doing something special so that global traffic travels on their backbone as much as possible and you can move ip from server to server.
AWS officially recommends CNAME records for ELBs, but the IP addresses don't change regularly and also CNAME for root host name won't work if other records are present, so many sysadmins straightaway use A records with the ELB IPs.
I have never understood the design decision to require zone apex to be an A Record. (I mean, I know that is what the RFC says, but I don't know what the RFC authors were thinking back in ancient times.)
The generator thing confused me a bit. It seems like the main issue wasn't that the generators failed, but that they took far longer to spin up than expected, so the automatic switchover failed to take, and they had to do a manual switchover at a later point (at which point the UPSes had already started failing). Or am I reading it wrong?
I wonder what the additional cost would be to leave a generator running 24/7 (at minimal load), so you never have an issue with spinning them up. Are they designed to run 24/7/365, or will they wear out too quickly?
Very high cost in both fuel (a big diesel probably uses 25% of peak fuel at idle) and maintenance (idling a lot destroys the engine), and pissing off the neighbors. You're generally not allowed to run backup generators 24x7 in most areas.
It's different if you're in the middle of nowhere, but even then, maintenance issues would be prohibitive.
(It would be so much fun to build a big datacenter next to a hydro dam; you'd get both pretty high availability hydro power and the HV feed going to civilization could be used in reverse to bring power if the dam goes down; pick a site where the power goes off in two directions to two grids and you've got 2 grid feeds plus local hydro plus anything like diesel onsite backup. Then you just have issues with getting diverse fiber in, but maybe along the HV routes. Northern Oregon/Southern WA is essentially this -- which is why Amazon, Google, Facebook, Microsoft, and more are putting huge facilities there.)
As an aside, you need to test your generators with actual load. AC is a little more complex than just "on or off", too. The "real" way to do this is to do periodic tests with a load bank (basically a huge resistive load which converts the electricity from the generators into heat, although if you were smart, you could get big pumps and pump a pond instead, which could be used for something useful later). Most people don't, they just spin it up, or sometimes they use facility loads for test, which is good in some ways, but if the test fails, the facility load drops, which makes people sad. Backup power can fail in all kinds of exciting ways other than just not delivering power, so it's possible you'll switch to generator, generator will fail, and you won't be able to switch back, or in some cases, generator will kill your UPS and drop your load. The bigger the system the easier it seems to be for really weird things to happen, but the more individual sets of equipment you have, the higher the chances that at least one will fail in some crazy way. Which is one of the reasons the Google UPS-in-Server thing is so exciting.
"a big diesel probably uses 25% of peak fuel at idle"
I have a very hard time believing this, especially for diesels, which idle much more efficiently than spark-ignition engines. Where did you get this figure?
What I'm also surprised by was the fact that they hadn't already allowed unstable voltages from the generators to be used. A modern switching powersupply can operate down to, what, 80 volts?
EDIT: rdl, for some reason I can't reply to your reply. Replying here.
Thanks for the info. Now that I think back, I recall reading that diesels operate with an air/fuel ratio as high as 80:1 at idle. 1/4 would put them at 20:1 at full load, which I can beleive, because diesels don't typically operate at stoichiometric even under full load, which would be ~15:1.
2MW Cummins use about 130 gallons/hour at max load.
At 1/4, about 34 gallons/hour.
You can't idle them at low loads for long periods without damaging them. That's basically 1/4. You usually want it to be more like 50-75%. So "no real load" is artificially set higher than engine idle to keep the system functional, using a load bank. (what you actually do is have N generators and just use some of them, or have different sizes and use the right ones to sort of match the load)
The output from the generators is going into the UPS. The UPS input is probably pickier than any switchmode power supply. Especially because most big UPSes let you set in software how picky they are about input power before they switch to battery, and they apparently had it set to conservative mode.
Here's something I haven't seen mentioned in the discussion yet: while it's impractical to idle the generators constantly, why not turn them on ahead of time when it appears there's a good chance of troubles with the electrical supply?
The Friday storm had at least a decent hour of warning. I got an alert on my phone from the Weather Underground app, and the radar made it apparent that this was going to be a monster well before it actually hit. Their summary of the events shows that Amazon had decent warning and called in extra staff and took other precautions.
So, given that they had advance warning of a potentially serious storm, why not start up the generators and have them idling beforehand? Is the cost of that still prohibitive even when it looks like there's a good chance they'll be needed?
At least they specified the timezone, and at least they got the timezone correct (PDT, and not PST) .
 Not only do a lot of communications not specify a timezone at all, but some communications specify a timezone incorrectly. I recently submitted a ticket with Apple dev support, and the automated reply specifies their hours of operation as 7am to 5pm PST. They're probably clueless, and they mean PDT, but if someone takes them at their word, those hours of operation are off by an hour.
If the outage happened in two time zones, which time zone should they use to report? The priority is to make the write-up clear, which means picking a single time zone and being consistent. I guess you could argue it should be UTC, but practically it makes no difference.
So far the comments have focused on the technical aspects outlined as points of failure in Amazon's summary: grid failure, diesel generator failure, and the complexities of the amazon stack.
What are your thoughts on Amazon's professionalism in their response and action plan going forward? If you're an AWS customer does this style of response keep you on board?
The level of clarity in post incident reporting by Amazon is excellent. During-incident, sub-par. Amazon seems to try to minimize any "more than just a single AZ is affected" in their realtime reporting during outages. There's also a disconnect between the graphics and the text.
What I don't like is that they make repeated promises about AZes and features which are repeatedly shown to be untrue. They also have never disclosed their testing methodology, which leads me to assume there isn't much of one. That makes me unlikely to rely on any AWS services I can't myself fully exercise, or which haven't shown themselves to be robust. S3, sure. EC2, sure (except don't depend on modifying instances on any given day during an outage). EBS, oh hell no. ELB, probably not my first choice, and certainly not my sole ft/ha layer. Route 53, which I haven't messed with, actually does seem really good, but since it's complex, I'm scared given the track record of the other complex components.
"What I don't like is that they make repeated promises about AZes and features which are repeatedly shown to be untrue."
Thats pretty much SOP in the hosting business, nobody really knows whats going to happen because nobody really knows how to test. Most developers can/do not understand what ops failures are like, and therefore most testing is only superficial.
Right, but I think we've established Amazon is actually a software company which happens to run a retail business and contract for some datacenters. Same way Google is either a supercomputer company or an advertising company which uses search to get users cheaply.
Well if you look at their status page ( http://status.aws.amazon.com/ ) none of the statuses shows red. Yellow clearly says its for performance issues and red is for service disruptions. If this isn't service disruption then I don't know what is.
Reading between the lines from both the posted note and their persistent failure to provide correct statuses: the ops guys over there are in full CYA desperation mode somewhere around 100% of the time, and a culture of 'it wasn't me', fuelled either by job fear, or promotion fear, or fear of being noticed by bezos, is in full bloom.