I can't help but get the feeling that, if it weren't for the fancy "elastic" API stuff, these outages would remain nothing more than just minor glitches. At this point, I don't see how you could possibly justify running on AWS. Far better to just fire up a few dedicated servers at a dumb host and let them do their job.
Having a common provisioning API in a bunch of physically diverse centers IS a huge advantage for availability, cost, etc, though. If you have 100 hours to set up a system, and $10k/mo, you have two real choices:
1) Dedicated/conventional servers: Burn 10 hours of sales and contract negotiation, vendor selection, etc., plus ~10-20h in anything specific to the vendor, and then set up systems. Get a bunch of dedicated boxes (coloing your own gear may be better at scale but 10-20x the time and then upfront cost...), set up single-site HA (hardware or software load balancer, A+B power, dual line cord servers for at least databases, etc.
2) Set up AWS. Since the lowest tiers are free, it's possible you have a lot of experience with it already. 1h to buy the product, extra time to script to the APIs, be able to be dynamic. You could probably be resilient against single-AZ failure in 100h with $10k/mo, although doing multi-region (due to DB issues and minimum spend per region) might be borderline.
In case #1, you're protected against a bunch of problems, but not against a plane hitting the building, someone hitting EPO by accident, etc. In case #2, you should be resilient against any single physical facility problem, but are exposed to novel software risk.
The best solution would be #3 -- some consistent API to order more conventional infrastructure in AWS-like timeframes. Arguably OpenStack or other systems could offer this (using real SANs instead of EBS, real hw load balancers in place of ELB, ...), and you could presumably do some kind of dedicated host provisioning using the same kind of APIs you use for VM provisioning (big hosting companies have done this with PXE for years; someone like Softlayer can provision a system in ~30 minutes on bare metal). Use virtualization when it makes sense, and bare metal at other times (the big Amazon Compute instances are pretty close) -- although the virtualization layer doesn't seem to be the real weakness, but rather all the other services like ELB/EBS, RDS, etc.)
Basically, IaaS by a provider who recognizes software, and especially big complex interconnected systems, is really hard, and is willing to sacrifice some technical impressiveness and peak efficiency for reliability and easy characterization of failure modes.
Or just use a provider that you're already familiar with, at whatever list price they give you. This is fair since in option #2 you're assuming that we already know about Amazon, and won't stress about the 3x dedicated server cost.
> set up single-site HA (hardware or software load balancer, A+B power, dual line cord servers for at least databases, etc.
Many high-end providers will simply do this for you, and have setups ready on rack for you to use.
It's not really that hard so long as you're not going past say... 15 boxes. For anything less than that, I can pretty much guarantee you'll find a high-end provider that will have you set up in 4-6 hours maximum.
It's impossible that it's an amazing coincidence they all failed on the same day. The fact they were all recently certified and tested means that that process doesn't work to ensure they will come on line any more than the process worked at Fukushima nuclear plant.
They don't give the manufacturer or model, and they say that they are going to have them recertified and continue to use them. So that means they are not going to fix the problem, because they don't know why they failed.
You can not fix the problem if you do not know what caused it.
That manufacturer is probably flying every single C-level exec out to the US-East data center, over the July 4th holiday, to personally disassemble the generator, polish each screw, and carefully put it all back together while singing an a cappella version of "Bohemian Rhapsody", including vocal percussion.
And if they do it to Amazon's satisfaction, Amazon has hinted that they might decide not to out them to the rest of the world. That's called leverage.
In the single datacenter that did not successfully transfer to the generator backup, all servers continued to operate normally on Uninterruptable Power Supply (“UPS”) power. As onsite personnel worked to stabilize the primary and backup power generators, the UPS systems were depleting and servers began losing power at 8:04pm PDT. Ten minutes later, the backup generator power was stabilized, the UPSs were restarted, and power started to be restored by 8:14pm PDT. At 8:24pm PDT, the full facility had power to all racks.
It sounds like they were too optimistic about their generator startup times.
I'm a fan of decentralized power generation, and it would seem like large consumers would have the most to gain.
Is this a regulation issue? I imagine Amazon becoming a provider of electricity (even if it's just to itself), can become a political mess.
They did, and it's called "backup generators". :-)
More seriously, power plants go offline on a regular basis -- typical availability factors are around 90% due to the need for regular maintenance. You need to have a power grid in order to have any reasonable availability.
Taking down the entire grid causes many problems, and basically doesn't happen (it happened in the Northeast a few years ago....) http://en.wikipedia.org/wiki/Northeast_blackout_of_2003
The problem is diesel generators basically suck, especially when left powered off. In the long run, I predict fuel cells will take over in the standby power market.
Some sites use microturbines or full turbines, and various of the local backup generators are LP or LNG and not diesel.
Neither diesel nor the gasoline blends are particularly stable during storage, unfortunately.
Capstone claims a 10-second stabilization and transfer time for their microturbines, and has microturbines from tens of kilowatts up to a megawatt package.
Con-Ed was (is?) running 155 MW and 174 MW turbines mounted on power barges, and they've had turbines around at least as far back as the 1920s.
For some historical reading on turbine power generation:
And yes, some of the fuel cell co-generation deployments look quite promising.
That could be for dependent on zoning, though, but I've heard this from a few engineers designing solar systems for residential.
I don't personally know of any datacenters using turbines for backup power. I don't know enough about turbines to know why that is -- maybe they're not available in smaller sizes, or have higher maintenance requirements, or higher capex per MW, or something else. The only turbines I've seen are cogen/prime power with grid backup, or marine.
They are just expensive; and then you also need a (safe) way to store gas on site. They also have small efficiency bands and therefore aren't always optimal for certain applications. Larger turbines also need a diesel generator to start them ("black start").
There's no real current alternative to diesel generators for true standby power, but they still suck.
I suspect that Area 51 and the like do somthing similar. Though Google has sited one DC near a source of Hydro power not sure if they where running there own link to it or not.
Based on how they behave in outages, I've always been curious (read: suspicious) about whether ELBs were redundant across AZs or hosted in a single AZ regardless of the AZs your instances are in.
It's good to hear that they are actually redundant and to understand how they're added/removed from circulation in the event of problems.
You'd need to integrate it with each VM running, possibly just be sending it the equivalent of a shutdown command from the console, so that it understood that the disk was going away in X seconds, and to shutdown any databases immediately, flush all cache, unmount filesystems, and shutdown itself.
It wouldn't be massively difficult, but not as simple as simply shutting down EBS.
On the other hand, an fsck to recover the filesystem probably causes even more traffic.
My guess is that this functionality is missing for historical reasons -- the original S3-backed instances didn't have any concept of powering off an instance.
(generators and UPS might be cheaper, especially since all facilities have them, than getting all users to be good at handling failure gracefully. It's ok for an in-house datacenter doing batch processing, or with great developers, like Google, but probably not for a generalized IaaS)
Is this referring to Netflix?
That certainly sounds like this doesn't it?
IIRC, Amazon.com doesn't use ELB, they use some proper load balancers (with DNS above) in front of several Regions of ELBs.
I wonder if there's a market for global load balancing + advanced DNS tricks as a service, even if you're using Amazon entirely. Generally extending Amazon's IaaS with features Amazon should implement hasn't been a great strategy, as Amazon eventually implements an inferior but cheaper or easier (or at least default) solution for non-core parts.
Kind of surprised to see www.amazon.com goes to DC for me (from SF), though. It's possible it's all in US-East, across all the AZ's, and just has localized sites like co.uk in other places.
You probably implement Geo, health, and load based optimization in the DNS layer and also in the physical load balancers. A single host out of ~30 dying at Site A doesn't affect how the DNS load balancing responds, but a critical number does.
Due to broken DNS implementations in the wild, you ideally keep your per-site load balancers up in as many situations as possible (even if load must be shed, you keep core network and load balancers up) and redirect traffic from there to other sites.
For larger sites, the big choice is running a single AS global network where IPs can move between datacenters vs. running each datacenter as a separate network. Advantages and disadvantages to each.
The big disadvantage of all of this is you're now spending a few million a month on salaries, infrastructure (various PoPs, network gear, etc.) before you've bought a single server or served a single page. Not so lean.
So a business opportunity kicks in. That's how the cloud revolutionizes computing.
(on the front end, you should probably also be doing CDN+ (caching, WAN TCP acceleration, DoS prevention) too.)
[also, wow, I didn't realize you were the "famous" Zhou Tong.]
In any case, AWS is probably closest to this kind of revolution, as they already have different types of managed database services (relational, object, in-memory and key-value). They also have Route 53, which is Anycast-based and CloudFront.
[What's right/wrong/surprising with my identity?]
[I wasn't following the bitcoin stuff as it happened, and then later read about bitcoinica (we actually use the difficulties with hosting providers you faced as an example with our unlaunched product...). I didn't know you were on HN -- awesome. (you are probably one of the world's experts on problems with service provider internal security, now, although it was expensive education). And handling things by paying everyone out was a much better decision than most startups after a breach.]
[The first line of the comment was intended to keep this on-topic. I didn't really have any financial interest in Bitcoinica at the time of the hack. A major management handover took place three weeks before that and the 100% of company was sold in 2011. The community assumed that I was still the owner but it's simply not true. The hacked mail server (the "root cause") didn't belong to me either. But yes, I learned a lot by being both an insider and an outsider, and these are fortunately free lessons. I still follow some of the valuable experience in dealing with Bitcoinica's infrastructure in my new project, which doesn't deal with money. Startup infrastructure has a big market. Now I use KVM to build a small private cloud for my new project on top of dedicated server(s) simply because I love the flexibility of cloud deployments. If someone brings that to developers who are not sysadmins, it'll be cool.]
Check where amazon.co.jp goes ;)
Considering most active domains have MX records under the root host name (like example.com.), CNAMEs won't work for these hostnames.
If there's native implementation for an "ALIAS" record which only points to the corresponding A record of the target, then it will work anywhere.
This is only true if you have a stable level of traffic, IIRC. If they have to scale the ELB up or down the IPs will change.
What I don't like is that they make repeated promises about AZes and features which are repeatedly shown to be untrue. They also have never disclosed their testing methodology, which leads me to assume there isn't much of one. That makes me unlikely to rely on any AWS services I can't myself fully exercise, or which haven't shown themselves to be robust. S3, sure. EC2, sure (except don't depend on modifying instances on any given day during an outage). EBS, oh hell no. ELB, probably not my first choice, and certainly not my sole ft/ha layer. Route 53, which I haven't messed with, actually does seem really good, but since it's complex, I'm scared given the track record of the other complex components.
Thats pretty much SOP in the hosting business, nobody really knows whats going to happen because nobody really knows how to test. Most developers can/do not understand what ops failures are like, and therefore most testing is only superficial.
Edit: Not that I agree or disagree with that SLA. Just noting it is documented somewhere besides a glyph in a legend.
I wonder what the additional cost would be to leave a generator running 24/7 (at minimal load), so you never have an issue with spinning them up. Are they designed to run 24/7/365, or will they wear out too quickly?
It's different if you're in the middle of nowhere, but even then, maintenance issues would be prohibitive.
(It would be so much fun to build a big datacenter next to a hydro dam; you'd get both pretty high availability hydro power and the HV feed going to civilization could be used in reverse to bring power if the dam goes down; pick a site where the power goes off in two directions to two grids and you've got 2 grid feeds plus local hydro plus anything like diesel onsite backup. Then you just have issues with getting diverse fiber in, but maybe along the HV routes. Northern Oregon/Southern WA is essentially this -- which is why Amazon, Google, Facebook, Microsoft, and more are putting huge facilities there.)
As an aside, you need to test your generators with actual load. AC is a little more complex than just "on or off", too. The "real" way to do this is to do periodic tests with a load bank (basically a huge resistive load which converts the electricity from the generators into heat, although if you were smart, you could get big pumps and pump a pond instead, which could be used for something useful later). Most people don't, they just spin it up, or sometimes they use facility loads for test, which is good in some ways, but if the test fails, the facility load drops, which makes people sad. Backup power can fail in all kinds of exciting ways other than just not delivering power, so it's possible you'll switch to generator, generator will fail, and you won't be able to switch back, or in some cases, generator will kill your UPS and drop your load. The bigger the system the easier it seems to be for really weird things to happen, but the more individual sets of equipment you have, the higher the chances that at least one will fail in some crazy way. Which is one of the reasons the Google UPS-in-Server thing is so exciting.
The Friday storm had at least a decent hour of warning. I got an alert on my phone from the Weather Underground app, and the radar made it apparent that this was going to be a monster well before it actually hit. Their summary of the events shows that Amazon had decent warning and called in extra staff and took other precautions.
So, given that they had advance warning of a potentially serious storm, why not start up the generators and have them idling beforehand? Is the cost of that still prohibitive even when it looks like there's a good chance they'll be needed?
I have a very hard time believing this, especially for diesels, which idle much more efficiently than spark-ignition engines. Where did you get this figure?
What I'm also surprised by was the fact that they hadn't already allowed unstable voltages from the generators to be used. A modern switching powersupply can operate down to, what, 80 volts?
EDIT: rdl, for some reason I can't reply to your reply. Replying here.
Thanks for the info. Now that I think back, I recall reading that diesels operate with an air/fuel ratio as high as 80:1 at idle. 1/4 would put them at 20:1 at full load, which I can beleive, because diesels don't typically operate at stoichiometric even under full load, which would be ~15:1.
At 1/4, about 34 gallons/hour.
You can't idle them at low loads for long periods without damaging them. That's basically 1/4. You usually want it to be more like 50-75%. So "no real load" is artificially set higher than engine idle to keep the system functional, using a load bank. (what you actually do is have N generators and just use some of them, or have different sizes and use the right ones to sort of match the load)
The output from the generators is going into the UPS. The UPS input is probably pickier than any switchmode power supply. Especially because most big UPSes let you set in software how picky they are about input power before they switch to battery, and they apparently had it set to conservative mode.
 Not only do a lot of communications not specify a timezone at all, but some communications specify a timezone incorrectly. I recently submitted a ticket with Apple dev support, and the automated reply specifies their hours of operation as 7am to 5pm PST. They're probably clueless, and they mean PDT, but if someone takes them at their word, those hours of operation are off by an hour.
Everybody knows, or should know, their time zone's delta to UTC. I'm sure there are lots of people who don't know their delta to PDT.