> Even then, this is only good until AWS's first multi-region failure; this ...

justinsb · on Oct 24, 2012

AZs were supposed to be that unit of isolation, then when multiple AZs failed that shifted to be Regions; it seemed like a "blame the victim" mentality to me.

Given that AWS are running the same software across regions and have the same people & processes in place, and further that there's software that runs across regions (e.g. S3), I'd wager it's not long before we have a multi-region outage.

Finally, some of the multi-AZ problems in the past were compounded because as one AZ went down everyone hammered the other AZs, taking out the APIs at least. That's when everyone believed that AZs were isolated. Now that people know that's not the case, those same systems are going to be hammering across multiple regions.

joeyi · on Oct 24, 2012

Perhaps you misread/misinterpreted the level of isolation that AZs provided.

AZs are physically separate data centers. They are protected from fires, flooding, physical disasters. BUT they do share some common components which allow you to do things like shift EIPs between AZs, snapshots, security groups, etc. (Source: http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availabilit...)

Regions on the other hand, are completely separate installations of every component of the AWS stack. You can verify this because no resources can be shared between Regions (snapshots, groups, EIPs etc). You can also verify this during an outage. IE: When the US-EAST-1 API becomes unresponsive (due to throttling), the US-WEST-1/2 are still available.

on Oct 24, 2012

[deleted]

justinsb · on Oct 24, 2012

That FAQ you yourself pointed to doesn't mention that there are common components.

We now know that when Amazon said - in that very FAQ - "even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone" they were very carefully not lying, but implying something that simply isn't true. We didn't know 18 months ago that multiple AZs would fail simultaneously (unless there was e.g. a huge earthquake). I agree that we know that now.

You believe we won't wake up at 3AM one morning to learn of an unanticipated way that multiple regions will fail at the same time. I don't share your faith.

Edit: This in reply to joeyi's comment above. It got double-posted, and I replied to one of the copies at the same time as joeyi deleted it!

joeyi · on Oct 24, 2012

I share your paranoia in general (as ops), but can assure you that regions are very isolated from one and other. I know that releases are rolled out on a very long schedule (think quarter long release), and that is to prevent what you describe.

I would argue that the application (ie: the application being hosted on AWS) probably is going to fail before multiple regions do simultaneously and that should be addressed, before thinking about going multi-provider.

justinsb · on Oct 24, 2012

Do you work for AWS as well? If so, I'd ask that team AWS spend less time astro-turfing on HN, and more time documenting your systems, so we can assess these risks for ourselves.

For example, I haven't heard of any precautions taken against a thundering herd of clients retrying requests in other regions if us-east goes down. What does AWS have there? How much spare capacity do you run in each region?

res0nat0r · on Oct 24, 2012

Regions are 100% independent of one another, both physically and also control plane wise. Also code pushes to regions for new features don't ever happen on the same day.

justinsb · on Oct 24, 2012

Source?

AZs were supposed to be independent; they aren't. Fool me one...

res0nat0r · on Oct 24, 2012

I used to work on the EC2 team. The regions are wholly independent of one another.

justinsb · on Oct 24, 2012

I hope you blog more of these practices then. AWS doesn't put this stuff in writing, which is very convenient for them when something goes wrong, but makes it nigh on impossible to build a reliable system on EC2.

I don't think it's an easy problem to solve, but to suggest that the regions won't go down together strikes me as "the Titanic is unsinkable" hubris. I hope the AWS team doesn't share your attitude :-)