Hacker News new | past | comments | ask | show | jobs | submit login

> Even then, this is only good until AWS's first multi-region failure; this doesn't seem to be an impossible event given EC2's recent track record.

Doesn't everything in their track record indicate that regions are nicely partitioned from each other? Even the biggest region failures they've had have stayed completely isolated to that region.




AZs were supposed to be that unit of isolation, then when multiple AZs failed that shifted to be Regions; it seemed like a "blame the victim" mentality to me.

Given that AWS are running the same software across regions and have the same people & processes in place, and further that there's software that runs across regions (e.g. S3), I'd wager it's not long before we have a multi-region outage.

Finally, some of the multi-AZ problems in the past were compounded because as one AZ went down everyone hammered the other AZs, taking out the APIs at least. That's when everyone believed that AZs were isolated. Now that people know that's not the case, those same systems are going to be hammering across multiple regions.


Perhaps you misread/misinterpreted the level of isolation that AZs provided.

AZs are physically separate data centers. They are protected from fires, flooding, physical disasters. BUT they do share some common components which allow you to do things like shift EIPs between AZs, snapshots, security groups, etc. (Source: http://aws.amazon.com/ec2/faqs/#How_isolated_are_Availabilit...)

Regions on the other hand, are completely separate installations of every component of the AWS stack. You can verify this because no resources can be shared between Regions (snapshots, groups, EIPs etc). You can also verify this during an outage. IE: When the US-EAST-1 API becomes unresponsive (due to throttling), the US-WEST-1/2 are still available.


[deleted]


That FAQ you yourself pointed to doesn't mention that there are common components.

We now know that when Amazon said - in that very FAQ - "even extremely uncommon disasters such as fires, tornados or flooding would only affect a single Availability Zone" they were very carefully not lying, but implying something that simply isn't true. We didn't know 18 months ago that multiple AZs would fail simultaneously (unless there was e.g. a huge earthquake). I agree that we know that now.

You believe we won't wake up at 3AM one morning to learn of an unanticipated way that multiple regions will fail at the same time. I don't share your faith.

Edit: This in reply to joeyi's comment above. It got double-posted, and I replied to one of the copies at the same time as joeyi deleted it!


I share your paranoia in general (as ops), but can assure you that regions are very isolated from one and other. I know that releases are rolled out on a very long schedule (think quarter long release), and that is to prevent what you describe.

I would argue that the application (ie: the application being hosted on AWS) probably is going to fail before multiple regions do simultaneously and that should be addressed, before thinking about going multi-provider.


Do you work for AWS as well? If so, I'd ask that team AWS spend less time astro-turfing on HN, and more time documenting your systems, so we can assess these risks for ourselves.

For example, I haven't heard of any precautions taken against a thundering herd of clients retrying requests in other regions if us-east goes down. What does AWS have there? How much spare capacity do you run in each region?


Regions are 100% independent of one another, both physically and also control plane wise. Also code pushes to regions for new features don't ever happen on the same day.


Source?

AZs were supposed to be independent; they aren't. Fool me one...


I used to work on the EC2 team. The regions are wholly independent of one another.


I hope you blog more of these practices then. AWS doesn't put this stuff in writing, which is very convenient for them when something goes wrong, but makes it nigh on impossible to build a reliable system on EC2.

I don't think it's an easy problem to solve, but to suggest that the regions won't go down together strikes me as "the Titanic is unsinkable" hubris. I hope the AWS team doesn't share your attitude :-)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: