Indeed. We're paying (and designing our systems to work on multiple AZs) to redu...

otterley · on Nov 28, 2020

(Disclaimer: I work for AWS but opinions are my own. I also do not work with the Kinesis team.)

Nearly all AWS services are regional in scope, and for many (if not most) services, they are scaled at a cellular level within a region. Accounts are assigned to specific cells within that region.

There are very, very few services that are global in scope, and it is strongly discouraged to create cross-regional dependencies -- not just as applied to our customers, but to ourselves as well. IAM and Route 53 are notable exceptions, but they offer read replicas in every region and are eventually consistent: if the primary region has a failure, you might not be able to make changes to your configuration, but the other regions will operate on read-only replicas.

This incident was regional in scope: us-east-1 was the only impacted region. As far as I know, no other region was impacted by this event. So customers operating in other regions were largely unaffected. (If you know otherwise, please correct me.)

As a Solutions Architect, I regularly warn customers that running in multiple Availability Zones is not enough. Availability Zones protect you from many kinds of physical infrastructure failures, but not necessarily from regional service failures. So it is super important to run in multiple regions as well: not necessarily active-active, but at least in a standby mode (i.e. "pilot light") so that customers can shed traffic from the failing region and continue to run their workloads.

Corrado · on Nov 29, 2020

This outage highlighted our dependency on Cognito. Everything else we are doing can (and probably should) be replicated to another region, which would resolve these types of issues.

However, Cognito is very region specific and there is currently no way to run in active-active or even in standby mode. The problem is user accounts; you can't sync them to another region and you can't back-up/restore them (with passwords). Until AWS comes up with some way to run Cognito in a cross-region fashion, we are pretty much stuck in a single region and vulnerable to this type of outage in the future.

otterley · on Nov 29, 2020

Please bring this to the attention of your account team! They will bring your feedback to the service team. While I can’t speak for the Cognito team, I can assure you they care deeply about customer satisfaction.

Corrado · on Nov 30, 2020

That's a great idea. I'm writing an email right now! :)

roman_sf · on Nov 29, 2020

What do you mean by cross-regional dependencies? Isn't running in multi-region setup is by itself adding dependency?

Speaking about multi-region services. What do you think about Google now offering all three major building pieces as multi-regional?

They have muti-regional buckets, LB with single anycast IP, document db (firebase). Pubsub can route automatically to nearest region. Nothing like this is available in amazon, well only DIY building blocks.

otterley · on Nov 29, 2020

If your workload can run in region B even if there is a serious failure of a service in region A, in which your workload normally runs, then no, you have not created a cross-regional dependency.

When I talk about cross regional dependency, I talk about an architectural decision that can lead to a cascading failure in region B, which is healthy by all accounts, when there is a failure in region A.

AWS has services that allow for regional replication and failover. DynamoDB, RDS, and S3 all offer cross region replication. And Global Accelerator provides an anycast IP that can front regional services and fail over in the event of an incident.

roman_sf · on Nov 29, 2020

I haven't used global accelerator but it doesn't look like the same. On landing page it says: "Your traffic routing is managed manually, or in console with endpoint traffic dials and weights".

otterley · on Nov 29, 2020

“Global Accelerator continuously monitors the health of all endpoints. When it determines that an active endpoint is unhealthy, Global Accelerator instantly begins directing traffic to another available endpoint. This allows you to create a high-availability architecture for your applications on AWS.”

https://docs.aws.amazon.com/global-accelerator/latest/dg/dis...

Alternatively, global load balancing with Route 53 remains a viable, mature option as well. Health checks and failover are fully supported.

qz2 · on Nov 28, 2020

Correct.

I, as many people have, discovered this when something broke in one of the golden regions. In my case cloudfront and ACM.

Realistically you can’t trust one provider at all if you have high availability requirements.

The justification is apparently that the cloud is taking all this responsibility away from people but from personal experience running two cages of kit at two datacenters the TCO was lower and the reliability and availability higher. Possibly the largest cost is navigating Harry-Potter-esque pricing and automation laws. The only gain is scaling past those two cages.

Edit: I should point out however that an advantage of the cloud is actually being able to click a couple of buttons and get rid of two cages worth of DC equipment instantly if your product or idea doesn't work out!

freehunter · on Nov 28, 2020

>you can’t trust one provider at all

The hard part with multi-cloud is, you're just increasing your risk of being impacted by someone's failure. Sure if you're all-in on AWS and AWS goes down, you're all-out. But if you're on [AWS, GCP] and GCP goes down, you're down anyway. Even though AWS is up, you're down because Google went down. And if you're on [AWS, GCP, Azure] and Azure goes down, it doesn't matter than AWS and GCP are up... you're down because Azure is down. The only way around that is architecting your business to run with only one of those vendors, which means you're paying 3x more than you need to 99.99999% of the time.

The probability that one of [AWS, Azure, GCP] is down is way higher than the probability that just one of them is down. And the probability that your two cages in your datacenter is down is way higher than the probability that any one of the hyperscalers is down.

cowsandmilk · on Nov 29, 2020

> which means you're paying 3x more than you need to 99.99999% of the time.

This would be a poor decision. If you assume AWS, GCP, and Azure would fail independently, you can pay 1.5x. Each of the 3 services would be scaled to take 50% of your traffic. If any one fails, you would then still be able to handle 100%. This is a common way to structure applications. Assuming independence means that more replicas result in less overprovisioning. 1 replica means needing to provision 2x. Having 5 independent replicas means, you need to provision 1.25x to be resilient against one failure as each replica will be scaled at 25%.

In general, N replicas need N/(N-1) over provisioning to be resilient against one replica failing.

qz2 · on Nov 28, 2020

I disagree. It’s about mitigating the risk of a single provider’s failure. Single providers go down all the time. We’ve seen it from all three major cloud vendors.

freehunter · on Nov 29, 2020

You disagree with what? That relying on three vendors increases your risk of being impacted by one? That's just statistics. You can disagree with it, but that doesn't make it incorrect.

Or do you disagree that planning for a total failure of one and running redundant workloads on other vendors increases your costs 99.99999% of the time? Because that's a fairly standard SLA from each of the major vendors. Let's even reduce it to EC2's SLA, 99.99%. So 99.99% of the time you're paying 3x as much as you need to be paying just to maintain your services an extra four hours per year. Again, you can disagree with that but that doesn't make it incorrect.

Some businesses might need that extra four hours, the cost of the extra services might be cheaper than the cost of four hours of downtime per year. But you're not going to find many businesses like that. Either you're running completely redundant workloads, paying 3x as much for an extra 4 hours per year, or you're going to be taken offline when any one of the three go down independently of each other.

Single providers go down, yes. And three providers go down three times as often as one. Either you're massively overspending or you're tripling your risk of downtime. If multi-cloud worked, you'd be hearing people talking about it and their success stories would fill the front page of Hacker News. They don't, because it doesn't.

hedora · on Nov 28, 2020

How do you test failover from provider-wide outages?

I’ve never heard of an untested failover mechanism that worked. Most places are afraid to invoke such a thing, even during a major outage.

qz2 · on Nov 28, 2020

That’s fairly simple. Regular scenario planning, drills and suitable chaos engineering tooling.

Being afraid of failures is a massive sign of problems. I’ve worked in those sorts of places before.

coredog64 · on Nov 28, 2020

ACM and CloudFront being sticky to us-east-1 is particularly annoying. I’m happy not being multi regional (I don’t have that level of DR requirements), but these types of services require me to incorporate all the multi region complexity.

lttlrck · on Nov 28, 2020

Harry-Potter-esque pricing?

Is that a reference to the difficulty of calculating the cost of visiting all the rides at universal? That's my best guess...

qz2 · on Nov 28, 2020

It's more a stab at the inconsistency of rules around magic.

"Well this pricing rule only works on a Tuesday lunch time if you're standing on one leg with a sausage under each arm and a traffic cone on your head"

And there are a million of those to navigate.

frankietaylr · on Nov 28, 2020

AWS should make this more transparent so that we make better design choices.