(Disclaimer: I work for AWS but opinions are my own. I also do not work with the Kinesis team.)
Nearly all AWS services are regional in scope, and for many (if not most) services, they are scaled at a cellular level within a region. Accounts are assigned to specific cells within that region.
There are very, very few services that are global in scope, and it is strongly discouraged to create cross-regional dependencies -- not just as applied to our customers, but to ourselves as well. IAM and Route 53 are notable exceptions, but they offer read replicas in every region and are eventually consistent: if the primary region has a failure, you might not be able to make changes to your configuration, but the other regions will operate on read-only replicas.
This incident was regional in scope: us-east-1 was the only impacted region. As far as I know, no other region was impacted by this event. So customers operating in other regions were largely unaffected. (If you know otherwise, please correct me.)
As a Solutions Architect, I regularly warn customers that running in multiple Availability Zones is not enough. Availability Zones protect you from many kinds of physical infrastructure failures, but not necessarily from regional service failures. So it is super important to run in multiple regions as well: not necessarily active-active, but at least in a standby mode (i.e. "pilot light") so that customers can shed traffic from the failing region and continue to run their workloads.
This outage highlighted our dependency on Cognito. Everything else we are doing can (and probably should) be replicated to another region, which would resolve these types of issues.
However, Cognito is very region specific and there is currently no way to run in active-active or even in standby mode. The problem is user accounts; you can't sync them to another region and you can't back-up/restore them (with passwords). Until AWS comes up with some way to run Cognito in a cross-region fashion, we are pretty much stuck in a single region and vulnerable to this type of outage in the future.
Please bring this to the attention of your account team! They will bring your feedback to the service team. While I can’t speak for the Cognito team, I can assure you they care deeply about customer satisfaction.
What do you mean by cross-regional dependencies? Isn't running in multi-region setup is by itself adding dependency?
Speaking about multi-region services. What do you think about Google now offering all three major building pieces as multi-regional?
They have muti-regional buckets, LB with single anycast IP, document db (firebase). Pubsub can route automatically to nearest region. Nothing like this is available in amazon, well only DIY building blocks.
If your workload can run in region B even if there is a serious failure of a service in region A, in which your workload normally runs, then no, you have not created a cross-regional dependency.
When I talk about cross regional dependency, I talk about an architectural decision that can lead to a cascading failure in region B, which is healthy by all accounts, when there is a failure in region A.
AWS has services that allow for regional replication and failover. DynamoDB, RDS, and S3 all offer cross region replication. And Global Accelerator provides an anycast IP that can front regional services and fail over in the event of an incident.
I haven't used global accelerator but it doesn't look like the same. On landing page it says: "Your traffic routing is managed manually, or in console with endpoint traffic dials and weights".
“Global Accelerator continuously monitors the health of all endpoints. When it determines that an active endpoint is unhealthy, Global Accelerator instantly begins directing traffic to another available endpoint. This allows you to create a high-availability architecture for your applications on AWS.”
Nearly all AWS services are regional in scope, and for many (if not most) services, they are scaled at a cellular level within a region. Accounts are assigned to specific cells within that region.
There are very, very few services that are global in scope, and it is strongly discouraged to create cross-regional dependencies -- not just as applied to our customers, but to ourselves as well. IAM and Route 53 are notable exceptions, but they offer read replicas in every region and are eventually consistent: if the primary region has a failure, you might not be able to make changes to your configuration, but the other regions will operate on read-only replicas.
This incident was regional in scope: us-east-1 was the only impacted region. As far as I know, no other region was impacted by this event. So customers operating in other regions were largely unaffected. (If you know otherwise, please correct me.)
As a Solutions Architect, I regularly warn customers that running in multiple Availability Zones is not enough. Availability Zones protect you from many kinds of physical infrastructure failures, but not necessarily from regional service failures. So it is super important to run in multiple regions as well: not necessarily active-active, but at least in a standby mode (i.e. "pilot light") so that customers can shed traffic from the failing region and continue to run their workloads.