Sorry all I jinxed it. Yesterday I was in a meeting and said "The only regional outages AWS has ever had were in us-east-1, so we should just move to us-east-2."
Now I guess we have to move to us-west-2. :)
Update: looks like it's only one zone anyway, so my statement still stands!
I always say stay in use1 because almost everybody is there and when it's suffering any kind of outage so much of the internet is affected that it's no big deal that you're a part of the outage. People just go outside and get some air knowing it will be back up in a few hours, usually right around the time the AWS status page acknowledges that there is an issue.
The AWS status board, posted elsewhere in the comments, seems to think this is an AZ outage, not a regional one.
Edit: although, one of our vendors that uses AWS has said that they think ELB registration is impacted (but I don't recall if that's regional?) and R53 is impacted (which is supposed to be global, IIRC). Dunno how much truth there is to it as we don't use AWS directly.
In all seriousness, we've been deploying everything on us-west-2, and it seems to have dodged most of the outages recently. Is there something special about that data center?
Classically, us-east-1 received most of the hate given its immense size (it used to be several times larger than any other) and status as the first large aws data center. It also seemed to launch new aws features first but that may have been my imagination. If true, I'm sure always running the latest builds was not great for stability.
us-west-2 has had outages as well but it is less common, even rare. I've been pushing companies to make their initial deployments onto us-west-2 for over ten years now. I occasionally get kudos messages in my inbox :)
I believe us-east-1 runs some of the control plane and an us-east outage can effectively take a service in a different region offline as it can break IAM authentication
It's never been a default datacenter. For a long time the default when you first logged into the console was us-east-1 so a lot of companies set up there (that's where all of reddit was run for a long time and Netflix too). At some point they switched the default to us-east-2.
So anyone who is in us-west-2 is there intentionally, which makes me assume there is a smaller footprint there (but I have no idea).
Rather the opposite - us-west-2 is big but not the biggest region, or the smallest, or the oldest or newest, it's not partitioned off like the China or GovCloud regions. Because us-west-2 is fairly typical it tends to be one of the last regions to get software updates, after they've been tested in prod elsewhere
Looks like this particular issue was due to power loss, and for power us-west-2 has one clear advantage: It's power is directly from the Columbia river and highly unlikely to have demand based outages.
Maybe not the entire region. Amazon was reportedly building a data center complex next to the natural gas Hermiston Generating Plant some distance from the river.
Naive question: don't people who care about resiliency have their services in more than one datacenter? or datacenter failure is considered such a rare event that's it's not worth the cost/trouble of using more?
AWS makes it pretty easy to operate in multiple AZs within a region (each AZ is considered a separate datacenter but in real life each AZ is multiple datacenters that are really close to each other).
That being said, there is still an added cost and complexity to operate in multiple AZs, because you have to synchronize data across the AZs. Also you have to have enough reserved instances to move into when you lose an AZ, because if you're running lean and each zone is serving 33% of your traffic, suddenly the two that are left need to serve 50% each.
The bigger companies with overhead reservations will get all the instances before you can launch any on demand during an AZ failure.
> each AZ is considered a separate datacenter but in real life each AZ is multiple datacenters that are really close to each other
For AWS specifically, I’m fairly certain they maintain a minimum distance and are much more strict on requirements to be on different grids etc than other Cloud providers. A few years ago they were calling out Azure and Google Cloud on exactly what you describe (having data centers essentially on the same street almost).
A single AZ may have neighboring datacenters, but they are very strict on having datacenters for different AZs be at least 100km apart and on different flood plains and power grids.
Each Availability Zone can be multiple data centers.At full scale, it can contain hundreds of thousands of servers. They are fully isolated partitions of the AWS global infrastructure. With its own powerful infrastructure, an Availability Zone is physically separated from any other zones. There is a distance of several kilometers, although all are within 100 km (60 miles of each other).
I think you may have slightly misread. I think what’s being said is that a single logical AZ may actually be multiple physical datacenters in close proximity.
Some people care about it but not enough to justify the added downsides - multi-data center is expensive (you pay per data center) and it’s complex (data sharding/duplication/sync).
If you’re Amazon where every second is millions of $ in transactions you care more than StartUp that has 1 request per minute. Even if you accept the risk, you still care when your DC goes down.
Also, a large chunk of AWS is managed from a single data center so if that one goes down you may still have issues with your service in another data center.
I'd consider using it, but the biggest roadblock for me is that I work in a regulated industry in Australia, and until AWS finishes their Melbourne region (next year maybe?) I'm stuck in one region because all private data needs to stay in Australia.
Also, I think a lot, but not all of the services I use work okay with multiple regions.
On top of that, I was looking at the documentation for KMS keys yesterday, and a KMS key can be multiregion, but if you don't create it as multiregion from the start, you can't update the multiregion attribute. So you need to create a new KMS key and update everything to use the new multiregion key.
AWS works with multiple availability zones (AZ) per region, some products by default deploy in several ones at the same time, while others leave it up to you.
AWS makes it trivially easy to distribute across more than one datacenter... The only time that outages make the news is when they all fail in a region.
The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry. It would be one thing if this was a Regional failure, but a single AZ failure should not have any noticeable effect.
For most businesses a little down time here and there is a calculated risk versus more complex infrastructure. You can’t assume all the cloud architects are idiots — they have to report their task list and cost of infrastructure to someone who can give feedback on various options based on comparative resource requirements and risks.
Zone downtime still falls under an AWS SLA so you know about how much downtime to accept and for a lot of businesses that downtime is acceptable.
This. People working in IT naturally think keeping IT systems up 100% time is most important. And depending on the business it often is, but it all costs money. Running a business is about managing costs and risks.
- Is it worth to spend 20% more on IT to keep our site up 99.99% vs 99%?
- Is it worth to have 3 suppliers for every part that our business depends, with each of them being contracted to be able to supply 2x more, in case other supplier has issues? And pay a big premium for that?
- Is it worth to have offices across the globe, fully staffed and trained to be able to take on any problem, in case there's big electrical outage/pandemic/etc in other part of the world?
I'm not saying that some of those outages aren't results of clowny/incompetent design. But "site sometimes goes down" can be often a very valid option.
I've had some interesting discussions about this with a bunch of representatives of our larger B2B customers about this. Interestingly enough, to them, a controlled downtime of 2-4 hours with an almost guaranteed success is preferable, compared to a more complex, probably working zero-downtime effort that might leave the system in a messed up - or not messed up - state.
To them it's much easier to communicate "Hey, our customer service is going to be degraded on the second saturday in october, call on monday" to their customers 1-2 month in advance, prepare to have the critical information without our system, and have agents tell people just that.
This has really started to change my thoughts of how to approach, e.g. a major postgres update. In our case, it's probably the better way to just take a backup, shutdown everything, do an offline upgrade and rebuild & restore if things unexpectedly go wrong. We can totally test the happy case to death, and if the happy case works, we're done in 2 hours for the largest systems with minimal risk. 4 hours if we have to recover from nothing, also tested.
And you know, at that point, is it really economical to spend weeks to plan and weeks to test a zero downtime upgrade that's hard to test, because of load on the cluster?
At least in my experience, AWS downtime also only accounts for a minor share of the total downtime; the major source are crashes and bugs in the application you're actually trying to host. Being completely HA and independent of AZ crashes/bug is extremely hard and time intensive and usually not worth it compared to investing that time to get your app to run smoothly.
I think a good trade off, if your infra is in TF, is to be able to run your scripts with a parameterized AZ/region. That way you can reduce the downtime even more at a fraction of the cost. (assuming the services that are down are not the base layers of AWS, like the 2020 outage)
If you can get the data out of the downed AZ, don't have state you need to transfer and are not shot in the foot once the primary replica comes online again. I've rarely deployed an app where it was as easy as just to change a region variable.
Depending on the size of the company it can be simple or hard. Most companies that need this are not huge. Things like RDS, Elasticache, ECR and Secrets have multi AZ integrated so not hard to do it. If you operate on ECS or EKS it's pretty straightforward to boot up nodes and load balancers in another AZ.
Maybe you have a system that requires more hands on work and want to explain your point of view? I don't appreciate the snarky responses tho.
Yeah, makes sense if explicitly stated. Not everything is worth the money.
However, in my experience, the people doing the calculations on that risk have no incentive to cover it. Their bonus has no link to the uptime and they can blame $INFRA for the lost millions and still meet their targets and get promoted / crosshired.
The people who warned them and asked for funding are the ones working late and having conf calls with the true stakeholders.
If your application and infra can magically utilize multiple zones with “a couple lines”… then I would say you are miles ahead of just about every other web company.
Today, a SaaS I’m familiar with that runs ~10 Aurora clusters in us-east-2 with 2-3 nodes each (1 writer, 1-2 readers) in different AZs had prolonged issues.
At least 1 cluster had a node on “affected” hardware (per AWS). Aurora failed to failover properly and the cluster ended up in a weird error state, requiring intervention from AWS. Could not write to the db at all. This took several hours to resolve.
All that to say that it’s never straightforward. In today’s event, it was pure luck of the draw as to whether a multi-AZ Aurora cluster was going to have >60 seconds of pain.
That SaaS has been running Aurora for years and has never experienced anything similar. I was very surprised when I heard the cluster was in a non-customer-fixable state and required manual intervention. I’ve shilled Aurora hard. Now I’m unsure.
Thank goodness they had an enterprise support deal or who knows if they’d still have issues now.
Or how about "I'm fully aware, I've done the math taking into account both cost and complexity of implementation and cost of downtime, and I'm probably making fantastic calls based on my actual needs."
This has quickly grown to more than adding in a couple of lines! Now I need to architect my legacy app so that I can deploy into lambdas, then I can get resiliency I don't really need!
Not all systems require high availability. Some systems are A-OK with downtime. Sometimes, I'm perfectly fine with eventual consistency. You really do have to look at the use-cases and requirements before making sweeping staements.
No, we were talking about architechts making decisions that you characterised as poor. I was pointing out that your statement was over-general and that there are many instances where making the informed decision to ignore HA is a completely reasonable thing to do.
By your last sentence, it appears you agree with me.
If you meant to say that your statement only applies to cloud architects who are attempting to maintain an uptime SLA with multi-az/region redundancy, then sure, AWS has lots of levers you can pull and those complaining really should spend some time studying them.
As for legacy applications, I would not have brought up them up at all if you hadn't suggested pushing things into lambdas as a solution to multi-az. Once again, there are many many situations where this is not appropriate. Not everything is greenfield, and re-architecting existing applications in an attempt to shoehorn it into a different deployment model seems a bit much. Unless I'm misunderstanding what you meant.
It gives me a bad gut feeling when you imply that multiple instances of a service is more complex than a single instance which cannot be duplicated easily.
I also disagree that it is inherently more costly to run a service in multiple locations.
Of course it's more costly, you need to ensure state between locations so by virtue there's more infra to pay for.
It's not just a single instance too, there's generally a lot more infrastructure (db servers, app servers, logging and monitoring backends, message queues, auth servers... etc)
You do not need to pay double for everything, that might have been true with traditional VPS providers but it is not the way it works with cloud services.
You decide on what kind of failure you're willing to tolerate and then architect based on those requirements (loss of multiple AZ's, loss of a region, etc..).
Let's say your website requires 4 application servers, you can then tolerate a single AZ failure by using 5 application servers and spreading them among 5 AZs.
If you already have 4 application servers you are probably already AZ tolerant; most people concerned about "doubling everything" are only running 1 instance.
Going by your example, If your website requires 1 application server, to tolerate a single AZ failure, it requires you to double the number of application servers.
Example - we have a service that used Kafka in the affected region that went down. Our primary kafka instance (R=3) survived but this auxiliary one failed and caused downtime. There's no way around this other than doubling the cost.
In most cases the elephant* in the room is your DB - it doesn't matter where your stateless application servers are, if your stateful DB goes down you're in trouble. It's also often 1) the hardest to replicate, as replication involves tradeoffs - see CAP theorem & co and 2) the most expensive, since it needs to be pretty beefy in terms of CPU, RAM and IO - all very expensive on AWS.
That's true, when only dealing with 1 server, you technically double the cost by adding a second server. My original comment was about "popular sites/services", that should be able to tolerate the costs and are most likely dealing with multiple servers.
For a single server deployment you can still reduce your downtime (with minimal costs) by having the ASG redeploy into another AZ on a failed health check.
its more expensive to have more things and it’s more expensive to have more complicated things that are also complex. And things that can fall over are inherently more complicated.
A multi-az deployment is a checkbox in most AWS services, e.g. ASGs, RDS, load balancers, etc. Someone didn't check that box because they didn't know about it, there isn't much complexity in it.
Aren't multi-az deployments more expensive? That would be a valid reason not to check this checkbox, if your business can survive a bit of downtime here and there.
Most of that expense is just the cost of a hot failover, but there is some additional cost around inter-AZ data transfer. If someone is not checking the boxes for cost reasons, I would be surprised if they had failovers in the same AZ. It seems more likely they just don't have failovers.
multi az brings multi complexity in terms of data duplication, consistency, if your app wasnt designed to handle those kind of scenarios and experience high users loads then you are in for a lot of problems.
designing for those scenarios increase complexity; cost; architecture style and most of the time it will bring you in microservices territory where most of the companies lack experience and just are following best practices in a field where engineers are expensive and few
> The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry.
Not really.
What's more likely is that their companies have other priorities. Multi-AZ architectures are more expensive to run, but that's normally not the issue. What's really costly is testing their assumptions.
Sure, by deploying your system in a Kubernetes clusters spread across 3 AZs and a HA database you are supposedly covered against failures. Except that when it actually happened, turns out your system couldn't really survive a sudden 30% capacity loss like you expected, and the ASG churning is now causing havoc with the pods who did survive.
Complex systems often fail in non-trivial ways. If you are not chaos-monkeying regularly, you won't know about those cases until they happen. At which time it's too late.
Nail on the head. The amount of times I've seen way overcomplicated redundancy setups which fail in weird and wonderful ways, causing way more downtime than just a simplier setup is pretty silly.
Don’t make the mistake of overromanticizing the simple solutions. They have nice, well understood failure conditions, and they come up relatively frequently.
When you start playing the HA game, the easy failures go off the table, and things break less often because “failures happen constantly and are auto-healed”. But when your virtual IP failover goes sideways or your cluster scheduler starts reaping systems because the metadata service is giving it useless data, you’re well into an infrequent, complex failure, and I hope you have a good ops team.
It's not so cut-and-dried. The AZ isolation guarantees are not quite at the maturity they need to be.
If you're using any managed services by AWS, you need to rely on their own services to be AZ fault-tolerant. In AWS speak, they may well be (just with elevated error rates for a few minutes while load balancing shifts traffic away from a bad AZ). But as an AWS customer, you still feel the impact. As an example, one of our CodePipelines failed the deployment step with an InternalError from CloudFormation. However, the actual underlying stack deployment succeeded. When we went to retry that stage, it wouldn't succeed because the changeset to apply is no more. It required pushing a dummy change to unblock that pipeline.
Similarly, many customers run Lambdas outside of VPCs that theoretically shouldn't be tied to an AZ. You're still reliant on the AWS Lambda team to shift traffic away from a failing AZ, and until they do that, you'll see "elevated error rates" as well.
1) AWS is already really expensive, just on a single AZ. Replicating to a second AZ would almost double your costs. I can't help but bring up the point that an old-school bare-metal setup on something like Hetzner/OVH/etc becomes significantly more cost-effective since you're not using AWS's advantages in this area anyway (and as we've seen in practice, AWS is nowhere near more reliable - how many times have AWS' AZs gone down vs the bare-metal HN server which only had its single significant outage very recently? - it makes sense considering the AWS control plane is orders of magnitude more complex than an old-school bare-metal server which just needs power and a network port).
2) It is extremely hard to build reliable systems over time (since during non-outage periods, everything appears to work fine despite accidentally introducing a hard dependency on a single AZ), and even more so to account for second-order effects such as an inter-AZ link suddenly becoming saturated during the outage. I'm personally not confident at all in Amazon's (or frankly, any public cloud provider's) ability to actually guarantee seamless failover during an outage, since the only way to prove it's working is to have a real outage as to induce any potential second-order effects such as inter-AZ links suddenly becoming saturated, which AWS or any other cloud provider aren't going to do (as an intentional, regularly-scheduled outage for testing would hurt anyone who intentionally doesn't use multiple AZs, essentially pricing them out of the market by forcing them to either commit to the cost increase of multi-AZ or move to a provider who doesn't do scheduled outages for testing purposes).
As some others have alluded to, it seems common AWS services (the ones you rely on to manage multi-AZ traffic like ALBs and Route53) spike in error rate and nose dive in response time so it becomes difficult to fail things over. On top of that, services like RDS that run active hot standby then rely on those to fail over so it's difficult to get the DB to actually fail over.
I suspect, behind the scenes, AWS fails to absorb the massive influx in requests and network traffic as AZs shift around.
I would think regions with more AZs (like us-east-1) would handle an AZ failure better since there's more AZs to spread the load across
What's more surprising, imo, is the large apps like New Relic and Zoom that you'd expect to be resilient (multi region/cloud) taking a hit
Architect here. We had an outage and we have a very complete architecture. The issue is, the services were still reachable via internal health checks. So instead of taking the effected servers out of service they stayed in.
We had to resolve it by manually shutting down all the servers in the affected AZ. Which is normally not needed.
There are of course a lot of companies that aren't architected with multi-AZ at all (or choose not be be). Those companies are having an even worse time right now. But because the servers generally still appeared healthy, this can effect some well architected apps also.
Only reason we knew to shut them down at all was because AWS told us the exact AZ in their status update. We were beginning the processes of pinging each one individually to try to find them (because again, all the health checks were fine).
Yup, exact same here. All of the multi-AZ failover depends on AWS recognizing that their AZ is having an issue, and they never reported having an issue on any health-check so no failover ever happened. We started being able to make progress when AWS told us which AZ was having issues. It still took some time for us to manually shift away from that AZ (manually promoting ElastiCache replicas to primary, switching RDS clusters around, etc.) because all of the AWS failover functionality did not function as they should have and we were relying on that. Multi-region failover would have made us more fault tolerant but our infrastructure wasn't setup for that yet (besides an RDS failover in a separate region). Here's to hoping we never have a Route53 or global AWS API Gateway failure! Then even multi-region will not do us much good. Perhaps we should have some backup servers on the moon, then in case of nuclear warfare we can still be online via satellite.
P.S. AWS has said they have resolved the issue for almost 2 hours now and we are still having issues with us-east-2a.
Both the EC2 instance health and our HTTP health checks. If either of those failed the server would have been removed from the load balancer, but they didn't fail.
Only the external health checks that hit the system from an outside service were failing. And because those spread out the load across the AZs, only a fraction of them were failing and no good way to tell the pattern of failure.
I did have some Kubernetes pods become unhealthy but only because they relied on making calls to servers that were in a different AZ.
It's always more complicated than just deploying EC2 instances into multiple-az's. Here are some things I noticed from today's events.
First: RDS. I saw one of our RDS instances do a failover to the secondary zone because the primary was in the zone that had the power outage. RDS failovers are not free and have a small window of downtime (60-120s as claimed by AWS[1]).
Second: EKS (Kubernetes). One of our Kubernetes EC2 worker nodes (in EKS) went down because it was in the zone with the power outage. Kubernetes did a decent job at re-scheduling pods, but there were edge cases for sure. Mainly with Consul and Traefik running inside of the Kubernetes cluster. Finally, when the Kubernetes EC2 worker node came back up, nearly nothing got scheduled back to it. I had to manually re-deploy to get pod distribution even again. Though the last issue might be something I can improve on by using the new Kubernetes attribute topologySpreadConstraints[2].
It's a game theory thing. If everyone stays single AZ, everyone goes down at the same time so nobody gets blamed. Somehow the blame falls on AWS instead!
That’s a feature, not a bug. If we all had the same number one ; then things would not be loaded anything close to evenly. There is some command to find out what the unique ID number is for your particular zones with your naming.
Clarification: 1/3 of sites will go down (those using the AZ that went offline), but my point is the same. Most companies aren't using multiple AZs, let alone multiple regions.
I don't think there's a shortage of people who can architect reliable services. I think companies simply put reliability on the back burner because it rarely bites them. It's the same reason technical debt is so rarely paid off.
There is a shortage of good cloud engineers, but even if there were more of them, the business doesn't give a crap about brief outages like this. Blame it on AWS and move on, business as usual. Even if they did care, the business is often too incompetent to understand that they could easily prevent these things. And even if they did realize it, they don't want to prioritize it over pushing out another half-baked feature, making sales, getting their bonus.
Multi-AZ architecture just double the cost at least, and it tends to cost even much more if the business is small. Good engineers find the balance between the cost and the availability.
To an investor, a salary is a temporary cost ie, you pay the salary, get the TF scripts made, fire the employee while a checkbox driven, managed resiliency is going to cost you forever with no hope of ever eliminating that cost.
At least that’s what was recently told to me by my manager to explai why my employer prefers to hire people to self manage the AWS infra.
Did you, by chance, reply to the wrong comment? Don’t think I said anything about failovers etc.
The point made to me was that a devops role can be made to eventually automate their own job away to an extent. To an investor, having a devops role on staff is acceptable.
If you never had a devops role and used AWS managed services, you can’t automate that and trim costs.
I.e., devops roles look like surplus in the system if they’re doing a worse job than managed services but to certain audiences, that surplus is necessary. So, if you’re looking to fundraise and your business has tight margins, don’t be too hasty to move to managed services.
Assuming you're using RDS then multi-AZ deployment is just a simple configuration option. If you're using Aurora then it is handled automatically and is even less expensive.
Aurora can replicate the data but doesn't have to keep a hot standby AFAIUI. You can then start a new instance in a different az but the process is semi manual.
I can tell you from experience that the cloud architects are world class and it's actually the data techs that are the problem. Amazon doesn't value data center techs, they don't pay competitively and hire techs that barely have enough skill so they can pay them nothing. Then they metric the fuck out of the teams so that everyone focuses on quick fixes instead of taking the time to troubleshoot long-term persistent issues. Couple this with the fact that management is only concerned with creating new capacity instead of fixing existing capacity.
Will need to see the post-mortem, when us-east-1 had its last big outage multiple AZs were working, but cross AZ functionality (lambda, event bridge) were impacted... which made recovery problematic.
we definitely had issues with all of the AZs in east-2, and far more services impacted than just EC2 (f.e. RDS, elasticache were intermittently down for us)
My take is that so many sites are broken, maybe I shouldn't care either. The extra complexity of dealing with high availability is something that probably isn't worth it for my project. Spend more time on features instead.
Update from AWS: they lost power to (part of?) a single DC in the use2-az1 availability zone.
10:25 AM PDT We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. The loss of power is affecting part of a single data center within the affected Availability Zone. Power has been restored to the affected facility and at this stage the majority of the affected EC2 instances have recovered. We expect to recover the vast majority of EC2 instances within the next hour. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.
Interesting to see it's been a loss of power that caused this. Usually the better datacenters have multiple levels of power redundancy including emergency backup generators.
It depends entirely on how AWS architected their power redundancy. Given that the outage affected a portion of one DC in one AZ, we can make some assumptions, but the truth is we just don't know.
It could be that their shared-fate scope is an entire data hall, or a set of rows, or even an entire building given that an AZ is made up of multiple datacenters. I don't know that AWS has ever published any kind of sub-AZ guarantees around reliability.
Datacenter power has all kinds of interesting failure modes. I've seen outages caused by a cat climbing into a substation, rats building a nest in a generator, fire-fighting in another part of the building causing flooding in the high-voltage switching room, etc.
Shrug... the datacenter is land locked (different animal species) and the problem hasn't happened again in multiple years.
I think you're taking the Eagle a bit too seriously though... if we didn't do anything how would we know? It isn't like this was an expensive thing to try out.
I thought that AWS availability zones were intentionally not canonically named to prevent everyone from adding stuff to AZ “A”. So my us-east-1 zone “A” might be your “B”.
But that system breaks down here when you need to know whether you are in an affected zone. Is there a way to map an account’s AZ name to the canonical one which apparently exists?
They gave up on that, now there's an extra "zone ID" you can read that maps to an absolute address. They used to be extremely cagey about giving out those mappings for your account.
I'm also pretty sure that GCP's identifiers are absolute (and this time, throughout) as well, since their documentation (which renders the same in incognito mode or whatever) makes reference to what zones have what microarchitectures and instance types.
This is true, at the AWS account level. us-east-2a for my account may map to the internal use2-az1, but in your account us-east-2a may map internally to use2-az2.
Haha literally had this same thought. us-east-2 is our default region for most stuff and so far that's been good. I think this is the first AWS downtime in last couple years that hit our systems directly
I find this spreadsheet handy for thinking about AWS region-wide outages and frequency. This seems to be the first major us-east-2 outage, indeed, vs us-east-1 and other regions.
Living a bit more dangerously at the moment as HN is still running temporarily on AWS. (I'd link to the threads about this from a few weeks ago but am on my phone ATM.)
And the reason that works is because HN is mostly hosted on its own stuff, without weird dependencies on anything beyond "the servers being up" and "TCP mostly working."
Just lost my email provider (https://status.postmarkapp.com/incidents/240161) to this and I'd bet my services are degraded/down. I know it's never a "good" time for an outage but this sure does suck for me right now. We've got an event this weekend and people can't sign up right now to buy tickets/etc.
Our own applications are hosted on Azure, but we had an outage today anyway. It was because apparently Netlify and Auth0 use AWS and went down, which took down our static sites and our authentication.
The nature of our business means it wasn't a big deal, but I could imagine lots of people were in the same boat.
But the colos aren't usually managed by a single control plane controlled by a single company, so while they can all fail, they will generally do so independently.
Suppose it's time to setup multi-az and pay to insure against AWS' own failures. I don't know why I previously thought their EC2 uptime claims were sufficient. Lesson learned.
Are you sure you understand their uptime claims? They offer a 99.99% SLA for regional availability, but only 99.5% for individual instances (and even then, they only owe you a 10% service credit for affected instances)
99.5% availability allows up to about 3 and a half hours of downtime a month. 99.99% means around 4 minutes a month. So if you can't handle hours of downtime, you should definitely be multi-AZ.
What are you hosting? Multi-AZ seems like a bare minimum for basic reliability. That said it's not a panacea. There's all sorts of cascading/downstream "weirdness" that can result on AWS's own services through the loss of an AZ.
[10:25 AM PDT] We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. The loss of power is affecting part of a single data center within the affected Availability Zone. Power has been restored to the affected facility and at this stage the majority of the affected EC2 instances have recovered. We expect to recover the vast majority of EC2 instances within the next hour. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.
"[10:11 AM PDT] We are investigating network connectivity issues for some instances and increased error rates and latencies for the EC2 APIs within the US-EAST-2 Region."
Depends what it’s stuck doing, but you might ctrl-c it and later manually unlock the state file (by carefully coordinating with colleagues and deleting the dynamo DB lock object if you’re using the s3 backend) when the outage is over.
TF makes API calls to the underlying cloud. If those hang, you'll have to wait for them to time out.
Whether TF can update the state & release its locks would depend on where those were hosted. If they're in the downed AZ, then ofc. it can't do that, and manual intervention will be required afterwards. I forget if you can make those objects regional when stored in AWS or not. (You can in some other storages.)
Fun fact, for a lot of providers, it'll hang on any error, not just cloud ones. I presume it's due to the gRPC communication mechanism and the terraform binary blocking until the provider answers "yes or no" to the request
Nothing is perfect, there’s probably good reason for this behaviour … but it is rarely something that happens anyway. and you know, deleting a key for the state lock (one that explicitly tells you when and who created it) ain’t that hard or a that big of a deal.
I think any system is susceptible to problems like this if the underlying hardware becomes unavailable. Using dynambodb to obtain locks on s3 is a pretty common pattern in AWS development. This has more to do with AWS than Terraform.
Control+C (once!) is usually enough to cause it to abort without any ill effects to the state file. If it really got stuck and you have to kill it, then sure, you might have to mess with it a bit.
us-east-2 customer here, having a variety of "strange" issues including inability to reach an RDS database, other users in my firm having VPN reconnect trouble to that region.
I interviewed there a few months ago for DevOps, and one of the people I interviewed with said that most of Zoom was in AWS (they liked that I had AWS stuff on my resume).
I just set up a few small sites (not live yet) on us-east-2, because us-east-1 has a poor reputation. I wanted to avoid multi-region to keep things simple, but now I'm thinking I might have to spend the additional time on it. Not ideal when there's no dedicated ops.
It's a joke but I only knew that because Snap is/was (as of S1) hosted on GCP and not AWS. Crackle happens to be the name of a video on demand company.
It's a reference to the breakfast cereal of the same name.
Pedantic clarification for the unfamiliar: the breakfast cereal is named Rice Krispies while Snap, Crackle, and Pop are the names of the cartoon mascots on the box.
I wonder if this is done because people have a tendency or something to always create resources in 'A' (or some other AZ) and this helps spread things around.
And if I would have read the page the link points to better, that's exactly the reason
Availability zones are not guaranteed to have the same name across accounts (ie. us-east-2a in one account might be us-east-2d in another). You would need to use the AZ-ID to determine if they are the same.
AWS availability zones (so like us-west-2b rather than us-west-2) are not the same between accounts. us-west-2b for you is something different than us-west-2b for everyone else.
I understand that us-east is AWS's oldest and biggest facility, but Amazon seems to have more money than Croesus, why aren't they fixing/rebuilding/replacing us-east with something more modern?
us-east is a geographic distinction within which there are multiple regions. us-east-1 and us-east-2 are not the same. This outage occurred in us-east-2. Within an AWS region there are multiple data centers. They call their data centers availability zones. The availability zone AZ1 was the one impacted, and within that availability zone, most likely only a subset of servers.
us-east-1 is the region you're thinking of that has issues. Mostly due to being the largest region (I think?) and like you mentioned, the oldest.
My first instinct would be to guess that something like this happened because of some intentional and well-meaning effort to upgrade some critical part of their infrastructure. Just my hunch given that it happened during the middle of the week in the middle of the day, and came back relatively quickly. The quick but not instantaneous bounce back has the hallmark of someone following a carefully laid out worst case contingency plan. I look forward to the postmortem.
Now I guess we have to move to us-west-2. :)
Update: looks like it's only one zone anyway, so my statement still stands!