Hacker News new | past | comments | ask | show | jobs | submit login
AWS us-east-2 outage
336 points by jchen42 on July 28, 2022 | hide | past | favorite | 247 comments
our alerts just went crazy, and we're having issues even logging in to the AWS dashboard

possibly related: https://news.ycombinator.com/item?id=32267154




Sorry all I jinxed it. Yesterday I was in a meeting and said "The only regional outages AWS has ever had were in us-east-1, so we should just move to us-east-2."

Now I guess we have to move to us-west-2. :)

Update: looks like it's only one zone anyway, so my statement still stands!


Stay in us-east-1, they provide Chaos Monkey for free. It's a feature.


I always say stay in use1 because almost everybody is there and when it's suffering any kind of outage so much of the internet is affected that it's no big deal that you're a part of the outage. People just go outside and get some air knowing it will be back up in a few hours, usually right around the time the AWS status page acknowledges that there is an issue.


After my own heart


Chaos Kong is the one that takes out whole regions. ;)


I think it's Chaos Cthulhu the way it's going recently...


us-rlyeh-1


Chaos Trump


I'm moving to us-weast-1


what kind of compass are ya reading lad?


I presume they are trying to express an extra cardinal dimension perpendicular to the plane. Deep underground in Bezos new evil lair perhaps?



Better to use us-nouth-3


The AWS status board, posted elsewhere in the comments, seems to think this is an AZ outage, not a regional one.

Edit: although, one of our vendors that uses AWS has said that they think ELB registration is impacted (but I don't recall if that's regional?) and R53 is impacted (which is supposed to be global, IIRC). Dunno how much truth there is to it as we don't use AWS directly.


AWS is notorious did underreporting and failing to report. They do not have asafe culture and its bad for your career if there is a major outage


Please don't move to us-west. We are probably going to have an 11-point earthquake the next day.

Thanks!


In all seriousness, we've been deploying everything on us-west-2, and it seems to have dodged most of the outages recently. Is there something special about that data center?


Classically, us-east-1 received most of the hate given its immense size (it used to be several times larger than any other) and status as the first large aws data center. It also seemed to launch new aws features first but that may have been my imagination. If true, I'm sure always running the latest builds was not great for stability.

us-west-2 has had outages as well but it is less common, even rare. I've been pushing companies to make their initial deployments onto us-west-2 for over ten years now. I occasionally get kudos messages in my inbox :)


I believe us-east-1 runs some of the control plane and an us-east outage can effectively take a service in a different region offline as it can break IAM authentication


93.99999


There are six 9s in there. Pretty solid!


Maybe Amazon should make us-east-1's actual datacenter change depend on the customer, as they do with the AZs :P


Doesn't AWS IoT run only in us-east-1?

And I think Alexa skills, if anybody cares about those.


It's never been a default datacenter. For a long time the default when you first logged into the console was us-east-1 so a lot of companies set up there (that's where all of reddit was run for a long time and Netflix too). At some point they switched the default to us-east-2.

So anyone who is in us-west-2 is there intentionally, which makes me assume there is a smaller footprint there (but I have no idea).


Rather the opposite - us-west-2 is big but not the biggest region, or the smallest, or the oldest or newest, it's not partitioned off like the China or GovCloud regions. Because us-west-2 is fairly typical it tends to be one of the last regions to get software updates, after they've been tested in prod elsewhere


Looks like this particular issue was due to power loss, and for power us-west-2 has one clear advantage: It's power is directly from the Columbia river and highly unlikely to have demand based outages.


Maybe not the entire region. Amazon was reportedly building a data center complex next to the natural gas Hermiston Generating Plant some distance from the river.


If us-west-2 goes down in the next few days we’ll expect an explanation.


Naive question: don't people who care about resiliency have their services in more than one datacenter? or datacenter failure is considered such a rare event that's it's not worth the cost/trouble of using more?


AWS makes it pretty easy to operate in multiple AZs within a region (each AZ is considered a separate datacenter but in real life each AZ is multiple datacenters that are really close to each other).

That being said, there is still an added cost and complexity to operate in multiple AZs, because you have to synchronize data across the AZs. Also you have to have enough reserved instances to move into when you lose an AZ, because if you're running lean and each zone is serving 33% of your traffic, suddenly the two that are left need to serve 50% each.

The bigger companies with overhead reservations will get all the instances before you can launch any on demand during an AZ failure.


> each AZ is considered a separate datacenter but in real life each AZ is multiple datacenters that are really close to each other

For AWS specifically, I’m fairly certain they maintain a minimum distance and are much more strict on requirements to be on different grids etc than other Cloud providers. A few years ago they were calling out Azure and Google Cloud on exactly what you describe (having data centers essentially on the same street almost).


A single AZ may have neighboring datacenters, but they are very strict on having datacenters for different AZs be at least 100km apart and on different flood plains and power grids.


This should be at most 100km. Range is in 60km-100km range typically.


100km? Oh really?


https://docs.aws.amazon.com/sap/latest/general/arch-guide-ar...

Each Availability Zone can be multiple data centers.At full scale, it can contain hundreds of thousands of servers. They are fully isolated partitions of the AWS global infrastructure. With its own powerful infrastructure, an Availability Zone is physically separated from any other zones. There is a distance of several kilometers, although all are within 100 km (60 miles of each other).


So at most 100km, not at least 100km.


I think you may have slightly misread. I think what’s being said is that a single logical AZ may actually be multiple physical datacenters in close proximity.


At least in eu-north-1 the three AZs are located in different towns, about 50 km apart (Västerås, Eskilstuna and Katrineholm).


Some people care about it but not enough to justify the added downsides - multi-data center is expensive (you pay per data center) and it’s complex (data sharding/duplication/sync).

If you’re Amazon where every second is millions of $ in transactions you care more than StartUp that has 1 request per minute. Even if you accept the risk, you still care when your DC goes down.

Also, a large chunk of AWS is managed from a single data center so if that one goes down you may still have issues with your service in another data center.


I'd consider using it, but the biggest roadblock for me is that I work in a regulated industry in Australia, and until AWS finishes their Melbourne region (next year maybe?) I'm stuck in one region because all private data needs to stay in Australia.

Also, I think a lot, but not all of the services I use work okay with multiple regions.

On top of that, I was looking at the documentation for KMS keys yesterday, and a KMS key can be multiregion, but if you don't create it as multiregion from the start, you can't update the multiregion attribute. So you need to create a new KMS key and update everything to use the new multiregion key.


AWS works with multiple availability zones (AZ) per region, some products by default deploy in several ones at the same time, while others leave it up to you.


AWS makes it trivially easy to distribute across more than one datacenter... The only time that outages make the news is when they all fail in a region.


Jinxed for sure. I refuse to deploy resource into us-east-1 unless required by the service.


The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry. It would be one thing if this was a Regional failure, but a single AZ failure should not have any noticeable effect.


For most businesses a little down time here and there is a calculated risk versus more complex infrastructure. You can’t assume all the cloud architects are idiots — they have to report their task list and cost of infrastructure to someone who can give feedback on various options based on comparative resource requirements and risks.

Zone downtime still falls under an AWS SLA so you know about how much downtime to accept and for a lot of businesses that downtime is acceptable.


This. People working in IT naturally think keeping IT systems up 100% time is most important. And depending on the business it often is, but it all costs money. Running a business is about managing costs and risks.

- Is it worth to spend 20% more on IT to keep our site up 99.99% vs 99%?

- Is it worth to have 3 suppliers for every part that our business depends, with each of them being contracted to be able to supply 2x more, in case other supplier has issues? And pay a big premium for that?

- Is it worth to have offices across the globe, fully staffed and trained to be able to take on any problem, in case there's big electrical outage/pandemic/etc in other part of the world?

I'm not saying that some of those outages aren't results of clowny/incompetent design. But "site sometimes goes down" can be often a very valid option.


I've had some interesting discussions about this with a bunch of representatives of our larger B2B customers about this. Interestingly enough, to them, a controlled downtime of 2-4 hours with an almost guaranteed success is preferable, compared to a more complex, probably working zero-downtime effort that might leave the system in a messed up - or not messed up - state.

To them it's much easier to communicate "Hey, our customer service is going to be degraded on the second saturday in october, call on monday" to their customers 1-2 month in advance, prepare to have the critical information without our system, and have agents tell people just that.

This has really started to change my thoughts of how to approach, e.g. a major postgres update. In our case, it's probably the better way to just take a backup, shutdown everything, do an offline upgrade and rebuild & restore if things unexpectedly go wrong. We can totally test the happy case to death, and if the happy case works, we're done in 2 hours for the largest systems with minimal risk. 4 hours if we have to recover from nothing, also tested.

And you know, at that point, is it really economical to spend weeks to plan and weeks to test a zero downtime upgrade that's hard to test, because of load on the cluster?


At least in my experience, AWS downtime also only accounts for a minor share of the total downtime; the major source are crashes and bugs in the application you're actually trying to host. Being completely HA and independent of AZ crashes/bug is extremely hard and time intensive and usually not worth it compared to investing that time to get your app to run smoothly.


Yes but when someone else causes your downtime it’s fun to sit around and snipe at them for fun.


I think a good trade off, if your infra is in TF, is to be able to run your scripts with a parameterized AZ/region. That way you can reduce the downtime even more at a fraction of the cost. (assuming the services that are down are not the base layers of AWS, like the 2020 outage)


If you can get the data out of the downed AZ, don't have state you need to transfer and are not shot in the foot once the primary replica comes online again. I've rarely deployed an app where it was as easy as just to change a region variable.


Yeah the data stores are the ones that I would always keep multi AZ no matter what. Everything else is stateless and can be moved quickly.


Write an article on that because you make it sound simple. Or better yet, start a company that configures this for companies.


Nothing is inherently simple.

Depending on the size of the company it can be simple or hard. Most companies that need this are not huge. Things like RDS, Elasticache, ECR and Secrets have multi AZ integrated so not hard to do it. If you operate on ECS or EKS it's pretty straightforward to boot up nodes and load balancers in another AZ.

Maybe you have a system that requires more hands on work and want to explain your point of view? I don't appreciate the snarky responses tho.


Yeah, makes sense if explicitly stated. Not everything is worth the money.

However, in my experience, the people doing the calculations on that risk have no incentive to cover it. Their bonus has no link to the uptime and they can blame $INFRA for the lost millions and still meet their targets and get promoted / crosshired.

The people who warned them and asked for funding are the ones working late and having conf calls with the true stakeholders.


This is true, but I think it would be more acceptable if the region were down vs the single AZ


Considering almost all of the services are multi-zone, it's not hard to add in a couple of lines to make them resilient against this.

People are just unaware, and probably making bad calls in the name of being "portable".


If your application and infra can magically utilize multiple zones with “a couple lines”… then I would say you are miles ahead of just about every other web company.


> you are miles ahead of just about every other web company.

I'm curious who these web companies are.

Use something like Lambda and you get multi-az for free.

https://docs.aws.amazon.com/lambda/latest/dg/security-resili...

Dynamo is another service that wouldn't be impacted as it is multi-az.

Getting postgres RDS multi-region would require the extra couple of lines in your CDK, but is fairly straightforward.


Today, a SaaS I’m familiar with that runs ~10 Aurora clusters in us-east-2 with 2-3 nodes each (1 writer, 1-2 readers) in different AZs had prolonged issues.

At least 1 cluster had a node on “affected” hardware (per AWS). Aurora failed to failover properly and the cluster ended up in a weird error state, requiring intervention from AWS. Could not write to the db at all. This took several hours to resolve.

All that to say that it’s never straightforward. In today’s event, it was pure luck of the draw as to whether a multi-AZ Aurora cluster was going to have >60 seconds of pain.

That SaaS has been running Aurora for years and has never experienced anything similar. I was very surprised when I heard the cluster was in a non-customer-fixable state and required manual intervention. I’ve shilled Aurora hard. Now I’m unsure.

Thank goodness they had an enterprise support deal or who knows if they’d still have issues now.


It's that easy for a lot of managed services.

Want GKE to run multi-zone, or Spanner to run multi-region, just check a box (and insert coin).


Or how about "I'm fully aware, I've done the math taking into account both cost and complexity of implementation and cost of downtime, and I'm probably making fantastic calls based on my actual needs."


If you had "done the math" then you would have gone serverless and gained multi-az for free, as it is almost always the cheapest option.


This has quickly grown to more than adding in a couple of lines! Now I need to architect my legacy app so that I can deploy into lambdas, then I can get resiliency I don't really need!

Not all systems require high availability. Some systems are A-OK with downtime. Sometimes, I'm perfectly fine with eventual consistency. You really do have to look at the use-cases and requirements before making sweeping staements.


I thought we were talking about cloud architects making poor decisions when designing solutions.

Where did legacy apps come from?

> Some systems are A-OK with downtime.

And those ones would not have cared about this outage. Your point isn't that clear.


No, we were talking about architechts making decisions that you characterised as poor. I was pointing out that your statement was over-general and that there are many instances where making the informed decision to ignore HA is a completely reasonable thing to do.

By your last sentence, it appears you agree with me.

If you meant to say that your statement only applies to cloud architects who are attempting to maintain an uptime SLA with multi-az/region redundancy, then sure, AWS has lots of levers you can pull and those complaining really should spend some time studying them.

As for legacy applications, I would not have brought up them up at all if you hadn't suggested pushing things into lambdas as a solution to multi-az. Once again, there are many many situations where this is not appropriate. Not everything is greenfield, and re-architecting existing applications in an attempt to shoehorn it into a different deployment model seems a bit much. Unless I'm misunderstanding what you meant.


Right, because magically serverless is the right answer for every application.


It gives me a bad gut feeling when you imply that multiple instances of a service is more complex than a single instance which cannot be duplicated easily.

I also disagree that it is inherently more costly to run a service in multiple locations.


Of course it's more costly, you need to ensure state between locations so by virtue there's more infra to pay for.

It's not just a single instance too, there's generally a lot more infrastructure (db servers, app servers, logging and monitoring backends, message queues, auth servers... etc)


Also, people who can configure and maintain that infrastructure. It is more complicated, and it does require a different sort of person.

(And checkbox-easy is sweeping edge cases and failure modes under the rug)


also inter region replication costs bandwidth money


Lots and lots of money.


How do you NOT pay more for running double of everything + load balancers?


You do not need to pay double for everything, that might have been true with traditional VPS providers but it is not the way it works with cloud services. You decide on what kind of failure you're willing to tolerate and then architect based on those requirements (loss of multiple AZ's, loss of a region, etc..).

Let's say your website requires 4 application servers, you can then tolerate a single AZ failure by using 5 application servers and spreading them among 5 AZs.


If you already have 4 application servers you are probably already AZ tolerant; most people concerned about "doubling everything" are only running 1 instance.

Going by your example, If your website requires 1 application server, to tolerate a single AZ failure, it requires you to double the number of application servers.

Example - we have a service that used Kafka in the affected region that went down. Our primary kafka instance (R=3) survived but this auxiliary one failed and caused downtime. There's no way around this other than doubling the cost.


In most cases the elephant* in the room is your DB - it doesn't matter where your stateless application servers are, if your stateful DB goes down you're in trouble. It's also often 1) the hardest to replicate, as replication involves tradeoffs - see CAP theorem & co and 2) the most expensive, since it needs to be pretty beefy in terms of CPU, RAM and IO - all very expensive on AWS.

*: https://commons.wikimedia.org/wiki/File:Postgresql_elephant....


That's true, when only dealing with 1 server, you technically double the cost by adding a second server. My original comment was about "popular sites/services", that should be able to tolerate the costs and are most likely dealing with multiple servers.

For a single server deployment you can still reduce your downtime (with minimal costs) by having the ASG redeploy into another AZ on a failed health check.


Those stateless app servers are the easy part. But you need to be replicating the data, with all the cost and complexity decisions that comes with it.


You should get into the database business. A lot of money to be made there if things are so trivial for you.


The sounds of crickets is deafening!


I’m sorry about your feelings but you are wrong.

its more expensive to have more things and it’s more expensive to have more complicated things that are also complex. And things that can fall over are inherently more complicated.


A multi-az deployment is a checkbox in most AWS services, e.g. ASGs, RDS, load balancers, etc. Someone didn't check that box because they didn't know about it, there isn't much complexity in it.


Aren't multi-az deployments more expensive? That would be a valid reason not to check this checkbox, if your business can survive a bit of downtime here and there.


Most of that expense is just the cost of a hot failover, but there is some additional cost around inter-AZ data transfer. If someone is not checking the boxes for cost reasons, I would be surprised if they had failovers in the same AZ. It seems more likely they just don't have failovers.


A checkbox that might 3-4x the cost.


multi az brings multi complexity in terms of data duplication, consistency, if your app wasnt designed to handle those kind of scenarios and experience high users loads then you are in for a lot of problems.

designing for those scenarios increase complexity; cost; architecture style and most of the time it will bring you in microservices territory where most of the companies lack experience and just are following best practices in a field where engineers are expensive and few


RDS just has a button for multi-AZ primaries. No complexity or microservices.


ok lets say you have a master in one AZ and it dies. what happens?


Automatic failover within 60 seconds.


ok then what happened to your transactions that were in the middle of a process? the one that were commit only on one side?


> The fact that so many popular sites/services are experiencing issues due to a single AZ failure makes me think that there is a serious shortage of good cloud architects/engineers in the industry.

Not really.

What's more likely is that their companies have other priorities. Multi-AZ architectures are more expensive to run, but that's normally not the issue. What's really costly is testing their assumptions.

Sure, by deploying your system in a Kubernetes clusters spread across 3 AZs and a HA database you are supposedly covered against failures. Except that when it actually happened, turns out your system couldn't really survive a sudden 30% capacity loss like you expected, and the ASG churning is now causing havoc with the pods who did survive.

Complex systems often fail in non-trivial ways. If you are not chaos-monkeying regularly, you won't know about those cases until they happen. At which time it's too late.


Or, the redundancy actually causes a failure, so not only have you spent more money but you’ve reduced your availability doing so.

(Or worse, the redundancy causes a subtle failure like data loss.)


Nail on the head. The amount of times I've seen way overcomplicated redundancy setups which fail in weird and wonderful ways, causing way more downtime than just a simplier setup is pretty silly.


Don’t make the mistake of overromanticizing the simple solutions. They have nice, well understood failure conditions, and they come up relatively frequently.

When you start playing the HA game, the easy failures go off the table, and things break less often because “failures happen constantly and are auto-healed”. But when your virtual IP failover goes sideways or your cluster scheduler starts reaping systems because the metadata service is giving it useless data, you’re well into an infrequent, complex failure, and I hope you have a good ops team.

It’s always a trade off.


It's not so cut-and-dried. The AZ isolation guarantees are not quite at the maturity they need to be.

If you're using any managed services by AWS, you need to rely on their own services to be AZ fault-tolerant. In AWS speak, they may well be (just with elevated error rates for a few minutes while load balancing shifts traffic away from a bad AZ). But as an AWS customer, you still feel the impact. As an example, one of our CodePipelines failed the deployment step with an InternalError from CloudFormation. However, the actual underlying stack deployment succeeded. When we went to retry that stage, it wouldn't succeed because the changeset to apply is no more. It required pushing a dummy change to unblock that pipeline.

Similarly, many customers run Lambdas outside of VPCs that theoretically shouldn't be tied to an AZ. You're still reliant on the AWS Lambda team to shift traffic away from a failing AZ, and until they do that, you'll see "elevated error rates" as well.


I have 2 takes on this:

1) AWS is already really expensive, just on a single AZ. Replicating to a second AZ would almost double your costs. I can't help but bring up the point that an old-school bare-metal setup on something like Hetzner/OVH/etc becomes significantly more cost-effective since you're not using AWS's advantages in this area anyway (and as we've seen in practice, AWS is nowhere near more reliable - how many times have AWS' AZs gone down vs the bare-metal HN server which only had its single significant outage very recently? - it makes sense considering the AWS control plane is orders of magnitude more complex than an old-school bare-metal server which just needs power and a network port).

2) It is extremely hard to build reliable systems over time (since during non-outage periods, everything appears to work fine despite accidentally introducing a hard dependency on a single AZ), and even more so to account for second-order effects such as an inter-AZ link suddenly becoming saturated during the outage. I'm personally not confident at all in Amazon's (or frankly, any public cloud provider's) ability to actually guarantee seamless failover during an outage, since the only way to prove it's working is to have a real outage as to induce any potential second-order effects such as inter-AZ links suddenly becoming saturated, which AWS or any other cloud provider aren't going to do (as an intentional, regularly-scheduled outage for testing would hurt anyone who intentionally doesn't use multiple AZs, essentially pricing them out of the market by forcing them to either commit to the cost increase of multi-AZ or move to a provider who doesn't do scheduled outages for testing purposes).


Going bare-metal is a premature optimization. Most startups that go that route don't survive long enough to make use of this optimization.

Take advantage of AWS (or Azure, or DO) until you're big enough that bringing the action in-house is a financially and technically prudent option.


It’s premature when it’s premature. It’s late when it’s not.


As some others have alluded to, it seems common AWS services (the ones you rely on to manage multi-AZ traffic like ALBs and Route53) spike in error rate and nose dive in response time so it becomes difficult to fail things over. On top of that, services like RDS that run active hot standby then rely on those to fail over so it's difficult to get the DB to actually fail over.

I suspect, behind the scenes, AWS fails to absorb the massive influx in requests and network traffic as AZs shift around.

I would think regions with more AZs (like us-east-1) would handle an AZ failure better since there's more AZs to spread the load across

What's more surprising, imo, is the large apps like New Relic and Zoom that you'd expect to be resilient (multi region/cloud) taking a hit


Architect here. We had an outage and we have a very complete architecture. The issue is, the services were still reachable via internal health checks. So instead of taking the effected servers out of service they stayed in.

We had to resolve it by manually shutting down all the servers in the affected AZ. Which is normally not needed.

There are of course a lot of companies that aren't architected with multi-AZ at all (or choose not be be). Those companies are having an even worse time right now. But because the servers generally still appeared healthy, this can effect some well architected apps also.

Only reason we knew to shut them down at all was because AWS told us the exact AZ in their status update. We were beginning the processes of pinging each one individually to try to find them (because again, all the health checks were fine).


Yup, exact same here. All of the multi-AZ failover depends on AWS recognizing that their AZ is having an issue, and they never reported having an issue on any health-check so no failover ever happened. We started being able to make progress when AWS told us which AZ was having issues. It still took some time for us to manually shift away from that AZ (manually promoting ElastiCache replicas to primary, switching RDS clusters around, etc.) because all of the AWS failover functionality did not function as they should have and we were relying on that. Multi-region failover would have made us more fault tolerant but our infrastructure wasn't setup for that yet (besides an RDS failover in a separate region). Here's to hoping we never have a Route53 or global AWS API Gateway failure! Then even multi-region will not do us much good. Perhaps we should have some backup servers on the moon, then in case of nuclear warfare we can still be online via satellite.

P.S. AWS has said they have resolved the issue for almost 2 hours now and we are still having issues with us-east-2a.


Which internal health checks are you referring to?


Both the EC2 instance health and our HTTP health checks. If either of those failed the server would have been removed from the load balancer, but they didn't fail.

Only the external health checks that hit the system from an outside service were failing. And because those spread out the load across the AZs, only a fraction of them were failing and no good way to tell the pattern of failure.

I did have some Kubernetes pods become unhealthy but only because they relied on making calls to servers that were in a different AZ.


that tracks with our experience as well


It's always more complicated than just deploying EC2 instances into multiple-az's. Here are some things I noticed from today's events.

First: RDS. I saw one of our RDS instances do a failover to the secondary zone because the primary was in the zone that had the power outage. RDS failovers are not free and have a small window of downtime (60-120s as claimed by AWS[1]).

Second: EKS (Kubernetes). One of our Kubernetes EC2 worker nodes (in EKS) went down because it was in the zone with the power outage. Kubernetes did a decent job at re-scheduling pods, but there were edge cases for sure. Mainly with Consul and Traefik running inside of the Kubernetes cluster. Finally, when the Kubernetes EC2 worker node came back up, nearly nothing got scheduled back to it. I had to manually re-deploy to get pod distribution even again. Though the last issue might be something I can improve on by using the new Kubernetes attribute topologySpreadConstraints[2].

[1] https://aws.amazon.com/premiumsupport/knowledge-center/rds-f... [2] https://kubernetes.io/docs/concepts/scheduling-eviction/topo...


It's a game theory thing. If everyone stays single AZ, everyone goes down at the same time so nobody gets blamed. Somehow the blame falls on AWS instead!


I think you're confusing availability zones with regions in this comment.

AWS AZs don't even have consistent naming across AWS accounts.


That’s a feature, not a bug. If we all had the same number one ; then things would not be loaded anything close to evenly. There is some command to find out what the unique ID number is for your particular zones with your naming.


Clarification: 1/3 of sites will go down (those using the AZ that went offline), but my point is the same. Most companies aren't using multiple AZs, let alone multiple regions.


Best take lol


I don't think there's a shortage of people who can architect reliable services. I think companies simply put reliability on the back burner because it rarely bites them. It's the same reason technical debt is so rarely paid off.


> technical debt is so rarely paid off.

It's not debt if you don't have to pay for it -- and if the ongoing costs of whatever it is are relatively insignificant.


But technical debt bites you in every new feature by slowing new code addition.


There is a shortage of good cloud engineers, but even if there were more of them, the business doesn't give a crap about brief outages like this. Blame it on AWS and move on, business as usual. Even if they did care, the business is often too incompetent to understand that they could easily prevent these things. And even if they did realize it, they don't want to prioritize it over pushing out another half-baked feature, making sales, getting their bonus.


Multi-AZ architecture just double the cost at least, and it tends to cost even much more if the business is small. Good engineers find the balance between the cost and the availability.


No that is not correct, it is not double the cost, please see my reply above.


Salaries are a cost.


To an investor, a salary is a temporary cost ie, you pay the salary, get the TF scripts made, fire the employee while a checkbox driven, managed resiliency is going to cost you forever with no hope of ever eliminating that cost.

At least that’s what was recently told to me by my manager to explai why my employer prefers to hire people to self manage the AWS infra.


Wait. Are you saying that while AWS maintains multiple AZs they can’t maintain reliability on the failover systems between them?


Did you, by chance, reply to the wrong comment? Don’t think I said anything about failovers etc.

The point made to me was that a devops role can be made to eventually automate their own job away to an extent. To an investor, having a devops role on staff is acceptable.

If you never had a devops role and used AWS managed services, you can’t automate that and trim costs.

I.e., devops roles look like surplus in the system if they’re doing a worse job than managed services but to certain audiences, that surplus is necessary. So, if you’re looking to fundraise and your business has tight margins, don’t be too hasty to move to managed services.


I did reply to the wrong comment, apologies!


Replicating a huge database between AZs, let alone regions, can be an enormously expensive ongoing operational cost. Not everyone can afford it.


Assuming you're using RDS then multi-AZ deployment is just a simple configuration option. If you're using Aurora then it is handled automatically and is even less expensive.


don't all the multi-AZ deployments imply at least 1 standby replica in a different AZ?


Aurora can replicate the data but doesn't have to keep a hot standby AFAIUI. You can then start a new instance in a different az but the process is semi manual.


Yes, my point was that it is not complex to setup and maintain, but it is not free.


I can tell you from experience that the cloud architects are world class and it's actually the data techs that are the problem. Amazon doesn't value data center techs, they don't pay competitively and hire techs that barely have enough skill so they can pay them nothing. Then they metric the fuck out of the teams so that everyone focuses on quick fixes instead of taking the time to troubleshoot long-term persistent issues. Couple this with the fact that management is only concerned with creating new capacity instead of fixing existing capacity.


Will need to see the post-mortem, when us-east-1 had its last big outage multiple AZs were working, but cross AZ functionality (lambda, event bridge) were impacted... which made recovery problematic.


Not looked into it too closely yet, but for us it looks like there were also issues connecting between the two remaining AZ in our 3 node cluster.


we definitely had issues with all of the AZs in east-2, and far more services impacted than just EC2 (f.e. RDS, elasticache were intermittently down for us)


Both RDS and elasticache run on EC2. But both of them have Multi-AZ options.


sure, just saying that only EC2 instances were impacted is disingenuous at best.

all of our production services are multi-az as well


My take is that so many sites are broken, maybe I shouldn't care either. The extra complexity of dealing with high availability is something that probably isn't worth it for my project. Spend more time on features instead.


Companies don’t want to pay for in house architecture/etc and developers are generally ultra hostile towards ops people.


Update from AWS: they lost power to (part of?) a single DC in the use2-az1 availability zone.

10:25 AM PDT We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. The loss of power is affecting part of a single data center within the affected Availability Zone. Power has been restored to the affected facility and at this stage the majority of the affected EC2 instances have recovered. We expect to recover the vast majority of EC2 instances within the next hour. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.


Interesting to see it's been a loss of power that caused this. Usually the better datacenters have multiple levels of power redundancy including emergency backup generators.


It depends entirely on how AWS architected their power redundancy. Given that the outage affected a portion of one DC in one AZ, we can make some assumptions, but the truth is we just don't know.

It could be that their shared-fate scope is an entire data hall, or a set of rows, or even an entire building given that an AZ is made up of multiple datacenters. I don't know that AWS has ever published any kind of sub-AZ guarantees around reliability.

Datacenter power has all kinds of interesting failure modes. I've seen outages caused by a cat climbing into a substation, rats building a nest in a generator, fire-fighting in another part of the building causing flooding in the high-voltage switching room, etc.


Our best was a bird landing on a transformer up on a pole. Installed a fake Eagle after that.


Given the scope of the effort invested in attempting to prevent duck and goose crap on the world's docks, I'm skeptical that this tactic is effective.


Shrug... the datacenter is land locked (different animal species) and the problem hasn't happened again in multiple years.

I think you're taking the Eagle a bit too seriously though... if we didn't do anything how would we know? It isn't like this was an expensive thing to try out.


OK. It's just that I am one of those people who have tried to solve the duck/goose problem and would be delighted if a fake eagle or owl worked.


Insert clip of O'Brien explaining to cardassians why there are backups for backups


In case anyone is unaware of the reference, that’s taken from Star Trek Deep Space 9 https://youtu.be/UaPkSU8DNfY


Ya, am I surprised by this too. Like, you have one job, keep the power on.


I thought that AWS availability zones were intentionally not canonically named to prevent everyone from adding stuff to AZ “A”. So my us-east-1 zone “A” might be your “B”.

But that system breaks down here when you need to know whether you are in an affected zone. Is there a way to map an account’s AZ name to the canonical one which apparently exists?


They gave up on that, now there's an extra "zone ID" you can read that maps to an absolute address. They used to be extremely cagey about giving out those mappings for your account.

Examples about how these relate: https://stackoverflow.com/questions/63283340/aws-map-between...

I'm also pretty sure that GCP's identifiers are absolute (and this time, throughout) as well, since their documentation (which renders the same in incognito mode or whatever) makes reference to what zones have what microarchitectures and instance types.


The mapping from AZ Name (account-specific) to AZ ID (global) shows up on the EC2 overview page in the dashboard.



This is true, at the AWS account level. us-east-2a for my account may map to the internal use2-az1, but in your account us-east-2a may map internally to use2-az2.


And I thought us-east-2 is the way to escape us-east-1's problems.


Haha literally had this same thought. us-east-2 is our default region for most stuff and so far that's been good. I think this is the first AWS downtime in last couple years that hit our systems directly


  $ dig news.ycombinator.com
  ;; ANSWER SECTION:
  news.ycombinator.com.   1       IN      A       50.112.136.166

  $ dig -x 50.112.136.166
  ;; ANSWER SECTION:
  166.136.112.50.in-addr.arpa. 300 IN     PTR     ec2-50-112-136-166.us-west-2.compute.amazonaws.com.
saving couple keypresses just in case


I find this spreadsheet handy for thinking about AWS region-wide outages and frequency. This seems to be the first major us-east-2 outage, indeed, vs us-east-1 and other regions.

https://docs.google.com/spreadsheets/d/1Gcq_h760CgINKjuwj7Wu... (from https://awsmaniac.com/aws-outages/)


On the bright side, this is certainly a good test for "exactly how resilient are our AWS-based systems to the loss of a single availability zone?"


Very resilient, provided that its not the AZ I'm using.


Ahh

Always check HN before trying to diagnose weird issues that shouldn't be connected


Living a bit more dangerously at the moment as HN is still running temporarily on AWS. (I'd link to the threads about this from a few weeks ago but am on my phone ATM.)


I did notice it being a little slow but I'm also on 4G at the moment (it got the blame)


And the reason that works is because HN is mostly hosted on its own stuff, without weird dependencies on anything beyond "the servers being up" and "TCP mostly working."


I believe it's on AWS after its two servers broke at the same time the other day.


Yep

  $ host news.ycombinator.com
  news.ycombinator.com has address 50.112.136.166
  
  $ host 50.112.136.166
  166.136.112.50.in-addr.arpa domain name pointer ec2-50-112-136-166.us-west-2.compute.amazonaws.com.


Temporarily, yes.


Too bad that is the best we have!


Just lost my email provider (https://status.postmarkapp.com/incidents/240161) to this and I'd bet my services are degraded/down. I know it's never a "good" time for an outage but this sure does suck for me right now. We've got an event this weekend and people can't sign up right now to buy tickets/etc.


AWS reported the outage here : https://health.aws.amazon.com/health/status


> Severity

> Informational

lol


Three things are certain: death, taxes, and useless cloud status dashboards.


Lies, damned lies, and self-reported statistics that affect a company's SLA refund liability.


And SE/SRE reviews


I dunno how else to put it. Having EVERYTHING on AWS is a national security threat.

This isn't good, and someone who can do something about it needs to.


Good thing we don't have EVERYTHING on AWS, so no threat detected.


Our own applications are hosted on Azure, but we had an outage today anyway. It was because apparently Netlify and Auth0 use AWS and went down, which took down our static sites and our authentication.

The nature of our business means it wasn't a big deal, but I could imagine lots of people were in the same boat.


There is no "someone" who could do anything about this.


Yeah, let's place everything in large colos instead. Those never fail, right?


But the colos aren't usually managed by a single control plane controlled by a single company, so while they can all fail, they will generally do so independently.


Having _everything_ on a single AZ of AWS is, indeed, a problem.

Having everything well-architected on AWS is...well, it's a problem for reasons of monopoly and cost, but it's not a problem for availability.


Certain "global" AWS services depend on a specific region (us-east-1 if I recall right) so if that goes down you could be in trouble.


Suppose it's time to setup multi-az and pay to insure against AWS' own failures. I don't know why I previously thought their EC2 uptime claims were sufficient. Lesson learned.


Are you sure you understand their uptime claims? They offer a 99.99% SLA for regional availability, but only 99.5% for individual instances (and even then, they only owe you a 10% service credit for affected instances)

https://aws.amazon.com/compute/sla/

99.5% availability allows up to about 3 and a half hours of downtime a month. 99.99% means around 4 minutes a month. So if you can't handle hours of downtime, you should definitely be multi-AZ.


What are you hosting? Multi-AZ seems like a bare minimum for basic reliability. That said it's not a panacea. There's all sorts of cascading/downstream "weirdness" that can result on AWS's own services through the loss of an AZ.


Multi-AZ is a requirement on production level loads if you cannot sustain prolonged downtime.

Datacenters do end up completely dying now and then, you really want to have a good strategy in that case. Or not, if that's not required.



[10:25 AM PDT] We can confirm that some instances within a single Availability Zone (USE2-AZ1) in the US-EAST-2 Region have experienced a loss of power. The loss of power is affecting part of a single data center within the affected Availability Zone. Power has been restored to the affected facility and at this stage the majority of the affected EC2 instances have recovered. We expect to recover the vast majority of EC2 instances within the next hour. For customers that need immediate recovery, we recommend failing away from the affected Availability Zone as other Availability Zones are not affected by this issue.


Funny I failed away from the zone and RDS still doesn't work, connections fail.

Edit: 2 minutes after I post this it starts working.


"[10:11 AM PDT] We are investigating network connectivity issues for some instances and increased error rates and latencies for the EC2 APIs within the US-EAST-2 Region."


Thanks for sharing. We've just spent the last hour debugging our website, thinking we had issues. This explains it.


Interestingly, we saw a bunch of other services degrade (Zoom, Zendesk, Datadog) before AWS services themselves degrade.


We have always been at war with us-east-2.


I'm running Terraform and it appears to be stuck now. What do I do??


Depends what it’s stuck doing, but you might ctrl-c it and later manually unlock the state file (by carefully coordinating with colleagues and deleting the dynamo DB lock object if you’re using the s3 backend) when the outage is over.


Thanks, this comment made it very clear to me that I never want to touch a terraform system.


TF makes API calls to the underlying cloud. If those hang, you'll have to wait for them to time out.

Whether TF can update the state & release its locks would depend on where those were hosted. If they're in the downed AZ, then ofc. it can't do that, and manual intervention will be required afterwards. I forget if you can make those objects regional when stored in AWS or not. (You can in some other storages.)

… what would you expect to happen here?


Fun fact, for a lot of providers, it'll hang on any error, not just cloud ones. I presume it's due to the gRPC communication mechanism and the terraform binary blocking until the provider answers "yes or no" to the request


Nothing is perfect, there’s probably good reason for this behaviour … but it is rarely something that happens anyway. and you know, deleting a key for the state lock (one that explicitly tells you when and who created it) ain’t that hard or a that big of a deal.


I think any system is susceptible to problems like this if the underlying hardware becomes unavailable. Using dynambodb to obtain locks on s3 is a pretty common pattern in AWS development. This has more to do with AWS than Terraform.


Rarely is Terraform mentioned in any other context.


Control+C (once!) is usually enough to cause it to abort without any ill effects to the state file. If it really got stuck and you have to kill it, then sure, you might have to mess with it a bit.


Wait


us-east-2 customer here, having a variety of "strange" issues including inability to reach an RDS database, other users in my firm having VPN reconnect trouble to that region.


FWIW, most of our issues just resolved ~1 minute ago. We'll see if it remains stable.


Zoom is having connectivity issues. Nothing on https://status.zoom.us/ page yet.


Anyone's ECR endpoints went out during the outage? We've had timeouts while pulling images onto our k8s cluster post-restart


Wonder if this is why Zoom is down. Wasn't able to connect just now. The connection proxy/sites were giving 504s.


IIRC Zoom signed up with Oracle Cloud when COVID hit and they needed to scale like crazy.

https://www.oracle.com/customers/zoom/

I'm not sure if Zoom has any Critical infra in AWS though.


I interviewed there a few months ago for DevOps, and one of the people I interviewed with said that most of Zoom was in AWS (they liked that I had AWS stuff on my resume).


WebEx too


Okta is degraded as well.


I just set up a few small sites (not live yet) on us-east-2, because us-east-1 has a poor reputation. I wanted to avoid multi-region to keep things simple, but now I'm thinking I might have to spend the additional time on it. Not ideal when there's no dedicated ops.


Same. Our EC2 instances can't connect RDS and just got 500 errors on the dashboard.


Can confirm based on what Metrist is seeing. Looks to be a larger issue in the US East, also seeing Cloudfare and Datadog with issues.


I've somehow dodged region outages on AWS for years, and here's my first one. So many alerts firing off in unexpected ways.


Same here - we were finally able to log in to the console, but we're in us-east-2 and are having a ton of issues.


Tons of issues in us-east-2 here as well


Personally only saw intermittent failures through out. Rather minor production as far as AWS outages go.


No wonder my error logs were clean. I can get into my hosts, but my LB isn't routing. Sad face.


Lots of issues in us-east-2 for instances for us but also other regions when connecting to RDS


Lots of things intermittently unreachable in us-east-2 for us, across multiple AWS accounts.


Chaos monkey is on the move again.


Anyone else notice similar issues in US-West-2 a few hours before this issue in US-East-2?


Looks like a lot of services are impacted including Cloudflare, Ping, Zoom and Datadog.


Not sure why we're on that list. We run our own infrastructure and are not built on AWS.


Looks like Snap, Crackle and Pop are down as well.


> Looks like Snap, Crackle and Pop are down as well.

I don't work on cloud stuff, so I'm genuinely unsure if this is a joke.


It's a joke but I only knew that because Snap is/was (as of S1) hosted on GCP and not AWS. Crackle happens to be the name of a video on demand company.

It's a reference to the breakfast cereal of the same name.


Pop is also a screen sharing/pairing tool, so the joke was great.


Pedantic clarification for the unfamiliar: the breakfast cereal is named Rice Krispies while Snap, Crackle, and Pop are the names of the cartoon mascots on the box.


Hah, frequency illusion strikes again? I just learned about the derivatives past Jerk yesterday.


Cloudflare uses AWS? For what?


unfortunately one of the largest regions:

https://github.com/patmyron/cloud/#ip-addresses-per-region


I'm seeing issues in 2a but not 2b. Anyone having issues in 2b?


AWS availability zones are randomly shuffled for each AWS account – your us-east-2a won't (necessarily) be the same as another user's (or even another account in the same organization): https://docs.aws.amazon.com/ram/latest/userguide/working-wit...

You'll need to see which availability zone ID (e.g., use2-az3) corresponds to each zone in your account: https://aws.amazon.com/premiumsupport/knowledge-center/vpc-m...

edit: AWS identified this as a power loss in a single zone, use2-az1.


I wonder if this is done because people have a tendency or something to always create resources in 'A' (or some other AZ) and this helps spread things around.

And if I would have read the page the link points to better, that's exactly the reason


Just FYI; availability zone names in AWS are randomized between accounts - your "a" can be someone else's "f"

Physical identifiers for availability zones are like "use2-az1", ref.

https://docs.aws.amazon.com/ram/latest/userguide/working-wit...


Availability zones are not guaranteed to have the same name across accounts (ie. us-east-2a in one account might be us-east-2d in another). You would need to use the AZ-ID to determine if they are the same.

https://docs.aws.amazon.com/prescriptive-guidance/latest/pat...


AWS availability zones (so like us-west-2b rather than us-west-2) are not the same between accounts. us-west-2b for you is something different than us-west-2b for everyone else.


Had Rackspace login issues earlier today. Hmmmmmm ...


us-east-2 customer here also having some issues


We have RDS and ECS issues in us-east-2


yeah looks like us-east-2 has networking issues


Can confirm....


Same


I understand that us-east is AWS's oldest and biggest facility, but Amazon seems to have more money than Croesus, why aren't they fixing/rebuilding/replacing us-east with something more modern?


us-east is a geographic distinction within which there are multiple regions. us-east-1 and us-east-2 are not the same. This outage occurred in us-east-2. Within an AWS region there are multiple data centers. They call their data centers availability zones. The availability zone AZ1 was the one impacted, and within that availability zone, most likely only a subset of servers.

us-east-1 is the region you're thinking of that has issues. Mostly due to being the largest region (I think?) and like you mentioned, the oldest.


> us-east-1 is the region you're thinking of that has issues. Mostly due to being the largest region (I think?) and like you mentioned, the oldest.

Also, because shared and global AWS resource are (or at least often behave as if they are) intimately tied to us-east-1.


My first instinct would be to guess that something like this happened because of some intentional and well-meaning effort to upgrade some critical part of their infrastructure. Just my hunch given that it happened during the middle of the week in the middle of the day, and came back relatively quickly. The quick but not instantaneous bounce back has the hallmark of someone following a carefully laid out worst case contingency plan. I look forward to the postmortem.


Because money can't fix everything? In fact sometimes having too much money makes it worse, as YC startup wisdom says.


I've avoided responding since my reply that started this was downvoted... But...

Agreed, money can't solve everything. BUT, this seems like an extremely solvable problem. That's why I'm so surprised.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: