Complex systems are really really hard. I'm not a big fan of seeing all these folks bash AWS for this, and not really understanding the complexity or nastiness of situations like this. Running the kind of services they do for the kind of customers, this is a VERY hard problem.
We ran into a very similar issue, but at the database layer in our company literally 2 weeks ago, where connections to our MySQL exploded and completely took down our data tier and caused a multi-hour outage, compounded by retries and thundering herds. Understanding this problem under the stressful scenario is extremely difficult and a harrowing experience. Anticipating this kind of issue is very very tricky.
Naive responses to this include "better testing", "we should be able to do this", "why is there no observability" etc. The problem isn't testing. Complex systems behave in complex ways, and its difficult to model and predict, especially when the inputs to the system aren't entirely under your control. Individual components are easy to understand, but when integrating, things get out of whack. I can't stress how difficult it is to model or even think about these systems, they're very very hard. Combined with this knowledge being distributed among many people, you're dealing with not only distributed systems, but also distributed people, which adds more difficulty in wrapping this around your head.
Outrage is the easy response. Empathy and learning is the valuable one. Hugs to the AWS team, and good learnings for everyone.
> Outrage is the easy response. Empathy and learning is the valuable one.
I'm outraged that AWS, as a company policy, continues to lie about the status of their systems during outages, making it hard for me to communicate to my stakeholders.
Empathy? For AWS? AWS is part a mega corporation that is closing in on 2 TRILLION dollars in market cap. It's not a person. I can empathize with individuals who work for AWS but it's weird to ask us to have empathy for a massive faceless, ruthless, relentless, multinational juggernaut.
My reading of GP's comment is that the empathy should be directed towards AWS' team, the people who are building the system and handling the fallout, not AWS the corporate entity.
It seems obvious to me that they're specifically talking about having empathy for the people who work there, the people who designed and built these systems and yes, empathy even for the people who might not be sure what to put on their absolutely humongous status page until they're sure.
But I don’t see people attacking the AWS team, at worst the “VP” who has to approve changes to the dashboard. That’s management and that “VP” is paid a lot.
I think most of the outrage is not because "it happened" but because AWS is saying things like "S3 was unaffected" when the anecdotal experience of many in this thread suggests the opposite.
That and the apparent policy that a VP must sign off on changing status pages, which is... backwards to say the least.
> a VP must sign off on changing status pages, which is... backwards to say the least.
I think most people's experience with "VP's" makes them not realize what AWS VP's do.
VP's here are not sitting in an executive lounge wining and dining customers, chomping on cigars and telling minions to "Call me when the data center is back up and running again!"
They are on the tech call, working with the engineers, evaluating the problem, gathering the customer impact, and attempting to balance communicating too early with being precise.
Is there room for improvement? Yes. I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.
But the reason why we don't, doesn't have anything to do with having to get VP approval to put that message up. The VP's are there in the trenches most of the time.
> I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.
I gotta say, the implication that you can't register an outage until you know why it happened is pretty damning. The status page is where we look to see if services are effected, if that information can't be shared there until you understand the cause, that's very broken.
The AWS status page has become kind of a joke to customers.
I was encouraged to see the announcement in OP say that there is "a new version of our Service Health Dashboard" coming. I hope it can provide actual capabilities to display, well, service health.
From how people talk about it, it kind of sounds like updates to the Service Health Dashboard are currently purely a manual process. Rather than automated monitoring automatically updating the Service Health Dashboard in any way at all. I find that a surprising implementation for an organization of Amazon's competence and power. That alarms me more than who it is that has the power to manually update it; I agree that I don't have enough knowledge of AWS internal org structures to have an opinion on if it's the "right" people or not.
I suspect AWS must have internal service health pages that are actually automatically updated in some way by monitoring, that is, that actually work to display service health. It seems like a business decision rather than a technical challenge if the public facing system has no inputs but manual human entry, but that's just how it seems from the outside, I may not have full information. We only have what Amazon shares with us of course.
Can you please help me understand why you, and everyone else, are so passionate about the status page?
I get that it not being updated is an annoyance, but I cannot figure out why it is the single most discussed thing about this whole event. I mean, entire services were out for almost an entire day, and if you read HN threads it would seem that nobody even cares about lost revenue/productivity, downtime, etc. The vast majority of comments in all of the outage threads are screaming about how the SHD lied.
In my entire career of consulting across many companies and many different technology platforms, never once have I seen or heard of anyone even looking at a status page outside of HN. I'm not exaggerating. Even over the last 5 years when I've been doing cloud consulting, nobody I've worked with has cared at all about the cloud provider's status pages. The only time I see it brought up is on HN, and when it gets brought up on HN it's discussed with more fervor than most other topics, even the outage itself.
In my real life (non-HN) experience, when an outage happens, teams ask each other "hey, you seeing problems with this service?" "yea, I am too, heard maybe it's an outage" "weird, guess I'll try again later" and go get a coffee. In particularly bad situations, they might check the news or ask me if I'm aware of any outage. Either way, we just... go on with our lives? I've never needed, nor have I ever seen people need, a status page to inform them that things aren't working correctly, but if you read HN you would get the impression that entire companies of developers are completely paralyzed unless the status page flips from green to red. Why? I would even go as far to say that if you need a third party's SHD to tell you if things aren't working right, then you're probably doing something wrong.
Seriously, what gives? Is all this just because people love hating on Amazon and the SHD is an easy target? Because that's what it seems like.
A status page give you confidence that the problem indeed lies with Amazon and not your own software. I don't think it's very reasonable to notice issues, ask other teams if they are also having issues, and if so, just shrug it off and get a cup of coffee without more investigation. Just because it looks like the problem is with AWS, you can't be sure until you further investigate it, specially if the status page says it's all working fine.
I think it goes without saying that having an outage is bad, but having an outage which is not confirmed by the service provider is even worse. People complain about that a lot because it's the least they could do.
I care about status pages, because when something breaks upstream I need to know whether it's an issue I need to report, and if there's additional problems related to the outage I need to look out for, or workarounds I can deploy. If I find out anything that might help me narrow down the ETA for a fix, that's bonus fries.
I don't gripe about it on HN, but it is generally a disappointment to me when I stumble upon something that looks like a significant outage but a company is making no indication that they've seen it and are working on it (or waiting for something upstream of them, as sometimes happens).
It is extremely common for customers to care about being informed accurately about downtime, and not just for AWS. I think your experience of not caring and not knowing anyone who cares may be an outlier.
> Can you please help me understand why you, and everyone else, are so passionate about the status page?
I don't think people are "passionate about status page." I think people are unhappy with someone they are supposed to trust straight up lying to their face.
aws isn’t a hobby platform. businesses are built on aws and other cloud providers. those businesses customers have the expectation of knowing why they are not receiving the full value of their service.
it makes sense that part of marketing yourself as a viable infrastructure upon which other businesses can operate, you’d provide more granular and refined communication to allow better communication up and down the chain instead of forcing your customers to rca your service in order to communicate to their customers.
> I wish we would just throw up a generic "Shit's Fucked Up. We Don't Know Why Yet, But We're Working On It" message.
I think that's the crux of the matter? AWS seems to now have a reputation for ignoring issues that are easily observable by customers, and by the time any update shows up, it's way too late. Whether VPs make this decision or not is irrelevant. If this becomes a known pattern (and I think it has), then the system is broken.
disclaimer: I have very little skin in this game. We use S3 for some static assets, and with layers of caching on top, I think we are rarely affected by outages. I'm still curious to observe major cloud outages and how they are handled, and the HN reaction from people on both side of the fence.
> disclaimer: I have very little skin in this game. We use S3 for some static assets, and with layers of caching on top, I think we are rarely affected by outages. I'm still curious to observe major cloud outages and how they are handled, and the HN reaction from people on both side of the fence.
I'd like to share my experience here. This outage definitely impacted my company. We make heavy use of autoscaling, we use AWS CodeArtifact for Python packages, and we recently adopted AWS Single Sign-On and EC2 Instance Connect.
So, you can guess what happened:
- No one could access the AWS Console.
- No one could access services authenticated with SAML.
- Very few CI/CD, training or data pipelines ran successfully.
- No one could install Python packages.
- No one could access their development VMs.
As you might imagine, we didn't do a whole lot that day.
With that said, this experience is unlikely to change our cloud strategy very much. In an ideal world, outages wouldn't happen, but the reason we use AWS and the cloud in general is so that, when they do happen, we aren't stuck holding the bag.
As others have said, these giant, complex systems are hard, and AWS resolved it in only a few hours! Far better to sit idle for a day rather than spend a few days scrambling, VP breathing down my neck, discovering that we have no disaster recovery mechanism, and we never practiced this, and hardware lead time is 3-5 weeks, and someone introduced a cyclical bootstrapping process, and and and...
Instead, I just took the morning off, trusted the situation would resolve itself, and it did. Can't complain. =P
I might be more unhappy if we had customer SLAs that were now broken, but if that was a concern, we probably should have invested in multi-region or even multi-cloud already. These things happen.
Saying "S3 is down" can mean anything. Our S3 buckets that served static web content stayed up no problem. The API was down though. But for the purposes of whether my organization cares I'm gonna say it was "up".
> We are currently experiencing some problems related to FOO service and are investigating.
A generic, utterly meaningless message, which is still a hell of a lot more than usually gets approved, and approved far too late.
It is also still better than "all green here, nothing to see" which has people looking at their own code, because they _expect_ that they will be the problem, not AWS.
Most of what they actually said via the manual human-language status updates was "Service X is seeing elevated error rates".
While there are still decisions to be made in how you monitor errors and what sorts of elevated rates merit an alert -- I would bet that AWS has internally-facing systems that can display service health in this way based on automated monitoring of error rates (as well as other things). Because they know it means something.
They apparently choose to make their public-facing service health page only show alerts via a manual process that often results in an update only several hours after lots of customers have noticed problems. This seems like a choice.
What's the point of a status page? To me, the point of it is, when I encounter a problem (perhaps noticed because of my own automated monitoring), one of the first thing I want to do is distinguish between a problem that's out of my control on the platform, and a problem that is under my control and I can fix.
A status page that does not support me in doing that is not fulfilling it's purpose. the AWS status page fails to help customers do that, by regularly showing all green with no alerts hours after widespread problems occured.
It doesn’t matter what the VPs are doing, that misses the point. Every minute you know there is a problem and you haven’t at least put up a “degraded” status, you’re lying to your customers.
It was on the top of HN for an hour before anything changed, and then it was still downplayed, which is insane.
I don't think the matter is whether or not VPs are involved, but the fact that human sign off is required. Ideally the dashboard would accurately show what's working or not, regardless if the engineers know what's going on.
There's definitely miscommunication around this. I know I've miscommunicated impact, or my communication was misinterpreted across the 2 or 3 people it had to jump before hitting the status page.
For example, The meaning of "S3 was affected" is subject to a lot of interpretation. STS was down, which is a blocker for accessing S3. So, the end result is S3 is effectively down, but technically it is not. How does one convey this in a large org? You run S3, but not STS, it's not technically an S3 fault, but an integration fault across multiple services. If you say S3 is down, you're implying that the storage layer is down. But it's actually not. What's the best answer to make everyone happy here? I cant think of one.
"S3 is unavailable because X, Y, and Z services are unavailable."
A graph of dependencies between services is surely known to AWS; if not, they ought to create one post-haste.
Trying to externalize Amazon's internal AWS politicking over which service is down is unproductive to the customers who check the dashboard and see that their service ought to be up, but... well, it isn't?
Because those same customers have to explain to their clients and bosses why their systems are malfunctioning, yet it "shows green" on a dashboard somewhere that almost never shows red.
(And I can levy this complaint against Azure too, by the way.)
Yes, I can envision a (simplified) AWS X-Ray dashboard showing the relationships between the systems and the performance of each one. Then we could see at a glance what was going on. Almost anything is better than that wall of text, tiny status images, and RSS feeds.
Later on in the process, you could do something like this. When you know what else is impacted and how that looks to your customers. But by then the problem is most likely over or at least on the way to being fixed. And hours may have gone by before you get to that point.
Early in the process, when you’re flying blind because you don’t know what’s going on around you and you look at your own systems and they appear to be fine, you can’t really say anything useful.
These weird edge cases are hard to adjudicate because they’ve never happened before — otherwise fixes would already be in place to prevent them. And nothing quite like them has ever before happened at this scale.
I understand the frustration, but when everything you think you know turns out to be wrong, or at least you are unable to confirm whether it’s right or wrong, what do you do?
Read the RCA — When AWS got to that point, they did actually update the SHD with a banner across the top of the page, but that ended up actually causing even more problems. There’s a reason why you try to do these sorts of things safely, which may mean using manual methods in some cases. And sometimes even those safe manual methods have their own weird side effects.
Sometime shit is hard. Sometimes you run into problems like no one else on the planet has ever experienced before, and you have to figure out what the laws of physics are in this new part of the world as you go about trying to fix whatever it was that broke or acted in an unexpected manner.
Disclaimer: my opinions are my own and are not necessarily shared or reflective of my employer.
I’m not all that angry over the situation but more disappointed that we’ve all collectively handed the keys over to AWS because “servers are hard”. Yeh they are but it’s not like locking ourselves into one vendor with flaky docs and a black box of bugs is any better, at least when your own servers go down it’s on you and you don’t take out half of North America.
If you aren't going to rely on external vendors, servers are really, really hard. Redundancy in: power, cooling, networking? Those get expensive fast. Drop your servers into a data center and you're in a similar situation to dropping it in AWS.
A couple years ago all our services at our data center just vanished. I call the data center and they start creating a ticket. "Can you tell me if there is a data center outage?" "We are currently investigating and I don't have any information I can give you." "Listen, if this is a problem isolated to our cabinet, I need to get in the car. I'm trying to decide if I need to drive 60 miles in a blizzard."
That facility has been pretty good to us over a decade, but they were frustratingly tight-lipped about an entire room of the facility losing power because one of their power feeder lines was down.
Could AWS improve? Yes. Does avoiding AWS solve these sorts of problems? No.
Servers are not hard if you have a dedicated person (long time ago known as Systemadminstrator), and fun fact...it's sometimes even much cheaper and more reliable then having everything in the "cloud".
Personally i am a believer in mixed environments, public webservers etc in the "cloud", locally used systems and backup "in house" with a second location (both in Data-centers or at least one), and no, i don't talk about the next google but the 99% of businesses.
You can either pay a dedicated team to manage your on prem solution, go multi cloud, or simply go multi region on aws.
My company was not affected by this outage because we are multi region. Cheapest and quickest option if you want to have at least some fault tolerance.
> ... multi region. Cheapest and quickest option if you want to have at least some fault tolerance.
That is simple not true, you have to adapt your application to be multi region aware to start with, and if you do that on AWS you are basically locked-in, and one of the most expensive cloud providers out there.
You're saying it's not true, but do you have another example of a quick and cheap way to do this ?
I'm not saying this can be done in 1 day for 2 cents, I'm saying that it's quick and cheap compared to other options.
> adapt your application to be multi region aware
This vs adapting your application to support multi cloud deployments or go from the cloud to start doing on prem with a dedicated team, you can take your bets.
On aws you can setup route 53 to point to multiple regions based on health check or latency.
Excuse me, do we need all that complexity? Telling that it is "hard" is justifiable?
It is naive to assume people bashing AWS are uncapable to running things better, cheaper, faster, across many other vendors, on-prem, colocation or what not.
> Outrage is the easy response.
That is what made AWS get the marketshare it has now in the first place, the easy responses.
The main selling point of AWS in the beginning was "how easy is to sping a virtual machine". After basically every layman started recommending AWS and we flocked there, AWS started making things more complex than it should. Was that to make harder to get out of it? IDK.
> Empathy and learning is the valuable one.
When you run your infrastructure and something fails and you are not transparent, your users will bash you, independently who you are.
And that was another "easy response" used to drive companies towards AWS. We developers were echoing that "having a infrastructure team or person is not necessary", etc.
Now we are stuck in this learned helplessness where every outage is a complete disaster in terms of transparency, multiple services failing, even for multi-region and multi-az customers, we saying "this service here is also not working" and AWS simple states that service was fine, not affected, up and running.
If it was a sysadmin doing that, people will be asking for his/her neck with pitchforks.
> AWS started making things more complex than it should
I don’t think this is fair for a couple reasons:
1. AWS would have had to scale regardless just because of the number of customers. Even without adding features. This means many data centers, complex virtual networking, internal networks, etc. These are solving very real problems that happen when you have millions of virtual servers.
2. AWS hosts many large, complex systems like Netflix. Companies like Netflix are going to require more advanced features out of AWS, and this will result in more features being added. While this is added complexity, it’s also solving a customer problem.
My point is that complexity is inherent to the benefits of the platform.
Thanks for these thoughts. Resonated well with me. I feel we are sleepwalking into major fiascos, when a simple doorbell needs to sit on top this level of complexity. It's in our best interest to not tie every small thing into layers, and layers of complexity. Mundane things like doorbells need to have their fallback at least done properly to function locally without relying on complex cloud systems.
The problem isn't AWS per se. The problem is it's become too big to fail. Maybe in the past an outage might take down a few sites, or one hospital, or one government service. Now one outage takes out all the sites, all the hospitals and all the government services. Plus your coffee machine stops working.
> I'm not a big fan of seeing all these folks bash AWS for this,
The disdain I saw was towards those claiming that all you need is AWS, that AWS never goes down, and don't bother planning for what happens when AWS goes down.
AWS is an amazing accomplishment, but it's still a single point of failure. If you are a company relying on a single supplier and you don't have any backup plans for that supplier being unavailable, that is ridiculous and worthy of laughter.
But Amazon advertises that they DO understand the complexity of this, and that their understanding, knowledge and experience is so deep that they are a safe place to put your critical applications, and so you should pay them lots of money to do so.
Totally understand that complex systems behave in incomprehensible ways (hopefully only temporarily incomprehensible). But they're selling people on the idea of trading your complex system, for their far more complex system that they manage with such great expertise that it is more reliable.
Not sure why I got down voted for an honest question. Most start-ups are founders, developers, sales and marketing. Dedicated infrastructure, network and database specialists don't get factored in because "smart CS graduates can figure that stuff out". I've worked at companies who held onto that false notion way too long and almost lost everything as a result ("company extinction event", like losing a lot of customer data)
I am always amazed at how little my software dev spouse understands about infrastructure, basic networking troubleshooting is beyond her. She is a great dev, but a terrible at ops. Fortunately she is at a large company with lots of devs, sysadmins and SREs.
> This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries. This led to persistent congestion and performance issues on the devices connecting the two networks.
I remember my first experience realizing the client retry logic we had implemented was making our lives way worse. Not sure if it's heartening or disheartening that this was part of the issue here.
Our mistake was resetting the exponential backoff delay whenever a client successfully connected and received a response. At the time a percentage but not all responses were degraded and extremely slow, and the request that checked the connection was not. So a client would time out, retry for a while, backing off exponentially, eventually successfully reconnect and then after a subsequent failure start aggressively trying again. System dynamics are hard.
And have to be actually tested. Most of them are designs based on nothing but uninformed intuition. There is an art to back pressure and keeping pipelines optimally utilized. Queueing doesn’t work like you think until you really know.
Why is this hard, and can’t just be written down somewhere as part of the engineering discipline? This aspect of systems in 2021 really shouldn’t be an “art.”
It is, in itself, a separate engineering discipline, and one that cannot really be practiced analytically unless you understand really well the behavior of individual pieces which interact with each other. Most don't, and don't care to.
It is something which needs to be designed and tuned in place and evades design "getting it right" without real world feedback.
And you also simply have to reach a certain somewhat large scale for it to matter at all, the amount of excess capacity you have because of the available granularity of capacity at smaller scales eats up most of the need for it and you can get away with wasting a bit of money on extra scale to get rid of it.
It is also sensitive to small changes so textbook examples might be implemented wrong with one small detail that won't show itself until a critical failure is happening.
It is usually the location of the highest complexity interaction in a business infrastructure which is not easily distilled to a formula. (and most people just aren't educationally prepared for nonlinear dynamics)
It absolutely is written down. The issue is that the results you get from modeling systems using queuing theory are often unintuitive and surprising. On top of that it's hard to account for all the seemingly minor implementation details in a real system.
During my studies we had a course where we built a distributed system and had to model it's performance mathematically. It was really hard to get the model to match the reality and vice-versa. So many details are hidden in a library, framework or network adapter somewhere (e.g buffers or things like packet fragmentation).
We used the book "The Art of Computer Systems Performance Analysis" (R. Jain), but I don't recommend it. At least not the 1st edition which had a frustrating amount of serious, experiment-ruining errata.
Think of other extremely complex systems and how we’ve managed to make them stable:
1) airplanes: they crashed, _a lot_. We used data recorders and stringent process to make air travel safety commonplace.
2) cars: so many accidents accident research. The solution comes after the disaster.
3) large buildings and structures: again, the master work of time, attempts, failures, research and solutions.
If we really want to get serious about this (and I think we do) we need to stop reinventing infrastructure every 10 years and start doubling down on stability. Cloud computing, in earnest, has only been around a short while. I’m not even convinced it’s the right path forward, just happens to align best with business interests, but it seems to be the devil we’re stuck with so now we need to really dig in and make it solid. I think we’re actually in that process right now.
But what's a good alternative then? What if the internet connection has recovered? And you were at the, for example, 4 minute retry loop. Would you just make your users stare at a spinning loader for 8 minutes?
Or tell them directly that "We have screwed up. The service is currently overloaded. Thank you for your patience. If you still haven't given up on us, try again a less busy time of day. We are very sorry."
There are several options, and finding the best one depends a bit on estimating the behaviour of your specific target audience.
I first learned about exponential backoff from TCP and TCP has a lot of other smart ways to manage congestion control. You don't need to implement all the ideas into client logic but you can also do a lot better than just basic exponential backoff.
The problem shows up at the central system while the peripheral device is causing it. And those systems belong to very different organizations with very different priorities. I still remember how difficult the discussion was with 3G basestation team persuading them to implement exponential backoff with some random factor when connecting to the management system.
I guess you have to read this kind of items hidden in careful language, running the instances had no problem, it is different matter they had limited connectivity!from AWS point of view they don't seem to see user impact but services from their point of view.
Perhaps that distinction has value if your workloads did not depend on network connectivity externally for example say S3 access without vpc and only compute some DS/ ML jobs perhaps.
Yeah, I know. This was based off instance store logging on these instances. For better or worse, they're very simple ports of pre-AWS on-prem servers, they don't speak AWS once they're up and running.
Do you use VPC endpoints for S3? The next sentence explained failures I observed with S3: "However, access to Amazon S3 buckets and DynamoDB tables via VPC Endpoints was impaired during this event."
I could not modify file properties in S3, uploading new or modified files was spotty, and AWS Console GUI access was broken as well. Was that because of VPC endpoints?
DAX, part of DynamoDB from how AWS groups things, was throwing internal server errors for us and eventually we had to reboot nodes manually. That's separate from the STS issues we had in terms of our EKS services connecting to DAX.
STS is the worst with this. Even for other internal teams, they seem to treat dropped requests (ie, timeouts which represent 5xxs on the client side) as 'non faults', and so don't treat those data points in their graphs and alarms. It's really obnoxious.
AWS in general is trying hard to do the right thing for customers, and obviously has a long ways to go. But man, a few specific orgs have some frustrating holdover policies.
> AWS in general is trying hard to do the right thing for customers
You are responding to a comment that suggests they're misrepresenting the truth (which wouldn't be the first time even in last few days) in communication to their customers.
As always, they are doing the right thing for themselves only.
EDIT: I think that you should mention being an Engineer at Amazon AWS in your comment.
> ...an AWS person shit-talking other AWS teams [in public].
I remember a time when this would be an instant reprimand... Either amzn engs are bolder these days, or amzn hr is trying really hard for amzn to be "world's best employer", or both.
Gotta deanonymize the user to reprimand them. Maybe i am wrong here, but i don’t see it as something an Amazon HR employee would actually waste their time on (exceptions apply for confidential info leaks and other blatantly illegal stuff, of course). Especially given that it might as well be impossible, unless the user incriminated themselves with identifiable info.
It's true that I shouldn't have posted it, was mostly just in a grumpy mood. It's still considered very bad form. I'm not actually there anymore, but the idea stands.
I suppose all outages are just elevated latency. Has anyone ever had an outage and said "fuck it, we're going out of business" and never came back up? That's the only true outage ;)
5xx errors are servers or proxies giving up on requests. Increased timeouts resulting in successful requests may have been considered "elevated latency" (but rarely this would be a proper way to solve similar issue).
They treat 5xx errors as non-errors but this is not the case with rest of the world. "Increased timeouts" is Amazon's untruthful term for "not working at all".
So many lessons in this article. When your service goes down but eventually gets back up, it's not an outage. It's "elevated latency". Of a few hours, maybe days.
> This congestion immediately impacted the availability of real-time monitoring data for our internal operations teams, which impaired their ability to find the source of congestion and resolve it.
Disruption of the standard incident response mechanism seems to be a common element of longer lasting incidents.
It is. And to add, all automation that we rely on in peace time can often complicate cross cutting wartime incidents by raising the ambient complexity of an environment. Bainbridge for more: https://blog.acolyer.org/2020/01/08/ironies-of-automation/
Indeed - Even the recent facebook outage outlined how slow recovery can be if the primary investigation and recovery methods are directly impacted as well. Back in the old days some environments would have POTS dial-in connections to the consoles as backup for network problems. That of course doesn't scale, but it was an attempt to have an alternate path of getting to things. Regrettably if a backhoe takes out all of the telecom at once that plan doesn't work so well.
Yup. There was a GCP outage a couple of years ago like this.
I don’t remember the exact details, but it was something along the lines of a config change went out that caused systems to incorrectly assume there were huge bandwidth constraints. Load shedding kicked in to drop lower priority traffic which ironically included monitoring data rendering GCP responders blind and causing StackDriver to go blank for customers.
That was my take. Seems like boilerplate you could report for almost any incident. Last year's Kinesis outage and the S3 outage some years ago had some decent detail
Does anyone know how often an AZ experiences an issue as compared to an entire region? AWS sells the redundancy of AZs pretty heavily, but it seems like a lot of the issues that happen end up being region-wide. I'm struggling to understand whether I should be replicating our service across regions or whether the AZ redundancy within a region is sufficient.
I've been naively setting up our distributed databases in separate AZs for a couple years now, paying, sometimes, thousands of dollars per month in data replication bandwidth egress fees. As far as I can remember I've never never seen an AZ go down, and the only region that has gone down has been us-east-1.
There was an AZ outage in Oregon a couple months back. You should definitely go multi AZ without hesitation for production workloads for systems that should be highly available. You can easily lose a system permanently in a single AZ setup if it’s not ephemeral.
The stuff that's exclusively hosted in us-east-1 is, to my knowledge, mostly things that maintain global uniqueness. CloudFront distributions, Route53, S3 bucket names, IAM roles and similar- i.e. singular control planes. Other than that, regions are about as isolated as it gets, except for specific features on top.
Availability zones are supposed to be another fault boundary, and things are generally pretty solid, but every so often problems spill over when they shouldn't.
The general impression I get is that us-east-1's issues tend to stem from it being singularly huge.
If I recall there was a point in time where the control panel for all regions was in us-east-1. I seem to recall an outrage where the other regions were up, but you couldn’t change any resources because the management api was down in us-east-1
Literally all our AWS resources are in EU/UK regions - and they all continued functioning just fine - but we couldn't sign in to our AWS console to manage said resources.
Thankfully the outage didn't impact our production systems at all, but our inability to access said console was quite alarming to say the least.
It would probably be clearer that they exist if the console redirected to the regional URL when you switched regions.
STS, S3, etc have regional endpoints too that have continued to work when us-east-1 has been broken in the past and the various AWS clients can be configured to use them, which they also sadly don't tend to do by default.
AWS has been getting a pass on their stability issues in us-east-1 for years now because it’s their “oldest” zone. Maybe they should invest in fixing it instead of inventing new services to sell.
I certainly wouldn't describe it as “a pass” given how commonly people joke about things like “friends don't let friends use us-east-1”. There's also a reporting bias: because many places only use us-east-1, you're more likely to hear about it even if it only affects a fraction of customers, and many of those companies blame AWS publicly because that's easier than admitting that they were only using one AZ, etc.
These big outages are noteworthy because they _do_ affect people who correctly architected for reliability — and they're pretty rare. This one didn't affect one of my big sites at all; the other was affected by the S3 / Fargate issues but the last time that happened was 2017.
That certainly could be better but so far it hasn't been enough to be worth the massive cost increase of using multiple providers, especially if you can have some basic functionality provided by a CDN when the origin is down (true for the kinds of projects I work on). GCP and Azure have had their share of extended outages, too, so most of the major providers tend to be careful to cast stones about reliability, and it's _much_ better than the median IT department can offer.
I agree with you, but my services are actually in Canada (Central). There's only one region in Canada, so I don't really have an alternative. AWS justifies it by saying there are three AZs (distinct data centres) within Canada (Central), but I get scared when I see these region-wide issues. If the AZs were really distinct, you wouldn't really have region-wide issues.
Take DynamoDB as an example. The AWS managed service takes care of replicating everything to multiple AZs for you, that's great! You're very unlikely to lose your data. But, the DynamoDB team is running a mostly-regional service. If they push bad code or fall over it's likely going to be a regional issue. Probably only the storage nodes are truly zonal.
If you wanted to deploy something similar, like Cassandra across AZs, or even regions you're welcome to do that. But now you're on the hook for the availability of the system. Are you going to get higher availability running your own Cassandra implementation than the DynamoDB team? Maybe. DynamoDB had a pretty big outage in 2015 I think. But that's a lot more work than just using DynamoDB IMO.
> But, the DynamoDB team is running a mostly-regional service.
this is both more and less true than you might think. for most regional endpoints teams leverage load balancers that are scoped zonally, such that ip0 will point at instances in zone a, ip1 will point at instances in zone b, and so on. Similarly, teams who operate "regional" endpoints will generally deploy "zonal" environments, such that in the event of a bad code deploy they can fail away that zone for customers.
that being said, these mitigations still don't stop regional poison pills or otherwise from infecting other AZs unless the service is architected to zonally internally.
Yeah, teams go to a lot of effort to have zonal environments/fleets/deployments... but there are still many, many regional failure modes. For example, even in a foundational service like EC2 most of their APIs touch regional databases.
It can be a bit hard to know, since the AZ identifiers are randomized per account, so if you think you have problems in us-west-1a, I can't check on my side. You can get the AZ ID out of your account to de-randomize things, so we can compare notes, but people rarely bother, for whatever reason.
Over two years I think we'd see about 2-3 AZ issues but only once I would consider it an outage.
Usually there would be high network error rates which were usually enough to make RDS Postgres fail over if it was in the impacted AZ
The only real "outage" was DNS having extremely high error rates in a single us-east-1 AZ to the point most things there were barely working
Lack of instance capacity, especially spot, especially for the NVMe types was common of CI (it used ASGs for builder nodes). It'd be pretty common for a single AZ to run out of spot instance types--especially the NVMe ([a-z]#d types)
I’ve been running platform teams on aws now for 10 years, and working in aws for 13. For anyone looking for guidance on how to avoid this, here’s the advice I give startups I advise.
First, if you can, avoid us-east-1. Yes, you’ll miss new features, but it’s also the least stable region.
Second, go multi AZ for production workloads. Safety of your customer’s data is your ethical responsibility. Protect it, back it up, keep it as generally available as is reasonable.
Third, you’re gonna go down when the cloud goes down. Not much use getting overly bent out of shape. You can reduce your exposure by just using their core systems (EC2, S3, SQS, LBs, Cloudfrount, RDS, Elasticache). The more systems you use, the less reliable things will be. However, running your own key value store, api gateway, event bud, etc., can also be way less reliable than using their’s. So, realize it’s an operational trade off.
Degradation of your app / platform is more likely to come from you than AWS. You’re gonna roll out bad code, break your own infra, overload your own system, way more often than Amazon is gonna go down. If reliability matters to you, start by examining your own practices first before thinking things like multi region or super durable highly replicated systems.
This stuff is hard. It’s hard for Amazon engineers. Hard for platform folks at small and mega companies. It’s just, hard. When your app goes down, and so does Disney plus, take some solace that Disney in all their buckets of cash also couldn’t avoid the issue.
And, finally, hold cloud providers accountable. If they’re unstable and not providing service you expect, leave. We’ve got tons of great options these days, especially if you don’t care about proprietary solutions.
Easy to say leave, the techinical lockin cloud service providers by design choose to have makes it impossible to leave .
AWS (and others) make egress costs insanely expensive for any startup to consider leaving with their data, also there is constant push to either not support open protocols or extend /expand them in ways making it hard to migrate a code base easily.
If the advise is to use only effectively use managed open source components then why AWS at all ? most competent mid sized teams can do that much cheaper with a colo providers like OVH/hetzner.
The point of investing in AWS is not outsource running base infra, if we should stay away from leveraging the kind of cloud native services us mere mortals cannot hope to build or maintain.
Also this avoid us-east-1 advice is bit frustrating, AWS does not have to experiment with new services always in the same region,it is not marked as experimental region or has reduced SLAs , if it is inferior/preview/beta than call it out in the UI and contract, what about when there is no choice? If cloudfront is managed in us-east-1 and we shouldnt now use it ? Why use the cloud then ?
if your engineering only discovers scale problems at us-east-1 along with customers perhaps something is wrong ? aws could limit new instances in that region and spread the load, playing with customers like this who are at your mercy just because you can is not nice.
Disney can afford to go down, or build their cloud, small companies don't have deep pockets to do either
> AWS (and others) make egress costs insanely expensive for any startup to consider leaving with their data
I have seen this repeated many times, but don't understand it. Yes egress is expensive, but they are not THAT expensive compared to storage. S3 egress per GB is no more than 3x the price of storage, i.e. moving out just cost 3 month of storage cost (there's also API cost but that's not the one often mentioned).
Is egress pricing being a lock-in factor just a myth? Is there some other AWS cost I'm missing? Obviously there will be big architectural and engineering cost to move, but that's just part of life.
Often the other cloud vendors will assist in offering those migration costs as part of your contract negotiations.
But really, egress costs aren’t locking you in. It’s the hard coded AWS apis, terraform scripts and technical debt. Having to change all of that and refactor and reoptimize to a different providers infrastructure is a huge endeavor. That time spent night have a higher ROI being put elsewhere
3months is only if you use standard S3, However intelligent tiering , infrequent access , reduced redundancy or glacier instant can be substantially cheaper, without impacting retrieval time [1]
At scale when costs matter, you would have lifecycle policy tuned to your needs taking advantage of these classes. Any typical production workload is hardly paying only S3 base price for all/most of its storage needs, they will have mix of all these too.
[1] if there is substantial data in glacier regular, the costing completely blows through the roof, retrieval +egress makes it infeasible unless you activily hate AWS enough to spend that kind of money
Lesson to build your services with Docker and Terraform. In this setup you can spin up a working clone of a decently sized stack in a different cloud provider in under an hour.
If the setup is that portable you probably don't need the AWS at all in the first place.
If your use only services built and managed by your docker images why use the cloud in the first place ? It would be cheaper to host on a smaller vendor , the reliability is not substantially better with big cloud than tier two vendors, that difference between say OVH and AWS is not that valuable to most applications to be worth the premium.
In IMO, if you don't leverage cloud native services offered by GCP or AWS then cloud is not adding much value to your stack.
This is just not true for Terraform at all, they do not aim to be multi cloud and it is a much more usable product because of it. Resource parameters do not swap out directly across providers (rightly so, the abstractions they choose are different!).
You've written up my thoughts better than I can express them myself - I think what people get really stuck on when something like this happens is the 'can I solve this myself?' aspect.
A wait for X provider to fix it for you situation is infinitely more stressful than an 'I have played myself, I will now take action' situation.
Situations out of your (immediate) resolution control feel infinitely worse, even if the customer impact in practice of your fault vs cloud fault is the same.
For me it’s the opposite… aws outages are much less stressful than my own because I know there’s nothing I/we can do about it, they have smart people working on it, and it will be fixed when it’s fixed
I couldn't possibly disagree more strongly with this. I used to drive frantically to the office to work on servers in emergency situations, and if our small team couldn't solve it, there was nobody else to help us. The weight of the outage was entirely on our shoulders. Now I relax and refresh a status page.
> Third, you’re gonna go down when the cloud goes down.
Not necessarily. You just need to not be stuck with a single cloud provider. The likelihood of more than one availability zone going down on a single cloud provider is not that low in practice. Especially when the problem is a software bug.
The likelihood of AWS, Azure, and OVH going down at the same time is low. So if you need to stay online if AWS fail, don't put all your eggs in the AWS basket.
That means not using proprietary cloud solutions from a single cloud provider, it has a cost so it's not always worth it.
> using proprietary cloud solutions from a single cloud provider, it has a cost so it's not always worth it.
but perhaps some software design choices could be made to alleviate these costs. For example, you could have a read-only replica on azure or whatever backup cloud provider, and design your software interfaces to allow the use of such read only replicas - at least you'd be degraded rather than unavailable. Ditto with web servers etc.
This has a cost, but it's lower than entirely replicating all of the proprietary features in a different cloud.
Complex systems are expensive to operate, in many ways.
The more complexity you build into your own systems on top of the providers you depend on, the more likely you are to shoot yourself in the foot when you run into complexity issues that you’ve never seen before.
And the times that is most likely to happen is when one of your complex service providers goes down.
If the kind of thing you’re talking about could be feasibly done, then Netflix would have already done it. The fact that Netflix hasn’t solved this problem is a strong indicator that piling more proprietary complexity on top of all the vendor complexity you inherit from using a given service, well that’s a really hard problem in and of itself.
True multi-cloud redundancy is hard to test - because it’s everything from DNS on up and it’s hard to ask AWS to go offline so you can verify Azure picks up the slack.
And you will get 1/N of requests timing or erroring out, and in the meanwhile paying 2x or 3x the costs. So, it might be worth in some cases but you need to evaluate it very, very well.
> And, finally, hold cloud providers accountable. If they’re unstable and not providing service you expect, leave. We’ve got tons of great options these days, especially if you don’t care about proprietary solutions.
Easy to say, but difficult to do in practice (leaving a cloud provider)
Absolutely hard. But that doesn’t mean if you’re in a position to start a company from scratch that you can’t walk away. Or if you go to another company and are involved in their procurement of a new purchase, that you can’t sway it away from said provider.
Just because it takes years doesn’t meant it can’t happen.
> Third, you’re gonna go down when the cloud goes down. Not much use getting overly bent out of shape.
Ugh. I have a hard time with this one. Back in the day, EBS had some really awful failures and degradations. Building a greenfield stack that specifically avoided EBS and stayed up when everyone else was down during another mass EBS failure felt marvelous. It was an obvious avoidable hazard.
It doesn't mean "avoid EBS" is good advice for the decade to follow, but accepting failure fatalistically doesn't feel right either.
I hear you. I didn’t use EBS for five years after the great outage in, what was it, 2011?
At this point, it’s reliable enough that even if it were to go down, it’s more safe than not using it. I’d put EBS in the pantheon of “core” services I never mind using these days.
Respectfully disagree. No company in the world has 100% uptime. Whether it’s your server rack or their server rack going down means nothing to a customer.
We’re not discussing data loss in this thread specifically. This is about a couple of hours of downtime per year.
> The AWS container services, including Fargate, ECS and EKS, experienced increased API error rates and latencies during the event. While existing container instances (tasks or pods) continued to operate normally during the event, if a container instance was terminated or experienced a failure, it could not be restarted because of the impact to the EC2 control plane APIs described above.
This seems pretty obviously false to me. My company has several EKS clusters in us-east-1 with most of our workloads running on Fargate. All of our Fargate pods were killed and were unable to be restarted during this event.
Strong agree. We were using Fargate nodes in our us-east-1 EKS cluster and not all of our nodes dropped, but every coredns pod did. When they came back up their age was hours older than expected, so maybe a problem between Fargate and the scheduler rendered them “up” but unable to be reached?
Either way, was surprising to us that already provisioned compute was impacted.
Saw the same. The only cluster services I was running in Fargate were CoreDNS and cluster-autoscaler; thought it would help the clusters recover from anything happening to the node group where other core services run. Whoops.
Couldn't just delete the Fargate profile without a working EKS control plane. I lucked out in that the label selector the kube-dns Service used was disjoint from the one I'd set in the Fargate profile, so I just made a new "coredns-emergency" deployment and cluster networking came back. (cluster-autoscaler was moot since we couldn't launch instances anyway.)
I was hoping to see something about that in this announcement, since the loss of live pods is nasty. Not inclined to rely on Fargate going forward. It is curious that you saw those pod ages; maybe Fargate kubelets communicate with EKS over the AWS internal network?
Still doesn’t explain the cause of all the IAM permission denied requests we saw against policies which are again working fine without any intervention.
Obviously networking issues can cause any number of symptoms but it seems like an unusual detail to leave out to me. Unless it was another ongoing outage happening at the same time.
It’s so hard to know what was the state of the system when the monitoring was out. Wouldn’t be surprised if they don’t have the data to investigate it now.
There are a lot of comments in here that boil down to "could you do infrastructure better?"
No, absolutely not. That's why I'm on AWS.
But what we are all ACTUALLY complaining about is ongoing lack of transparent and honest communications during outages and, clearly, in their postmortems.
Honest communications? Yeah, I'm pretty sure I could do that much better than AWS.
Something they didn't mention is AWS Billing alarms. These rely on metrics systems which were affected by this (and are missing some data). Crucially, billing alarms only exist in the us-east-1 region, so if you're using them, your impacted no matter where you're infrastructure is deployed. (That's just my reading of it)
> Customers accessing Amazon S3 and DynamoDB were not impacted by this event. However, access to Amazon S3 buckets and DynamoDB tables via VPC Endpoints was impaired during this event.
What does this even mean ? I bet most people use DynamoDB via a VPC, in a Lambda or in EC2
Your application can call DynamoDB via the public endpoint (dynamodb.us-east-1.amazonaws.com). But if you're in a VPC (i.e. practically all AWS workloads in 2021), you have to route to the internet (you need public subnet(s) I think) to make that call.
VPC Endpoints create a DynamoDB endpoint in your VPC, from the documentation:
"When you create a VPC endpoint for DynamoDB, any requests to a DynamoDB endpoint within the Region (for example, dynamodb.us-west-2.amazonaws.com) are routed to a private DynamoDB endpoint within the Amazon network. You don't need to modify your applications running on EC2 instances in your VPC. The endpoint name remains the same, but the route to DynamoDB stays entirely within the Amazon network, and does not access the public internet."
From within a VPC, you can either access DynamoDB via its public internet endpoints (eg, dynamodb.us-east-1.amazonaws.com, which routes through an Internet Gateway attachment in your VPC), or via a VPC endpoint for dynamodb that's directly attached to your VPC. The latter is useful in cases where you want a VPC to not be connected to the internet at all, for example.
I am not a fan of AWS due to their substantial market share on cloud computing. But as a software engineer I do appreciate their ability to provide fast turnarounds on root cause analyses and make them public.
This isn't a good example of an RCA - as other commenters have noted, it's outrightly lying about some issues during the incident, and using creative language to dance around other problems many people encountered.
If you want to dive into postmortems, there are some repos linking other examples
They posted on the status page to try using the alternate region endpoints like us-west.console.Amazon.com (I think) at the time, but not sure if it was a true fix.
I wonder if they could've designed better circuit breakers for situations like this. They're very common in electrical engineering, but I don't think they're as common in software design. Something we should try to design and put in, actually for situations like this.
They’re a fairly common design pattern https://en.m.wikipedia.org/wiki/Circuit_breaker_design_patte.... However, they certainly aren’t implemented with the frequency they should be at service level boundaries resulting in these sorts of cascading failures.
Netflix was talking alot about circuit breaks a few years ago, and had the Hystrix project. Looks like Hystrix is discontinued, so I'm not sure if there are good library solutions that are easy to adopt. Overall I don't see it getting talked about that frequently... beyond just exponential backoff inside a retry loop.
One of the big issues mentioned was that one of the circuit breakers they did have (client back off), didn't function properly. So they did have a circuit breaker in the design, but it was broken.
>At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network.
Just curious, is this scaling an AWS job or a client job? Looks like an AWS one from the context. I'm wondering if they are deploying additional data centers or something else?
The reality is that there’s a handful of people in the world that can operate systems at this sheer scale and complexity and I have mad respect for those in that camp.
Some of us are in that camp and are looking at this outage and also pointing out that they continuously fail to accurately update their status dashboard in this and prior outages. Yes, doing what AWS does is hard, and yes outages /will/ happen, it is no knock on them that this outage occurred, what is a knock is that they haven't communicated honestly while the outage was ongoing.
They address that in the post, and between Twitter, HN and other places there wasn’t anyone legit questioning if something was actually broken. Contacts at AWS also all were very clear that yes something was going on and being investigated. This narrative that AWS was pretending nothing was wrong just wasn’t true based on what we saw.
Isn't this the equivalent of "complaining about your meal in a restaurant, I'd like to see you do better."
The point of eating at a restaurant is that I can't/don't want to cook. Likewise, I use AWS because I want them to do the hard work and I'm willing to pay for it.
How does that abrogate my right to complain if it goes badly (regardless of whether I could/couldn't do it myself)?
I think the distinction is you can say "I pay good money for you to do it properly and how dare you go down on me" but you become an "armchair infrastructure engineer" when you try and explain how you would have avoided the outage because you don't have the whole picture (especially based on a very carefully worded PR approved blog post).
Simply already patched. Company sizes and number of attack surfaces vary. 22 hours is plenty of time for an input string filter on a centrally controlled endpoint and a dependency increment with the right CI pipeline.
The attack surface is quite a bit larger than many realize. I recently had a conversation with a person who wasn't at a Java shop so wasn't worried... until he said "oh, wait, ElasticSearch is vulnerable too?"
My company uses AWS. We had significant degradation for many of their APIs for over six hours, having a substantive impact on our business. The entire time their outage board was solid green. We were in touch with their support people and knew it was bad but were under NDA not to discuss it with anyone.
Of course problems and outages are going to happen, but saying they have five nines (99.999) uptime as measured by their "green board" is meaningless. During the event they were late and reluctant to report it and its significance. My point is that they are wrongly incentivized to keep the board green at all costs.
I mean, not to defend them too strongly, but literally half of this post mortem is addressing the failure of the Service Dashboard. You can take it on bad faith, but they own up to the dashboard being completely useless during the incident.
Off the top of my head, this is the third time they've had a major outage where they've been unable to properly update the status page. First we had the S3 outage, where the yellow and red icons were hosted in S3 and unable to be accessed. Second we had the Kinesis outage, which snowballed into a Cognito outage, so they were unable to login into the status page CMS. Now this.
They "own up to it" in their postmortems, but after multiple failures they're still unwilling to implement the obvious solution and what is widely regarded as best practice: host the status page on a different platform.
Firmly agreed. I've heard AWS discuss making the status page better – but they get really quiet about actually doing it. In my experience the best/only way to check for problems is to search Twitter for your AWS region name.
Maybe AWS should host their status checks in Azure and vice versa ... Mutually Assured Monitoring :) Otherwise it becomes a problem of who will monitor the monitor
My company is quite well known for blameless post-mortems, but if someone failed to implement improvements after three subsequent outages, they would be moved to a position more appropriate for their skills.
That’s not what’s being asked though - in all 3 events, they couldn’t manually update it. It’s clearly not a priority to fix it for even manual alerts.
>Be capable of spinning up virtualized instances (including custom drive configurations, network stacks, complex routing schemes, even GPUs) with a simple API call
But,
>Be incapable of querying the status of such things
As others mention, you can do it manually. But it’s also not that hard to do automatically: literally just spin up a “client” of your service and make sure it works.
Eh, the colored icons not loading is not really the same thing as incorrectly reporting that nothing’s wrong. Putting the status page on separate infra would be good practice, though.
The AWS summary says: "As the impact to services during this event all stemmed from a single root cause, we opted to provide updates via a global banner on the Service Health Dashboard, which we have since learned makes it difficult for some customers to find information about this issue"
This seems like bad faith to me based on my experience when I worked for AWS. As they repeated many times at Re:Invent last week, they've been doing this for 15+ years. I distinctly remember seeing banners like "Don't update the dashboard without approval from <importantSVP>" on various service team runbooks. They tried not to say it out loud, but there was very much a top-down mandate for service teams to make the dashboard "look green" by:
1. Actually improving availability (this one is fair).
2. Using the "Green-I" icon rather than the blue, orange, or red icons whenever possible.
3. They built out the "Personal Health Dashboard" so they can post about many issues in there, without having to acknowledge it publicly.
Eh I mean at least when DeSantis was lower on the food chain then he is now, the normal directive was that ec2 status wasn't updated unless a certain X percent of hosts were affected. Which is reasonable because a single rack going down isn't relevant enough to constitute a massive problem with ec2 as a whole.
Multiple AWS employees have acknowledged it takes VP approval to change the status color of the dashboard. That is absurd and it tells you everything you need to know. The status page isn't about accurate information, it's about plausible deniability and keeping AWS out of the news cycle.
When is the last time they had a single service outage in a single region? How about in a single AZ in a single region? Struggling to find a lot of headline stories? I'm willing to bet it's happened in the last 2 years and yet I don't see many news articles about it... so I'd say if the only thing that hits the front page is a complete region outage for 6+ hours, it's working out pretty well for them.
Um, so you think straight-up lying is good politics?
Any 7-year old knows that telling a lie when you broke something makes you look better superficially, especially if you get away with it.
That does not mean that we should think it is a good idea to tell lies when you break things.
It sure as hell isn't smart politics in my book. It is straight-up disqualifying to do business with them. If they are not honest about the status or amount of service they are providing, how is that different than lying about your prices?
Would you go to a petrol station that posted $x.00/gallon, but only delivered 3 quarts for each gallon shown on the pump?
We're being shortchanged and lied to. Fascinating that you think it is good politics on their part.
AWS spends a lot of time thinking about this problem in service to their customers.
How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?
It would be dumb and useless to turn something red every single time anything had a problem. Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock on these problems. Most of the problems either don’t affect anyone due to redundancy or affect only a tiny number of customers- a failed memory module or top-of-rack switch or a random bit flip in one host for one service.
Would it help anyone to tell everyone about all these problems? People would quickly learn to ignore it as it had no bearing on their experience.
What you’re really arguing is that you don’t like the thresholds they’ve chosen. That’s fine, everyone has an opinion. The purpose of health dashboards like these are mostly so that customers can quickly get an answer to “is it them or me” when there’s a problem.
As others on this thread have pointed out, AWS has done a pretty good job of making the SHD align with the subjective experience of most customers. They also have personal health dashboards unique to each customer, but I assume thresholding is still involved.
>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?
A good low-hanging fruit would be, when the outage is significant enough to have reached the media, you turn the dot red.
Dishonesty is what we're talking about here. Not the gradient when you change colors. This is hardly the first major outage where the AWS status board was a bald-faced lie. This deserves calling out and shaming the responsible parties, nothing less, certainly not defense of blatantly deceptive practices that most companies not named Amazon don't dip into.
>>How do you reduce the status of millions of machines, the software they run, and the interconnected-ness of those systems to a single graphical indicator?
There's a limitless variety of options, and multiple books written about it. I can recommend the series "The Visual Display of Quantitative Information" by Edward Tufte, for starters.
>> Literally there are hundreds of things broken every minute of every day. On-call engineers are working around the clock...
Of course there are, so a single R/Y/G indicator is obviously a bad choice.
Again, they could at any time easily choose a better way to display this information, graphs, heatmaps, whatever.
More importantly, the one thing that should NOT be chosen is A) to have a human in the loop of displaying status, as this inserts both delay and errors.
Worse yet, to make it so that it is a VP-level decision, as if it were a $1million+ purchase, and then to set the policy to keep it green when half a continent is down... ummm that is WAAAYYY past any question of "threshold" - it is a premeditated, designed-in, systemic lie.
>>You don’t know what you’re talking about.
Look in the mirror, dude. While I haven't worked inside AWS, I have worked in complex network software systems and well understand the issues of thousands of HW/SW components in multiple states. More importantly, perhaps it's my philosophy degree, but I can sort out WHEN (e.g., here) the problem is at another level altogether. It is not the complexity of the system that is the problem, it is the MANAGEMENT decision to systematically lie about that complexity. Worse yet, it looks like those lies on an everyday basis are what goes into their claims of "99.99+% uptime!!" evidently false. The problem is at the forest level, and you don't even want to look at the trees because you're stuck in the underbrush telling everyone else they are clueless.
That's only useful when it's an entire region, there are minor issues in smaller services that cause problems for a lot of people they don't reflect in their status board; and not everyone checks twitter or HN all the time while at work
it's a bullshit board used fudge numbers when negoaiting SLAs
like I don't care that much, hell my company does the same thing; but let's not get defensive over it
So -- ctrl-f "Dash" only produces four results and it's hidden away in the bottom of the page. It's false to claim that even 20% of the post mortem is addressing the failure of the dashboard.
The problem is that the dashboard requires VP approval to be updated. Which is broken. The dashboard should be automatic. The dashboard should update before even a single member of the AWS team knows there's something wrong.
Is it typical for orgs (the whole spectrum: IT departments everywhere, telecom, SaaS, maybe even status of non-technical services) to have automatic downtime messaging that doesn't need a human set of eyes to approve it first?
> You can take it on bad faith, but they own up to the dashboard being completely useless during the incident.
Let's not act like this is the first time this has happened. It's bad faith that they do not change when their promise is they hire the best to handle infrastructure so you don't have to. It's clearly not the case. Between this and billing I we can easily lay blame and acknowledge lies.
AWS as a business has an enormous (multi-billion-dollar) moral hazard: they have a fantastically strong disincentive to update their status dashboard to accurately reflect the true nature of an ongoing outage. They use weasel words like "some customers may be seeing elevated errors", which we all know translates to "almost all customers are seeing 99.99% failure rates."
They have a strong incentive to lie, and they're doing it. This makes people dependent upon the truth for refunds understandably angry.
> Why are we doing this folks? What's making you so angry and contemptful?
Because Amazon kills industries. Takes job. They do this because they promise they hire the best people that can do this better than you and for cheaper. And it's rarely true. And then they lie about it when things hit the fan. If you're going to be the best you need to act like the best, and execute like the best. Not build a walled garden that people cant see into, and hard to leave.
"Our Support Contact Center also relies on the internal AWS network, so the ability to create support cases was impacted from 7:33 AM until 2:25 PM PST. "
This to me is really bad. Even as a small company, we keep our support infrastructure separate. For a company of Amazon's size, this is a shitty excuse. If I cannot even reach you as a customer for almost 7 hours, that is just nuts. AWS must do better here.
Also, is it true that the outage/status pages are manually updated ? If yes, there is no excuse why it was green for that long. If you are manually updating it, please update asap.
I know a few tiny ISPs that host their voip server and email server outside of their own ASN so that in the event of a catastrophic network event, communications with customers is still possible... Not saying amazon should do the same, but the general principle isn't rocket science.
We moved our company's support call system to Microsoft Teams when lockdowns were happening, and even that was affected by the AWS outage (along with our SaaS product hosted on AWS).
It turned out our call center supplier had something running on AWS, and it took out our entire phone system. After this situation settles, I'm tempted to ask my supplier to see what they're doing to get around this in the future, but I doubt even they knew that AWS was used further downstream.
AWS operates a lot like Amazon.com, the marketplace now--you can try to escape it, but it's near impossible. If you want to ban usage of Amazon's services, you're going to find some service (AWS) or even a Shopify site (FBA warehouse) who uses it.
Wasn't this the Bezos directive early on that created AWS? Anything that was created had to be a service with an API. Not allowed to recreate the wheel. So AWS depends on AWS.
My favourite is when some company migrates their physical servers to virtual machines, including the AD domain controllers. Then the next step is to use AD LDAP authentication for the VM management software.
When there's a temporary outage and the VMs don't start up as expected, the admins can't log on and troubleshoot the platform because the logon system was running on it... but isn't now.
The loop is closed.
You see this all the time, especially with system-management software. They become dependent on the systems they're managing, and vice-versa.
If you care about availability at all, make sure to have physical servers providing basic services like DNS, NTP, LDAP, RADIUS, etc...
My company isn't big enough for us to have any pull but this communication is _significantly_ downplaying the impact of this issue.
One of our auxiliary services that's basically a pass through to AWS was offline nearly the entire day. Yet, this communication doesn't even mention that fact. In fact, it almost tries to suggest the opposite.
Likewise, AWS is reporting S3 didn't have issues. Yet, for a period of time, S3 was erroring out frequently because it was responding so slowly.
This. We're under NDA too on internal support. Our customers know we use AWS and they go and check the AWS status dashboards and tell us there's nothing wrong so the inevitable vitriol is always directed at us which we then have to defend.
> The entire time their outage board was solid green
Unless you're talking about some board other than the Service Health Dashboard, this isn't true. They dropped EC2 down to degraded pretty early on. I bemusedly noted in our corporate Slack that every time I refreshed the SHD, another service was listed as degraded. Then they added the giant banner at the top. Their slight delay in updating the SHD at the beginning of the outage is mentioned in the article. It was absolutely not all green for the duration of the outage.
That is not true. There was hours before they started annotating any kind of service issues. Maybe from when you noticed there was a problem it appeared to be quick, but the board remained green for a large portion of the outtage.
No, it was about an hour. We were aware from the very moment EC2 API error rates began to elevate, around 10:30 Eastern. By 11:30 the dashboard was updating. This timing is mentioned in the article, and it all happened in the middle of our workday on the east coast. The outage then continued for about 7 hours with SHD updates. I suspect we actually both agree on how long it took them to start updating, but I conclude that 1 hour wasn't so bad.
At the large platform company where I work, our policy is if the customer reported the issue before our internal monitoring caught it, we have failed. Give 5 minutes for alerting lag, 10 minutes to evaluate the magnitude of impact, 10 minutes to craft the content and get it approved, 5 minutes to execute the update, adds up to 30 minutes end to end with healthy buffer at each step.
1 hour (52 minutes according to the article) sounds meh. I wonder what their error rate and latency graphs look like from that day.
We saw the timing described where the dashboard updates started about an hour after the problem began (which we noticed immediately since 7:30AM Pacific is in the middle of the day for those of us in Eastern time). I don't know if there was an issue with browser caching or similar but once the updates started everyone here had no trouble seeing them and my RSS feed monitor picked them up around that time as well.
Multiple services I use were totally skunked, and none were ever anything but green.
Sagemaker, for example, was down all day. I was dead in the water on a modeling project that required GPUs. It relied on EC2, but nobody there even thought to update the status? WTF. This is clearly executives incentivized to let a bug persist. This is because the bug is actually a feature for misleading customers and maximizing profits.
I worked at Amazon. While my boss was on vacation I took over for him in the "Launch readiness" meeting for our team's component of our project. Basically, you go to this meeting with the big decision makers and business people once a week and tell them what your status is on deliverables. You are supposed to sum up your status as "Green/Yellow/Red" and then write (or update last week's document) to explain your status.
My boss had not given me any special directions here so I assumed I was supposed to do this honestly. I set our status as "Red" and then listed out what were, I felt, quite compelling reasons to think we were Red. The gist of it was that our velocity was negative. More work items were getting created and assigned to us than we closed, and we still had high priority items open from previous dates. There was zero chance, in my estimation, that we would meet our deadlines, so I called us Red.
This did not go over well. Everyone at the Launch Readiness meeting got mad at me for declaring Red. Our VP scolded me in front of the entire meeting and lectured me about how I could not unilaterally declare our team red. Her logic was, if our team was Red, that meant the entire project was Red, and I was in no position to make that call. Other managers at the meeting got mad at me too because they felt my call made them look bad. For the rest of my manager's absence I had to first check in with a different manager and show him my Launch Readiness status and get him to approve my update before I was allowed to show it to the rest of the group.
For the rest of the time that I went to Launch Readiness I was forbidden from declaring Red regardless of what our metrics said. Our team was Yellow or Green, period.
Naturally, we wound up being over a year late on the deadlines, because, despite what they compelled us to say in those meetings, we weren't actually getting the needed work done. Constant "schedule slips" and adjustments. Endless wasted time in meetings trying to rework schedules that would instantly get blown up again. Hugely frustrating. Still slightly bitter about it.
Anyway, I guess all this is to say that it doesn't surprise me that Amazon is bad about declaring Red, Yellow, or Green in other places too. Probably there is a guy in charge of updating those dashboards who is forbidden from changing them unless they get approval from some high level person and that person will categorically refuse regardless of the evidence because they want the indicators to be Green.
I had a good chuckle reading your comment. This is not unique to Amazon. Unfortunately, status indicators are super political almost everywhere, precisely because they are what is being monitored as a proxy for the actual progress. I think your comment should be mandatory reading for any leader who is holding the kinds of meetings you describe and thinks they are getting an accurate picture of things.
I worked at AMZN and this perfectly captures my experience there with those weekly reviews. I once set a project I was managing as "Red" and had multiple SDMs excoriate me for apparently "throwing them under the bus" even though we had missed multiple timelines and were essentially not going to deliver anything of quality on time. I don't miss this aspect of AMZN!
We have something similar at my big corp company. I think the issue is you went from Green to Red in a flip of a switch. A more normal project goes Green...raise a red flags...if red flags aren't resolved in the next week or two, go to yellow...In these meetings everyone collaborates ways to keep your green or get you back to green if you went yellow.
In essence - what you were saying is your boss lied the whole time, because how does one go from a presumed positive velocity to negative velocity in a week?
Additionally assuming you're a dev lead, it's a little surprising that this is your first meeting of this sorts. As dev lead, I didn't always attend them but my input is always sought on the status.
Sounds like you had a bad manager, and Amazon is filled with them.
Exactly this. If you take your team from green to red without raising flags and asking for help, you will be frowned upon. It’s like pulling the fire alarm at the smell of burning toats. It will piss off people.
1) If you keep status green for 5 years, while not delivering anything, the reality is the folks at the very top (who can come and go) just look at these colors and don't really get into the project UNLESS you say you are red :)
2) Within 1-2 years there is always going to be some excuse for WHY you are late (people changes, scope tweaks, new things to worry about, covid etc)
3) Finally you are 3 years late, but you are launching. Well, the launch overshadows the lateness. Ie, you were green, then you launched, that's all the VP really sees sometime.
This explicitly supports what most of us assume is going on. I wont be surprised if someone with a (un)vested interest will be along shortly to say that their experience is the opposite and that on their team, making people look bad by telling the truth is expected and praised.
I once had the inverse happen. I showed up as an architect at a pretty huge e-commerce shop. They had a project that had just kicked off and onboarded me to help with planning. They had estimated two months by total finger in the air guessing. I ran them through a sizing and velocity estimation and the result came back as 10 months. I explained this to management and they said "ok". We delivered in about 10 months. It was actually pretty sad that they just didn't care. Especially since we quintupled the budget and no one was counting.
I worked at an Amazon air-shipping warehouse for a couple years, and hearing this confirms my suspicions about the management there. Lower management (supervisors, people actually in the building) were very aware of problems, but the people who ran the building lived out of state, so they only actually went to the building on very rare occasions.
Equipment was constantly breaking down, in ways that ranged from inconvenient to potentially dangerous. Seemingly basic design decisions, like the shape of chutes, were screwed up in mind-boggling ways (they put a right-angle corner partway down each chute, which caused packages to get stuck in the chutes constantly). We were short on equipment almost every day; things like poles to help us un-jam packages were in short supply, even though we could move hundreds of thousands of packages a day. On top of all this, the facility opened with half its sorting equipment, and despite promises that we'd be able to add the rest of the equipment in the summer, during Amazon's slow season...it took them two years to even get started.
And all the while, they demanded ever-increasing package quotas. At first, 120,000 packages/day was enough to raise eyebrows--we broke records on a daily basis in our first holiday rush--but then, they started wanting 200,000, then 400,000. Eventually it came out that the building wouldn't even be breaking even until it hit something like 500,000.
As we scaled up, things got even worse. None of the improvements that workers suggested to management were used, to my knowledge, even simple things like adding an indicator light to freight elevators.
Meanwhile, it eventually became clear that there wasn't enough space to store cargo containers in the building. 737s and the like store packages mostly in these giant curved cargo containers, and we needed them to be locked in place while working around/in them...except that, surprise, the people planning the building hadn't planned any holding areas for containers that weren't in use! We ended up sticking them in the middle of the work area.
Which pissed off the upper management when they visited. Their decision? Stop doing it. Are we getting more storage space for the cans? No. Are we getting more workers on the airplane ramp so we can put these cans outside faster? No. But we're not allowed to store those cans in the middle of the work area anymore, even if there aren't any open stations with working locks. Oh, by the way, the locking mechanisms that hold the cans in place started to break down, and to my knowledge they never actually fixed any of the locks. (A guy from their safety team claims they've fixed like 80 or 90 of the stations since the building opened, but none of the broken locks I've seen were fixed in the 2 years I worked there.)
The problem here sounds like lack of clarity over the meaning of the colours.
In organisations with 100s of in-flight projects, it’s understandable that red is reserved for projects that are causing extremely serious issues right now. Otherwise, so many projects would be red, that you’d need a new colour.
I'd be willing to believe they had some elite high level reason to schedule things this way if I thought they were good at scheduling. In my ~10 years there I never saw a major project go even close to schedule.
I think it's more like the planning people get rewarded for creating plans that look good and it doesn't bother them if the plans are unrealistic. Then, levels of middle management don't want to make themselves look bad by saying they're behind. And, ultimately, everyone figures they can play a kind of schedule-chicken where everyone says they're green or yellow until the last possible second, hoping that another group will raise a flag first and give you all more time while you can pretend you didn't need it.
You might be working at the wrong org? My colleagues routinely take weeks off at a time, sometimes more than a month to travel Europe, go scuba diving in French Polynesia, etc. Work to live, don’t live to work.
Yes, it's a conflict of interest. They have a guarantee on uptime and they decide what their actual uptime is. There's a lot of that now. Most insurances comes to mind.
even in the post mortem, they are reclutant to admit it
> While AWS customer workloads were not directly impacted from the internal networking issues described above, the networking issues caused impact to a number of AWS Services which in turn impacted customers using these service capabilities. Because the main AWS network was not affected, some customer applications which did not rely on these capabilities only experienced minimal impact from this event.
Honestly, the should host that status page on CloudFlare or some completely separate infrastructure that they maintain in colo datacenters or something. The only time it really needs to be up is when their stuff isn't working.
Second hand info but supposedly when an outage hits they go all hands on resolving it and no one who knows what's going on has time to update the status board which is why it's always behind.
Yes. Exactly. Pay double. That is what all the blogs say. But no, when a region goes down everything is hosed. Give it a shot! Next time an entire region is down try out your apis or give AWS support a call.
No. We don't have an active deployment in that region at all. It killed our build pipeline as ECR was down globally so we had nowhere to push images. Also there was a massive risk as our target environments are EKS so any node failures or scaling events had nowhere to pull images from while ECR was down.
Edit: not to mention APIGW and Cloudwatch APIs were down too.
> The entire time their outage board was solid green. We were in touch with their support people and knew it was bad but were under NDA not to discuss it with anyone.
if ($pain > $gain) {
move_your_shit_and_exit_aws();
}
sub move_your_shit_and_exit_aws
{
printf("Dude. We have too much pain. Start moving\n");
printf("Yeah. That won't happen, so who cares\n");
exit(1);
}
This was addressed at least 3 times during this post. I'm not defending them but you're just gaslighting. If you have something to add about the points they raised regarding the status page please do so.
Some orgs really do have lousy availability figures (such as my own, the Navy).
We have an environment we have access to for hosting webpages for one of the highest leaders in the whole Dept of Navy. This environment was DOWN (not "degrade availability" or "high latencies"), literally off of the Internet entirely, for CONSECUTIVE WEEKS earlier this year.
Completely incommunicado as well. It just happened to start working again one day. We collectively shrugged our shoulders and resumed updating our part of it.
This is an outlier example but even our normal sites I would classify as 1 "nine" of availability at best.
Yep, things should be made clear to whoever cares about your service's SLA that our SLA is contingent upon AWS's SLA et al. AWS' SLA would be the lower bound :)
All I’m hearing is that you can make up your own availability numbers and get away with it. When you define what it means to be up or down then reality is whatever you say it is.
#gatekeep your real availability metrics
#gaslight your customers with increased error rates
It’s a meme; search it on Twitter. It’s a play on “live, laugh, love” that started as a way for young women to mock pandering displays of female empowerment but has grown in scope so that it can be used to mock anyone.
Was this outage only impact us-east-1 region? I think I saw other regions affected in some HN comments but this summary did not mention anything to suggest it has more than 1 region impacted.
There are some AWS services, notably STS, that are hosted in us-east-1. I don’t have anything in us-east-1 but I was completely unable to log into the console to check on the health of my services.
"At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks. These delays increased latency and errors for services communicating between these networks, resulting in even more connection attempts and retries."
So was this in service to something like DynamoDB or some other service?
As in, did some of those extra services that AWS offers for lockin (and that undermines open source projects with embrace and extend) bomb the mainline EC2 service?
Because this kind of smacks of "Microsoft Hidden APIs" that office got to use against other competitors. Does AWS use "special hardware capabilites" to compete against other companies offering roughtly the same service?
Yes and other cloud providers (Google, Microsoft) probably have similar. Besides special network equipment, they use PCIe accelerator/coprocessors on their hypervisors to offload all non-VM activity (Nitro instances)
Idea:. Network devices should be configured to automatically prioritize the same packet flows for the same clients as they served yesterday.
So many overload issues seem to be caused by a single client, in a case where the right prioritization or rate limit rule could have contained any outage, but such a rule either wasn't in place or wasn't the right one due to the difficulty of knowing how to prioritize hundreds of clients.
Using more bandwidth or requests than yesterday should then be handled as capacity allows, possibly with a manual configured priority list, cap, or ratio. But "what I used yesterday" should always be served first. That way, any outage is contained to clients acting differently to yesterday, even if the config isn't perfect.
My favorite sentence: "Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event."
I saw pleeeeeenty of untested code at Amazon/AWS. Looking back it was almost like the most important services/code had the least amount of testing. While internal boondoggle projects (I worked on a couple) had complicated test plans and debates about coverage metrics.
It's gotta be a whole thing to even think about how to accurately test this kind of software. Simulating all kinds of hardware failures, network partitions, power failures, or the thousand other failure modes.
Then again they get like $100B in revenue that should buy some decent unit tests.
The most important services get the most attention from leaders who apply the most pressure, especially in the first ~2y of a fast-growing or high-potential product. So people skip tests.
reality most of the real world successful projects are mostly untested because that's not actually a high ROI endeavor. it kills me to realize that mediocre code you can hack all over to do unnatural things is generally higher value in phase I than the same code done well in twice the time.
I think the pendulum swing back is going to be designing code that is harder to make bad.
Typescript is a good example of trying to fix this. Rust is even better.
Deno, I think, takes things in a better direction as well.
Ultimately we're going to need systems that just don't let you do "unnatural" things but still maintain a great deal of forward mobility. I don't think that's an unreasonable ask of the future.
Interesting. I also work for a cloud provider. My team work on both internal infrastructure as well as product features. We take testing coverage very seriously and tie the metrics to the team's perf. Any product feature must have unit tests, integration tests at each layer of the stack, staging test, production test and continuous probers in production. But our reliability is still far from satisfactory. Now with your observation at AWS, I start wondering whether the coverage effort and different types of tests really help or not...
> Now with your observation at AWS, I start wondering whether the coverage effort and different types of tests really help or not...
Figuring out ROI for testing is a very tricky problem. I'm glad to hear your team invests in testing. I agree it's hard to know if you're wasting money or not doing enough!
my take is that the overwhelming majority of services insufficiently invest in making testing easy. the services that need to grow fast due to customer demand skip the tests while the services that aren't going much of anywhere spend way too much time on testing.
I found myself wishing for a few code snippets here. It would be interesting. A lot of time code that handles "connection refused" or fast failures doesn't handle network slowness well. I've seen outages from "best effort" services (and the best-effort-ness worked when the services were hard down) because all of a sudden calls that were taking 50 ms were not failing but all taking 1500+ ms. Best effort but no client enforced SLAs that were low enough to matter.
Load shedding never kicked in, so things had to be shutdown for a bit and then restarted.
Seems their normal operating state might be what is called "meta-stable" - dynamically stable at a high thru-put (edited) unless/until a brief glitch bumps the system into the low work being finished state which is also stable.
thundering herd and accidental synchronization for the win
I am sad to say, I find issues like this any time I look at retry logic written by anyone I have not interacted with previously on the topic. It is shockingly common even in companies where networking is their bread and butter.
> It absolutely is difficult. A challenge I have seen is when retries are stacked and callers time out subprocesses that are doing retries.
This is also a general problem with (presumed stateless) concurrent/distributed systems which irked me working on such a system and still haven’t found meaningful resources for which aren’t extremely platform/stack/implementation specific:
A concurrent system has some global/network-wide/partitioned-subset-wide error or backoff condition. If that system is actually stateless and receives push work, communicating that state to them either means pushing the state management back to a less concurrent orchestrator to reprioritize (introducing a huge bottleneck/single or fragile point of failure) or accepting a lot of failed work will be processed in pathological ways.
The complexity that AWS has to deal with is astounding. Sure having your main production network and a management network is common. But making sure all of it scales and doesn't bring down the other is what I think they are dealing with here.
It must have been crazy hard to troubleshoot when you are flying blind because all your monitoring is unresponsive. Clearly more isolation with clearly delineated information exchange points are needed.
“But AWS has more operations staff than I would ever hope to hire” — a common mantra when talking about using the cloud overall.
I’m not saying I fully disagree. But consolidation of the worlds hosting necessitates a very complicated platform and these things will happen, either due to that complexity, failures that can’t be foreseen or good old fashioned Sod’s law.
I know AWS marketing wants you to believe it’s all magic and rainbows, but it’s still computers.
I work for one of the Big 3 cloud providers and it’s always interesting when giving RCAs to customers. The vast majority of our incidents are due to bugs in the “magic” components that allow us to operate at such a massive scale.
Hm. This post does not seem to acknowledge what I saw. Multiple hours of rate-limiting kicking in when trying to talk to S3 (eu-west-1). After the incident everything works fine without any remediations done on our end.
eu-west-1 was not impacted by this event. I’m assuming you saw 503 Slowdown responses, which are non-exceptional and happen for a multitude of reasons.
"Our networking clients have well tested request back-off behaviors that are designed to allow our systems to recover from these sorts of congestion events, but, a latent issue prevented these clients from adequately backing off during this event. "
That is an interesting way to phrase that. A 'well-tested' method, but 'latent issues'. That would imply the 'well-tested' part was not as well-tested as it needed to be. I guess 'latent issue' is the new 'bug'.
Its not transparent at all. A massive amount of services were hard down for hours like SNS and were never acknowledged on the status page or in this write-up. This honestly reads like they don't truly understand the scope of things effected.
It sounded like the entire management plane was down and potentially part of the "data" plane too (management being config and data being get/put/poll to stateful resources)
I saw in the Reddit thread someone mentioned all services that auth to other services on the backend were effected (not sure how truthful it is but that certainly made sense)
I’m glad they published something, that too so quick. Ultimately these guys are running a business. There are other market alternatives, multibillion dollar contracts at play, SLAs, etc. it’s not as simple as people think.
"At 7:30 AM PST, an automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behavior from a large number of clients inside the internal network. "
It's basically used for service discovery. At a certain point, you have too many different devices which are potentially changing to identify them by IP. You want some abstraction layer to separate physical devices from services and DNS lets you so things like advertise different IPs at different times in different network zones
That "internal network" hosts an awful lot of stuff- it's not just network hardware, but services that mostly use DNS to find each other. Besides that, it's just plain useful for network devices to have names.
"... the networking congestion impaired our Service Health Dashboard tooling from appropriately failing over to our standby region. By 8:22 AM PST, we were successfully updating the Service Health Dashboard."
Sounds like they lost the ability to update the dashboard. HN comments at the time were theorizing it wasn't being updated due to bad policies (need CEO approval) etc. Didn't even occur to me that it might be stuck in green mode.
In the February 2017 S3 outage, AWS was unable to move status icons to the red icon because those images happened to be stored on the servers that went down.
Hasn't this exact thing (something in US-east-1 goes down, AWS loses ability to update dashboard) happened before? I vaguely remember it was one of the S3 outages, but I might be wrong.
In any case, AWS not updating their dashboard is almost a meme by now. Even for global service outages the best you will get is a yellow.
Yeah, probably. I haven't watched it this closely before during an outage. I have no idea if this happens in good faith, bad faith, or (probably) a mix.
Having an internal network like this that everything on the main AWS network so heavily depends on is just bad design. One does not create a stable high tech spacecraft and then fuels it with coal.
Might seem that way because of what happened, but the main network is probably more likely to fail than the internal network. In those cases, running monitoring on a separate network is critical. EC2 control plane same story.
The entire value proposition for AWS is "migrate your internal network to us so it's more stable with less management." I buy that 100%, and I think you're wrong to assume their main network is more likely to fail than their internal one. They have every incentive to continuously improve it because it's not made just for one client.
It's 2006, you work for an 'online book store' that's experimenting with this cloud thing. Are you going to build a whole new network involving multi-million dollar networking appliances?
Hehe true! But it's real hard to "move fast and not break things" when you're talking about millions of servers, exabytes of data, and tens of thousands of engineers.
> Operators instead relied on logs to understand what was happening and initially identified elevated internal DNS errors. Because internal DNS is foundational for all services and this traffic was believed to be contributing to the congestion, the teams focused on moving the internal DNS traffic away from the congested network paths. At 9:28 AM PST, the team completed this work and DNS resolution errors fully recovered.
It’s quite a bit different… Facebook took themselves offline completely because of a bad BGP update, whereas AWS had network congestion due to a scaling event. DNS relies on the network, so of course it’ll be impacting if networking is also impacted.
no. it wasn't a "bad bgp update". bgp withdrawal of anycast addresses was a desired outcome of a region (serving location) getting disconnected from the backbone. if you'd like to trivialize it, you can say it was configuration change to the software defined backbone.
We ran into a very similar issue, but at the database layer in our company literally 2 weeks ago, where connections to our MySQL exploded and completely took down our data tier and caused a multi-hour outage, compounded by retries and thundering herds. Understanding this problem under the stressful scenario is extremely difficult and a harrowing experience. Anticipating this kind of issue is very very tricky.
Naive responses to this include "better testing", "we should be able to do this", "why is there no observability" etc. The problem isn't testing. Complex systems behave in complex ways, and its difficult to model and predict, especially when the inputs to the system aren't entirely under your control. Individual components are easy to understand, but when integrating, things get out of whack. I can't stress how difficult it is to model or even think about these systems, they're very very hard. Combined with this knowledge being distributed among many people, you're dealing with not only distributed systems, but also distributed people, which adds more difficulty in wrapping this around your head.
Outrage is the easy response. Empathy and learning is the valuable one. Hugs to the AWS team, and good learnings for everyone.