> 8:22 AM PST We are investigating increased error rates for the AWS Management Console.
> 8:26 AM PST We are experiencing API and console issues in the US-EAST-1 Region. We have identified root cause and we are actively working towards recovery. This issue is affecting the global console landing page, which is also hosted in US-EAST-1. Customers may be able to access region-specific consoles going to https://console.aws.amazon.com/. So, to access the US-WEST-2 console, try https://us-west-2.console.aws.amazon.com/
> This issue is affecting the global console landing page, which is also hosted in US-EAST-1
Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?
At a different (unnamed) FAANG, we considered it unacceptable to have anything depend on a single region. Even the dinky little volunteer-run thing which ran https://internal.site.example/~someEngineer was expected to be multi-region, and was, because there was enough infrastructure for making things multi-region that it was usually pretty easy.
Every damn Well-Architected Framework includes multi-AZ if not multi-region redundancy, and yet the single access point for their millions of customers is single-region. Facepalm in the form of $100Ms in service credits.
> Facepalm in the form of $100Ms in service credits.
It was also greatly affecting Amazon.com itself. I kept getting sporadic 404 pages and one was during a purchase. Purchase history wasn't showing the product as purchased and I didn't receive an email, so I repurchased. Still no email, but the purchase didn't end in a 404, but the product still didn't show up in my purchase history. I have no idea if I purchased anything, or not. I have never had an issue purchasing. Normally get a confirmation email within 2 or so minutes and the sale is immediately reflected in purchase history. I was unaware of the greater problem at that moment or I would have steered clear at the first 404.
They're also unable to refund Kindle book orders via their website. The "Request a refund" page has a 500 error, so they fall back to letting you request a call from a customer service rep. Initiating this request also fails, so they then fall back to showing a 1-888 number that the customer can call. Of course, when I tried to call, I got "All circuits are busy".
>Facepalm in the form of $100Ms in service credits.
Part of me wonders how much they're actually going to pay out, given that their own status page has only indicated five services with moderate ("Increased API Error Rates") disruptions in service.
That public status page has no bearing on service credits, it's a statically hosted page updated when there's significant public impact. A lot of issues never make it there.
Every AWS customer has a personal health dashboard that links the issues to their services which is updated much faster, and links issues to your affected resources. Additionally requests for credits are done by the customer service team who have even more information.
This point is repeated often, and the incentives for Amazon to downplay the actual downtime are definitely there.
Wouldn't affected companies be incentivized to make a lawsuit about AMZ lying about status? It would be easy to prove and costly to defend from AWS standpoint.
I'm guessing Google, on the basis of the recently published (to the public) "I just want to serve 5TB"[1] video. If it isn't Google, then the broccoli man video is still a cogent reminder that unyielding multi-region rigor comes with costs.
It's salient that the video is from 2010. Where I was (not Google), the push to make everything multi-region only really started in, maybe, 2011 or 2012. And, for a long time, making services multi-region actually was a huge pain. (Exception: there was a way to have lambda-like code with access to a global eventually-consistent DB.)
The point is that we made it easier. By the time I left, things were basically just multi-region by default. (To be sure, there were still sharp edges. Services which needed to store data (like, databases) were a nightmare to manage. Services which needed to be in the same region specific instances of other services, e.g. something which wanted to be running in the same region as wherever the master shard of its database was running, were another nasty case.)
The point was that every services was expected to be multi-region, which was enforced by regular fire drills, and if you didn't have a pretty darn good story about why regular announced downtime was fine, people would be asking serious questions.
And anything external going down for more than a minute or two (e.g. for a failover) would be inexcusable. Especially for something like a bloody login page.
YES! Why do they do that? It's so weird. I will deploy a whole config into us-west-1 or something; but then I need to create a new cert in us-east-1 JUST to let cloudfront answer an HTTPS call. So frustrating.
Apparently it's okay for static data (like a website hosted in S3 behind CloudFront) but seeing non-Australian items in AWS billing and overviews always makes us look twice.
> Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?
They're cheap. HA is for their customers to pay more, not for Amazon which often lies during major outages. They would lose money on HA and they would lose money on acknowledging downtimes. They will lie as long as they benefit from it.
I think I know specifically what you are talking about. The actual files an engineer could upload to populate their folder was not multi-region for a long time. The servers were, because they were stateless and that was easy to multi-region, but the actual data wasn't until we replaced the storage service.
I think the storage was replicated by 2013? Definitely by 2014. It didn't have automated failover, but failover could be done, and was done during the relevant drills for some time.
I think it only stopped when the storage services got to the "deprecated, and we're not bothering to do a failover because dependent teams who care should just use something else, because this one is being shut down any year now". (I don't agree with that decision, obviously ;) but I do have sympathy for the team stuck running a condemned service. Sigh.)
After stuff was migrated to the new storage service (probably somewhere in the 2017-2019 range but I have no idea when), I have no idea how DR/failover worked.
Thank you for the sympathy. If we are talking about the same product then it was most likely backed by 3 different storage services over its lifespan, 2013/2014 was a third party product that had some replication/fail-over baked in, 2016-2019 on my team with no failover plans due to "deprecated, dont bother putting anything important here", then 2019 onward with "fully replicated and automatic failover capable and also less cost-per-GB to replicate but less flexible for the existing use cases".
Yeah, but I still have a different understanding what "Increased Error Rates" means.
IMHO it should mean that the rate of errors is increased but the service is still able to serve a substantial amount of traffic. If the rate of errors is bigger than, let's say, 90% that's not an increased error rate, that's an outage.
Some big customers should get together and make an independent org to monitor cloud providers and force them to meet their SLA guarantees without being able to weasel out of the terms like this…
IAM is a "global" service for AWS, where "global" means "it lives in us-east-1".
STS at least has recently started supporting regional endpoints, but most things involving users, groups, roles, and authentication are completely dependent on us-east-1.
When I brought up the status page (because we're seeing failures trying to use Amazon Pay) it had EC2 and Mgmt Console with issues.
I opened it again just now (maybe 10 minutes later) and it now shows DynamoDB has issues.
If past incidents are anything to go by, it's going to get worse before it gets better. Rube Goldberg machines aren't known for their resilience to internal faults.
As a user of Sagemaker in us-east-1, I deeply fucking resent AWS claiming the service is normal. I have extremely sensitive data, so Sagemaker notebooks and certain studio tools make sense for me. Or DID. After this I'm going back to my previous formula of EC2 and hosting my own GPU boxes.
Sagemaker is not working, I can't get to my work (notebook instance is frozen upon launch, with zero way to stop it or restart it) and Sagemaker Studio is also broken right now.
You don't use AWS because it has better uptime. If you've been around the block enough times, this story has always rung hollow.
Rather, you use AWS because when it is down, it's down for everybody else as well. (Or at least they can nod their head in sympathy for the transient flakiness everybody experiences.) Then it comes back up and everybody forgets about the outage like it was just background noise. This is what's meant by "nobody ever got fired for buying (IBM|Microsoft)". The point is that when those products failed, you wouldn't get blamed for making that choice; in their time they were the one choice everybody excused even when it was an objectively poor choice.
As for me, I prefer hosting all my own stuff. My e-mail uptime is better than GMail, for example. However, when it is down or mail does bounce, I can't pass the buck.
Identify or to publicly acknowledge? Chances are technical teams knew about this and noticed it fairly quickly, they've been working on the issue for some time. It probably wasn't until they identified the root cause and had a handful of strategies to mitigate with confidence that they chose to publicly acknowledge the issue to save face.
I've broken things before and been aware of it, but didn't acknowledge them until I was confident I could fix them. It allows you to maintain an image of expertise to those outside who care about the broken things but aren't savvy to what or why it's broken. Meanwhile you spent hours, days, weeks addressing the issue and suddenly pull a magic solution out of your hat to look like someone impossible to replace. Sometimes you can break and fix things without anyone even knowing which is very valuable if breaking something had some real risk to you.
This sounds very self-blaming. Are you sure that's what's really going through your head? Personally, when I get avoidant like that, it's because of anticipation of the amount of process-related pain I'm going to have to endure as a result, and it's much easier to focus on a fix when I'm not also trying to coordinate escalation policies that I'm not familiar with.
:) I imagine it went like this theoretical Slack conversation:
> Dev1: Pushing code for branch "master" to "AWS API".
> <slackbot> Your deploy finished in 4 minutes
> Dev2: I can't react the API in east-1
> Dev1: Works from my computer
It's acting odd for me. Shows all green in Firefox, but shows the error in Chrome even after some refreshes. Not sure what's caching where to cause that.
I asked my friend who's a senior dev if he ever uses recursion at work. He said whenever he sees recursion in a code review, he tells the junior dev to knock it off.
I worked at a company that hired an ex-Amazon engineer to work on some cloud projects.
Whenever his projects went down, he fought tooth and nail against any suggestion to update the status page. When forced to update the status page, he'd follow up with an extremely long "post-mortem" document that was really just a long winded explanation about why the outage was someone else's fault.
He later explained that in his department at Amazon, being at fault for an outage was one of the worst things that could happen to you. He wanted to avoid that mark any way possible.
YMMV, of course. Amazon is a big company and I've had other friends work there in different departments who said this wasn't common at all. I will always remember the look of sheer panic he had when we insisted that he update the status page to accurately reflect an outage, though.
It's popular to upvote this during outages, because it fits a narrative.
The truth (as always) is more complex:
* No, this isn't the broad culture. It's not even a blip. These are EXCEPTIONAL circumstances by extremely bad teams that - if and when found out - would be intervened dramatically.
* The broad culture is blameless post-mortems. Not whose fault is it. But what was the problem and how to fix it. And one of the internal "Ten commandments of AWS availability" is you own your dependencies. You don't blame others.
* Depending on the service one customer's experience is not the broad experience. Someone might be having a really bad day but 99.9% of the region is operating successfully, so there is no reason to update the overall status dashboard.
* Every AWS customer has a PERSONAL health dashboard in the console that should indicate their experience.
* Yes, VP approval is needed to make any updates on the status dashboard. But that's not as hard as it may seem. AWS executives are extremely operation-obsessed, and when there is an outage of any size are engaged with their service teams immediately.
Yeah no kidding. Is there a ratio of how many people it has to be working for to be in yellow rather than red? Some internal person going “it works on my machine” while 99% of customers are down.
I've always wondered why services are not counted down more often. Is there some sliver of customers who have access to the management console for example?
An increase in error rates - no biggie, any large system is going to have errors. But when 80%+ of customers loads in the region are impacted (cross availability zones for whatever good those do) - that counts as down doesn't it? Error rates in one AZ - degraded. Multi-AZ failures - down?
The outage dashboard is normally only updated if a certain $X percent of hosts / service is down. If the EC2 section were updated every time a rack in a datacenter went down, it would be red 24x7.
It's only updated when a large percentage of customers are impacted, and most of the time this number is less than what the HN echo chamber makes it appear to be.
I mean, sure, there are technical reasons why you would want to buffer issues so they're only visible if something big went down (although one would argue that's exactly what the "degraded" status means).
But if the official records say everything is green, a customer is going to have to push a lot harder to get the credits. There is a massive incentivization to “stay green”.
yes there were. I'm from central europe and we were at least able to get some pages of the console in us-east-1 -but i assume this was more caching related. Even though the console loaded and worked for listing some entries - we weren't able to post a support case nor viewing SQS messages etc.
So i aggree that degraded is not the proper wording - but it's / was not completly vanished. so.... hard to tell what is an common acceptable wording here.
From France, when I connect to "my personal health dashboard" in eu-west-3, it says several services are having "issues" in us-east-1.
To your point, for support center (which doesn't show a region) it says:
Description
Increased Error Rates
[09:01 AM PST] We are investigating increased error rates for the Support Center console and Support API in the US-EAST-1 Region.
[09:26 AM PST] We can confirm increased error rates for the Support Center console and Support API in the US-EAST-1 Region. We have identified the root cause of the issue and are working towards resolution.
I'm part of a large org with a large AWS footprint, and we've had a few hundred folks on a call nearly all day. We have only a few workloads that are completely down; most are only degraded. This isn't a total outage, we are still doing business in east-1. Is it "red"? Maybe! We're all scrambling to keep the services running well enough for our customers.
Well you know, like when a rocket explode, it's a sudden and "unexpected rapid disassembly" or something...
And a cleaner is called a "floor technician".
Nothing really out of the ordinary for a service to be called degraded while "hey, the cache might still be working right?" ... or "Well you know, it works every other day except today, so it's just degradation" :-)
If your statement is true, then why is the AWS status page widely considered useless, and everyone congregates on HN and/or Twitter to actually know what's broken on AWS during an outage?
> Yes, VP approval is needed to make any updates on the status dashboard. But that's not as hard as it may seem. AWS executives are extremely operation-obsessed, and when there is an outage of any size are engaged with their service teams immediately.
My experience generally aligns with amzn-throw, but this right here is why. There's a manual step here and there's always drama surrounding it. The process to update the status page is fully automated on both sides of this step, if you removed VP approval, the page would update immediately. So if the page doesn't update, it is always a VP dragging their feet. Even worse is that lags in this step were never discussed in the postmortem reviews that I was a part of.
It's intentional plausible deniability. By creating the manual step you can shift blame away. It's just like the concept of personal health dashboards which are designed to keep an asymmetry in reliability information from a host and the client to their personal anecdata experiences. Ontop of all of this, the metrics are pretty arbitrary.
Let's not pretend businesses haven't been intentionally advertising in deceitful ways for decades if not hundreds of years. This just happens to be current strategy in tech of lying and deceiving customers to limit liability, responsibility, and recourse actions.
To be fair, it's not it's not just Amazon, they just happen to be the largest and targeted whipping boys on the block. Few businesses under any circumstances will admit to liability under any circumstances. Liability has to always be assessed externally.
I have in the past directed users here on HN who were complaining about https://status.aws.amazon.com to the Personal Health Dashboard at https://phd.aws.amazon.com/ as well. Unfortunately even though the account I was logged into this time only has a single S3 bucket in the EU, billed through the EU and with zero direct dependancies on the US the personal health dashboard was ALSO throwing "The request processing has failed because of an unknown error" messages. Whatever the problem was this time it had global effects for the majority of users of the Console, the internet noticed for over 30 minutes before either the status page or the PHD were able to report it. There will be no explanation and the official status page logs will say there was "increased API failure rates" for an hour.
Now i guess its possible that the 1000s and 1000s of us who noticed and commented are some tiny fraction of the user base but if thats so you could at least publish a follow up like other vendors do that says something like 0.00001% of API requests failed effecting an estimated 0.001% of our users at the time.
I haven't asked AWS employees specifically about blameless postmortems, but several of them have personally corroborated that the culture tends towards being adversarial and "performance focused." That's a tough environment for blameless debugging and postmoretems. Like if I heard that someone has a rain forest tree-frog living happily in their outdoor Arizona cactus garden, I have doubts.
When I was at Google I didn't have a lot of exposure to the public infra side. However I do remember back in 2008 when a colleague was working on routing side of YouTube, he made a change that cost millions of dollars in mere hours before noticing and reverting it. He mentioned this to the larger team which gave applause during a tech talk. I cannot possibly generalize the culture differences between Amazon and Google, but at least in that one moment, the Google culture seemed to support that errors happen, they get noticed, and fixed without harming the perceived performance of those responsible.
I was not informed of his performance reviews. However, given the reception, his work in general, and the attitudes of the team, I cannot imagine this even came up. More likely the ability to improve routing to actually make YouTube cheaper in the end was I'm sure the ultimate positive result.
This was also towards the end of the golden age of Google, when the percentage of top talent was a lot higher.
The entire point of blameless postmortems is acknowledging that the mere existence of an outage does not inherently reflect on the performance of the people involved. This allows you to instead focus on building resilient systems that avoid the possibility of accidental outages in the first place.
I'll play devil's advocate here and say that sometimes these incidents deserve praise because they uncovered an issue that was otherwise unknown previously. Also if the incident had a large negative impact then it shows to leadership how critical normal operation of that service is. Even if you were the cause of the issue, the fact that you fixed it and kept the critical service operating the rest of the time, is worth something good.
Mistakes happen, and a culture that insists too hard that "mistakes shouldn't happen, and so we can't be seen making mistakes" is harmful toward engineering.
How should their performance be evaluated, if not by the rote number of mistakes that can be pinned onto the person, and their combined impact? (Was that the question?)
It’s standard. Career ladder [1] sets expectation for each level. Performance is measured against those expectations. Outages don’t negatively impact a single engineer.
The key difference is the perspective. If reliability is bad that’s an organizational problem and blaming or punishing one engineer won’t fix that.
We knew us-east-1 was unuseable for our customers for 45 minutes before amazon acknowledged anything was wrong _at all_. We made decisions _in the dark_ to serve our customers, because amazon drug their feet communicating with us. Our customers were notified after 2 minutes.
Can’t comment on most of your post but I know a lot of Amazon engineers who think of the CoE process (Correction of Error, what other companies would call a postmortem) as punitive
They aren't meant to be, but shitty teams are shitty. You can also create a COE and assign it to another team. When I was at AWS, I had a few COEs assigned to me by disgruntled teams just trying to make me suffer and I told them to pound sand. For my own team, I wrote COEs quite often and found it to be a really great process for surfacing systemic issues with our management chain and making real improvements, but it needs to be used correctly.
Absolutely! Anecdotally, out of all the teams I interacted with in seven years at AWS across multiple arms of the company, I saw only a handful of bad teams. But like online reviews, the unhappy people are typically the loudest. I'm happy they are though, it's always important to push to be better, but I don't believe that AWS is the hellish place to work that HN discourse would lead many to believe.
Because OTHERWISE people might think AMAZON is a DYSFUNCTIONAL company that is beginning to CRATER under its HORRIBLE work culture and constant H/FIRE cycle.
See, AWS is basically turning into a long standing utility that needs to be reliable.
Hey, do most institutions like that completely turn over their staff every three years? Yeah, no.
Great for building it out and grabbing market share.
Maybe not for being the basis of a reliable substrate of the modern internet.
If there are dozens of bespoke systems that keep AWS afloat (disclosure: I have friends who worked there, and there are, and also Conway's law), but if the people who wrote them are three generations of HIRE/FIRE ago....
> Maybe not for being the basis of a reliable substrate of the modern internet.
Maybe THEY will go to a COMPETITOR and THINGS MOVE ON if it's THAT BAD. I wasn't sure what the pattern for all caps was, so just giving it a shot there. Apologies if it's incorrect.
I'm an ex-Amazon employee and approve of this response.
It reflects exactly my experience there.
Blameless post-mortem, stick to the facts and how the situation could be avoided/reduced/shortened/handled better for next time.
In fact, one of the guidelines for writing COE (Correction Of Error, Amazon's jargon for Post Mortem) is that you never mention names but use functions and if necessary teams involved:
1. Personal names don't mean anything except to the people who were there on the incident at the time. Someone reading the CoE on the other side of the world or 6 months from now won't understand who did what and why.
2. It stands in the way of honest accountability.
Everybody is very slow to update their outage pages because of SLAs. It's in a company's financial interest to deny outages and when they are undeniable to make them appear as short as possible. Status pages updating slowly is definitely by design.
There's no large dev platform I've used that this wasn't true of their status pages.
> ...you own your dependencies. You don't blame others.
Agreed, teams should invest resources in architecting their systems in a way that can withstand broken dependencies. How does AWS teams account for "core" dependencies (e.g. auth) that may not have alternatives?
> * Depending on the service one customer's experience is not the broad experience. Someone might be having a really bad day but 99.9% of the region is operating successfully, so there is no reason to update the overall status dashboard.
> you own your dependencies. You don't blame others.
I love that. Build your service to be robust. Never assume that dependencies are 100% reliable. Gracefully handle failures. Don't just go hard down, or worse, sure horribly in a way that you can't recover from automatically when you're dependencies come back. I've seen a single database outage cause cascading failures across a whole site even though most services had no direct connection to the database. And recovery had to be done in order of dependency, or else you're playing whack-a-mole for an hour)
> VP approval is needed to make updates on the status board.
Isn't that normal? Updating the status has a cost (reparations to customers if you breach SLA). You don't want some on-call engineer stressing over the status page while trying to recover stuff.
"Yes, VP approval is needed to make any updates on the status dashboard."
If services are clearly down, why is this needed ? I can understand the oversights required for a company like Amazon but this sounds strange to me. If services are clearly down, I want that damn status update right away as a customer.
The person is unlikely to have been authorized as a spokesman for AWS. In many workplaces, doing that is grounds for disciplinary action. Hence, throwaway.
Well, when you talk about blameless post mortems and how they are valued at the company... A throw-away does make me doubt that the culture supports being blameless :)
Well, I understand that, but if you look at his account history it is only pro-Amazon comments. It feels like propaganda more than information, and all I am saying is that the throwaway does not add credibility or a feeling that his opinion are genuine.
Edit: also, could you please stop posting unsubstantive comments generally? You've unfortunately been doing that repeatedly, and we're trying for something else here.
That sounds like the exact opposite of human-factors engineering. No one likes taking blame. But when things go sideways, people are extra spicy and defensive, which makes them clam up and often withhold useful information, which can extend the outage.
No-blame analysis is a much better pattern. Everyone wins. It's about building the system that builds the system. Stuff broke; fix the stuff that broke, then fix the things that let stuff break.
I worked at Walmart Technology. I bravely wrote post mortem documents owning the fault of my team (100+ people), owning both technically and also culturally as their leader. I put together a plan to fix it and executed it. Thought that was the right thing to do. This happend two times in my 10 year career there.
Both times I was called out as a failure in my performance eval. Second time, I resigned and told them to find a better leader.
That's shockingly stupid. I also worked for a major Walmart IT services vendor in another life, and we always had to be careful about how we handled them, because they didn't always show a lot of respect for vendors.
On another note, thanks for building some awesome stuff -- walmart.com is awesome. I have both Prime and whatever-they're-currently-calling Walmart's version and I love that Walmart doesn't appear to mix SKU's together in the same bin which seems to cause counterfeiting fraud at Amazon.
walmart.com user design sucks. My particular grudge right now is - I'm shopping to go pickup some stuff (and indicate "in store pickup) and each time I search for the next item, it resets that filter making me click on that filter for each item on my list.
Almost every physical-store-chain company's website makes it way too hard to do the thing I nearly always want out of their interface, which is to search the inventory of the X nearest locations. They all want to push online orders or 3rd-party-seller crap, it seems.
Yes, I assume they intentionally make it difficult to push third party sellers’ where they get to earn bigger profit margins and/or hide their low inventory.
Although, Amazon is the worst, then Walmart (still much better than Amazon since you can at least filter). The others are not bad in my experience.
Walmart.com, Am I the only one in the world who can't view their site on my phone? I tried it on a couple devices and couldn't get it to work. Scaling is fubar. I assumed this would be costing them millions/billions since it's impossible to buy something from my phone right now. S21+ in portrait on multiple browsers.
I believe he means a literal bin. E.g. Amazon takes products from all their sellers and chucks them in the same physical space, so they have no idea who actually sold the product when it's picked. So you could have gotten something from a dodgy 3rd party seller that repackages broken returns, etc, and Amazon doesn't maintain oversight of this.
Ah, whew. That's what I thought. Thanks! I asked because we make warehouse and retail management systems and every vendor or customer seems to give every word their own meanings (e.g., we use "bin" in our discounts engine to be a collection of products eligible for discounts, and "barcode" has at least three meanings depending on to whom you're speaking).
Props to you and Walmart will never realize their loss. Unfortunately. But one day there will be headline (or even a couple of them) and you will know that if you had been there it might not have happened and that in the end it is Walmarts' customers that will pay the price for that, not their shareholders.
Stories like this are why I'm really glad I stopped talking to that Walmart Technology recruiter a few years ago. I love working for places where senior leadership constantly repeat war stories about "that time I broke the flagship product" to reinforce the importance of blameless postmortems. You can't fix the process if the people who report to you feel the need to lie about why things go wrong.
I firmly believe in the dictum "if you ship it you own it". That means you own all outages. It's not just an operator flubbing a command, or a bit of code that passed review when it shouldn't. It's all your dependencies that make your service work. You own ALL of them.
People spend all this time threat modelling their stuff against malefactors, and yet so often people don't spend any time thinking about the threat model of decay. They don't do it adding new dependencies (build- or runtime), and therefore are unprepared to handle an outage.
There's a good reason for this, of course: modern software "best practices" encourage moving fast and breaking things, which includes "add this dependency we know nothing about, and which gives an unknown entity the power to poison our code or take down our service, arbitrarily, at runtime, but hey its a cool thing with lots of github stars and it's only one 'npm install' away".
Should I be penalized if an upstream dependency, owned by another team, fails? Did I lack due diligence in choosing to accept the risk that the other team couldn't deliver? These are real problems in the micro-services world, especially since I own UI and there are dozens of teams pumping out services, and I'm at the mercy of all of them. The best I can do is gracefully fail when services don't function in a healthy state.
You and many others here may be conflating two concepts which are actually quite separate.
Taking blame is a purely punitive action and solves nothing. Taking responsibility means it's your job to correct the problem.
I find that the more "political" the culture in the organization is, the more likely everyone is to search for a scapegoat to protect their own image when a mistake happens. The higher you go up in the management chain, the more important vanity becomes, and the more you see it happening.
I have made plenty of technical decisions that turned out to be the wrong call in retrospect. I took _responsibility_ for those by learning from the mistake and reversing or fixing whatever was implemented. However, I never willfully took _blame_ for those mistakes because I believed I was doing the best job I could at the time.
Likewise, the systems I manage sometimes fail because something that another team manages failed. Sometimes it's something dumb and could have easily been prevented. In these cases, it's easy point blame and say, "Not our fault! That team or that person is being a fuckup and causing our stuff to break!" It's harder but much more useful to reach out and say, "hey, I see x system isn't doing what we expect, can we work together to fix it?"
Every argument I have on the internet is between prescriptive and descriptive language.
People tend to believe that if you can describe a problem that means you can prescribe a solution. Often times, the only way to survive is to make it clear that the first thing you are doing is describing the problem.
After you do that, and it's clear that's all you are doing, then you follow up with a prescriptive description where you place clearly what could be done to manage a future scenario.
If you don't create this bright line, you create a confused interpretation.
My comment was made from the relatively simpler entrepreneurial perspective, not the corporate one. Corp ownership rests with people in the C-suite who are social/political lawyer types, not technical people. They delegate responsibility but not authority, because they can hire people, even smart people, to work under those conditions. This is an error mode where "blame" flows from those who control the money to those who control the technology. Luckily, not all money is stupid so some corps (and some parts of corps) manage to function even in the presence of risk and innovation failures. I mean the whole industry is effectively a distributed R&D budget that may or may not yield fruit. I suppose this is the market figuring out whether iterated R&D makes sense or not. (Based on history, I'd say it makes a lot of sense.)
I wish you wouldn't talk about "penalization" as if it was something that comes from a source of authority. Your customers are depending on you, and you've let them down, and the reason that's bad has nothing to do with what your boss will do to you in a review.
The injustice that can and does happen is that you're explicitly given a narrow responsibility during development, and then a much broader responsibility during operation. This is patently unfair, and very common. For something like a failed uService you want to blame "the architect" that didn't anticipate these system level failures. What is the solution? Have plan b (and plan c) ready to go. If these services don't exist, then you must build them. It also implies a level of indirection that most systems aren't comfortable with, because we want to consume services directly (and for good reason) but reliability requires that you never, ever consume a service directly, but instead from an in-process location that is failure aware.
This is why reliable software is hard, and engineers are expensive.
Oh, and it's also why you generally do NOT want to defer the last build step to runtime in the browser. If you start combining services on both the client and server, you're in for a world of hurt.
You get hit by a car and injured. The accident is the other driver's fault, but getting to the ER is your problem. The other driver may help and call an ambulance, but they might not even be able to help you if they also got hurt in the car crash.
Say during due diligence two options are uncovered: use an upstream dependency owned by another team, or use that plus a 3P vendor for redundancy. Implementing parallel systems costs 10x more than the former and takes 5x longer. You estimate a 0.01% chance of serious failure for the former, and 0.001% for the latter.
Now say you're a medium sized hyper-growth company in a competitive space. Does spending 10 times more and waiting 5 times longer for redundancy make business sense? You could argue that it'd be irresponsible to over-engineer the system in this case, since you delay getting your product out and potentially lose $ and ground to competitors.
I don't think a black and white "yes, you should be punished" view is productive here.
If it's brand new RiscV CPU that was just relesed 5 min ago, and nobody really tested then yes.
If its standard CPU that everybody else uses, and its not known to be bad then no.
Same for software. Is it ok to have dependency on AWS services ? Their history shows yes. Dependency on brand new SaaS product ? Nothing mission critical.
Or npm/crates/pip packages. Packages that have been around and seedily maintained for few years, have active users, are worth checking out. Some random project from single developer ? Consider vendoring (and owning if necessary ) it.
You choose the CPU and you choose what happens in a failure scenario. Part of engineering is making choices that meet the availability requirements of your service. And part of that is handling failures from dependencies.
That doesn't extend to ridiculous lengths but as a rule you should engineer around any single point of failure.
I think this is why we pay for support, with the expectation that if their product inadvertently causes losses for you they will work fast to fix it or cover the losses.
Yes? If you are worried about CPU microcode failing, then you do a NASA and have multiple CPU arch's doing calculations in a voting block. These are not unsolved problems.
JPL goes further and buys multiple copies of all hardware and software media used for ground systems, and keeps them in storage "just in case". It's a relatively cheap insurance policy against the decay of progress.
Ok, let's take an organization, let's call them, say Ammizzun. Totally not Amazon. Let's say you have a very aggressive hire/fire policy which worked really well in rapid scaling and growth of your company. Now you have a million odd customers highly dependent on systems that were built by people that are now one? two? three? four? hire/fire generations up-or-out or cashed-out cycles ago.
So.... who owns it if the people that wrote it are lllloooooonnnnggg gone? Like, not just long gone one or two cycles ago so some institutional memory exists. I mean, GONE.
A lot can go wrong as an organization grows, including loss of knowledge. At amazon "Ownership" officially rests with the non-technical money that owns voting shares. They control the board who controls the CEO. "Ownership" can be perverted to mean that you, a wage slave, are responsible for the mess that previous ICs left behind. The obvious thing to do in such a circumstance is quit (or don't apply). It is unfair and unpleasant to be treated in a way that gives you responsibility but no authority, and to participant in maintaining (and extending) that moral hazard, and as long as there are better companies you're better off working for them.
I worked on a project like this in government for my first job. I was the third butt in that seat in a year. Everyone associated with project that I knew there was gone by one year from my own departure date.
They are now on the 6th butt in that seat in 4 years. That poor fellow is entirely blameless for the mess that accumulated over time.
Having individuals own systems seems like a terrible practice. You're essentially creating a single point of failure if only one person understands how the system works.
if I were a black hat I would absolutely love GitHub and all the various language-specific package systems out there. giving me sooooo many ways to sneak arbitrary tailored malicious code into millions of installs around the world 24x7. sure, some of my attempts might get caught, or not but not lead to a valuable outcome for me. but that percentage that does? can make it worth it. its about scale and a massive parallelization of infiltration attempts. logic similar to the folks blasting out phishing emails or scam calls.
I love the ubiquity of thirdparty software from strangers, and the lack of bureaucratic gatekeepers. but I also hate it in ways. and not enough people know about the dangers of this second thing.
Any yet oddly enough the Earth continues to spin and the internet continues to work. I think the system we have now is necessarily the system that must exist ( in this particular case, not in all cases ). Something more centralized is destined to fail. And, while the open source nature of software introduces vulnerabilities it also fixes them.
> And, while the open source nature of software introduces vulnerabilities it also fixes them.
dat gap tho... which was my point. smart black hats will be exploiting this gap, at scale. and the strategy will work because the majority of folks seem to be either lazy, ignorant or simply hurried for time.
and btw your 1st sentence was rude. constructive feedback for the future
when working on CloudFiles, we often had monitoring for our limited dependencies that were better than their monitoring. Don't just know what your stuff is doing, but what your whole dependency ecosystem is doing and know when it all goes south. also helps to learn where and how you can mitigate some of those dependencies.
This. We found very big, serious issues with our anti-DDOS provider because their monitoring sucked compared to ours. It was a sobering reality check when we realized that.
It's also a nightmare for software preservation. There's going to be a lot from this era that won't be usable 80 years from now because everything is so interdependent and impossible to archive. It's going to be as messy and irretrievable as the Web pre Internet Archive + Wayback are.
I don't think engineers can believe in no-blame analysis if they know it'll harm career growth. I can't unilaterally promote John Doe, I have to convince other leaders that John would do well the next level up. And in those discussions, they could bring up "but John has caused 3 incidents this year", and honestly, maybe they'd be right.
Would they? Having 3 outages in a year sounds like an organization problem. Not enough safeguards to prevent very routine human errors. But instead of worrying about that we just assign a guy to take the fall
If you work in a technical role and you _don't_ have the ability to break something, you're unlikely to be contributing in a significant way. Likely that would make you a junior developer whose every line of code is heavily scrutinized.
Engineers should be experts and you should be able to trust them to make reasonable choices about the management of their projects.
That doesn't mean there can't be some checks in place, and it doesn't mean that all engineers should be perfect.
But you also have to acknowledge that adding all of those safeties has a cost. You can be a competent person who requires fewer safeties or less competent with more safeties.
The tactical point is to remove sharp edges, eg there's a tool that optionally take a region argument.
network_cli remove_routes [--region us-east-1]
Blaming the operator that they should have known that running
network_cli remove_routes
will take down all regions because the region wasn't specified is the kind of thing as to what's being called out here.
All of the tools need to not default to breaking the world. That is the first and foremost thing being pushed. If an engineer is remotely afraid to come forwards (beyond self-shame/judgement) after an incident, and say "hey, I accidentally did this thing", then the situation will never get any better.
That doesn't mean that engineers don't have the ability to break things, but it means it's harder (and very intentionally so) for a stressed out human operator to do the wrong thing by accident. Accidents happen. Do you just plan on never getting into a car accident, or do you wear a seat belt?
> Which one provides more value to an organization?
Neither, they both provide the same value in the long term.
Senior engineers cannot execute on everything they commit to without having a team of engineers they work with. If nobody trains junior engineers, the discipline would go extinct.
Senior engineers provide value by building guardrails to enable junior engineers to provide value by delivering with more confidence.
Well if John caused 3 outages and and his peers Sally and Mike each caused 0, it's worth taking a deeper look. There's a real possibility he's getting screwed by a messed up org, also he could be doing slapdash work or he seriously might not undertsand the seriousness of an outage.
John’s team might also be taking more calculated risks and running circles around Sally and Mike’s teams with respect to innovation and execution. If your organization categorically punishes failures/outages, you end up with timid managers that are only playing defense, probably the opposite of what the leadership team wants.
Worth a look, certainly. Also very possible that this John is upfront about honest postmortems and like a good leader takes the blame, whereas Sally and Mike are out all day playing politics looking for how to shift blame so nothing has their name attached. Most larger companies that's how it goes.
You're not wrong, but it's possible that the organization is small enough that it's just not feasible to have enough safeguards that would prevent the outages John caused. And in that case, it's probably best that John not be promoted if he can't avoid those errors.
Current co is small. We are putting in the safeguards from Day 1. Well, okay technically like day 120, the first few months were a mad dash to MVP. But now that we have some breathing room, yeah, we put a lot of emphasis on preventing outages, detecting and diagnosing outages promptly, documenting them, doing the whole 5-why's thing, and preventing them in the future. We didn't have to, we could have kept mad dashing and growth hacking. But very fortunately, we have a great culture here (founders have lots of hindsight from past startups).
It's like a seed for crystal growth. Small company is exactly the best time to implement these things, because other employees will try to match the cultural norms and habits.
Well, I started at the small company I'm currently at around day 7300, where "source control" consisted of asking the one person who was in charge of all source code for a copy of the files you needed to work on, and then giving the updated files back. He'd write down the "checked out" files on a whiteboard to ensure that two people couldn't work on the same file at the same time.
The fact that I've gotten it to the point of using git with automated build and deployment is a small miracle in itself. Not everybody gets to start from a clean slate.
> I have to convince other leaders that John would do well the next level up.
"Yes, John has made mistakes and he's always copped to them immediately and worked to prevent them from happening again in the future. You know who doesn't make mistakes? People who don't do anything."
You can even make the same error twice but you better have much better explanation the second time around than you had the first time around because you already knew that what you did was risky and or failure prone.
But usually it isn't the same person making the same mistake, usually it is someone else making the same mistake and nobody thought of updating processes/documentation to the point that the error would have been caught in time. Maybe they'll fix that after the second time ;)
Yes. AAR process in the army was good at this up to the field grade level, but got hairy on G/J level staffs. I preferred being S-6 to G-6 for that reason.
There is no such thing as "no-blame" analysis. Even in the best organizations with the best effort to avoid it, there is always a subconscious "this person did it". It doesn't help that these incidents serve as convenient places for others to leverage to climb their own career ladder at your expense.
Cynical/realist take: Take responsibility and then hope your bosses already love you, you can immediately both come with a way to prevent it from happening again, and convince them to give you the resources to implement it. Otherwise your responsibility is, unfortunately, just blood in the water for someone else to do all of that to protect the company against you and springboard their reputation on the descent of yours. There were already senior people scheming to take over your department from your bosses, now they have an excuse.
Yes, and I personally have worked in environments that do just that. They said they didn't, but with management "personalities" plus stack ranking, you know damn well that they did.
The Gervais/Peter Principle is alive and well in many orgs. That doesn't mean that when you have the prerogative to change the culture, you just give up.
I realize that isn't an easy thing to do. Often the best bet is to just jump around till you find a company that isn't a cultural superfund site.
You can work an entire career and maybe enjoy life in one healthy organization in that entire time even if you work in a variety of companies. It just isn't that common, though of course voicing the _ideals_ is very, very common.
> No-blame analysis is a much better pattern. Everyone wins. It's about building the system that builds the system. Stuff broke; fix the stuff that broke, then fix the things that let stuff break.
Yea, except it doesn't work in practice. I work with a lot of people who come from places with "blameless" post-mortem 'culture' and they've evangelized such a thing extensively.
You know what all those people have proven themselves to really excel at? Blaming people.
Ok, and? I don't doubt it fails in places. That doesn't mean that it doesn't work in practice. Our company does it just fine. We have a high trust, high transparency system and it's wonderful.
It's like saying unit tests don't work in practice because bugs got through.
Have you ever considered that the “no-blame” postmortems you are giving credit for everything are just a side effect of living in a high trust, high transparency system?
In other words, “no-blame” should be an emergent property of a culture of trust. It’s not something you can prescribe.
Sometimes, these large companies tack on too much "necessary" incident "remediation" actions with Arbitrary Due Date SLAs that completely wrench any ongoing work. And ongoing, strategically defined ""muh high impact"" projects are what get you promoted, not doing incident remediations.
When you get to the level you want, you get to not really give a shit and actually do The Right Thing. However, for all of the engineers clamoring to get out of the intermediate brick laying trenches, opening an incident can have pervasive incentives.
I've worked for Amazon for 4 years, including stints at AWS, and even in my current role my team is involved in LSE's. I've never seen this behavior, the general culture has been find the problem, fix it, and then do root cause analysis to avoid it again.
Jeff himself has said many times in All Hands and in public "Amazon is the best place to fail". Mainly because things will break, it's not that they break that's interesting, it's what you've learned and how you can avoid that problem in the future.
I guess the question is why can't you (AWS) fix the problem of the status page not reflecting an outage? Maybe acceptable if the console has a hiccup, but when www.amazon.com isn't working right, there should be some yellow and red dots out there.
With the size of your customer base there were man years spent confirming the outage after checking the status.
Because there's a VP approval step for updating the status page and no repercussions for VPs who don't approve updates in a timely manner. Updating the status page is fully automated on both sides of VP approval. If the status page doesn't update, it's because a VP wouldn't do it.
Haha... This bring back memories. It really depends on the org.
I've had push backs on my postmortems before because of phrasing that could be constituted as laying some of the blame on some person/team when it's supposed to be blameless.
And for a long time, it was fairly blameless. You would still be punished with the extra work of writing high quality postmortems, but I have seen people accidentally bring down critical tier-1 services and not be adversely affected in terms of promotion, etc.
But somewhere along the way, it became politicized. Things like the wheel of death, public grilling of teams on why they didn't follow one of the thousands of best practices, etc, etc. Some orgs are still pretty good at keeping it blameless at the individual level, but... being a big company, your mileage may vary.
We're in a situation where the balls of mud made people afraid to touch some things in the system. As experiences and processes have improved we've started to crack back into those things and guess what, when you are being groomed to own a process you're going to fuck it up from time to time. Objectively, we're still breaking production less often per year than other teams, but we are breaking it, and that's novel behavior, so we have to keep reminding people why.
The moment that affects promotions negatively, or your coworkers throw you under the bus, you should 1) be assertive and 2) proof-read your resume as a precursor to job hunting.
Or problems just persisting, because the fix is easy, but explaining it to others who do not work on the system are hard. Esp. justifying why it won't cause an issue, and being told that the fixes need to be done via scripts that will only ever be used once, but nevertheless needs to be code reviewed and tested...
I wanted to be proactive and fix things before they became an issue, but such things just drained life out of me, to the point I just left.
The status page is as much a political tool as a technical one. Giving your service a non-green state makes your entire management chain responsible. You don't want to be one that upsets some VPs advancement plans.
When I worked for AMZN (2012-2015, Prime Video & Outbound Fulfillment), attempting to sweep issues under the rug was a clear path to termination. The Correction-Of-Error (COE) process can work wonders in a healthy, data-driven, growth-mindset culture. I wonder if the ex-Amazonian you're referring to did not leave AMZN by their own accord?
Blame deflection is a recipe for repeat outages and unhappy customers.
This is the exact opposite of my experience at AWS. Amazon is all about blameless fact finding when it comes to root cause analysis. Your company just hired a not so great engineer or misunderstood him.
Blameless, maybe, but not repercussion-less. A bad CoE was liable to upend the team's entire roadmap and put their existing goals at risk. To be fair, management was fairly receptive to "we need to throw out the roadmap and push our launch out to the following reinvent", but it wasn't an easy position for teams to be in.
Every incident review meeting I've ever been in starts out like, _"This meeting isn't to place blame..."_, then, 5 minutes later, it turns into the Blame Game.
Sorry. I'm probably to blame because I've posted this a couple times on HN before.
It strikes a nerve with me because it caused so much trouble for everyone around him. He had other personal issues, though, so I should probably clarify that I'm not entirely blaming Amazon for his habits. Though his time at Amazon clearly did exacerbate his personal issues.
I mean it's true at every company I've ever worked at too. If you can lawyer incidents into not being an outage you avoid like 15 meetings with the business stakeholders about all the things we "have to do" to prevent things like this in the future that get canceled the moment they realize that how much dev/infra time it will take to implement.
Perhaps reward structure should be changed to incentivize the post-mortems. There could be several flaws that run underreported otherwise.
We may run into the problem of everything documented and possible deliberate acts but for a service that relies heavily on uptime, that’s a small price to pay for a bulletproof operation.
I find post-mortems interesting to read through especially when it’s not my fault. Most of them would probably be routine to read through but there are occasional ones that make me cringe or laugh.
Post-mortems can sometime be thought of like safety training. There is a big imbalance of time dedicated to learning proper safety handling just for those small incidences.
Does Disney still play the "Instructional Videos" series starring Goofy where he's supposed to be teaching you how to do something and instead we learn how NOT to do something? Or did I just date myself badly?
On the retail/marketplace side this wasn't my experience, but we also didn't have any public dashboards. On Prime we occasionally had to refund in bulk, and when it was called for (internally or externally) we would right up a detailed post-mortem. This wasn't fun, but it was never about blaming a person and more about finding flaws in process or monitoring.
I don't think anecdotes like this are even worth sharing, honestly. There's so much context lost here, so much that can be lost in translation. No one should be drawing any conclusions from this post.
> explanation about why the outage was someone else's fault
In my experience, it's rarely clear who was at fault for any sort of non-trivial outage. The issue tends to be at interfaces and involve multiple owners.
Yep I can confirm that. The process when the outage is caused by you is called COE (correction of errors). I was oncall once for two teams because I was switching teams and I got 11 escalations in 2 hours. 10 of these were caused by an overly sensitive monitoring setting. The 11th was a real one. Guess which one I ignored. :)
This fits with everything I've heard about terrible code quality at Amazon and engineers working ridiculous hours to close tickets any way they can. Amazon as a corporate entity seems to be remarkably distrustful of and hostile to its labor force.
I don't know if you're exaggerating or not, but even if true why would anyone show that emotion about losing a job in the worst case?
You certainly had a lot of relevate-to-todays-top-hn-post stories throughout you career. And I'm less and less surprised to continuously find PragmaticPulp as one of the top commenters if not the top that resonates with a good chunk of HN.
I am finding that I have a very bimodal response to "He did it". When I write an RCA or just talk about near misses, I may give you enough details to figure out that Tom was the one who broke it, but I'm not going to say Tom on the record anywhere, with one extremely obvious exception.
If I think Tom has a toxic combination of poor judgement, Dunning-Kruger syndrome, and a hint of narcissism (I'm not sure but I may be repeating myself here), such that he won't listen to reason and he actively steers others into bad situations (and especially if he then disappears when shit hits the fan), then I will nail him to a fucking cross every chance I get. Public shaming is only a tool for getting people to discount advice from a bad actor. If it comes down to a vote between my idea and his, then I'm going to make sure everyone knows that his bets keep biting us in the ass. This guy kinda sounds like the Toxic Tom.
What is important when I turned out to be the cause of the issue is a bit like some court cases. Would a reasonable person in this situation have come to the same conclusion I did? If so, then I'm just the person who lost the lottery. Either way, fixing it for me might fix it for other people. Sometimes the answer is, "I was trying to juggle three things at once and a ball got dropped." If the process dictated those three things then the process is wrong, or the tooling is wrong. If someone was asking me questions we should think about being more pro-active about deflecting them to someone else or asking them to come back in a half hour. Or maybe I shouldn't be trying to watch training videos while babysitting a deployment to production.
If you never say "my bad" then your advice starts to sound like a lecture, and people avoid lectures so then you never get the whole story. Also as an engineer you should know that owning a mistake early on lets you get to what most of us consider the interesting bit of solving the problem instead of talking about feelings for an hour and then using whatever is left of your brain afterward to fix the problem. In fact in some cases you can shut down someone who is about to start a rant (which is funny as hell because they look like their head is about to pop like a balloon when you say, "yep, I broke it, let's move on to how do we fix it?")
To me, the point of "blameless" PM is not to hide the identity of the person who was closest to the failure point. You can't understand what happened unless you know who did what, when.
"Blameless" to me means you acknowledge that the ultimate problem isn't that someone made a mistake that caused an outage. The problem is that you had a system in place where someone could make a single mistake and cause an outage.
If someone fat-fingers a SQL query and drops your database, the problem isn't that they need typing lessons! If you put a DBA in a position where they have to be typing SQL directly at a production DB to do their job, THAT is the cause of the outage, the actual DBA's error is almost irrelevant because it would have happened eventually to someone.
That's true if the direct cause is an actual mistake, which often is the case but not always.
It may also be that the cause is willful negligence, intentionally circumventing barriers for some personal reason.
And, of course, it may be that the cause is explicitly malicious (e.g. internal fraud, or the intent to sabotage someone) and at least part of the blame directly lies on the culprit, and not only on those who failed to notice and stop them.
Naming someone is how you discover that not everyone in the organization believes in Blamelessness. Once it's out it's out, you can't put it back in.
It's really easy for another developer to figure out who I'm talking about. Managers can't be arsed to figure it out, or at least pretend like they don't know.
And this is exactly why you can expect these headlines to hit with great regularity. These things are never a problem at the individual level, they are always at the level of culture and organization.
That's a real shame, one of the leadership principles used to be "be vocally self-critical" which I think was supposed to explicitly counteract this kind of behaviour.
This may not actually be that bad of thing. If you think about it if they're fighting tooth and nail to keep the status page still green that tells you they were probably doing that at every step of the way before the failure became eminent. Gotta have respect for that.
I can't remember seeing problems be more strongly worded than "Increased Error Rates" or "high error rates with S3 in us-east-1" during the infamous S3 outage of 2017 - and that was after they struggled to even update their own status page because of S3 being down. :)
During the Facebook outage FB wrote something along the lines of "We noticed that some users are experiencing issues with our apps" eventhough nothing worked anymore
I love my Home Assistant setup for this reason. I can even get light bulbs pre-flashed with ESPHome now (my wife was bemused when I was updating the firmware on the lightbulbs).
This is an absolute requirement for all of my smart home devices. Not only in case of an outage but also in case the manufacturer decides to stop supporting my device in the future. My Roomba, litter box, washer/dryer, outlets, lights, and all the rest will keep working even if their internet functionality fails. I would like all of those devices to keep working for at least a decade, and I'd be surprised of all the manufacturers keep supporting that old of tech.
My lighting automation uses Insteon currently. My primary req is that they are smart and connected without needing a central controller o a connection to the internet. My switches all understand lighting scenes and can manage those in a P2P manner, without a central controller. The central controller is primarily used when I want to add actual automations vs. scenes. Even the central controller aspect works 100% disconnected from the internet though. I can easily layer on top any automations I like. For instance, I have my exterior lights driven my the angle of the sun. Then, on top of that I can add internet based triggers for automation as needed. This is where I add in voice assistant triggering of automation and scenes.
Edit:
Just to add, very simple binary automations are even possible without a central controller. Like, I have Insteon motion sensors that trigger a lighting scene when they detect motion. These are super simplistic though.
> Might also be a good idea to run AWS SSO in multiple regions if you're not already doing so.
Is this possible?
> AWS Organizations only supports one AWS SSO Region at a time. If you want to make AWS SSO available in a different Region, you must first delete your current AWS SSO configuration. Switching to a different Region also changes the URL for the user portal. [0]
This seems to indicate you can only have one region.
Good call. I just assumed you could for some reason. I guess the fallback is to devise your own SSO implementation using STS in another region if needed.
I think now is a good time to reiterate the danger of companies just throwing all of their operational resilience and sustainability over the wall and trusting someone else with their entire existence. It's wild to me that so many high performing businesses simply don't have a plan for when the cloud goes down. Some of my contacts are telling me that these outages have teams of thousands of people completely prevented from working and tens of million dollars of profit are simply vanishing since the start of the outage this morning. And now institutions like government and banks are throwing their entire capability into the cloud with no recourse or recovery plan. It seems bad now but I wonder how much worse it might be when no one actually has access to money because all financial traffic is going through AWS and it goes down.
We are incredibly blind to just trust just 3 cloud providers with the operational success of basically everything we do.
Why hasn't the industry come up with an alternative?
This seems like an insane stance to have, it's like saying businesses should ship their own stock, using their own drivers, and their in-house made cars and planes and in-house trained pilots.
Heck, why stop at having servers on-site? Cast your own silicon waffers, after all you don't want spectrum exploits.
Because you are worst at it. If a specialist is this bad, and the market is fully open, then it's because the problem is hard.
AWS has fewer outages in one zone alone than the best self-hosted institutions, your facebooks and petagons. In-house servers would lead to an insane amount of outage.
And guess what? AWS (and all other IAAS providers) will beg you to use multiple region because of this. The team/person that has millions of dollars a day staked on a single AWS region is an idiot and could not be entrusted to order a gaming PC from newegg, let alone run an in-house datacenter.
edit: I will add that AWS specifically is meh and I wouldn't use it myself, there's better IASS. But it's insanity to even imagine self-hosted is more reliable than using even the shittiest of IASS providers.
> This seems like an insane stance to have, it's like saying businesses should ship their own stock, using their own drivers, and their in-house made cars and planes and in-house trained pilots.
> Heck, why stop at having servers on-site? Cast your own silicon waffers, after all you don't want spectrum exploits.
That's an overblown argument. Nobody is saying that, but it's clear that businesses that maintain their own infrastructure would've avoided today's AWS' outage. So just avoiding a single level of abstraction would've kept your company running today.
> Because you are worst at it. If a specialist is this bad, and the market is fully open, then it's because the problem is hard.
The problem is hard mostly because of scale. If you're a small business running a few websites with a few million hits per month, it might be cheaper and easier to colocate a few servers and hire a few DevOps or old-school sysadmins to administer the infrastructure. The tooling is there, and is not much more difficult to manage than a hundred different AWS products. I'm actually more worried about the DevOps trend where engineers are trained purely on cloud infrastructure and don't understand low-level tooling these systems are built on.
> AWS has fewer outages in one zone alone than the best self-hosted institutions, your facebooks and petagons. In-house servers would lead to an insane amount of outage.
That's anecdotal and would depend on the capability of your DevOps team and your in-house / colocation facility.
> And guess what? AWS (and all other IAAS providers) will beg you to use multiple region because of this. The team/person that has millions of dollars a day staked on a single AWS region is an idiot and could not be entrusted to order a gaming PC from newegg, let alone run an in-house datacenter.
Oh great, so the solution is to put even more of our eggs in a single provider's basket? The real solution would be having failover to a different cloud provider, and the infrastructure changes needed for that are _far_ from trivial. Even with that, there's only 3 major cloud providers you can pick from. Again, colocation in a trusted datacenter would've avoided all of this.
>, but it's clear that businesses that maintain their own infrastructure would've avoided today's AWS' outage.
When Netflix was running its own datacenters in 2008, they had a 3 day outage from a database corruption and couldn't ship DVDs to customers. That was the disaster that pushed CEO Reed Hastings to get out of managing his own datacenters and migrate to AWS.
The flaw in the reasoning that running your own hardware would avoid today's outage is that it doesn't also consider the extra unplanned outages on other days because your homegrown IT team (especially at non-tech companies) isn't as skilled as the engineers working at AWS/GCP/Azure.
> it's clear that businesses that maintain their own infrastructure would've avoided today's AWS' outage.
Sure, that's trivially obvious. But how many other outages would they have had instead because they aren't as experienced at running this sort of infrastructure as AWS is?
You seem to be arguing from the a priori assumption that rolling your own is inherently more stable than renting infra from AWS, without actually providing any justification for that assumption.
You also seem to be under the assumption that any amount of downtime is always unnacceptable, and worth spending large amounts of time and effort to avoid. For a lot of businesses systems going down for a few hours every once in a while just isn't a big deal, and is much more preferable than spending thousands more on cloud bills, or hiring more full time staff to ensure X 9s of uptime.
You and GP are making the same assumption that my DevOps engineers _aren't_ as experienced as AWS' are. There are plenty of engineers capable of maintaining an in-house infrastructure running X 9s because, again, the complexity comes from the scale AWS operates at. So we're both arguing with an a priori assumption that the grass is greener on our side.
To be fair, I'm not saying never use cloud providers. If your systems require the complexity cloud providers simplify, and you operate at a scale where it would be prohibitively expensive to maintain yourself, by all means go with a cloud provider. But it's clear that not many companies are prepared for this type of failure, and protecting against it is not trivial to accomplish. Not to mention the conceptual overhead and knowledge required with dealing with the provider's specific products, APIs, etc. Whereas maintaining these systems yourself is transferrable across any datacenter.
This feels like a discussion that could sorely use some numbers.
What are good examples of
>a small business running a few websites with a few million hits per month, it might be cheaper and easier to colocate a few servers and hire a few DevOps or old-school sysadmins to administer the infrastructure.
depends I guess, I am running on-prem workstation for our DWH. So far in 2 years it went down minutes at the time, when I decided to do so, because of hardware updates.
I have no idea where this narrative came from, but usually hardware you have is very reliable and doesn't turn off every 15 minutes.
Heck, I use old T430 for my home server and still it doesn't go down on completely random occasions (but thats very simplified example, I know)
The one in work yes, but for internal network, as we are not exposed to internet. But to be honest, we are probably one of few companies that make priority that there is always electricity and internet in the office (with UPS, electricity generator, multiple internet providers).
No idea what are the standards for other companies.
There are at least 6 cloud providers I can name that I've used which run their own data centers with capabilities similar to AWSs core products (ec2, route53, s3, cloud watch, rdb)
Ovh, scaleway, online.net, azure, gcp, aws
That's one's I've used in production, I've heard of a dozen more including big names like HP and IBM, I assume they can match aws for the most part.
...
That being said I agree multi tenant is the way to go for reliability. But I was pointing out that in this case even the simple solution of multi region on one provider was not implemented by those affected.
...
As for running your own data center as a small company. I have done it, buying components building servers and all.
Expenses and ISP issues aside, I can't imagine using in house without at least a few outages a year for anywhere near the price of hiring a DevOps person to build a MT solution for you.
If you think you can you've either never tried doing it OR you are being severely underpaid for your job.
Competent teams to build and run reliable in house infrastructure exist, and they can get you SLA similar to multi region AWS or GC (aka 100% over the last 5 years)... But the price tag has 7 to 8 figures in it.
This is the right answer, I recall studying for the solutions architect professional certification and reading this countless times: outages will happen and you should plan for them by using multi-region if you care about downtime.
It's not AWS fault here, it's the companies', which assume that it will never be down. In-house servers also have outages, it's a very naive assumption to think that it'd be all better if all of those services were using their own servers.
Facebook doesn't use AWS and they were down for several hours a couple weeks ago, and that's because they have way better engineers than the average company, working on their infrastructure, exclusively.
If all you wanted to do was vacuum the floor you would not have gotten that particular vacuum cleaner.
Clearly you wanted to do more than just vacuum the floor and something like this happening should be weighed with the purchase of the vacuum.
> AWS (and all other IAAS providers) will beg you to use multiple region
will they? because AWS still puts new stuff in us-east-1 before anywhere else, and there is often a LONG delay before those things go to other regions. there are many other examples of why people use us-east-1 so often, but it all boils down to this: AWS encourage everyone to use us-east-1 and discourage the use of other regions for the same reasons.
if they want to change how and where people deploy, they should change how they encourage it's customers to deploy.
my employer uses multi-region deployments where possible, and we can't do that anywhere nearly as much as we'd like because of limitations that AWS has chosen to have.
so if cloud providers want to encourage multi-region adoption, they need to stop discouraging and outright preventing it, first.
> AWS still puts new stuff in us-east-1 before anywhere else, and there is often a LONG delay before those things go to other regions.
Come to think of it (far down the second page of comments): Why east?
Amazon is still mainly in Seattle, right? And Silicon Valley is in California. So one would have thought the high-tech hub both of Amazon and of the USA in general is still in the west, not east. So why us-east-1 before anywhere else, and not us-west-1?
Most features roll out to IAD second, third, or fourth. PDX and CMH are good candidates for earlier feature rollout, and usually it's tested in a small region first. I use PDX (us-west-2) for almost everything these days.
I also think that they've been making a lot of the default region dropdowns and such point to CMH (us-east-2) to get folks to migrate away from IAD. Your contention that they're encouraging people to use that region just don't ring true to me.
It works really well imo. All the people who want to use new stuff at the expense of stability choose us-east-1; those who want stability at the expense of new stuff run multi-region (usually not in us-east-1 )
This argument seems rather contrived. Which feature available in only one region for a very long time has specifically impacted you? And what was the solution?
Quick follow up. I once used a IASS provider (hyperstreet) that was terrible. Long story short provider ended closing shop and the owner of the company now sells real estate in California.
Was a nightmare recovering data. Even when the service was operational was sub par.
Just saying perhaps the “shittiest” providers may not be more reliable.
> In-house servers would lead to an insane amount of outage.
That might be true, but the effects of any given outage would be felt much less widely. If Disney has an outage, I can just find a movie on Netflix to watch instead. But now if one provider goes down, it can take down everything. To me, the problem isn't the cloud per se, it's one player's dominance in the space. We've taken the inherently distributed structure of the internet and re-centralized it, losing some robustness along the way.
> That might be true, but the effects of any given outage would be felt much less widely.
If my system has an hour of downtime every year and the dozen other systems it interacts with and depends on each have an hour of downtime every year, it can be better that those tend to be correlated rather than independent.
I think you're missing the point of the comment. It's not "don't use cloud". It's "be prepared for when cloud goes down". Because it will, despite many companies either thinking it won't, or not planning for it.
> AWS has fewer outages in one zone alone than the best self-hosted institutions, your facebooks and petagons. In-house servers would lead to an insane amount of outage.
> they usually beg you to use multiple availability zones though
Doesn't help you if it what goes down is AWS global services on which you directly, or other AWS services, depend (which tend to be tied to US-east-1).
Because the expected value of using AWS is greater than the expected value of self-hosting. It's not that nobody's ever heard of running on their own metal. Look back at what everyone did before AWS, and how fast they ran screaming away from it as soon as they could. Once you didn't have to do that any more, it's just so much better that the rare outages are worth it for the vast majority of startups.
Medical devices, banks, the military, etc. should generally run on their own hardware. The next photo-sharing app? It's just not worth it until they hit tremendous scale.
On the second though, at some point, infrastructure like AWS are going to be more reliable than what many banks, medical device operators etc can provide themselves. asking them to stay on their own hardware is asking for that industry to remain slow, bespoke and expensive.
It is incredibly difficult for non-tech companies to hire quality software and infrastructure engineers - they usually pay less and the problems aren't as interesting.
Agreed, and it'll be a gradual switch rather than a single point, smeared across industries. Likely some operations won't ever go over, but it'll be a while before we know.
People are so quick to forget how things were before behemoths like AWS, Google Cloud, and Azure. Not all things are free and the outage the internet is experiencing is the risk users signed up for.
If you would like to go back to the days of managing your own machines, be my guest. Remember those machines also live somewhere and were/are subject to the same BGP and routing issues we've seen over the past couple of years.
Personally, I'll deal with outages a few times a year for the peace of mind that there's a group of really talented people looking into for me.
>It seems bad now but I wonder how much worse it might be when no one actually has access to money because all financial traffic is going through AWS and it goes down.
Most financial institutions are implementing their own clouds, I can't think of any major one that is reliant on public cloud to the extent transactions would stop.
>Why hasn't the industry come up with an alternative?
You mean like building datacenters and hosting your own gear?
The agreement is more of a hybrid cloud arrangement with AWS Outposts.
FTA:
>Core to Nasdaq’s move to AWS will be AWS Outposts, which extend AWS infrastructure, services, APIs, and tools to virtually any datacenter, co-location space, or on-premises facility. Nasdaq plans to incorporate AWS Outposts directly into its core network to deliver ultra-low-latency edge compute capabilities from its primary data center in Carteret, NJ.
They are also starting small, with Nasdaq MRX
This is much less about moving NASDAQ (or other exchanges) to be fully owned/maintained by Amazon, and more about wanting to take advantage of development tooling and resources and services AWS provides, but within the confines of an owned/maintained data center. I'm sure as this partnership grows, racks and racks will be in Amazon's data centers too, but this is a hybrid approach.
I would also bet a significant amount of money that when NASDAQ does go full "cloud" (or hybrid, as it were), it won't be in the same US-east region co-mingling with the rest of the consumer web, but with its own redundant services and connections and networking stack.
NASDAQ wants to modernize its infrastructure but it absolutely doesn't want to offload it to a cloud provider. That's why it's a hybrid partnership.
Indeed I can think of several outages in the past decade in the UK of banks' own infrastructure which have led to transactions stopping for days at a time, with the predictable outcomes.
money people think of money very weirdly. when they predict they will get more than they actually get, they call it "loss" for some reason, and when they predict they will get less than they actually get, it's called ... well I don't know what that's called but everyone gets bonuses.
Well, if you're web-based, there's never really been any better alternative. Even before "the cloud", you had to be hosted in a datacenter somewhere if you wanted enough bandwidth to service customers, as well as have somebody who would make sure the power stayed on 24/7. The difference now is that there used to be thousands of ISP's so one outage wouldn't get as much news coverage, but it would also probably last longer because you wouldn't have a team of people who know what to look for like Amazon (probably?) does.
> Why hasn't the industry come up with an alternative?
The cloud is the solution to self managed data centers. Their value proposition is appealing: Focus on your core business and let us handle infrastructure for you.
This fits the needs of most small and medium sized businesses, there's no reason not to use the cloud and spend time and money on building and operating private data centers when the (perceived) chances of outages are so small.
Then, companies grow to a certain size where the benefits of having a self managed data center begins to outweight not having one. But at this point this becomes more of a strategic/political decision than merely a technical one, so it's not an easy shift.
This appears to be a single region outage - us-east-1. AWS supports as much redundancy as you want. You can be redundant between multiple Availability Zones in a single Region or you can be redundant among 1, 2 or even 25 regions throughout the world.
Multiple-region redundancy costs more both in initial planning/setup as well as monthly fees so a lot of AWS customers choose to just not do it.
We have or had alternatives - rackspace, linode, digital ocean, in the past there were many others, self hosting is still an option. But the big three just do it better. The alternatives are doomed to fail. If you use anything other than the big three you risk not just more outages, but your whole provider going out of business overnight.
If the companies at the scale you are talking about do not have multi-region and multi service (aws to azure for example) failover that's their fault, and nobody else's.
Companies do not change their whole strategy from a capex-driven traditional self-hosting environment to opex-driven cloud hosting because their IT people are lazy; it is typically an exec-level decision.
> Why hasn't the industry come up with an alternative?
We used to have that, some companies still have the capability and know-how to build and run infrastructure that is reliable, distributed across many hosting providers before "cloud" became the "norm", but it goes along with "use or lose it".
Do you think they'd manage their own infra better? Are you suggesting they pay for a fully redundant second implementation on another provider? How much extra cost would that be vs eating an outage very infrequently?
Because the majority of consumers don't know better / don't care and still buy products from companies with no backup plan. Because, really, how can any of us know better until we're burned many times over?
Not sure it helps, but got this update from someone inside AWS a few moments ago.
"We have identified the root cause of the issues in the US-EAST-1 Region, which is a network issue with some network devices in that Region which is affecting multiple services, including the console but also services like S3. We are actively working towards recovery."
That's a copy-paste, we got the same thing from our AWS contact. It's just enough info to confirm there's an issue, but not enough to give any indication on the scope or timeline to resolution.
Internally the rumor is that our CICD pipelines failed to stop bad commits to certain AWS services. This isn’t due to tests but due to actual pipelines infra failing.
We’ve been told to disable all pipelines even if we have time blockers or manual approval steps or failing tests
Ahh good spot, it does seem that the AWS person I am speaking too has a few more bits other than what is shown on the page, they just messaged me the same message there, but added:
"All teams are engaged and continuing to work towards mitigation. We have confirmed the issue is due to multiple impaired network devices in the US-EAST-1 Region."
Doesn't sound like they are having a good day there!
It's funny that the first place I go to learn about the outage is Hacker News and not https://status.aws.amazon.com/ (it's still reports everything to be "operating normally"...)
I made sure our incident response plan includes checking Hacker News and Twitter for actual updates and information.
As of right now, this thread and one update from a twitter user, https://twitter.com/SiteRelEnby/status/1468253604876333059 are all we have. I went into disaster recovery mode when I saw our traffic dropped to 0 suddenly at 10:30am ET. That was just the SQS/something else preventing our ELB logs from being extracted to DataDog though.
So as of the time you posted this comment, were other services actually down? The way the 500 shows up, and the AWS status page, makes it sound like "only" the main landing page/mgt console is unavailable, not AWS services.
Yes, they are still publishing lies on their status page. In this thread people are reporting issues with many services. I'm seeing periodic S3 PUT failures for the last 1.5 hours.
AWS services are all built against each other so one failing will take down a bunch more which take down more like dominos. Internally there’s a list of >20 “public facing” AWS services impacted.
I always got the impression that downdetector worked by logging the number of times they get a hit for a particular service and using that as a heuristic to determine if something is down. If so, that's brilliant.
When Facebook's properties all went down in October, people were saying that AT&T and other cell phone carriers were also down - because they couldn't connect to FB/Insta/etc. There were even some media reports that cited Downdetector, seeming without understanding that they are basically crowdsourced and sometimes the crowd is wrong.
I think it's a bit simpler for AWS- there's a big red "I have a problem with AWS" button on that page. You click it, tell it what your problem is, and it logs a report. Unless that's what you were driving at and I missed it, it's early. Too early for AWS to be down :(
Some 3600 people have hit that button in the last ~15 minutes.
Anecdotally, we're seeing a small number of 500s from S3 and SQS, but mostly our service (which is at nontrivial scale, but mostly just uses EC2, S3, DynamoDB, and some basic network facilities including load balancers) seems fine, knock on wood. Either the problem is primarily in more complex services, or it is specific to certain AZs or shards or something.
Definitely not just the console. We had hundreds of thousands of websocket connections to us-east-1 drop at 15:40, and new websocket connections to that region are still failing. (Luckily not a huge impact on our service cause we run in 6 other regions, but still).
No idea, we don't use it. These were websocket connections to processes on ec2, via NLB and cloudfront. Not sure exactly what part of that chain was broken yet.
This whole time I've been seeing intermittent timeouts when checking a UDP service via NLB; I've been wondering if it's general networking trouble or something specifically with the NLB. EC2 hosts are all fine, as far as I can tell.
One of my sites went offline an hour ago because the web server stopped responding. I can't SSH into it or get any type of response. The database server in the same region and zone is continuing to run fine though.
It's just a t3a.nano instance since it's a project under development. However, I have a high number of t3a.nano instances in the same region operating as expected. This particular server has been running for years, so although it could be a coincidence it just went offline within minutes of the outage starting, it seems unlikely. Hopefully no hardware failures or corruption, and it'll just need a reboot once I can get access to AWS again.
I can't access anything related to Cloudfront, either through the CLI or console :
$ aws cloudfront list-distributions
An error occurred (HttpTimeoutException) when calling the ListDistributions operation: Could not resolve DNS within remaining TTL of 4999 ms
I see started EC2 instances are doing fine. However, starting offline instances cannot be done through AWS SDK due to the HTTP 500 error, even for Ec2 service. The CLI should be getting the HTTP 500 error too since likely the same API as the SDK.
My ECS, EC2, Lambda, load balancer, and other services on us-east-1 still function. But these outages can sometimes propagate over time rather than instantly.
I can tell you that some processes are not running, possibly due to SQS or SWF problems. Previous outages of this scale were caused by Kinesis outages. Can't connect via aws login at the cli either since we use SSO and that seems to be down.
I wasn't able to load my amazon.com wishlist, nor the shopping page through the app. Not an aws service specifically, but an amazon service that I couldn't use.
Yeah, there are "global" services which are actually secretly us-east-1 services as that is the region they use for internal data storage and orchestration. I can't launch instances with OpsWorks (not a very widely used service, I'd imagine) even if those instances are in stacks outside of us-east-1. I suspect Route53 and CloudFront will also have issues.
Yeah I can't log in with our external SAML SSO to our AWS dashboard to manage our us-east-2 resources. . . . Because our auth is apparently routed thru us-east-1 STS
You can pick the region for SSO — or even use multiple. Ours is in ap-southeast-1 and working fine — but then the console that it signs us into is only partially working presumably due to dependencies on us-east-1.
Virginia's actual motto is "Sic semper tyrannis". What's more tyrannical than an omnipotent being that will condemn you to eternal torment if you don't worship them and follow their laws.
Virginia and Massachusetts have surprisingly aggressive mottoes (MA is: "By the sword we seek peace, but peace only under liberty", which is really just a fancy way of saying "don't make me stab a tyrant," if you think about it). It probably makes sense, though, given that they came up with them during the revolutionary war.
This is funny, but true. I've been avoiding us-east-1 simply because thats where everyone else is. Spot instances are also less likely to be expensive in less utilized regions.
I live in Ohio and can confirm. If the Earth were destroyed by an asteroid Ohio would be left floating out there somehow holding onto an atmosphere for about ten years.
If your company is shoehorning you into using multiple clouds and learning a dozen products, IAM and CICD dialects simultaneously because "being cloud dependent is bad", I feel bad for you.
Doing one cloud correctly from a current DevSecOps perspective is a multi-year ask. I estimate it takes about 25 people working full time on managing and securing infrastructure per cloud, minimum. This does not include certain matrixed people from legacy network/IAM teams. If you have the people, go for it.
There are so many things that can go wrong with a single provider, regardless of how many availability zones you are leveraging, that you cannot depend on 1 cloud provider for your uptime if you require that level of up.
Example: Payment/Administrative issues, rogue employee with access, deprecated service, inter-region routing issues, root certificate compromises... the list goes on and it is certainly not limited to single AZ.
A very good example, is that regardless of which of the 85 AZs you are in at aws, you are affected by this issue right now.
Multi-cloud with the right tooling is trivial. Investing in learning cloud-proprietary stacks is a waste of your investment. You're a clown if you think you need 25 people internally per cloud is required to "do it right".
There is no such thing as trivially setting up a secure, fully automated cloud stack, much less anything like a streamlined cloud agnostic toolset.
Deprecated services are not the discussion here. We're talking tactical availability, not strategic tools etc.
Rogue employees with access? You mean at the cloud provider or at your company? Still doesn't make sense. Cloud IAM is very difficult in large organizations, and each cloud does things differently.
I worked at fortune 100 finance on cloud security. Some things were quite dysfunctional, but the struggles and technical challenges are real and complex at a large organization. Perhaps you're working on a 50 employee greenfield startup. I'll hesitate to call you a clown as you did me, because that would be rude and dismissive of your experience (if any) in the field.
I advise many fintechs with engineering orgs from 5 to 5000, 2 in top 100 - none are blindly single-cloud and none have 25 people dedicated to each of their public clouds. The largest is not on any public clouds due to regulation/compliance and have several colocation facilities for their mission critical - they have less than 25 dedicated in the entire netsec org. This is a company that maintians strict PCI-DSS1 on thousands of servers and thousands of services. If you're employing 25 people per cloud for netsec in a multi cloud environment you have some seriously deficient DevOps practices or your org is 5-figure deep and has been ignoring devops best practices while on cloud for a half decade. Hahsicorps entire business revolves around cloud agnostic toolkits. All clouds speak kubernetes at this point and unless you have un-cloudable components in your stack (like root cert key signing systems on a proprietary appliance) you really should never find yourself in such a scenario where you have that many people overseeing infra security on a public cloud. It's been proven time and time again that too many people managing security is inversely secure.
I meant at least 25 people in the DevSecOps role per cloud. Security experts, network/ops/systems experts, and automation (gitlab and container underlay) support.
K8s is one of a hundred technologies to learn and use, and just because each cloud is supported by terraform, you can't swap a GCP terraform writer over to Azure in a day.
And no bank is without their uncloudable components.
Also, going multi-cloud will introduce more complexity which leads to more errors and more downtime. I'd rather sit this outage out than deal with daily risk of downtime because I'm infrastructure is too smart for its own good.
Depends on the criticality of the service. I mean you're right about adding complexity. But sometimes you can just take your really critical services and make sure it can completely withstand any one cloud provider outage.
While my heroku apps are currently up, I am unable to push new versions.
Logging in to heroku dashboard (which does work), there is a message pointing to this heroku status incident for "Availability issues with upstream provider in the US region": https://status.heroku.com/incidents/2390
How can there be an outage severe enough to be effecting middleman customers like heroku, but the AWS status page is still all green?!?!
If whoever runs the AWS status page isn't embaressed, they really ought to be.
AWS management APIs in the us-east-1 region is what is affected. I'm guessing Heroku uses at least the S3 APIs when deploying new versions, and those are failing (intermittently?).
I advise not touching your Heroku setup right now. Even something like trying to restart a dyno might mean it doesn't come back since the slug is probably stored on S3 and that will fail.
The fun thing about these types of outages are seeing all of the people that depend upon these services with no graceful fallback. My roomba app will not even launch because of the AWS outage. I understand that the app gets "updates" from the cloud. In this case "updates" is usually promotional crap, but whatevs. However, for this to prevent the app launching in a manner that I can control my local device is total BS. If you can't connect to the cloud, fail, move on and load the app so that local things are allowed to work.
I'm guessing other IoT things suffer from this same short sitedness as well.
If you did that some clever person would set up their PiHole so that their device just always worked, and then you couldn't send them ads and surveil them. They'd tell their friends and then everyone would just use their local devices locally. Totally irresponsible what you're suggesting.
It's a little harder than blocking the DNS unfortunately. But nonetheless it always brings a smile to my face to see that there's a FOSS frontier for everything.
But this new little box would then be required to connect to the home server to receive updates. Guess what? No updates, no worky!! It's a vicious circle!!! Outages all the way down
>The fun thing about these types of outages are seeing all of the people that depend upon these services with no graceful fallback.
Whats a graceful fallback? Switching to another hosting service when AWS goes down? Wouldn't that present another set of complications for a very small edge case at huge cost?
Usually this refers to falling back to a different region in AWS. It's typical for systems to be deployed in multiple regions due to latency concerns, but it's also important for resiliency. What you call "a very small edge case" is occurring as we speak, and if you're vulnerable to it you could be losing millions of dollars.
AWS itself has a huge single point of failure on us-east-1 region. Usually, if us-east-1 goes down, others soon follow. At that point, it doesn't matter how many regions you're deploying to.
I don't want to minimize the impact from cognito and r53, but that's quite a different scale of failure than the OP implies. It's been a while since I used AWS, but we had multiple regions and saw no impact to our services in other regions the one time that us-east-1 had a major failure. And we used r53.
Perhaps I could've been more precise with my words. What I meant to say is IAM and r53 are two of many critical services that all depend on us-east-1. It goes without saying that if those services go down in us-east-1, the whole AWS is affected. This doesn't just happen "usually." When IAM goes down, AWS experiences major issues across all regions. If you were okay, perhaps you got lucky? Our team has 7 different prod regions and we see multiple regions go down every time a problem of this scale occurs.
If your product requires 100% uptime, you may need to look at backup options or design your product in such a way that can handle temporary cloud failures.
I heard from someone at <big media company> that they couldn't switch to their fallback region because they couldn't update DNS on Route53. All the console commands and web interface were failing.
There is a company that delivers broadcast video ads to hundreds of TV stations on demand. The ad has to run and run now, so they cannot tolerate failure.
They write the videos to GCS storage in Google Cloud, and to S3 in AWS. Every point of their workflows are checkpointed and cross referenced across GCP and AWS. If either side drops the ball, the other picks it up.
So yes, you can design a super fault tolerant system. This company did it because failing to deliver a few ads would mean lose of major contracts.
> Wouldn't that present another set of complications for a very small edge case at huge cost?
One has to crunch the numbers. What does a service outage cost your business every minute/hour/day/etc in terms of lost revenue, reputational damage, violated SLAs, and other factors? For some enterprises, it's well worth the added expense and trouble of having multi-site active-active setups that span clouds and on-prem.
Now think of how many assets of various governments' militaries are discreetly employed as normal operational staff by FAAMG in the USA and have access to cause such events from scratch. I would imagine that the US IC (CIA/NSA) already does some free consulting for these giant companies to this end, because they are invested in that Not Being Possible (indeed, it's their job).
There is a societal resilience benefit to not having unnecessary cloud dependencies beyond the privacy stuff. It makes your society and economy more robust if you can continue in the face of remote failures/errors.
> I would imagine that the US IC (CIA/NSA) already does some free consulting for these giant companies
This comment is how I know you don't work in the public sector. Those agencies' infrastructures are essentially run by contractors with a few GS personnel making bad decisions every chance they get and a few DoD personnel acting like their rank can fix technical problems.
I'm not talking about running infrastructure, I'm talking about working with HR and making sure that the people they've hired as sysadmins aren't meeting with FSB or PLA agents in a local park on weekends and accepting suitcases of USD cash to accidentally `rm -rf` all of the Zookeeper/etcd nodes at once on a Monday morning.
I doubt it. We’re shit at critical infrastructure defense and the military cares mostly about its own networks. And industry really doesn’t want to cooperate. I was in cyber and military IT. Can’t speak for the IC, but I really doubt it.
> I would imagine that the US IC (CIA/NSA) already does some free consulting for these giant companies to this end,
Haha, it would be funny if the IC reaches out to BigTech when failures occur to let them know they need not be worried about data loses. They can just borrow a copy of the data IC is siphoning off them. /s?
I wouldn't jump to say it's short sitedness (it is shitty) but it could be a matter of being pragmatic... It's easier to maintain the code if it is loaded at run time (think thin client browser style). This way your iot device can load the lastest code and even settings from the cloud... (advantage when the cloud is available)... I think of this less of short sitedness and more a reasonable trade off (with shitty side effects)
Then you could just keep a local copy available as a fallback in case the latest code cannot be fetched. Not doing the bare minimum and screwing the end user isn't acceptable IMHO. But I also understand that'd take some engineer hours and report virtually no benefits as these outages are rare (not sure how Roomba's reliability is in general on the other hand) so here we are.
there's a reason it is called the Internet of things, and not the "local network of things". Even if the latter is probably what most customers would prefer.
A constitutional property of a network is it's volatility. Nodes may fail. Edges may. You may not. Or you may. But then you're delivering no reliabilty but crap. Nice sunshine crap, maybe.
In a traditional internet style architecture, your Roomba and phone would both have globally routable addresses and could directly communicate - your local network would provide hosts with internet connectivity.
"If you can't connect to the cloud, fail, move on and load the app so that local things are allowed to work."
Building fallbacks require work. How much extra effort and overhead is needed to build something like this ? Sometimes the cost vs benefits says that it is ok not to do it. If AWS has an outage like this once a year, maybe we can deal with it (unless you are working with mission critical apps).
Yes, it is a lot of work to test if response code is OK or not, or if a timeout limit has been reached. So much so, I pretty much wrote the test in the first sentence. Phew. 10x coder right here!
Remember the Signal outage? Which is to say, the testing is the problem, not the implementation itself. (Which isn’t to say I think it’s optional—I myself generally don’t run online-only local-by-their-nature things.)
I just walked into the Amazon Books store at our local mall. They are letting everyone know at the entrance that “some items aren’t available for purchase right now because our systems are down.”
So at least Amazon retail is feeling some of the pain from this outage!
My job (although 50% of time) at Azure is unit testing/monitoring services under different scenarios and flows to detect small failures that will be overlooked in public status page. Our tests run multiple times daily and we have people constantly monitoring logs. It concerns me when I see all AWS services are 100% green when I know there is an outage.
Kinda hard to believe after they were blasted for that very situation during/after the S3 outage way back.
If that's the case, it's 100% a feature. They want as little public proof of an outage after it's over and to put the burden on customers completely to prove they violated SLAs.
I wonder how often outages really happen. The official page is nonsense, of course, and we only collectively notice when the outage is big enough that lots of us are affected. On AWS, I see about a 3:1 ratio of "bump in the night" outages (quickly resolved, little corroboration) to mega too-big-to-hide outages. Does that mirror others' experiences?
If you count any time AWS is having a problem that impacts our production workloads then I think it's about 5:1. Dealing with "AWS is down" outages are easy because I can just sit back and grab some popcorn, it's the "dammit I know this is AWS's fault" outages that are a PITA because you count yourself lucky to even get a report in your personalized dashboard.
It’s hard to measure what five-9 is because you have to wait around until a 0.00001 occurs. Incentivizing post-mortems are absolutely critical in this case.
From what I've heard it's mostly true. Not only the CEO but a few SVPs can approve it, but yes a human must approve the update and it must be a high level exec.
Part of the reason is because their SLAs are based on that dashboard, and that dashboard going red has a financial cost to AWS, so like any financial cost, it needs approval.
Taken literally what you are saying is the service could be down and an executive could override that, preventing them for paying customers for a service outage, even if the service did have an outage and the customer could prove it (screenshots, metrics from other cloud providers, many different folks see it).
I'm sure there is some subtlety to this, but it does mean that large corps with influence should be talking to AWS to ensure that status information corresponds with actual service outages.
I have no inside knowledge or anything but it seems like there are a lot of scenarios with degraded performance where people could argue about whether it really constitutes an outage.
One time gcp argued that since they did return 404s on gcs for a few hours that wasn’t an uptime/latency sla violation so we were not entitled to refund (tho they refunded us anyway)
Opex > Capex. If companies thought about long term, yes they might consider it. But unless the cloud providers fuck up really badly, they're ok to take the heat occasionally and tolerate a bit of nonsense.
Yep. I was an SRE who worked at Google and also launched a product on Google Cloud. We had these arguments all the time, and the contract language often provides a way for the provider to weasel out.
This is no longer a partial outage. The status page reports elevated API error rates, DynamoDB issues, EC2 API error rates, and my company's monitoring is significantly affected (IE, our IT folks can't tell us what isn't working) and my AWS training class isn't working either.
If this needed a CEO to eventually get around to pressing a button that said "show users the actual information about a problem" that reflects poorly on amazon.
My friend works at a telemetry company for monitoring and they are working on alerting customers of cloud service outages before the cloud providers since the providers like to sit on their hands for a while (presumably to try and fix it before anyone notices).
It's not really dishonest though because there is nuance. Most everything in EC2 is still working it seems, just the console is down. So is it really down? It should probably be yellow but not red.
if you cannot access the control plane to create or destroy resources, it is down (partial availability). The jobs that are running are basically zombies.
I'm right in the middle of an AWS-run training and we literally can't run the exercises because of this.
let me repeat that: my AWS trainign that is run by AWS that I pay AWS for isn't working, because AWS is having control plane (or other) issues. This is several hours after the initial incident. We're doing training in us-west-2, but the identity service and other components run in us-east-1.
I’m running EKS in us-west-2. My pods use a role ARN and identity token file to get temporary credentials via STS. STS can’t return credentials right now. So my EKS cluster is “down” in the sense that I can’t bring up new pods. I only noticed because an auto-scaling event failed.
The API is NOT working -- it may not have been listed on the service health dashboard when you posted that, but it is now. We haven't been able to launch an instance at all, and we are continuously trying. We can't even start existing instances.
Heroku is currently having major problems. My stuff is still up, but I can't deploy any new versions. Heroku runs their stuff on AWS. I have heard reports of other companies who run on AWS also having degarded service and outages.
i'd say when other companies who run their infrastruture on AWS are going out, it's hard to argue it's not a real outage.
But AWS status _has_ changed to yellow at this point. Probably heroku could be completely down because of an AWS problem, and AWS status would still not show red. But at least yellow tells us there's a problem, the distinction between yellow and red probably only matters at this point to lawyers arguing about the AWS SLA, the rest of us know yellow means "problems", red will never be seen, and green means "maybe problems anyway".
I believe the entire us-east-1 could be entirely missing, and they'd still only put a yellow not a red on status page. After all, the other regions are all fine, right?
Zero directly-attributable, calculable-at-time-of-decision cost. Of course there's a cost in terms of customers who leave because of the dishonest practice, but, who knows how many people that'll be? Out of the customers who left after the outage, who knows whether they left due to not communicating status promptly and honestly or whether it was for some other reason?
Versus, if a company has X SLA contracts signed, that point to Y reimbursement for being out for Z minutes, so it's easily calculable.
I wonder how well known this is. You'd think it would be hard to hire ethical engineers with such a scheme in place and yet they have tens of thousands.
There was a great response in r/relationship advice the other day where someone said that OP's partner forced a fight because they're planning to cheat on them, reconcile, and then will 'trickle out the truth' over the next 6 months. I'm stealing that phrase.
I don't see why they couldn't provide an error rate graph like Reddit[0] or simply make services yellow saying "increased error rate detected, investigating..."
A executive has a OKR around uptime and a automated system prevents him or her from having control over the messaging. Therefore any effort to create one is squashed, leaving the people requesting it confused as to why and left without any explanation. Oldest story in the book.
Because Amazon has $$$$$ in their SLOs, and it costs them through the nose every minute they're down in payments made to customers and fees refunded. I trust them and most companies not to be outright fraudulent (although I'm sure some are), but it's totally understandable they'd be reticent to push the "Downtime Alert/Cost Us a Ton of Money" button until they're sure something serious is happening.
Oh, you can get pretty weaselly about what “down” means. If there is “just” an S3 issue, are all the various services which are still “available” but throwing an elevated number of errors because of their own internal dependency on S3 actually down or just “degraded?” You have to spin up the hair-splitting apparatus early in the incident to try to keep clear of the post-mortem party. :D
This is an incentive to dishonesty, leading to fraudulent payments and false advertising of uptime to potential customers.
Hopefully it results in a class action lawsuit for enough money that Amazon decides that an automated system is better than trying to supply human judgement.
Can someone just have a site ping all the GET endpoints on the AWS API? That is very far from "automating [their entire] system" but it's better than what they're doing.
It should be costing them trust not to push it when they should though. A trustworthy company will err on the side of pushing it. AWS is a near-monopoly, so their unprofessional business practices have still yet to cost them.
> It should be costing them trust not to push it when they should though.
This is what Amazon, the startup, understood.
Step 1: Always make it right and make the customer happy, even if it hurts in $.
Step 2: If you find you're losing too much money over a particular issue, fix the issue.
Amazon, one of the world's largest companies, seems to have forgotten that the risk of not reporting accurately isn't money, but breaking the feedback chain. Once you start gaming metrics, no leaders know what's really important to work on internally, because no leaders know what the actual issues are. It's late Soviet Union in a nutshell. If everyone is gaming the system at all levels, then eventually the ability to objectively execute decreases, because effort is misallocated due to misunderstanding.
The more transparency you give; the harder it is to control the narrative. They have a general reputation for reliability; and exposing just how many actual errors/failures there are (that generally don't effect a large swath of users/usecases) would do hurt that reputation for minimal gain.
Because "uptime" and "nines" became a marketing term. Simple as that. But the problem is that any public-facing measure of availability becomes a defacto marketing term.
The older I get the more I hate marketers. The whole field stands on the back of war-time propaganda research and it sure feels like it's the cause of so much rot in society.
Also 4-5 nines is virtually impossible for complex systems, so the sort of responsible people who could make 3 nines true begin to check out, and now you've getting most of your info from the delusional, and you're lucky if you manage 2 objective nines.
No wonder IMDB <https://www.imdb.com/> is down (returning 503). Sad that Amazon engineers don't implement what they teach their customers -- designing fault-tolerant and highly available systems.
I think it's pretty unlikely that both Google and Facebook are affected by this minor AWS outage, whatever DownDetector says. I even did a spot check on some of the smaller websites they report as "down", like canva.com, and didn't see any issues.
When I worked there it required the signoff of both your VP-level executive and the comms team to update the status page. I do not believe I ever received said signoff before the issues were resolved.
When they have too much pride in an all-green dash, sure. Allowing any engineer to declare a problem when first detected? Not so hard, but it doesn't make you look good if you have an ultra-twitchy finger. They have the balance badly wrong at the moment though.
A trigger-happy status page gives realtime feedback for anyone doing a DoS attack. Even if you published that information publicly you would probably want it on a significant delay.
I wonder if the other parts of Amazon do this. Like their inventory system thinks something is in stock, but people can't find it in the warehouse, do they just simply not send it to you and hope you don't notice? AWS's culture sounds super broken.
My favorite status page, though, is Slack's. You can read an article in the New York Times about how Slack was down for most of a day, and the status page is just like "some percentage of users experienced minor connectivity issues". "Some percentage" is code for "100%" and "minor" is code for "total". Good try.
Makes you wonder if they have to manually update the page when outages occur. That'd be a pretty bad way to go, so I'd hope not. Maybe the code to automatically update the page is in us-east-1? :)
Something like that has impacted the status page in the past. There was a severe Kinesis outage last year (https://aws.amazon.com/message/11201/), and they couldn't update the service dashboard for quite a while because their tool to do manage the service dashboard lives in us-east-1 and depends on Kinesis.
I assume each service has its own health check that checks the service is accessible from an internal location, thus most are green. However, when Service A requires Service B to do work, but Service B is down, a simple access check on Service A clearly doesn't give a good representation of uptime.
So what's a good health check actually report these days? Is it just about its own status, or should it include a breakdown of the status of external dependencies as part of its folded up status?
That's why your status page should be completely independent from the services it is monitoring (minus maybe something that automatically updates it). We use a third party to host our status page specifically so that we can update it even if all our systems are down.
> AWS Management Console Home page is currently unavailable.
> You can monitor status on the AWS Service Health Dashboard.
"AWS Service Health Dashboard" is a link to status.aws.amazon.com... which is ALL GREEN. So... thanks for the suggestion?
At this point the AWS service health dashboard is kind of famous for always been green isn't it? It's a joke to it's users. Do the folks who work on the relevant AWS internal team(s) know this, and just not have the resources to do anything about it, or what? If it's a harder problem than you'd think for interesting technical reasons, that'd be interesting to hear about.
Most of our services in us-east-1 are still responding although we cannot log into the console. However, it looks like dynamodb is timing out most requests for us.
Meanwhile I moved one of my two internal DNS servers to a second site on 11 Nov 2020, and it's been up since then. One of my monitoring machines has been filling, rotating and deleting logs for 1,712 days with a load average in the c. 40 range for that whole time, just works.
If only there was a way to run stuff with an uptime of 364 days a year without using the cloud /s
> the point is that when it's down, bring it back up is someone else's problem.
When it's down, it's my problem, and I can't do anything about it other than explain why I have no idea the system is broken and can't do anything about it.
"Why is my dohicky down? When will it be back?"
"Because it's raining, no idea"
May be accurate, it's also of no use.
But yes, Opex vs Capex, of course that's why you can lease your servers. It's far easier to spend company money with another $500 a month on AWS than spend $500 a year for a new machine.
My lightswitch is used twice a day, yet it works every time. In the old days it would occasionally break (bulb goes), I would be empowered to fix it myself (change the bulb).
In the cloud you're at the mercy of someone who doesn't even know you exist to fix it, without the protections that say an electric company has with supplying domestic users.
This thread has people unable to turn their lights on[0], it's hilarious how people tie their stuff to dependencies that aren't needed, with a history of constant failure.
If you want to host millions of people, then presumably your infrastructure can cope with the loss of a single AZ (and ideally the loss of Amazon as a whole). The vast majority of people will be far better off without their critical infrastructure going down in the middle of the day in the busiest sales season going.
Many B2B-type applications have a lot of usage during the workday and minimal usage outside of it. No reason to keep all that capacity running 24/7 when you only need most of it for ~8 hours per weekday. The cloud is perfect for that use case.
Scaling itself costs nothing, but saves money because you're not paying for unused capacity.
The main application I run operates in 7 countries globally, but the US is the only one that has enough usage to require additional capacity during the workday. So out of 720 hours in a 30 day month, cloud scaling allows me to pay for additional capacity for only the (roughly) 160 hours that it's actually needed. It's a significant cost saver.
And because the scaling is based on actual metrics, it won't scale up on a holiday when nobody is using the application. More cost savings.
Nice of you to assume that I don't understand the pricing of the services I use. I can assure you that I do, and I can also assure you that there is no such thing as provisioned vs on-demand pricing for Azure App Service until you get into the higher tiers. And even in those higher tiers, it's cheaper for me to use on-demand capacity.
Obviously what I'm saying will not apply to all use cases, but I'm only talking about mine.
So, we're getting failures (for customers) trying to use amazon pay from our site. AFAIK there is no "status page" for Amazon Pay, but the rest of Amazon's services seem to be a giant Rube Goldberg machine so it's hard to imagine this isn't too.
I have basically zero faith in Amazon at this point.
We first noticed failures because a tester happened to be testing in an env that uses the Amazon Pay sandbox.
I checked the prod site, and it wouldn't even ask me to login.
When I tried to login to SellerCentral to file a ticket - it told me my password (from a pw manager) was wrong.
When I tried to reset, the OTP was ridiculously slow.
Clicking "resend OTP" gives a "the OTP is incorrect" error message.
When I finally got an OTP and put it in, the resulting page was a generic Amazon "404 page not found".
A while later, my original SellerCentral password, still un-changed because I never got another OTP to reset it, worked.
What the fuck kind of failure mode is that "services are down, so password must be wrong".
In my experience, despite whatever is published, companies will private acknowledge and pay their SLA terms. (Which still only gets you, like, one day's worth of reimbursement if you're lucky.)
Retail SLAs are a small risk compared to the enterprise SLAs where an outage like this could cost Amazon tens of millions. I assume these contracts have discount tiers based on availability and anything below 99% would be a 100% discount for that bill cycle.
"[4:35 PM PST] With the network device issues resolved, we are not working towards recovery of any impaired services. We will provide additional updates for impaired services within the appropriate entry in the Service Health Dashboard."
Especially if enough Amazon internal tools rely on it - would be funny if there were a repeat of the FB debacle where Amazon employees somehow couldn't communicate/get back into their offices because of the problem they were trying to fix
> 8:22 AM PST We are investigating increased error rates for the AWS Management Console.
> 8:26 AM PST We are experiencing API and console issues in the US-EAST-1 Region. We have identified root cause and we are actively working towards recovery. This issue is affecting the global console landing page, which is also hosted in US-EAST-1. Customers may be able to access region-specific consoles going to https://console.aws.amazon.com/. So, to access the US-WEST-2 console, try https://us-west-2.console.aws.amazon.com/