Hacker News new | past | comments | ask | show | jobs | submit login
AWS us-east-1 outage (amazon.com)
1658 points by judge2020 on Dec 7, 2021 | hide | past | favorite | 957 comments



Looks like they've acknowledged it on the status page now. https://status.aws.amazon.com/

> 8:22 AM PST We are investigating increased error rates for the AWS Management Console.

> 8:26 AM PST We are experiencing API and console issues in the US-EAST-1 Region. We have identified root cause and we are actively working towards recovery. This issue is affecting the global console landing page, which is also hosted in US-EAST-1. Customers may be able to access region-specific consoles going to https://console.aws.amazon.com/. So, to access the US-WEST-2 console, try https://us-west-2.console.aws.amazon.com/


> This issue is affecting the global console landing page, which is also hosted in US-EAST-1

Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

At a different (unnamed) FAANG, we considered it unacceptable to have anything depend on a single region. Even the dinky little volunteer-run thing which ran https://internal.site.example/~someEngineer was expected to be multi-region, and was, because there was enough infrastructure for making things multi-region that it was usually pretty easy.


Every damn Well-Architected Framework includes multi-AZ if not multi-region redundancy, and yet the single access point for their millions of customers is single-region. Facepalm in the form of $100Ms in service credits.


> Facepalm in the form of $100Ms in service credits.

It was also greatly affecting Amazon.com itself. I kept getting sporadic 404 pages and one was during a purchase. Purchase history wasn't showing the product as purchased and I didn't receive an email, so I repurchased. Still no email, but the purchase didn't end in a 404, but the product still didn't show up in my purchase history. I have no idea if I purchased anything, or not. I have never had an issue purchasing. Normally get a confirmation email within 2 or so minutes and the sale is immediately reflected in purchase history. I was unaware of the greater problem at that moment or I would have steered clear at the first 404.


Oh no... I think you may be in for a rough time, because I purchased something this morning and it only popped up in my orders list a few minutes ago.


They're also unable to refund Kindle book orders via their website. The "Request a refund" page has a 500 error, so they fall back to letting you request a call from a customer service rep. Initiating this request also fails, so they then fall back to showing a 1-888 number that the customer can call. Of course, when I tried to call, I got "All circuits are busy".


>Facepalm in the form of $100Ms in service credits.

Part of me wonders how much they're actually going to pay out, given that their own status page has only indicated five services with moderate ("Increased API Error Rates") disruptions in service.


That public status page has no bearing on service credits, it's a statically hosted page updated when there's significant public impact. A lot of issues never make it there.

Every AWS customer has a personal health dashboard that links the issues to their services which is updated much faster, and links issues to your affected resources. Additionally requests for credits are done by the customer service team who have even more information.


Utter lies on that page. Multiple services listed as green aren't working for me or my team.


This point is repeated often, and the incentives for Amazon to downplay the actual downtime are definitely there.

Wouldn't affected companies be incentivized to make a lawsuit about AMZ lying about status? It would be easy to prove and costly to defend from AWS standpoint.


Suggesting that when the status page sends a status request and hears no response—it defaults to green—hear no evil and see no evil —> report no evil

Either way—overt lies or engineering incompetence—it’s disappointing!


Pretty low chance that the status page is automated, especially via health checks. I imagine it's a static asset updated by hand.


Or the service that updates the status page runs out of us-east-1.


It has customer relationship implications. I guarantee you it is updated by a support agent.



Don’t think there is an sla for the console , so you would not be claiming anything for the console at least


I don't know if that should surprise us. AWS hosted their status page in S3 so it couldn't even reflect its own outage properly ~5 years ago. https://www.theregister.com/2017/03/01/aws_s3_outage/


I just want to serve 5 terabytes of data


Reference for those out of the loop: https://news.ycombinator.com/item?id=29082014


One region? I forgot how to count that low


It's like three regions - when two of them explode.

Two is one & one is none.


the obvious solution is to put all internet in one region so that when that one explodes nobody notices your little service


> At a different (unnamed) FAANG

I'm guessing Google, on the basis of the recently published (to the public) "I just want to serve 5TB"[1] video. If it isn't Google, then the broccoli man video is still a cogent reminder that unyielding multi-region rigor comes with costs.

1. https://www.youtube.com/watch?v=3t6L-FlfeaI


It's salient that the video is from 2010. Where I was (not Google), the push to make everything multi-region only really started in, maybe, 2011 or 2012. And, for a long time, making services multi-region actually was a huge pain. (Exception: there was a way to have lambda-like code with access to a global eventually-consistent DB.)

The point is that we made it easier. By the time I left, things were basically just multi-region by default. (To be sure, there were still sharp edges. Services which needed to store data (like, databases) were a nightmare to manage. Services which needed to be in the same region specific instances of other services, e.g. something which wanted to be running in the same region as wherever the master shard of its database was running, were another nasty case.)

The point was that every services was expected to be multi-region, which was enforced by regular fire drills, and if you didn't have a pretty darn good story about why regular announced downtime was fine, people would be asking serious questions.

And anything external going down for more than a minute or two (e.g. for a failover) would be inexcusable. Especially for something like a bloody login page.


Maybe has something to do with CloudFront mandating certs to be in us-east-1?


YES! Why do they do that? It's so weird. I will deploy a whole config into us-west-1 or something; but then I need to create a new cert in us-east-1 JUST to let cloudfront answer an HTTPS call. So frustrating.


Agreed - in my line of work regulators want everything in the country we operate from but of course CloudFront has to be different.


Wouldn't using a global CDN for everything be off the table to begin with, in that case?


Apparently it's okay for static data (like a website hosted in S3 behind CloudFront) but seeing non-Australian items in AWS billing and overviews always makes us look twice.


Forget the number of regions. Monitoring for X shouldn't even be hosted on X at all...


Exactly. And I’m surprised AWS doesn’t have failover. That’s basic SOP for an SRE team.


> Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?

They're cheap. HA is for their customers to pay more, not for Amazon which often lies during major outages. They would lose money on HA and they would lose money on acknowledging downtimes. They will lie as long as they benefit from it.


I think I know specifically what you are talking about. The actual files an engineer could upload to populate their folder was not multi-region for a long time. The servers were, because they were stateless and that was easy to multi-region, but the actual data wasn't until we replaced the storage service.


I think the storage was replicated by 2013? Definitely by 2014. It didn't have automated failover, but failover could be done, and was done during the relevant drills for some time.

I think it only stopped when the storage services got to the "deprecated, and we're not bothering to do a failover because dependent teams who care should just use something else, because this one is being shut down any year now". (I don't agree with that decision, obviously ;) but I do have sympathy for the team stuck running a condemned service. Sigh.)

After stuff was migrated to the new storage service (probably somewhere in the 2017-2019 range but I have no idea when), I have no idea how DR/failover worked.


Thank you for the sympathy. If we are talking about the same product then it was most likely backed by 3 different storage services over its lifespan, 2013/2014 was a third party product that had some replication/fail-over baked in, 2016-2019 on my team with no failover plans due to "deprecated, dont bother putting anything important here", then 2019 onward with "fully replicated and automatic failover capable and also less cost-per-GB to replicate but less flexible for the existing use cases".


MAANG*

How long before Meta takes over for Facebook?


Well, alphabet needs to take over for Google first.


So its MAAAAN? Seems disappointing


That's just, like, your opinion maaaan


I like MAGMA (Meta, Amazon, Google, Microsoft, Apple).

Especially when you are getting burned by an outage.


MANGA


Yeah, but I still have a different understanding what "Increased Error Rates" means.

IMHO it should mean that the rate of errors is increased but the service is still able to serve a substantial amount of traffic. If the rate of errors is bigger than, let's say, 90% that's not an increased error rate, that's an outage.


They say that to try and avoid SLA commitments.


Some big customers should get together and make an independent org to monitor cloud providers and force them to meet their SLA guarantees without being able to weasel out of the terms like this…


They are still lying about it, the issues are not only affecting the console but also AWS operations such as S3 puts. S3 still shows green.


It's certainly affecting a wider range of stuff from what I've seen. I'm personally having issues with API Gateway, CloudFormation, S3, and SQS


Our corporate ForgeRock 2FA service is apparently broken. My services are behind distributed x509 certs so no problems there.


> We are experiencing API and console issues in the US-EAST-1 Region


I read it as console APIs. Each service API has its own indicator, and they are all green.


IAM is a "global" service for AWS, where "global" means "it lives in us-east-1".

STS at least has recently started supporting regional endpoints, but most things involving users, groups, roles, and authentication are completely dependent on us-east-1.


Yep, I am seeing failures on IAM as well:

   aws iam list-policies
  
  An error occurred (503) when calling the ListPolicies operation (reached max retries: 2): Service Unavailable


I'm seeing errors for things that worked fine, like policies that had no issue now are saying "access denied".

I'm wondering if the cause of the outage has to do with something changing in the way IAM is interpreted ?


Same here. Kubernetes pods running in EKS are (intermittently) failing to get IAM credentials via the ServiceAccount integration.


I still can't create/destroy/etc CloudFront distros. They are stuck in "pending" indefinitely.


Ok, we've changed the URL to that from https://us-east-1.console.aws.amazon.com/console/home since the latter is still not responding.

There are also various media articles but I can't tell which ones have significant new information beyond "outage".


When I brought up the status page (because we're seeing failures trying to use Amazon Pay) it had EC2 and Mgmt Console with issues.

I opened it again just now (maybe 10 minutes later) and it now shows DynamoDB has issues.

If past incidents are anything to go by, it's going to get worse before it gets better. Rube Goldberg machines aren't known for their resilience to internal faults.


As a user of Sagemaker in us-east-1, I deeply fucking resent AWS claiming the service is normal. I have extremely sensitive data, so Sagemaker notebooks and certain studio tools make sense for me. Or DID. After this I'm going back to my previous formula of EC2 and hosting my own GPU boxes.

Sagemaker is not working, I can't get to my work (notebook instance is frozen upon launch, with zero way to stop it or restart it) and Sagemaker Studio is also broken right now.

The length of this outage has blown my mind.


You don't use AWS because it has better uptime. If you've been around the block enough times, this story has always rung hollow.

Rather, you use AWS because when it is down, it's down for everybody else as well. (Or at least they can nod their head in sympathy for the transient flakiness everybody experiences.) Then it comes back up and everybody forgets about the outage like it was just background noise. This is what's meant by "nobody ever got fired for buying (IBM|Microsoft)". The point is that when those products failed, you wouldn't get blamed for making that choice; in their time they were the one choice everybody excused even when it was an objectively poor choice.

As for me, I prefer hosting all my own stuff. My e-mail uptime is better than GMail, for example. However, when it is down or mail does bounce, I can't pass the buck.


Looks like they removed some 9s from availability in one day. I wonder if more are considering moving away from cloud.


Uh, four minutes to identify the root cause? Damn, those guys are on fire.


Identify or to publicly acknowledge? Chances are technical teams knew about this and noticed it fairly quickly, they've been working on the issue for some time. It probably wasn't until they identified the root cause and had a handful of strategies to mitigate with confidence that they chose to publicly acknowledge the issue to save face.

I've broken things before and been aware of it, but didn't acknowledge them until I was confident I could fix them. It allows you to maintain an image of expertise to those outside who care about the broken things but aren't savvy to what or why it's broken. Meanwhile you spent hours, days, weeks addressing the issue and suddenly pull a magic solution out of your hat to look like someone impossible to replace. Sometimes you can break and fix things without anyone even knowing which is very valuable if breaking something had some real risk to you.


This sounds very self-blaming. Are you sure that's what's really going through your head? Personally, when I get avoidant like that, it's because of anticipation of the amount of process-related pain I'm going to have to endure as a result, and it's much easier to focus on a fix when I'm not also trying to coordinate escalation policies that I'm not familiar with.


:) I imagine it went like this theoretical Slack conversation:

> Dev1: Pushing code for branch "master" to "AWS API". > <slackbot> Your deploy finished in 4 minutes > Dev2: I can't react the API in east-1 > Dev1: Works from my computer


Outage started at 731 PST from our monitoring. They are on fire, but not in a good way.


It was down as of 7:45am (we posted in our engineering channel), so that's a good 40 minutes of public errors before the root cause was figured out.


I'm trying to login in the AWS Console from other regions but I'm getting HTTP 500. Anyone managed to login in other regions? Which ones?

Our backend is failing, it's on us-east-1 using AWS Lambda, Api Gateway, S3


I like how 6 hours in: "Many services have already recovered".


Not even close for us.


https://status.aws.amazon.com/ still shows all green for me


It's acting odd for me. Shows all green in Firefox, but shows the error in Chrome even after some refreshes. Not sure what's caching where to cause that.


firefox has more aggressive caching than other browsers I think


Haha my developer called me in panic telling that he crashed Amazon - was doing some load tests with Lambda


Postmortem: unbounded auto-scaling of lambda combined with oversight on internal rate limits caused unforseen internal ddos.


Just wait for the medium article “How I ran up a $400 million AWS bill.”


"On how I learned 'recursion'"


I asked my friend who's a senior dev if he ever uses recursion at work. He said whenever he sees recursion in a code review, he tells the junior dev to knock it off.


He created a lambda function that spawned more lambda functions and the rest is history


If he actually knows how to crash Amazon, you have a new business opportunity, albeit not a very nice one...


It'd be hilarious if you kept that impression going for the duration of the outage.


thats cute xD


thank you for that big hearty laugh! :)


[flagged]


Don't get nitty about saying "my X". People say "my plumber" or "my hairstylist" or whatever all the time.


"my developer"


"my boss" "my QA team" "my peers" "my partner"


I worked at a company that hired an ex-Amazon engineer to work on some cloud projects.

Whenever his projects went down, he fought tooth and nail against any suggestion to update the status page. When forced to update the status page, he'd follow up with an extremely long "post-mortem" document that was really just a long winded explanation about why the outage was someone else's fault.

He later explained that in his department at Amazon, being at fault for an outage was one of the worst things that could happen to you. He wanted to avoid that mark any way possible.

YMMV, of course. Amazon is a big company and I've had other friends work there in different departments who said this wasn't common at all. I will always remember the look of sheer panic he had when we insisted that he update the status page to accurately reflect an outage, though.


It's popular to upvote this during outages, because it fits a narrative.

The truth (as always) is more complex:

* No, this isn't the broad culture. It's not even a blip. These are EXCEPTIONAL circumstances by extremely bad teams that - if and when found out - would be intervened dramatically.

* The broad culture is blameless post-mortems. Not whose fault is it. But what was the problem and how to fix it. And one of the internal "Ten commandments of AWS availability" is you own your dependencies. You don't blame others.

* Depending on the service one customer's experience is not the broad experience. Someone might be having a really bad day but 99.9% of the region is operating successfully, so there is no reason to update the overall status dashboard.

* Every AWS customer has a PERSONAL health dashboard in the console that should indicate their experience.

* Yes, VP approval is needed to make any updates on the status dashboard. But that's not as hard as it may seem. AWS executives are extremely operation-obsessed, and when there is an outage of any size are engaged with their service teams immediately.


Well, the narrative is sort of what Amazon is asking for, heh?

The whole us-east-1 management console is gone, what is Amazon posting for the management console on their website?

"Service degradation"

It's not a degradation if it's outright down. Use the red status a little bit more often, this is a "disruption", not a "degradation".


Yeah no kidding. Is there a ratio of how many people it has to be working for to be in yellow rather than red? Some internal person going “it works on my machine” while 99% of customers are down.


I've always wondered why services are not counted down more often. Is there some sliver of customers who have access to the management console for example?

An increase in error rates - no biggie, any large system is going to have errors. But when 80%+ of customers loads in the region are impacted (cross availability zones for whatever good those do) - that counts as down doesn't it? Error rates in one AZ - degraded. Multi-AZ failures - down?


SLAs. Officially acknowledging an incident means that they now have to issue the SLA credits.


The outage dashboard is normally only updated if a certain $X percent of hosts / service is down. If the EC2 section were updated every time a rack in a datacenter went down, it would be red 24x7.

It's only updated when a large percentage of customers are impacted, and most of the time this number is less than what the HN echo chamber makes it appear to be.


I mean, sure, there are technical reasons why you would want to buffer issues so they're only visible if something big went down (although one would argue that's exactly what the "degraded" status means).

But if the official records say everything is green, a customer is going to have to push a lot harder to get the credits. There is a massive incentivization to “stay green”.


yes there were. I'm from central europe and we were at least able to get some pages of the console in us-east-1 -but i assume this was more caching related. Even though the console loaded and worked for listing some entries - we weren't able to post a support case nor viewing SQS messages etc.

So i aggree that degraded is not the proper wording - but it's / was not completly vanished. so.... hard to tell what is an common acceptable wording here.


From France, when I connect to "my personal health dashboard" in eu-west-3, it says several services are having "issues" in us-east-1.

To your point, for support center (which doesn't show a region) it says:

Description

Increased Error Rates

[09:01 AM PST] We are investigating increased error rates for the Support Center console and Support API in the US-EAST-1 Region.

[09:26 AM PST] We can confirm increased error rates for the Support Center console and Support API in the US-EAST-1 Region. We have identified the root cause of the issue and are working towards resolution.


I'm part of a large org with a large AWS footprint, and we've had a few hundred folks on a call nearly all day. We have only a few workloads that are completely down; most are only degraded. This isn't a total outage, we are still doing business in east-1. Is it "red"? Maybe! We're all scrambling to keep the services running well enough for our customers.


Because the console works just fine in us-east-2 and that the console on the status page does not display regions.

If the console works 100% in us-east-2 and not in us-east-1 why would they put the console completely down in us-east?


Well you know, like when a rocket explode, it's a sudden and "unexpected rapid disassembly" or something...

And a cleaner is called a "floor technician".

Nothing really out of the ordinary for a service to be called degraded while "hey, the cache might still be working right?" ... or "Well you know, it works every other day except today, so it's just degradation" :-)


If your statement is true, then why is the AWS status page widely considered useless, and everyone congregates on HN and/or Twitter to actually know what's broken on AWS during an outage?


> Yes, VP approval is needed to make any updates on the status dashboard. But that's not as hard as it may seem. AWS executives are extremely operation-obsessed, and when there is an outage of any size are engaged with their service teams immediately.

My experience generally aligns with amzn-throw, but this right here is why. There's a manual step here and there's always drama surrounding it. The process to update the status page is fully automated on both sides of this step, if you removed VP approval, the page would update immediately. So if the page doesn't update, it is always a VP dragging their feet. Even worse is that lags in this step were never discussed in the postmortem reviews that I was a part of.


It's intentional plausible deniability. By creating the manual step you can shift blame away. It's just like the concept of personal health dashboards which are designed to keep an asymmetry in reliability information from a host and the client to their personal anecdata experiences. Ontop of all of this, the metrics are pretty arbitrary.

Let's not pretend businesses haven't been intentionally advertising in deceitful ways for decades if not hundreds of years. This just happens to be current strategy in tech of lying and deceiving customers to limit liability, responsibility, and recourse actions.

To be fair, it's not it's not just Amazon, they just happen to be the largest and targeted whipping boys on the block. Few businesses under any circumstances will admit to liability under any circumstances. Liability has to always be assessed externally.


Here's one business honest with their status page: https://news.ycombinator.com/item?id=29478679


I have in the past directed users here on HN who were complaining about https://status.aws.amazon.com to the Personal Health Dashboard at https://phd.aws.amazon.com/ as well. Unfortunately even though the account I was logged into this time only has a single S3 bucket in the EU, billed through the EU and with zero direct dependancies on the US the personal health dashboard was ALSO throwing "The request processing has failed because of an unknown error" messages. Whatever the problem was this time it had global effects for the majority of users of the Console, the internet noticed for over 30 minutes before either the status page or the PHD were able to report it. There will be no explanation and the official status page logs will say there was "increased API failure rates" for an hour.

Now i guess its possible that the 1000s and 1000s of us who noticed and commented are some tiny fraction of the user base but if thats so you could at least publish a follow up like other vendors do that says something like 0.00001% of API requests failed effecting an estimated 0.001% of our users at the time.


I haven't asked AWS employees specifically about blameless postmortems, but several of them have personally corroborated that the culture tends towards being adversarial and "performance focused." That's a tough environment for blameless debugging and postmoretems. Like if I heard that someone has a rain forest tree-frog living happily in their outdoor Arizona cactus garden, I have doubts.


When I was at Google I didn't have a lot of exposure to the public infra side. However I do remember back in 2008 when a colleague was working on routing side of YouTube, he made a change that cost millions of dollars in mere hours before noticing and reverting it. He mentioned this to the larger team which gave applause during a tech talk. I cannot possibly generalize the culture differences between Amazon and Google, but at least in that one moment, the Google culture seemed to support that errors happen, they get noticed, and fixed without harming the perceived performance of those responsible.


While I support that, how are the people involved evaluated?


I was not informed of his performance reviews. However, given the reception, his work in general, and the attitudes of the team, I cannot imagine this even came up. More likely the ability to improve routing to actually make YouTube cheaper in the end was I'm sure the ultimate positive result.

This was also towards the end of the golden age of Google, when the percentage of top talent was a lot higher.


So on what basis is someone's performance reviewed, if such performance is omitted?


The entire point of blameless postmortems is acknowledging that the mere existence of an outage does not inherently reflect on the performance of the people involved. This allows you to instead focus on building resilient systems that avoid the possibility of accidental outages in the first place.


I know. That's not what I'm asking about, if you might read my question.


I'll play devil's advocate here and say that sometimes these incidents deserve praise because they uncovered an issue that was otherwise unknown previously. Also if the incident had a large negative impact then it shows to leadership how critical normal operation of that service is. Even if you were the cause of the issue, the fact that you fixed it and kept the critical service operating the rest of the time, is worth something good.


I know; that's not what I'm asking about. I'm talking about a different issue.


Mistakes happen, and a culture that insists too hard that "mistakes shouldn't happen, and so we can't be seen making mistakes" is harmful toward engineering.

How should their performance be evaluated, if not by the rote number of mistakes that can be pinned onto the person, and their combined impact? (Was that the question?)


If an engineer causes an outage by mistake and then ensures that would never happen again, he has made a positive impact.


I understand that, but eventually they need to evaluate performance, for promotions, demotions, raises, cuts, hiring, firing, etc. How is that done?


It’s standard. Career ladder [1] sets expectation for each level. Performance is measured against those expectations. Outages don’t negatively impact a single engineer.

The key difference is the perspective. If reliability is bad that’s an organizational problem and blaming or punishing one engineer won’t fix that.

[1] An example ladder from Patreon: https://levels.patreon.com/


> The key difference

The key difference between what and what?


Your approach and their approach. It sounded like you have a different perspective about who is responsible for an outage.


We knew us-east-1 was unuseable for our customers for 45 minutes before amazon acknowledged anything was wrong _at all_. We made decisions _in the dark_ to serve our customers, because amazon drug their feet communicating with us. Our customers were notified after 2 minutes.

It's not acceptable.


Can’t comment on most of your post but I know a lot of Amazon engineers who think of the CoE process (Correction of Error, what other companies would call a postmortem) as punitive


They aren't meant to be, but shitty teams are shitty. You can also create a COE and assign it to another team. When I was at AWS, I had a few COEs assigned to me by disgruntled teams just trying to make me suffer and I told them to pound sand. For my own team, I wrote COEs quite often and found it to be a really great process for surfacing systemic issues with our management chain and making real improvements, but it needs to be used correctly.


At some point the number of people who were on shitty teams becomes an indictment on the wider culture at Amazon.


Absolutely! Anecdotally, out of all the teams I interacted with in seven years at AWS across multiple arms of the company, I saw only a handful of bad teams. But like online reviews, the unhappy people are typically the loudest. I'm happy they are though, it's always important to push to be better, but I don't believe that AWS is the hellish place to work that HN discourse would lead many to believe.


I don't know any, and I have written or reviewed about 20


Even in a medium decent culture, with a sample of 20? You know at least one, you just don't know it.


Because OTHERWISE people might think AMAZON is a DYSFUNCTIONAL company that is beginning to CRATER under its HORRIBLE work culture and constant H/FIRE cycle.

See, AWS is basically turning into a long standing utility that needs to be reliable.

Hey, do most institutions like that completely turn over their staff every three years? Yeah, no.

Great for building it out and grabbing market share.

Maybe not for being the basis of a reliable substrate of the modern internet.

If there are dozens of bespoke systems that keep AWS afloat (disclosure: I have friends who worked there, and there are, and also Conway's law), but if the people who wrote them are three generations of HIRE/FIRE ago....

Not good.


> Maybe not for being the basis of a reliable substrate of the modern internet.

Maybe THEY will go to a COMPETITOR and THINGS MOVE ON if it's THAT BAD. I wasn't sure what the pattern for all caps was, so just giving it a shot there. Apologies if it's incorrect.


I was mocking the parent, who was doing that. Yes it's awful. Effective? Sigh, yes. But awful.


I'm an ex-Amazon employee and approve of this response.

It reflects exactly my experience there.

Blameless post-mortem, stick to the facts and how the situation could be avoided/reduced/shortened/handled better for next time.

In fact, one of the guidelines for writing COE (Correction Of Error, Amazon's jargon for Post Mortem) is that you never mention names but use functions and if necessary teams involved:

1. Personal names don't mean anything except to the people who were there on the incident at the time. Someone reading the CoE on the other side of the world or 6 months from now won't understand who did what and why. 2. It stands in the way of honest accountability.


>* Every AWS customer has a PERSONAL health dashboard in the console that should indicate their experience.

You mean the one that is down right now?


Seems like it's doing an exemplary job of indicating their experience, then.


What?!

Everybody is very slow to update their outage pages because of SLAs. It's in a company's financial interest to deny outages and when they are undeniable to make them appear as short as possible. Status pages updating slowly is definitely by design.

There's no large dev platform I've used that this wasn't true of their status pages.


> ...you own your dependencies. You don't blame others.

Agreed, teams should invest resources in architecting their systems in a way that can withstand broken dependencies. How does AWS teams account for "core" dependencies (e.g. auth) that may not have alternatives?


This is the irony of building a "reliable" system across multiple AZ's.


> * Depending on the service one customer's experience is not the broad experience. Someone might be having a really bad day but 99.9% of the region is operating successfully, so there is no reason to update the overall status dashboard.

https://rachelbythebay.com/w/2019/07/15/giant/


Oh, yes. Let me go look at the PERSONAL health dashboard and... oh, I need to sign into the console to view it... hmm


> you own your dependencies. You don't blame others.

I love that. Build your service to be robust. Never assume that dependencies are 100% reliable. Gracefully handle failures. Don't just go hard down, or worse, sure horribly in a way that you can't recover from automatically when you're dependencies come back. I've seen a single database outage cause cascading failures across a whole site even though most services had no direct connection to the database. And recovery had to be done in order of dependency, or else you're playing whack-a-mole for an hour)

> VP approval is needed to make updates on the status board.

Isn't that normal? Updating the status has a cost (reparations to customers if you breach SLA). You don't want some on-call engineer stressing over the status page while trying to recover stuff.


Come on, we all know managers don’t want to claim an outage till the last minute.


"Yes, VP approval is needed to make any updates on the status dashboard."

If services are clearly down, why is this needed ? I can understand the oversights required for a company like Amazon but this sounds strange to me. If services are clearly down, I want that damn status update right away as a customer.


Because "services down" also means SLA credits.


Hiding behind a throw away account does not help your point.


The person is unlikely to have been authorized as a spokesman for AWS. In many workplaces, doing that is grounds for disciplinary action. Hence, throwaway.


Well, when you talk about blameless post mortems and how they are valued at the company... A throw-away does make me doubt that the culture supports being blameless :)


Well, I understand that, but if you look at his account history it is only pro-Amazon comments. It feels like propaganda more than information, and all I am saying is that the throwaway does not add credibility or a feeling that his opinion are genuine.


[flagged]


This post broke the site guidelines badly. If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules in the future, we'd be grateful.

Edit: also, could you please stop posting unsubstantive comments generally? You've unfortunately been doing that repeatedly, and we're trying for something else here.


That sounds like the exact opposite of human-factors engineering. No one likes taking blame. But when things go sideways, people are extra spicy and defensive, which makes them clam up and often withhold useful information, which can extend the outage.

No-blame analysis is a much better pattern. Everyone wins. It's about building the system that builds the system. Stuff broke; fix the stuff that broke, then fix the things that let stuff break.


I worked at Walmart Technology. I bravely wrote post mortem documents owning the fault of my team (100+ people), owning both technically and also culturally as their leader. I put together a plan to fix it and executed it. Thought that was the right thing to do. This happend two times in my 10 year career there.

Both times I was called out as a failure in my performance eval. Second time, I resigned and told them to find a better leader.

Happy now I am out of such shitty place.


That's shockingly stupid. I also worked for a major Walmart IT services vendor in another life, and we always had to be careful about how we handled them, because they didn't always show a lot of respect for vendors.

On another note, thanks for building some awesome stuff -- walmart.com is awesome. I have both Prime and whatever-they're-currently-calling Walmart's version and I love that Walmart doesn't appear to mix SKU's together in the same bin which seems to cause counterfeiting fraud at Amazon.


walmart.com user design sucks. My particular grudge right now is - I'm shopping to go pickup some stuff (and indicate "in store pickup) and each time I search for the next item, it resets that filter making me click on that filter for each item on my list.


Almost every physical-store-chain company's website makes it way too hard to do the thing I nearly always want out of their interface, which is to search the inventory of the X nearest locations. They all want to push online orders or 3rd-party-seller crap, it seems.


Yes, I assume they intentionally make it difficult to push third party sellers’ where they get to earn bigger profit margins and/or hide their low inventory.

Although, Amazon is the worst, then Walmart (still much better than Amazon since you can at least filter). The others are not bad in my experience.


Walmart.com, Am I the only one in the world who can't view their site on my phone? I tried it on a couple devices and couldn't get it to work. Scaling is fubar. I assumed this would be costing them millions/billions since it's impossible to buy something from my phone right now. S21+ in portrait on multiple browsers.


What's a "bin" in this context?


I believe he means a literal bin. E.g. Amazon takes products from all their sellers and chucks them in the same physical space, so they have no idea who actually sold the product when it's picked. So you could have gotten something from a dodgy 3rd party seller that repackages broken returns, etc, and Amazon doesn't maintain oversight of this.


Literally just a bin in a fulfillment warehouse.

An amazon listing doesn't guarantee a particular SKU.


Ah, whew. That's what I thought. Thanks! I asked because we make warehouse and retail management systems and every vendor or customer seems to give every word their own meanings (e.g., we use "bin" in our discounts engine to be a collection of products eligible for discounts, and "barcode" has at least three meanings depending on to whom you're speaking).


Is WalMart.com awesome?


Props to you and Walmart will never realize their loss. Unfortunately. But one day there will be headline (or even a couple of them) and you will know that if you had been there it might not have happened and that in the end it is Walmarts' customers that will pay the price for that, not their shareholders.


Stories like this are why I'm really glad I stopped talking to that Walmart Technology recruiter a few years ago. I love working for places where senior leadership constantly repeat war stories about "that time I broke the flagship product" to reinforce the importance of blameless postmortems. You can't fix the process if the people who report to you feel the need to lie about why things go wrong.


But hope you found a better place?


that's awful. You should have been promoted for that.


is it just 'ceremony' to be called out on those things? (even if it is actually a positive sum total)


> Happy now I am out of such shitty place.

Doesn't sound like it.


I firmly believe in the dictum "if you ship it you own it". That means you own all outages. It's not just an operator flubbing a command, or a bit of code that passed review when it shouldn't. It's all your dependencies that make your service work. You own ALL of them.

People spend all this time threat modelling their stuff against malefactors, and yet so often people don't spend any time thinking about the threat model of decay. They don't do it adding new dependencies (build- or runtime), and therefore are unprepared to handle an outage.

There's a good reason for this, of course: modern software "best practices" encourage moving fast and breaking things, which includes "add this dependency we know nothing about, and which gives an unknown entity the power to poison our code or take down our service, arbitrarily, at runtime, but hey its a cool thing with lots of github stars and it's only one 'npm install' away".

Just want to end with this PSA: Dependencies bad.


Should I be penalized if an upstream dependency, owned by another team, fails? Did I lack due diligence in choosing to accept the risk that the other team couldn't deliver? These are real problems in the micro-services world, especially since I own UI and there are dozens of teams pumping out services, and I'm at the mercy of all of them. The best I can do is gracefully fail when services don't function in a healthy state.


You and many others here may be conflating two concepts which are actually quite separate.

Taking blame is a purely punitive action and solves nothing. Taking responsibility means it's your job to correct the problem.

I find that the more "political" the culture in the organization is, the more likely everyone is to search for a scapegoat to protect their own image when a mistake happens. The higher you go up in the management chain, the more important vanity becomes, and the more you see it happening.

I have made plenty of technical decisions that turned out to be the wrong call in retrospect. I took _responsibility_ for those by learning from the mistake and reversing or fixing whatever was implemented. However, I never willfully took _blame_ for those mistakes because I believed I was doing the best job I could at the time.

Likewise, the systems I manage sometimes fail because something that another team manages failed. Sometimes it's something dumb and could have easily been prevented. In these cases, it's easy point blame and say, "Not our fault! That team or that person is being a fuckup and causing our stuff to break!" It's harder but much more useful to reach out and say, "hey, I see x system isn't doing what we expect, can we work together to fix it?"


Every argument I have on the internet is between prescriptive and descriptive language.

People tend to believe that if you can describe a problem that means you can prescribe a solution. Often times, the only way to survive is to make it clear that the first thing you are doing is describing the problem.

After you do that, and it's clear that's all you are doing, then you follow up with a prescriptive description where you place clearly what could be done to manage a future scenario.

If you don't create this bright line, you create a confused interpretation.


My comment was made from the relatively simpler entrepreneurial perspective, not the corporate one. Corp ownership rests with people in the C-suite who are social/political lawyer types, not technical people. They delegate responsibility but not authority, because they can hire people, even smart people, to work under those conditions. This is an error mode where "blame" flows from those who control the money to those who control the technology. Luckily, not all money is stupid so some corps (and some parts of corps) manage to function even in the presence of risk and innovation failures. I mean the whole industry is effectively a distributed R&D budget that may or may not yield fruit. I suppose this is the market figuring out whether iterated R&D makes sense or not. (Based on history, I'd say it makes a lot of sense.)


I wish you wouldn't talk about "penalization" as if it was something that comes from a source of authority. Your customers are depending on you, and you've let them down, and the reason that's bad has nothing to do with what your boss will do to you in a review.

The injustice that can and does happen is that you're explicitly given a narrow responsibility during development, and then a much broader responsibility during operation. This is patently unfair, and very common. For something like a failed uService you want to blame "the architect" that didn't anticipate these system level failures. What is the solution? Have plan b (and plan c) ready to go. If these services don't exist, then you must build them. It also implies a level of indirection that most systems aren't comfortable with, because we want to consume services directly (and for good reason) but reliability requires that you never, ever consume a service directly, but instead from an in-process location that is failure aware.

This is why reliable software is hard, and engineers are expensive.

Oh, and it's also why you generally do NOT want to defer the last build step to runtime in the browser. If you start combining services on both the client and server, you're in for a world of hurt.


Not penalised no, but questioned as to how well your graceful failure worked in the end.

Remember: it may not be your fault, but it still is your problem.


A analogy for illustrating this is:

You get hit by a car and injured. The accident is the other driver's fault, but getting to the ER is your problem. The other driver may help and call an ambulance, but they might not even be able to help you if they also got hurt in the car crash.


> Should I be penalized if an upstream dependency, owned by another team, fails?

Yes

> Did I lack due diligence in choosing to accept the risk that the other team couldn't deliver?

Yes


Say during due diligence two options are uncovered: use an upstream dependency owned by another team, or use that plus a 3P vendor for redundancy. Implementing parallel systems costs 10x more than the former and takes 5x longer. You estimate a 0.01% chance of serious failure for the former, and 0.001% for the latter.

Now say you're a medium sized hyper-growth company in a competitive space. Does spending 10 times more and waiting 5 times longer for redundancy make business sense? You could argue that it'd be irresponsible to over-engineer the system in this case, since you delay getting your product out and potentially lose $ and ground to competitors.

I don't think a black and white "yes, you should be punished" view is productive here.


Where does this mindset end? Do I lack due diligence by choosing to accept that the cpu microcode on the system I’m deploying to works correctly?


If it's brand new RiscV CPU that was just relesed 5 min ago, and nobody really tested then yes.

If its standard CPU that everybody else uses, and its not known to be bad then no.

Same for software. Is it ok to have dependency on AWS services ? Their history shows yes. Dependency on brand new SaaS product ? Nothing mission critical.

Or npm/crates/pip packages. Packages that have been around and seedily maintained for few years, have active users, are worth checking out. Some random project from single developer ? Consider vendoring (and owning if necessary ) it.


Why? Intel has Spectre/Meltdown which erased like half of everyone's capacity overnight.


You choose the CPU and you choose what happens in a failure scenario. Part of engineering is making choices that meet the availability requirements of your service. And part of that is handling failures from dependencies.

That doesn't extend to ridiculous lengths but as a rule you should engineer around any single point of failure.


I think this is why we pay for support, with the expectation that if their product inadvertently causes losses for you they will work fast to fix it or cover the losses.


Yes? If you are worried about CPU microcode failing, then you do a NASA and have multiple CPU arch's doing calculations in a voting block. These are not unsolved problems.


JPL goes further and buys multiple copies of all hardware and software media used for ground systems, and keeps them in storage "just in case". It's a relatively cheap insurance policy against the decay of progress.


That's a great philosophy.

Ok, let's take an organization, let's call them, say Ammizzun. Totally not Amazon. Let's say you have a very aggressive hire/fire policy which worked really well in rapid scaling and growth of your company. Now you have a million odd customers highly dependent on systems that were built by people that are now one? two? three? four? hire/fire generations up-or-out or cashed-out cycles ago.

So.... who owns it if the people that wrote it are lllloooooonnnnggg gone? Like, not just long gone one or two cycles ago so some institutional memory exists. I mean, GONE.


A lot can go wrong as an organization grows, including loss of knowledge. At amazon "Ownership" officially rests with the non-technical money that owns voting shares. They control the board who controls the CEO. "Ownership" can be perverted to mean that you, a wage slave, are responsible for the mess that previous ICs left behind. The obvious thing to do in such a circumstance is quit (or don't apply). It is unfair and unpleasant to be treated in a way that gives you responsibility but no authority, and to participant in maintaining (and extending) that moral hazard, and as long as there are better companies you're better off working for them.


I worked on a project like this in government for my first job. I was the third butt in that seat in a year. Everyone associated with project that I knew there was gone by one year from my own departure date.

They are now on the 6th butt in that seat in 4 years. That poor fellow is entirely blameless for the mess that accumulated over time.


Having individuals own systems seems like a terrible practice. You're essentially creating a single point of failure if only one person understands how the system works.


if I were a black hat I would absolutely love GitHub and all the various language-specific package systems out there. giving me sooooo many ways to sneak arbitrary tailored malicious code into millions of installs around the world 24x7. sure, some of my attempts might get caught, or not but not lead to a valuable outcome for me. but that percentage that does? can make it worth it. its about scale and a massive parallelization of infiltration attempts. logic similar to the folks blasting out phishing emails or scam calls.

I love the ubiquity of thirdparty software from strangers, and the lack of bureaucratic gatekeepers. but I also hate it in ways. and not enough people know about the dangers of this second thing.


Any yet oddly enough the Earth continues to spin and the internet continues to work. I think the system we have now is necessarily the system that must exist ( in this particular case, not in all cases ). Something more centralized is destined to fail. And, while the open source nature of software introduces vulnerabilities it also fixes them.


> And, while the open source nature of software introduces vulnerabilities it also fixes them.

dat gap tho... which was my point. smart black hats will be exploiting this gap, at scale. and the strategy will work because the majority of folks seem to be either lazy, ignorant or simply hurried for time.

and btw your 1st sentence was rude. constructive feedback for the future


For my vote, I don't think it was rude, I think it was making a point.


when working on CloudFiles, we often had monitoring for our limited dependencies that were better than their monitoring. Don't just know what your stuff is doing, but what your whole dependency ecosystem is doing and know when it all goes south. also helps to learn where and how you can mitigate some of those dependencies.


This. We found very big, serious issues with our anti-DDOS provider because their monitoring sucked compared to ours. It was a sobering reality check when we realized that.


It's also a nightmare for software preservation. There's going to be a lot from this era that won't be usable 80 years from now because everything is so interdependent and impossible to archive. It's going to be as messy and irretrievable as the Web pre Internet Archive + Wayback are.


I don't think engineers can believe in no-blame analysis if they know it'll harm career growth. I can't unilaterally promote John Doe, I have to convince other leaders that John would do well the next level up. And in those discussions, they could bring up "but John has caused 3 incidents this year", and honestly, maybe they'd be right.


Would they? Having 3 outages in a year sounds like an organization problem. Not enough safeguards to prevent very routine human errors. But instead of worrying about that we just assign a guy to take the fall


If you work in a technical role and you _don't_ have the ability to break something, you're unlikely to be contributing in a significant way. Likely that would make you a junior developer whose every line of code is heavily scrutinized.

Engineers should be experts and you should be able to trust them to make reasonable choices about the management of their projects.

That doesn't mean there can't be some checks in place, and it doesn't mean that all engineers should be perfect.

But you also have to acknowledge that adding all of those safeties has a cost. You can be a competent person who requires fewer safeties or less competent with more safeties.

Which one provides more value to an organization?


The tactical point is to remove sharp edges, eg there's a tool that optionally take a region argument.

    network_cli remove_routes [--region us-east-1]
Blaming the operator that they should have known that running

    network_cli remove_routes
will take down all regions because the region wasn't specified is the kind of thing as to what's being called out here.

All of the tools need to not default to breaking the world. That is the first and foremost thing being pushed. If an engineer is remotely afraid to come forwards (beyond self-shame/judgement) after an incident, and say "hey, I accidentally did this thing", then the situation will never get any better.

That doesn't mean that engineers don't have the ability to break things, but it means it's harder (and very intentionally so) for a stressed out human operator to do the wrong thing by accident. Accidents happen. Do you just plan on never getting into a car accident, or do you wear a seat belt?


> Which one provides more value to an organization?

Neither, they both provide the same value in the long term.

Senior engineers cannot execute on everything they commit to without having a team of engineers they work with. If nobody trains junior engineers, the discipline would go extinct.

Senior engineers provide value by building guardrails to enable junior engineers to provide value by delivering with more confidence.


Well if John caused 3 outages and and his peers Sally and Mike each caused 0, it's worth taking a deeper look. There's a real possibility he's getting screwed by a messed up org, also he could be doing slapdash work or he seriously might not undertsand the seriousness of an outage.


John’s team might also be taking more calculated risks and running circles around Sally and Mike’s teams with respect to innovation and execution. If your organization categorically punishes failures/outages, you end up with timid managers that are only playing defense, probably the opposite of what the leadership team wants.


Worth a look, certainly. Also very possible that this John is upfront about honest postmortems and like a good leader takes the blame, whereas Sally and Mike are out all day playing politics looking for how to shift blame so nothing has their name attached. Most larger companies that's how it goes.


Or John's work is in frontline production use and Sally's and Mike's is not, so there's different exposure.


You're not wrong, but it's possible that the organization is small enough that it's just not feasible to have enough safeguards that would prevent the outages John caused. And in that case, it's probably best that John not be promoted if he can't avoid those errors.


Current co is small. We are putting in the safeguards from Day 1. Well, okay technically like day 120, the first few months were a mad dash to MVP. But now that we have some breathing room, yeah, we put a lot of emphasis on preventing outages, detecting and diagnosing outages promptly, documenting them, doing the whole 5-why's thing, and preventing them in the future. We didn't have to, we could have kept mad dashing and growth hacking. But very fortunately, we have a great culture here (founders have lots of hindsight from past startups).

It's like a seed for crystal growth. Small company is exactly the best time to implement these things, because other employees will try to match the cultural norms and habits.


Well, I started at the small company I'm currently at around day 7300, where "source control" consisted of asking the one person who was in charge of all source code for a copy of the files you needed to work on, and then giving the updated files back. He'd write down the "checked out" files on a whiteboard to ensure that two people couldn't work on the same file at the same time.

The fact that I've gotten it to the point of using git with automated build and deployment is a small miracle in itself. Not everybody gets to start from a clean slate.


> I have to convince other leaders that John would do well the next level up.

"Yes, John has made mistakes and he's always copped to them immediately and worked to prevent them from happening again in the future. You know who doesn't make mistakes? People who don't do anything."


You know why SO-teams, firefighters and military pilots are so successful?

-You don't hide anything

-Errors will be made

-After training/mission everyone talks about the errors (or potential ones) and how to prevent them

-You don't make the same error twice

Being afraid to make errors and learn from them creates a culture of hiding, a culture of denial and especially being afraid to take responsibility.


You can even make the same error twice but you better have much better explanation the second time around than you had the first time around because you already knew that what you did was risky and or failure prone.

But usually it isn't the same person making the same mistake, usually it is someone else making the same mistake and nobody thought of updating processes/documentation to the point that the error would have been caught in time. Maybe they'll fix that after the second time ;)


Yes. AAR process in the army was good at this up to the field grade level, but got hairy on G/J level staffs. I preferred being S-6 to G-6 for that reason.


There is no such thing as "no-blame" analysis. Even in the best organizations with the best effort to avoid it, there is always a subconscious "this person did it". It doesn't help that these incidents serve as convenient places for others to leverage to climb their own career ladder at your expense.


Or just take responsibility. People will respect you for doing that and you will demonstrate leadership.


Cynical/realist take: Take responsibility and then hope your bosses already love you, you can immediately both come with a way to prevent it from happening again, and convince them to give you the resources to implement it. Otherwise your responsibility is, unfortunately, just blood in the water for someone else to do all of that to protect the company against you and springboard their reputation on the descent of yours. There were already senior people scheming to take over your department from your bosses, now they have an excuse.


This seems like an absolutely horrid way of working or doing 'office politics'.


Yes, and I personally have worked in environments that do just that. They said they didn't, but with management "personalities" plus stack ranking, you know damn well that they did.


And the guy who doesn't take responsibility gets promoted. Employees are not responsible for failures of management to set a good culture.


The Gervais/Peter Principle is alive and well in many orgs. That doesn't mean that when you have the prerogative to change the culture, you just give up.

I realize that isn't an easy thing to do. Often the best bet is to just jump around till you find a company that isn't a cultural superfund site.


Not in healthy organizations, they don't.


You can work an entire career and maybe enjoy life in one healthy organization in that entire time even if you work in a variety of companies. It just isn't that common, though of course voicing the _ideals_ is very, very common.


Once you reach a certain size there are surprisingly few healthy organization, most of them turn into externalization engines with 4 beats per year.


I love it when I share a mental model with someone in the wild.


Way more fun argument: Outages just, uh… uh… find a way.


> No-blame analysis is a much better pattern. Everyone wins. It's about building the system that builds the system. Stuff broke; fix the stuff that broke, then fix the things that let stuff break.

Yea, except it doesn't work in practice. I work with a lot of people who come from places with "blameless" post-mortem 'culture' and they've evangelized such a thing extensively.

You know what all those people have proven themselves to really excel at? Blaming people.


Ok, and? I don't doubt it fails in places. That doesn't mean that it doesn't work in practice. Our company does it just fine. We have a high trust, high transparency system and it's wonderful.

It's like saying unit tests don't work in practice because bugs got through.


Have you ever considered that the “no-blame” postmortems you are giving credit for everything are just a side effect of living in a high trust, high transparency system?

In other words, “no-blame” should be an emergent property of a culture of trust. It’s not something you can prescribe.


Yes, exactly. Culture of trust is the root. Many beneficial patterns emerge when you can have that: more critical PRs, blameless post-mortems, etc.


Sometimes, these large companies tack on too much "necessary" incident "remediation" actions with Arbitrary Due Date SLAs that completely wrench any ongoing work. And ongoing, strategically defined ""muh high impact"" projects are what get you promoted, not doing incident remediations.

When you get to the level you want, you get to not really give a shit and actually do The Right Thing. However, for all of the engineers clamoring to get out of the intermediate brick laying trenches, opening an incident can have pervasive incentives.


Politicized cloud meh.


In my experience this is the actual reason for fear of the formal error correction process.


I've worked for Amazon for 4 years, including stints at AWS, and even in my current role my team is involved in LSE's. I've never seen this behavior, the general culture has been find the problem, fix it, and then do root cause analysis to avoid it again.

Jeff himself has said many times in All Hands and in public "Amazon is the best place to fail". Mainly because things will break, it's not that they break that's interesting, it's what you've learned and how you can avoid that problem in the future.


I guess the question is why can't you (AWS) fix the problem of the status page not reflecting an outage? Maybe acceptable if the console has a hiccup, but when www.amazon.com isn't working right, there should be some yellow and red dots out there.

With the size of your customer base there were man years spent confirming the outage after checking the status.


Because there's a VP approval step for updating the status page and no repercussions for VPs who don't approve updates in a timely manner. Updating the status page is fully automated on both sides of VP approval. If the status page doesn't update, it's because a VP wouldn't do it.


LSE?


Large Scale Event


Haha... This bring back memories. It really depends on the org.

I've had push backs on my postmortems before because of phrasing that could be constituted as laying some of the blame on some person/team when it's supposed to be blameless.

And for a long time, it was fairly blameless. You would still be punished with the extra work of writing high quality postmortems, but I have seen people accidentally bring down critical tier-1 services and not be adversely affected in terms of promotion, etc.

But somewhere along the way, it became politicized. Things like the wheel of death, public grilling of teams on why they didn't follow one of the thousands of best practices, etc, etc. Some orgs are still pretty good at keeping it blameless at the individual level, but... being a big company, your mileage may vary.


We're in a situation where the balls of mud made people afraid to touch some things in the system. As experiences and processes have improved we've started to crack back into those things and guess what, when you are being groomed to own a process you're going to fuck it up from time to time. Objectively, we're still breaking production less often per year than other teams, but we are breaking it, and that's novel behavior, so we have to keep reminding people why.

The moment that affects promotions negatively, or your coworkers throw you under the bus, you should 1) be assertive and 2) proof-read your resume as a precursor to job hunting.


Or problems just persisting, because the fix is easy, but explaining it to others who do not work on the system are hard. Esp. justifying why it won't cause an issue, and being told that the fixes need to be done via scripts that will only ever be used once, but nevertheless needs to be code reviewed and tested...

I wanted to be proactive and fix things before they became an issue, but such things just drained life out of me, to the point I just left.


That’s idiotic, the service is down regardless. If you foster that kind of culture, why have a status page at all?

It make AWS engineers look stupid, because it looks like they are not monitoring their services.


The status page is as much a political tool as a technical one. Giving your service a non-green state makes your entire management chain responsible. You don't want to be one that upsets some VPs advancement plans.


> It make AWS engineers look stupid, because it looks like they are not monitoring their services.

Management.


Former AWSser. I can totally believe that happened and continues to happen in some teams. Officially, it's not supposed to be done that way.

Some AWS managers and engineers bring their corporate cultural baggage with them when they join AWS and it takes a few years to unlearn it.


Thanks for the perspective. I was beginning to regret posting this after so many people claiming this wouldn’t happen at AWS.

Amazon is a huge company so I have no doubt YMMV depending on your manager.


When I worked for AMZN (2012-2015, Prime Video & Outbound Fulfillment), attempting to sweep issues under the rug was a clear path to termination. The Correction-Of-Error (COE) process can work wonders in a healthy, data-driven, growth-mindset culture. I wonder if the ex-Amazonian you're referring to did not leave AMZN by their own accord?

Blame deflection is a recipe for repeat outages and unhappy customers.


> I wonder if the ex-Amazonian you're referring to did not leave AMZN by their own accord?

Entirely possible, and something I've always suspected.


This is the exact opposite of my experience at AWS. Amazon is all about blameless fact finding when it comes to root cause analysis. Your company just hired a not so great engineer or misunderstood him.


Adding my piece of anecdata to this.. the process is quite blameless. If a postmortem seems like it points blame, this is pointed out and removed.


Blameless, maybe, but not repercussion-less. A bad CoE was liable to upend the team's entire roadmap and put their existing goals at risk. To be fair, management was fairly receptive to "we need to throw out the roadmap and push our launch out to the following reinvent", but it wasn't an easy position for teams to be in.


Every incident review meeting I've ever been in starts out like, _"This meeting isn't to place blame..."_, then, 5 minutes later, it turns into the Blame Game.


Manually updated status pages are an anti-pattern to begin with. At that point, why not just call it a blog?


This gets posted every time there's an AWS outage. It mind as well be a copy pasta at this point.


Sorry. I'm probably to blame because I've posted this a couple times on HN before.

It strikes a nerve with me because it caused so much trouble for everyone around him. He had other personal issues, though, so I should probably clarify that I'm not entirely blaming Amazon for his habits. Though his time at Amazon clearly did exacerbate his personal issues.


well, this is the first time I've seen it, so I am glad it was posted this time.


First time I've seen it too. Definitely not my first "AWS us-east-1 is down but the status board is green" thread, either.


Ditto, it's always annoyed me that their status page is useless, but glad someone else mentioned it.


I had that deja vu feeling reading PragmaticPulp's comment, too.

And sure enough, PragmaticPulp did post a similar comment on a thread about Amazon India's alleged hire-to-fire policy 6 months back: https://news.ycombinator.com/item?id=27570411

You and I, we aren't among the 10000, but there are potentially 10000 others who might be: https://xkcd.com/1053/


I mean it's true at every company I've ever worked at too. If you can lawyer incidents into not being an outage you avoid like 15 meetings with the business stakeholders about all the things we "have to do" to prevent things like this in the future that get canceled the moment they realize that how much dev/infra time it will take to implement.


It's the "grandma got run over by a reindeer" of AWS outages. Really no outage thread would be complete without this anecdote.


Perhaps reward structure should be changed to incentivize the post-mortems. There could be several flaws that run underreported otherwise.

We may run into the problem of everything documented and possible deliberate acts but for a service that relies heavily on uptime, that’s a small price to pay for a bulletproof operation.


Then we would drown in a sea of meetings and 'lessons learned' emails. There is a reason for post-mortems, but there has to be balance.


I find post-mortems interesting to read through especially when it’s not my fault. Most of them would probably be routine to read through but there are occasional ones that make me cringe or laugh.

Post-mortems can sometime be thought of like safety training. There is a big imbalance of time dedicated to learning proper safety handling just for those small incidences.


Does Disney still play the "Instructional Videos" series starring Goofy where he's supposed to be teaching you how to do something and instead we learn how NOT to do something? Or did I just date myself badly?


On the retail/marketplace side this wasn't my experience, but we also didn't have any public dashboards. On Prime we occasionally had to refund in bulk, and when it was called for (internally or externally) we would right up a detailed post-mortem. This wasn't fun, but it was never about blaming a person and more about finding flaws in process or monitoring.


I don't think anecdotes like this are even worth sharing, honestly. There's so much context lost here, so much that can be lost in translation. No one should be drawing any conclusions from this post.


> explanation about why the outage was someone else's fault

In my experience, it's rarely clear who was at fault for any sort of non-trivial outage. The issue tends to be at interfaces and involve multiple owners.


What if they just can't access the console to update the status page...


They could still go into the data center, open up the status page servers' physical...ah wait, what if their keyfobs don't work?


Yep I can confirm that. The process when the outage is caused by you is called COE (correction of errors). I was oncall once for two teams because I was switching teams and I got 11 escalations in 2 hours. 10 of these were caused by an overly sensitive monitoring setting. The 11th was a real one. Guess which one I ignored. :)


This fits with everything I've heard about terrible code quality at Amazon and engineers working ridiculous hours to close tickets any way they can. Amazon as a corporate entity seems to be remarkably distrustful of and hostile to its labor force.


> I will always remember the look of sheer panic

I don't know if you're exaggerating or not, but even if true why would anyone show that emotion about losing a job in the worst case?

You certainly had a lot of relevate-to-todays-top-hn-post stories throughout you career. And I'm less and less surprised to continuously find PragmaticPulp as one of the top commenters if not the top that resonates with a good chunk of HN.


This is weird, on my team it’s taken as a learning opportunity. I caused a pretty big outage and we just did a COE.


I am finding that I have a very bimodal response to "He did it". When I write an RCA or just talk about near misses, I may give you enough details to figure out that Tom was the one who broke it, but I'm not going to say Tom on the record anywhere, with one extremely obvious exception.

If I think Tom has a toxic combination of poor judgement, Dunning-Kruger syndrome, and a hint of narcissism (I'm not sure but I may be repeating myself here), such that he won't listen to reason and he actively steers others into bad situations (and especially if he then disappears when shit hits the fan), then I will nail him to a fucking cross every chance I get. Public shaming is only a tool for getting people to discount advice from a bad actor. If it comes down to a vote between my idea and his, then I'm going to make sure everyone knows that his bets keep biting us in the ass. This guy kinda sounds like the Toxic Tom.

What is important when I turned out to be the cause of the issue is a bit like some court cases. Would a reasonable person in this situation have come to the same conclusion I did? If so, then I'm just the person who lost the lottery. Either way, fixing it for me might fix it for other people. Sometimes the answer is, "I was trying to juggle three things at once and a ball got dropped." If the process dictated those three things then the process is wrong, or the tooling is wrong. If someone was asking me questions we should think about being more pro-active about deflecting them to someone else or asking them to come back in a half hour. Or maybe I shouldn't be trying to watch training videos while babysitting a deployment to production.

If you never say "my bad" then your advice starts to sound like a lecture, and people avoid lectures so then you never get the whole story. Also as an engineer you should know that owning a mistake early on lets you get to what most of us consider the interesting bit of solving the problem instead of talking about feelings for an hour and then using whatever is left of your brain afterward to fix the problem. In fact in some cases you can shut down someone who is about to start a rant (which is funny as hell because they look like their head is about to pop like a balloon when you say, "yep, I broke it, let's move on to how do we fix it?")


To me, the point of "blameless" PM is not to hide the identity of the person who was closest to the failure point. You can't understand what happened unless you know who did what, when.

"Blameless" to me means you acknowledge that the ultimate problem isn't that someone made a mistake that caused an outage. The problem is that you had a system in place where someone could make a single mistake and cause an outage.

If someone fat-fingers a SQL query and drops your database, the problem isn't that they need typing lessons! If you put a DBA in a position where they have to be typing SQL directly at a production DB to do their job, THAT is the cause of the outage, the actual DBA's error is almost irrelevant because it would have happened eventually to someone.


That's true if the direct cause is an actual mistake, which often is the case but not always.

It may also be that the cause is willful negligence, intentionally circumventing barriers for some personal reason.

And, of course, it may be that the cause is explicitly malicious (e.g. internal fraud, or the intent to sabotage someone) and at least part of the blame directly lies on the culprit, and not only on those who failed to notice and stop them.


Naming someone is how you discover that not everyone in the organization believes in Blamelessness. Once it's out it's out, you can't put it back in.

It's really easy for another developer to figure out who I'm talking about. Managers can't be arsed to figure it out, or at least pretend like they don't know.


(This was originally a reply to https://news.ycombinator.com/item?id=29473759 but I've pruned it to make the thread less top-heavy.)


And this is exactly why you can expect these headlines to hit with great regularity. These things are never a problem at the individual level, they are always at the level of culture and organization.


being at fault for an outage was one of the worst things that could happen to you

Imagine how stressful life would be thinking that you had to be perfect all the time.


That's been most of my life. Welcome to perfectionism.


Maybe it’s more telling that that engineer no longer works at Amazon.


That's a real shame, one of the leadership principles used to be "be vocally self-critical" which I think was supposed to explicitly counteract this kind of behaviour.

I think they got rid of it at some point though.


I can totally picture this. Poor guy.


Damn, he had serious PTSD lol


This may not actually be that bad of thing. If you think about it if they're fighting tooth and nail to keep the status page still green that tells you they were probably doing that at every step of the way before the failure became eminent. Gotta have respect for that.


After over 45 minutes https://status.aws.amazon.com/ now shows "AWS Management Console - Increased Error Rates"

I guess 100% is technically an increase.


I can't remember seeing problems be more strongly worded than "Increased Error Rates" or "high error rates with S3 in us-east-1" during the infamous S3 outage of 2017 - and that was after they struggled to even update their own status page because of S3 being down. :)


During the Facebook outage FB wrote something along the lines of "We noticed that some users are experiencing issues with our apps" eventhough nothing worked anymore


Their entire infrastructure was unroutable for a while if I remember correctly. That is a euphemism if I ever saw one.


"Fixed a bug that could cause [adverse behavior affecting 100% of the user base] for some users"


"some" as in "not all". I'm sure there are some tiny tiny sites that were unaffected because no one went to them during the outage.


"some" technically includes "all", doesn't it? It excludes "none", I suppose, but why should it exclude "all" (except if "all" equals "none")?


We make heavy usage of Kinesis Firehose in us-east-1.

Issues started ~1:24am ET and resolved around 7:31am ET.

Then really kicked in at a much larger scale at 10:32am ET.

We're now seeing failures with connections to RDS Postgres and other services.

Console is completely unavailable to me.


>Issues started ~1:24am ET and resolved around 7:31am ET.

First engineer found a clever hack using bubble gum

>Then really kicked in at a much larger scale at 10:32am ET.

Bubble gum dried out, and the connector lost connection again. Now, connector also fouled by the gum making a full replacement required.


Route53 is not updating new records. Console is also out.


Kinesis was the cause last Thanksgiving too iirc. It's the backbone of many services.


Funny, I just asked Alexa to set a timer and she said there was a problem doing that. Apparently timers require functioning us-east-1 now.


I can’t turn on my lights… the future is weird


And that is why my lighting automation has a baseline req that it works 100% without the internet and preferably without a central controller.


HomeKit compatibility is a useful proxy for local API, since it's a hard requirement for HomeKit certification.


I love my Home Assistant setup for this reason. I can even get light bulbs pre-flashed with ESPHome now (my wife was bemused when I was updating the firmware on the lightbulbs).


This is an absolute requirement for all of my smart home devices. Not only in case of an outage but also in case the manufacturer decides to stop supporting my device in the future. My Roomba, litter box, washer/dryer, outlets, lights, and all the rest will keep working even if their internet functionality fails. I would like all of those devices to keep working for at least a decade, and I'd be surprised of all the manufacturers keep supporting that old of tech.


So basically it's just a toggle switch?


My lighting automation uses Insteon currently. My primary req is that they are smart and connected without needing a central controller o a connection to the internet. My switches all understand lighting scenes and can manage those in a P2P manner, without a central controller. The central controller is primarily used when I want to add actual automations vs. scenes. Even the central controller aspect works 100% disconnected from the internet though. I can easily layer on top any automations I like. For instance, I have my exterior lights driven my the angle of the sun. Then, on top of that I can add internet based triggers for automation as needed. This is where I add in voice assistant triggering of automation and scenes.

Edit: Just to add, very simple binary automations are even possible without a central controller. Like, I have Insteon motion sensors that trigger a lighting scene when they detect motion. These are super simplistic though.


Yeah, my christmas tree was fugazi.


Some advice that may help:

* Visit the console directly from another region's URL (e.g., https://us-east-2.console.aws.amazon.com/console/home?region...). You can try this after you've successfully signed in but see the console failing to load as well.

* If your AWS SSO app is hosted in a region other than us-east-1, you're probably fine to continue signing in with other accounts/roles.

Of course, if all your stuff is in us-east-1, you're out of luck.

EDIT: Removed incorrect advice about running AWS SSO in multiple regions.


> Might also be a good idea to run AWS SSO in multiple regions if you're not already doing so.

Is this possible?

> AWS Organizations only supports one AWS SSO Region at a time. If you want to make AWS SSO available in a different Region, you must first delete your current AWS SSO configuration. Switching to a different Region also changes the URL for the user portal. [0]

This seems to indicate you can only have one region.

[0] https://docs.aws.amazon.com/singlesignon/latest/userguide/re...


Good call. I just assumed you could for some reason. I guess the fallback is to devise your own SSO implementation using STS in another region if needed.


I don't think you can run SSO in multiple regions on the same AWS account.


Thanks, corrected.


I think now is a good time to reiterate the danger of companies just throwing all of their operational resilience and sustainability over the wall and trusting someone else with their entire existence. It's wild to me that so many high performing businesses simply don't have a plan for when the cloud goes down. Some of my contacts are telling me that these outages have teams of thousands of people completely prevented from working and tens of million dollars of profit are simply vanishing since the start of the outage this morning. And now institutions like government and banks are throwing their entire capability into the cloud with no recourse or recovery plan. It seems bad now but I wonder how much worse it might be when no one actually has access to money because all financial traffic is going through AWS and it goes down.

We are incredibly blind to just trust just 3 cloud providers with the operational success of basically everything we do.

Why hasn't the industry come up with an alternative?


This seems like an insane stance to have, it's like saying businesses should ship their own stock, using their own drivers, and their in-house made cars and planes and in-house trained pilots.

Heck, why stop at having servers on-site? Cast your own silicon waffers, after all you don't want spectrum exploits.

Because you are worst at it. If a specialist is this bad, and the market is fully open, then it's because the problem is hard.

AWS has fewer outages in one zone alone than the best self-hosted institutions, your facebooks and petagons. In-house servers would lead to an insane amount of outage.

And guess what? AWS (and all other IAAS providers) will beg you to use multiple region because of this. The team/person that has millions of dollars a day staked on a single AWS region is an idiot and could not be entrusted to order a gaming PC from newegg, let alone run an in-house datacenter.

edit: I will add that AWS specifically is meh and I wouldn't use it myself, there's better IASS. But it's insanity to even imagine self-hosted is more reliable than using even the shittiest of IASS providers.


> This seems like an insane stance to have, it's like saying businesses should ship their own stock, using their own drivers, and their in-house made cars and planes and in-house trained pilots.

> Heck, why stop at having servers on-site? Cast your own silicon waffers, after all you don't want spectrum exploits.

That's an overblown argument. Nobody is saying that, but it's clear that businesses that maintain their own infrastructure would've avoided today's AWS' outage. So just avoiding a single level of abstraction would've kept your company running today.

> Because you are worst at it. If a specialist is this bad, and the market is fully open, then it's because the problem is hard.

The problem is hard mostly because of scale. If you're a small business running a few websites with a few million hits per month, it might be cheaper and easier to colocate a few servers and hire a few DevOps or old-school sysadmins to administer the infrastructure. The tooling is there, and is not much more difficult to manage than a hundred different AWS products. I'm actually more worried about the DevOps trend where engineers are trained purely on cloud infrastructure and don't understand low-level tooling these systems are built on.

> AWS has fewer outages in one zone alone than the best self-hosted institutions, your facebooks and petagons. In-house servers would lead to an insane amount of outage.

That's anecdotal and would depend on the capability of your DevOps team and your in-house / colocation facility.

> And guess what? AWS (and all other IAAS providers) will beg you to use multiple region because of this. The team/person that has millions of dollars a day staked on a single AWS region is an idiot and could not be entrusted to order a gaming PC from newegg, let alone run an in-house datacenter.

Oh great, so the solution is to put even more of our eggs in a single provider's basket? The real solution would be having failover to a different cloud provider, and the infrastructure changes needed for that are _far_ from trivial. Even with that, there's only 3 major cloud providers you can pick from. Again, colocation in a trusted datacenter would've avoided all of this.


>, but it's clear that businesses that maintain their own infrastructure would've avoided today's AWS' outage.

When Netflix was running its own datacenters in 2008, they had a 3 day outage from a database corruption and couldn't ship DVDs to customers. That was the disaster that pushed CEO Reed Hastings to get out of managing his own datacenters and migrate to AWS.

The flaw in the reasoning that running your own hardware would avoid today's outage is that it doesn't also consider the extra unplanned outages on other days because your homegrown IT team (especially at non-tech companies) isn't as skilled as the engineers working at AWS/GCP/Azure.


The flaw in your reasoning is that the complexity of the problem is even remotely the same. Most AWS outages are control plane related.


> it's clear that businesses that maintain their own infrastructure would've avoided today's AWS' outage.

Sure, that's trivially obvious. But how many other outages would they have had instead because they aren't as experienced at running this sort of infrastructure as AWS is?

You seem to be arguing from the a priori assumption that rolling your own is inherently more stable than renting infra from AWS, without actually providing any justification for that assumption.

You also seem to be under the assumption that any amount of downtime is always unnacceptable, and worth spending large amounts of time and effort to avoid. For a lot of businesses systems going down for a few hours every once in a while just isn't a big deal, and is much more preferable than spending thousands more on cloud bills, or hiring more full time staff to ensure X 9s of uptime.


You and GP are making the same assumption that my DevOps engineers _aren't_ as experienced as AWS' are. There are plenty of engineers capable of maintaining an in-house infrastructure running X 9s because, again, the complexity comes from the scale AWS operates at. So we're both arguing with an a priori assumption that the grass is greener on our side.

To be fair, I'm not saying never use cloud providers. If your systems require the complexity cloud providers simplify, and you operate at a scale where it would be prohibitively expensive to maintain yourself, by all means go with a cloud provider. But it's clear that not many companies are prepared for this type of failure, and protecting against it is not trivial to accomplish. Not to mention the conceptual overhead and knowledge required with dealing with the provider's specific products, APIs, etc. Whereas maintaining these systems yourself is transferrable across any datacenter.


This feels like a discussion that could sorely use some numbers.

What are good examples of

>a small business running a few websites with a few million hits per month, it might be cheaper and easier to colocate a few servers and hire a few DevOps or old-school sysadmins to administer the infrastructure.

and how often do they go down?


depends I guess, I am running on-prem workstation for our DWH. So far in 2 years it went down minutes at the time, when I decided to do so, because of hardware updates. I have no idea where this narrative came from, but usually hardware you have is very reliable and doesn't turn off every 15 minutes.

Heck, I use old T430 for my home server and still it doesn't go down on completely random occasions (but thats very simplified example, I know)


But was it always accessible from the internet, and serving requests in an acceptable amount of time?


The one in work yes, but for internal network, as we are not exposed to internet. But to be honest, we are probably one of few companies that make priority that there is always electricity and internet in the office (with UPS, electricity generator, multiple internet providers).

No idea what are the standards for other companies.


There are at least 6 cloud providers I can name that I've used which run their own data centers with capabilities similar to AWSs core products (ec2, route53, s3, cloud watch, rdb)

Ovh, scaleway, online.net, azure, gcp, aws

That's one's I've used in production, I've heard of a dozen more including big names like HP and IBM, I assume they can match aws for the most part.

...

That being said I agree multi tenant is the way to go for reliability. But I was pointing out that in this case even the simple solution of multi region on one provider was not implemented by those affected.

...

As for running your own data center as a small company. I have done it, buying components building servers and all.

Expenses and ISP issues aside, I can't imagine using in house without at least a few outages a year for anywhere near the price of hiring a DevOps person to build a MT solution for you.

If you think you can you've either never tried doing it OR you are being severely underpaid for your job.

Competent teams to build and run reliable in house infrastructure exist, and they can get you SLA similar to multi region AWS or GC (aka 100% over the last 5 years)... But the price tag has 7 to 8 figures in it.


This is the right answer, I recall studying for the solutions architect professional certification and reading this countless times: outages will happen and you should plan for them by using multi-region if you care about downtime.

It's not AWS fault here, it's the companies', which assume that it will never be down. In-house servers also have outages, it's a very naive assumption to think that it'd be all better if all of those services were using their own servers.

Facebook doesn't use AWS and they were down for several hours a couple weeks ago, and that's because they have way better engineers than the average company, working on their infrastructure, exclusively.


Apple created their own silicon. Fedex uses its own pilots. The USPS uses it's own cars.

If you're a company relying upon AWS for your business, is it okay if you're down for a day, or two while you wait for AWS to resolve it's issue?


It’s bloody annoying when all I want to do is vacuum the floor and Roomba says nope, “active AWS incident”.


If all you wanted to do was vacuum the floor you would not have gotten that particular vacuum cleaner. Clearly you wanted to do more than just vacuum the floor and something like this happening should be weighed with the purchase of the vacuum.


I’ll rephrase. I wanted the floor vacuumed and I didn’t want to do it.


When I bought my automated sprinkler system, I got one that would continue to work if the company or the cloud went belly up.


>Apple created their own silicon.

Apple designs the M1. But TSMC (and possibly Samsung) actually manufacture the chips.


I'm pretty sure that's a difference without a distinction.


> Apple created their own silicon

Apple designed their own silicon, a third party manufactures and packages it for them.


Pedantic, -1.


I suggest you review this before commenting again:

https://news.ycombinator.com/newsguidelines.html


> Please respond to the strongest plausible interpretation of what someone says, not a weaker one that's easier to criticize. Assume good faith.

You mean like that rule, pedant? It's not name calling if it's an accurate representation of one's behavior.

ped·ant /ˈpednt/: noun a person who is excessively concerned with minor details and rules or with displaying academic learning.


Most companies using AWS are tiny compared to the companies you mentioned.


> AWS (and all other IAAS providers) will beg you to use multiple region

will they? because AWS still puts new stuff in us-east-1 before anywhere else, and there is often a LONG delay before those things go to other regions. there are many other examples of why people use us-east-1 so often, but it all boils down to this: AWS encourage everyone to use us-east-1 and discourage the use of other regions for the same reasons.

if they want to change how and where people deploy, they should change how they encourage it's customers to deploy.

my employer uses multi-region deployments where possible, and we can't do that anywhere nearly as much as we'd like because of limitations that AWS has chosen to have.

so if cloud providers want to encourage multi-region adoption, they need to stop discouraging and outright preventing it, first.


> AWS still puts new stuff in us-east-1 before anywhere else, and there is often a LONG delay before those things go to other regions.

Come to think of it (far down the second page of comments): Why east?

Amazon is still mainly in Seattle, right? And Silicon Valley is in California. So one would have thought the high-tech hub both of Amazon and of the USA in general is still in the west, not east. So why us-east-1 before anywhere else, and not us-west-1?


Most features roll out to IAD second, third, or fourth. PDX and CMH are good candidates for earlier feature rollout, and usually it's tested in a small region first. I use PDX (us-west-2) for almost everything these days.

I also think that they've been making a lot of the default region dropdowns and such point to CMH (us-east-2) to get folks to migrate away from IAD. Your contention that they're encouraging people to use that region just don't ring true to me.


It works really well imo. All the people who want to use new stuff at the expense of stability choose us-east-1; those who want stability at the expense of new stuff run multi-region (usually not in us-east-1 )


This argument seems rather contrived. Which feature available in only one region for a very long time has specifically impacted you? And what was the solution?


Quick follow up. I once used a IASS provider (hyperstreet) that was terrible. Long story short provider ended closing shop and the owner of the company now sells real estate in California.

Was a nightmare recovering data. Even when the service was operational was sub par.

Just saying perhaps the “shittiest” providers may not be more reliable.


> In-house servers would lead to an insane amount of outage.

That might be true, but the effects of any given outage would be felt much less widely. If Disney has an outage, I can just find a movie on Netflix to watch instead. But now if one provider goes down, it can take down everything. To me, the problem isn't the cloud per se, it's one player's dominance in the space. We've taken the inherently distributed structure of the internet and re-centralized it, losing some robustness along the way.


> That might be true, but the effects of any given outage would be felt much less widely.

If my system has an hour of downtime every year and the dozen other systems it interacts with and depends on each have an hour of downtime every year, it can be better that those tend to be correlated rather than independent.


I think you're missing the point of the comment. It's not "don't use cloud". It's "be prepared for when cloud goes down". Because it will, despite many companies either thinking it won't, or not planning for it.


> AWS has fewer outages in one zone alone than the best self-hosted institutions, your facebooks and petagons. In-house servers would lead to an insane amount of outage.

It’s had two in 13 months


In addition to fewer outages, _many_ products get a free pass on incidents because basically everyone is being impacted by outage.


the benefit of self hosting is, that you are up, while a your competitors are down.

However if youbare on AWS many of your competitors are down while you are down, so they can't takeover your business.


"AWS has fewer outages in one zone alone than the best self-hosted institutions" sure you just call an outage "increased error rate"


they usually beg you to use multiple availability zones though

I'm not sure how many aws services are easy to spawn at multiple regions


> they usually beg you to use multiple availability zones though

Doesn't help you if it what goes down is AWS global services on which you directly, or other AWS services, depend (which tend to be tied to US-east-1).


Which ones are difficult to deploy in multiple regions?


iirc elb+autoscaling


Because the expected value of using AWS is greater than the expected value of self-hosting. It's not that nobody's ever heard of running on their own metal. Look back at what everyone did before AWS, and how fast they ran screaming away from it as soon as they could. Once you didn't have to do that any more, it's just so much better that the rare outages are worth it for the vast majority of startups.

Medical devices, banks, the military, etc. should generally run on their own hardware. The next photo-sharing app? It's just not worth it until they hit tremendous scale.


Agree with your first point.

On the second though, at some point, infrastructure like AWS are going to be more reliable than what many banks, medical device operators etc can provide themselves. asking them to stay on their own hardware is asking for that industry to remain slow, bespoke and expensive.


Hard agree with the second paragraph.

It is incredibly difficult for non-tech companies to hire quality software and infrastructure engineers - they usually pay less and the problems aren't as interesting.


Agreed, and it'll be a gradual switch rather than a single point, smeared across industries. Likely some operations won't ever go over, but it'll be a while before we know.


People are so quick to forget how things were before behemoths like AWS, Google Cloud, and Azure. Not all things are free and the outage the internet is experiencing is the risk users signed up for.

If you would like to go back to the days of managing your own machines, be my guest. Remember those machines also live somewhere and were/are subject to the same BGP and routing issues we've seen over the past couple of years.

Personally, I'll deal with outages a few times a year for the peace of mind that there's a group of really talented people looking into for me.


>It seems bad now but I wonder how much worse it might be when no one actually has access to money because all financial traffic is going through AWS and it goes down.

Most financial institutions are implementing their own clouds, I can't think of any major one that is reliant on public cloud to the extent transactions would stop.

>Why hasn't the industry come up with an alternative?

You mean like building datacenters and hosting your own gear?


> Most financial institutions are implementing their own clouds

https://www.nasdaq.com/Nasdaq-AWS-cloud-announcement


That doesn't mean what you think it means.

The agreement is more of a hybrid cloud arrangement with AWS Outposts.

FTA:

>Core to Nasdaq’s move to AWS will be AWS Outposts, which extend AWS infrastructure, services, APIs, and tools to virtually any datacenter, co-location space, or on-premises facility. Nasdaq plans to incorporate AWS Outposts directly into its core network to deliver ultra-low-latency edge compute capabilities from its primary data center in Carteret, NJ.

They are also starting small, with Nasdaq MRX

This is much less about moving NASDAQ (or other exchanges) to be fully owned/maintained by Amazon, and more about wanting to take advantage of development tooling and resources and services AWS provides, but within the confines of an owned/maintained data center. I'm sure as this partnership grows, racks and racks will be in Amazon's data centers too, but this is a hybrid approach.

I would also bet a significant amount of money that when NASDAQ does go full "cloud" (or hybrid, as it were), it won't be in the same US-east region co-mingling with the rest of the consumer web, but with its own redundant services and connections and networking stack.

NASDAQ wants to modernize its infrastructure but it absolutely doesn't want to offload it to a cloud provider. That's why it's a hybrid partnership.


Indeed I can think of several outages in the past decade in the UK of banks' own infrastructure which have led to transactions stopping for days at a time, with the predictable outcomes.


> tens of million dollars of profit are simply vanishing

vanishing or delayed six hours? I mean


money people think of money very weirdly. when they predict they will get more than they actually get, they call it "loss" for some reason, and when they predict they will get less than they actually get, it's called ... well I don't know what that's called but everyone gets bonuses.


6 hours of downtime often means 6 hours of paying employees to stand around which adds up rather quickly.


Well, if you're web-based, there's never really been any better alternative. Even before "the cloud", you had to be hosted in a datacenter somewhere if you wanted enough bandwidth to service customers, as well as have somebody who would make sure the power stayed on 24/7. The difference now is that there used to be thousands of ISP's so one outage wouldn't get as much news coverage, but it would also probably last longer because you wouldn't have a team of people who know what to look for like Amazon (probably?) does.


> Why hasn't the industry come up with an alternative?

The cloud is the solution to self managed data centers. Their value proposition is appealing: Focus on your core business and let us handle infrastructure for you.

This fits the needs of most small and medium sized businesses, there's no reason not to use the cloud and spend time and money on building and operating private data centers when the (perceived) chances of outages are so small.

Then, companies grow to a certain size where the benefits of having a self managed data center begins to outweight not having one. But at this point this becomes more of a strategic/political decision than merely a technical one, so it's not an easy shift.


Why is the comparison always between public cloud and building your own data center from scratch?

There is a middle ground of bare-metal hosting. Abstract away the hardware and networking. Do the rest yourself.


We're too busy working generating our own electricity and designing our own CPUs.


This appears to be a single region outage - us-east-1. AWS supports as much redundancy as you want. You can be redundant between multiple Availability Zones in a single Region or you can be redundant among 1, 2 or even 25 regions throughout the world.

Multiple-region redundancy costs more both in initial planning/setup as well as monthly fees so a lot of AWS customers choose to just not do it.


We have or had alternatives - rackspace, linode, digital ocean, in the past there were many others, self hosting is still an option. But the big three just do it better. The alternatives are doomed to fail. If you use anything other than the big three you risk not just more outages, but your whole provider going out of business overnight.

If the companies at the scale you are talking about do not have multi-region and multi service (aws to azure for example) failover that's their fault, and nobody else's.


In my opinion there is a lack of talent in these industries for building out there own resilient systems. IT people and engineers get lazy.


> IT people and engineers get lazy.

Companies do not change their whole strategy from a capex-driven traditional self-hosting environment to opex-driven cloud hosting because their IT people are lazy; it is typically an exec-level decision.


No lazier than anyone else, there's just not enough of us, in general and per company.


We're too busy in endless sprints to focus on things outside of our core business that don't make salespeople and executives excited.


> Why hasn't the industry come up with an alternative?

We used to have that, some companies still have the capability and know-how to build and run infrastructure that is reliable, distributed across many hosting providers before "cloud" became the "norm", but it goes along with "use or lose it".


Because your own datacenters cant go down?


Many of those businesses wouldn’t have existed in the first place without simplicity offered by cloud.


Do you think they'd manage their own infra better? Are you suggesting they pay for a fully redundant second implementation on another provider? How much extra cost would that be vs eating an outage very infrequently?


There is an alternative: A true network cloud. This is what Cloudflare will eventually become.


So you're saying companies should start moving their infrastructure to the blockchain?


Ethereum has gone 5 years without a single minute of downtime, so if it's extreme reliability you're going for I don't think it can be beaten.


Easy to be reliable when nobody uses it


Because the majority of consumers don't know better / don't care and still buy products from companies with no backup plan. Because, really, how can any of us know better until we're burned many times over?


Not sure it helps, but got this update from someone inside AWS a few moments ago.

"We have identified the root cause of the issues in the US-EAST-1 Region, which is a network issue with some network devices in that Region which is affecting multiple services, including the console but also services like S3. We are actively working towards recovery."


That's a copy-paste, we got the same thing from our AWS contact. It's just enough info to confirm there's an issue, but not enough to give any indication on the scope or timeline to resolution.


Internally the rumor is that our CICD pipelines failed to stop bad commits to certain AWS services. This isn’t due to tests but due to actual pipelines infra failing.

We’ve been told to disable all pipelines even if we have time blockers or manual approval steps or failing tests


They finally updated the status page: https://status.aws.amazon.com/


Ahh good spot, it does seem that the AWS person I am speaking too has a few more bits other than what is shown on the page, they just messaged me the same message there, but added:

"All teams are engaged and continuing to work towards mitigation. We have confirmed the issue is due to multiple impaired network devices in the US-EAST-1 Region."

Doesn't sound like they are having a good day there!


I love how they are sharing this stuff out to some clients, but its technically under NDA.


Yeah, we got updates via NDA too lol. Such a joke that a status page update is considered privileged lol.


It's funny that the first place I go to learn about the outage is Hacker News and not https://status.aws.amazon.com/ (it's still reports everything to be "operating normally"...)


I made sure our incident response plan includes checking Hacker News and Twitter for actual updates and information.

As of right now, this thread and one update from a twitter user, https://twitter.com/SiteRelEnby/status/1468253604876333059 are all we have. I went into disaster recovery mode when I saw our traffic dropped to 0 suddenly at 10:30am ET. That was just the SQS/something else preventing our ELB logs from being extracted to DataDog though.


So as of the time you posted this comment, were other services actually down? The way the 500 shows up, and the AWS status page, makes it sound like "only" the main landing page/mgt console is unavailable, not AWS services.


Yes, they are still publishing lies on their status page. In this thread people are reporting issues with many services. I'm seeing periodic S3 PUT failures for the last 1.5 hours.


AWS services are all built against each other so one failing will take down a bunch more which take down more like dominos. Internally there’s a list of >20 “public facing” AWS services impacted.



I always got the impression that downdetector worked by logging the number of times they get a hit for a particular service and using that as a heuristic to determine if something is down. If so, that's brilliant.


It's brilliant until the information is bad.

When Facebook's properties all went down in October, people were saying that AT&T and other cell phone carriers were also down - because they couldn't connect to FB/Insta/etc. There were even some media reports that cited Downdetector, seeming without understanding that they are basically crowdsourced and sometimes the crowd is wrong.


I think it's a bit simpler for AWS- there's a big red "I have a problem with AWS" button on that page. You click it, tell it what your problem is, and it logs a report. Unless that's what you were driving at and I missed it, it's early. Too early for AWS to be down :(

Some 3600 people have hit that button in the last ~15 minutes.


Now 57 minutes later and it still reports everything as operating normally.


It shows errors now.


It doesn't show errors with Lambda and we clearly do experience them.


Community reporting > internal operations


I usually go on Twitter first for outages.


I think there should be some third-party status checker alliance.

It's a joke. Each time AWS/Azure/GCP is down their status page says all is fine.


Want to build a startup?


already running one.


Are the actual services down, or is it just the console and/or login page?

For example, the sign-up page appears to be working: https://portal.aws.amazon.com/billing/signup#/start

Are websites that run on AWS us-east up? Are the AWS CLIs working?


Anecdotally, we're seeing a small number of 500s from S3 and SQS, but mostly our service (which is at nontrivial scale, but mostly just uses EC2, S3, DynamoDB, and some basic network facilities including load balancers) seems fine, knock on wood. Either the problem is primarily in more complex services, or it is specific to certain AZs or shards or something.


Definitely not just the console. We had hundreds of thousands of websocket connections to us-east-1 drop at 15:40, and new websocket connections to that region are still failing. (Luckily not a huge impact on our service cause we run in 6 other regions, but still).


Side question: How happy are you with API Gateway's WebSocket service?


No idea, we don't use it. These were websocket connections to processes on ec2, via NLB and cloudfront. Not sure exactly what part of that chain was broken yet.


This whole time I've been seeing intermittent timeouts when checking a UDP service via NLB; I've been wondering if it's general networking trouble or something specifically with the NLB. EC2 hosts are all fine, as far as I can tell.


We're seeing issues with EventBridge, other folks are having trouble reaching S3.

Looks like actual services.


My website that runs on US-East-1 is up.

However, my Alexa (Echo) won't control my thermostat right now.

And my Ring app won't bring up my cameras.

Those services are run on AWS.


Now I'm imagining someone dying because they couldn't turn their heating on because AWS. The 21st Century is fucked up.


One of my sites went offline an hour ago because the web server stopped responding. I can't SSH into it or get any type of response. The database server in the same region and zone is continuing to run fine though.


Interesting, is the site on a particular type of EC2 instance, e.g. bare metal? I see c4.xlarge is doing fine in us-east-1.


It's just a t3a.nano instance since it's a project under development. However, I have a high number of t3a.nano instances in the same region operating as expected. This particular server has been running for years, so although it could be a coincidence it just went offline within minutes of the outage starting, it seems unlikely. Hopefully no hardware failures or corruption, and it'll just need a reboot once I can get access to AWS again.


I can't access anything related to Cloudfront, either through the CLI or console :

  $ aws cloudfront list-distributions

  An error occurred (HttpTimeoutException) when calling the ListDistributions operation: Could not resolve DNS within remaining TTL of 4999 ms
However I can still access the distribution fine


I see started EC2 instances are doing fine. However, starting offline instances cannot be done through AWS SDK due to the HTTP 500 error, even for Ec2 service. The CLI should be getting the HTTP 500 error too since likely the same API as the SDK.


My ECS, EC2, Lambda, load balancer, and other services on us-east-1 still function. But these outages can sometimes propagate over time rather than instantly.

I cannot access the admin console.


I'm getting blank pages from Amazon.com itself.


We've had reports of some intermittent 500 errors from cloudfront, apart from that our sites are up.


I can tell you that some processes are not running, possibly due to SQS or SWF problems. Previous outages of this scale were caused by Kinesis outages. Can't connect via aws login at the cli either since we use SSO and that seems to be down.


I wasn't able to load my amazon.com wishlist, nor the shopping page through the app. Not an aws service specifically, but an amazon service that I couldn't use.


EventBridge, CloudWatch. I've just started getting session errors with the console, too.


Using cli to describe instances isn't working. Instances themselves seem fine so far.


Interactive Video Service (IVS) is down too


Friends tell friends to pick us-east-2.

Virginia is for lovers, Ohio is for availability.


Lots of services are only in us-east-1. The sso system isn't working 100% right now so that's where I assume it's hosted.


Yeah, there are "global" services which are actually secretly us-east-1 services as that is the region they use for internal data storage and orchestration. I can't launch instances with OpsWorks (not a very widely used service, I'd imagine) even if those instances are in stacks outside of us-east-1. I suspect Route53 and CloudFront will also have issues.


Yeah I can't log in with our external SAML SSO to our AWS dashboard to manage our us-east-2 resources. . . . Because our auth is apparently routed thru us-east-1 STS


You can pick the region for SSO — or even use multiple. Ours is in ap-southeast-1 and working fine — but then the console that it signs us into is only partially working presumably due to dependencies on us-east-1.


(Correction: you can choose the region but you can't use multiple for the same org.)


Ohio's actual motto funnily kind of fits here:

  With God, all things are possible


Does this imply Virginia is Godless?


Virginia's actual motto is "Sic semper tyrannis". What's more tyrannical than an omnipotent being that will condemn you to eternal torment if you don't worship them and follow their laws.


Virginia and Massachusetts have surprisingly aggressive mottoes (MA is: "By the sword we seek peace, but peace only under liberty", which is really just a fancy way of saying "don't make me stab a tyrant," if you think about it). It probably makes sense, though, given that they came up with them during the revolutionary war.


I think I should add state motto to my data center consideration matrix.


I mean, most people are okay with dogs and Seattle.


Maybe virginia is for lovers, ohio is for God/gods?


This is funny, but true. I've been avoiding us-east-1 simply because thats where everyone else is. Spot instances are also less likely to be expensive in less utilized regions.


I live in Ohio and can confirm. If the Earth were destroyed by an asteroid Ohio would be left floating out there somehow holding onto an atmosphere for about ten years.


Always has been.


Friends tell Friends not to use Rube Goldberg machines as their infrastructure layer.


If you're not multi-cloud in 2021 and are expecting 5-9's, I feel bad for you.


If you're having SLA problems I feel bad for you son

I got two 9 problems cuz of us-east-1


> ~~I got two 9 problems cuz of us-east-1~~

I left my two nines problems in us-east-1


If you're not multi-region, I feel bad for you.

If your company is shoehorning you into using multiple clouds and learning a dozen products, IAM and CICD dialects simultaneously because "being cloud dependent is bad", I feel bad for you.

Doing one cloud correctly from a current DevSecOps perspective is a multi-year ask. I estimate it takes about 25 people working full time on managing and securing infrastructure per cloud, minimum. This does not include certain matrixed people from legacy network/IAM teams. If you have the people, go for it.


There are so many things that can go wrong with a single provider, regardless of how many availability zones you are leveraging, that you cannot depend on 1 cloud provider for your uptime if you require that level of up.

Example: Payment/Administrative issues, rogue employee with access, deprecated service, inter-region routing issues, root certificate compromises... the list goes on and it is certainly not limited to single AZ.

A very good example, is that regardless of which of the 85 AZs you are in at aws, you are affected by this issue right now.

Multi-cloud with the right tooling is trivial. Investing in learning cloud-proprietary stacks is a waste of your investment. You're a clown if you think you need 25 people internally per cloud is required to "do it right".


All cloud tech is proprietary.

There is no such thing as trivially setting up a secure, fully automated cloud stack, much less anything like a streamlined cloud agnostic toolset.

Deprecated services are not the discussion here. We're talking tactical availability, not strategic tools etc.

Rogue employees with access? You mean at the cloud provider or at your company? Still doesn't make sense. Cloud IAM is very difficult in large organizations, and each cloud does things differently.

I worked at fortune 100 finance on cloud security. Some things were quite dysfunctional, but the struggles and technical challenges are real and complex at a large organization. Perhaps you're working on a 50 employee greenfield startup. I'll hesitate to call you a clown as you did me, because that would be rude and dismissive of your experience (if any) in the field.


I advise many fintechs with engineering orgs from 5 to 5000, 2 in top 100 - none are blindly single-cloud and none have 25 people dedicated to each of their public clouds. The largest is not on any public clouds due to regulation/compliance and have several colocation facilities for their mission critical - they have less than 25 dedicated in the entire netsec org. This is a company that maintians strict PCI-DSS1 on thousands of servers and thousands of services. If you're employing 25 people per cloud for netsec in a multi cloud environment you have some seriously deficient DevOps practices or your org is 5-figure deep and has been ignoring devops best practices while on cloud for a half decade. Hahsicorps entire business revolves around cloud agnostic toolkits. All clouds speak kubernetes at this point and unless you have un-cloudable components in your stack (like root cert key signing systems on a proprietary appliance) you really should never find yourself in such a scenario where you have that many people overseeing infra security on a public cloud. It's been proven time and time again that too many people managing security is inversely secure.


I meant at least 25 people in the DevSecOps role per cloud. Security experts, network/ops/systems experts, and automation (gitlab and container underlay) support.

K8s is one of a hundred technologies to learn and use, and just because each cloud is supported by terraform, you can't swap a GCP terraform writer over to Azure in a day.

And no bank is without their uncloudable components.


Someone start devops as a service please


This.


I imagine there are very few businesses where the extra cost of going multi-cloud is smaller than the cost of being down during AWS outages.


Also, going multi-cloud will introduce more complexity which leads to more errors and more downtime. I'd rather sit this outage out than deal with daily risk of downtime because I'm infrastructure is too smart for its own good.


Depends on the criticality of the service. I mean you're right about adding complexity. But sometimes you can just take your really critical services and make sure it can completely withstand any one cloud provider outage.


How do you become multi-cloud if your root domain is in Route53? Have Backup domains on the client side?


dns records should be synced to secondary provider, and that provider added to your secondary/tertiary domain dns.

Multi-provider dns is a solved problem.


Sometimes you can't avoid us-east-1; an example is AWS ECR Public. It's a shame. Meanwhile, DockerHub is up and running even when it's in EC2 itself.


us-east-1 is a cursed region. It's old, full of one-off patches to work and tends to be the first big region released to.


No one was in the room where it happened


This is also a clever play on the Hawthorne Heights song.



Didn't us-east-2 have an issue last week?


Can I get that on a license plate?


I wonder why AWS has Ohio and Virginia but no region in the northeast where a significant plurality of customers using east regions probably live.


This is effecting heroku.

While my heroku apps are currently up, I am unable to push new versions.

Logging in to heroku dashboard (which does work), there is a message pointing to this heroku status incident for "Availability issues with upstream provider in the US region": https://status.heroku.com/incidents/2390

How can there be an outage severe enough to be effecting middleman customers like heroku, but the AWS status page is still all green?!?!

If whoever runs the AWS status page isn't embaressed, they really ought to be.


AWS management APIs in the us-east-1 region is what is affected. I'm guessing Heroku uses at least the S3 APIs when deploying new versions, and those are failing (intermittently?).

I advise not touching your Heroku setup right now. Even something like trying to restart a dyno might mean it doesn't come back since the slug is probably stored on S3 and that will fail.


The fun thing about these types of outages are seeing all of the people that depend upon these services with no graceful fallback. My roomba app will not even launch because of the AWS outage. I understand that the app gets "updates" from the cloud. In this case "updates" is usually promotional crap, but whatevs. However, for this to prevent the app launching in a manner that I can control my local device is total BS. If you can't connect to the cloud, fail, move on and load the app so that local things are allowed to work.

I'm guessing other IoT things suffer from this same short sitedness as well.


If you did that some clever person would set up their PiHole so that their device just always worked, and then you couldn't send them ads and surveil them. They'd tell their friends and then everyone would just use their local devices locally. Totally irresponsible what you're suggesting.


A little off-topic, but there are people working on it: https://valetudo.cloud/

It's a little harder than blocking the DNS unfortunately. But nonetheless it always brings a smile to my face to see that there's a FOSS frontier for everything.


An even more clever person would package up this box, and sell it, along with a companion subscription service, to help busy folks like myself.


But this new little box would then be required to connect to the home server to receive updates. Guess what? No updates, no worky!! It's a vicious circle!!! Outages all the way down


this is why everyone runs piholes and no one sees ads on the internet anymore, which killed the internet ad industry


Dear person from the future, can you give me a hint on who wins the upcoming sporting events? I'm asking for a friend of course


Also, what's the verdict on John Titor?


>The fun thing about these types of outages are seeing all of the people that depend upon these services with no graceful fallback.

Whats a graceful fallback? Switching to another hosting service when AWS goes down? Wouldn't that present another set of complications for a very small edge case at huge cost?


Usually this refers to falling back to a different region in AWS. It's typical for systems to be deployed in multiple regions due to latency concerns, but it's also important for resiliency. What you call "a very small edge case" is occurring as we speak, and if you're vulnerable to it you could be losing millions of dollars.


AWS itself has a huge single point of failure on us-east-1 region. Usually, if us-east-1 goes down, others soon follow. At that point, it doesn't matter how many regions you're deploying to.


My workloads on Sydney and London are unaffected. I can't speak for anywhere else.


"Usually"? When has that ever happened?



Thanks. That shows that OPs claim that "Usually, if us-east-1 goes down, others soon follow." is false.


Cognito (and r53) have hard dependencies in us-east-1. It’s mentioned up and down this thread.


I don't want to minimize the impact from cognito and r53, but that's quite a different scale of failure than the OP implies. It's been a while since I used AWS, but we had multiple regions and saw no impact to our services in other regions the one time that us-east-1 had a major failure. And we used r53.


Perhaps I could've been more precise with my words. What I meant to say is IAM and r53 are two of many critical services that all depend on us-east-1. It goes without saying that if those services go down in us-east-1, the whole AWS is affected. This doesn't just happen "usually." When IAM goes down, AWS experiences major issues across all regions. If you were okay, perhaps you got lucky? Our team has 7 different prod regions and we see multiple regions go down every time a problem of this scale occurs.

If your product requires 100% uptime, you may need to look at backup options or design your product in such a way that can handle temporary cloud failures.


I heard from someone at <big media company> that they couldn't switch to their fallback region because they couldn't update DNS on Route53. All the console commands and web interface were failing.


probably not possible for a lot mroe services than you'd think because AWS Cognito has no decent failover method


There is a company that delivers broadcast video ads to hundreds of TV stations on demand. The ad has to run and run now, so they cannot tolerate failure.

They write the videos to GCS storage in Google Cloud, and to S3 in AWS. Every point of their workflows are checkpointed and cross referenced across GCP and AWS. If either side drops the ball, the other picks it up.

So yes, you can design a super fault tolerant system. This company did it because failing to deliver a few ads would mean lose of major contracts.


In this case, just connect over LAN.


Right -- I think I've misread OP as graceful fallback e.g. working offline.

Rather than implement a dynamically switching backup in the event of AWS going down which is not trivial.


Or BlueTooth.


Well, BT is a hell of a protocol. I wouldn't wish that on anyone.


> Wouldn't that present another set of complications for a very small edge case at huge cost?

One has to crunch the numbers. What does a service outage cost your business every minute/hour/day/etc in terms of lost revenue, reputational damage, violated SLAs, and other factors? For some enterprises, it's well worth the added expense and trouble of having multi-site active-active setups that span clouds and on-prem.


Now think of how many assets of various governments' militaries are discreetly employed as normal operational staff by FAAMG in the USA and have access to cause such events from scratch. I would imagine that the US IC (CIA/NSA) already does some free consulting for these giant companies to this end, because they are invested in that Not Being Possible (indeed, it's their job).

There is a societal resilience benefit to not having unnecessary cloud dependencies beyond the privacy stuff. It makes your society and economy more robust if you can continue in the face of remote failures/errors.

It is December 7th, after all.


> I would imagine that the US IC (CIA/NSA) already does some free consulting for these giant companies

This comment is how I know you don't work in the public sector. Those agencies' infrastructures are essentially run by contractors with a few GS personnel making bad decisions every chance they get and a few DoD personnel acting like their rank can fix technical problems.


I'm not talking about running infrastructure, I'm talking about working with HR and making sure that the people they've hired as sysadmins aren't meeting with FSB or PLA agents in a local park on weekends and accepting suitcases of USD cash to accidentally `rm -rf` all of the Zookeeper/etcd nodes at once on a Monday morning.


I doubt it. We’re shit at critical infrastructure defense and the military cares mostly about its own networks. And industry really doesn’t want to cooperate. I was in cyber and military IT. Can’t speak for the IC, but I really doubt it.


> I would imagine that the US IC (CIA/NSA) already does some free consulting for these giant companies to this end,

Haha, it would be funny if the IC reaches out to BigTech when failures occur to let them know they need not be worried about data loses. They can just borrow a copy of the data IC is siphoning off them. /s?


I wouldn't jump to say it's short sitedness (it is shitty) but it could be a matter of being pragmatic... It's easier to maintain the code if it is loaded at run time (think thin client browser style). This way your iot device can load the lastest code and even settings from the cloud... (advantage when the cloud is available)... I think of this less of short sitedness and more a reasonable trade off (with shitty side effects)


Then you could just keep a local copy available as a fallback in case the latest code cannot be fetched. Not doing the bare minimum and screwing the end user isn't acceptable IMHO. But I also understand that'd take some engineer hours and report virtually no benefits as these outages are rare (not sure how Roomba's reliability is in general on the other hand) so here we are.


I don't think that's ever a reasonable tradeoff! Network access goes down all the time, and should be a fundamental assumption of any software.

Maybe I'm too old, but I can't imagine a seasoned dev, much less a tech lead, omitting planning for that failure mode


there's a reason it is called the Internet of things, and not the "local network of things". Even if the latter is probably what most customers would prefer.


There’s also no reason for an internet connected app to crash on load when there is no access to the internet services.


indeed.

A constitutional property of a network is it's volatility. Nodes may fail. Edges may. You may not. Or you may. But then you're delivering no reliabilty but crap. Nice sunshine crap, maybe.


In a traditional internet style architecture, your Roomba and phone would both have globally routable addresses and could directly communicate - your local network would provide hosts with internet connectivity.


The internet was originally designed to still work if there was a nuclear strike on a node. Now we can’t even cope with us-west-1 down.


"If you can't connect to the cloud, fail, move on and load the app so that local things are allowed to work."

Building fallbacks require work. How much extra effort and overhead is needed to build something like this ? Sometimes the cost vs benefits says that it is ok not to do it. If AWS has an outage like this once a year, maybe we can deal with it (unless you are working with mission critical apps).


Yes, it is a lot of work to test if response code is OK or not, or if a timeout limit has been reached. So much so, I pretty much wrote the test in the first sentence. Phew. 10x coder right here!


Remember the Signal outage? Which is to say, the testing is the problem, not the implementation itself. (Which isn’t to say I think it’s optional—I myself generally don’t run online-only local-by-their-nature things.)


Someone posted this on our company slack: https://stop.lying.cloud/


I just walked into the Amazon Books store at our local mall. They are letting everyone know at the entrance that “some items aren’t available for purchase right now because our systems are down.”

So at least Amazon retail is feeling some of the pain from this outage!


My job (although 50% of time) at Azure is unit testing/monitoring services under different scenarios and flows to detect small failures that will be overlooked in public status page. Our tests run multiple times daily and we have people constantly monitoring logs. It concerns me when I see all AWS services are 100% green when I know there is an outage.


I don't know how accurate this information is, but I'm hearing that the monitor can't be updated because the service is in the region that is down.


Kinda hard to believe after they were blasted for that very situation during/after the S3 outage way back.

If that's the case, it's 100% a feature. They want as little public proof of an outage after it's over and to put the burden on customers completely to prove they violated SLAs.


I love that every time this happens, 100% of the services on https://status.aws.amazon.com are green.


Those five 9s don't come easy. Sometimes you have to prop them up :)


I wonder how often outages really happen. The official page is nonsense, of course, and we only collectively notice when the outage is big enough that lots of us are affected. On AWS, I see about a 3:1 ratio of "bump in the night" outages (quickly resolved, little corroboration) to mega too-big-to-hide outages. Does that mirror others' experiences?


If you count any time AWS is having a problem that impacts our production workloads then I think it's about 5:1. Dealing with "AWS is down" outages are easy because I can just sit back and grab some popcorn, it's the "dammit I know this is AWS's fault" outages that are a PITA because you count yourself lucky to even get a report in your personalized dashboard.


Yep.

Random aside: any chance you are related to the Calculus on Manifolds Spivak?


I had to log in to say, that one of my favorite quotes of all time I found in Calculus on Manifolds.

He says that any good theorem is worth generalizing, and I've generalized that to any life rule.


Nope, just a fan. It was the book that pioneered my love of math.


https://aws.amazon.com/compute/sla/

looks like only four 9's


  > looks like only four 9's 
That's why the Germans are such good engineers.

  Did the drives fail? Nein.
  Did the CPU overheat? Nein.
  Did the power get cut? Nein.
  Did the network go down? Nein.
That's "four neins" right there.


Every time someone asks to update the status page, managers say "nein"


It’s hard to measure what five-9 is because you have to wait around until a 0.00001 occurs. Incentivizing post-mortems are absolutely critical in this case.


It's 0.001; the first 2 9's count.

  5N  = 99.999%
  3N  = 99.9%
  1N5 = 95%
5N is <43m12s downtime per month.


I considered writing it as a percent but then decided against using it and moving the decimal instead. But good info for clarification.


EC2 or S3 showing red in any region literally requires personal approval of the CEO of AWS.


Is this true or a joke? This sort of policy is how you destroy trust.


From what I've heard it's mostly true. Not only the CEO but a few SVPs can approve it, but yes a human must approve the update and it must be a high level exec.

Part of the reason is because their SLAs are based on that dashboard, and that dashboard going red has a financial cost to AWS, so like any financial cost, it needs approval.


Sure, but... that just raises more questions :)

Taken literally what you are saying is the service could be down and an executive could override that, preventing them for paying customers for a service outage, even if the service did have an outage and the customer could prove it (screenshots, metrics from other cloud providers, many different folks see it).

I'm sure there is some subtlety to this, but it does mean that large corps with influence should be talking to AWS to ensure that status information corresponds with actual service outages.


Large corps with influence get what they want regardless. Status page goes red and the small corps start thinking they can get what they want too.


> Status page goes red and the small corps start thinking they can get what they want too.

I think you mean "start thinking they can get what they pay for"


I have no inside knowledge or anything but it seems like there are a lot of scenarios with degraded performance where people could argue about whether it really constitutes an outage.


One time gcp argued that since they did return 404s on gcs for a few hours that wasn’t an uptime/latency sla violation so we were not entitled to refund (tho they refunded us anyway)


Man, between costs and shenanigans like this, why don't more companies self-host?


1. Leadership prefers to blame cloud when things break rather than take responsibility.

2. Cost is not an issue (until it is but you’re already locked in so oh well)

3. Faang has drained the talent pool of people who know how


If you think that’s bad you should see the outages when you self host without a big enough team to really manage it.


Opex > Capex. If companies thought about long term, yes they might consider it. But unless the cloud providers fuck up really badly, they're ok to take the heat occasionally and tolerate a bit of nonsense.


You can lease equipment you know…


Yep. I was an SRE who worked at Google and also launched a product on Google Cloud. We had these arguments all the time, and the contract language often provides a way for the provider to weasel out.


Like I said I never worked there and this is all hearsay but there is a lot of nuance here being missed like partial outages.


This is no longer a partial outage. The status page reports elevated API error rates, DynamoDB issues, EC2 API error rates, and my company's monitoring is significantly affected (IE, our IT folks can't tell us what isn't working) and my AWS training class isn't working either.

If this needed a CEO to eventually get around to pressing a button that said "show users the actual information about a problem" that reflects poorly on amazon.


My friend works at a telemetry company for monitoring and they are working on alerting customers of cloud service outages before the cloud providers since the providers like to sit on their hands for a while (presumably to try and fix it before anyone notices).


Being dishonest about SLAs seems to bear zero cost in this case?


It's not really dishonest though because there is nuance. Most everything in EC2 is still working it seems, just the console is down. So is it really down? It should probably be yellow but not red.


if you cannot access the control plane to create or destroy resources, it is down (partial availability). The jobs that are running are basically zombies.


I'm right in the middle of an AWS-run training and we literally can't run the exercises because of this.

let me repeat that: my AWS trainign that is run by AWS that I pay AWS for isn't working, because AWS is having control plane (or other) issues. This is several hours after the initial incident. We're doing training in us-west-2, but the identity service and other components run in us-east-1.


I’m running EKS in us-west-2. My pods use a role ARN and identity token file to get temporary credentials via STS. STS can’t return credentials right now. So my EKS cluster is “down” in the sense that I can’t bring up new pods. I only noticed because an auto-scaling event failed.


We ran through the whole 4.5 hour training and the training app didn't work the entire time.


Seems like the API is still working and so is auto scaling. So they aren’t really zombies.

Partial availability isn’t the same as no availability.


The API is NOT working -- it may not have been listed on the service health dashboard when you posted that, but it is now. We haven't been able to launch an instance at all, and we are continuously trying. We can't even start existing instances.


Depending the workload being run users may or may not notice. Should be Yellow at a minimum.


Heroku is currently having major problems. My stuff is still up, but I can't deploy any new versions. Heroku runs their stuff on AWS. I have heard reports of other companies who run on AWS also having degarded service and outages.

i'd say when other companies who run their infrastruture on AWS are going out, it's hard to argue it's not a real outage.

But AWS status _has_ changed to yellow at this point. Probably heroku could be completely down because of an AWS problem, and AWS status would still not show red. But at least yellow tells us there's a problem, the distinction between yellow and red probably only matters at this point to lawyers arguing about the AWS SLA, the rest of us know yellow means "problems", red will never be seen, and green means "maybe problems anyway".

I believe the entire us-east-1 could be entirely missing, and they'd still only put a yellow not a red on status page. After all, the other regions are all fine, right?


"Good at finding excuses" is not the same thing as "honest."


SNS seems to be at least partially down as well


My company relies on DynamoDB, so we're totally down.

edit: partly down; it's sporadically failing


Little honesty from cloud providers will ensure regulators don't step in.


Zero directly-attributable, calculable-at-time-of-decision cost. Of course there's a cost in terms of customers who leave because of the dishonest practice, but, who knows how many people that'll be? Out of the customers who left after the outage, who knows whether they left due to not communicating status promptly and honestly or whether it was for some other reason?

Versus, if a company has X SLA contracts signed, that point to Y reimbursement for being out for Z minutes, so it's easily calculable.


I wonder how well known this is. You'd think it would be hard to hire ethical engineers with such a scheme in place and yet they have tens of thousands.


maybe we gotta consider the publicly facing status pages as something other than a technical tool (e.g. marketing or PR or something like that, dunno)


If you trust them at this point, you have not being paying attention, and will probably continue to trust after this.


Well, no big deal, there's not really a lot of trust there to destroy...


Unfortunately, errors don't require his approval...


Uhhhhh... what if the monitoring said it was hard down? They'd still not show red?


Probably they cannot. They outsourced this dashboard and it runs on AWS now ;).


It's like trying to get the truth out of a kid that caused some trouble.

Mom: Alexa, did you break something?

Alexa: No.

M: Really? What's this? 500 Internal server error

A: ok maybe management console is down

M: Anything else?

A: ...

A: ... ok maybe cloudwatch logs

M: Ah hah. What else?

A: That's it, I swear!

M: 503 ClientError

A: ...well okay secretsmanager might be busted too...


Funny I literally just asked my Alexa.

Me: Alexa, is AWS down right now?

Alexa: I'd rather not answer that


Wise robot.

That's a bit like involving your kid in an argument between parents.


There was a great response in r/relationship advice the other day where someone said that OP's partner forced a fight because they're planning to cheat on them, reconcile, and then will 'trickle out the truth' over the next 6 months. I'm stealing that phrase.


The very expensive EC2 instance I started this morning still works. Of course now I can't shut it down.


I don't see why they couldn't provide an error rate graph like Reddit[0] or simply make services yellow saying "increased error rate detected, investigating..."

0: https://www.redditstatus.com/#system-metrics


A executive has a OKR around uptime and a automated system prevents him or her from having control over the messaging. Therefore any effort to create one is squashed, leaving the people requesting it confused as to why and left without any explanation. Oldest story in the book.


Because Amazon has $$$$$ in their SLOs, and it costs them through the nose every minute they're down in payments made to customers and fees refunded. I trust them and most companies not to be outright fraudulent (although I'm sure some are), but it's totally understandable they'd be reticent to push the "Downtime Alert/Cost Us a Ton of Money" button until they're sure something serious is happening.


It literally is fraudulent though.

I don't think a region being down is something that you can be unsure about.


Oh, you can get pretty weaselly about what “down” means. If there is “just” an S3 issue, are all the various services which are still “available” but throwing an elevated number of errors because of their own internal dependency on S3 actually down or just “degraded?” You have to spin up the hair-splitting apparatus early in the incident to try to keep clear of the post-mortem party. :D


This is an incentive to dishonesty, leading to fraudulent payments and false advertising of uptime to potential customers.

Hopefully it results in a class action lawsuit for enough money that Amazon decides that an automated system is better than trying to supply human judgement.


Can someone just have a site ping all the GET endpoints on the AWS API? That is very far from "automating [their entire] system" but it's better than what they're doing.


Something like this? https://stop.lying.cloud/


Um, that's just a reskin of AWS's lying status page. No real new data there.


It should be costing them trust not to push it when they should though. A trustworthy company will err on the side of pushing it. AWS is a near-monopoly, so their unprofessional business practices have still yet to cost them.


> It should be costing them trust not to push it when they should though.

This is what Amazon, the startup, understood.

Step 1: Always make it right and make the customer happy, even if it hurts in $.

Step 2: If you find you're losing too much money over a particular issue, fix the issue.

Amazon, one of the world's largest companies, seems to have forgotten that the risk of not reporting accurately isn't money, but breaking the feedback chain. Once you start gaming metrics, no leaders know what's really important to work on internally, because no leaders know what the actual issues are. It's late Soviet Union in a nutshell. If everyone is gaming the system at all levels, then eventually the ability to objectively execute decreases, because effort is misallocated due to misunderstanding.


> It's late Soviet Union in a nutshell

How come an action of a private company in a capitalist country is like the Soviet Union?


Private companies are small centrally-planned economies within larger capitalist systems.


They talk about gaming the metrics. Something the Soviet Union was known for.


I can Google and see how many apps, games, or other services are down. So them not "pushing some buttons" to confirm it isn't fooling anyone.


The more transparency you give; the harder it is to control the narrative. They have a general reputation for reliability; and exposing just how many actual errors/failures there are (that generally don't effect a large swath of users/usecases) would do hurt that reputation for minimal gain.


Wow. Kudos to the reddit engineering team. That's one of the nicest status pages I have seen.


because nobody cares when reddit is down. or at least, nobody is paying them to be up 99.999% of the time.


> Goodhart's Law is expressed simply as: “When a measure becomes a target, it ceases to be a good measure.”

It’s very frustrating. Why even have them?


Because "uptime" and "nines" became a marketing term. Simple as that. But the problem is that any public-facing measure of availability becomes a defacto marketing term.


The older I get the more I hate marketers. The whole field stands on the back of war-time propaganda research and it sure feels like it's the cause of so much rot in society.


Also 4-5 nines is virtually impossible for complex systems, so the sort of responsible people who could make 3 nines true begin to check out, and now you've getting most of your info from the delusional, and you're lucky if you manage 2 objective nines.


On top of that, the "Personalized Health Dashboard" doesn't work because I can't seem to log in to the console.


I'm logged in; you're missing an error message.


We have federated login with MFA required (which was failing). It just started working again.

Scratch that... console is not loading at all now :)


No wonder IMDB <https://www.imdb.com/> is down (returning 503). Sad that Amazon engineers don't implement what they teach their customers -- designing fault-tolerant and highly available systems.


"some customers may experience a slight elevation in error rates" --> everything is on fire, absolutely nothing works


https://downdetector.com

Amazing and scary to see all the unrelated services down right now.


I think it's pretty unlikely that both Google and Facebook are affected by this minor AWS outage, whatever DownDetector says. I even did a spot check on some of the smaller websites they report as "down", like canva.com, and didn't see any issues.


You might be right about Google and Facebook, but this isn't minor at all. Impact is widespread.


When I worked there it required the signoff of both your VP-level executive and the comms team to update the status page. I do not believe I ever received said signoff before the issues were resolved.


I remember the time when S3 went down and took the status page down with it


Status pages are hard


When they have too much pride in an all-green dash, sure. Allowing any engineer to declare a problem when first detected? Not so hard, but it doesn't make you look good if you have an ultra-twitchy finger. They have the balance badly wrong at the moment though.


A trigger-happy status page gives realtime feedback for anyone doing a DoS attack. Even if you published that information publicly you would probably want it on a significant delay.


They sent their CEO into space, I am sure they have the resources to figure it out.


No they're not.

Step 1: deploy status checks to an external cloud.


I agree, but does come with increased challenges with false positives.

That being said, AWS status pages are up.


More like admitting failure is hard.


“Falsehoods Programmers Believe About Status Pages”


Not if you’re AWS. At this point I’m fairly sure their status page is just a static html that always show all green.


Well, it is.


It's widespread industry knowledge now that AWS is publicly dishonest about downtime.

When the biggest cloud provider in the world is famous for gaslighting, it sets expectations for our whole industry.

It's fucking disgraceful that they tolerate such a lack of integrity in their organization.


I wonder if the other parts of Amazon do this. Like their inventory system thinks something is in stock, but people can't find it in the warehouse, do they just simply not send it to you and hope you don't notice? AWS's culture sounds super broken.

My favorite status page, though, is Slack's. You can read an article in the New York Times about how Slack was down for most of a day, and the status page is just like "some percentage of users experienced minor connectivity issues". "Some percentage" is code for "100%" and "minor" is code for "total". Good try.


Makes you wonder if they have to manually update the page when outages occur. That'd be a pretty bad way to go, so I'd hope not. Maybe the code to automatically update the page is in us-east-1? :)


Something like that has impacted the status page in the past. There was a severe Kinesis outage last year (https://aws.amazon.com/message/11201/), and they couldn't update the service dashboard for quite a while because their tool to do manage the service dashboard lives in us-east-1 and depends on Kinesis.


Word on the street is the status page is just a JPG


I assume each service has its own health check that checks the service is accessible from an internal location, thus most are green. However, when Service A requires Service B to do work, but Service B is down, a simple access check on Service A clearly doesn't give a good representation of uptime.

So what's a good health check actually report these days? Is it just about its own status, or should it include a breakdown of the status of external dependencies as part of its folded up status?


It seems they updated it ~30 minutes after your comment.


Are they lying, or just prioritizing their own services?


https://music.amazon.co.uk is giving me an error since about 16:30 GMT

"We are experiencing an error. Our apologies – We will be back up soon."


amazon.com seems to be having problems too. I get something went wrong to a new design/layout which I assume is either new or a fail safe.


Looks okay right now to this UK user of amazon.co.uk


It depends on which Amazon region you are being served from.

It is very unlikely that Amazon would deliberately make your messages cross the Atlantic just to find an American region that is unable to serve you.


Willing to bet the status page gets updated by logic on us-east-1


Status service is probably hosted in us-east-1


Well yea, it's eventually consistent ;)


It starts to show issues now. I agree that it was a bit long before we can get real visibility on the incident.


Wonder why almost all Amazon frontend looks like it was written in c++


That page is not loading for me... on which region is it hosted?


The problem being that often times you can't actually update the status page. Most internal systems are down.

We can't even update our product to say it's down, because accessing the product requires a process that is currently dead.


That's why your status page should be completely independent from the services it is monitoring (minus maybe something that automatically updates it). We use a third party to host our status page specifically so that we can update it even if all our systems are down.


I'm not saying you're wrong, or that the status page is architected properly. I'm just speaking to the current situation.


Not right now. I think they monitor if it appears on HN too.


Even better, when I try to go to console, I get:

> AWS Management Console Home page is currently unavailable.

> You can monitor status on the AWS Service Health Dashboard.

"AWS Service Health Dashboard" is a link to status.aws.amazon.com... which is ALL GREEN. So... thanks for the suggestion?

At this point the AWS service health dashboard is kind of famous for always been green isn't it? It's a joke to it's users. Do the folks who work on the relevant AWS internal team(s) know this, and just not have the resources to do anything about it, or what? If it's a harder problem than you'd think for interesting technical reasons, that'd be interesting to hear about.


Most of our services in us-east-1 are still responding although we cannot log into the console. However, it looks like dynamodb is timing out most requests for us.


My raspberry pi is still working just fine


Yeah, strange, my self-hosted server isn't affected either.


Seems "the cloud" had a major outage less than a month ago, my laptop has a higher uptime.

$ 16:04 up 46 days, 7:02, 9 users, load averages: 3.68 3.56 3.18

US East 1 was down just over a year ago

https://www.theregister.com/2020/11/25/aws_down/

Meanwhile I moved one of my two internal DNS servers to a second site on 11 Nov 2020, and it's been up since then. One of my monitoring machines has been filling, rotating and deleting logs for 1,712 days with a load average in the c. 40 range for that whole time, just works.

If only there was a way to run stuff with an uptime of 364 days a year without using the cloud /s


I think the point of the could isn't increased uptime - the point is that when it's down, bring it back up is someone else's problem.

(Also, OpEx vs CapEx financial shenanigans...)

All the same, I don't disagree with your point.


> the point is that when it's down, bring it back up is someone else's problem.

When it's down, it's my problem, and I can't do anything about it other than explain why I have no idea the system is broken and can't do anything about it.

"Why is my dohicky down? When will it be back?"

"Because it's raining, no idea"

May be accurate, it's also of no use.

But yes, Opex vs Capex, of course that's why you can lease your servers. It's far easier to spend company money with another $500 a month on AWS than spend $500 a year for a new machine.


so does my toaster, oven and microwave. so what? they get used a few times a day, but my production level equipment serves millions in an hour.


My lightswitch is used twice a day, yet it works every time. In the old days it would occasionally break (bulb goes), I would be empowered to fix it myself (change the bulb).

In the cloud you're at the mercy of someone who doesn't even know you exist to fix it, without the protections that say an electric company has with supplying domestic users.

This thread has people unable to turn their lights on[0], it's hilarious how people tie their stuff to dependencies that aren't needed, with a history of constant failure.

If you want to host millions of people, then presumably your infrastructure can cope with the loss of a single AZ (and ideally the loss of Amazon as a whole). The vast majority of people will be far better off without their critical infrastructure going down in the middle of the day in the busiest sales season going.

[0] https://news.ycombinator.com/item?id=29475499


Cool. Now let's have a race to see who can triple their capacity the fastest. (Note: I don't use AWS, so I can actually do it)


Why would I want to triple my capacity?

Most people don't need to scale to a billion users overnight.


Many B2B-type applications have a lot of usage during the workday and minimal usage outside of it. No reason to keep all that capacity running 24/7 when you only need most of it for ~8 hours per weekday. The cloud is perfect for that use case.


idk man, idle hardware doesn't use all that much power.

https://www.thomas-krenn.com/en/wiki/Processor_P-states_and_...

Which is an implementation of:

https://web.eecs.umich.edu/~twenisch/papers/asplos09.pdf


Is it really? How much does that scaling actually cost?

And what's a workday anyway, surely you operate globally?


Scaling itself costs nothing, but saves money because you're not paying for unused capacity.

The main application I run operates in 7 countries globally, but the US is the only one that has enough usage to require additional capacity during the workday. So out of 720 hours in a 30 day month, cloud scaling allows me to pay for additional capacity for only the (roughly) 160 hours that it's actually needed. It's a significant cost saver.

And because the scaling is based on actual metrics, it won't scale up on a holiday when nobody is using the application. More cost savings.


You are (conveniently or not) incorrectly assuming that the unit price of provisioned vs on-demand capacity is the same. It's not.


Nice of you to assume that I don't understand the pricing of the services I use. I can assure you that I do, and I can also assure you that there is no such thing as provisioned vs on-demand pricing for Azure App Service until you get into the higher tiers. And even in those higher tiers, it's cheaper for me to use on-demand capacity.

Obviously what I'm saying will not apply to all use cases, but I'm only talking about mine.


So, we're getting failures (for customers) trying to use amazon pay from our site. AFAIK there is no "status page" for Amazon Pay, but the rest of Amazon's services seem to be a giant Rube Goldberg machine so it's hard to imagine this isn't too.



Thanks.. seems to be about as accurate as the regular AWS status board is..


They just added a banner. My guess is they don't know enough yet to update the respective service statuses.


I have basically zero faith in Amazon at this point.

We first noticed failures because a tester happened to be testing in an env that uses the Amazon Pay sandbox.

I checked the prod site, and it wouldn't even ask me to login.

When I tried to login to SellerCentral to file a ticket - it told me my password (from a pw manager) was wrong. When I tried to reset, the OTP was ridiculously slow. Clicking "resend OTP" gives a "the OTP is incorrect" error message. When I finally got an OTP and put it in, the resulting page was a generic Amazon "404 page not found".

A while later, my original SellerCentral password, still un-changed because I never got another OTP to reset it, worked.

What the fuck kind of failure mode is that "services are down, so password must be wrong".


Sorry to hear. If multi-cloud is the answer, I wouldn't be surprised to see folks go back to owning and operating their own gear.


Some sage advice I learned a while ago: "Avoid us-east-1 as much as possible".


But if you use CloudFront, there you go.


Have folks considered a class-action lawsuit against these blatantly fraudulent SLAs to recoup costs?


In my experience, despite whatever is published, companies will private acknowledge and pay their SLA terms. (Which still only gets you, like, one day's worth of reimbursement if you're lucky.)


Retail SLAs are a small risk compared to the enterprise SLAs where an outage like this could cost Amazon tens of millions. I assume these contracts have discount tiers based on availability and anything below 99% would be a 100% discount for that bill cycle.


But those enterprise SLAs are the ones they'll be paying out. Retail SLAs are the ones that you'll have to fight for.


Latest update:

"[4:35 PM PST] With the network device issues resolved, we are not working towards recovery of any impaired services. We will provide additional updates for impaired services within the appropriate entry in the Service Health Dashboard."

I guess they gave up.


It was a typo. "Not" has changed to "now".


Hahaha, corrected to “now working” now.


So reinvent is over. Time to deploy.


This got me thinking, are there any major chat services that would go down if a particular AWS/GCP/etc data centre went down?

You don't want your service to go down, plus your team's comms at the same time.


Especially if enough Amazon internal tools rely on it - would be funny if there were a repeat of the FB debacle where Amazon employees somehow couldn't communicate/get back into their offices because of the problem they were trying to fix


Last I knew, Amazon used all Microsoft stuff for business communication.