Hacker News new | past | comments | ask | show | jobs | submit login

It's popular to upvote this during outages, because it fits a narrative.

The truth (as always) is more complex:

* No, this isn't the broad culture. It's not even a blip. These are EXCEPTIONAL circumstances by extremely bad teams that - if and when found out - would be intervened dramatically.

* The broad culture is blameless post-mortems. Not whose fault is it. But what was the problem and how to fix it. And one of the internal "Ten commandments of AWS availability" is you own your dependencies. You don't blame others.

* Depending on the service one customer's experience is not the broad experience. Someone might be having a really bad day but 99.9% of the region is operating successfully, so there is no reason to update the overall status dashboard.

* Every AWS customer has a PERSONAL health dashboard in the console that should indicate their experience.

* Yes, VP approval is needed to make any updates on the status dashboard. But that's not as hard as it may seem. AWS executives are extremely operation-obsessed, and when there is an outage of any size are engaged with their service teams immediately.




Well, the narrative is sort of what Amazon is asking for, heh?

The whole us-east-1 management console is gone, what is Amazon posting for the management console on their website?

"Service degradation"

It's not a degradation if it's outright down. Use the red status a little bit more often, this is a "disruption", not a "degradation".


Yeah no kidding. Is there a ratio of how many people it has to be working for to be in yellow rather than red? Some internal person going “it works on my machine” while 99% of customers are down.


I've always wondered why services are not counted down more often. Is there some sliver of customers who have access to the management console for example?

An increase in error rates - no biggie, any large system is going to have errors. But when 80%+ of customers loads in the region are impacted (cross availability zones for whatever good those do) - that counts as down doesn't it? Error rates in one AZ - degraded. Multi-AZ failures - down?


SLAs. Officially acknowledging an incident means that they now have to issue the SLA credits.


The outage dashboard is normally only updated if a certain $X percent of hosts / service is down. If the EC2 section were updated every time a rack in a datacenter went down, it would be red 24x7.

It's only updated when a large percentage of customers are impacted, and most of the time this number is less than what the HN echo chamber makes it appear to be.


I mean, sure, there are technical reasons why you would want to buffer issues so they're only visible if something big went down (although one would argue that's exactly what the "degraded" status means).

But if the official records say everything is green, a customer is going to have to push a lot harder to get the credits. There is a massive incentivization to “stay green”.


yes there were. I'm from central europe and we were at least able to get some pages of the console in us-east-1 -but i assume this was more caching related. Even though the console loaded and worked for listing some entries - we weren't able to post a support case nor viewing SQS messages etc.

So i aggree that degraded is not the proper wording - but it's / was not completly vanished. so.... hard to tell what is an common acceptable wording here.


From France, when I connect to "my personal health dashboard" in eu-west-3, it says several services are having "issues" in us-east-1.

To your point, for support center (which doesn't show a region) it says:

Description

Increased Error Rates

[09:01 AM PST] We are investigating increased error rates for the Support Center console and Support API in the US-EAST-1 Region.

[09:26 AM PST] We can confirm increased error rates for the Support Center console and Support API in the US-EAST-1 Region. We have identified the root cause of the issue and are working towards resolution.


I'm part of a large org with a large AWS footprint, and we've had a few hundred folks on a call nearly all day. We have only a few workloads that are completely down; most are only degraded. This isn't a total outage, we are still doing business in east-1. Is it "red"? Maybe! We're all scrambling to keep the services running well enough for our customers.


Because the console works just fine in us-east-2 and that the console on the status page does not display regions.

If the console works 100% in us-east-2 and not in us-east-1 why would they put the console completely down in us-east?


Well you know, like when a rocket explode, it's a sudden and "unexpected rapid disassembly" or something...

And a cleaner is called a "floor technician".

Nothing really out of the ordinary for a service to be called degraded while "hey, the cache might still be working right?" ... or "Well you know, it works every other day except today, so it's just degradation" :-)


If your statement is true, then why is the AWS status page widely considered useless, and everyone congregates on HN and/or Twitter to actually know what's broken on AWS during an outage?


> Yes, VP approval is needed to make any updates on the status dashboard. But that's not as hard as it may seem. AWS executives are extremely operation-obsessed, and when there is an outage of any size are engaged with their service teams immediately.

My experience generally aligns with amzn-throw, but this right here is why. There's a manual step here and there's always drama surrounding it. The process to update the status page is fully automated on both sides of this step, if you removed VP approval, the page would update immediately. So if the page doesn't update, it is always a VP dragging their feet. Even worse is that lags in this step were never discussed in the postmortem reviews that I was a part of.


It's intentional plausible deniability. By creating the manual step you can shift blame away. It's just like the concept of personal health dashboards which are designed to keep an asymmetry in reliability information from a host and the client to their personal anecdata experiences. Ontop of all of this, the metrics are pretty arbitrary.

Let's not pretend businesses haven't been intentionally advertising in deceitful ways for decades if not hundreds of years. This just happens to be current strategy in tech of lying and deceiving customers to limit liability, responsibility, and recourse actions.

To be fair, it's not it's not just Amazon, they just happen to be the largest and targeted whipping boys on the block. Few businesses under any circumstances will admit to liability under any circumstances. Liability has to always be assessed externally.


Here's one business honest with their status page: https://news.ycombinator.com/item?id=29478679


I have in the past directed users here on HN who were complaining about https://status.aws.amazon.com to the Personal Health Dashboard at https://phd.aws.amazon.com/ as well. Unfortunately even though the account I was logged into this time only has a single S3 bucket in the EU, billed through the EU and with zero direct dependancies on the US the personal health dashboard was ALSO throwing "The request processing has failed because of an unknown error" messages. Whatever the problem was this time it had global effects for the majority of users of the Console, the internet noticed for over 30 minutes before either the status page or the PHD were able to report it. There will be no explanation and the official status page logs will say there was "increased API failure rates" for an hour.

Now i guess its possible that the 1000s and 1000s of us who noticed and commented are some tiny fraction of the user base but if thats so you could at least publish a follow up like other vendors do that says something like 0.00001% of API requests failed effecting an estimated 0.001% of our users at the time.


I haven't asked AWS employees specifically about blameless postmortems, but several of them have personally corroborated that the culture tends towards being adversarial and "performance focused." That's a tough environment for blameless debugging and postmoretems. Like if I heard that someone has a rain forest tree-frog living happily in their outdoor Arizona cactus garden, I have doubts.


When I was at Google I didn't have a lot of exposure to the public infra side. However I do remember back in 2008 when a colleague was working on routing side of YouTube, he made a change that cost millions of dollars in mere hours before noticing and reverting it. He mentioned this to the larger team which gave applause during a tech talk. I cannot possibly generalize the culture differences between Amazon and Google, but at least in that one moment, the Google culture seemed to support that errors happen, they get noticed, and fixed without harming the perceived performance of those responsible.


While I support that, how are the people involved evaluated?


I was not informed of his performance reviews. However, given the reception, his work in general, and the attitudes of the team, I cannot imagine this even came up. More likely the ability to improve routing to actually make YouTube cheaper in the end was I'm sure the ultimate positive result.

This was also towards the end of the golden age of Google, when the percentage of top talent was a lot higher.


So on what basis is someone's performance reviewed, if such performance is omitted?


The entire point of blameless postmortems is acknowledging that the mere existence of an outage does not inherently reflect on the performance of the people involved. This allows you to instead focus on building resilient systems that avoid the possibility of accidental outages in the first place.


I know. That's not what I'm asking about, if you might read my question.


I'll play devil's advocate here and say that sometimes these incidents deserve praise because they uncovered an issue that was otherwise unknown previously. Also if the incident had a large negative impact then it shows to leadership how critical normal operation of that service is. Even if you were the cause of the issue, the fact that you fixed it and kept the critical service operating the rest of the time, is worth something good.


I know; that's not what I'm asking about. I'm talking about a different issue.


Mistakes happen, and a culture that insists too hard that "mistakes shouldn't happen, and so we can't be seen making mistakes" is harmful toward engineering.

How should their performance be evaluated, if not by the rote number of mistakes that can be pinned onto the person, and their combined impact? (Was that the question?)


If an engineer causes an outage by mistake and then ensures that would never happen again, he has made a positive impact.


I understand that, but eventually they need to evaluate performance, for promotions, demotions, raises, cuts, hiring, firing, etc. How is that done?


It’s standard. Career ladder [1] sets expectation for each level. Performance is measured against those expectations. Outages don’t negatively impact a single engineer.

The key difference is the perspective. If reliability is bad that’s an organizational problem and blaming or punishing one engineer won’t fix that.

[1] An example ladder from Patreon: https://levels.patreon.com/


> The key difference

The key difference between what and what?


Your approach and their approach. It sounded like you have a different perspective about who is responsible for an outage.


We knew us-east-1 was unuseable for our customers for 45 minutes before amazon acknowledged anything was wrong _at all_. We made decisions _in the dark_ to serve our customers, because amazon drug their feet communicating with us. Our customers were notified after 2 minutes.

It's not acceptable.


Can’t comment on most of your post but I know a lot of Amazon engineers who think of the CoE process (Correction of Error, what other companies would call a postmortem) as punitive


They aren't meant to be, but shitty teams are shitty. You can also create a COE and assign it to another team. When I was at AWS, I had a few COEs assigned to me by disgruntled teams just trying to make me suffer and I told them to pound sand. For my own team, I wrote COEs quite often and found it to be a really great process for surfacing systemic issues with our management chain and making real improvements, but it needs to be used correctly.


At some point the number of people who were on shitty teams becomes an indictment on the wider culture at Amazon.


Absolutely! Anecdotally, out of all the teams I interacted with in seven years at AWS across multiple arms of the company, I saw only a handful of bad teams. But like online reviews, the unhappy people are typically the loudest. I'm happy they are though, it's always important to push to be better, but I don't believe that AWS is the hellish place to work that HN discourse would lead many to believe.


I don't know any, and I have written or reviewed about 20


Even in a medium decent culture, with a sample of 20? You know at least one, you just don't know it.


Because OTHERWISE people might think AMAZON is a DYSFUNCTIONAL company that is beginning to CRATER under its HORRIBLE work culture and constant H/FIRE cycle.

See, AWS is basically turning into a long standing utility that needs to be reliable.

Hey, do most institutions like that completely turn over their staff every three years? Yeah, no.

Great for building it out and grabbing market share.

Maybe not for being the basis of a reliable substrate of the modern internet.

If there are dozens of bespoke systems that keep AWS afloat (disclosure: I have friends who worked there, and there are, and also Conway's law), but if the people who wrote them are three generations of HIRE/FIRE ago....

Not good.


> Maybe not for being the basis of a reliable substrate of the modern internet.

Maybe THEY will go to a COMPETITOR and THINGS MOVE ON if it's THAT BAD. I wasn't sure what the pattern for all caps was, so just giving it a shot there. Apologies if it's incorrect.


I was mocking the parent, who was doing that. Yes it's awful. Effective? Sigh, yes. But awful.


I'm an ex-Amazon employee and approve of this response.

It reflects exactly my experience there.

Blameless post-mortem, stick to the facts and how the situation could be avoided/reduced/shortened/handled better for next time.

In fact, one of the guidelines for writing COE (Correction Of Error, Amazon's jargon for Post Mortem) is that you never mention names but use functions and if necessary teams involved:

1. Personal names don't mean anything except to the people who were there on the incident at the time. Someone reading the CoE on the other side of the world or 6 months from now won't understand who did what and why. 2. It stands in the way of honest accountability.


>* Every AWS customer has a PERSONAL health dashboard in the console that should indicate their experience.

You mean the one that is down right now?


Seems like it's doing an exemplary job of indicating their experience, then.


What?!

Everybody is very slow to update their outage pages because of SLAs. It's in a company's financial interest to deny outages and when they are undeniable to make them appear as short as possible. Status pages updating slowly is definitely by design.

There's no large dev platform I've used that this wasn't true of their status pages.


> ...you own your dependencies. You don't blame others.

Agreed, teams should invest resources in architecting their systems in a way that can withstand broken dependencies. How does AWS teams account for "core" dependencies (e.g. auth) that may not have alternatives?


This is the irony of building a "reliable" system across multiple AZ's.


> * Depending on the service one customer's experience is not the broad experience. Someone might be having a really bad day but 99.9% of the region is operating successfully, so there is no reason to update the overall status dashboard.

https://rachelbythebay.com/w/2019/07/15/giant/


Oh, yes. Let me go look at the PERSONAL health dashboard and... oh, I need to sign into the console to view it... hmm


> you own your dependencies. You don't blame others.

I love that. Build your service to be robust. Never assume that dependencies are 100% reliable. Gracefully handle failures. Don't just go hard down, or worse, sure horribly in a way that you can't recover from automatically when you're dependencies come back. I've seen a single database outage cause cascading failures across a whole site even though most services had no direct connection to the database. And recovery had to be done in order of dependency, or else you're playing whack-a-mole for an hour)

> VP approval is needed to make updates on the status board.

Isn't that normal? Updating the status has a cost (reparations to customers if you breach SLA). You don't want some on-call engineer stressing over the status page while trying to recover stuff.


Come on, we all know managers don’t want to claim an outage till the last minute.


"Yes, VP approval is needed to make any updates on the status dashboard."

If services are clearly down, why is this needed ? I can understand the oversights required for a company like Amazon but this sounds strange to me. If services are clearly down, I want that damn status update right away as a customer.


Because "services down" also means SLA credits.


Hiding behind a throw away account does not help your point.


The person is unlikely to have been authorized as a spokesman for AWS. In many workplaces, doing that is grounds for disciplinary action. Hence, throwaway.


Well, when you talk about blameless post mortems and how they are valued at the company... A throw-away does make me doubt that the culture supports being blameless :)


Well, I understand that, but if you look at his account history it is only pro-Amazon comments. It feels like propaganda more than information, and all I am saying is that the throwaway does not add credibility or a feeling that his opinion are genuine.


[flagged]


This post broke the site guidelines badly. If you'd please review https://news.ycombinator.com/newsguidelines.html and stick to the rules in the future, we'd be grateful.

Edit: also, could you please stop posting unsubstantive comments generally? You've unfortunately been doing that repeatedly, and we're trying for something else here.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: