AWS's us-east-1 region is experiencing issues

halestock · on March 9, 2022

I can't help but wonder, with the increases in attrition across the industry, are we hitting some kind of tipping point where the institutional knowledge in these massive tech corporations is disappearing?

Mistakes happen all the time but when all the people who intimately know how these systems work leave for other opportunities, disasters are bound to happen more and more.

thethethethe · on March 9, 2022

Disc: Googler opinions are my own.

I've been an SRE on a tier 1 GCP product for over three years this is not the case. In my experience, our systems have only gotten more reliable and easier to understand since I joined.

It's not like there are only a few key knowledge holders that single handedly prevent outages from happening. In reality, you don't need to know shit about how a system works internally to prevent outages if things are setup correctly.

In theory, my dog should be able to sit on my laptop and submit a prod breaking change without any fear that it will make it to prod and damage the system because automated testing/canary should catch it and, if it does make it to prod, we should be able to detect and mitigate it before the change effects more users using probes or whitebox monitoring.

This is happens for 99.9% of potential issues and is completely invisible to users. However it's what's not caught (the remaining 0.01%) that actually matters.

dijit · on March 9, 2022

I’ll be honest. My external impression of Amazon and Google could not be more distant in this regard.

Google doesn’t have nearly as hard a time retaining good people as Amazon does.

AitchEmArsey · on March 9, 2022

From within AWS, it just feels like we push people too hard. Service teams are too small relative to their goals, sales teams have unrealistic growth targets and also double as support for needy/incompetent customers, and professional services have billable hour requirements which are as high as any major consultancy and with additional pre-sales support expectations.

Strict adherence to the "hiring bar" means we fail to bring in good people who aren't desperate enough to act out the cultish LP dance during their interview. Hiring new grads seems to be the only area where growth is not stalling - but that can obviously only help so much.

My team is hiring for 2-3 people and we are being buried alive without that growth happening sooner - but I can't in good conscience recommend this place to anyone I respect or like.

labster · on March 10, 2022

https://goomics.net/329/

bogomipz · on March 9, 2022

>"Strict adherence to the "hiring bar" means we fail to bring in good people who aren't desperate enough to act out the cultish LP dance during their interview."

What is the "cultish LP dance" here that is weeding good people out?

>"My team is hiring for 2-3 people and we are being buried alive without that growth happening sooner - but I can't in good conscience recommend this place to anyone I respect or like."

I appreciate your candor. Are you in a dev role or are you on the SRE side? Is your description true across pretty much all teams/services then?

AitchEmArsey · on March 10, 2022

> What is the "cultish LP dance" here that is weeding good people out?

The "culture fit" interview process focuses on leadership principles, so lots of questions like " tell me about a time when you went above and beyond for a customer". Being yourself will get you nowhere, you need to research the questions and the script that is expected of you.

> What service does your team work on?

I'm a partner-focused SA, so not a developer and not aligned to a particular service.

smarmgoblin · on March 10, 2022

I interviewed with AWS about a year ago. Knocked the programming/system design interviews out of the park, but it was clear the LP interviewer simply didn’t believe I was being truthful about the example I gave (included a period where my team had no direct manager, which is abnormal). He also had a programming question for me but we didn’t have time for him to explain it.

No offer, recruiter emphasized that they were halving the “cool off” period for me (so I could interview again soon), and maybe they do this for everyone, but it’s clear there was one interview making the difference here. Interesting that this is apparently a common problem.

bradknowles · on March 12, 2022

When I interviewed, the instructions I was given by the recruiter were very explicit. For the Leadership Principles, I did my hardest to come up with at least two items for each LP, and of course they had to be in STAR format -- Situation, Task, Action, Result. To that, I added a URL to back up my story for all of the cases where I could.

So, when I used the example of designing and starting to build the replacement e-mail system for ASML for literally zero additional cost [0], I pointed them at the URL for the Invited Talk that I gave based on that work. And when I used the example of when I broke all e-mail across the entire Internet, I pointed them at the URL for the article that was published in The Register. When I talked about what I had learned about Chef and DevOps, I pointed them at the invited talk I gave in Edinburgh entitled "From zero to cloud in 90 days" and the accompanying tutorial I taught called "Just enough Ruby for Chef". And so on.

I really feel that having the URLs to backstop my stories helped me sail through that part of the interview.

In my case, I wasn't being hired as an SDE, so there wasn't much programming tests they wanted me to take -- one of their senior developers did connect me to a shared coding platform, which we really just used as a shared whiteboard. He asked me some questions on how I would solve some problems, and I used my 30+ years of experience with Bourne shell and bash to show him stupid first order solutions to those problems and then we talked about what some of the second and third order solutions might be.

The longer I work at AWS, the more convinced I am that everything depends on the people you're working with. There are good teams and bad teams. There are good managers and bad managers. And if you can find a good team with good managers, then you're golden.

In this respect, I don't think AWS is materially different from any other employer I've known.

[0] They had already bought all the hardware, including some stuff I scavenged from a closet where they had been sitting new-in-the-box for a couple of years; the OS was covered under their site license; all the application software was open source; and my time was free because I was already there under another contract)

bogomipz · on March 12, 2022

Can I ask what role were your interviewing for where you weren't asked to solve Leetcode style problems on a whiteboard? I thought that was standard for AWS. You mentioned it wasn't SDE but I'm assuming it's still a DevOps/Technical role you have based on your STAR examples.

bogomipz · on March 10, 2022

Who would interview with a company again after this nonsense though? I would hope candidates view of said "cool off" period is that it is a permanent one. The arrogance of these recruiters is astonishing.

smarmgoblin · on March 10, 2022

It was unpleasant but at the end of the day it was just one guy.

What gets me is that every recruiter who reaches out (hundreds by now I estimate) wants me to complete their timed coding test to qualify to interview again.

I found Google’s interview process comparatively much more respectful (although more demanding), and have been happy working there instead.

bogomipz · on March 10, 2022

Do you mean AWS recruiters are asking you to take timed coding tests? Or in general? This is a thing for people with experience? What value or skills do recruiters actually bring? It's a wonder anyone gets hired at all with these jokers as gate keepers. Good on you for getting a better gig. Congrats.

smarmgoblin · on March 10, 2022

Yes it’s a standard part of the interview cycle at Amazon. Usually two questions in an hour or something, and you code solutions which the test platform grades. They’re on par with leetcode easy IME, but I just find it to be a pain so I won’t do it again.

I’m sure it’s helpful to weed out applicants who actually cannot code, but what’s the point in doing it twice?

Thank you!

millzlane · on March 10, 2022

As some dude from Baltimore who has just picked up a second gig there. It seemed to me that these are normal questions asked of you in most interviews. In my daytime position at the first gig I have interviewed technicians and have asked similar questions of them. It's not about culture fit. It's about finding their answer to basic customer service questions. It would make sense that a customer obsessed company would be focused on a set principals they have found success in.

Speaking of the research, the recruiters email you the principals and specifically mention to you to review them and to consider them during the interview process. They even send you a document about the STAR method of interviewing to help you have a smoother interview. To me as a guy from Baltimore who doesn't know anything half the people here do. I don't think the interview process could have been smoother.

AitchEmArsey · on March 10, 2022

They are normal-ish questions and I can see where the STAR guidance is better than getting no help at all - I just think the process is too rigid overall. I've interviewed a handful of people for our team in the last year - all had the right background and technical expertise, but since they didn't have an example from their career which matches up with the LP questions they were asked, they were knocked back. All of them would have been "bar raising" from a technical standpoint, but since they were not demonstrably already "Amazonian" in their mindset, we couldn't hire them.

bogomipz · on March 10, 2022

Wow, that just seems completely performative.

Is "SA" systems admin here?

oblio · on March 10, 2022

Solutions Architect, jack of all trades that is a sort of customer consultant and creator of solutions architecture (duh) for customers.

AitchEmArsey · on March 10, 2022

Customers and partners, the latter basically taking our ideas and selling them as a product - the flow of IP towards Amazon isn't always as clear-cut as people believe :-)

Broadly though it is a pre-sales role to help people get started, followed by ongoing guidance as the customer iterates (this is the part which often turns into free support).

notyourwork · on March 10, 2022

There is nothing cultish about the way Amazon interviews. If you are a good engineer with a relevant background who can speak english you will have no problem passing these interviews. I'm not sure why you are framing it as cultish.

jnwatson · on March 10, 2022

I interviewed with AWS about 7 months ago and got an offer. I had multiple recruiters emphasize the LP stuff. I prepared, and there’s no way I would have passed without that preparation.

In my experience, they also give the toughest programming questions. It is a lot of prep overall.

metadat · on March 10, 2022

Native English speaker here. I was denied for unspecified reasons a couple years ago. At the time it was perplexing, as I had pretty good answers for all their questions.

Now I work at one of the slightly more sane FANMAG (to include $ms) companies.

Pretty sure I dodged a bullet, maybe the engineering manager spared me because he liked me more than I realized.

bogomipz · on March 10, 2022

I would be curious to know which one is considered more sane these days. I feel like I've heard enough negative things about the culture at all of them at this point.

metadat · on March 12, 2022

Presently, Apple, Microsoft, Oracle can all be okay as an employee IME. Depends entirely on the division and team you're joining.

Ones I'd never consider, even for a cushy VP role- FB, Goggle.

Ymmv.

AitchEmArsey · on March 10, 2022

That does not match up with my own experience at all.

Maybe the service teams are less heavy on LPs due to not being customer-facing?

notyourwork · on March 11, 2022

What doesn't match up?

9wzYQbTYsAIc · on March 9, 2022

> From within AWS, it just feels like we push people too hard.

Sounds like regular skeleton-crew enterprise IT.

anonporridge · on March 10, 2022

"We'll fix it when it breaks."

_ktx2 · on March 9, 2022

I don't work anywhere as big as Amazon/AWS but it is still a "big tech engineering firm" and we're experiencing all the same problems.

zwirbl · on March 9, 2022

Just like the tech priests in Warhammer 40k, keeping occult old engineering, thatno one could build anymore, running

SketchySeaBeast · on March 9, 2022

If we want to normalize letting long term support people call themselves tech priests I'd very much appreciate it.

"What were your duties at your last position?" "Performing the daily ministrations and singing the praise of the machine god."

anonporridge · on March 10, 2022

Walk down the rows of cubicles chanting "Nonne avertis et conare iterum?" (best I could translate "have you tried turning it off and on again" to latin)

Mezzie · on March 9, 2022

As an archivist/librarian-programmer, I'm totally here for calling myself a tech priestess.

9wzYQbTYsAIc · on March 10, 2022

rurp · on March 9, 2022

Nothing makes me feel more like a wizard than learning some new command line skills. Just string together some short, seemingly indecipherable symbols, and... magic!

hughrr · on March 9, 2022

So today I find out my job title is tech priest. I was happy with necromancer before. Does it come with a pay rise?

hinkley · on March 10, 2022

Not at all, but a status increase for sure.

viraptor · on March 9, 2022

Not familiar with 40k. Was it a similar idea to nuclear-power-as-religion from Foundation?

atty · on March 9, 2022

Not far off. The “golden age” of humanity was shattered long ago, with the mortal wounding of the god emperor, and knowledge of most of the greatest technology was lost.Millennia later, a cult has grown up that both worships and maintains technology as having machine spirits, which are somehow linked to the machine god itself. That god may or may not be the same or related to the god emperor of mankind, depending on the interpretation.

Honestly the lore of w40k is quite fun to read, if you’re into dystopian and fantasy sci-fi.

R0b0t1 · on March 9, 2022

FWIW hardline tech priests view the machine god as separate from the emperor. Hardline imperium officials view the emperor as separate from the machine god. The official party line is that the emperor and the machine god are the same, with the emperor perhaps being an avatar.

It seems like both sides are fine to have them be reconciled, but it's an important narrative gadget that can be used to get humanity to fight itself in-universe.

Also interesting is that humanity's "lost" technological progress seems to eclipse some of the other races in the W40K narrative, with even the Tau (space dwarves with robots) and Eldar (space elves with crystals) freaking out when humanity brings giant robots, because the sheer physical impracticality of a gigantic human shaped robot is noted, with nobody aware of how they continue to work.

9wzYQbTYsAIc · on March 9, 2022

> gigantic human shaped robot is noted, with nobody aware of how they continue to work

BattleTech had something similar: it was considered cheaper to keep replacing humans than to replace the mechs, because hardly anyone knew how to repair or build mechs.

aaronax · on March 9, 2022

Or how about Anathem, with the Ita class doing computer things and nuclear materials cared for by a select group?

9wzYQbTYsAIc · on March 9, 2022

That was probably one of Stephenson’s most insightful moments of science fiction.

Scientists only get to talk to the public every 100 years or something, wasn’t it?

IT was never allowed to talk to scientists.

Seemed like a modernist idea, even at the time of publishing.

9wzYQbTYsAIc · on March 9, 2022

Outside of the mega-fang industry, I’m wondering the same thing.

The Great Resignation had to have taken a huge toll on regular enterprises. There are probably going to be some unlucky (or lucky, depending on how hardcore they are) people in the position of maintaining aging legacy systems and retrofitting them into the future.

COBOL, for example, is becoming a lucrative language for people in the financial and insurance industries. Legacy Java is all over the place, I’m sure. Legacy .NET is in the middle of a huge industry retrofit, (.NET 5 was the official post-legacy rebrand and they’re on to .NET 6+ now).

Lascaille · on March 10, 2022

>The Great Resignation had to have taken a huge toll on regular enterprises.

The Great Resignation was people leaving jobs they didn't want (front of house/service industry/gig) for jobs they did want (career-track jobs) after those jobs resumed hiring again after the pandemic settled down.

Labor force participation went up not down due to 'the great resignation.'

9wzYQbTYsAIc · on March 10, 2022

We were discussing attrition, not labor force participation.

fragmede · on March 9, 2022

You're right, but that's been true since the beginning of the tech boom (but isn't exclusive to tech) when no one works for a place for several decades. Companies weather this in different ways but attrition has always been around.

What's causing people to believe that the latest round of attrition is any different?

hkt · on March 9, 2022

I'd speculate that perhaps more senior people are moving, and/or a greater overall rate of attrition combined with much more complex technologies and organisations. In other words, it might be harder to become good at jobs now, and fewer people stick with them. Just a hunch but definitely seems to be where the incentives point with loyalty penalties and tech bloat.

9wzYQbTYsAIc · on March 9, 2022

In my experience, education certainly seems to not have kept up with computing, at least in terms of having massive academic-industry partnerships like a doctors residency or a trade apprenticeship .

I’d definitely agree that it is probably harder to become good in older organizations - the technologies are probably generations behind the current state of the art and the learning curves are high for those older technologies.

Just thinking through keyboard, but it’s probably reaching the point where enterprises need to evaluate aqui-hire or outsourcing entire development departments due to attrition, due to the incentives to leave for regular employees.

anonporridge · on March 10, 2022

Promoting high employee turnover could actually be a very effective strategy for a company's long term sustainability.

If your company is hostile to people sticking around for decades, then it makes it that much less likely that you end up stuck with an machine that relies heavily on poorly documented tribal knowledge that's likely to start falling apart as your core people start cashing out.

9wzYQbTYsAIc · on March 10, 2022

Similarly, it makes it much easier to make the business case for switching to outsourcing and insourcing business models. That makes it much less likely that you have to worry about losing money to people who “work from home”.

9wzYQbTYsAIc · on March 9, 2022

> What's causing people to believe that the latest round of attrition is any different?

The Great Recession

The Great Resignation

The Great Dying (due to COVID-19)

spoils19 · on March 9, 2022

> The Great Dying (due to COVID-19)

Repeating this wise comment: https://news.ycombinator.com/item?id=23769427

The COVID death counts are hopelessly over-counted. This is why there's a cottage industry of people pointing out things like "COVID deaths" which mysteriously also suffered from being murdered, or drug overdoses, or undiagnosed leukaemia.

Then you get into the problem of care homes being authorised to report COVID deaths without any testing or formally trained opinion at all.

gilbetron · on March 10, 2022

It's shocking to see such a statement so far into the pandemic. This is solved and known already, and while complicated, we've figured it out for some time. We can easily see the massive amount of deaths when we look at excess death numbers. Covid deaths are, if anything, undercounted. To believe anything else at this point is to bury your head in the sand and avoid all scientific evidence and medical consensus.

https://www.medicalnewstoday.com/articles/how-are-covid-19-d...

MatteoFrigo · on March 10, 2022

I won't argue this topic since this thread is about AWS, but the article that you quote says "Experts calculate the excess death rate by comparing figures for a given period with the average for that same period over several previous years." That's not how experts calculate excess deaths, since that algorithm produces totally bogus answers. Here is how real experts compute excess deaths, compensating for seasonality and population changes: https://euromomo.eu/how-it-works/methods/

jimmydorry · on March 10, 2022

Excess deaths is a perfectly valid statistic to base policy off... if policy makers maintained a hands-off approach and didn't radically change society through out the pandemic.

Can you in good conscience say that the typical rate of death remained steady while the following happened:

* a majority of given populations remained at home (lock downs - meaning no travelling to work in multi-tonne death traps)

* practised increased hygiene protocols (masks, more frequent hand washing)

* did not visit elderly homes (at least, less than usual)

* many people were reluctant to get timely health care (due to fear of catching COVID from medical facilities)

* ate worse and excersied less

* infected elderly patients were sent back to their nursing homes (to typically infect the entire facility)

There are so many confounding factors that on the face of it, should result in a radically different death profile... and almost every country faced the above to different extents.

Anyone claiming to be able to work out the actual number of people who died from COVID-19 from excess deaths is being disingenous at best.

9wzYQbTYsAIc · on March 10, 2022

> Anyone claiming to be able to work out the actual number of people who died from COVID-19 from excess deaths is being disingenous at best.

I think you have that backwards: disingenuous at worst, a scientist at best.

9wzYQbTYsAIc · on March 10, 2022

How about undercounting?

There’s very likely to be severe undercounting of COVID-19 due to the same reasons that crimes of victimization are underreported: shame.

Not to mention the swaths of homeless and disabled people that probably didn’t get counted.

panarky · on March 10, 2022

Not to mention all the cases detected with home antigen tests and never reported to a lab.

steveBK123 · on March 10, 2022

Yes, absolutely. Within my own org of ~50 people, 15% have resigned/contracts ending during Q1 (after 15% in Q4). Of the remaining 85%.. 20% have been around since before COVID / 65% joined during COVID.

Of our senior engineers & team leads, 70% have joined in last 6-9 months.

Only 3 full time senior engineers with 2 years or more tenure.

We've grown during COVID but we've also just burned through people.

Turnover has hit the point where we stopped doing going away zoom toasts.. people just sort of disappeared.

iso1210 · on March 10, 2022

Newest member of my team has been in the company for 6 years and on my team for 4. I was in the local pub the other day and there was a retirement do for someone who had been here for 35 years, which certainly isn't exceptional (40 years is more noteworthy, and I've known a couple of people who made it to 50 years)

9wzYQbTYsAIc · on March 10, 2022

Similar story in smaller orgs, from what I’ve heard - people don’t even bother with the going-away stuff any more.

metadat · on March 10, 2022

It seems like that is more of an artifact of the pandemic and everyone working remote. Are you going to hold a farewell zoom party? I don't think so.

9wzYQbTYsAIc · on March 10, 2022

> Are you going to hold a farewell zoom party?

I would, and have. Granted not with an entire org, but it’s still good form to take a few moments to say goodbye.

steveBK123 · on March 10, 2022

Yup, it's not that hard to be decent.

If we can have meeting after meeting for working groups, agile kayfabe, status reports, etc for hours recurring weekly.. we can spend 15min on the phone saying thanks, good luck, and see you again a handful of times per year when a teammate leaves.

steveBK123 · on March 10, 2022

Decent people who respect each other as former/possibly future teammates can & do.

Traster · on March 9, 2022

I think this is a transient issue. When you're in growth mode you make a huge series of hacks to just keep things running and then when you leave.... well, it's a problem. But if the business is robust, and lives beyond you, what replaces your work is better documented, better tested, and maintainable.

That's the dream. Obviously there are companies that sink between v1 and v2, but that's life.

Fundamentally I think the cloud business is robust, it's a fundamentally reasonable way of organising things (for enough people), which is why it attracts customers despite being arguably more expensive.

I've been in this situation in much smaller scales, and yes, you'll see massive drop in productivity but that's the cost of going from prototype to product.

replygirl · on March 9, 2022

if us-east-1 is "in growth mode", what that we rely on can we possibly expect to ever reach maturity?

Traster · on March 10, 2022

No what I'm saying is that AWS is transitioning out of early stage growth, and so they're seeing all the issues you see where the original people who hacked stuff together are moving on and you have to start really focusing on making things stable, but the beginning of that process is always everything getting unreliable.

faangiq · on March 10, 2022

Yep. They literally need to start doubling pay to retain people. The attrition this year is going to be devastating.

nyellin · on March 9, 2022

That's the problem we're out to solve with robusta.dev.

We're slowly but surely converting the world's institutional technical knowledge into re-usable and automated runbooks.

hughrr · on March 9, 2022

I’m just going to have to spend all day fixing the runbooks as well as the technology ;)

nyellin · on March 11, 2022

I could make the same argument for why dockerfiles are bad and you should install infrastructure manually instead of automating it!

la64710 · on March 10, 2022

Resolved in 7 mins. Can you do better?

iso1210 · on March 10, 2022

My monitoring doesn't remember the last time we had a service outage lasting 7 seconds, let alone 7 minutes.

9wzYQbTYsAIc · on March 10, 2022

Nice work, keep it up. It’s surely not easy managing that many companies computer systems.

newobj · on March 9, 2022

9wzYQbTYsAIc · on March 9, 2022

That’s good to hear. How wide is your scope of experience?

saltypal · on March 9, 2022

Based on our telemetry, this started as NXDOMAINs for sqs.us-east-1.amazonaws.com beginning in modest volumes at 20:43 UTC and becoming a total outage at 20:48 UTC. Naturally, it was completely resolved by 20:57, 5 minutes before anything was posted in the "Personal Health Dashboard" in the AWS console.

It takes a while to find a Vice President, I guess.

mcqueenjordan · on March 9, 2022

Or perhaps triaging, root-causing, and fixing the issue is the highest-order bit?

viraptor · on March 9, 2022

Different people have different responsibilities. At Amazon scale, the comms and people doing a deep dive to fix stuff will not be the same.

jrockway · on March 9, 2022

I'd be totally fine just having alerts and metrics driving the status page. Why involve a human at all? They just get emotional.

(I have a data-driven status page for my personal website. If Oh Dear decides my website is down, the status page gets automatically updated. Obviously nobody is ever going to visit status.jrock.us if they are trying to read an article on my blog and it doesn't load, but hey at least I can say I did it.)

cj · on March 10, 2022

> Why involve a human at all?

To make a judgement call on whether the issue is severe enough to warrant the legal/financial risk of admitting your service is broken, potentially breaking customer SLAs.

viraptor · on March 10, 2022

Also prevents monitoring flukes and planned/transient-but-no-impact issues from showing up in dashboards.

jrockway · on March 10, 2022

If you're just going to lie, why have an SLA at all? It's like doing a clinical trial for a drug; a bunch of your patients die and you say "well they were going to die anyway it has nothing to do with the drug." If it's one person, maybe you can get away with that. When it's everyone in the experimental group, people start to wonder.

I have two arguments in favor of honest SLAs. One is, if customers expect that something is down, it can give them a piece of data with which to make their mitigation decisions. "A lot of our services are returning errors", check the auto-updated status page, "there may be an issue with network routes between AZs A and C". Now you know to drain out of those zones. If the status page says "there are no problems", now you spend hours debugging the issue, costing yourself far more money in your time than you spend on your cloud infrastructure in the first place. If having an SLA is the cause of that, it would be financially in your favor to not have the SLA at all. The SLA bounds your losses to what you pay for cloud resources, but your losses can actually be much higher; lost revenue, lost time troubleshooting someone else's problem, etc.

The second is, SLA violations are what justify reliability engineering efforts. If you lose $1,000,000 a year to SLA violations, and you hire an SRE for $500,000 a year to reduce SLA violations by 75%, then you just made a profit of $250,000. If you say "nah there were no outages", then you're flushing that $500,000 a year down the toilet and should fire anyone working on reliability. That is obviously not healthy; the financial aspect keeps you honest and accountable.

All of this gets very difficult when you are planning your own SLAs. If everyone is lying to you, you have no choice but to lie to your customers. You can multiply together all the 99.5% SLAs of the services you depend on and give your customers a guarantee of 95%, but if the 99.5% you're quoted is actually 89.4%, then you can't actually meet your guarantees. AWS can afford to lie to their customers (and Congress apparently) without consequences. But you, small startup, can't. Your customers are going to notice, and they were already taking a chance going with you instead of some big company. This is hard cycle to get out of. People don't want to lie, but they become liars because the rest of the industry is lying.

Finally, I just want to say I don't even care about the financial aspect, really. The 5 figures spent on cloud expenses are nothing compared to the nights and weekends your team loses to debugging someone else's problem. They could have spent the time with their families or hobbies if the cloud provider just said "yup, it's all broken, we'll fix it by tomorrow". Instead, they stay up late into the night looking for the root cause, finding it 4 hours later, and still not being able to do anything except open a support ticket answered by someone who has to lie to preserve the SLA. They'll never get those hours back! And they turned out to be a complete waste of time.

It's a disaster, and I don't care if it's a hard problem. We, as an industry, shouldn't stand for it.

viraptor · on March 11, 2022

It's not just a hard problem, it's impossible to go from metrics to automatic announcements. Give me a detected scenario and I'll give you an explanation for seeing errors which are unrelated to the service health as seen by customers. From basic "our tests have bugs", to "reflected DDoS on internal systems causes rate limit on test endpoint", to "deprecated instance type removal caused a spike in failed creations", to "bgp issues cause test endpoints failures, but not customer visible ones", to...

You can't go from a metric to diagnosis for a customer - there's just no 1:1 mapping possible, with errors going both ways. AWS sucks with their status delays, but it's better than seeing their internal monitoring.

jrockway · on March 11, 2022

I don't agree. I'm perfectly happy seeing their internal metrics.

I remember one night a long time ago while working at Google, I was trying a new approach for loading some data involving an external system. To get the performance I needed, I was going to be making a lot of requests to that service, and I was a little concerned about overloading it. I ran my job and opened up that service's monitoring console, poked around, and determined that I was in fact overloading it. I sent them an email and they added some additional replicas for me.

Now I live in the world of the cloud where I pay $100 per X requests to some service, and they won't even tell me if the errors it's returning are their fault or my fault? Not acceptable.

Corrado · on March 10, 2022

Corey Quinn has a great blog post[0] on why status pages are hard, especially for large organizations with lots of products.

[0] https://www.lastweekinaws.com/blog/status-paging-you/

saghm · on March 10, 2022

> I'd be totally fine just having alerts and metrics driving the status page. Why involve a human at all?

What happens if the monitoring service goes down or has a bug that causes it to incorrectly report the status as okay? Obviously this won't usually happen at the same time as the actual service goes down, but if it did, would people really believe that?

nostrebored · on March 9, 2022

It definitely is. For an issue like this, you will see relevant teams and delegates looped in very quickly. Getting approved wording about an outage requires some very senior people though. Often they have to be paged in as well.

Having worked at a few other large tech companies now -- Amazon's incident response process is honestly great. It's one of the things I miss about working there.

saltypal · on March 9, 2022

This. We have a 4-person team and posted our own incident about this 7 minutes before Amazon did. Surely they can aim a little higher.

ElevenLathe · on March 9, 2022

IME, this actually becomes more challenging as a company gets larger, not less (but that doesn't mean it can't be done).

saltypal · on March 9, 2022

Separate teams. We have a tiny team and even _we_ appoint a group to fix and a group or individual to do nothing but communicate.

smachiz · on March 9, 2022

sure, but if those people are updating the status pages to say something isn't right and we're looking into it, we're doomed.

mhio · on March 9, 2022

The truth assuaging usually takes 15-30 minutes.

easton · on March 9, 2022

From temuze last time:

"If you're having SLA problems I feel bad for you son

I got two 9 problems cuz of us-east-1"

operator1 · on March 9, 2022

What’s up with all of the multi-platform outages lately? Seems abnormal looking at historical data. Are there issues affecting the internet backbone or something? Or just a coincidence?

300bps · on March 9, 2022

Important to keep in mind that AWS has 250 services in 84 Availability Zones in 26 regions.

This outage is reportedly impacting 5 services in 1 region.

For those impacted, pretty terrible. But as a heavy user of AWS, I’ve seen these notices posted multiple times on HN and haven’t been impacted by one yet.

late2part · on March 10, 2022

Counterpoint:

us-east-1 is their largest reason, someone told me it's 50% of their revenue

multi-region failover is awfully, awfully expensive.

In my last 7 years I imagine we had ~1 outage a month on average from AWS failures, but who knows if my imagination is accurate.

quxbar · on March 9, 2022

For businesses with uptime guarantees and lots of boxes to spin up in failover scenario, this has been a very eventful 12 months. At least that's what I'm experiencing.

super_linear · on March 9, 2022

Absolutely no way to prove this but maybe Q1 deadlines coming up and people trying to launch things and make changes?

frays · on March 9, 2022

Increase in attrition across the industry.

A lot of institutional knowledge in these massive tech corporations is disappearing and we're starting to reach the tipping point.

thethethethe · on March 9, 2022

See my comment here: https://news.ycombinator.com/item?id=30621190

fragmede · on March 9, 2022

But there's always been attrition. What are some of the ways that is now different that is affecting attrition rates and their effects?

stone-monkey · on March 9, 2022

Probably increased salary and switch to permanent remote. Amazon is notorious for their frugality and they recently doubled their maximum salary cap to 350k. They would only have done this to stay competitive in the current job market. This implies that many of their existing employees are underpaid relative to their peers at comparable companies and they've likely seen a large uptick in attrition. Not to mention attrition begets more attrition, especially if it's "influential" employees who are leaving.

9wzYQbTYsAIc · on March 10, 2022

> Amazon is notorious for their frugality and they recently doubled their maximum salary cap to 350k

Does that salary cap apply to regular enterprise developers working for HR, Accounting, etc., too?

If so, kudos to Amazon.

sokoloff · on March 10, 2022

Why wouldn’t the cap apply to those devs as well?

9wzYQbTYsAIc · on March 10, 2022

I can’t imagine why it wouldn’t.

It’s just a little amazing to imagine that people doing the same work in different places of the country have such huge gaps in salary caps. I think the national average for a high-level software engineer is less than $150k per year.

Spooky23 · on March 9, 2022

Things are bigger anywhere. My colleagues and I thought we’re hot shit managing 5-7k applications and infrastructure. Amazon probably runs 20,000 orgs like mine.

Also, times are good and rates are crazy. Even at VARs, you can make a lot of cash. I have a buddy who went from $150k to $600k. The guy paid off his mortgage and is at a point where he could burn out and work at Home Depot if he needed to.

nix0n · on March 9, 2022

A handful of large-traffic sites have recently, and relatively suddenly, started blocking traffic from a large region. That's a major change in flow.

WinterMount223 · on March 9, 2022

Could you be more specific?

jrockway · on March 9, 2022

Russia / sanctions, I'm guessing.

thethethethe · on March 9, 2022

> What’s up with all of the multi-platform outages lately? Seems abnormal looking at historical data.

Source?

xeromal · on March 9, 2022

Russian war is another juicy possibility

adamrezich · on March 9, 2022

told myself I'd click this submission's comments link, CTRL+F `Russia`, & quit HN for the day if anything came up, thanks for not disappointing

xeromal · on March 9, 2022

Haha, no problemo.

9wzYQbTYsAIc · on March 10, 2022

Elevated risk of cyberattacks due to foreign meddling.

https://www.cisa.gov/shields-up

btgeekboy · on March 10, 2022

This is a pretty big claim to make. Do you have any sources that back it up?

9wzYQbTYsAIc · on March 10, 2022

Indeed I do.

It is public information within America that we are to be at “Shields Up” readiness.

didip · on March 9, 2022

This is why us-east-1 is perfect for chaos-testing, non-prod, environment.

TameAntelope · on March 10, 2022

Yeah, if you're still running only in us-east-1 at this point, you kind of asked for it...

jasoneckert · on March 9, 2022

Maybe the reason AWS keeps going down is because they run all their stuff on-prem...

merb · on March 9, 2022

I'm not sure if gcloud or azure would help. I run two servers on hetzner which is way cheaper than azure/gcloud they would be better off there.

9wzYQbTYsAIc · on March 10, 2022

They might benefit from migrating to the Azure cloud. I’ve heard that some of the Windows servers actually run faster than some of the Linux servers on Azure.

jedberg · on March 9, 2022

Protip to anyone building new infrastructure in AWS: If you're gonna only use one region in the US, make it us-east-2 or us-west-2. us-east-1 is their oldest and biggest region but also the least stable in the US (ok technically us-west-1 is worse but you can't get that one anymore).

fotta · on March 9, 2022

Somehow AWS managed to make their new status page more opaque than the old one. It's like they want you to scroll through their gigantic list so they can fix the issue before you find the right line.

xilni · on March 9, 2022

This is why you are strongly urged not to rely on one region or AZ.

pid-1 · on March 9, 2022

Given the total amount of money I've lost due a single AZ being down, it was totally worth it to NOT go multi az or multi region so far.

Multi AZ isn't that hard, but generally requires extra costs (one nat gw per az, etc...)

But multi region in AWS is a royal pain in the ass. Many services (like SSO) do not play well with multi region setups, making things really complicated even if you IaCed your whole stack.

evrydayhustling · on March 9, 2022

Those costs are the actual reason you are encouraged to go multi-AZ!

(I actually love that we have strategies and infrastructure for multi-region... it just tends to come up at scales and for applications where it is not justified.)

systemvoltage · on March 9, 2022

Seems like it would be conflict of interest to increase robustness of single AZ (so it never goes down or has its own redundancy) vs. increased revenues from multi AZ deployment.

What's the point of cloud if we have to manage robustness of their own infrastructure. I can understand if that's due to natural disasters and earthquakes, but the idea should be that a single AZ should never go down barring extraordinary circumstances. AWS should be auto-balancing, handling downtimes of a single AZ without the customer ever noticing it.

It might not be a good analogy, but if a single Cloudflare edge datacenter goes down, it will automatically route traffic through others. Transparent and painless to the customer. I understand AWS is huge, and different services have different redundancy mechanisms, but just conceptually it feels like they're in a conflict of interest to increase robustness of their data centers - "We told you to have multi-AZ deployment, not our fault".

Another way to put this is make sure as an AWS customer, to 3x multiply all costs + management of multi-AZ deployment into your total costs.

9wzYQbTYsAIc · on March 10, 2022

> What's the point of cloud if we have to manage robustness of their own infrastructure.

Worth deliberating on. I’m curious as to what the lifetime cost of ownership for an on-prem data center is relative to lifetime cost of operating in the cloud.

thedougd · on March 9, 2022

They would simply charge for the privilege. An EC2 'always on' or whatever option that enabled your instance to live migrate between availability zones would be a nice and expensive option.

systemvoltage · on March 9, 2022

Definitely. Then I wonder why we need the cloud :) if not for services (not EC2). Lot of mid-sized companies are re-evaluating: https://www.economist.com/business/2021/07/03/do-the-costs-o...

Johnny555 · on March 9, 2022

I would strongly urge not using us-east-1 -- of all the regions we're in, it's by far the most problematic. Use us-east-2 if you need good latency to the East Coast.

temp0826 · on March 9, 2022

Not sure if it's still the case, but when I was there us-east-1 was a SPOF for some services world wide. I think if dynamodb went down in the region it was a big, big issue.

Johnny555 · on March 10, 2022

The only SPOF of failure I know of for us-east-1 today is the control plane for Route53 - it's distributed and DNS queries will continue to work when us-east-1 is down (including health check based failover), but you can't make any DNS changes when us-east-1 is down.

m34 · on March 9, 2022

Might be true for running stuff in different regions/AZs but if the provisioning region is down (e.g. deploying lambda@edge) one does not really have an alternative

tyingq · on March 9, 2022

Good advice, though AWS still has some services that don't work completely independently. Cloudfront, because of certificates. Route53. The control API for IAM (adding/removing roles, etc). And I wish they didn't have global-looking endpoints (like https://sts.amazonaws.com) that aren't really global or resilient.

ranman · on March 9, 2022

STS will let you use regional endpoints now, right?

tyingq · on March 9, 2022

Yes. It's just that the "global endpoint" is misleading. They don't repoint it if it fails. It really shouldn't exist given that's how it functions.

didip · on March 9, 2022

Multi AZ is great and should be by default, but multi Region is expensive.

hughrr · on March 9, 2022

This. We have multi AZ in more than one region and I occasionally dream of Bezos wearing only a top hat and waistcoat laughing manically while diving into a large vat of gold coins.

jamesfinlayson · on March 10, 2022

Not always possible - Australia (currently) only has one availability zone and if you're in a regulated industry (banking or government stuff) they require data to be in Australia.

etaioinshrdlu · on March 9, 2022

Does AWS have a plan to improve this region?

Do they acknowledge the problem?

It's been a joke for years how bad us-east-1 is.

consumer451 · on March 9, 2022

Nuke the entire site from orbit

It's the only way to be sure

consumer451 · on March 10, 2022

Just in case PT's Stasi as a Service company ($PLTR) has a hard time parsing this, I want to make clear that this is a joke based on a quote from Aliens the movie. I am not endorsing anything violent.

It is a joke.

I would delete my parent comment if I could.

PeterBarrett · on March 9, 2022

SQS went down for us in us-east-1 and we lost health checks on instances there. Fully recovered now.

mtrunkat · on March 9, 2022

In our case (Apify.com) there was a complete outage of SQS (15mins+), most likely DNS problems + EC2 instances got restarted probably as a result of an SQS outage.

EDIT: Also AWS Lambda seems to be down and AWS EC2 APIs having a very high error rate and machines slow startup times.

BigGreenTurtle · on March 9, 2022

Yep, I saw empty responses for sqs.us-east-1.amazonaws.com for a while. Seems okay now though.

karmakaze · on March 10, 2022

It's a meme by this point that us-east-1 is not 'the cloud'--it's a snowflake, a pet, etc.

amar0c · on March 9, 2022

My Aruba Instant ON Ap's are "offline" (orange) even tho they work and I am online. My first tought is that some Cloud went nirvana state

asah · on March 9, 2022

us-east-1 again!

extant_lifeform · on March 10, 2022

The URA target needs to be bumped up to 25%. Churn and burn.

lyjackal · on March 9, 2022

noticed issues with SQS for a couple minutes. Errors from java sdk, `com.amazonaws.SdkClientException: Unable to execute HTTP request: sqs.us-east-1.amazonaws.com`

hughrr · on March 9, 2022

I am no longer surprised and this is worrying.

0xCAP · on March 9, 2022

Is us-east cursed or what?!

adenner · on March 10, 2022

It is just the one that everyone uses...

0xbadcafebee · on March 9, 2022

My schadenfreude is strong whenever us-east-1 goes down

stjohnswarts · on March 9, 2022

Seems fine he

csdvrx · on March 9, 2022

As usual?