I can't help but wonder, with the increases in attrition across the industry, are we hitting some kind of tipping point where the institutional knowledge in these massive tech corporations is disappearing?
Mistakes happen all the time but when all the people who intimately know how these systems work leave for other opportunities, disasters are bound to happen more and more.
I've been an SRE on a tier 1 GCP
product for over three years this is not the case. In my experience, our systems have only gotten more reliable and easier to understand since I joined.
It's not like there are only a few key knowledge holders that single handedly prevent outages from happening. In reality, you don't need to know shit about how a system works internally to prevent outages if things are setup correctly.
In theory, my dog should be able to sit on my laptop and submit a prod breaking change without any fear that it will make it to prod and damage the system because automated testing/canary should catch it and, if it does make it to prod, we should be able to detect and mitigate it before the change effects more users using probes or whitebox monitoring.
This is happens for 99.9% of potential issues and is completely invisible to users. However it's what's not caught (the remaining 0.01%) that actually matters.
From within AWS, it just feels like we push people too hard. Service teams are too small relative to their goals, sales teams have unrealistic growth targets and also double as support for needy/incompetent customers, and professional services have billable hour requirements which are as high as any major consultancy and with additional pre-sales support expectations.
Strict adherence to the "hiring bar" means we fail to bring in good people who aren't desperate enough to act out the cultish LP dance during their interview. Hiring new grads seems to be the only area where growth is not stalling - but that can obviously only help so much.
My team is hiring for 2-3 people and we are being buried alive without that growth happening sooner - but I can't in good conscience recommend this place to anyone I respect or like.
>"Strict adherence to the "hiring bar" means we fail to bring in good people who aren't desperate enough to act out the cultish LP dance during their interview."
What is the "cultish LP dance" here that is weeding good people out?
>"My team is hiring for 2-3 people and we are being buried alive without that growth happening sooner - but I can't in good conscience recommend this place to anyone I respect or like."
I appreciate your candor. Are you in a dev role or are you on the SRE side? Is your description true across pretty much all teams/services then?
> What is the "cultish LP dance" here that is weeding good people out?
The "culture fit" interview process focuses on leadership principles, so lots of questions like " tell me about a time when you went above and beyond for a customer". Being yourself will get you nowhere, you need to research the questions and the script that is expected of you.
> What service does your team work on?
I'm a partner-focused SA, so not a developer and not aligned to a particular service.
I interviewed with AWS about a year ago. Knocked the programming/system design interviews out of the park, but it was clear the LP interviewer simply didn’t believe I was being truthful about the example I gave (included a period where my team had no direct manager, which is abnormal). He also had a programming question for me but we didn’t have time for him to explain it.
No offer, recruiter emphasized that they were halving the “cool off” period for me (so I could interview again soon), and maybe they do this for everyone, but it’s clear there was one interview making the difference here. Interesting that this is apparently a common problem.
When I interviewed, the instructions I was given by the recruiter were very explicit. For the Leadership Principles, I did my hardest to come up with at least two items for each LP, and of course they had to be in STAR format -- Situation, Task, Action, Result. To that, I added a URL to back up my story for all of the cases where I could.
So, when I used the example of designing and starting to build the replacement e-mail system for ASML for literally zero additional cost [0], I pointed them at the URL for the Invited Talk that I gave based on that work. And when I used the example of when I broke all e-mail across the entire Internet, I pointed them at the URL for the article that was published in The Register. When I talked about what I had learned about Chef and DevOps, I pointed them at the invited talk I gave in Edinburgh entitled "From zero to cloud in 90 days" and the accompanying tutorial I taught called "Just enough Ruby for Chef". And so on.
I really feel that having the URLs to backstop my stories helped me sail through that part of the interview.
In my case, I wasn't being hired as an SDE, so there wasn't much programming tests they wanted me to take -- one of their senior developers did connect me to a shared coding platform, which we really just used as a shared whiteboard. He asked me some questions on how I would solve some problems, and I used my 30+ years of experience with Bourne shell and bash to show him stupid first order solutions to those problems and then we talked about what some of the second and third order solutions might be.
The longer I work at AWS, the more convinced I am that everything depends on the people you're working with. There are good teams and bad teams. There are good managers and bad managers. And if you can find a good team with good managers, then you're golden.
In this respect, I don't think AWS is materially different from any other employer I've known.
[0] They had already bought all the hardware, including some stuff I scavenged from a closet where they had been sitting new-in-the-box for a couple of years; the OS was covered under their site license; all the application software was open source; and my time was free because I was already there under another contract)
Can I ask what role were your interviewing for where you weren't asked to solve Leetcode style problems on a whiteboard? I thought that was standard for AWS. You mentioned it wasn't SDE but I'm assuming it's still a DevOps/Technical role you have based on your STAR examples.
Who would interview with a company again after this nonsense though? I would hope candidates view of said "cool off" period is that it is a permanent one. The arrogance of these recruiters is astonishing.
It was unpleasant but at the end of the day it was just one guy.
What gets me is that every recruiter who reaches out (hundreds by now I estimate) wants me to complete their timed coding test to qualify to interview again.
I found Google’s interview process comparatively much more respectful (although more demanding), and have been happy working there instead.
Do you mean AWS recruiters are asking you to take timed coding tests? Or in general? This is a thing for people with experience? What value or skills do recruiters actually bring? It's a wonder anyone gets hired at all with these jokers as gate keepers. Good on you for getting a better gig. Congrats.
Yes it’s a standard part of the interview cycle at Amazon. Usually two questions in an hour or something, and you code solutions which the test platform grades. They’re on par with leetcode easy IME, but I just find it to be a pain so I won’t do it again.
I’m sure it’s helpful to weed out applicants who actually cannot code, but what’s the point in doing it twice?
As some dude from Baltimore who has just picked up a second gig there. It seemed to me that these are normal questions asked of you in most interviews. In my daytime position at the first gig I have interviewed technicians and have asked similar questions of them. It's not about culture fit. It's about finding their answer to basic customer service questions. It would make sense that a customer obsessed company would be focused on a set principals they have found success in.
Speaking of the research, the recruiters email you the principals and specifically mention to you to review them and to consider them during the interview process. They even send you a document about the STAR method of interviewing to help you have a smoother interview. To me as a guy from Baltimore who doesn't know anything half the people here do. I don't think the interview process could have been smoother.
They are normal-ish questions and I can see where the STAR guidance is better than getting no help at all - I just think the process is too rigid overall. I've interviewed a handful of people for our team in the last year - all had the right background and technical expertise, but since they didn't have an example from their career which matches up with the LP questions they were asked, they were knocked back. All of them would have been "bar raising" from a technical standpoint, but since they were not demonstrably already "Amazonian" in their mindset, we couldn't hire them.
Customers and partners, the latter basically taking our ideas and selling them as a product - the flow of IP towards Amazon isn't always as clear-cut as people believe :-)
Broadly though it is a pre-sales role to help people get started, followed by ongoing guidance as the customer iterates (this is the part which often turns into free support).
There is nothing cultish about the way Amazon interviews. If you are a good engineer with a relevant background who can speak english you will have no problem passing these interviews. I'm not sure why you are framing it as cultish.
I interviewed with AWS about 7 months ago and got an offer. I had multiple recruiters emphasize the LP stuff. I prepared, and there’s no way I would have passed without that preparation.
In my experience, they also give the toughest programming questions. It is a lot of prep overall.
Native English speaker here. I was denied for unspecified reasons a couple years ago. At the time it was perplexing, as I had pretty good answers for all their questions.
Now I work at one of the slightly more sane FANMAG (to include $ms) companies.
Pretty sure I dodged a bullet, maybe the engineering manager spared me because he liked me more than I realized.
I would be curious to know which one is considered more sane these days. I feel like I've heard enough negative things about the culture at all of them at this point.
Walk down the rows of cubicles chanting "Nonne avertis et conare iterum?" (best I could translate "have you tried turning it off and on again" to latin)
Nothing makes me feel more like a wizard than learning some new command line skills. Just string together some short, seemingly indecipherable symbols, and... magic!
Not far off. The “golden age” of humanity was shattered long ago, with the mortal wounding of the god emperor, and knowledge of most of the greatest technology was lost.Millennia later, a cult has grown up that both worships and maintains technology as having machine spirits, which are somehow linked to the machine god itself. That god may or may not be the same or related to the god emperor of mankind, depending on the interpretation.
Honestly the lore of w40k is quite fun to read, if you’re into dystopian and fantasy sci-fi.
FWIW hardline tech priests view the machine god as separate from the emperor. Hardline imperium officials view the emperor as separate from the machine god. The official party line is that the emperor and the machine god are the same, with the emperor perhaps being an avatar.
It seems like both sides are fine to have them be reconciled, but it's an important narrative gadget that can be used to get humanity to fight itself in-universe.
Also interesting is that humanity's "lost" technological progress seems to eclipse some of the other races in the W40K narrative, with even the Tau (space dwarves with robots) and Eldar (space elves with crystals) freaking out when humanity brings giant robots, because the sheer physical impracticality of a gigantic human shaped robot is noted, with nobody aware of how they continue to work.
> gigantic human shaped robot is noted, with nobody aware of how they continue to work
BattleTech had something similar: it was considered cheaper to keep replacing humans than to replace the mechs, because hardly anyone knew how to repair or build mechs.
Outside of the mega-fang industry, I’m wondering the same thing.
The Great Resignation had to have taken a huge toll on regular enterprises. There are probably going to be some unlucky (or lucky, depending on how hardcore they are) people in the position of maintaining aging legacy systems and retrofitting them into the future.
COBOL, for example, is becoming a lucrative language for people in the financial and insurance industries. Legacy Java is all over the place, I’m sure. Legacy .NET is in the middle of a huge industry retrofit, (.NET 5 was the official post-legacy rebrand and they’re on to .NET 6+ now).
>The Great Resignation had to have taken a huge toll on regular enterprises.
The Great Resignation was people leaving jobs they didn't want (front of house/service industry/gig) for jobs they did want (career-track jobs) after those jobs resumed hiring again after the pandemic settled down.
Labor force participation went up not down due to 'the great resignation.'
You're right, but that's been true since the beginning of the tech boom (but isn't exclusive to tech) when no one works for a place for several decades. Companies weather this in different ways but attrition has always been around.
What's causing people to believe that the latest round of attrition is any different?
I'd speculate that perhaps more senior people are moving, and/or a greater overall rate of attrition combined with much more complex technologies and organisations. In other words, it might be harder to become good at jobs now, and fewer people stick with them. Just a hunch but definitely seems to be where the incentives point with loyalty penalties and tech bloat.
In my experience, education certainly seems to not have kept up with computing, at least in terms of having massive academic-industry partnerships like a doctors residency or a trade apprenticeship .
I’d definitely agree that it is probably harder to become good in older organizations - the technologies are probably generations behind the current state of the art and the learning curves are high for those older technologies.
Just thinking through keyboard, but it’s probably reaching the point where enterprises need to evaluate aqui-hire or outsourcing entire development departments due to attrition, due to the incentives to leave for regular employees.
Promoting high employee turnover could actually be a very effective strategy for a company's long term sustainability.
If your company is hostile to people sticking around for decades, then it makes it that much less likely that you end up stuck with an machine that relies heavily on poorly documented tribal knowledge that's likely to start falling apart as your core people start cashing out.
Similarly, it makes it much easier to make the business case for switching to outsourcing and insourcing business models. That makes it much less likely that you have to worry about losing money to people who “work from home”.
The COVID death counts are hopelessly over-counted. This is why there's a cottage industry of people pointing out things like "COVID deaths" which mysteriously also suffered from being murdered, or drug overdoses, or undiagnosed leukaemia.
Then you get into the problem of care homes being authorised to report COVID deaths without any testing or formally trained opinion at all.
It's shocking to see such a statement so far into the pandemic. This is solved and known already, and while complicated, we've figured it out for some time. We can easily see the massive amount of deaths when we look at excess death numbers. Covid deaths are, if anything, undercounted. To believe anything else at this point is to bury your head in the sand and avoid all scientific evidence and medical consensus.
I won't argue this topic since this thread is about AWS, but the article that you quote says "Experts calculate the excess death rate by comparing figures for a given period with the average for that same period over several previous years."
That's not how experts calculate excess deaths, since that algorithm produces totally bogus answers. Here is how real experts compute excess deaths, compensating for seasonality and population changes: https://euromomo.eu/how-it-works/methods/
Excess deaths is a perfectly valid statistic to base policy off... if policy makers maintained a hands-off approach and didn't radically change society through out the pandemic.
Can you in good conscience say that the typical rate of death remained steady while the following happened:
* a majority of given populations remained at home (lock downs - meaning no travelling to work in multi-tonne death traps)
* practised increased hygiene protocols (masks, more frequent hand washing)
* did not visit elderly homes (at least, less than usual)
* many people were reluctant to get timely health care (due to fear of catching COVID from medical facilities)
* ate worse and excersied less
* infected elderly patients were sent back to their nursing homes (to typically infect the entire facility)
There are so many confounding factors that on the face of it, should result in a radically different death profile... and almost every country faced the above to different extents.
Anyone claiming to be able to work out the actual number of people who died from COVID-19 from excess deaths is being disingenous at best.
Yes, absolutely.
Within my own org of ~50 people, 15% have resigned/contracts ending during Q1 (after 15% in Q4).
Of the remaining 85%.. 20% have been around since before COVID / 65% joined during COVID.
Of our senior engineers & team leads, 70% have joined in last 6-9 months.
Only 3 full time senior engineers with 2 years or more tenure.
We've grown during COVID but we've also just burned through people.
Turnover has hit the point where we stopped doing going away zoom toasts.. people just sort of disappeared.
Newest member of my team has been in the company for 6 years and on my team for 4. I was in the local pub the other day and there was a retirement do for someone who had been here for 35 years, which certainly isn't exceptional (40 years is more noteworthy, and I've known a couple of people who made it to 50 years)
If we can have meeting after meeting for working groups, agile kayfabe, status reports, etc for hours recurring weekly.. we can spend 15min on the phone saying thanks, good luck, and see you again a handful of times per year when a teammate leaves.
I think this is a transient issue. When you're in growth mode you make a huge series of hacks to just keep things running and then when you leave.... well, it's a problem. But if the business is robust, and lives beyond you, what replaces your work is better documented, better tested, and maintainable.
That's the dream. Obviously there are companies that sink between v1 and v2, but that's life.
Fundamentally I think the cloud business is robust, it's a fundamentally reasonable way of organising things (for enough people), which is why it attracts customers despite being arguably more expensive.
I've been in this situation in much smaller scales, and yes, you'll see massive drop in productivity but that's the cost of going from prototype to product.
No what I'm saying is that AWS is transitioning out of early stage growth, and so they're seeing all the issues you see where the original people who hacked stuff together are moving on and you have to start really focusing on making things stable, but the beginning of that process is always everything getting unreliable.
Based on our telemetry, this started as NXDOMAINs for sqs.us-east-1.amazonaws.com beginning in modest volumes at 20:43 UTC and becoming a total outage at 20:48 UTC. Naturally, it was completely resolved by 20:57, 5 minutes before anything was posted in the "Personal Health Dashboard" in the AWS console.
It takes a while to find a Vice President, I guess.
I'd be totally fine just having alerts and metrics driving the status page. Why involve a human at all? They just get emotional.
(I have a data-driven status page for my personal website. If Oh Dear decides my website is down, the status page gets automatically updated. Obviously nobody is ever going to visit status.jrock.us if they are trying to read an article on my blog and it doesn't load, but hey at least I can say I did it.)
To make a judgement call on whether the issue is severe enough to warrant the legal/financial risk of admitting your service is broken, potentially breaking customer SLAs.
If you're just going to lie, why have an SLA at all? It's like doing a clinical trial for a drug; a bunch of your patients die and you say "well they were going to die anyway it has nothing to do with the drug." If it's one person, maybe you can get away with that. When it's everyone in the experimental group, people start to wonder.
I have two arguments in favor of honest SLAs. One is, if customers expect that something is down, it can give them a piece of data with which to make their mitigation decisions. "A lot of our services are returning errors", check the auto-updated status page, "there may be an issue with network routes between AZs A and C". Now you know to drain out of those zones. If the status page says "there are no problems", now you spend hours debugging the issue, costing yourself far more money in your time than you spend on your cloud infrastructure in the first place. If having an SLA is the cause of that, it would be financially in your favor to not have the SLA at all. The SLA bounds your losses to what you pay for cloud resources, but your losses can actually be much higher; lost revenue, lost time troubleshooting someone else's problem, etc.
The second is, SLA violations are what justify reliability engineering efforts. If you lose $1,000,000 a year to SLA violations, and you hire an SRE for $500,000 a year to reduce SLA violations by 75%, then you just made a profit of $250,000. If you say "nah there were no outages", then you're flushing that $500,000 a year down the toilet and should fire anyone working on reliability. That is obviously not healthy; the financial aspect keeps you honest and accountable.
All of this gets very difficult when you are planning your own SLAs. If everyone is lying to you, you have no choice but to lie to your customers. You can multiply together all the 99.5% SLAs of the services you depend on and give your customers a guarantee of 95%, but if the 99.5% you're quoted is actually 89.4%, then you can't actually meet your guarantees. AWS can afford to lie to their customers (and Congress apparently) without consequences. But you, small startup, can't. Your customers are going to notice, and they were already taking a chance going with you instead of some big company. This is hard cycle to get out of. People don't want to lie, but they become liars because the rest of the industry is lying.
Finally, I just want to say I don't even care about the financial aspect, really. The 5 figures spent on cloud expenses are nothing compared to the nights and weekends your team loses to debugging someone else's problem. They could have spent the time with their families or hobbies if the cloud provider just said "yup, it's all broken, we'll fix it by tomorrow". Instead, they stay up late into the night looking for the root cause, finding it 4 hours later, and still not being able to do anything except open a support ticket answered by someone who has to lie to preserve the SLA. They'll never get those hours back! And they turned out to be a complete waste of time.
It's a disaster, and I don't care if it's a hard problem. We, as an industry, shouldn't stand for it.
It's not just a hard problem, it's impossible to go from metrics to automatic announcements. Give me a detected scenario and I'll give you an explanation for seeing errors which are unrelated to the service health as seen by customers. From basic "our tests have bugs", to "reflected DDoS on internal systems causes rate limit on test endpoint", to "deprecated instance type removal caused a spike in failed creations", to "bgp issues cause test endpoints failures, but not customer visible ones", to...
You can't go from a metric to diagnosis for a customer - there's just no 1:1 mapping possible, with errors going both ways. AWS sucks with their status delays, but it's better than seeing their internal monitoring.
I don't agree. I'm perfectly happy seeing their internal metrics.
I remember one night a long time ago while working at Google, I was trying a new approach for loading some data involving an external system. To get the performance I needed, I was going to be making a lot of requests to that service, and I was a little concerned about overloading it. I ran my job and opened up that service's monitoring console, poked around, and determined that I was in fact overloading it. I sent them an email and they added some additional replicas for me.
Now I live in the world of the cloud where I pay $100 per X requests to some service, and they won't even tell me if the errors it's returning are their fault or my fault? Not acceptable.
> I'd be totally fine just having alerts and metrics driving the status page. Why involve a human at all?
What happens if the monitoring service goes down or has a bug that causes it to incorrectly report the status as okay? Obviously this won't usually happen at the same time as the actual service goes down, but if it did, would people really believe that?
It definitely is. For an issue like this, you will see relevant teams and delegates looped in very quickly. Getting approved wording about an outage requires some very senior people though. Often they have to be paged in as well.
Having worked at a few other large tech companies now -- Amazon's incident response process is honestly great. It's one of the things I miss about working there.
What’s up with all of the multi-platform outages lately? Seems abnormal looking at historical data. Are there issues affecting the internet backbone or something? Or just a coincidence?
Important to keep in mind that AWS has 250 services in 84 Availability Zones in 26 regions.
This outage is reportedly impacting 5 services in 1 region.
For those impacted, pretty terrible. But as a heavy user of AWS, I’ve seen these notices posted multiple times on HN and haven’t been impacted by one yet.
For businesses with uptime guarantees and lots of boxes to spin up in failover scenario, this has been a very eventful 12 months. At least that's what I'm experiencing.
Probably increased salary and switch to permanent remote. Amazon is notorious for their frugality and they recently doubled their maximum salary cap to 350k. They would only have done this to stay competitive in the current job market. This implies that many of their existing employees are underpaid relative to their peers at comparable companies and they've likely seen a large uptick in attrition. Not to mention attrition begets more attrition, especially if it's "influential" employees who are leaving.
It’s just a little amazing to imagine that people doing the same work in different places of the country have such huge gaps in salary caps. I think the national average for a high-level software engineer is less than $150k per year.
Things are bigger anywhere. My colleagues and I thought we’re hot shit managing 5-7k applications and infrastructure. Amazon probably runs 20,000 orgs like mine.
Also, times are good and rates are crazy. Even at VARs, you can make a lot of cash. I have a buddy who went from $150k to $600k. The guy paid off his mortgage and is at a point where he could burn out and work at Home Depot if he needed to.
They might benefit from migrating to the Azure cloud. I’ve heard that some of the Windows servers actually run faster than some of the Linux servers on Azure.
Protip to anyone building new infrastructure in AWS: If you're gonna only use one region in the US, make it us-east-2 or us-west-2. us-east-1 is their oldest and biggest region but also the least stable in the US (ok technically us-west-1 is worse but you can't get that one anymore).
Somehow AWS managed to make their new status page more opaque than the old one. It's like they want you to scroll through their gigantic list so they can fix the issue before you find the right line.
Given the total amount of money I've lost due a single AZ being down, it was totally worth it to NOT go multi az or multi region so far.
Multi AZ isn't that hard, but generally requires extra costs (one nat gw per az, etc...)
But multi region in AWS is a royal pain in the ass. Many services (like SSO) do not play well with multi region setups, making things really complicated even if you IaCed your whole stack.
Those costs are the actual reason you are encouraged to go multi-AZ!
(I actually love that we have strategies and infrastructure for multi-region... it just tends to come up at scales and for applications where it is not justified.)
Seems like it would be conflict of interest to increase robustness of single AZ (so it never goes down or has its own redundancy) vs. increased revenues from multi AZ deployment.
What's the point of cloud if we have to manage robustness of their own infrastructure. I can understand if that's due to natural disasters and earthquakes, but the idea should be that a single AZ should never go down barring extraordinary circumstances. AWS should be auto-balancing, handling downtimes of a single AZ without the customer ever noticing it.
It might not be a good analogy, but if a single Cloudflare edge datacenter goes down, it will automatically route traffic through others. Transparent and painless to the customer. I understand AWS is huge, and different services have different redundancy mechanisms, but just conceptually it feels like they're in a conflict of interest to increase robustness of their data centers - "We told you to have multi-AZ deployment, not our fault".
Another way to put this is make sure as an AWS customer, to 3x multiply all costs + management of multi-AZ deployment into your total costs.
> What's the point of cloud if we have to manage robustness of their own infrastructure.
Worth deliberating on. I’m curious as to what the lifetime cost of ownership for an on-prem data center is relative to lifetime cost of operating in the cloud.
They would simply charge for the privilege. An EC2 'always on' or whatever option that enabled your instance to live migrate between availability zones would be a nice and expensive option.
I would strongly urge not using us-east-1 -- of all the regions we're in, it's by far the most problematic. Use us-east-2 if you need good latency to the East Coast.
Not sure if it's still the case, but when I was there us-east-1 was a SPOF for some services world wide. I think if dynamodb went down in the region it was a big, big issue.
The only SPOF of failure I know of for us-east-1 today is the control plane for Route53 - it's distributed and DNS queries will continue to work when us-east-1 is down (including health check based failover), but you can't make any DNS changes when us-east-1 is down.
Might be true for running stuff in different regions/AZs but if the provisioning region is down (e.g. deploying lambda@edge) one does not really have an alternative
Good advice, though AWS still has some services that don't work completely independently. Cloudfront, because of certificates. Route53. The control API for IAM (adding/removing roles, etc). And I wish they didn't have global-looking endpoints (like https://sts.amazonaws.com) that aren't really global or resilient.
This. We have multi AZ in more than one region and I occasionally dream of Bezos wearing only a top hat and waistcoat laughing manically while diving into a large vat of gold coins.
Not always possible - Australia (currently) only has one availability zone and if you're in a regulated industry (banking or government stuff) they require data to be in Australia.
Just in case PT's Stasi as a Service company ($PLTR) has a hard time parsing this, I want to make clear that this is a joke based on a quote from Aliens the movie. I am not endorsing anything violent.
In our case (Apify.com) there was a complete outage of SQS (15mins+), most likely DNS problems + EC2 instances got restarted probably as a result of an SQS outage.
EDIT: Also AWS Lambda seems to be down and AWS EC2 APIs having a very high error rate and machines slow startup times.
noticed issues with SQS for a couple minutes. Errors from java sdk, `com.amazonaws.SdkClientException: Unable to execute HTTP request: sqs.us-east-1.amazonaws.com`
Mistakes happen all the time but when all the people who intimately know how these systems work leave for other opportunities, disasters are bound to happen more and more.