I'd be happy to share my experience with you. We've solved problems like this for a lot of folks with small or non-existent internal tech teams. I agree with the small team of experts approach, but there are pitfalls there too.
My understanding is that you are legally required to carry third-party liability (TPL) insurance in MX, but credit cards only provide collision/damage insurance.
> 8:22 AM PST We are investigating increased error rates for the AWS Management Console.
> 8:26 AM PST We are experiencing API and console issues in the US-EAST-1 Region. We have identified root cause and we are actively working towards recovery. This issue is affecting the global console landing page, which is also hosted in US-EAST-1. Customers may be able to access region-specific consoles going to https://console.aws.amazon.com/. So, to access the US-WEST-2 console, try https://us-west-2.console.aws.amazon.com/
> This issue is affecting the global console landing page, which is also hosted in US-EAST-1
Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?
At a different (unnamed) FAANG, we considered it unacceptable to have anything depend on a single region. Even the dinky little volunteer-run thing which ran https://internal.site.example/~someEngineer was expected to be multi-region, and was, because there was enough infrastructure for making things multi-region that it was usually pretty easy.
Every damn Well-Architected Framework includes multi-AZ if not multi-region redundancy, and yet the single access point for their millions of customers is single-region. Facepalm in the form of $100Ms in service credits.
> Facepalm in the form of $100Ms in service credits.
It was also greatly affecting Amazon.com itself. I kept getting sporadic 404 pages and one was during a purchase. Purchase history wasn't showing the product as purchased and I didn't receive an email, so I repurchased. Still no email, but the purchase didn't end in a 404, but the product still didn't show up in my purchase history. I have no idea if I purchased anything, or not. I have never had an issue purchasing. Normally get a confirmation email within 2 or so minutes and the sale is immediately reflected in purchase history. I was unaware of the greater problem at that moment or I would have steered clear at the first 404.
They're also unable to refund Kindle book orders via their website. The "Request a refund" page has a 500 error, so they fall back to letting you request a call from a customer service rep. Initiating this request also fails, so they then fall back to showing a 1-888 number that the customer can call. Of course, when I tried to call, I got "All circuits are busy".
>Facepalm in the form of $100Ms in service credits.
Part of me wonders how much they're actually going to pay out, given that their own status page has only indicated five services with moderate ("Increased API Error Rates") disruptions in service.
That public status page has no bearing on service credits, it's a statically hosted page updated when there's significant public impact. A lot of issues never make it there.
Every AWS customer has a personal health dashboard that links the issues to their services which is updated much faster, and links issues to your affected resources. Additionally requests for credits are done by the customer service team who have even more information.
This point is repeated often, and the incentives for Amazon to downplay the actual downtime are definitely there.
Wouldn't affected companies be incentivized to make a lawsuit about AMZ lying about status? It would be easy to prove and costly to defend from AWS standpoint.
I'm guessing Google, on the basis of the recently published (to the public) "I just want to serve 5TB"[1] video. If it isn't Google, then the broccoli man video is still a cogent reminder that unyielding multi-region rigor comes with costs.
It's salient that the video is from 2010. Where I was (not Google), the push to make everything multi-region only really started in, maybe, 2011 or 2012. And, for a long time, making services multi-region actually was a huge pain. (Exception: there was a way to have lambda-like code with access to a global eventually-consistent DB.)
The point is that we made it easier. By the time I left, things were basically just multi-region by default. (To be sure, there were still sharp edges. Services which needed to store data (like, databases) were a nightmare to manage. Services which needed to be in the same region specific instances of other services, e.g. something which wanted to be running in the same region as wherever the master shard of its database was running, were another nasty case.)
The point was that every services was expected to be multi-region, which was enforced by regular fire drills, and if you didn't have a pretty darn good story about why regular announced downtime was fine, people would be asking serious questions.
And anything external going down for more than a minute or two (e.g. for a failover) would be inexcusable. Especially for something like a bloody login page.
YES! Why do they do that? It's so weird. I will deploy a whole config into us-west-1 or something; but then I need to create a new cert in us-east-1 JUST to let cloudfront answer an HTTPS call. So frustrating.
Apparently it's okay for static data (like a website hosted in S3 behind CloudFront) but seeing non-Australian items in AWS billing and overviews always makes us look twice.
> Even this little tidbit is a bit of a wtf for me. Why do they consider it ok to have anything hosted in a single region?
They're cheap. HA is for their customers to pay more, not for Amazon which often lies during major outages. They would lose money on HA and they would lose money on acknowledging downtimes. They will lie as long as they benefit from it.
I think I know specifically what you are talking about. The actual files an engineer could upload to populate their folder was not multi-region for a long time. The servers were, because they were stateless and that was easy to multi-region, but the actual data wasn't until we replaced the storage service.
I think the storage was replicated by 2013? Definitely by 2014. It didn't have automated failover, but failover could be done, and was done during the relevant drills for some time.
I think it only stopped when the storage services got to the "deprecated, and we're not bothering to do a failover because dependent teams who care should just use something else, because this one is being shut down any year now". (I don't agree with that decision, obviously ;) but I do have sympathy for the team stuck running a condemned service. Sigh.)
After stuff was migrated to the new storage service (probably somewhere in the 2017-2019 range but I have no idea when), I have no idea how DR/failover worked.
Thank you for the sympathy. If we are talking about the same product then it was most likely backed by 3 different storage services over its lifespan, 2013/2014 was a third party product that had some replication/fail-over baked in, 2016-2019 on my team with no failover plans due to "deprecated, dont bother putting anything important here", then 2019 onward with "fully replicated and automatic failover capable and also less cost-per-GB to replicate but less flexible for the existing use cases".
Yeah, but I still have a different understanding what "Increased Error Rates" means.
IMHO it should mean that the rate of errors is increased but the service is still able to serve a substantial amount of traffic. If the rate of errors is bigger than, let's say, 90% that's not an increased error rate, that's an outage.
Some big customers should get together and make an independent org to monitor cloud providers and force them to meet their SLA guarantees without being able to weasel out of the terms like this…
IAM is a "global" service for AWS, where "global" means "it lives in us-east-1".
STS at least has recently started supporting regional endpoints, but most things involving users, groups, roles, and authentication are completely dependent on us-east-1.
When I brought up the status page (because we're seeing failures trying to use Amazon Pay) it had EC2 and Mgmt Console with issues.
I opened it again just now (maybe 10 minutes later) and it now shows DynamoDB has issues.
If past incidents are anything to go by, it's going to get worse before it gets better. Rube Goldberg machines aren't known for their resilience to internal faults.
As a user of Sagemaker in us-east-1, I deeply fucking resent AWS claiming the service is normal. I have extremely sensitive data, so Sagemaker notebooks and certain studio tools make sense for me. Or DID. After this I'm going back to my previous formula of EC2 and hosting my own GPU boxes.
Sagemaker is not working, I can't get to my work (notebook instance is frozen upon launch, with zero way to stop it or restart it) and Sagemaker Studio is also broken right now.
You don't use AWS because it has better uptime. If you've been around the block enough times, this story has always rung hollow.
Rather, you use AWS because when it is down, it's down for everybody else as well. (Or at least they can nod their head in sympathy for the transient flakiness everybody experiences.) Then it comes back up and everybody forgets about the outage like it was just background noise. This is what's meant by "nobody ever got fired for buying (IBM|Microsoft)". The point is that when those products failed, you wouldn't get blamed for making that choice; in their time they were the one choice everybody excused even when it was an objectively poor choice.
As for me, I prefer hosting all my own stuff. My e-mail uptime is better than GMail, for example. However, when it is down or mail does bounce, I can't pass the buck.
Identify or to publicly acknowledge? Chances are technical teams knew about this and noticed it fairly quickly, they've been working on the issue for some time. It probably wasn't until they identified the root cause and had a handful of strategies to mitigate with confidence that they chose to publicly acknowledge the issue to save face.
I've broken things before and been aware of it, but didn't acknowledge them until I was confident I could fix them. It allows you to maintain an image of expertise to those outside who care about the broken things but aren't savvy to what or why it's broken. Meanwhile you spent hours, days, weeks addressing the issue and suddenly pull a magic solution out of your hat to look like someone impossible to replace. Sometimes you can break and fix things without anyone even knowing which is very valuable if breaking something had some real risk to you.
This sounds very self-blaming. Are you sure that's what's really going through your head? Personally, when I get avoidant like that, it's because of anticipation of the amount of process-related pain I'm going to have to endure as a result, and it's much easier to focus on a fix when I'm not also trying to coordinate escalation policies that I'm not familiar with.
:) I imagine it went like this theoretical Slack conversation:
> Dev1: Pushing code for branch "master" to "AWS API".
> <slackbot> Your deploy finished in 4 minutes
> Dev2: I can't react the API in east-1
> Dev1: Works from my computer
It's acting odd for me. Shows all green in Firefox, but shows the error in Chrome even after some refreshes. Not sure what's caching where to cause that.
Cool project! I'm building something similar at https://apppack.io. It's not terraform based, but also helps setup, manage, and orchestrate AWS resources for devs.
Using AWS managed services is a huge win for maintainability. A lot of host-your-own PaaS tools are spinning up EC2 instances that you're then responsible for maintaining/patching/securing.
Thanks for the link. Honestly one of my goals is to avoid becoming too dependent on on proprietary clouds. If I wanted something that only works on Amazon, there's always ElasticBeanstalk/Amplify & friends.
Feel free to reach out pete AT lincolnloop.com