I feel for the OP, I really do. Have been through this many times before.
But it is going too far to call this a "lie" on the part of AWS. If you talk to any AWS customer support representative, or read their docs, they are very clear that the unit of availability for their service is a region, not a zone.
Notice that the status update said "connectivity issues in a single Availability Zone".
If you are deploying in a single AZ, and your app is not tolerant to at least one AZ failure, then you should not be telling your customers/boss that your app is highly available. That's not the fault of AWS, it's how you're using AWS.
With that said, I do think that AWS could do two things: 1) maybe show a yellow warning sign instead of a blue "i", for something that borks an entire AZ, and 2) make it even more clear that each AZ is not high-availability.
Well it's not that's simple. We do run in multiple availability zones in every region. But if the connectivity between them is partially working, which it was, and shared service from Amazon itself aren't working fully from every instance, ou have a huge mess to contend with where the cluster consensus cannot be formed. So in cases like this we did what we should have done and routed traffic away form a network that was unreliable and partly partitioned. The point for us was not that this availability zone went down at all. It was that Amazon throughout claimed everything was operating normally for hours when this was very far from the truth.
So one availability zone went down, not the region, which Amazon indicated on their status page (which is clearly set up to predominantly display outages of the region, not of a single AZ within the region), and because of how your system was set up to be dependent on each AZ being operational and networks to not partition, it caused issues.
I get that having issues on AWS is irritating; I exist in that ecosystem too. But...I really can't fault them for this, or claim that they're lying. AWS says to not rely on any one AZ being up and/or reachable, and yet you did. And the fact it caused problems for you means you want them saying the entire region is down. Why? They make regions be fault tolerant by having multiple AZs; they guarantee reliability at the region level, not at the AZ level, and that's what the status page is intended to track.
Now, I can see wanting a clear status page per AZ, rather than just a blue 'i'. That's a valid request. But -request- that, don't claim that they're lying. You're being antagonistic despite them doing everything they've promised, and their status page being correct (just not using the colors you would like because they view severity differently than you).
I think the two suggestions I made are more reasonable than claiming AWS "lied". It's understandable that your customer would be confused seeing the blue "i" instead of something yellow. But that doesn't mean they claimed "everything was operating normally".
In terms of your cluster, it takes a lot of testing and tweaking to ensure your cluster can reach consensus during partial/sporadic partitions. But it can and should be done if you need high availability (e.g. take nodes out of the cluster if they keep disconnecting, until they have a stable connection for X minutes).
I have worked at amazon, and I can validate this. When I was on an AWS team, posting a "non-green" status to the status page was actually a Manager decision. That means that it is in no way real time, and its possible the status might say "everything is ok" when its really not because a manager doesnt think its a big enough deal.
Also there is a status called green-i which is the green icon with a little 'i' information bubble. Almost everytime something was seriously wrong, our status was green-i, because the "level" of warning is ultimately a manager decision. So they avoid yellow and red as much as possible. So the status's aren't very accurate.
That being said, if you do see a yellow or red status on AWS, you know things are REALLY bad.
Devops here who has to deal with the AWS status page frequently. Is someone's bonus or performance metrics tied to how often a service goes to "non-green" on the status page?
not really. But the manager IS held responsible if they go green-i too often. As sometimes having a bad status can trigger AWS to give credits back to certain customers.
I mean, it's still a manager decision. So it usually goes green-i after you notice the issue. I would say it's the same, still not very reliable. But do know, that if you see the I indicator, something is wrong
Note that if they did have non-green, they might have to credit you due to SLAs. There's likely a strong disincentive internally to have a problem cause non-green status, because it is a metric that will cost amazon money. Amazon has service credits for reduced uptime. As Amazon employees are known for being very very harshly judged on metrics, lying on the status page is thus incentivized.
You are looking at the consequences of amazon's employee culture. Metrics or die means the metrics might need to be fudged to keep your and your team's jobs.
It seems like they need to separate the status in terms of information for customers from the status in terms of slas. Just tell me what's going on as honestly as you can, we can figure out if we need a refund later.
Aws is massively incentivized to lie on the status pages until the outages become egregious. It's literally a page for marketing to point to when describing how reliable the service is.
This. The people at Amazon need to restructure their teams' incentive structures so that whoever is responsible for status reporting is in no way beholden to marketing. Of course, Amazon itself has no incentive to do this, which is why it's important for developers like ourselves raise a big enough stink.
I agree that AWS could improve their openness around service problems. That said, what impacts some customers doesn't impact all customers. Ably may be offline when AWS has certain types of trouble, but that doesn't mean everyone with services in that AZ or region are having similar problems.
You also have to keep in mind that AWS status reports aren't just informational. They have a direct impact on how people use their service. As soon as AWS puts up a yellow dot, people start changing DNS and spinning up failovers in other regions or AZs, which has the potential to cascade into a much bigger failure when resources are unnecessarily tied up in other regions for something that maybe isn't as big a crisis as it sounds at first. This has happened before, and is part of the reason why AWS is so circumspect about announcing problems.
Point being that there are ways to architect around AZ and region failure that don't rely on trusting AWS's own reporting. Anyone with a significant investment in or dependency on AWS or any cloud service should not rely on those service's own indicators.
All that said, the fact is that in a system as big as AWS there are minor problems going on all the time. It would be healthier for their customers, their PR image, and the response to bigger incidents for them to reveal more details showing that yeah, most of the time some tiny fraction of the system is broken, offline, in a failure state, or some other unknown issue. 500 errors happen with the API sometime, let's see streaming feeds of fail rates from however many endpoints there are. EC2 instances have hardware failures or need to reboot sometimes, let's see some indication of whether that's happening more or less often than normal. Network segments fail or get overloaded... Revealing some of the less pristine guts of the operation (without revealing sensitive details of course) would go along way to being more honest without the bigger risks of saying EC2 in us-east-1 is down!
I've long learned that the status page from AWS is useless, it usually only updates 30-40 minutes after the problem starts (by which time, you've already noticed it) and they will always minimize the problem on the status page...
It's frustrating when there's an issue with AWS and my clients tell me that their status page reports that everything is OK.
I've had this happen to me multiple times. Luckily I use AWS for batch research jobs, so I'm not losing any money when things break.
But when researchers come to me with AWS problems even though I've not changed anything, I usually spend 10 minutes on one of the nodes for some cursory investigation and if it's not obvious what the problem is I tell them to wait a day or two before further investigation. 95% of the time, the problem clears up on its own.
I've learned the hard way that spending half a day tracing SQS messages, dynamo tables, spot bidding, etc is usually a waste of time.
That the AWS console flat out lies like this wastes so much of everyone's time. It's not even hard to fix, AWS could run internal test cases on every subnet group.
A while back, after some months of frustration with significant levels of S3 API failures in bulk that never made it to the status page, I ended up writing a tool that continually monitors S3 via making random requests, and alerts if too many fail per unit time. The outcome of that experiment is the finding that there are a lot of significant spikes in S3 API failure rates that go entirely unannounced by Amazon, though the situation has improved considerably in the past year.
If they're running a status page, you shouldn't have to contact support to get the status. It's pretty well known among heavy aws users that their status page is bullshit.
Whether or not they contact AWS support, AWS knew about this issue and it should have been RED, not green with a 'note'.
I've had about 4 or 5 serious AWS issues in the last 4 years (an issue which impacts multiple instances or services), and literally never had anything other than a green tick. Usually they don't even bother with the note.
AWS status page is a point of contention within the support teams as well. I've contacted them for problems when the status page says "everything's fine" and they have known outages. There is ALWAYS a lag between them knowing and anything being reported, usually on the order of 15-30 minutes. Most of the time their response is "Yes, there's a problem. I don't know why the status page isn't updated yet, it would really help us (support) out if the page was accurate."
Twitter is usually the best place to watch if you think it's not you, search #aws or #{service_name} and view the "Live" tab to see if others are reporting the same trouble.
Calling attention to these failures via support ticket, via sales rep and even publicly via @ customer support heads has made zero difference. Here's hoping this blog has enough visibility to make a difference.
The "real AWS status" extension for Chrome [1] pokes a little bit of fun at this. Basically, it hides all the regular green checkmarks (unless they are a 'Resolved' issue), then it increments the status images (green with I becomes yellow, yellow becomes red). It also makes the legend at the bottom snarky.
Although it's really meant as a joke, I think, it's actually really useful. It removes the noise and, frankly, the incremented status images are usually quite accurate for us.
A lot of the time, services outages seem localized to specific datacenters and subnets, or some relatively small subset of nodes. I suspect even Amazon doesn't know there's a problem sometimes.
I would be happy if they just added a status indicator to each running node indicating that there's some local problem.
I meant you could deduce/monitor/alert on the behavior of any of the services relevant to you (i don't use SQS but metrics for it are there apparently). These have been way more helpful to me and i often show copy pastes of graphs to clients to explain problems.
ckozlowski mentioned Personal Health Dashboard which I never knew about as an alternative I should check out.
However, I agree with the fundamental premise that status.aws.amazon.com is useless except for National News scale events.
Other commenters have taught me that the status page is a regional-level view, manually updated which is something I hadn't considered and makes me understand why it might be useless.
One other thing I tried was to subscribe to the RSS feeds (on the status page) for the region/services I was interested in. I eventually turned them off as they where way too noisy, often alerting to things that didn't actually have an effect on my usage.
This is definitely not unique only to AWS but also GCP in my own experience. We sometimes had Sentry error alerts due to DB connection error coming from the web app hundreds of times in a short window of 5 minutes and GCP dashboard would still shows green the entire time. Disclaimer: I have worked with GCP for 2 years and AWS for 8+ years.
Damn lies about the number of 9s. I face similar issues with Google Cloud on a daily basis with the APIs and their standard response is to go and buy and buy the next support level before they can look at the problem.
These companies claim to have state of art technologies where as in reality their customers inform them about outages and performance problems.
The exhausting reality is that in the age of widely distributed services, everything is always broken enough that someone's use case is going to choke and die. When the best case involves total failure for a small percentage of people, what does green even mean? Author had their own monitoring, that's the only workable approach.
This literally happened to me for the first time this week after years with AWS! My site was totally down for 6 hours but AWS still reports everything as green and no notifications were sent.
Any suggestions for a very simple 3rd party "check" that will constantly monitor my site and alert me (email/text) when it's unresponsive?
I wish AWS had both high-level and drill down status pages where it would post API success rates or the equivalent detailed information like github's status page has [0].
Short version: AWS doesn't give any service status.
The have a status page but everything is always "green". It would take half the internet down before they move something to "not green (but not red either)".
Read my previous comment. We use multiple AZs. That's not the point of my blog post. I am not complaining about availability of services, I am complaining that there is no accuracy of their service status.
Well, your system is not fault-tolerant if you have cross dependencies between AZs. I find their reporting accurate - AZ outage, not Regional as per your post.
Simple thought experiment: realtime messaging system; servers in multiple AZs; user A is connected to endpoint 1 in AZ 1, user B is connected to endpoint 2 in AZ 2; A publishes a message on a channel that B is subscribed to. Then it is an error for B to not receive it. This makes network partitions (especially partial, asymmetric, or sporadic ones) a nontrivial problem. Of course there are solutions, but it's hardly as simple as "just make your app multi-AZ". Not every app is a bunch of independent boxes serving web pages.
Well what is your SLO around message delivery? If it's "a successful response means a message will be delivered at least once and in under X time" then you need to verify that message has been durably committed to multiple machines in all of your availability zones. if it's just that the message is durably committed or the SLO on delivery is long enough, then you can drop the multi-as bit.
Yep. Non-quorum members can even simply terminate existing connections and refuse new connections from clients, so clients are always either connected to a quorum node or not connected at all (CP, no A)
The point is, if a health check fails or if there is an outage on AZ 2 then the ELB should be scripted to route to AZ 1 only, as well as AZ 3 if it exists.
He could've not said the noob comment in the end, but I hope his comment doesn't get flagged. The author could've read some AWS docs in the time he wrote this post.
It might sound condescending but I am merely stating facts and my opinon which froms the basis of my version of 'truth' as is the OP. I did not call anyone a liar at least.
In a perfect world, this makes sense. For us, we had an ElastiCache redis multi-AZ replication group whose primary was in the affected AZ. It refused to fail over, even though automatic failover was configured. Apparently there is no way to force this failover, even though people have been asking for such a feature for years. I'm curious what you would do in this situation, besides not using ElastiCache?
But it is going too far to call this a "lie" on the part of AWS. If you talk to any AWS customer support representative, or read their docs, they are very clear that the unit of availability for their service is a region, not a zone.
Notice that the status update said "connectivity issues in a single Availability Zone".
If you are deploying in a single AZ, and your app is not tolerant to at least one AZ failure, then you should not be telling your customers/boss that your app is highly available. That's not the fault of AWS, it's how you're using AWS.
With that said, I do think that AWS could do two things: 1) maybe show a yellow warning sign instead of a blue "i", for something that borks an entire AZ, and 2) make it even more clear that each AZ is not high-availability.