But it is going too far to call this a "lie" on the part of AWS. If you talk to any AWS customer support representative, or read their docs, they are very clear that the unit of availability for their service is a region, not a zone.
Notice that the status update said "connectivity issues in a single Availability Zone".
If you are deploying in a single AZ, and your app is not tolerant to at least one AZ failure, then you should not be telling your customers/boss that your app is highly available. That's not the fault of AWS, it's how you're using AWS.
With that said, I do think that AWS could do two things: 1) maybe show a yellow warning sign instead of a blue "i", for something that borks an entire AZ, and 2) make it even more clear that each AZ is not high-availability.
I get that having issues on AWS is irritating; I exist in that ecosystem too. But...I really can't fault them for this, or claim that they're lying. AWS says to not rely on any one AZ being up and/or reachable, and yet you did. And the fact it caused problems for you means you want them saying the entire region is down. Why? They make regions be fault tolerant by having multiple AZs; they guarantee reliability at the region level, not at the AZ level, and that's what the status page is intended to track.
Now, I can see wanting a clear status page per AZ, rather than just a blue 'i'. That's a valid request. But -request- that, don't claim that they're lying. You're being antagonistic despite them doing everything they've promised, and their status page being correct (just not using the colors you would like because they view severity differently than you).
In terms of your cluster, it takes a lot of testing and tweaking to ensure your cluster can reach consensus during partial/sporadic partitions. But it can and should be done if you need high availability (e.g. take nodes out of the cluster if they keep disconnecting, until they have a stable connection for X minutes).
You are looking at the consequences of amazon's employee culture. Metrics or die means the metrics might need to be fudged to keep your and your team's jobs.
It's normally used in the context of economic planning, but it's a law that "data-driven" engineering organizations ignore at their peril.
You also have to keep in mind that AWS status reports aren't just informational. They have a direct impact on how people use their service. As soon as AWS puts up a yellow dot, people start changing DNS and spinning up failovers in other regions or AZs, which has the potential to cascade into a much bigger failure when resources are unnecessarily tied up in other regions for something that maybe isn't as big a crisis as it sounds at first. This has happened before, and is part of the reason why AWS is so circumspect about announcing problems.
Point being that there are ways to architect around AZ and region failure that don't rely on trusting AWS's own reporting. Anyone with a significant investment in or dependency on AWS or any cloud service should not rely on those service's own indicators.
All that said, the fact is that in a system as big as AWS there are minor problems going on all the time. It would be healthier for their customers, their PR image, and the response to bigger incidents for them to reveal more details showing that yeah, most of the time some tiny fraction of the system is broken, offline, in a failure state, or some other unknown issue. 500 errors happen with the API sometime, let's see streaming feeds of fail rates from however many endpoints there are. EC2 instances have hardware failures or need to reboot sometimes, let's see some indication of whether that's happening more or less often than normal. Network segments fail or get overloaded... Revealing some of the less pristine guts of the operation (without revealing sensitive details of course) would go along way to being more honest without the bigger risks of saying EC2 in us-east-1 is down!
Twitter is usually the best place to watch if you think it's not you, search #aws or #{service_name} and view the "Live" tab to see if others are reporting the same trouble.
Calling attention to these failures via support ticket, via sales rep and even publicly via @ customer support heads has made zero difference. Here's hoping this blog has enough visibility to make a difference.
But when researchers come to me with AWS problems even though I've not changed anything, I usually spend 10 minutes on one of the nodes for some cursory investigation and if it's not obvious what the problem is I tell them to wait a day or two before further investigation. 95% of the time, the problem clears up on its own.
I've learned the hard way that spending half a day tracing SQS messages, dynamo tables, spot bidding, etc is usually a waste of time.
That the AWS console flat out lies like this wastes so much of everyone's time. It's not even hard to fix, AWS could run internal test cases on every subnet group.
It's frustrating when there's an issue with AWS and my clients tell me that their status page reports that everything is OK.
I've had about 4 or 5 serious AWS issues in the last 4 years (an issue which impacts multiple instances or services), and literally never had anything other than a green tick. Usually they don't even bother with the note.
AWS service status page is completely worthless
I would be happy if they just added a status indicator to each running node indicating that there's some local problem.
I am almost always confident the node itself is functional from my own metrics.
TL;DR: Don’t trust AWS status reports, they’re lies
