
Honest status reporting and AWS service status “truth” - matt_oriordan
https://blog.ably.io/honest-status-reporting-and-aws-service-status-truth-in-a-post-truth-world-8b9a31c8cc90
======
apeace
I feel for the OP, I really do. Have been through this many times before.

But it is going too far to call this a "lie" on the part of AWS. If you talk
to any AWS customer support representative, or read their docs, they are very
clear that the unit of availability for their service is a _region_ , not a
_zone_.

Notice that the status update said "connectivity issues in a single
Availability Zone".

If you are deploying in a single AZ, and your app is not tolerant to at least
one AZ failure, then you should not be telling your customers/boss that your
app is highly available. That's not the fault of AWS, it's how you're using
AWS.

With that said, I do think that AWS could do two things: 1) maybe show a
yellow warning sign instead of a blue "i", for something that borks an entire
AZ, and 2) make it even more clear that each AZ is not high-availability.

~~~
franciskim
Yeah, I was about to mention multi-AZ load balancing too. Can't believe a real
time messaging platform doesn't have that.

~~~
matt_oriordan
Well it's not that's simple. We do run in multiple availability zones in every
region. But if the connectivity between them is partially working, which it
was, and shared service from Amazon itself aren't working fully from every
instance, ou have a huge mess to contend with where the cluster consensus
cannot be formed. So in cases like this we did what we should have done and
routed traffic away form a network that was unreliable and partly partitioned.
The point for us was not that this availability zone went down at all. It was
that Amazon throughout claimed everything was operating normally for hours
when this was very far from the truth.

~~~
lostcolony
So one availability zone went down, not the region, which Amazon indicated on
their status page (which is clearly set up to predominantly display outages of
the region, not of a single AZ within the region), and because of how your
system was set up to be dependent on each AZ being operational and networks to
not partition, it caused issues.

I get that having issues on AWS is irritating; I exist in that ecosystem too.
But...I really can't fault them for this, or claim that they're lying. AWS
says to not rely on any one AZ being up and/or reachable, and yet you did. And
the fact it caused problems for you means you want them saying the entire
region is down. Why? They make regions be fault tolerant by having multiple
AZs; they guarantee reliability at the region level, not at the AZ level, and
that's what the status page is intended to track.

Now, I can see wanting a clear status page per AZ, rather than just a blue
'i'. That's a valid request. But -request- that, don't claim that they're
lying. You're being antagonistic despite them doing everything they've
promised, and their status page being correct (just not using the colors you
would like because they view severity differently than you).

------
nhumrich
I have worked at amazon, and I can validate this. When I was on an AWS team,
posting a "non-green" status to the status page was actually a Manager
decision. That means that it is in no way real time, and its possible the
status might say "everything is ok" when its really not because a manager
doesnt think its a big enough deal. Also there is a status called green-i
which is the green icon with a little 'i' information bubble. Almost everytime
something was seriously wrong, our status was green-i, because the "level" of
warning is ultimately a manager decision. So they avoid yellow and red as much
as possible. So the status's aren't very accurate. That being said, if you do
see a yellow or red status on AWS, you know things are REALLY bad.

~~~
toomuchtodo
Devops here who has to deal with the AWS status page frequently. Is someone's
bonus or performance metrics tied to how often a service goes to "non-green"
on the status page?

~~~
nhumrich
not really. But the manager IS held responsible if they go green-i too often.
As sometimes having a bad status can trigger AWS to give credits back to
certain customers.

------
kevin_b_er
Note that if they did have non-green, they might have to credit you due to
SLAs. There's likely a strong disincentive internally to have a problem cause
non-green status, because it is a metric that will cost amazon money. Amazon
has service credits for reduced uptime. As Amazon employees are known for
being very very harshly judged on metrics, lying on the status page is thus
incentivized.

You are looking at the consequences of amazon's employee culture. Metrics or
die means the metrics might need to be fudged to keep your and your team's
jobs.

~~~
Analemma_
"When a measure becomes a target, it ceases to be a good measure." \-
[https://en.wikipedia.org/wiki/Goodhart%27s_law](https://en.wikipedia.org/wiki/Goodhart%27s_law)

It's normally used in the context of economic planning, but it's a law that
"data-driven" engineering organizations ignore at their peril.

------
hueving
Aws is massively incentivized to lie on the status pages until the outages
become egregious. It's literally a page for marketing to point to when
describing how reliable the service is.

~~~
csallen
This. The people at Amazon need to restructure their teams' incentive
structures so that whoever is responsible for status reporting is in no way
beholden to marketing. Of course, Amazon itself has no incentive to do this,
which is why it's important for developers like ourselves raise a big enough
stink.

------
skywhopper
I agree that AWS could improve their openness around service problems. That
said, what impacts some customers doesn't impact all customers. Ably may be
offline when AWS has certain types of trouble, but that doesn't mean everyone
with services in that AZ or region are having similar problems.

You also have to keep in mind that AWS status reports aren't just
informational. They have a direct impact on how people use their service. As
soon as AWS puts up a yellow dot, people start changing DNS and spinning up
failovers in other regions or AZs, which has the potential to cascade into a
much bigger failure when resources are unnecessarily tied up in other regions
for something that maybe isn't as big a crisis as it sounds at first. This has
happened before, and is part of the reason why AWS is so circumspect about
announcing problems.

Point being that there are ways to architect around AZ and region failure that
don't rely on trusting AWS's own reporting. Anyone with a significant
investment in or dependency on AWS or any cloud service should not rely on
those service's own indicators.

All that said, the fact is that in a system as big as AWS there are minor
problems going on all the time. It would be healthier for their customers,
their PR image, and the response to bigger incidents for them to reveal more
details showing that yeah, most of the time some tiny fraction of the system
is broken, offline, in a failure state, or some other unknown issue. 500
errors happen with the API sometime, let's see streaming feeds of fail rates
from however many endpoints there are. EC2 instances have hardware failures or
need to reboot sometimes, let's see some indication of whether that's
happening more or less often than normal. Network segments fail or get
overloaded... Revealing some of the less pristine guts of the operation
(without revealing sensitive details of course) would go along way to being
more honest without the bigger risks of saying EC2 in us-east-1 is down!

------
gommm
I've long learned that the status page from AWS is useless, it usually only
updates 30-40 minutes after the problem starts (by which time, you've already
noticed it) and they will always minimize the problem on the status page...

It's frustrating when there's an issue with AWS and my clients tell me that
their status page reports that everything is OK.

------
CoolGuySteve
I've had this happen to me multiple times. Luckily I use AWS for batch
research jobs, so I'm not losing any money when things break.

But when researchers come to me with AWS problems even though I've not changed
anything, I usually spend 10 minutes on one of the nodes for some cursory
investigation and if it's not obvious what the problem is I tell them to wait
a day or two before further investigation. 95% of the time, the problem clears
up on its own.

I've learned the hard way that spending half a day tracing SQS messages,
dynamo tables, spot bidding, etc is usually a waste of time.

That the AWS console flat out lies like this wastes so much of everyone's
time. It's not even hard to fix, AWS could run internal test cases on every
subnet group.

------
exratione
A while back, after some months of frustration with significant levels of S3
API failures in bulk that never made it to the status page, I ended up writing
a tool that continually monitors S3 via making random requests, and alerts if
too many fail per unit time. The outcome of that experiment is the finding
that there are a lot of significant spikes in S3 API failure rates that go
entirely unannounced by Amazon, though the situation has improved considerably
in the past year.

------
jwildeboer
But AFAICS they didn't contact AWS support? That would be my very first step,
especially when running customer code/services ...

~~~
empath75
If they're running a status page, you shouldn't have to contact support to get
the status. It's pretty well known among heavy aws users that their status
page is bullshit.

------
web007
AWS status page is a point of contention within the support teams as well.
I've contacted them for problems when the status page says "everything's fine"
and they have known outages. There is ALWAYS a lag between them knowing and
anything being reported, usually on the order of 15-30 minutes. Most of the
time their response is "Yes, there's a problem. I don't know why the status
page isn't updated yet, it would really help us (support) out if the page was
accurate."

Twitter is usually the best place to watch if you think it's not you, search
#aws or #{service_name} and view the "Live" tab to see if others are reporting
the same trouble.

Calling attention to these failures via support ticket, via sales rep and even
publicly via @ customer support heads has made zero difference. Here's hoping
this blog has enough visibility to make a difference.

------
ReidZB
The "real AWS status" extension for Chrome [1] pokes a little bit of fun at
this. Basically, it hides all the regular green checkmarks (unless they are a
'Resolved' issue), then it increments the status images (green with I becomes
yellow, yellow becomes red). It also makes the legend at the bottom snarky.

Although it's really meant as a joke, I think, it's actually really useful. It
removes the noise and, frankly, the incremented status images are usually
quite accurate for us.

[1] [https://github.com/josegonzalez/real-aws-
status](https://github.com/josegonzalez/real-aws-status)

------
throwaway2016a
Related to this:

My favorite is "We are experiencing an elevated level of errors" Which almost
always translates to "This service is completely down!"

Although, with that said, they are not lying. 100% failure is "elevated" from
"0"

------
benmmurphy
they really need two dashboards. one for people that want something useful and
one for people that want to be lied to.

~~~
CoolGuySteve
A lot of the time, services outages seem localized to specific datacenters and
subnets, or some relatively small subset of nodes. I suspect even Amazon
doesn't know there's a problem sometimes.

I would be happy if they just added a status indicator to each running node
indicating that there's some local problem.

~~~
amichal
They do.. its called Cloudwatch.
[https://aws.amazon.com/cloudwatch/details/#amazon-
ec2-monito...](https://aws.amazon.com/cloudwatch/details/#amazon-
ec2-monitoring)

~~~
CoolGuySteve
That's not quite what I want though, what I want to know is if the
SQS/Dynamo/etc _server_ is functioning correctly for that node.

I am almost always confident the node itself is functional from my own
metrics.

~~~
ckozlowski
You might be interested in the Personal Health Dashboard then:

[https://aws.amazon.com/premiumsupport/phd/](https://aws.amazon.com/premiumsupport/phd/)

------
devy
This is definitely not unique only to AWS but also GCP in my own experience.
We sometimes had Sentry error alerts due to DB connection error coming from
the web app hundreds of times in a short window of 5 minutes and GCP dashboard
would still shows green the entire time. Disclaimer: I have worked with GCP
for 2 years and AWS for 8+ years.

------
65827
Seen this a few times myself, weird network outages with AWS refusing to
acknowledge

------
rawat81
Damn lies about the number of 9s. I face similar issues with Google Cloud on a
daily basis with the APIs and their standard response is to go and buy and buy
the next support level before they can look at the problem.

These companies claim to have state of art technologies where as in reality
their customers inform them about outages and performance problems.

------
patmcguire
The exhausting reality is that in the age of widely distributed services,
everything is always broken enough that someone's use case is going to choke
and die. When the best case involves total failure for a small percentage of
people, what does green even mean? Author had their own monitoring, that's the
only workable approach.

------
ccvannorman
This literally happened to me for the first time this week after years with
AWS! My site was totally down for _6 hours_ but AWS still reports everything
as green and no notifications were sent.

Any suggestions for a very simple 3rd party "check" that will constantly
monitor my site and alert me (email/text) when it's unresponsive?

~~~
amichal
[https://www.pingdom.com/](https://www.pingdom.com/) is often my backup
monitoring service (we use CloudWatch and Nagios for real things)

------
billhathaway
I wish AWS had both high-level and drill down status pages where it would post
API success rates or the equivalent detailed information like github's status
page has [0].

[0] [https://status.github.com/](https://status.github.com/)

------
user5994461
Short version: AWS doesn't give any service status.

The have a status page but everything is always "green". It would take half
the internet down before they move something to "not green (but not red
either)".

------
logicallee
Awhile ago I came up with this handy status page.

[http://i.imgur.com/adl7Yc3.png](http://i.imgur.com/adl7Yc3.png)

Which should work in 100% of cases - you can just serve it statically.

------
alainv
This is unfortunately not a new issue: [https://github.com/josegonzalez/real-
aws-status](https://github.com/josegonzalez/real-aws-status)

------
paulddraper
My coworker has worked a lot with AWS:

"When Armageddon strikes and all of humanity is extinct, the last one to
survive has to switch the AWS status flag to 'Red'"

------
Dinius

        TL;DR: Don’t trust AWS status reports, they’re lies
    

Or as some would say, alternative facts.

