
Honest status reporting and AWS service status “truth” in a post-truth world - kiyanwang
https://blog.ably.io/honest-status-reporting-and-aws-service-status-truth-in-a-post-truth-world-8b9a31c8cc90#.hy68p11y2
======
Twirrim
This still fundamentally misses a key point:

AWS operates on a _huge_ scale. Mindbogglingly huge, in fact.

What you see as an outage, AWS might be aware of it being a _single_ rack in a
_single_ AZ. One single rack out of hundreds of thousands in a single AZ
doesn't constitute an outage, and affects such a tiny percentage of customers
that there is absolutely no logical reason why they'd update their service
status dashboard.

Despite all the redundancy built in to the system and all the protections they
can manage, these incidents are a common event, just by nature of the sheer
scale that AWS is operating at.

If AWS was to start surfacing these on their status page it would essentially
never leave Green-I state, except to occasionally dip in to Yellow or Red.

Things that you, and I, with smaller infrastructures would think about as
being one in a million, or even one in a billion odds of them happening, are
an absolute certainty on their scale. To give context, 3 years ago S3
announced they'd passed the 2 _trillion_ object territory, after having hit
the 1 trillion object mark 4 years ago.

~~~
hueving
>and affects such a tiny percentage of customers that there is absolutely no
logical reason why they'd update their service status dashboard.

Tell that to the customers that experienced the outage. Look, if you don't
want to show a global status page, at least give customers their own status
page that is sensitive to outages for the equipment holding their
data/instances.

When an airline experiences a single flight cancellation, they don't notify
everyone in the world, but they certainly notify the people that are on that
specific flight.

This is just a cop-out for hoping to upset as few customers as possible by
hiding the fact that their service went out.

~~~
Twirrim
That's what the Personal Health Dashboard is trying to do:
[https://phd.aws.amazon.com](https://phd.aws.amazon.com)

------
jedberg
The service was operating normally, you just weren't architected to handle a
common failure case. Don't blame Amazon, they warned you in advance. They
specifically say that they don't consider it a problem if the problem is
isolated to one zone, because your app should be able to work across all zones
in a region seamlessly.

I'll be the first to tell you that their dashboard underestimates impact, but
in this case it was totally accurate.

The system was working normally -- ie. you could still use us-west-1 without
issue if you were in another zone.

~~~
snewman
I don't think this is a fair use of the word "normally".

The service may have been working _within SLA_. Based on the data in this
article, it was not working _the way it works most of the time_. I think most
people would interpret "normally" to refer to the latter.

Or, let's drop the word "normally" and take a step back. A good dashboard
should be designed to inform. Amazon's dashboard was not providing the
information that Ably or their customers needed. And this isn't an isolated
complaint; if I had a nickel for every time someone had complained about the
green mockery of the AWS health dashboard, I'd have... well, several dollars,
anyway.

~~~
nostrademons
I might agree that "most people" out of the general population would go for
the latter, but "most engineers" would not. As an engineer, 90% of my job
description is handling the cases where it's not working the way it works most
of the time. The case where everything works when all dependencies are doing
what they normally do is a prototype or an MVP.

Now, as a founder, my job is to build that prototype & MVP, but I'd expect
that as a CEO it would be to hire people to handle the other 90% of cases.

------
et-al
[slight political rant given the title and meme]

Perhaps the writer and co-founder Ably used "post-truth" because he knows
Bezos is not a supporter of our current administration, but I wish he wouldn't
because it waters down the bald-faced lies of Kellyanne Conway, Sean Spicer,
and The Donald.

The missteps of the American government should not become as trivial and
normalised as a bodega not stocking organic almond milk.

------
joelhaasnoot
I think the main discrepancy is that you see networking in a single zone not
working as a catastrophic event, but AWS doesn't.

In all the AWS documentation you are reminded time and time again that if you
want any guarantee of availability: choose two zones. If you want more
availability: choose two regions. If you're operating globally, you probably
should anyway.

~~~
ajmurmann
It is of course true that the author should have deployed to two zones. That
said you'd still noticed something is going wrong and now need to identify
what's happening and if you need to take any corrective action. Being easily
able to see that half your infrastructure is down and you didn't break it
would save allot of time and nerves in a potentially critical situation.
Unless you significantly overprovision you likely will also need to scale up
the other zone. For that decision making it's again important to quickly know
what's happening. Amazon's reporting helps nobody but their reporting and
probably internal metrics and likely bonuses.

~~~
jedberg
It's probably not wise to rely on your provider's monitoring for such critical
things. As AWS has proven, their alerts are slow, and even if they weren't,
they aren't monitoring the very specific thing you need.

Your best bet is to monitor your own systems, and have enough monitoring in
place to tell you that one zone is unavailable without having to rely on AWS
to tell you.

Their dashboard has no effect on their bonuses BTW (at least it didn't the
last time I asked), but it is slow to update because it is purposely gated by
a human so as not to cause false positives, and that human has to manually
verify the problem before reporting it, which takes time.

~~~
hueving
>It's probably not wise to rely on your provider's monitoring for such
critical things. As AWS has proven, their alerts are slow, and even if they
weren't, they aren't monitoring the very specific thing you need.

In this particular case they are not monitoring that they lost an entire zone?
This seems like a cop-out to cover up for the fact that they didn't reveal a
major outage that they most definitely monitor internally.

------
Ysx
Frustration is understandable, though reliance on a single AZ is a known and
well-documented failure mode. You have to share the blame.

[http://d0.awsstatic.com/whitepapers/architecture/AWS_Well-
Ar...](http://d0.awsstatic.com/whitepapers/architecture/AWS_Well-
Architected_Framework.pdf)

"Best practices: Multi-AZ / Region. Distribute application load across
multiple Availability Zones / Regions"

~~~
duncan_bayne
No idea why this is being downvoted. If a single AZ goes down and your app
goes with it, then you haven't deployed your app in line with Amazon
guidelines.

~~~
GalacticDomin8r
Maybe because this has nothing to do with availability zones and it's
specifically to do with AWS mis/under-reporting status.

What AZ were they supposed to failover to? Another one reporting green?

~~~
andrewguenther
The dashboard shows region health, not AZ health. You don't even need cross-
region deployment, just multi-AZ in the same region. The issue was isolated to
a single AZ, not an entire region.

~~~
gkop
This is a strawman. We're not demanding red, but seeking yellow or orange or
some other indicator that captures the impact. Absolutely not green.

------
cjhanks
In my experience I have encountered 3 different types of outages:

1\. Multiple availability zones in a single region are experiencing reduced
performance/outage.

2\. A single availability zone is experience reduced performance/outage.

3\. A single service (bound to your account) is having reduced performance.

The dashboard displays item 1, and very rarely item 2. It never displays item
3, and that's the one most people actually care about.

Many AWS services appear to be siloed into user accounts, such as S3 or
DynamoDB. Sometimes the systems comprising your storage might be severely
degraded (due to poor access patterns or simply random statistic degradation).
In such cases users have _absolutely no way of being notified_ about degraded
performance unless they have built metrics to monitor their services.

For example; I once wrote a program which was making access calls to S3 which
was essentially destroying my performance and reliability. I had not
instrumented my code to report fairly lowlevel information (number of TCP-RST,
authentication negotiation time, packet loss). In fact, most of the low-level
metrics are hidden so deep in their SDK's, that instrumenting logging is
itself a major engineering effort.

I think 99% of customers would be happy if there was a simple REST endpoint to
query how _my_ services are running. Fidelity could even be something like 15
minute accuracy.

    
    
        GET:  /MyStatus
        {
            "S3": [
               { 
                  "BucketName": "...",
                  "Status": "DEGRADED",
                  "FailingOperations": [
                      "ListMultipartUpload",
                      "CreateMultipartUpload",
                  ]
            ],
            "EC2": []
            ...
        }

~~~
cyberferret
CloudWatch?

~~~
cjhanks
CloudWatch only tells you a small part of the story. AWS has much richer
statistics internally that they do not expose isolating _why_ issues are
happening.

------
CaliforniaKarl
This article was posted and discussed here around the time it came out.

Here's a link to the previous discussion:
[https://news.ycombinator.com/item?id=13615198](https://news.ycombinator.com/item?id=13615198)

------
dogecoinbase
This gets at the heart of the biggest problem with AWS status reporting -- if
you're going to build a platform for people to build atop, they need to be
able to pass through your errors to their customers in a reliable, visible,
honest way.

Though to be fair, most sufficiently popular projects don't even need a real
status page. One which simply reported its own traffic volume would suffice to
know if the service itself is down (crowdsource your status!).

~~~
salesguy222
I hypothesize that as with many things in the business world, anything that
would hurt stock price must be avoided, especially in the short term.

So, I picture some rich business executive, technical or not, who ultimately
would decide if there should be a status page and how it would look, saying to
themself, "Why would we tell the whole world our service is down? This would
cause panic among everyone, rather than just annoyance to those who truly
care."

Of course it becomes an even bigger problem at that point to blatently say
"ALL SYSTEMS GO" with green checks, when it clearly isn't the case.

So, Amazon, like any other multi multi multi billion dollar business empire,
clouds the water and seeks to control the perspective. "This wasn't downtime,
as per your SLA. It was inability for some to access our servers, which were
powered on the whole time!"

Anything to shore up stock price is to be pursued. Anything to bring it down
is to be avoided.

------
logicallee
Someone needs to go take one for the team and go through the hassle of a petty
lawsuit (which will be an open and shut case in front of any judge), and the
judge will find that if you're a $400bn operations company and have a status
page, it needs to be accurate. This is simple, like a judge forcing you to
turn off your "we're always open! Yes, right now! Come on in" giant flashing
billboard, whenever you close.

It's ridiculous that people need to check a $7 server that shows if amazon is
down or not. Someone go through this petty lawsuit and force them to sort
their shit.

If you're amazon, sort your shit or you deserve the huge legal damages of
making someone take one for the team and actually go through with a lawsuit.
This isn't even a grey area.

It's trivial for you to report this accurately. If you have a page purporting
to do so...do so.

~~~
logicallee
Why is this being downvoted? There is a status page advertising "our services
are up", when they have the information that this is false.

I can't think of a great (tight) analogy, but McDonald's obviously can't put
"Only 100 Calories!" on their advertisement for Big Macs. Why not? Because
it's false...

Companies can't just say whatever they want when they have information to the
contrary. In this case other businesses are impacted.

I think it's not a hard argument to make.

------
gkop
Has anyone else noticed that in the last year, services have begun _naming_
upstream vendors on their status pages when service is disrupted? In other
words, saying "Our service is down because AWS is disrupted" rather than "Our
service is down because our upstream infrastructure vendor is disrupted".

Naming the vendor during an availability incident is un-classy, I'd even say
childish, yet is seems to have rapidly become a mainstream practice in our
field :/ In the case of naming Amazon, practices of theirs such as those
described in the article bring this on themselves, since we are only human,
and we resent Amazon's dishonesty. Still, we should switch away from dishonest
vendors rather than use them as scapegoats when our systems fail. (my point
does not apply to postmortems, where every detail counts and is useful,
including the specific vendors involved)

Example (singling out a vendor that I otherwise think is great, sorry
Filestack!):
[http://status.filestack.com/incidents/lzl06yzw1jq5](http://status.filestack.com/incidents/lzl06yzw1jq5)

What do we stand to gain by naming the upstream vendor? (genuinely curious)

