Hacker News new | comments | show | ask | jobs | submit login
Honest status reporting and AWS service status “truth” in a post-truth world (ably.io)
79 points by kiyanwang 217 days ago | hide | past | web | 35 comments | favorite

This still fundamentally misses a key point:

AWS operates on a huge scale. Mindbogglingly huge, in fact.

What you see as an outage, AWS might be aware of it being a single rack in a single AZ. One single rack out of hundreds of thousands in a single AZ doesn't constitute an outage, and affects such a tiny percentage of customers that there is absolutely no logical reason why they'd update their service status dashboard.

Despite all the redundancy built in to the system and all the protections they can manage, these incidents are a common event, just by nature of the sheer scale that AWS is operating at.

If AWS was to start surfacing these on their status page it would essentially never leave Green-I state, except to occasionally dip in to Yellow or Red.

Things that you, and I, with smaller infrastructures would think about as being one in a million, or even one in a billion odds of them happening, are an absolute certainty on their scale. To give context, 3 years ago S3 announced they'd passed the 2 trillion object territory, after having hit the 1 trillion object mark 4 years ago.

>and affects such a tiny percentage of customers that there is absolutely no logical reason why they'd update their service status dashboard.

Tell that to the customers that experienced the outage. Look, if you don't want to show a global status page, at least give customers their own status page that is sensitive to outages for the equipment holding their data/instances.

When an airline experiences a single flight cancellation, they don't notify everyone in the world, but they certainly notify the people that are on that specific flight.

This is just a cop-out for hoping to upset as few customers as possible by hiding the fact that their service went out.

That's what the Personal Health Dashboard is trying to do: https://phd.aws.amazon.com

The service was operating normally, you just weren't architected to handle a common failure case. Don't blame Amazon, they warned you in advance. They specifically say that they don't consider it a problem if the problem is isolated to one zone, because your app should be able to work across all zones in a region seamlessly.

I'll be the first to tell you that their dashboard underestimates impact, but in this case it was totally accurate.

The system was working normally -- ie. you could still use us-west-1 without issue if you were in another zone.

I don't think this is a fair use of the word "normally".

The service may have been working within SLA. Based on the data in this article, it was not working the way it works most of the time. I think most people would interpret "normally" to refer to the latter.

Or, let's drop the word "normally" and take a step back. A good dashboard should be designed to inform. Amazon's dashboard was not providing the information that Ably or their customers needed. And this isn't an isolated complaint; if I had a nickel for every time someone had complained about the green mockery of the AWS health dashboard, I'd have... well, several dollars, anyway.

I might agree that "most people" out of the general population would go for the latter, but "most engineers" would not. As an engineer, 90% of my job description is handling the cases where it's not working the way it works most of the time. The case where everything works when all dependencies are doing what they normally do is a prototype or an MVP.

Now, as a founder, my job is to build that prototype & MVP, but I'd expect that as a CEO it would be to hire people to handle the other 90% of cases.

I would tend to agree. This quote from the epilogue drove home that the author doesn't think that's reasonable though:

> However, I still do not believe that if one out of three availability zones in a region is practically offline that is reasonable to report that region as healthy.

I can agree that it should be easier to see and understand that there are zonal issues, but I can't agree that the region was unhealthy. Now if 2 out of 3 zones were down, that's a real concern (you can't run a highly-available service anymore).

Said another way: what if there were 20 zones in the region and only one were down?

Disclosure: I work on Google Cloud.

I've had this question for a while: do we know there are only 2-3 zones per region, or might there be 10-20 and, for example, my zone 'a' is your zone 'c', and neither of us has access to the other's zone 'b' (yours too old/loaded to be assigned to me; mine too new for you)?

Sort of. The "zones" are just logical constructs. They can be one building or many tied together with fat fiber links. And you're right that my "A" and your "A" may be different. They are randomly assigned the first time you launch a resource.

Also, there are decommissioned zones that new accounts can't see but old ones can if they've ever launched something there before (if they haven't then that zone letter will be reassigned to a new zone with no one the wiser). Typically they won't let you launch new things in that zone, so it's in your bet interest to move out of it, lest your resources get out of balance.

Edit: Oh and old accounts can always see new zones as far as I know. They just show up as new letters. But it's rare that they add new zones -- usually they just make old zones bigger.

Right. In some set of slides, James Hamilton has certainly said at least how many buildings there are for us-east-1 and it's a lot more than the up to four zones people see (IIRC, >26). Most customers will never notice the randomization of building => zone, but every once in a while you rely on a vendor like MongoLabs and its important you both are speaking about the same us-east1-b. The AWS folks let you tie yourself together in case you have multiple projects that need identical mappings.

Us-east actually has five zone, but one of them was decommissioned many years ago, so unless your account is very old, you can only see four.

As far as I know, that's the only zone that's ever been fully decommissioned.

[slight political rant given the title and meme]

Perhaps the writer and co-founder Ably used "post-truth" because he knows Bezos is not a supporter of our current administration, but I wish he wouldn't because it waters down the bald-faced lies of Kellyanne Conway, Sean Spicer, and The Donald.

The missteps of the American government should not become as trivial and normalised as a bodega not stocking organic almond milk.

I think the main discrepancy is that you see networking in a single zone not working as a catastrophic event, but AWS doesn't.

In all the AWS documentation you are reminded time and time again that if you want any guarantee of availability: choose two zones. If you want more availability: choose two regions. If you're operating globally, you probably should anyway.

It is of course true that the author should have deployed to two zones. That said you'd still noticed something is going wrong and now need to identify what's happening and if you need to take any corrective action. Being easily able to see that half your infrastructure is down and you didn't break it would save allot of time and nerves in a potentially critical situation. Unless you significantly overprovision you likely will also need to scale up the other zone. For that decision making it's again important to quickly know what's happening. Amazon's reporting helps nobody but their reporting and probably internal metrics and likely bonuses.

It's probably not wise to rely on your provider's monitoring for such critical things. As AWS has proven, their alerts are slow, and even if they weren't, they aren't monitoring the very specific thing you need.

Your best bet is to monitor your own systems, and have enough monitoring in place to tell you that one zone is unavailable without having to rely on AWS to tell you.

Their dashboard has no effect on their bonuses BTW (at least it didn't the last time I asked), but it is slow to update because it is purposely gated by a human so as not to cause false positives, and that human has to manually verify the problem before reporting it, which takes time.

>It's probably not wise to rely on your provider's monitoring for such critical things. As AWS has proven, their alerts are slow, and even if they weren't, they aren't monitoring the very specific thing you need.

In this particular case they are not monitoring that they lost an entire zone? This seems like a cop-out to cover up for the fact that they didn't reveal a major outage that they most definitely monitor internally.

We use three availability zones. Unfortunately that was not the point of the article. The point was one AZ was effectively completely down, yet AWS was reporting the entire region as healthy without any warnings.

Joel, the point of my article was nothing to do with architecture and using multiple AZs. Of course we use a minimum of two AZs and often three in reach region. The problem is that AWS continued to report the entire region as completely healthy in spit of an entire AZ being down, and worse, continuing to route traffic from an ELB in that AZ to that unhealthy AZ data center.

Do you have health checks implemented corrected? ELB isn't magic: if the ELB is running and your instance isn't the health check needs to return that your application is unhealthy.

Frustration is understandable, though reliance on a single AZ is a known and well-documented failure mode. You have to share the blame.


"Best practices: Multi-AZ / Region. Distribute application load across multiple Availability Zones / Regions"

This article was NOT about reliance on a single AZ. That was not the point as we use three AZs in most AWS region, and at least two in every region. The point was simply that an entire AZ was down and Amazon continued to report the entire region as completely healthy throughout.

So this is a AWS architecture guide that on page 55 (of 77) has a bullet point pertaining to applications with high-availanbilty requirements, mentioning that you can increase availability by going multi-AZ.

This is pretty far from AWS clearly communicating from the outset that a single AZ would be generally poor practice or have poorer availability than a server in a normal datacenter.

No idea why this is being downvoted. If a single AZ goes down and your app goes with it, then you haven't deployed your app in line with Amazon guidelines.

Maybe because this has nothing to do with availability zones and it's specifically to do with AWS mis/under-reporting status.

What AZ were they supposed to failover to? Another one reporting green?

The dashboard shows region health, not AZ health. You don't even need cross-region deployment, just multi-AZ in the same region. The issue was isolated to a single AZ, not an entire region.

This is a strawman. We're not demanding red, but seeking yellow or orange or some other indicator that captures the impact. Absolutely not green.

In my experience I have encountered 3 different types of outages:

1. Multiple availability zones in a single region are experiencing reduced performance/outage.

2. A single availability zone is experience reduced performance/outage.

3. A single service (bound to your account) is having reduced performance.

The dashboard displays item 1, and very rarely item 2. It never displays item 3, and that's the one most people actually care about.

Many AWS services appear to be siloed into user accounts, such as S3 or DynamoDB. Sometimes the systems comprising your storage might be severely degraded (due to poor access patterns or simply random statistic degradation). In such cases users have absolutely no way of being notified about degraded performance unless they have built metrics to monitor their services.

For example; I once wrote a program which was making access calls to S3 which was essentially destroying my performance and reliability. I had not instrumented my code to report fairly lowlevel information (number of TCP-RST, authentication negotiation time, packet loss). In fact, most of the low-level metrics are hidden so deep in their SDK's, that instrumenting logging is itself a major engineering effort.

I think 99% of customers would be happy if there was a simple REST endpoint to query how my services are running. Fidelity could even be something like 15 minute accuracy.

    GET:  /MyStatus
        "S3": [
              "BucketName": "...",
              "Status": "DEGRADED",
              "FailingOperations": [
        "EC2": []


CloudWatch only tells you a small part of the story. AWS has much richer statistics internally that they do not expose isolating why issues are happening.

This article was posted and discussed here around the time it came out.

Here's a link to the previous discussion: https://news.ycombinator.com/item?id=13615198

This gets at the heart of the biggest problem with AWS status reporting -- if you're going to build a platform for people to build atop, they need to be able to pass through your errors to their customers in a reliable, visible, honest way.

Though to be fair, most sufficiently popular projects don't even need a real status page. One which simply reported its own traffic volume would suffice to know if the service itself is down (crowdsource your status!).

I hypothesize that as with many things in the business world, anything that would hurt stock price must be avoided, especially in the short term.

So, I picture some rich business executive, technical or not, who ultimately would decide if there should be a status page and how it would look, saying to themself, "Why would we tell the whole world our service is down? This would cause panic among everyone, rather than just annoyance to those who truly care."

Of course it becomes an even bigger problem at that point to blatently say "ALL SYSTEMS GO" with green checks, when it clearly isn't the case.

So, Amazon, like any other multi multi multi billion dollar business empire, clouds the water and seeks to control the perspective. "This wasn't downtime, as per your SLA. It was inability for some to access our servers, which were powered on the whole time!"

Anything to shore up stock price is to be pursued. Anything to bring it down is to be avoided.

Someone needs to go take one for the team and go through the hassle of a petty lawsuit (which will be an open and shut case in front of any judge), and the judge will find that if you're a $400bn operations company and have a status page, it needs to be accurate. This is simple, like a judge forcing you to turn off your "we're always open! Yes, right now! Come on in" giant flashing billboard, whenever you close.

It's ridiculous that people need to check a $7 server that shows if amazon is down or not. Someone go through this petty lawsuit and force them to sort their shit.

If you're amazon, sort your shit or you deserve the huge legal damages of making someone take one for the team and actually go through with a lawsuit. This isn't even a grey area.

It's trivial for you to report this accurately. If you have a page purporting to do so...do so.

Why is this being downvoted? There is a status page advertising "our services are up", when they have the information that this is false.

I can't think of a great (tight) analogy, but McDonald's obviously can't put "Only 100 Calories!" on their advertisement for Big Macs. Why not? Because it's false...

Companies can't just say whatever they want when they have information to the contrary. In this case other businesses are impacted.

I think it's not a hard argument to make.

Has anyone else noticed that in the last year, services have begun naming upstream vendors on their status pages when service is disrupted? In other words, saying "Our service is down because AWS is disrupted" rather than "Our service is down because our upstream infrastructure vendor is disrupted".

Naming the vendor during an availability incident is un-classy, I'd even say childish, yet is seems to have rapidly become a mainstream practice in our field :/ In the case of naming Amazon, practices of theirs such as those described in the article bring this on themselves, since we are only human, and we resent Amazon's dishonesty. Still, we should switch away from dishonest vendors rather than use them as scapegoats when our systems fail. (my point does not apply to postmortems, where every detail counts and is useful, including the specific vendors involved)

Example (singling out a vendor that I otherwise think is great, sorry Filestack!): http://status.filestack.com/incidents/lzl06yzw1jq5

What do we stand to gain by naming the upstream vendor? (genuinely curious)

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact