Honest status reporting and AWS service status “truth” in a post-truth world (ably.io)
This still fundamentally misses a key point:

AWS operates on a huge scale. Mindbogglingly huge, in fact.

What you see as an outage, AWS might be aware of it being a single rack in a single AZ. One single rack out of hundreds of thousands in a single AZ doesn't constitute an outage, and affects such a tiny percentage of customers that there is absolutely no logical reason why they'd update their service status dashboard.

Despite all the redundancy built in to the system and all the protections they can manage, these incidents are a common event, just by nature of the sheer scale that AWS is operating at.

If AWS was to start surfacing these on their status page it would essentially never leave Green-I state, except to occasionally dip in to Yellow or Red.

Things that you, and I, with smaller infrastructures would think about as being one in a million, or even one in a billion odds of them happening, are an absolute certainty on their scale. To give context, 3 years ago S3 announced they'd passed the 2 trillion object territory, after having hit the 1 trillion object mark 4 years ago.

The service was operating normally, you just weren't architected to handle a common failure case. Don't blame Amazon, they warned you in advance. They specifically say that they don't consider it a problem if the problem is isolated to one zone, because your app should be able to work across all zones in a region seamlessly.

I'll be the first to tell you that their dashboard underestimates impact, but in this case it was totally accurate.

The system was working normally -- ie. you could still use us-west-1 without issue if you were in another zone.

[slight political rant given the title and meme]

Perhaps the co-founder at Ably used "post-truth" because he knows Bezos is not a supporter of our current administration, but I wish he wouldn't because it waters down the bald-faced lies of Kellyanne Conway, Sean Spicer, and The Donald.

The missteps of our government should not become as trivial and normalised as the bodega not stocking organic almond milk.

I think the main discrepancy is that you see networking in a single zone not working as a catastrophic event, but AWS doesn't.

In all the AWS documentation you are reminded time and time again that if you want any guarantee of availability: choose two zones. If you want more availability: choose two regions. If you're operating globally, you probably should anyway.

It is of course true that the author should have deployed to two zones. That said you'd still noticed something is going wrong and now need to identify what's happening and if you need to take any corrective action. Being easily able to see that half your infrastructure is down and you didn't break it would save allot of time and nerves in a potentially critical situation. Unless you significantly overprovision you likely will also need to scale up the other zone. For that decision making it's again important to quickly know what's happening. Amazon's reporting helps nobody but their reporting and probably internal metrics and likely bonuses.

It's probably not wise to rely on your provider's monitoring for such critical things. As AWS has proven, their alerts are slow, and even if they weren't, they aren't monitoring the very specific thing you need.

Your best bet is to monitor your own systems, and have enough monitoring in place to tell you that one zone is unavailable without having to rely on AWS to tell you.

Their dashboard has no effect on their bonuses BTW (at least it didn't the last time I asked), but it is slow to update because it is purposely gated by a human so as not to cause false positives, and that human has to manually verify the problem before reporting it, which takes time.

Frustration is understandable, though reliance on a single AZ is a known and well-documented failure mode. You have to share the blame.

http://d0.awsstatic.com/whitepapers/architecture/AWS_Well-Ar...

"Best practices: Multi-AZ / Region. Distribute application load across multiple Availability Zones / Regions"

No idea why this is being downvoted. If a single AZ goes down and your app goes with it, then you haven't deployed your app in line with Amazon guidelines.

Maybe because this has nothing to do with availability zones and it's specifically to do with AWS mis/under-reporting status.

What AZ were they supposed to failover to? Another one reporting green?

This gets at the heart of the biggest problem with AWS status reporting -- if you're going to build a platform for people to build atop, they need to be able to pass through your errors to their customers in a reliable, visible, honest way.

Though to be fair, most sufficiently popular projects don't even need a real status page. One which simply reported its own traffic volume would suffice to know if the service itself is down (crowdsource your status!).

I hypothesize that as with many things in the business world, anything that would hurt stock price must be avoided, especially in the short term.

So, I picture some rich business executive, technical or not, who ultimately would decide if there should be a status page and how it would look, saying to themself, "Why would we tell the whole world our service is down? This would cause panic among everyone, rather than just annoyance to those who truly care."

Of course it becomes an even bigger problem at that point to blatently say "ALL SYSTEMS GO" with green checks, when it clearly isn't the case.

So, Amazon, like any other multi multi multi billion dollar business empire, clouds the water and seeks to control the perspective. "This wasn't downtime, as per your SLA. It was inability for some to access our servers, which were powered on the whole time!"

Anything to shore up stock price is to be pursued. Anything to bring it down is to be avoided.

