AWS operates on a huge scale. Mindbogglingly huge, in fact.
What you see as an outage, AWS might be aware of it being a single rack in a single AZ. One single rack out of hundreds of thousands in a single AZ doesn't constitute an outage, and affects such a tiny percentage of customers that there is absolutely no logical reason why they'd update their service status dashboard.
Despite all the redundancy built in to the system and all the protections they can manage, these incidents are a common event, just by nature of the sheer scale that AWS is operating at.
If AWS was to start surfacing these on their status page it would essentially never leave Green-I state, except to occasionally dip in to Yellow or Red.
Things that you, and I, with smaller infrastructures would think about as being one in a million, or even one in a billion odds of them happening, are an absolute certainty on their scale. To give context, 3 years ago S3 announced they'd passed the 2 trillion object territory, after having hit the 1 trillion object mark 4 years ago.
Tell that to the customers that experienced the outage. Look, if you don't want to show a global status page, at least give customers their own status page that is sensitive to outages for the equipment holding their data/instances.
When an airline experiences a single flight cancellation, they don't notify everyone in the world, but they certainly notify the people that are on that specific flight.
This is just a cop-out for hoping to upset as few customers as possible by hiding the fact that their service went out.
I'll be the first to tell you that their dashboard underestimates impact, but in this case it was totally accurate.
The system was working normally -- ie. you could still use us-west-1 without issue if you were in another zone.
The service may have been working within SLA. Based on the data in this article, it was not working the way it works most of the time. I think most people would interpret "normally" to refer to the latter.
Or, let's drop the word "normally" and take a step back. A good dashboard should be designed to inform. Amazon's dashboard was not providing the information that Ably or their customers needed. And this isn't an isolated complaint; if I had a nickel for every time someone had complained about the green mockery of the AWS health dashboard, I'd have... well, several dollars, anyway.
Now, as a founder, my job is to build that prototype & MVP, but I'd expect that as a CEO it would be to hire people to handle the other 90% of cases.
> However, I still do not believe that if one out of three availability zones in a region is practically offline that is reasonable to report that region as healthy.
I can agree that it should be easier to see and understand that there are zonal issues, but I can't agree that the region was unhealthy. Now if 2 out of 3 zones were down, that's a real concern (you can't run a highly-available service anymore).
Said another way: what if there were 20 zones in the region and only one were down?
Disclosure: I work on Google Cloud.
Also, there are decommissioned zones that new accounts can't see but old ones can if they've ever launched something there before (if they haven't then that zone letter will be reassigned to a new zone with no one the wiser). Typically they won't let you launch new things in that zone, so it's in your bet interest to move out of it, lest your resources get out of balance.
Edit: Oh and old accounts can always see new zones as far as I know. They just show up as new letters. But it's rare that they add new zones -- usually they just make old zones bigger.
As far as I know, that's the only zone that's ever been fully decommissioned.
Perhaps the writer and co-founder Ably used "post-truth" because he knows Bezos is not a supporter of our current administration, but I wish he wouldn't because it waters down the bald-faced lies of Kellyanne Conway, Sean Spicer, and The Donald.
The missteps of the American government should not become as trivial and normalised as a bodega not stocking organic almond milk.
In all the AWS documentation you are reminded time and time again that if you want any guarantee of availability: choose two zones. If you want more availability: choose two regions. If you're operating globally, you probably should anyway.
Your best bet is to monitor your own systems, and have enough monitoring in place to tell you that one zone is unavailable without having to rely on AWS to tell you.
Their dashboard has no effect on their bonuses BTW (at least it didn't the last time I asked), but it is slow to update because it is purposely gated by a human so as not to cause false positives, and that human has to manually verify the problem before reporting it, which takes time.
In this particular case they are not monitoring that they lost an entire zone? This seems like a cop-out to cover up for the fact that they didn't reveal a major outage that they most definitely monitor internally.
"Best practices: Multi-AZ / Region. Distribute application load across multiple Availability Zones / Regions"
This is pretty far from AWS clearly communicating from the outset that a single AZ would be generally poor practice or have poorer availability than a server in a normal datacenter.
What AZ were they supposed to failover to? Another one reporting green?
1. Multiple availability zones in a single region are experiencing reduced performance/outage.
2. A single availability zone is experience reduced performance/outage.
3. A single service (bound to your account) is having reduced performance.
The dashboard displays item 1, and very rarely item 2. It never displays item 3, and that's the one most people actually care about.
Many AWS services appear to be siloed into user accounts, such as S3 or DynamoDB. Sometimes the systems comprising your storage might be severely degraded (due to poor access patterns or simply random statistic degradation). In such cases users have absolutely no way of being notified about degraded performance unless they have built metrics to monitor their services.
For example; I once wrote a program which was making access calls to S3 which was essentially destroying my performance and reliability. I had not instrumented my code to report fairly lowlevel information (number of TCP-RST, authentication negotiation time, packet loss). In fact, most of the low-level metrics are hidden so deep in their SDK's, that instrumenting logging is itself a major engineering effort.
I think 99% of customers would be happy if there was a simple REST endpoint to query how my services are running. Fidelity could even be something like 15 minute accuracy.
Here's a link to the previous discussion: https://news.ycombinator.com/item?id=13615198
Though to be fair, most sufficiently popular projects don't even need a real status page. One which simply reported its own traffic volume would suffice to know if the service itself is down (crowdsource your status!).
So, I picture some rich business executive, technical or not, who ultimately would decide if there should be a status page and how it would look, saying to themself, "Why would we tell the whole world our service is down? This would cause panic among everyone, rather than just annoyance to those who truly care."
Of course it becomes an even bigger problem at that point to blatently say "ALL SYSTEMS GO" with green checks, when it clearly isn't the case.
So, Amazon, like any other multi multi multi billion dollar business empire, clouds the water and seeks to control the perspective. "This wasn't downtime, as per your SLA. It was inability for some to access our servers, which were powered on the whole time!"
Anything to shore up stock price is to be pursued. Anything to bring it down is to be avoided.
It's ridiculous that people need to check a $7 server that shows if amazon is down or not. Someone go through this petty lawsuit and force them to sort their shit.
If you're amazon, sort your shit or you deserve the huge legal damages of making someone take one for the team and actually go through with a lawsuit. This isn't even a grey area.
It's trivial for you to report this accurately. If you have a page purporting to do so...do so.
I can't think of a great (tight) analogy, but McDonald's obviously can't put "Only 100 Calories!" on their advertisement for Big Macs. Why not? Because it's false...
Companies can't just say whatever they want when they have information to the contrary. In this case other businesses are impacted.
I think it's not a hard argument to make.
Naming the vendor during an availability incident is un-classy, I'd even say childish, yet is seems to have rapidly become a mainstream practice in our field :/ In the case of naming Amazon, practices of theirs such as those described in the article bring this on themselves, since we are only human, and we resent Amazon's dishonesty. Still, we should switch away from dishonest vendors rather than use them as scapegoats when our systems fail. (my point does not apply to postmortems, where every detail counts and is useful, including the specific vendors involved)
Example (singling out a vendor that I otherwise think is great, sorry Filestack!): http://status.filestack.com/incidents/lzl06yzw1jq5
What do we stand to gain by naming the upstream vendor? (genuinely curious)