Based on our telemetry, this started as NXDOMAINs for sqs.us-east-1.amazonaws.com beginning in modest volumes at 20:43 UTC and becoming a total outage at 20:48 UTC. Naturally, it was completely resolved by 20:57, 5 minutes before anything was posted in the "Personal Health Dashboard" in the AWS console.
It takes a while to find a Vice President, I guess.
I'd be totally fine just having alerts and metrics driving the status page. Why involve a human at all? They just get emotional.
(I have a data-driven status page for my personal website. If Oh Dear decides my website is down, the status page gets automatically updated. Obviously nobody is ever going to visit status.jrock.us if they are trying to read an article on my blog and it doesn't load, but hey at least I can say I did it.)
To make a judgement call on whether the issue is severe enough to warrant the legal/financial risk of admitting your service is broken, potentially breaking customer SLAs.
If you're just going to lie, why have an SLA at all? It's like doing a clinical trial for a drug; a bunch of your patients die and you say "well they were going to die anyway it has nothing to do with the drug." If it's one person, maybe you can get away with that. When it's everyone in the experimental group, people start to wonder.
I have two arguments in favor of honest SLAs. One is, if customers expect that something is down, it can give them a piece of data with which to make their mitigation decisions. "A lot of our services are returning errors", check the auto-updated status page, "there may be an issue with network routes between AZs A and C". Now you know to drain out of those zones. If the status page says "there are no problems", now you spend hours debugging the issue, costing yourself far more money in your time than you spend on your cloud infrastructure in the first place. If having an SLA is the cause of that, it would be financially in your favor to not have the SLA at all. The SLA bounds your losses to what you pay for cloud resources, but your losses can actually be much higher; lost revenue, lost time troubleshooting someone else's problem, etc.
The second is, SLA violations are what justify reliability engineering efforts. If you lose $1,000,000 a year to SLA violations, and you hire an SRE for $500,000 a year to reduce SLA violations by 75%, then you just made a profit of $250,000. If you say "nah there were no outages", then you're flushing that $500,000 a year down the toilet and should fire anyone working on reliability. That is obviously not healthy; the financial aspect keeps you honest and accountable.
All of this gets very difficult when you are planning your own SLAs. If everyone is lying to you, you have no choice but to lie to your customers. You can multiply together all the 99.5% SLAs of the services you depend on and give your customers a guarantee of 95%, but if the 99.5% you're quoted is actually 89.4%, then you can't actually meet your guarantees. AWS can afford to lie to their customers (and Congress apparently) without consequences. But you, small startup, can't. Your customers are going to notice, and they were already taking a chance going with you instead of some big company. This is hard cycle to get out of. People don't want to lie, but they become liars because the rest of the industry is lying.
Finally, I just want to say I don't even care about the financial aspect, really. The 5 figures spent on cloud expenses are nothing compared to the nights and weekends your team loses to debugging someone else's problem. They could have spent the time with their families or hobbies if the cloud provider just said "yup, it's all broken, we'll fix it by tomorrow". Instead, they stay up late into the night looking for the root cause, finding it 4 hours later, and still not being able to do anything except open a support ticket answered by someone who has to lie to preserve the SLA. They'll never get those hours back! And they turned out to be a complete waste of time.
It's a disaster, and I don't care if it's a hard problem. We, as an industry, shouldn't stand for it.
It's not just a hard problem, it's impossible to go from metrics to automatic announcements. Give me a detected scenario and I'll give you an explanation for seeing errors which are unrelated to the service health as seen by customers. From basic "our tests have bugs", to "reflected DDoS on internal systems causes rate limit on test endpoint", to "deprecated instance type removal caused a spike in failed creations", to "bgp issues cause test endpoints failures, but not customer visible ones", to...
You can't go from a metric to diagnosis for a customer - there's just no 1:1 mapping possible, with errors going both ways. AWS sucks with their status delays, but it's better than seeing their internal monitoring.
I don't agree. I'm perfectly happy seeing their internal metrics.
I remember one night a long time ago while working at Google, I was trying a new approach for loading some data involving an external system. To get the performance I needed, I was going to be making a lot of requests to that service, and I was a little concerned about overloading it. I ran my job and opened up that service's monitoring console, poked around, and determined that I was in fact overloading it. I sent them an email and they added some additional replicas for me.
Now I live in the world of the cloud where I pay $100 per X requests to some service, and they won't even tell me if the errors it's returning are their fault or my fault? Not acceptable.
> I'd be totally fine just having alerts and metrics driving the status page. Why involve a human at all?
What happens if the monitoring service goes down or has a bug that causes it to incorrectly report the status as okay? Obviously this won't usually happen at the same time as the actual service goes down, but if it did, would people really believe that?
It definitely is. For an issue like this, you will see relevant teams and delegates looped in very quickly. Getting approved wording about an outage requires some very senior people though. Often they have to be paged in as well.
Having worked at a few other large tech companies now -- Amazon's incident response process is honestly great. It's one of the things I miss about working there.
It takes a while to find a Vice President, I guess.