They're already fairly hard to automate, but making them external makes it even harder. So they're generally not automated. And, hopefully, they're not exercised very often, so they're often forgotten. The best way to manage this, IMHO, is to enable customer service to update the status page and delegate the duty to them. Generally, updating the status page reduces the customer service demand, so CS gets the most direct and timely feedback and is best placed to manage it.
Source: I ran a status page, poorly. It wasn't fully external either, but was far away from all the moving parts, so it would only have correlated failure if our hosting had a multi-site routing issue (which happened once, but not while I was there) or DNS problems, but we had mitigations for DNS not working in the clients, so not too bad.
That an entity the scale of Amazon and AWS just can't figure out how to report system status? Or they forgot? Or didn't have the demand to create it?
Yeah, that's literally what they've claimed in the past.. when the 2017 S3 outage happened, the issue wasn't reflected on the status page and they later claimed that they were having trouble updating it because the status page relied upon S3 somehow.
Maybe the 'optimistic' argument is more like the 'naive' argument...
They have some rudimentary automated checks that doesn't cover enough failure scenarios. After some time, given there are enough reports on twitter, they check manually if something is wrong and a human operator writes a status on the status page.
But, yeah, general infrastructure being down and affecting all kinds of disparate (travel and non-travel) sites makes sense and is boring.
Reminds me of an interesting study where researchers gave a list of numbers with some hidden property and let people ask if other numbers have that property, to try to figure out what it was. And people tend to generate numbers that agree with their hypothesis, to try to confirm it, rather than numbers that disagree, to disprove it. Wish I could find that study.
This is actually helpful to see a lot of unrelated sites affected by (likely) a similar cause. I've had issues with a couple different sites, and it's good to see confirmation that it's not just me.
Particularly since a) the number of affected sites appears to be growing, and b) as the downtime continues to increase.
Kinda depressing how fragile the Internet is.