Does this have to be posted every time some cloud service has issues?
We, as in people in this forum, know that status pages are worthless. They’re tools with the explicit purpose of reducing the burden on tier one customer support. That’s it. They are not a public monitoring platform.
As an alternate example, Github tends to have a pretty good status page in my experience. It'll usually be up-to-date within minutes of people chatting about issues on work Slack and gets updated at a regular cadence with details.
AWS on the other hand... We usually just reach out to our TAMs and say "Hey, our application monitoring is showing tons of errors interacting with service X--can you check your super secret internal dashboards and see what the deal is?"
It's nice to at least have a "Yeah the service is completely hosed and in a bad place" or a "Yeah, some changes went wrong and are being rolled back". The former usually requires some sort of mitigation while the latter can largely be ignored
Having a fully up to date status page is what prevents useless repetitive cases during degradation or an outage. If I'm having issues but you show all green, I'm submitting a case.
For large incumbents, publishing a complete and accurate status page might not only be a recipe for bad press, but also lawsuits. There's significant downside and not much upside to telling the whole truth. It's entirely unsurprising that cloud providers like AWS or Azure would play definitional games w/ what constitutes an outage. Rolling 5m, 1% outage across your entire customer base? That's just a hiccup! If your staff has to field more confused questions and complaints, it's a relatively small price to pay.
I was on support until a few minutes ago (SQL DB, not general Azure), and the status page was ironically my first indication something was up since said status page was having DNS issues.
Afaik Actions (maybe Packages, too?) was always built on Azure. I think Github core is on bare-metal
I'm guessing they probably get the elasticity of cloud while paying the wholesale or at-cost of the infrastructure (surely they get some discounts over the advertised price, at least)
> Azure DNS is now being offered at a 100% availability SLA that's backed by our diverse, geo-redundant DNS infrastructure.
> With this update, Azure DNS guarantees that valid DNS requests will receive a response from at least one name server 100% of the time. For details, see the SLA definition.
The sad part is that even though that feels like a lot given the time span, it doesn't feel that bad given it's Azure. These network issues have been plaguing us ever since we started using AD.
These problems are actually good in that more organizations - especially big ones - realize that reliability is not something that you can outsource to a single provider and the problem magically disappears. Literally any service you depend on, from DNS to email, should be using redundancy so that when your basket gets squashed, you still have some eggs. Neither Amazon nor Microsoft will tell you that because vendor lock-in is in their best interest. You need to take care of it yourself otherwise you are completely at their mercy.
> We are aware of an issue affecting the Azure Portal and Azure services, please visit our alternate Status Page here https://status2.azure.com for more information and updates.
There have been some internal debates about setting up a secondary DNS in case Route53 somehow went down. My reasoning has thusfar been that if Route53 is down, there are probably other AWS services that we depend on that would also be down.
What do you guys think? Is secondary DNS in this case worth it?
We're moving towards dual DNS providers. We've been bit by DNS hosting failures too many times, and they're always painful because everything, including monitoring and control systems, end up dead or inaccessible. Not to mention the entire network being down is absurdly expensive if you're paying SLAs.
DNS responses for our app servers in Azure were failing and now they are taking ~4000ms. They have a really short TTL too, which really exacerbates the issue.