Major Azure Outage

6ak74rfy · on May 2, 2019

Public clouds will have outages - that's not the point. What's most concerning about this outage is that it is across all regions. That violates the fundamental assumption, of developing for the cloud, that failures in every region are independent.

If regions fail independently and a failure in 1-2 regions brought down my system, that's my fault. But if region failures aren't independent and a global outage such as this is possible - well, that's pretty bad.

cheeze · on May 2, 2019

Yup, fully agreed. Nobody can keep sql instances or hosts up forever. They will go down. Further, humans work on this stuff. Humans make mistakes. Bad config push, bad code push, lightning hitting a data center, human vandalism, etc. will happen.

What shouldn't happen ever is that your entire cloud goes down because somebody pushed a bad config change to a service that serves literally your entire cloud.

Microsoft clearly hasn't architected everything to be region-independent. There are things that will always be somewhat global, but Azure seems pretty bad at this. This isn't the first time even in the past 12 months that they've had a global outage.

cheeze · on May 2, 2019

Man... Azure seems to be an order of magnitude worse than AWS and GCP when it comes to reliability.

Seems like they have tons of global dependencies within their services which cause these cascading failures rather often... Seems like only a few months ago we were reading about a global outage that affected auth?

Regardless: Godspeed to the engineers working to fix this.

sbr464 · on May 2, 2019

On a more serious note, how would your entire network, worldwide go down? Are there really no independent zones (that are unaware of each other)? That can’t be good.

cheeze · on May 2, 2019

Global dependencies. Something in DNS had a dependency on something central within Azure, and when that breaks, you're done for.

I'm not in the cloud provider game, but it seems like it would be important to audit and ensure that there are no critical cross-region dependencies. I assume GCP and AWS do this regularly?

It seems like there are some things that have to be somewhat global (IAM comes to mind), but minimization of that seems important.

IMO this is the most embarrassing non-security thing that can happen to you as a cloud provider.

popotamonga · on May 2, 2019

Paying thousands for replications and redundancies, multizone etc and bam, all down. Nice.

steveadoo · on May 2, 2019

When I tell my clients Azure had another outage they're going to demand we move to another cloud service. Looks like I'm in for a looong couple weeks.

lghh · on May 2, 2019

What are you going to tell them when the other cloud services have their inevitable outages?

PlanetLotus · on May 2, 2019

I'm not the parent, but I maintain services on both AWS and Azure, and in the last few years, I can definitely say the outages on Azure have been more frequent and more severe. The only AWS outages I recall are S3 and the Dyn DNS issue that brought many other providers down too.

user5994461 · on May 2, 2019

The Dyn DNS outage was easier to explain. Half of the internet is down, it's not just AWS.

PlanetLotus · on May 2, 2019

Exactly. As an AWS customer, that outage didn't really bother me because it was obvious it impacted many other unrelated services too.

cheeze · on May 2, 2019

I'd tell them "other providers have inevitable outages, however..."

1. The other providers have a better track record 2. The other providers don't go down globally multiple times a year

mattbillenstein · on May 2, 2019

How did you get on Azure to begin with?

ecaron · on May 2, 2019

I'm guessing he inherited it. I'm guessing most people on this thread are in that boat.

GFischer · on May 2, 2019

In my case, we actually chose it, as we´re a Microsoft shop. Hasn´t been that bad, but we did experience 2 major outages in 1 year.

Not sure how many outages are there on AWS.

mattbillenstein · on May 2, 2019

AWS has had a couple cascading EBS failures in us-east-1 years ago which affected a lot of services since it's a foundational building block of the whole system. It's been a reason to prefer instance storage for quite awhile imo.

I've run most everything in us-west-2 Oregon the last 5+ years and I can't remember a similar sort of outage there in that time-frame.

A widespread world-wide outage like is happening now on Azure is a red flag imo.

outworlder · on May 2, 2019

> It's been a reason to prefer instance storage for quite awhile imo.

Yuck. Please tell me you don't do that anymore (unless for specialized workloads where you don't care if an instance loses data due to a shutdown).

mattbillenstein · on May 4, 2019

I still do it all the time - but I consider each instance disposable and all data is replicated to at least two other hosts in different AZs.

cheeze · on May 2, 2019

Outages happen all the time on both of the providers - they just don't happen globally and usually it isn't a global outage.

One that does come to mind is the S3 outage a few years ago, which was essentially a global outage.

tracker1 · on May 2, 2019

AWS has one or two a year from what I've seen... IIRC gcp has had outages too... Unless you're designing with several safeguards in place across multiple regions and cloud providers, there's no getting around it.

Everyone has down time, it's just often coordinated to a lot of people when it does happen. It's still generally less than when you try to self host on a cloud provider, it's just not your mistake that did it.

goobynight · on May 2, 2019

"an" outage is different from a global outage though.

I'm not sure what the SLA is on a single region, but going down 0-2 times per year is a reasonable expectation, depending on the length of each one. If you want more, you have to have regional failover.

Azure is burning error budget in every region today and you would need to failover to a different cloud provider or your own datacenter.

If I'm interpreting this correctly, there was no plan you could implement solely in Azure that could have helped you today.

outworlder · on May 2, 2019

> AWS has one or two a year from what I've seen

Worldwide outage? That's really not the case. Over thousands of machines in 3 years I haven't seen a single failure that spanned more than one region. Or even a region going entirely offline. At most you'll see one service affected.

ecaron · on May 2, 2019

https://azure.microsoft.com/en-us/status/ was just updated, showing "Network Infrastructure" is down across the board.

Analemma_ · on May 2, 2019

Azure SQL is totally down for us, Storage (tables/blobs/queues) is mostly down. Seems to be a DNS issue, and this wouldn’t be the first time Microsoft has been brought down by DNS.

ljoshua · on May 2, 2019

Same. Name resolution is failing but intermittently, changing from second to second whether it resolves or not.

steveadoo · on May 2, 2019

Same here. East US2, West US and West Europe are all affected for me.

odorousrex · on May 2, 2019

Same here. On a positive note, all our apps are still working - apparently their internal DNS is still working, at least in the same datacenters.

I can't access anything from here though.

augustl · on May 2, 2019

Azure SQL is totally down for us as well. Unable to resolve the DNS for it, both from within azure (kubernetes pods) and from outside (my laptop).

EDIT (23:37 Oslo/Norway): connectivity is restored for us now

plasma · on May 2, 2019

I'm seeing some recovery of services in West US

ZanyProgrammer · on May 2, 2019

Things have been back up for several minutes.

zeven7 · on May 2, 2019

Things are not back up for me. And looking at Twitter, it looks like things are still broken for a lot of people.

PlanetLotus · on May 2, 2019

Same. West US here.

norad73 · on May 2, 2019

Recovery started in Europe (AMS)