
Major Azure Outage - klausjensen
https://twitter.com/AzureSupport/status/1124046510411460610
======
6ak74rfy
Public clouds _will_ have outages - that's not the point. What's most
concerning about this outage is that it is across all regions. That violates
the fundamental assumption, of developing for the cloud, that failures in
every region are independent.

If regions fail independently and a failure in 1-2 regions brought down my
system, that's my fault. But if region failures aren't independent and a
global outage such as this is possible - well, that's pretty bad.

~~~
cheeze
Yup, fully agreed. _Nobody_ can keep sql instances or hosts up forever. They
will go down. Further, humans work on this stuff. Humans make mistakes. Bad
config push, bad code push, lightning hitting a data center, human vandalism,
etc. will happen.

What _shouldn 't_ happen ever is that your entire cloud goes down because
somebody pushed a bad config change to a service that serves _literally your
entire cloud._

Microsoft clearly hasn't architected everything to be region-independent.
There are things that will always be somewhat global, but Azure seems pretty
bad at this. This isn't the first time even in the past 12 months that they've
had a global outage.

------
cheeze
Man... Azure seems to be an order of magnitude worse than AWS and GCP when it
comes to reliability.

Seems like they have tons of global dependencies within their services which
cause these cascading failures rather often... Seems like only a few months
ago we were reading about a global outage that affected auth?

Regardless: Godspeed to the engineers working to fix this.

------
sbr464
On a more serious note, how would your entire network, worldwide go down? Are
there really no independent zones (that are unaware of each other)? That can’t
be good.

~~~
cheeze
Global dependencies. Something in DNS had a dependency on something central
within Azure, and when that breaks, you're done for.

I'm not in the cloud provider game, but it seems like it would be important to
audit and ensure that there are _no_ critical cross-region dependencies. I
assume GCP and AWS do this regularly?

It seems like there are some things that _have_ to be somewhat global (IAM
comes to mind), but minimization of that seems important.

IMO this is the most embarrassing non-security thing that can happen to you as
a cloud provider.

------
steveadoo
When I tell my clients Azure had another outage they're going to demand we
move to another cloud service. Looks like I'm in for a looong couple weeks.

~~~
lghh
What are you going to tell them when the other cloud services have their
inevitable outages?

~~~
PlanetLotus
I'm not the parent, but I maintain services on both AWS and Azure, and in the
last few years, I can definitely say the outages on Azure have been more
frequent and more severe. The only AWS outages I recall are S3 and the Dyn DNS
issue that brought many other providers down too.

~~~
user5994461
The Dyn DNS outage was easier to explain. Half of the internet is down, it's
not just AWS.

~~~
PlanetLotus
Exactly. As an AWS customer, that outage didn't really bother me because it
was obvious it impacted many other unrelated services too.

------
ecaron
[https://azure.microsoft.com/en-us/status/](https://azure.microsoft.com/en-
us/status/) was just updated, showing "Network Infrastructure" is down across
the board.

------
Analemma_
Azure SQL is totally down for us, Storage (tables/blobs/queues) is mostly
down. Seems to be a DNS issue, and this wouldn’t be the first time Microsoft
has been brought down by DNS.

~~~
ljoshua
Same. Name resolution is failing but intermittently, changing from second to
second whether it resolves or not.

------
augustl
Azure SQL is totally down for us as well. Unable to resolve the DNS for it,
both from within azure (kubernetes pods) and from outside (my laptop).

EDIT (23:37 Oslo/Norway): connectivity is restored for us now

------
plasma
I'm seeing some recovery of services in West US

------
ZanyProgrammer
Things have been back up for several minutes.

~~~
zeven7
Things are not back up for me. And looking at Twitter, it looks like things
are still broken for a lot of people.

------
norad73
Recovery started in Europe (AMS)

