
Postmortem: Azure DevOps (VSTS) Outage of 4 Sep 2018 - wallflower
https://blogs.msdn.microsoft.com/vsoservice/?p=17485
======
a2tech
Long story short they give no indication as to why their data center cooling
systems were unable to handle voltage changes caused by the storm and their
systems are not designed for speedy restoration into another region.

~~~
chrisbolt
Seems like there's more information in the preliminary RCA on
[https://azure.microsoft.com/en-
us/status/history/](https://azure.microsoft.com/en-us/status/history/)

~~~
bpicolo
> Initially, the datacenter was able to maintain its operational temperatures
> through a load dependent thermal buffer that was designed within the cooling
> system. However, once this thermal buffer was depleted the datacenter
> temperature exceeded safe operational thresholds, and an automated shutdown
> of devices was initiated

~~~
souterrain
Total failure of a data center's cooling apparatus seems to be a very rare
occurrence to me, perhaps limited to simultaneous failure of utility and
genset power (example: electrical switchgear and fuel pumps underwater due to
flooding).

Anyone have any data around how frequently such a failure occurs?

~~~
beh9540
I had the same thought. The only thing I could come up with is that it wasn't
a failure of power supply, but that a surge took down enough cooling systems
that they couldn't maintain temperature. A lot of DC's I've seen are N+1 with
cooling (or even 2N), but they all run at the same time and are the same
units. Or the control system went down, and they weren't able to get it back
up and running, although I would think they would have redundancy in that
case.

------
outworlder
Ok, it's understandable, freak events happen.

> The primary solution we are pursuing to improve handling datacenter failures
> is Availability Zones, and we are exploring the feasibility of asynchronous
> replication.

This I do not understand. I was also amazed when I saw that Azure AZs are not
available on all regions. In AWS, the bare minimum is 2 AZs (except for one
odd region). Same thing for Google Cloud.

~~~
scarface74
From what I understand and I can't find the reference anywhere, each region
has at least three availability zones. Some regions only have two user
selectable AZ's.

For instance, S3 promises that it is replicated between 3 AZ's in a region.
That guaranteed is available in regions that only have two publicly available
AZ's.

------
romaniv
At the end of the day, Azure and AWS are monocultures with considerable amount
of centralization and interdependency within their services. Their scale
undermines the original purpose behind the Internet.

It bothers me that increasing number of large companies dump their own data
centers to jump into The Cloud. Thia means future outages (which will
undoubtedly happen) will have wider and wider impact on end users.

For example, if your email is hosted on AWS and it goes down, you loose access
to your email. No big deal. However, if your email, VOIP and IM/chat go down
at the same time, you may loose all ability to communicate electronically.
This can be a very big deal in certain situations.

~~~
otterley
The original purpose behind the Internet was to build a robust layer-3 network
based on packet switching technology. The designers weren't focused on the
application layer.

Source: [https://www.internetsociety.org/internet/history-
internet/br...](https://www.internetsociety.org/internet/history-
internet/brief-history-internet/)

Separately, I think we have enough history of working with the cloud at this
point to demonstrate that major providers' availability is on par with, or
better than, the availability of the typical small entity. Sure, the impact is
potentially wider spread (although this can be mitigated with a cellular
architecture, which first-class providers do employ), but there's a perverse
advantage that when outages occur, they tend to get fixed a lot faster because
the complaint volume is much higher.

~~~
romaniv
_> when outages occur, they tend to get fixed a lot faster because the
complaint volume is much higher._

On the other hand, they can be much harder to fix, because the sheer scale of
failures and complexity of the infrastructure. There is a higher probability
of complex systemic issues, as demonstrated by this very outage.

There are plenty of smaller providers that beat Azure VMs in uptime. Plus,
smaller websites/services can employ much simpler failure mitigation
strategies.

~~~
otterley
The "complex systemic issue" here is that Azure is only now rolling out
availability zones, and the product in question hasn't yet been able to take
advantage of them to mitigate a serious DC fault caused by an Act of God.

The necessity of low-latency-but-decoupled-physical-plant AZs is well known in
the art by now, and these issues will no doubt be addressed as Azure matures.
Remember, they're 5 years behind AWS.

~~~
romaniv
_> The "complex systemic issue" here is that Azure is only now rolling out
availability zones,_

Availability zones are a _mitigation_. The issues is the sequence of events
and dependencies described in the postmortem. The description has six
paragraphs.

~~~
otterley
I'm not precisely sure what you're referring to. Can you cite the precise
problem discussed in the postmortem, and how, specifically, you think it could
have been better designed?

And how could your perfect model, whatever that is, survive a similar
catastrophic DC failure without availability zones?

------
swebs
That's an extremely roundabout way of saying there was a lightning strike and
they had inadequate surge protection.

------
em0ney
Really not the worst post mortem I've ever seen

------
sungju1203
just use AWS. simple.

~~~
Bhilai
Comments like this are counter productive for the discussion. Competition is
always good and some of us like that AWS has competition in the form of Azure
and GCP. AWS has had its own share of outages so its not perfect either.

------
byte1918
> VSTS (now called Azure DevOps)

Not again.

~~~
herbderb
What's the point of even mentioning that when they just use the old name for
the entirety of the article anyway

~~~
skrebbel
The point is that the headline should've been "Azure DevOps Outage…" but
they're afraid that other outlets will take that over as "Azure Outage..." and
they don't want a headline like that making the rounds. So they use the old
name for bad news and the new name for good news.

TBH I'd do the same.

~~~
freeone3000
But is _is_ an Azure outage. It took out a DC. VSTS is one of the services
affected, but other services were also affected.

------
indemnity
Is this what we have to look forward to as GitHub will be forced onto Azure?

~~~
manigandham
No service is perfect and github has had plenty of outages. They will only
become more reliable with the resources of MS/Azure at their disposal.

~~~
tumetab1
Having worked in big Azure customer I wouldn't say resources equals stability.
The reality is more resources + quality engineering + failure testing.

As this case tells, Azure, isn't spending a lot on failure testing. Also,
having experience with being an big Azure customer, I can tell you that things
look better than they are.

~~~
eropple
Small Azure customer (five figures a month), but can co-sign all of this.

Azure looks shiny from the outside but we've had way, way more problems, from
uptime to bad APIs to _awful_ language SDKs to bad user interfaces to
licensing hell, than I've _ever_ had on AWS or GCP. It's so bad that I am
currently weighing whether or not to advocate for a migration off of it, at
nontrivial expense, because I cannot pretend to provide reliable services for
our customers.

