
Microsoft Azure suffers outage after cooling issue - pcunite
https://www.datacenterdynamics.com/news/microsoft-azure-suffers-outage-after-cooling-issue/
======
amaccuish
The Azure status page has more information. I suggest updating the link.

> A severe weather event, including lightning strikes, occurred near one of
> the South Central US datacenters. This resulted in a power voltage increase
> that impacted cooling systems. Automated datacenter procedures to ensure
> data and hardware integrity went into effect and critical hardware entered a
> structured power down process.

[https://azure.microsoft.com/en-us/status/](https://azure.microsoft.com/en-
us/status/)

~~~
Rapzid
I wonder if that's one of their facilities down here in San Antonio. Was
getting flash flood alerts on my phone all night and morning.

~~~
mariojv
Wow, I didn't tie these two events together until reading this comment. The
flash floods last night were quite awful. mySA, a local news site (which I
don't necessarily trust), has said that the daily rainfall total was 3x its
historic record in the 1800s. [0]

It's always quite fascinating whenever cloud platforms like this have "leaky
abstractions." GCP had a very long storage service degradation today, as well.
[1] I don't know if it's related.

[0] [https://www.mysanantonio.com/news/weather/article/Several-
re...](https://www.mysanantonio.com/news/weather/article/Several-rescued-from-
high-water-on-North-Side-13202695.php)

[1]
[https://status.cloud.google.com/incident/storage/18003](https://status.cloud.google.com/incident/storage/18003)

edited for formatting

~~~
mirimir
Hmmm. That's quite the jump. Daily rainfall of 6.07" vs 1889 record of 1.76".
But hey, just a fluke, no doubt. Or evidence for iffy reporting.

> SAN ANTONIO - Heavy rain caused flash flooding Monday night in northern
> Bexar County and extending into Comal County. Some areas had up to 9 inches
> of rain, and water rose on some roadways, including Interstate 10.

> Tuesday morning, the National Weather Service said via social media that
> parts of San Antonio had up to 6.07 inches of rain, which "smashes" the
> daily rainfall record of 1.76 inches from 1889.

> Over 9 inches of rain was observed around Stone Oak Parkway, and more than 8
> inches between Shavano Park and Camp Bullis, according to the NWS website.

[https://www.mysanantonio.com/news/weather/article/mysananton...](https://www.mysanantonio.com/news/weather/article/mysanantonio.com/news/weather/article/Heavy-
rainfall-smashes-daily-record-as-storms-13202276.php&ipid=artem)

------
Someone1234
Visual Studio Online has been offline all day. They say it is due to the same
Azure outage. This has had a productivity impact.

If Microsoft didn't own GitHub, this may have prompted a move, but since they
do it seems a little redundant given that Github will likely be on Azure too
before long.

[https://blogs.msdn.microsoft.com/vsoservice/?p=17405](https://blogs.msdn.microsoft.com/vsoservice/?p=17405)

~~~
okcwarrior
Having VSTS down all day meant I got exactly 0 done today. Completely crazy to
me.

~~~
toomuchtodo
Not intended to be snarky: why is this crazy to you? All cloud providers have
had downtime incidents, major hosted VCS providers, SaaS products. Downtime is
a fact of life in tech.

~~~
asdfasgasdgasdg
Hrm. I've worked at a large tech firm for more than a decade and there has
never been a full day where VCS or the build farm were down all day. It's
notable when it's down for more than twenty minutes.

~~~
toomuchtodo
As counterpoint, I’ve seen banks down for extended periods of time, hours
occasionally stretching to a day or two (TSB, LLoyds, Bank Of America,
BankSimple [BBVA]). Downtime is a fact of life. Google, Amazon, IBM, and
Microsoft have had major cloud outages. GitLab nuked their production DB.
Slack and Reddit are frequently down.

Unless it’s life critical (911, air traffic control), if it’s down its only
going to hamper productivity, but it’ll be back eventually. Time to stretch
and get a coffee, and if it’s all day, going home and we’ll start fresh
tomorrow.

We’re not saving lives, we’re just building websites. Downtime isn’t shameful,
it happens to all of us.

~~~
dustinmoris
If a single bank is down or Reddit (lol) then the impact is fairly limited,
but if one of the 3 major cloud providers, which powers large parts of the
internet is down for an entire day, then the impact is a little bit more
critical I would say ;).

There's a reason why Azure has a SLA and Reddit doesn't ;)

Also if you start comparing the big companies with GitLab then we don't have
to continue talking anymore. It's not ok to nuke your production db and that's
why everyone in the tech scene laughs about GitLab and comparing them to Azure
is like comparing a lego house to brick and mortar.

~~~
dsumenkovic
At GitLab we are always trying to iterate and we learned a lot from our
incident with the database.

The one thing we are proud of is our transparency. The community really
appreciates our openness and we are happy about it.

Here's the one example [https://about.gitlab.com/2017/02/01/gitlab-dot-com-
database-...](https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-
incident/)

------
king_magic
Edit: <snipped> out my rant.

It's been a long day because of this. Just going to leave it at that.

~~~
ohthehugemanate
They have some services that are "global", ie not tied to a given region.
Those services' requests are actually processed all over the place, but south
central is a big datacenter. The 9th biggest in the world, apparently. When it
lost cooling and shut down, everything routed around it as planned... But it
caused so much extra traffic that it overwhelmed the connections to other
datacenters. The backlog of requests is tremendous of course, so even after
they got south central back up, all the other datacenters are way over their
traffic capacity. They've got the Datacenter back up, and are now restoring
storage and storage dependent services.

Honestly it's hard to imagine a good mitigation for this. "Build more
datacenters" is already happening as fast as it can. "Keep enough spare
capacity around to handle losing one of the biggest datacenters in the world"
is pretty unreasonable.

If you, as a customer, are uptime focused enough that it's worth paying extra,
then the sensible practice has always been Cross-Cloud
infrastructhre/failovers. At least since the Amazon Easter failure of 2011.
That's what giants like Netflix do.

~~~
eric_b
> "Keep enough spare capacity around to handle losing one of the biggest
> datacenters in the world" is pretty unreasonable.

Err what?

It's entirely reasonable to expect Azure to handle the loss of a single DC and
not have a 14+ hour global outage. I don't care how big the DC is, losing one
should not take out the world, especially not for the length of time this one
has been going on.

~~~
cloakandswagger
Indeed. This article by AWS VP James Hamilton gives a unique insight into how
Amazon approaches the problem of sizing data centers for redundancy:

[https://perspectives.mvdirona.com/2017/04/how-many-data-
cent...](https://perspectives.mvdirona.com/2017/04/how-many-data-centers-
needed-world-wide/)

~~~
msandford
Here I was hoping this was a reference to James Mickens:
[https://blogs.microsoft.com/ai/james-mickens-the-funniest-
ma...](https://blogs.microsoft.com/ai/james-mickens-the-funniest-man-in-
microsoft-research/)

------
dmarlow
The worst part has been the poor communication. If they were to give clearer
insight from the get go, that'd give me more confidence and patience. Saying
"check back in 2 hours" isn't useful.

~~~
mrep
> Saying "check back in 2 hours" isn't useful.

Having worked for a cloud provider, the reason they are saying that is because
they are actively working to understand and fix the problem but haven't come
to a well resound solution and thus they cannot give you a decent time
estimate because you will probably get even more mad if they under/over
estimate the time it took to fix it.

~~~
dmarlow
If they said this, "because they are actively working to understand and fix
the problem but haven't come to a well resound solution and thus they cannot
give you a decent time estimate because you will probably get even more mad if
they under/over estimate the time it took to fix it." I would thank and
applaud them. Tell me what it is you're doing at least. Why don't you
understand the problem? What are you investigating? Some transparency goes a
long way for me.

~~~
Vinnl
I used to think like this too - e.g. I was happy when our national rail
started announcing the cause of delays. But then a friend of mine was
complaining that they did this, because he didn't want to be troubled with
their internal problems - "just tell me what to do".

When your customer's demands are so directly opposed, you're somewhat caught
between a rock and a hard place.

~~~
madmulita
You can easily reflect both positions in your status page.

Those who don't need to be bothered with the details can refrain from reading
them.

~~~
LeftTurnSignal
I really don't understand why everything has to be black or white with any of
this stuff.

All this would take to keep both sides happy is a little link with "more info"
below it.

Why things like this are so difficult, I will never understand.

------
GFischer
We're affected by this issue. And we had our alerts system in Azure as well,
so we didn't get alerts about the outage (welp).

~~~
ZainRiz
That's why T-Mobile's on-call engineers carry around AT&T phones.

(source: friend who's an engineer at T-Mobile)

~~~
Theodores
That is a top 'did you know' factoid that I am sure I will tell others.

But do AT+T engineers carry T-Mobile phones?

If yes then they should put themselves together a deal so that none of the on-
call engineers have to worry about running up big bills using their phones.
When there are freak weather events they are all in it together.

~~~
i_have_to_speak
"factoid": I don't think it means what you think it means.

[https://www.merriam-webster.com/dictionary/factoid](https://www.merriam-
webster.com/dictionary/factoid)

~~~
fphhotchips
The second definition there fits perfectly.

------
addicted
VSTS is still down for us. TFS hosted code repos along with our entire bug
system on VSTS means that no work is being done.

I suspect we are gonna have to wait at least one other day at best for this to
resolve. Meanwhile my local code goes even more out of sync.

I’m probably just gonna spin up a git repo on my local machine and use that to
share code with my team.

~~~
ethomson
PM for VSTS here. The final scale unit in South Central US was brought back
online a few hours ago, which means that the final accounts that were affected
should now be operational. We're still restoring package management to some
accounts, but otherwise, you should be back to working. Please feel free to
reach out to me if you're still having trouble. Email is my HN username @
microsoft.com. We're _very_ sorry for this very significant outage.

------
ledriveby
Kind of surprised of the lack of redundancy, especially for their first party
products. Shouldn't they be deploying to more than one failure zone?

~~~
Mythroat
Like any major service, I'm sure they do. But also like any other service, how
well is all the resiliency tested in the real world, is a separate question.
And today we have an answer.

Anyway I have a good guess as to what most of the employees there are going to
be doing for the next six months.

------
Memosyne
The Visual Studio Marketplace is also down
[https://marketplace.visualstudio.com/](https://marketplace.visualstudio.com/).

~~~
whoisjuan
Just today I was having issues with the Prettier extension in VS Code, and I
uninstalled it to see if that would fix it (I read that usually fixes the
issues I was having). Then I realized that I couldn't install it again because
VS Marketplace was down. This was like 8 hours ago and still no signs of
recovery. Of course, all my builds are failing because of some stupid
formatting issue that Prettier usually would solve, so yeah..thanks MSFT.

~~~
TeMPOraL
> _Of course, all my builds are failing because of some stupid formatting
> issue that Prettier usually would solve, so yeah..thanks MSFT._

Is failing builds due to _formatting_ issues really a sound setup?

~~~
amaccuish
I assume their build system checks for formatting and will raise an error if
it doesn't conform. And this person would use the VS Code extension to auto-
conform their code.

~~~
TeMPOraL
I'm questioning the soundness of such setup in general, and especially if it
means that losing connection to a third-party prettifier makes you unable to
work on your own codebase.

~~~
Vinnl
Hmm, the alternatives are not enforcing a similar code style, or enforcing it
earlier on (e.g. on commit). I can understand why they would not want the
former, and the latter is more annoying when experimenting, i.e. when code
style does not matter that much yet. Thus, in CI sounds like the right choice.

~~~
w0m
I'd go for precommit hook or similar, but not a huge deal.

------
plasma
The outage appears to be ongoing, and its having ripple affects in other
regions (management portal unresponsive, autoscale and other services not
firing in West US at least for me).

Also unable to lodge a support ticket because the portal fails to identify me
as having paid support (that API request appears to timeout).

------
gameswithgo
Yeah our company's website was down all day due to this. We are looking at
ways to mitigate in the future.

------
pcunite
Here are some official links on the issue:

\- [https://azure.microsoft.com/en-us/status/](https://azure.microsoft.com/en-
us/status/)

\- [https://twitter.com/AzureSupport](https://twitter.com/AzureSupport)

\- [https://twitter.com/Office365Status](https://twitter.com/Office365Status)

\- [https://status.office.com](https://status.office.com)

~~~
tialaramex
As of 07:15 UTC on the 5th of September 2018 the Azure status message
reported:

"NEXT UPDATE: The next update will be provided by 07:00 UTC 05 Sep 2018 or as
events warrant."

As I finished writing this they finally updated with essentially the same
message except stalling for an additional two hours.

So, if you're thinking "Well, surprises will happen". Yeah, and Microsoft is
not actually prepared for that at all, so, sucks to be their customer I guess?

------
karmicthreat
So AWS has had some big outages, as has Azure. Has GCP had any big outages
yet?

~~~
Someone1234
AWS and Azure have had "big" outages people because actually use them.
Rackspace and IBM are almost neck and neck with Google's best efforts (3%
markshare Vs. 30%/40% for Azure/AWS)[0].

[0] [https://www.skyhighnetworks.com/cloud-security-
blog/microsof...](https://www.skyhighnetworks.com/cloud-security-
blog/microsoft-azure-closes-iaas-adoption-gap-with-amazon-aws/)

~~~
bitmapbrother
The biggest and most popular services run on Google Cloud.

[https://cloud.google.com/customers/](https://cloud.google.com/customers/)

You know....services consumers actually use.

~~~
AaronFriel
I'm sure services I use run on all of the major cloud providers, but if that
page was supposed to be enlightening, I only recognized one brand from the
first page of customers.

I refreshed a couple times, and sure, I saw more (on average 1 or 2) that I
recognized on each page. But I don't think your response is particularly
persuasive. Are you suggesting that the services that I use that run on AWS
are in fact, not services I actually use?

Or am I not a consumer? I'm confused.

Edit: Do you hold any Alphabet/Google stock? I've noticed your comment history
trends toward dismissing criticism of Google, praising their products, and
taking opportunities to speak about the flaws of their top competitors.

~~~
bitmapbrother
>I'm sure services I use run on all of the major cloud providers, but if that
page was supposed to be enlightening, I only recognized one brand from the
first page of customers.

So the first page was supposed to be indicative of all of the popular consumer
facing services they host? Here, let me help you out: Spotify, eBay, Twitter,
Apple iCloud, Verizon, Vimeo, Netflix, etc

>I refreshed a couple times, and sure, I saw more (on average 1 or 2) that I
recognized on each page. But I don't think your response is particularly
persuasive. Are you suggesting that the services that I use that run on AWS
are in fact, not services I actually use?

What popular consumer services were on AWS again?

>Edit: Do you hold any Alphabet/Google stock? I've noticed your comment
history trends toward dismissing criticism of Google, praising their products,
and taking opportunities to speak about the flaws of their top competitors.

Do you own Microsoft stock? Because quote a few of your posts seem to praise
their products and services. Do you work for them?

~~~
dang
We've banned this account. All it has done is aggressively post pro-Google
comments and diss Google competitors.

Single-purpose accounts are not allowed here, especially not when pushing an
agenda, and most of all not when pushing corporate propaganda. Of all the
things that make HN users angry, that's at the top. And I agree with them.

Most of the time we tell HN users that they're not allowed to accuse each
other of astroturfing. When we do find a clear-cut case of abuse that's been
getting away with it for this long, I get pretty steamed.

You've also frequently broken the site guidelines by being uncivil, so much so
that we've warned you at least half a dozen times. That's more than enough
reason to ban you in its own right.

------
filomeno
We all ditched the Unix model of a central server with dumb terminals around
because Microsoft told us the future was everybody having a full OS in their
workstation. Now they tell us the future is going back 30 years, and having
all of our data and programs in somebody else's machines (theirs).

Sometimes I can't understand people.

------
partiallypro
You can use a VPN to view the status page if it's erroring out for you, I'd
also suggest trying to clear your local DNS cache.

------
p2t2p
And I was wondering why my App.Gateways are not deploying last night

------
tracker1
The Extensions Marketplace for VS Code also seems offline, still.

