

Microsoft Azure Outage - wenbert
http://azure.microsoft.com/en-us/status/#current
Is Azure down?<p>Status: http:&#x2F;&#x2F;azure.microsoft.com&#x2F;en-us&#x2F;status&#x2F;#current<p>Twitter: https:&#x2F;&#x2F;twitter.com&#x2F;search?f=realtime&amp;q=azure&amp;src=typd
======
pkorzeniewski
Azure support is probably the worst one I had ever deal with. When my account
(and service itself) stopped working, I haven't received any email. When I
tried to sign in, all I got was some generic error saying "There's something
wrong with your account". My services of course were down and I couldn't do
ANYTHING. I've contacted the support to learn that my account has been blocked
(!) because there was some suspicious (!!) activity going on. What the... No,
they couldn't tell me what exactly it was. I've exchanged emails back and
forth with the support for several days to learn nothing new, my account and
services were still disabled and I was more than pissed off. From that day I
hate Azure and I advise anyone against using it, because such situation is
absolutely unacceptable.

~~~
atmosx
Days? What do you "days"???? Azure was down for like 9 hours.

~~~
tiagocesar
I love how deleted comments actually never disappear.

------
photorized
I feel like an idiot. MS featured my Azure startup today, quoting me about
overall stability etc (which has been the case for us, until today). They then
proceeded to go down, taking all our production systems with them.

(yes we do have AWS, too)

Sigh.

~~~
higherpurpose
Azure has had quite a few outages this year. I'd say it's already lower than
that 99.999 percent uptime or w/e they are advertising.

~~~
toyg
0.001% of 365 days is 8.76 hours. So yeah, shot for the year; but of course
they'll do some "hollywood timekeeping" (or just ignore the matter altogether)
and keep advertising...

~~~
kazoolist
0.001% of 365 days = 5.26 minutes
([http://www.wolframalpha.com/input/?i=0.001%25+of+365+days](http://www.wolframalpha.com/input/?i=0.001%25+of+365+days))

------
silverbax88
The idea of cloud storage being down is less of an issue - I don't like it,
but I understand it. What bothers me about this is:

1\. I was never notified of the outage. I noticed it myself when attempting to
log into one of my VMs and then started looking for status updates. Sadly, the
best status updates I got were here on Hacker News.

2\. When my servers did come back up, at least one of my IP addresses had
changed, which meant I had to update all of the relevant DNS entries (which,
as everyone here no doubt knows, can take up to 48 hours to propagate). I was
never notified of this change in any way.

~~~
Maarten88
2\. Azure does not guarantee that you keep your ip address by default. You
should configure a cname if you use Azure Websites or get a reserved ip
address, available with Cloud Services

~~~
silverbax88
Actually, I am supposed to have a fixed IP. And I have CNAMES configured. I
will review to determine if there is some other way I can set this up, but my
issue is that I would never allow my products to be offline for hours without
notifying my customers.

~~~
coreysa
First, I am really sorry about the impact the incident had on your service.
Specifically, for your changing IP, are you currently using the reserved IP
feature. You can reserve both your external IP and your internal IP.

External IP: [http://msdn.microsoft.com/en-
us/library/azure/dn690120.aspx](http://msdn.microsoft.com/en-
us/library/azure/dn690120.aspx)

Internal IP: [http://msdn.microsoft.com/en-
us/library/azure/dn630228.aspx](http://msdn.microsoft.com/en-
us/library/azure/dn630228.aspx)

------
patwhite
So, the worst part about this is that zero communication has come out of
Microsoft - we first started seeing issues on Sunday and filed a ticked, had
an open ticket while this larger outage happened, and haven't gotten a single
email saying there's an outage. I found out about it from, sigh, buzzfeed.

Question - are AWS or GCE better at proactively messaging when there's an
outage?

~~~
nkvoll
I've never ever received a message from AWS when they've had outages that have
been affected us significantly. On the contrary, there's been multiple cases
where we've experienced issues, contacted them and it's taken a few hours
before they realize they're actually having infrastructure problems. Many of
these don't even get an entry on their service status pages. So there's still
a lot of room for improvement on AWS's side of things as well.

~~~
bad_user
I can confirm this. I remember once when half of the Internet was down and the
status reported for EC2 was yellow - experiencing some minor issues :-)

And I find out about it by yelling at Heroku - they told me that Amazon is
having issues before Amazon's status turned yellow.

~~~
kalleboo
Usually when AWS has an outage they have a nice green circle but with a small
blue "i" next to it that you need a loupe to see. Extremely dishonest.

------
jread
I run a site that monitors cloud service availability. Based on VMs and Blob
storage containers I maintain and monitor, the outage affected every US Azure
region with 1-2 hours of downtime: [https://cloudharmony.com/status-for-
azure](https://cloudharmony.com/status-for-azure)

~~~
razzberryman
Wow, comparing that to AWS is staggering!

[https://cloudharmony.com/status-for-aws](https://cloudharmony.com/status-for-
aws)

~~~
etha
It's not as if AWS has never gone down
([http://aws.amazon.com/message/65648/](http://aws.amazon.com/message/65648/)).
It just hasn't had a major outage in the last 30 days.

~~~
rbanffy
Comparing [https://cloudharmony.com/status-1year-for-
azure](https://cloudharmony.com/status-1year-for-azure) with
[https://cloudharmony.com/status-1year-for-
aws](https://cloudharmony.com/status-1year-for-aws) tells a different story.

------
zaroth
That status table with all the randomly located green checks is painful to
look at... I guess a green check in the 'Global' column implies a green check
in all location specific columns? But what about all the rows which have no
Global green check, but most columns are still empty? Are those regions where
the service is not deployed? Can we gray out those boxes or something if they
are 'N/A'?

Also, funny if you try to zoom out in Chrome to see the whole thing, the row
headers get out of alignment.

Why would I want to 'X' out specific rows/columns in the table? It was so
complicated to begin with, someone thought adding more complication through
end-user customization was a good idea? I just noticed, you can even expand
some of the rows...

Seriously, a status page should tell you either "It's up" or "What's down".
It's not even showing history over time, this is just a snapshot. The text at
the top directly contradicts the icons in the table, making the whole thing
even more ridiculous.

The footnote at the bottom is the best, "The Australia Regions are available
only to customers with billing addresses in Australia and New Zealand." Thanks
for that useful nugget! /s

~~~
unclesaamm
Running Chrome 38.0.2125.111 m, zoomed out row headers look fine

~~~
ballstothewalls
Are you confusing row headers and column headers? I have the same version as
you and the row headers got funky when zoomed out.

------
matthewking
The most damaging part to me is that "All good! Everything is running great."
message on the status page.

Mistakes happen, services go down, I can get over that. What matters is how
its dealt with. At the moment I would not want to be an Azure customer dealing
with 9 hours+ downtime whilst MS are saying everything is great. At the very
least change it to "Having some issues" or similar!

------
teovall
The postmortem for this should make for a good read. How does storage go down
in eleven regions at once?

~~~
ohyesyodo
Just apply same buggy network patch to all DCs at once? They use software
networking so causing something like this should be easy. Or mess up network
routing for *.blob.core.windows.net which pretty much all of Azure relies on.

~~~
icebraining
Isn't applying the same patch everywhere at once a major anti-pattern?

~~~
ohyesyodo
Turns out this was exactly what happened - they applied a buggy patch to all
data centers at once by mistake.

------
inglor
Our sites have been down for more than 3 hours now.

EDIT2: Now the databases are down, this is costing us a lot of money. EDIT:
Just went up again.

It would be great if anyone knows how to mitigate these in the future - what
can I do to protect myself against this in the future? (Except leave Azure)

~~~
joshuak
Major outages should absolutely weigh into your decisions as to what platform
to use. That being said you can mitigate the effect of instability by
engineering your app to failover to other availability zones or even to
another cloud platform (depending on your app) if the entire platform goes
down.

Obviously there is a segnifigant cost associated with engineering this level
of cross platform redundancy which is why reliability is an important factor
in making your platform choices. If you can tolerate some downtime, you can be
more flexible, otherwise it will costs one way or the other.

In any case you should consider having a user notification site setup on a
completely different service (or two) so that _when_ things go wrong you can
redirect everyone to that site to keep your customers informed. This is
especially important when you have partial outages that could create
inconstancies in your database or application state if you where to continue
to allow users to interact with it in a degraded state.

~~~
inglor
Thanks! This is very helpful.

Our big hosted site is hosted in Europe is actually working but our blogs and
a news website are both down. We offer a paid service at 600$ a year and if
the main site was down it would be very bad for our reputation.

Our DNS points to Azure on all these domains and things are hosted as "Azure
Web Site" \- how would notifications work if Azure itself is failing? Would I
need to proxy the traffic through elsewhere?

Are there any services that solve this problem for me? I really don't mind
paying a few dollars every month and not worry about this.

~~~
joshuak
There are any number of uptime, and ping services that you can google for.
This can raise the alarm in a timely fashion when your site, or parts of your
site go down, and then you decide how to handle those issues.

You may also want to google for DNS failover services, to help you
automatically redirect traffic in more catastrophic failure cases. There are
offerings from google[1], AWS[2], and others.

[1]: [https://cloud.google.com/dns/docs](https://cloud.google.com/dns/docs)

[2]: [http://aws.amazon.com/route53/](http://aws.amazon.com/route53/)

------
sphildreth
So much for the idea of 99.999% uptime with the magical "cloud" buzzword. I
noticed during this downtime in North America that Word Online wasn't
functioning as my daughter tried to use it to do some homework.

~~~
matthewmacleod
_So much for the idea of 99.999% uptime with the magical "cloud" buzzword_

Who was selling that to you? Because I'm pretty sure it wasn't Microsoft…

~~~
sphildreth
Oh its 99.9% from Microsoft see [http://azure.microsoft.com/en-
us/support/legal/sla/](http://azure.microsoft.com/en-us/support/legal/sla/)
which means they get 9h a year to hit the SLA see
[http://uptime.is/](http://uptime.is/)

~~~
tankenmate
Except most people use monthly periods for their SLAs

------
kelvin0
You should try Google's App Engine (paid premium account) tech support when
your critical files disappear. Can't be any worse than this ... That's the
problem with these hosted cloud solutions, your systems are at the mercy of
the bad tech support. Try explaining that to your own customers ...

------
nnx
Actual link to status page: [http://azure.microsoft.com/en-
us/status/#current](http://azure.microsoft.com/en-us/status/#current)

(not that convenient to copy paste the OP link from a mobile device)

------
gwgwegewg
Microsoft are refusing to help us with our downed servers because we don't
have a support contract. The outage is their issue not ours!!

~~~
ownagefool
If your app is down, it sounds very much like it's your problem.

While you're obviously going to be unhappy with downtime, this is a genuine
part of calculation you should have made when you decided to outsource all
your eggs into one basket.

------
codeshaman
As more and more services and apps depend on 'the cloud', I'm wondering, how
many of them would survive a major cloud outage: the cloud company going
bankrupt, stock market crash or economic meltdown, a malware exploiting a
major server-side bug (like heartbleed or shellshock, but worse) wiping or
encrypting the data on the infrastructure/user machines.

How much of the user's data would be forever lost in such an event ?

The other aspect is privacy - in theory, all user's data can be stored and
accessed forever, eg. 20 years from now, when the reincarnation of someone
like Stalin comes to power.

Anyway, the point I'm trying to make is that we should design our services or
apps with this in mind - the cloud can and will fail from time to time, maybe
forever. So, if possible, use the cloud as a 'bonus' feature, a means to back
up data and store user's data offline for when the dark day comes at least the
user still has his data.

~~~
maccard
> The other aspect is privacy - in theory, all user's data can be stored and
> accessed forever, eg. 20 years from now, when the reincarnation of someone
> like Stalin comes to power.

Is havin your stuff stores locally any more secure in that situation. If
someone wants your data they'll knock on your door and beat you and your
family until you give it to them

~~~
jacalata
If you have the only copy, you can destroy it.

~~~
rbanffy
They will still beat you up until you produce the data you have destroyed,
which is until they get tired of beating you up. You could keep some decoy
data you produce in such situations, preferably before the beating starts.

------
coldtea
Reality call: ANY and ALL Cloud services, be it Google, Azure, AWS etc, will
be down for hours at some point every few years.

~~~
ExpiredLink
Reality call: ANY and ALL services, be it local or remote, will be down for
hours at some point every few years.

~~~
Drakim
If it's my own fault, I can at least curse my own lack of knowledge and
expertise, and I can strive to do better in the future.

When the cloud is down, all we can do is fiddle our thumbs and hope it doesn't
happen again. Or maybe we could send an angry letter to Microsoft, and hope
somebody reads it.

~~~
cmdkeen
It's about abstracting away the cost of it being your own fault. Realistically
the cost of employing enough people, and buying enough hardware, to provide
anything close to 99.X% uptime is much more than punting that over to a Cloud
Provider.

~~~
vidarh
I've found very few cases where "punting that over to a cloud provider" has
been remotely cost effective for base load. It's gotten closer over the years,
but the gap is still massive for all but some very specific types of
workloads.

It's great for convenience, and it's great for managing without certain
skillsets that may be hard to obtain, and it's great for temporary capacity,
but it's not cheap.

~~~
cmdkeen
It's not cheap to have any confidence of any uptime realistically at all. The
thing is that most people either live without that guarantee, or just get
lucky enough not to care. It becomes problematic if you've made promises to
others about uptime that are built on a house of sand.

Unless your base load cloud costs are more than the cost of full time, ready
at a moments notice, experienced ops people you don't get close to any
guarantee of uptime by non-managed hosting. The salary cost alone of that is
substantial, let alone hardware spread across multiple locations. My firm pays
at least 7 figures a year on IT ops and don't come close to 99.9% uptime
across everything.

------
jedgrant
Regretting the decision to go with Azure. Talk about terrible timing. We have
media outlets interested in our site, we send info and the site is dead. Talk
about a crap first impression.

~~~
craigvn
It is totally frustrating, but at the end of the day similar outages happen
with all cloud providers.

~~~
ZoF
Can you point to an example of this happening on AWS in multiple regions
simultaneously?

~~~
dangrossman
Yes. The day half the internet seemed to be down. It took Amazon more than 24
hours to recover, and having your services in multiple availability zones did
not shield you from the failure.

[http://aws.amazon.com/message/65648/](http://aws.amazon.com/message/65648/)

~~~
tarblog
This seems to have only affected one region. Am I missing something?

~~~
dangrossman
Yes. It started as a failure in one region, and propagated to others as it
overloaded the "control plane" \-- the stuff that runs "the cloud", and EBS
tried to replicate "failed" disks to the point that Amazon ran out of disk
space in the cluster. At the time, I was paying for RDS Multi-AZ which runs
your database in multiple availability zones at once with hot failover if the
primary goes offline. It failed to fail over despite that. Many large sites
went down for a very long time that day, and people couldn't spawn replacement
instances even in other AZs than the one the failure started in.

~~~
jedberg
You're confusing region with AZ. They've never had a multi-region outage
(yet).

------
toddgardner
Our VMs and websites on USEast are unreachable, however our storage seems to
be working fine. There is something very backwards with how they are
communicating this outage.

------
Beached
This may be greater then just west Europe. I personally have servers in US
East that are unreachable, and there are a few reports of others in US region
reporting partial unavailability for the US based servers.

I wonder how many customers Azure just lost do to their unexpected 2 day
fiasco

~~~
ceejayoz
> I wonder how many customers Azure just lost do to their unexpected 2 day
> fiasco

Amazon had a number of EBS fiascoes and survived just fine. I'd expect Azure
to do the same.

~~~
LamaOfRuin
Each of Amazon's high profile failures did have many people formulating
previously non-existent escape plans though, and there are now several
alternatives in this space that can offer the same scale.

It's obviously not going to destroy anyone's business, but there is a lot more
competition than there used to be.

------
inglor
We've been noticing ups and downs for the last few hours of our VM powering an
important database in West Europe.

Seriously considering another layer above azure to mitigate this in the
future. Very disappointing to see.

At least initially their status indicated they're handing the problem but
lately it's just been "All Good" and they said they resolved it on twitter but
it's not at 100% yet: [http://azure.microsoft.com/en-
us/status/](http://azure.microsoft.com/en-us/status/)

~~~
joshmlewis
Someone else mentioned routing to divert traffic to working data centers. That
might be an option for you.

------
Varcht
Oh no, did we break the status page too? Sorry Azure team, really didn't mean
to pile on!

~~~
bengali3
keeping the load light? <html><head></head><body>The page cannot be displayed
because an internal server error has occurred.</body></html>

------
elpool2
Yup! Azure websites and Storage are down in multiple regions.

~~~
andrea_s
VMs too... At least for western europe

------
syassami
Storage, Websites and Visual Studio Online - Multiple Regions - Partial
Service Interruption 5 mins agoStarting at 19 Nov 2014 00:52 UTC we are
experiencing a connectivity issue to Azure Services including Storage,
Websites and Visual Studio Online. The next update will be provided in 60
minutes.

------
wenbert
Well, their status page is telling lies.

------
bursteg
Storage is the source of the outage, and most of the services rely on it, so
they are all impacted.

------
plasma
Still down even 2 hours later, regardless of the status page saying its OK.

------
jmnicolas
Judging by how cloud services "frequently" go down when everything is normal,
it makes me wonder what would happen in case of a real problem (volcano
eruption, social unrest, nuclear disaster, alien invasion ...). I still don't
get the cloud infatuation, and no you don't have to get off my lawn, I'm
"only" 36 (yeah I know, in IT I'm already a dinosaur).

~~~
freehunter
What would happen to your own datacenter in case of a similar disaster? Your
servers would go down and you would spin up from your disaster recovery site.
Cloud doesn't mean you don't need a DR plan anymore.

Put your servers in different regions, use Azure/Google, BlueMix/AWS, or even
hybrid cloud, do something. Have a DR plan.

~~~
jmnicolas
I'm thinking as the little guy here : not data center but personal computers.

If the disaster strikes my region, I probably have better things to do than IT
things (like running for my life :-).

But with the cloud the disaster could be thousand of kilometers away and still
affect me. That's the problem with the cloud : why should I stop working in my
remote French town because there's a landslide in Ireland (or wherever they
put the European cloud data centers) ?

I don't say the cloud doesn't have it's uses (especially as a redundant backup
far far away) but the all cloud model has way more risks than what people
think ... and vendors don't rush to explain that.

I'm one of those guy that think the future will be more and more harsh for the
western civilization (think collapse of the Soviet Union). There will be less
money for everything, infrastructure in particular, things will fail and you
will have to deal with it locally and the DIY way.

------
scientist
See also
[https://news.ycombinator.com/item?id=8627630](https://news.ycombinator.com/item?id=8627630)

------
us0r
My VMs are down. This much be something major.

~~~
plasma
I think because the disks are backed by blob storage.

------
iancarroll
Seems to be back up now, my site ([https://ian.sh](https://ian.sh)) was down
for a while.

------
scoj
There really isn't anything I can do either. My VM isn't back up yet. I'd go
to sleep and just expect it to be online in the morning (when it really
matters), but I'm afraid a drive won't reattach or something like that.
Meanwhile, twiddling thumbs...hit F5...twiddle thumbs...)

------
csbowe
_The page cannot be displayed because an internal server error has occurred._

Their error pages are less graceful than mine.

------
duedl0r
come on! give them some slack.. they probably aren't very experienced at
managing their linux servers! ;)

~~~
hyperliner
Clearly neither did the developers of these apps which are now down who
thought of spending as few pennies as possible and save a few other pennies
with load balancing failover, and then expecting magic!

------
jsudhams
This is why i have server class refurbished machine handy as working backup so
that you can restore if ther service is not restored with in few minutes. Or
have another copy of vm/db in other provider like rackspace or something

------
sspies
Do you run multi-region or maybe multi-provider setups? How do you migrate
your instances from failed regions to healthy ones? How do you route users to
the healthy regions? DNS? Do you think anycast could be an alternative?

------
NicoJuicy
My website, my webapplication for member management + my clients are down :s,
i really don't like this...

Didn't receive any calls yet, but i don't think that will take long.

------
silverbax88
This is nice. My web site server IP was changed when the server came back up.
So now I have to update all of the site DNS settings.

~~~
ohyesyodo
Hmm. You should be using CNAME records rather than IP addresses. Or are you
using the new fixed IP features?

------
NinjaTime
Disgusting Virtual service

Disgusting management interface

Abysmal support

Way to fuck up a mustard sandwich Microsoftie

We moved everything we had away from that Virus named Azure.

------
scoj
My VM is still down (US East). Is anyone else still experiencing issues?

~~~
photorized
My VMs appear to be up mostly, but they are primarily in North Central.

~~~
scoj
Thanks, I just restarted it and it took a while (5 minutes or so), but after
that, it appears fine.

------
Nmachine
"Everything is running great"

~~~
smoyer
It's obviously "All Good!"

~~~
ExpiredLink
Just a data dump for the NSA. Nothing serious!

------
Aoyagi
So did anyone receive a call "The cloud is down"? Or at least an e-mail?

------
superuser2
Every time this happens, ask yourself... Are you outage-proof? Do you have a
rational reason to believe that internally-managed infrastructure would never
have a problem like this?

------
damian2000
I'm guessing the reason that this site is down I was trying to load ...
[http://www.dotnetrocks.com/](http://www.dotnetrocks.com/)

------
csbowe
Maybe they unknowingly upgraded to Intel's latest SSDs in their storage array.
[https://news.ycombinator.com/item?id=8626928](https://news.ycombinator.com/item?id=8626928)

------
BeeDunc
This outage exposes the clowns that actually chose Azure as their cloud
provider. If you use AMZN and it goes down, at least you're in good company,
with the likes of Netflix, Twitter, Instagram, and so on. It's like yeah, I'm
big like they are. So what, it went down, so is Netflix.

What does your client/customer think of you being on Azure? That you chose the
crappy solution because your low-tech infrastructure still uses windows, which
does not carry a lot of tech cred.

~~~
keithwarren
Over 80% of the Fortune 500 run on Azure.

20% of Azure VMs are Linux.

You are not well informed.

~~~
brongondwana
"run on" \- I suspect you're being fed a unicode pile of poo here.

More likely the have _something_ which runs on Azure. Fortune 500s are, pretty
much by definition, quite large - and probably have tons of departments and
sub departments. And at least one of those departments probably has a task of
trying out new things, like Azure, by running something on it.

What surprises me is that nearly 20% of Fortune 500s _don't_ have something
running on Azure.

(I wonder what percentage "run on" Amazon)

~~~
curiousDog
Ignoring the fortune companies, most of Microsoft's own services like Office
365 run on Azure. That's a pretty big bet right there.

~~~
apapli
Actually I don't think they do - to the best of my knowledge Office 365 didn't
have any downtime as a result of this outage. And Yammer stayed up, so I
assume they haven't yet migrated from AWS...

