
Final Root Cause Analysis of Nov 18 Azure Service Interruption - asyncwords
http://azure.microsoft.com/blog/2014/12/17/final-root-cause-analysis-and-improvement-areas-nov-18-azure-storage-service-interruption/
======
ChuckMcM
Nice writeup. I hope that the engineer in question didn't get fired or
anything. One of the challenges in SRE/Ops type organizations is to be
responsible, take ownership, put in the extra time to fix things you break,
_but keep the nerve to push out changes._ Once an ops team loses its
willingness to push large changes, the infrastructure calcifies and you have a
much bigger problem on your hand.

~~~
qeorge
I agree, and am reminded of the quote from Thomas Watson (IBM) on this:

 _“Recently, I was asked if I was going to fire an employee who made a mistake
that cost the company $600,000. No, I replied, I just spent $600,000 training
him. Why would I want somebody to hire his experience?”_

~~~
tedunangst
Almost certainly apocryphal. There are multiple versions of the quote floating
around. Also: Who was Watson talking to? How did this story get from him to
us?

~~~
michaelcampbell
I heard the exact same story in a Wall St. trading firm I worked for - except
the story was about a trade gone wrong (trades X shares instead of $X worth of
shares, causing the Dow to slide and all sorts of havok).

------
tytso
The really big missing piece that I found in this post mortem is if it only
took 30 minutes to revert the original change, why did it take over ten hours
to restart the Azure Blob storage servers? This was neatly elided in the last
sentence of this paragraph of their writeup:

".... We reverted the change globally within 30 minutes of the start of the
issue which protected many Azure Blob storage Front-Ends from experiencing the
issue. The Azure Blob storage Front-Ends which already entered the infinite
loop were unable to accept any configuration changes due to the infinite loop.
These required a restart after reverting the configuration change, extending
the time to recover."

The ten+ hours extension was the vast majority of the outage time; why wasn't
the reason for this given? More importantly, what will be done to prevent a
similar extension in the time Azure spends belly up if at some point in the
future, the Blob servers go insane and have to be restarted?

~~~
nemothekid
Only a guess but from how its worded it seems that the storage frontend that
had already entered infinite loops may have taken the tem+ hours to restart.

~~~
CoreySanders
Mark, the Azure CTO, gives a good breakdown of the time taken for each portion
of the incident recovery in this video:
[http://channel9.msdn.com/posts/Inside-the-Azure-Storage-
Outa...](http://channel9.msdn.com/posts/Inside-the-Azure-Storage-Outage-of-
November-18th)

That may help address these questions. Just FYI, I am an engineer in the Azure
compute team.

------
mrb
_" These Virtual Machines were recreated by repeating the VM provisioning
step. Linux Virtual Machines were not affected."_

So Azure supports Linux VMs?! Microsoft does so little Azure advertising that
I had to learn this fact from their RCA. Apparently they do support it since
2012: [http://www.techrepublic.com/blog/linux-and-open-
source/micro...](http://www.techrepublic.com/blog/linux-and-open-
source/microsoft-now-offering-linux-on-azure-what-does-this-mean/) but it is
likely that many non-users of Azure do not know this.

~~~
shanselman
In fact, something like 20% of Azure is running Linux VMs. Works great, you
can use Chef, Puppet, Vagrant, and the open-source cross platform CLI to
manage them. There's a thousand Linux VMs to choose from here
[https://vmdepot.msopentech.com/List/Index](https://vmdepot.msopentech.com/List/Index)

~~~
Alupis
There's also the "Microsoft Loves Linux" campaign going on[1]. It's part of
their new attempt to embrace the FOSS world and generally be more "open".

[1]
[https://twitter.com/jniccolai/status/524281997632745472](https://twitter.com/jniccolai/status/524281997632745472)

------
sandis
> The engineer fixing the Azure Table storage performance issue believed that
> because the change had already been flighted on a portion of the production
> infrastructure for several weeks, enabling this across the infrastructure
> was low risk.

Ugh, I wouldn't want to be that guy (even if there would be no direct
repercussions). That said, and as others have highlighted - kudos on the
writeup and openness.

~~~
cheez
Forever known as the guy who knocked Azure offline.

~~~
click170
Probably less than you think. At least among his peers.

These kinds of "almost took X offline" happen All The Time, its just that most
of the time they get caught before it gets too far. Its inevitable that a few
will squeak through the nets.

Mistakes can and will happen anywhere we allow them to. If you want to prevent
mistakes, write tools to help reduce the "attack surface" (areas where
mistakes can be made). Eg Don't want someone to be able to do "sudo reboot"
accidentally? Alias reboot to something else. It won't stop hackers but it
might help fight fat fingers.

------
coldcode
Shit happens and at this scale it happens big. I wish everyone would provide
details like this when the fan gets hit or your security fails. I'm glad I
never have to deal with scale like this, it's pretty scary.

~~~
jmartinpetersen
I've seen several companies where analysis like this would be for management
only. I guess it's just human nature to want to sweep mistakes and accidents
under the rug, but it does also speak volumes about the culture in such
companies. Kudos to Microsoft and every other big player that communicates
these things.

~~~
Someone1234
It reminds me of the NTSB's crash investigations. Instead of looking for a
scapegoat or someone to blame, they look for the cause, and then look even
deeper to find the root cause.

For example they discover a pilot made a mistake. But they don't end it there,
they then look at the airline's training materials, see if other pilots would
repeat the same mistakes, and so on until they reach a point where they have a
"this won't happen again" resolution (rather than simply discovering what
happened).

I feel like with Microsoft's breakdown they did the "this is what happened"
post-mortem but then went to the next level and said "here's why this
happened, and here is why it won't happen again."

~~~
epochwolf
Nitpick: Crash investigations are done by the NTSB, not the FAA.

The NTSB has no authority to enforce its recommendations. That's up to the
FAA. The idea behind that is the NTSB is more likely to be impartial.

~~~
Someone1234
Valid correction. I've edited it in. But it did originally say FAA, not NTSB.

------
pfortuny
Impressive non-jargonized report. I would have "quantified" the "small number"
but kudos anyway to Microsoft for taking this path towards transparency.

------
ha292
This is a good effort. I do have some concerns about it.

A true root cause would go deeper and ask why is it that an engineer could
solely decide to roll out to all slices ?

The surface-level answer is that Azure platform lacked tooling. Is that the
cause or an effect ? I think it is an effect. There are deeper root causes.

Let's ask -- why was it that the design allowed one engineer to effectively
bring down Azure ?

We often stop at these RCAs when it gets uncomfortable and it starts to point
upwards.

I say this to the engineer who pressed the buttons: Bravo! You did something
that exposed a massive hole in Azure which may have very well prevented a much
bigger embarrassment.

~~~
Lewisham
_A true root cause would go deeper and ask why is it that an engineer could
solely decide to roll out to all slices?_

Because writing code which contains a large number of checks and balances is
generally orders of magnitude more expensive than human trust/judgment on the
Ops team. Reading the postmortem makes me think that this sort of failure
could have happened to anyone, and no-one really did anything wrong. The
mistake was the blob store config flag not getting flipped, which is just a
natural human error. The engineer who did the roll out could have been any of
us. Given what he/she knew, he/she thought they had a good soak test (and a
couple of weeks is a pretty good soak test) and made a call, similar calls
he/she makes a number of times every day. This one didn't pan out.

I would hazard that most companies have a big red rollout button that is
reserved for trusted engineers that will do a rollout without all the checks
you're requesting.

~~~
smackfu
Just a second level of approval can be very useful, without requiring orders
of magnitude costs. In part because it usually requires that the change be
explained in writing to the second approver, and that can often reveal issues.

~~~
Lewisham
It's not clear he/she didn't notify a secondary person, who would have likely
had the same knowledge he/she did. Given the same knowledge, the same push
might well have happened.

------
nchelluri
I'm pretty impressed with the openness of this statement.

~~~
sybhn
idem

------
jabanico
"Unfortunately, the configuration tooling did not have adequate enforcement of
this policy of incrementally deploying the change across the infrastructure."
They relied on tooling to do the review of the last step of the process? I
would have thought there were a few layers of approval that goes along with
that final push into mission critical infrastructure.

~~~
NeutronBoy
That's exactly what they mean - the workflow tool they used didn't enforce
approvals from all the concerned parties.

------
mooneater
Pros: -they are sharing info -they allowed some caustic comments to remain at
the bottom of the page (so far).

Cons: -This is almost 30 days after the incident -Look at the regions, it was
global! -This was a whole chain of issues. I count it as 5 separate issues.
This goes deep into how they operate and it does not paint a picture of
operational maturity:

1: configuration change for the Blob Front-Ends exposed a bug in the Blob
Front-Ends

2: Blob Front-Ends infinite loop delayed the fix (I count this as a separate
issue though I expect some may not)

3: As part of a plan to improve performance of the Azure Storage Service, the
decision was made to push the configuration change to the entire production
service

4: Update was made across most regions in a short period of time due to
operational error

5: Azure infrastructure issue that impacted our ability to provide timely
updates via the Service Health Dashboard

That is quite a list. [Edit : formatting only]

~~~
markveronda
> -This is almost 30 days after the incident

What would have been the optimal response time? They fixed the immediate
problem as fast as they could and gave a preliminary RCA, then they did a
longer-term RCA and fix. I feel this shows maturity by not rushing to
immediate conclusions and trying to do a 5-Whys drill-down to fix the
underlying cause. Furthermore, they also took steps to actually fix the
problem by pointing out they moved the human out of the loop in one aspect and
that's always a good thing (unless the replacement software is faulty itself
of course).

Also, in response to the list, I believe [3&4] are actually the same thing,
are they not? The operator was the one who made the 'decision' by accidentally
ignoring the incremental config change policy that was in place and did it all
at once. This was identified as a human error and they fixed it by _enforcing_
incremental changes.

------
Redsquare
Why are they so quiet about SLA credit? Not a word for a month and for a year
I have been wasting good money on doubling up services to be inside the SLA +
also deploying cross region to ensure zero downtime, what a joke. Surely Azure
are not hoping we will forget?

------
sybhn
TL;DR

>In summary, Microsoft Azure had clear operating guidelines but there was a
gap in the deployment tooling that relied on human decisions and protocol.

------
spudlyo
_After analysis was complete, we released an update to our deployment system
tooling to enforce compliance to the above testing and flighting policies for
standard updates, whether code or configuration._

Hopefully there is a way to disable this policy adherence for when you really
need to push out a configuration or code change everywhere quickly.

------
markveronda
I cannot believe how many times I have seen a PROD (or new env X) deployment
go bad from configuration issues. At least they separate configuration
deployments from code deployments, that's a good sign. Why not take it a step
_further_ and instead of doing config deployments, use a config server?

~~~
internetisthesh
In the end, you would still have config deployments, but to the config server.
And if you can push config to nodes needing it. you have one less point-of-
failure, right? I'm not too familiar with the concept of a config server.

------
teyc
If utility computing is to is be taken seriously, then it has to institute the
same kind of discipline that we see occurring in the airline industry. Recent
examples come to mind: pilot letting songstress hold the wheels and wearing
the pilots cap - fired. Airline executive overruling pilot over macadamia nuts
- million dollar fine.

If we wish for a future where cloud computing will be considered reliable
enough for air traffic control systems, then management of these
infrastructure requires a level of dedication and commitment to process and
training.

Failover zones need to be isolated not only physically, but also from command
and control. A lone engineer should not have sufficient authority or
capability to operationally control more than one zone. It is extremely
unnerving for enterprises to see that a significant infrastructure like Azure
has a root account which can take down the whole of Azure.

------
forgotAgain
I hope the engineer in question did not get fired.

I also hope that no one who recommended Azure to their employer got fired
either.

------
billarmstrong
Only one question: Will the engineer be fired?

~~~
billarmstrong
I want to know the reality. I don't want to see another PR show. Dear
Information diggers, please let us know whether the guy was fired!

------
larrystrange
OSS is alive and well on the Azure Platform www.microsoft.com/openness

------
runT1ME
Does anyone else see the missing piece to this post mortem? An infinite loop
made its way onto a majority(? all?) of production servers, and the immediate
response is more or less 'we shouldn't have deployed to as many customers,
failure should have only happened to a small subset'?

I agree that improvements made to their deployment tooling were good and
necessary, take the human temptation to skip steps out of the equation.

But this exemplifies a _major_ problem our industry suffers from, in that it
just taken as a given that critical errors will sometimes make their way into
production servers and the best we can do is reduce the impact.

I find this absolutely unacceptable. How about we short circuit the process
and identify ways to stop that from happening? Were there enough code reviews?
Did automated testing fail here? Yes I'm familiar with the halting problem and
limitations of formal verification on turing complete languages, but I don't
believe it's an excuse.

This is tantamount to saying "yeah sometimes our airplanes crash, so from now
on we'll just make sure we have less passengers ride in the newer models".

~~~
fishnchips
> Were there enough code reviews? Did automated testing fail here?

I really do love tests and all but they only get you so far. In fact you're
way more often bitten by things that are outside of your frame of reference
and therefore these are not the ones you take into account when designing
testing pipeline.

~~~
eldavido
Ops engineer here. This is a particularly hard case because the problem
involved an interaction of components across the network, and was scale-
dependent. These kinds of problems are truly "emergent" in that they're
enormously hard to test for. Absent an exact copy of production, with the same
workload, I/O characteristics, network latencies, etc., there are always some
class of scale/performance-related bugs you just won't catch until the code
hits production.

One defense is a "canary" deployment process (they used the term "flighting")
to ensure major changes are rolled out slowly enough to detect major
performance shifts. Had their deployment process worked correctly, they may
have been able to roll back the change without incident.

A second defense is proactively building "safeties" and "blowoff valves" into
your software. Example: if a client notices a huge spike in errors, __back off
__before retrying a connection request, otherwise you may put the system into
a positive feedback loop. Ethernet collision detection /avoidance is a great
example of a safety mechanism done well.

Finally, every high-scale domain has its own problems, which experienced
engineers know to worry about. In my case, at an analytics provider, one of
the hardest problems we face is data retention: how much to store, at what
granularity, for how long, and how that interacts with our various plan tiers.
OTOH we have significant latitude to be "eventually correct" or "eventually
consistent" in a way a bank, stock exchange, or other transactional financial
system (e.g. credit approval) can't be. I imagine other things like ad
serving, video serving, game backend development, etc. there are similar
"gotchas", but I don't know what they are.

