

Update on Azure Storage Service Interruption - robert_nsu
http://azure.microsoft.com/blog/2014/11/19/update-on-azure-storage-service-interruption

======
taspeotis

        404 - Article Not Found
        The article you were looking for was not found, but maybe try looking again!
    

You can get the text from the RSS feed [1].

\---

Yesterday evening Pacific Standard Time, Azure storage services experienced a
service interruption across the United States, Europe and parts of Asia, which
impacted multiple cloud services in these regions. I want to first sincerely
apologize for the disruption this has caused. We know our customers put their
trust in us and we take that very seriously. I want to provide some background
on the issue that has occurred.

As part of a performance update to Azure Storage, an issue was discovered that
resulted in reduced capacity across services utilizing Azure Storage,
including Virtual Machines, Visual Studio Online, Websites, Search and other
Microsoft services. Prior to applying the performance update, it had been
tested over several weeks in a subset of our customer-facing storage service
for Azure Tables. We typically call this “flighting,” as we work to identify
issues before we broadly deploy any updates. The flighting test demonstrated a
notable performance improvement and we proceeded to deploy the update across
the storage service. During the rollout we discovered an issue that resulted
in storage blob front ends going into an infinite loop, which had gone
undetected during flighting. The net result was an inability for the front
ends to take on further traffic, which in turn caused other services built on
top to experience issues.

Once we detected this issue, the change was rolled back promptly, but a
restart of the storage front ends was required in order to fully undo the
update. Once the mitigation steps were deployed, most of our customers started
seeing the availability improvement across the affected regions. While
services are generally back online, a limited subset of customers are still
experiencing intermittent issues, and our engineering and support teams are
actively engaged to help customers through this time.

When we have an incident like this, our main focus is rapid time to recovery
for our customers, but we also work to closely examine what went wrong and
ensure it never happens again. We will continually work to improve our
customers’ experiences on our platform. We will update this blog with a RCA
(root cause analysis) to ensure customers understand how we have addressed the
issue and the improvements we will make going forward.

\---

[1]
[http://sxp.microsoft.com/feeds/3.0/devblogs](http://sxp.microsoft.com/feeds/3.0/devblogs)

------
je42
Uhm.

> The configuration change for the Blob Front-Ends exposed a

> bug in the Blob Front-Ends, which had been previously

> performing as expected for the Table Front-Ends.

> This bug resulted in the Blob Front-Ends to go into an

> infinite loop not allowing it to take traffic.

An infinite loop that has not been discovered during the partial role out.
That is clearly weird. Also, doesn't put too much trust in their current
monitoring scheme.

------
keithwarren
This video
[http://channel9.msdn.com/events/Build/2014/3-615;](http://channel9.msdn.com/events/Build/2014/3-615;)
start at 39 minutes. Mark Russinovich explains their update rollout procedure
(or at least part of it).

Willing to bet the rollout infrastructure depended on storage and so their
ability to control or stop the rollout was broken once the storage failures
began.

------
coreysa
Almost everyone has fully recovered at this point. If you are still seeing
problems with your Virtual Machine after the incident earlier this week, we
want to help you!! Please send mail to azcommsm@microsoft.com and email me
directly at corey.sanders@microsoft.com.

Please send with high importance so it pops in our inbox and we will dig in.

------
keithwarren
I hope to read more in the post-mortem RCA but I am curious what their
flighting missed, is flighting so limited that is does not see the cross
region scale or something? I also had the feeling from watching Mark
Russinovich discuss previous failures that their patch rollouts were much more
controlled.

~~~
coreysa
Keith, you can find more details in the RCA that is published here:
[http://azure.microsoft.com/blog/2014/11/19/update-on-
azure-s...](http://azure.microsoft.com/blog/2014/11/19/update-on-azure-
storage-service-interruption/). It is updated with more details on the
flighting and issues we encountered.

------
nnx
So they rolled out a performance update (not a critical security fix) to all
their datacenters at once?

This sounds incredibly amateur for a provider the size of Azure.

~~~
coreysa
Hey nnx, this is Corey from the Azure engineering team. We have a standard
protocol in the team of applying production changes in incremental batches.
Due to an operational error, this update was made across most regions in a
short period of time. I really apologize for the disruption.

~~~
keypusher
So, you had a bug in your code. That happens to everyone and I think we all
understand. However, there are a number of other issues here which seem
systemic and much more troubling. First, that your "flighting" did not catch
the problem. Why was that? If the bug caused an infinite loop on all the live
storage systems, that seems like it should have been fairly obvious on the
customer systems you tested on. Second, that the patch was rolled out to all
servers at the same time. You have admitted this was a mistake, but honestly
it looks like amateur hour. If you are running business critical distributed
cloud infrastructure, you just don't ever do this. Third, that there was
extended fallout from rolling the patch back. If there are still customers
experiencing downtime from this problem a full day later, that speaks to some
serious flaws in the ops architecture and process. If you guys want to compete
with AWS and similar platforms, it seems like you have a long way to go still.
This set of mistakes should haunt you for a long time, because it's going to
come up whenever someone is trying to convince their boss/colleague/team that
Azure is a solid solution.

~~~
coreysa
Thanks. We are continuing to investigate this and driving needed improvements
in our process and technology to avoid similar issues in the future.

~~~
ohyesyodo
The last two times there was a big issue the same thing happened with the
status dashboard (it became inaccessible). I remember the same issue when the
certs expired 1,5 years ago. I really like Microsoft and was convinced "you"
would somehow isolate the dashboard and host it separately, but it turns out I
was wrong. Do you happen to know the reasons for hosting the status dashboard
inside of Azure? It seems so counter-intuitive to me. Or is it actually hosted
externally but died due to the load when the issue started to appear?

The OP mentions that Microsoft representatives gave info via public forums.
When the issue appeared I looked in different places trying to find info, but
only I found was a statement saying that We are aware of issues. I looked at
Azure twitter/blog, ScottGu twitter/blog, Hanselmans, MSDN forums. I also
tried this forum and reddit. Do you know where I should have gone to receive
details?

~~~
coreysa
Thanks. The communications and the service health dashboard are two areas that
we are creating improvement plans from the learning of this event. For the
dashboard, we do expect it to continue to run even through outages like this
one, but we did encounter an issue with our fallback mechanism that we need to
understand more deeply.

For general communications, we did most of our early communication on the
event using twitter, announcing the incident and giving updates. We need to
build a more formal multi-pronged approach to communicating, including faster
responses in the MSDN forums and here in HN to make sure we are reaching as
many of our customers and partners as possible. Thanks again for the
feedback!!

------
ohyesyodo
How about not rolling out a patch to all data centers at once?

~~~
coreysa
Hi, this is Corey Sanders, an engineer on the Azure compute team. Yes, our
normal policy for updates is to roll them in incremental batches. In this
case, due to an operational error, we did not apply the changes as per normal
policy.

------
bart3r
404 - Article Not Found

