
Windows Azure Service Disruption on Feb 29th, 2012 - FrancescoRizzi
http://blogs.msdn.com/b/windowsazure/archive/2012/03/09/summary-of-windows-azure-service-disruption-on-feb-29th-2012.aspx
======
panarky
tl;dr

1\. On February 29, 2012, new certificates created with a one-year expiration
date by adding 1 to the year. Since February 29, 2013 is an invalid date, VMs
wouldn't start.

2\. After multiple attempts to restart failed VMs, physical hosts marked as
failed, and VMs migrated to other physical machines -- the problem propagates.

3\. Management services disabled to prevent customers from starting more VMs,
compounding the problems.

4\. After leap-day bug fixed, secondary failures caused by mixing up
incompatible versions of a networking plugin, so VMs had no network access.

5\. Total duration of outages: about 16 hours.

6\. 33% of a month's service to be credited to all customers, regardless of
who was affected.

~~~
tomjen3
Why is it that they think a single customer would be happy with 33% of a fee
which is likely to be only a very small part of what their downtime cost them?

Not to mention that 16 hours time to fix is insane, unless all your
datacenters had been blown up or war had broken out.

~~~
powertower
> Why is it that they think a single customer would be happy with 33% of a fee

Because most other providers would have refunded the customer 16 / (24*28) = 1
/ 42 = 2.4% of the bill.

Microsoft paid out 10x that amount.

The type of an SLA that you are talking about (that pays out to cover all loss
of business) does not exist anywhere, and if it did, it would cost you more
than $10-$100/month hosting account that you'd normally buy.

~~~
panarky
Not exactly. For big outages like this, both Google and Amazon provide bigger
refunds.

When Amazon's elastic block store was down, they credited back 10 days of
service.

<http://aws.amazon.com/message/65648/>

And Google offers a 99.95% SLA for App Engine which refunds 10%, 25% or 50% of
the total monthly bill if uptime falls below 99.95%, 99.00% or 95.00%
respectively.

<http://code.google.com/appengine/sla.html>

------
pilif
_cough_ <http://thedailywtf.com/Articles/DATE_NOT_FOUND.aspx>

And this is why you always use your framework's or language's date arithmetics
library and never try to hack up a solution on your own. Date calculations
alone are hard enough with the basic irregularities of month lengths. Add the
leap years and it becomes even harder.

And don't get me started on times, especially once time zones and summertime
comes into play.

Likely your particular hacked-together solution will fail at some point. And
if it doesn't: was it worth all the effort you put into making it perfect,
especially considering that somebody has already done it for your framework.

NIH at its finest.

~~~
hythloday
I think the problem is that it's not obvious that (in Python, where I saw this
first):

    
    
      datetime( now.years + 1, now.month, now.day )
    

_is_ a hacked-together solution. You have to really design an API _very_
carefully to suggest that this is a bad thing to do (I guess you could make
now.years + int yield a type that datetime won't accept as the first argument,
but I'm sure I wouldn't think of that before the fact and I consider myself a
relatively competent API designer.

Not excusing MSFT here, as they have the resources and experience to get it
right, but in general I think that following the rule of "don't DIY" won't
solve the problem.

~~~
jbert
I guess that the fundamental problem is thinking of months and years as
numbers (and representing them as such).

If they were purely symbolic constants, then the expression "January + 1" is
meaningless and would throw an error.

So, with hindsight, I'd say that any Datetime api which represents days,
months and years as numeric quantities (which is, probably, all of them)
encourages these kinds of bugs. (Or at least doesn't discourage them).

Can anyone come up with a use case where you need numeric values for these
things? (Which doesn't suffer from the same kind of bugs as this?)

------
cypherpunks01
How do you all generally handle leap days when doing time math? If you're
selling a service for one year, are you selling 365 days (02/28/12 - 02/26/13)
or do you just give away the leap day for free (02/28/12 - 02/27/13)? Do you
pay your salaried employees one day extra on a leap year?

What other leap year bugs have people run into? Generally the libraries I work
with (e.g. python's timedelta) don't let you add months or years because of
their ambiguity.

~~~
mef
The best approach is to treat Feb 29 as a non-day for purposes of adding
months and years to a date, for example in Ruby:

    
    
      > t = Time.parse("feb 29, 2012")
       => 2012-02-29 00:00:00 -0500 
      > t + 1.year
       => 2013-02-28 00:00:00 -0500 
      > t + 1.year == (t - 1.day + 1.year)
       => true

~~~
ScottBurson
Ugh, that's terrible! A type that doesn't obey basic arithmetic identities --
that's almost certain to result in bugs like the one we're discussing.

~~~
mef
Different approaches, different bugs. The seconds from UTC approach results in
1 month from Aug 31st being Sep 1st, and 1 month from Jan 31 being Mar 3rd.

I much prefer spelling out plainly the intention.

~~~
ScottBurson
Well, a "month", unqualified, is not really a unit of time. You have to know
not only which month you're talking about, but in the case of February, what
year it's in, to know exactly how long it is. The same goes for years, if you
want them to consist of an integral number of days.

So I think as a matter of API design, Ruby has made a wrong choice by making
these operations look like arithmetic on known values. It's too tempting to
think that they'll obey the arithmetic identities, when they don't. If you
want to have an API function "same date N months later", I have no problem
with that at all; then it's much less tempting to think that it's just doing
addition.

------
rdcastro
Working at Microsoft (in Windows Azure), this was the first outage since I
joined the org, so I did not know what to expect from the company in terms of
transparency on this outage. However, given other presentations or papers on
the Windows Azure technology and how open they were publicly, I expected a
good job here.

Bill Liang's post confirmed how transparent Microsoft wants to be with its
customers, what is really nice. And I appreciate how seriously Microsoft is
attempting to learn from these incidents and putting measures in place.

------
kogir
The article really is worth a read if you build complex systems. My takeaway
from this is that you shouldn't schedule maintenance work during "weird"
times.

Had they not been deploying new code on leap day (UTC), the outage would have
been substantially less severe. Code that uses dates and times will have bugs,
because it's hard. Don't complicate things further.

So from now on, no more leap day, daylight savings time, or new years
maintenance. It's worth postponing a day just in case.

------
recoiledsnake
That seems to be incredibly well-detailed, much more than Amazon's or others'
responses to their outages so far.

