Hacker News new | comments | show | ask | jobs | submit login
Windows Azure Service Disruption on Feb 29th, 2012 (msdn.com)
46 points by FrancescoRizzi 2139 days ago | hide | past | web | favorite | 26 comments



tl;dr

1. On February 29, 2012, new certificates created with a one-year expiration date by adding 1 to the year. Since February 29, 2013 is an invalid date, VMs wouldn't start.

2. After multiple attempts to restart failed VMs, physical hosts marked as failed, and VMs migrated to other physical machines -- the problem propagates.

3. Management services disabled to prevent customers from starting more VMs, compounding the problems.

4. After leap-day bug fixed, secondary failures caused by mixing up incompatible versions of a networking plugin, so VMs had no network access.

5. Total duration of outages: about 16 hours.

6. 33% of a month's service to be credited to all customers, regardless of who was affected.


Why is it that they think a single customer would be happy with 33% of a fee which is likely to be only a very small part of what their downtime cost them?

Not to mention that 16 hours time to fix is insane, unless all your datacenters had been blown up or war had broken out.


> Why is it that they think a single customer would be happy with 33% of a fee

Because most other providers would have refunded the customer 16 / (24*28) = 1 / 42 = 2.4% of the bill.

Microsoft paid out 10x that amount.

The type of an SLA that you are talking about (that pays out to cover all loss of business) does not exist anywhere, and if it did, it would cost you more than $10-$100/month hosting account that you'd normally buy.


Not exactly. For big outages like this, both Google and Amazon provide bigger refunds.

When Amazon's elastic block store was down, they credited back 10 days of service.

http://aws.amazon.com/message/65648/

And Google offers a 99.95% SLA for App Engine which refunds 10%, 25% or 50% of the total monthly bill if uptime falls below 99.95%, 99.00% or 95.00% respectively.

http://code.google.com/appengine/sla.html


cough http://thedailywtf.com/Articles/DATE_NOT_FOUND.aspx

And this is why you always use your framework's or language's date arithmetics library and never try to hack up a solution on your own. Date calculations alone are hard enough with the basic irregularities of month lengths. Add the leap years and it becomes even harder.

And don't get me started on times, especially once time zones and summertime comes into play.

Likely your particular hacked-together solution will fail at some point. And if it doesn't: was it worth all the effort you put into making it perfect, especially considering that somebody has already done it for your framework.

NIH at its finest.


I think the problem is that it's not obvious that (in Python, where I saw this first):

  datetime( now.years + 1, now.month, now.day )
is a hacked-together solution. You have to really design an API very carefully to suggest that this is a bad thing to do (I guess you could make now.years + int yield a type that datetime won't accept as the first argument, but I'm sure I wouldn't think of that before the fact and I consider myself a relatively competent API designer.

Not excusing MSFT here, as they have the resources and experience to get it right, but in general I think that following the rule of "don't DIY" won't solve the problem.


I guess that the fundamental problem is thinking of months and years as numbers (and representing them as such).

If they were purely symbolic constants, then the expression "January + 1" is meaningless and would throw an error.

So, with hindsight, I'd say that any Datetime api which represents days, months and years as numeric quantities (which is, probably, all of them) encourages these kinds of bugs. (Or at least doesn't discourage them).

Can anyone come up with a use case where you need numeric values for these things? (Which doesn't suffer from the same kind of bugs as this?)


Yup, someone didn't use DateTime.AddYears http://msdn.microsoft.com/en-us/library/system.datetime.addy...


How do you all generally handle leap days when doing time math? If you're selling a service for one year, are you selling 365 days (02/28/12 - 02/26/13) or do you just give away the leap day for free (02/28/12 - 02/27/13)? Do you pay your salaried employees one day extra on a leap year?

What other leap year bugs have people run into? Generally the libraries I work with (e.g. python's timedelta) don't let you add months or years because of their ambiguity.


The best approach is to treat Feb 29 as a non-day for purposes of adding months and years to a date, for example in Ruby:

  > t = Time.parse("feb 29, 2012")
   => 2012-02-29 00:00:00 -0500 
  > t + 1.year
   => 2013-02-28 00:00:00 -0500 
  > t + 1.year == (t - 1.day + 1.year)
   => true


Ugh, that's terrible! A type that doesn't obey basic arithmetic identities -- that's almost certain to result in bugs like the one we're discussing.


While I agree, and I fully expected it to go to March 1st instead, the core of the problem is that dates don't obey basic arithmetic identities. Treating Feb 29 as a zero value is an elegant solution in some ways.

But say 2012-02-29 + 1 year went to 2013-03-01. Then what's 2012-03-01 - 1 year? Does it go to March 1st every time, sometimes ignoring that there's an extra day in between, or does it go back 365 days (March 2nd)?

I suspect the only "solution" is to decide that calendar dates are for human consumption only. If you're doing any calculations, you do them on timestamps, where you can declare a 'year' to be one of 365.256363004, 365.24219, or 365.259636 days (sidereal, tropical, and anomalistic years)[1]. Given the importance of calendars and seasons, you'd probably want to use the middle one, as they all essentially share that length of time as a definition of a year. That way you can just screw the whole leap-year concept entirely.

Of course, then you're left with year-long agreements that expire at odd hours of the night.

[1]: http://en.wikipedia.org/wiki/Year#Sidereal.2C_tropical.2C_an...


Different approaches, different bugs. The seconds from UTC approach results in 1 month from Aug 31st being Sep 1st, and 1 month from Jan 31 being Mar 3rd.

I much prefer spelling out plainly the intention.


Well, a "month", unqualified, is not really a unit of time. You have to know not only which month you're talking about, but in the case of February, what year it's in, to know exactly how long it is. The same goes for years, if you want them to consist of an integral number of days.

So I think as a matter of API design, Ruby has made a wrong choice by making these operations look like arithmetic on known values. It's too tempting to think that they'll obey the arithmetic identities, when they don't. If you want to have an API function "same date N months later", I have no problem with that at all; then it's much less tempting to think that it's just doing addition.


In C#:

  DateTime expiry = DateTime.Now.AddYears(1);
No need to worry about leap days or anything. The framework takes a date/time such as 2012-02-29 15:00 and calculates that one year later is 2013-02-28 15:00. Similarly, 2012-01-31 15:00 called with .AddMonths(1) returns 2012-02-29 15:00 and calling that with .AddMonths(-1) returns 2012-01-29 15:00.

As long as you understand how your library handles these calculations, trying to do it manually is likely to get you in trouble.


A gotcha to look out for in .NET's implementation is that because the library is adjusting invalid results down to the last day of the month, addition and subtraction operations are not commutative:

  new DateTime(2012, 3, 31)
    .AddMonths(-1)
    .AddMonths(1) ==> 2012-03-29

  new DateTime(2012, 3, 31)
    .AddMonths(1)
    .AddMonths(-1) ==> 2012-03-30


Right, which is why an understanding of how .NET handles the calculations is important. If it's important that you preserve things like "last day of the month" there are libraries[0] which support this and will do date calculations accordingly.

[0] For example: http://sourceforge.net/projects/dday-ical/


Anyone who's been programming for more than a year knows, or should know, that you don't do time arithmetic directly on the date representation. You convert times to some form that is easy to do arithmetic on, like seconds from 1970-1-1 00:00:00 UTC (the Unix epoch) or 1900-1-1 00:00:00 UTC (the Common Lisp epoch), do all your arithmetic on that, then convert back.

So to add a year, all you have to do is add 365.25 * 24 * 60 * 60.

This was a stunningly stupid, n00b mistake.


Don't do that either. December 31, 2012 at 9:00 PM + 365.25 * 24 * 60 * 60 seconds = January 1, 2014 at 3:00 AM. I doubt many people would consider that to be one year later. Let a well-tested library handle the date calculations as you likely haven't thought enough about leap years, leap seconds, daylight savings, historical changes in daylight savings, and so on.


Let a well-tested library handle the date calculations

Yes, that's my point exactly. The conversion from the familiar representation to linear seconds incorporates all of those complications (except leap seconds, which hardly anyone bothers with).

If you don't want to use 365.25, you can write code to figure out whether to use 365 or 366; but in neither case will you have a bug like the one Azure hit where a completely invalid date was generated.


Yes,

Among all the tasks and problems that seem simple, you'll find Time and date questions to be some of the trickiest. Deriding "noobs" doesn't do anyone any credit.


You're usually pretty safe if you allow well-tested date and time libraries to do all of the heavy lifting.


I'm keenly aware of the complications. My point is that every other professional programmer should be too.

I stand by my statement: this was a stunningly stupid, n00b mistake. If I had made it, I would say exactly the same thing.


Working at Microsoft (in Windows Azure), this was the first outage since I joined the org, so I did not know what to expect from the company in terms of transparency on this outage. However, given other presentations or papers on the Windows Azure technology and how open they were publicly, I expected a good job here.

Bill Liang's post confirmed how transparent Microsoft wants to be with its customers, what is really nice. And I appreciate how seriously Microsoft is attempting to learn from these incidents and putting measures in place.


The article really is worth a read if you build complex systems. My takeaway from this is that you shouldn't schedule maintenance work during "weird" times.

Had they not been deploying new code on leap day (UTC), the outage would have been substantially less severe. Code that uses dates and times will have bugs, because it's hard. Don't complicate things further.

So from now on, no more leap day, daylight savings time, or new years maintenance. It's worth postponing a day just in case.


That seems to be incredibly well-detailed, much more than Amazon's or others' responses to their outages so far.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: