1. On February 29, 2012, new certificates created with a one-year expiration date by adding 1 to the year. Since February 29, 2013 is an invalid date, VMs wouldn't start.
2. After multiple attempts to restart failed VMs, physical hosts marked as failed, and VMs migrated to other physical machines -- the problem propagates.
3. Management services disabled to prevent customers from starting more VMs, compounding the problems.
4. After leap-day bug fixed, secondary failures caused by mixing up incompatible versions of a networking plugin, so VMs had no network access.
5. Total duration of outages: about 16 hours.
6. 33% of a month's service to be credited to all customers, regardless of who was affected.
Not to mention that 16 hours time to fix is insane, unless all your datacenters had been blown up or war had broken out.
Because most other providers would have refunded the customer 16 / (24*28) = 1 / 42 = 2.4% of the bill.
Microsoft paid out 10x that amount.
The type of an SLA that you are talking about (that pays out to cover all loss of business) does not exist anywhere, and if it did, it would cost you more than $10-$100/month hosting account that you'd normally buy.
When Amazon's elastic block store was down, they credited back 10 days of service.
And Google offers a 99.95% SLA for App Engine which refunds 10%, 25% or 50% of the total monthly bill if uptime falls below 99.95%, 99.00% or 95.00% respectively.
And this is why you always use your framework's or language's date arithmetics library and never try to hack up a solution on your own. Date calculations alone are hard enough with the basic irregularities of month lengths. Add the leap years and it becomes even harder.
And don't get me started on times, especially once time zones and summertime comes into play.
Likely your particular hacked-together solution will fail at some point. And if it doesn't: was it worth all the effort you put into making it perfect, especially considering that somebody has already done it for your framework.
NIH at its finest.
datetime( now.years + 1, now.month, now.day )
Not excusing MSFT here, as they have the resources and experience to get it right, but in general I think that following the rule of "don't DIY" won't solve the problem.
If they were purely symbolic constants, then the expression "January + 1" is meaningless and would throw an error.
So, with hindsight, I'd say that any Datetime api which represents days, months and years as numeric quantities (which is, probably, all of them) encourages these kinds of bugs. (Or at least doesn't discourage them).
Can anyone come up with a use case where you need numeric values for these things? (Which doesn't suffer from the same kind of bugs as this?)
What other leap year bugs have people run into? Generally the libraries I work with (e.g. python's timedelta) don't let you add months or years because of their ambiguity.
> t = Time.parse("feb 29, 2012")
=> 2012-02-29 00:00:00 -0500
> t + 1.year
=> 2013-02-28 00:00:00 -0500
> t + 1.year == (t - 1.day + 1.year)
But say 2012-02-29 + 1 year went to 2013-03-01. Then what's 2012-03-01 - 1 year? Does it go to March 1st every time, sometimes ignoring that there's an extra day in between, or does it go back 365 days (March 2nd)?
I suspect the only "solution" is to decide that calendar dates are for human consumption only. If you're doing any calculations, you do them on timestamps, where you can declare a 'year' to be one of 365.256363004, 365.24219, or 365.259636 days (sidereal, tropical, and anomalistic years). Given the importance of calendars and seasons, you'd probably want to use the middle one, as they all essentially share that length of time as a definition of a year. That way you can just screw the whole leap-year concept entirely.
Of course, then you're left with year-long agreements that expire at odd hours of the night.
I much prefer spelling out plainly the intention.
So I think as a matter of API design, Ruby has made a wrong choice by making these operations look like arithmetic on known values. It's too tempting to think that they'll obey the arithmetic identities, when they don't. If you want to have an API function "same date N months later", I have no problem with that at all; then it's much less tempting to think that it's just doing addition.
DateTime expiry = DateTime.Now.AddYears(1);
As long as you understand how your library handles these calculations, trying to do it manually is likely to get you in trouble.
new DateTime(2012, 3, 31)
.AddMonths(1) ==> 2012-03-29
new DateTime(2012, 3, 31)
.AddMonths(-1) ==> 2012-03-30
 For example: http://sourceforge.net/projects/dday-ical/
So to add a year, all you have to do is add 365.25 * 24 * 60 * 60.
This was a stunningly stupid, n00b mistake.
Yes, that's my point exactly. The conversion from the familiar representation to linear seconds incorporates all of those complications (except leap seconds, which hardly anyone bothers with).
If you don't want to use 365.25, you can write code to figure out whether to use 365 or 366; but in neither case will you have a bug like the one Azure hit where a completely invalid date was generated.
Among all the tasks and problems that seem simple, you'll find Time and date questions to be some of the trickiest. Deriding "noobs" doesn't do anyone any credit.
I stand by my statement: this was a stunningly stupid, n00b mistake. If I had made it, I would say exactly the same thing.
Bill Liang's post confirmed how transparent Microsoft wants to be with its customers, what is really nice. And I appreciate how seriously Microsoft is attempting to learn from these incidents and putting measures in place.
Had they not been deploying new code on leap day (UTC), the outage would have been substantially less severe. Code that uses dates and times will have bugs, because it's hard. Don't complicate things further.
So from now on, no more leap day, daylight savings time, or new years maintenance. It's worth postponing a day just in case.