

Confirmed: Windows Azure downtime caused by leap-day bug - panarky
http://blogs.msdn.com/b/windowsazure/archive/2012/03/01/windows-azure-service-disruption-update.aspx

======
paulbaumgart
The last line is a little unfortunate. It made me think of:
[http://37signals.com/svn/posts/1528-the-bullshit-of-
outage-l...](http://37signals.com/svn/posts/1528-the-bullshit-of-outage-
language)

~~~
panarky
Contrast Microsoft's wishy washy "any inconvenience this may have caused" with
Amazon's sincere apology.

<http://aws.amazon.com/message/65648/>

(Scroll to last paragraph.)

~~~
Mythbusters
I was hoping to hear a little more technical apology from Microsoft. For a
suit, reading what the VP posted might make sense, but for a techie like
myself I really need to know that the platform is mature and stable. A
detailed description of the problem helps me judge.

~~~
screwt
To be fair, the post does say they'll provide the technical explanation when
they have more info:

    
    
        We will post an update on this situation, including details on the root cause analysis at the end of this incident.
    

I think that's fair enough (assuming they don't just quietly forget about this
last part).

------
slardat01
Another strange time related issue that just burned us, if your server is up
for 497 days, it will stop closing sockets:
<http://support.microsoft.com/kb/2553549>

~~~
InclinedPlane
On a hunch I converted 497 days to seconds, and it works out to be 42.9
million. A suspiciously familiar number, as it is precisely 2^32 hundredths of
a second. Since 10 ms is a common clock resolution on systems that points to
an obvious cause: a 32-bit counter for time rolling over and horfing the
relative age calculations, so all of the sockets that were open prior to the
rollover stay open forever.

~~~
dave1010uk
Windows 95 crashed after 49.7 days (2^32 ms) for similar reasons:
<http://news.cnet.com/2100-1040-222391.html>

~~~
InclinedPlane
There are two things which are a bit off-putting about this.

First, the fact that the same exact type of bug had been known in 1999 and yet
they either failed to fix it in the newer code base or they reimplemented the
exact same bug in new code.

Second, almost certainly the reason that these bugs weren't caught earlier is
because it's unusual for Windows to have such long uptime (50 days for Win 9x
is impressive, and over a year for Windows server equally so). More so, almost
certainly the average user has such low expectations of windows reliability
that if they see the system become unstable or slow after a long period of
uptime they will as a rule merely reboot the system rather than investigate.

Edit: a thought occurs to me. Perhaps the "fix" for the older problem was to
simply change from using milliseconds since last boot for tcp/ip socket age to
using hundredths of a second. I really, really hope that wasn't the case.

------
rachelbythebay
They noticed it at 5:45 PM PST?

    
    
      $ date -d "Feb 28 2012 17:45 PST" -u 
      Wed Feb 29 01:45:00 UTC 2012
    

Does that mean it took them nearly two hours to spot the problem? Or are they
not running on UTC?

------
xelfer
How are these things happening? in Australia our Healthcare system had a bug
too: [http://www.itnews.com.au/News/292081,hicaps-bug-hits-
health-...](http://www.itnews.com.au/News/292081,hicaps-bug-hits-health-
payment-system.aspx)

I don't recall any of this occurring in 2008 or 2004.

~~~
RandallBrown
This happens all the time because of how confusing our calendar is. Lots of
dates are figured out by counting the number of seconds since 1970. It's
really easy to miss a little detail like a leap year, since they happen so
infrequently and they didn't always exist.

It's not just Microsoft that does stuff like this either. Apple regularly
messes up iPhone alarms during daylight savings.

~~~
crcastle
Divisible by 4 => leap year

 _Also_ divisible by 100 => not leap year _unless_ _also_ divisible by 400.

It's really not that complicated.

[edit: Oops. I messed it up. Irony. Fixed now.]

~~~
mrb
This is not an irony that you messed it up. You provide the perfect example
_why_ these bugs happen: not only you didn't know the exact rules
(divisibility by 400), but you also thought it would be fine to implement them
yourself, when you should most likely use an existing library to handle date &
time calculations.

~~~
apaprocki
... and most people don't realize that knowing that correct calculation is
only half the battle. Once you know it, you can successfully navigate the
Gregorian calendar. But what happens when you need to work with dates prior to
the start of the Gregorian calendar? Does your Gregorian start happen in 1582
when the first countries adopted it, or in 1752 when the British adopted it?
Most people simply apply Gregorian rules indefinitely into the past, which is
not always correct for every situation:

<http://en.wikipedia.org/wiki/Proleptic_Gregorian_calendar>

------
nandemo
I don't understand. Your software works for months with 28 or 30 or 31 days.
Why does it break for months with 29 days?

I think unless you're messing with non-Gregorian calendars, this is a solved
problem. Am I missing something?

~~~
lysium
Although time is an important data in most applications, there is only poor
library support for it. Most languages don't even have a data type for time!

For example, the system call gettimeofday(2) does only return the seconds
since Jan 1, 1970 (the 'epoch'), the time zone and the daylight saving time
correction. No day, no month, no year. For this, you have to call some other
function, eg. libc' localtime(3).

libc's time(3) function says it returns the number of seconds since the epoch,
but it ignores leap seconds, so it actually does not return the number of
seconds since the epoch (since there have been several leap seconds since the
epoch).

Even if you use localtime(3) to get the actual wall clock hour and minutes,
you are left on your own from there. Want to have a time point two hours from
now? Do your own math, but watch out for the end of the day, which might also
be the end of the month and/or the end of the year. One month from now? Do
your own math, but watch out for months that have less days than your current
month.

You may want to resort to do calculations only in seconds since the epoch, but
how many seconds are in a month? Depends on the month (and the year as we've
just learned!). In a year? Depends on the year (Is it a leap year? Was / will
be there a leap second? Do you have to care about leap seconds?).

I just picked the C language because it is so prevalent. Other languages have
their own issues or inherit them from C. In Python, for example, there is a
timedelta object, but it can only handle days, seconds, and microseconds, so
you still cannot calculate the date one month from now or one year ago.

I find it unbelievably funny that it's 2012 and we still have to deal with
this 'solved problem'. Turns out, it is not solved at all.

~~~
thwarted
_libc's time(3) function says it returns the number of seconds since the
epoch, but it ignores leap seconds, so it actually does not return the number
of seconds since the epoch (since there have been several leap seconds since
the epoch)._

It returns the number of _actual_ seconds that have elapsed since the epoch,
the only way to do this is to ignore leap seconds. When leap seconds occur,
they don't actually exist, they just change the offset to keep things in sync.
Every second the time(3) function returns exists uniquely and none are
skipped, there are no ambiguous values, which you can't say if leap seconds
were not ignored.

~~~
obtu
Nope, Unix time is aligned on UTC days, which means it jumps back by one when
a leap second is inserted (and would jump forward when one is deleted):
<https://en.wikipedia.org/wiki/Unix_time>

~~~
thwarted
Ah, well I stand corrected. That being said, it seems that UNIX time doesn't
take leap seconds into account physically, but they are logically visible when
viewing the change over time.

------
mrb
Is there any study comparing the downtime of the different cloud platforms
over, say, the past few years? EC2, Azure, Google Apps, etc. That would be the
ultimate tool to shame substandard cloud vendors...

~~~
obtu
CloudHarmony collects these, here are stats for the past year:
<https://cloudharmony.com/status>

~~~
obtu
Correction: it doesn't seem to show the recent Azure outage, which would be on
Azure Compute regions.

------
weavejester
They should really be running some test servers with clocks set ahead of time
in order to get advanced notice of problems like this. I seem to recall that
Amazon does this with its servers.

------
pom
This sort of thing happens every leap year, and everytime I remember that the
very first exercise in the very first CS class that I took in college was to
write an algorithm to decide whether or not a given year was a leap year.

------
wavephorm
I thought this sounded familiar:

[http://www.computerworld.com/s/article/9124638/Zune_chokes_o...](http://www.computerworld.com/s/article/9124638/Zune_chokes_on_leap_year_bug)

~~~
panarky
The Zune leap-year bug bricked the players on December 31.

Money quote: "Microsoft says it will issue a bug fix for the device so that
this problem won't occur again in 2012, the next leap year."

~~~
tazzy531
Maybe that's why they cancelled the Zune in 2011 to avoid having to fix the
leap year bug.

