
Leap second causing Linux server crashes? - sathyabhat
http://serverfault.com/q/403732/8453
======
__david__
It appears to be fixed in Linux 3.4 [1]. According to the original commit [2]
it's been broken since 7dffa3c673fbcf835cd7be80bb4aec8ad3f51168 [3], which
appeared in 2.6.26.

So, kernels between 2.6.26 and 3.3 (inclusive) are vulnerable.

[1]
[https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2....](https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=bcd550745fc54f789c14e7526e0633222c505faa)

[2]
[https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2....](https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d)

[3]
[https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2....](https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7dffa3c673fbcf835cd7be80bb4aec8ad3f51168)

~~~
moe
Which, in summary, is pretty much every production kernel out there.

Spent the last two hours recovering servers, tomorrow will be another
interesting day.

Whoever figured it'd be a good idea to INSERT[1] the leap-second instead of
just slowing/accelerating time... <censored>

[1] Clock: inserting leap second 23:59:60 UTC

~~~
rplnt
Well, it was a known bug and you had six months to prepare (i.e. update your
kernel).

~~~
moe
Where was it published?

Almost all of my machines run the Debian stable kernel and were still
affected.

~~~
rplnt
The leap second was scheduled in January. That event is so unusual you might
get worried. So you do a simple google search and find out that there was a
critical bug[1] in Linux kernel last time leap second was inserted. People got
worried rightfully[2][3]. I don't know about debian, if it was known prior, if
it is the same bug as before, ... But I don't run Debian, you do.

1\. <https://bugzilla.redhat.com/show_bug.cgi?id=479765>

2\. [http://it.slashdot.org/story/12/06/30/2123248/the-leap-
secon...](http://it.slashdot.org/story/12/06/30/2123248/the-leap-second-is-
here-are-your-systems-ready)

3\. [http://serverfault.com/questions/402087/does-
centos-5-4-prop...](http://serverfault.com/questions/402087/does-
centos-5-4-properly-handle-leap-seconds)

~~~
moe
No need to be a smart-ass about it.

Even if I had googled (which I didn't) then I'd probably have assumed the
fixes for bugs from 2009 to have long made it into the current distro kernels.

I just didn't expect something so basic to be still (or again) broken.

~~~
rplnt
Don't get me wrong, I wouldn't too by default. But do you remember Azure
crashing on February 29th? And checking for that date is a matter of three
conditions. Leap second is much more complex. I'm not trying to be a smart-
ass.. I'm just saying it's something I would worry about and would try to find
something about it. And perhaps it wouldn't lead anywhere with Debian.

And still, something in your app stack could crash on this as well, leaving
the kernel patching pointless.

------
dfc
Google uses a "leap smear" and slowly accounts for the leap second before it
happens.[1] As long as you are not doing any astronomical calculations or
constrained by regulatory requirements I think google has the right idea.

[1] [http://googleblog.blogspot.com/2011/09/time-technology-
and-l...](http://googleblog.blogspot.com/2011/09/time-technology-and-leaping-
seconds.html)

~~~
jbeda
As part of Google Compute Engine we provide an NTP server to the guest which
is based on Google Production time. As such our VMs get to take advantage of
this leap second smearing implementation. I was going to mention this at my
talk at IO but forgot.

~~~
ralph
So a VM on G's Compute Engine could in turn run an NTP server that exported
G's Production Time? Do I also see GPT on App Engine?

Any chance Google could just make a GPT NTP server available as a public
service anyway, just as 8.8.8.8 is their public ping responder. ;-)

~~~
enneff
8.8.8.8 and 8.8.4.4 are Google Public DNS, not a "ping responder."
<https://developers.google.com/speed/public-dns/>

Google does provide time servers, although I'm not sure whether they are
officially supported. The addresses are:

    
    
        time1.google.com
        time2.google.com
        time3.google.com
        time4.google.com

~~~
ralph
Yeah, sorry, I know they're the public DNS, but they're also jolly handy as
unforgettable IP addresses you expect to be able to ping, hence the smiley.

Good news about the time{1..4} NTP servers, I'll give them a try, thanks.

------
ChuckMcM
Not surprising. In spite of all press that Y2K was just a silly waste of
money, its events like these that makes me suspect it would have been a much
bigger deal if everyone had ignored it and fixed it after things where shown
to break.

~~~
noselasd
Why does everyone always say Y2K wasn't an issue ? I'm sure there were a lot
of consultant making too much money with little work - however _alot_ of bug
fixes were done, that would have caused problems. So because it was taken
seriously , stuff were fixed and issues didn't happen because of that.

Personally, I fixed 3 Y2K bugs back then, 2 of them would have brought down a
rather critical business support to simply crash every time new data arrived.

~~~
jerf
"Why does everyone always say Y2K wasn't an issue?"

From the outsider's perspective, it is indistinguishable from any number of
other putative disasters that required lots of money to fix, yet didn't come
to pass... in some cases including putative disasters in which the money
wasn't spent and the disaster didn't happen anyway.

I have the insider's perspective and I agree that it is the more accurate,
that Y2K was, if not necessarily going to end the world, certainly a bad thing
and was largely averted through effective engineering. But I can still see how
from the outside it sure doesn't look that way.

~~~
ChuckMcM
This.

Many, if not a majority, of my non-technical friends and acquaintances have
expressed at one time or another a reference to the "Y2K disaster" and rolled
their eyes to suggest it was somehow not an issue. I was the 'Y2K compliance
officer' at my startup at the time (we even got certified, and that actually
may have been a scam (the certifying part)) but we identified and fixed a
number of issues our box would have suffered had we not done the work.

------
duiker101
2012\. and we still have problems keeping track of time. This is both
fascinating and scary.

P.S. for people wanting to know more this video is simple to understand but
really amazing <http://www.youtube.com/watch?v=xX96xng7sAE>

~~~
thaumasiotes
From discussion of this same issue in prior threads, my takeaway was

(a) it's really not at all difficult to handle leap seconds, but

(b) the POSIX standard specifically disallows them, by specifying that a day
must contain exactly 86400 seconds. (Analogously, imagine if leap days
occurred as normal, but a "year" by definition contained exactly 365 days.)

The existence of leap seconds means that it's not possible to simultaneously
have (1) system time representing the number of seconds since the epoch, and
(2) system time equal to (86400 * number_of_days_since_epoch) +
seconds_elapsed_today, and all the proposed methods of dealing with the
problem involve preserving (2), which seems worthless to me, and throwing away
(1), which I would have thought was a better model.

edit: actual system times may be in units other than seconds, but the point
remains

~~~
timr
It's harder than leap days, because leap seconds aren't inserted on a regular
schedule. Leap days follow a predictable pattern of insertion. Leap seconds
are inserted whenever the IERS decides to insert them.

The problem of leap seconds is therefore closer to that of time zone
definitions -- which are a total mess, because they depend on keeping rapidly
changing system tables up to date. I can see why people don't relish the idea
of requiring similar tables just to keep system time accurate.

~~~
thaumasiotes
How are systems being notified of the leap seconds now, that wouldn't
immediately enable them to update their hypothetical leap second table?

It seems like we already have a much bigger lead time for notification than we
could possibly need.

> I can see why people don't relish the idea of requiring similar tables just
> to keep system time accurate.

But the 'solution' we're using now is to make system time less accurate, not
more accurate. Accurate would be if leap seconds incremented the system clock
like normal seconds do. If the accuracy you're worried about is displaying a
clock time rather than time since the epoch, you already need a time zone to
do that.

~~~
timr
_"How are systems being notified of the leap seconds now, that wouldn't
immediately enable them to update their hypothetical leap second table?"_

I am not an expert, but as far as I know the most automated solutions are
doing it via NTP, which just resets the second, then relies on clock drift to
bring everything back into synch. Otherwise, I think your only option is to
keep the timezone packages up-to-date (which is a non-trivial task for large
deployments). A quick search found this:

<http://www.novell.com/support/kb/doc.php?id=7001865>

 _"But the 'solution' we're using now is to make system time less accurate,
not more accurate."_

Yeah, I'm not disputing this. I'm just saying that preserving the assumption
that _"day == 86400 seconds"_ probably breaks less code than the alternative.
NTP messes with the notion of seconds-since-epoch anyway, so we know that
single-second variations in unix time aren't automatically deadly to most unix
software.

~~~
obtu
NTP sends a special message to the kernel (using adjtimex), that boils down
to: today you will insert a leap second. This isn't the same as clock drift,
which gets smoothed out, it means a minute with a 60th second (in UTC) or with
the 59th second happening twice (in POSIX). NTP servers need a leap second
table (
[http://support.ntp.org/bin/view/Support/ConfiguringNTP#Secti...](http://support.ntp.org/bin/view/Support/ConfiguringNTP#Section_6.14).
), but most other systems only need to know the _current_ delta between POSIX
and TAI, and manage without a leap table.

------
kabdib
Fear the Unix 32-bit time-becomes-negative bugs, in 2037.

We have 25 years to get ready. I still think we'll be patching at the last
minute.

(Yeah, lots of systems will be 64-bit by then, but there will still be a lot
of embedded crackerbox systems running 32-bit timestamps. It's all the
embedded stuff I'm worried about).

~~~
bcantrill
It's 2038, not 2037.[1] (Specifically, January 19th, 2038 at 3:14:08am.) And
while lots of systems will be 64-bit, many programs still won't be -- and it
seems highly likely that this will be a significantly more serious and
widespread problem than, say, Y2K or DST. (And certainly more serious than
leap seconds, which happen relatively frequently.) Then again, I might be
biased: perhaps I'm secretly hoping to spend the years leading up to 2038
paying for my retirement with high-priced consulting gigs to fix it...

[1] <http://en.wikipedia.org/wiki/Year_2038_problem>

~~~
el_presidente
<http://article.gmane.org/gmane.linux.kernel/1184914>

Less than a year ago there were already people thinking about your job
security. (It's a better explanation than "the glibc maintainers are insane".)

~~~
kabdib
But MUCH less than a year ago, many more people were still writing 32-bit-
dirty time_t based code.

It's gonna be a fun one.

------
kzk_mover
Now facing this issue... By using 'adjtimex' command, you can clear the
problematic INS bit.

At first, you can confirm the status flag like this.

    
    
        $ ./adjtimex --print | grep status
        status: 8209
    

8209's binary representation is like this. This surely have INS bit
"100000000[1]0001" (5th LSB).

    
    
        $ ruby -e 'p 8209.to_s(2)'
        "10000000010001"
    

8193 is the value after the clearance of the INS big.

    
    
        $ ruby -e 'p 8193.to_s(2)'
        "10000000000001"
    

Then, let's set it as a current value. Please ensure your ntpd is not running.

    
    
        $ adjtimex --status 8193

------
MrUnderhill
Novell kb: <http://www.novell.com/support/kb/doc.php?id=7001865>

    
    
      SLE9 (kernel 2.6.5-7.325): NOT AFFECTED
      SLE10-SP1 (kernel 2.6.16.54-0.2.12): NOT AFFECTED
      SLE10-SP2 (kernel 2.6.16.60-0.42.54.1): NOT AFFECTED
      SLE10-SP3 (kernel 2.6.16.60-0.83.2): NOT AFFECTED
      SLE10-SP4 (kernel 2.6.16.60-0.97.1): NOT AFFECTED
      SLE11-GA (kernel 2.6.27.54-0.2.1): VERY UNLIKELY
      SLE11-SP1 (kernel 2.6.32.59-0.3.1): VERY UNLIKELY
      SLE11-SP2 (kernel 3.0.31-0.9.1): VERY UNLIKELY
    
      Update (06/26/2012): after thorough code review -> SLE9 and SLE10 not affected at all.

------
brongondwana
FYI: I've updated the post with details of the workaround as implemented on
our servers.

------
shaggy
Pardon the ignorance if this is a stupid question. I've been looking at some
of my hosts and have noticed a message "Clock: inserting leap second 23:59:60
UTC" in dmesg output but each of the hosts is in the EDT timezone so the I was
under the impression that the leap second hadn't been applied yet. So what
does that mean? That the systems have applied the leap second successfully or
have only received it from their NTP servers?

~~~
DEinspanjer
The leap second is applied at midnight UTC time, regardless of what timezone
the server is in.

~~~
shaggy
Okay, so does that mean that the various bugs that have been circulating can
still hit as it hasn't hit midnight in EDT yet or can I exhale?

------
piggity
We just had 100s of EC2 instances generate high (alleged) load. Instances had
load averages of 90+ but were responsive.

Running on a 3.2 kernel

Rebooted them all and they're fine.

~~~
sehugg
What he said.

------
kristopher
FYI: Our Debian servers did not kernel panic but system CPU load went through
the roof; A quick restart brought levels back to normal.

~~~
wiredfool
My Ubuntu 10.04 desktop went to 100% proc and load avg of 20, none of my 10.04
servers or Debian stable servers were affected.

This fixed it:

    
    
      date; sudo date `date +"%m%d%H%M%C%y.%S"`; date;

~~~
ajays
You are a lifesaver. All morning my desktop's load has been pegged at 20. I
upgraded FF, Chrome, etc. and no impact. I was dreading a full re-start, as I
have lots of windows, tabs, etc. open. The above command knocked the load down
to almost nothing in seconds.

------
politician
After reading these tales of woe, all I can say is that I hope the criminal
element doesn't start assaulting NTP servers.

------
mootothemax
I was logged on to a couple of CentOS 6 servers when I saw this happen, and on
each one the Java processes went absolutely haywire. Everything else seemed to
work fine.

I attempted to fix with adjtimex and the script in the linked question, but to
no avail, in the end having to restart them all instead. After that, all was
good again.

~~~
cagenut
I just had the exact same experience.

------
glawatscheck
POSTMORTEM fix for CPU eating softirqd threads without rebooting:

stop ntpd, run ntpdate or sntp, start ntpd

/etc/init.d/ntp stop; sntp -s <ntpserver>; /etc/init.d/ntp start

Unfortunately sntp / ntpdate wrapper is not shipped with squeeze for example.
I've used the binary from SuSE 11.4 just fine on squeeze.

~~~
glawatscheck
OK this is how it works on squeeze etc.:

apt-get install ntpdate; /etc/init.d/ntp stop; ntpdate pool.ntp.org;
/etc/init.d/ntp start

~~~
glawatscheck
or easier still just date -s "`date`"

without ntpd restart

------
raverbashing
Ouch!

My Debian GNU/Linux 6.0 is still standing

Oh well, reading the issue, the machine date is Sat Jun 30 16:11:31 EDT 2012

Stopped ntpd just in case

~~~
rbanffy
Same here. Set ntp to restart in 12 hours.

~~~
raverbashing
With ntp stopped, no problem whatsoever

------
yaix
Two days ago while booting, the BIOS time on my eeepc was suddenly reset, with
an error message on boot to adjust the time manually. Was just thinking that
it may be related?

------
sayeed
Our Linux instances running on Amazon EC2 had no issues since we are not
running ntpd on these servers and adjtimex returns status as 64 (clock
unsynchronized).

I think the Xen host takes care of the synchronization and we need not do it
in the guest OS. (see [http://serverfault.com/questions/100978/do-i-need-to-
run-ntp...](http://serverfault.com/questions/100978/do-i-need-to-run-ntpd-in-
my-ec2-instance)).

Is this fine or should we run ntpd for better accuracy?

~~~
csarva
Yes. This issue notwithstanding, you should be running ntpd.

------
arohner
Stupid question: Why was this not caught? Seems pretty easy to test. Just set
the clock to today (or any day with a leap second), and watch what happens.

~~~
duskwuff
> Just set the clock to today (or any day with a leap second), and watch what
> happens.

That won't work. The bug is only triggered when an upstream NTP server reports
that a leap second was scheduled. Since leap seconds aren't predictable (and
aren't even scheduled very far in advance), just setting the time back to the
date of a previous leap second won't do anything.

~~~
eadvgf
True, but the question still stands, since you can still test it by just
telling the kernel to insert a (fake) leap second.

~~~
Someone
It also should not be that hard to provide your own upstream ntp server, and
have that generate leap seconds at will. Both machines could be VMs, too.

------
cullenking
On debian, I was able to fix the issue (fix the load issue specifically) with
this command

/etc/init.d/ntp stop; date; date `date +"%m%d%H%M%C%y.%S"`; date;

------
chmod775
If really all of the Linux where affected more than half of the Internet would
be still down by now. Could be only a specific combination of kernel/userspace
bugs that only exists in some systems.

What a bit sucks is that my VPN was affected to (openvpn) causing my computer
to do a poweroff. I replaced the poweroff with

ip route add to 192.168.1.0/24 dev lo

hope that saves me when the next leap second occurs.

------
Monotoko
Pirate Bay has also been crashed by this: "TPB crashed just after midnight
June 30th GMT (5.5 hrs ago) The crash appears to have been caused by the leap
second that was issued at midnight."

<https://forum.suprbay.org/showthread.php?tid=125071>

------
bifrost
No burps from my BSD boxes either, although they're all in UTC so the leap
second hasn't happened for them yet.

~~~
MrUnderhill
The leap second is added at the same point in time regardless the timezone
your server is configured to use. So if you're GMT+3, the leap second will be
inserted at 03:00 local time.

From the answer: "The reason this is occurring before the leap second is
actually scheduled to occur is that ntpd lets the kernel handle the leap
second at midnight, but needs to alert the kernel to insert the leap second
before midnight. ntpd therefore calls adjtimex sometime during the day of the
leap second, at which point this bug is triggered."

------
mkr-hn
Is this implementation-specific, or could the Windows equivalent to ntp cause
the same problem?

~~~
mjschultz
Implementation specific. It looks like it is a bug in the Linux kernel with
how it adjusts the time. It is possible that Windows, OS X, and other BSDs
will be affected by a similar bug, but that would be coincidental as the bug
is not due to ntpd but rather how the kernel handles a request that ntpd
generates.

More specifically, there is a condition in which the kernel tries to insert a
leap second and, in doing so, attempts to acquire the same lock twice causing
the spinlock lockup and (effectively) halting the kernel.

------
ernestipark
My AWS EC2 instances got spun up to 100% cpu and have been like that for a
day. Basically saw a step function from 0 to 100 in the CPU graph. Just had to
reboot them.

------
x3c
Hey, I'm running Ubuntu 12.04 . Could someone guide me through what I can do
to detect/prevent this from crippling my server? Thanks.

~~~
klodolph
Read the linked article.

> The work-around is to just turn off ntpd. If ntpd already issued the
> adjtimex(2) call, you may need to disable ntpd and reboot to be 100% safe.

------
icefox
Oddly netflix went down for me at 12:01 last night...

I assumed some cronjob or something similar was to blame.

~~~
ceol
Netflix had outages due to a huge storm on the east coast of the US. That was
probably the cause.

~~~
icefox
Probably...

------
drivebyacct2
My Ubuntu servers seem unaffected thus far.

~~~
rarrrrrr
Unfortunately I can confirm that Ubuntu 10.04 is vulnerable. We're proceeding
with the fixtime.pl workaround.

~~~
boyter
I can confirm it too, but didn't catch it in time. A reboot however and
everything is back to normal.

~~~
henrikschroder
We didn't catch it in time either. It was oh so much fun to wake up to our
service not working at all, all java and mysqld processes spinning like crazy,
and having to reboot all servers. :-/

------
sohn5
That wouldn't happen if servers were Macs

~~~
TazeTSchnitzel
Mac OS X is a BSD variant, so there's every chance.

~~~
daeken
How does that follow? OS X runs an odd hybrid kernel (XNU) which is Mach and
parts of BSD, but... this is a Linux kernel bug. There's an effectively zero
chance of this impacting anything but Linux.

~~~
TazeTSchnitzel
The kernel is not the only OS component relying on time that may have not
considered this.

~~~
chc
This is evidently a kernel bug. The fact that both operating systems rely on
time isn't particularly relevant. Could there be time bugs in OS X? Certainly.
But it wouldn't be this one. Windows relies on time too, so I don't see why
you bring up the fact that OS X is a BSD variant.

------
aidanbrandt
Read that as "high rates of cash."

