Hacker Newsnew | comments | ask | jobs | submitlogin
Critical Linux bug that leads 100% CPU (leap second) (wpkg.org)
136 points by yekmer 659 days ago | comments


pilif 659 days ago | link

I would love to see what's really causing this bug. We read so many times over the weekend to either reboot or just run that date command - but nobody is telling us what's causing the problem.

Also, seeing that other threaded applications had similar problems, I doubt this is a java issue - more likely a pthread, glibc or even kernel issue

-----

gaius 659 days ago | link

There is a good explanation here: http://serverfault.com/q/403732/58037

-----

agwa 659 days ago | link

That's predominantly about the kernel crash, not the high-CPU futex issue. One of the most maddening things about this is that there have been several different issues related to leap seconds on Linux, making it all the harder to get information.

-----

altxwally 659 days ago | link

The patch that was shared on the lkml shows some insight on what is causing the issue. https://lkml.org/lkml/2012/7/1/27

Apparently the issues might be due "to the leapsecond being added without calling clock_was_set() to notify the hrtimer subsystem of the change", a possible fix being to patch kernel/time/timekeeping.c to be leapsecond aware.

-----

ajays 659 days ago | link

This seems like the best explanation I've found so far: https://lkml.org/lkml/2012/7/1/203

-----

pilif 659 days ago | link

Agreed. Also it clearly accounts for the futex related load issues and it even gives nice and readable C code to see the problem happening.

This explains it for me. Thanks a lot for the pointer.

-----

xxpor 659 days ago | link

A good explanation from Reddit: http://www.reddit.com/r/programming/comments/vxmf7/time_arit...

-----

ecopoesis 659 days ago | link

Hard to call this a Java bug when many other, non-Java things are affected. It's a critical Linux bug that causes futex to timeout, and anything that uses it to behave incorrectly.

https://lkml.org/lkml/2012/7/1/11

-----

ww520 659 days ago | link

It's probably that Java heavily utilizes the multi-thread support and the kernel bug is showing up as a Java bug. It just means Java really exercises the system's concurrent support.

-----

tommi 659 days ago | link

ecopoesis, you are not the only one saying that it's a linux bug instead of a java bug even though the link title says "Critical Linux bug that leads 100% CPU (leap second)".

Did the link title change from a Java title, like the article, to a Linux title to match the actual root cause?

-----

davidw 659 days ago | link

> Did the link title change from a Java title, like the article, to a Linux title to match the actual root cause?

Yes, it did.

-----

pjmlp 659 days ago | link

This is a Linux kernel bug, not a JVM bug.

-----

mcescalante 659 days ago | link

Yeah, NTP is Linux kernel, but the JVM is what's eating the CPU after the clock leap.

-----

jbellis 659 days ago | link

no, it's the kernel livelocking in response to a call made by the jvm

-----

jhund 659 days ago | link

I saw what is likely a related issue on one of our AWS EC2 instances, where exactly at midnight UTC there was a high percentage of 'steal' CPU time in our server monitoring charts.

I wonder if this was caused by another VM on the same physical box being hit by the bug and as a result stole CPU time from our VM.

I resolved the issue by moving to a different VM (Rebooting didn't help), to get away from my greedy neighbor.

More info here: http://blog.thinrhino.net.in/cpu-steal-time

-----

yekmer 659 days ago | link

Our company uses HBase, Elastic Search, GitBlit, SmartFox Server, Jetty which have been by this bug, MySQL is said to be affected too http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-sec...

-----

davidw 659 days ago | link

Thank you for that link! I had been scratching my head about that server even though it wasn't mine to take care of (the other service I'm involved with here, that I helped plan, uses Postgres, which does not seem to have problems).

-----

j_col 659 days ago | link

So that explains why the 12 cores on my Fedora workstation were maxed-out when I came to work this morning!

-----

kzrdude 659 days ago | link

So if the leap second was handled in userspace instead of the kernel, just like a normal ntp time update, all would have been fine. Why not just do that?

-----

regularfry 659 days ago | link

The easier to type 'sudo date -s "`date`" seemed to work for me.

-----

JVIDEL 659 days ago | link

Oh man so that was causing it!

My rig crashed all weekend because of this POS bug, I had to boot back to Windows to get anything done (oh cmd, I really didn't miss you at all you insufferable bitch...)

Any fixes?

-----

gcr 659 days ago | link

There's a fix in the article.

-----

geetee 659 days ago | link

Hey, remember that time I spent a couple hours frantically checking logs and restarting services?

-----

[deleted]
derpmeister 659 days ago | link

I hate tzdata updates with a passion, politicians should just get a grip and stop messing around with timezones. I'm all for ideas that create new jobs but this isn't one of them.

-----

streptomycin 659 days ago | link

If you come up with a way of predicting when leap seconds will be needed (hint: it's not a constant regular time interval), let us all know. Until then, there will need to be adjustments.

-----

wmf 659 days ago | link

Leap seconds aren't needed at all. I'd rather let them accumulate until there's a leap hour that can be rolled into DST (although DST may not exist that far in the future).

-----

MichaelGG 659 days ago | link

DST doesn't exist in UTC, so that's irrelevant. A one-hour UTC shift would totally, utterly, screw stuff up. But, at the current rate, a one-hour "leap" would happen in thousands of years, so maybe it's not such a bad idea after all. But I think the reason for leap seconds has to do with keeping UTC in sync with other clock systems, and that probably overrides any inconvenience to software.

-----

moe 659 days ago | link

I wonder why they don't implement the google solution on pool.ntp.org.

I.e. gradually slow/accelerate time over the course of a day, rather than stepping it hard at once.

I'd say this approach would be vastly preferable for about 100% of the systems relying on pool.ntp.org.

The remaining 0%, e.g. scientific applications that absolutely need the leap second to appear at exactly the right moment, most likely don't use pool.ntp.org anyways.

And for those who do they could create a second pool with the old behavior. Maybe call it science.ntp.org.

-----

Negitivefrags 659 days ago | link

The solution is not to change UTC, but just to rotate the time zones of each country every now and then. DST has proven that a country is able to change time zones twice a year. This would happen far less often.

-----

mikeash 659 days ago | link

A one-second UTC shift seems to be pretty good at totally screwing stuff up already. At least a one-hour shift would happen once every few centuries instead of once every few years.

-----

michaelt 659 days ago | link

You don't think the leap-hour will be a new millennium bug?

When you look at how much people crap their pants over the leap second - a relatively common thing - I dread to think how unprepared people would be for something 3,600 times less common.

-----

wmf 659 days ago | link

Yes, it would be similar to Y2K. Hopefully politicians could agree decades in advance so people would have plenty of time to prepare. Also, since you're only changing the tzdata and not UTC, much less would break and most of the breakage would be purely cosmetic. Right now we have several tzdata changes per year and they cause much less disruption than leap seconds.

-----

phaker 659 days ago | link

If you want something that looks like UTC minus the leap second trouble, then there is TAI. Right now TAI and UTC differ by about 30 seconds.

-----

wmf 659 days ago | link

I can't switch to TAI because then I'd be 30 seconds off from everybody else. And everybody can't switch to TAI because that disruption would be even larger than what we saw this weekend. IMO the solution is to leave the leap seconds that were already added but not add any more.

-----

e40 659 days ago | link

On Sunday I noticed that Gerrit (code review, written in Java) was chewing through CPU on one of our servers. Just applied this it appears to have settled down.

-----

freestyler 659 days ago | link

There is a list of applications affected by this kernel bug http://blog.windfluechter.net/content/blog/2012/07/01/1481-1...

-----

abc_lisper 659 days ago | link

Does this happen on android too?

-----

coldskull 659 days ago | link

well, our hadoop cluster went bonkers because of this bug....luckily it was on stage...not production!

-----

danielhlockard 659 days ago | link

Yeah, I ended up rebooting our production hadoop cluster, it all came back up fine, and we don't have too many people using it yet.

-----

agentgt 659 days ago | link

What a PITA

-----




Lists | RSS | Bookmarklet | Guidelines | FAQ | DMCA | News News | Feature Requests | Bugs | Y Combinator | Apply | Library

Search: