This report by Gil Tene (their CTO according to Wikipedia) lends more support to that theory.
A shame, since those would allow some really sweet things for garbage collectors, which evidently isn't a goal.
Does anyone have a link to that discussion? I get the sense there is more to that discussion than the author let on in his brief response.
I googled for "kernel mailing list managed runtime initiative" and found this: https://lwn.net/Articles/392307/ Apparently it was never even proposed on the kernel mailing list! But then apparently the code as written wasn't ever meant to be integrated into the kernel upstream but as a sort of PoC that would maybe be a better starting point than going fresh for eventual integration... But there's lots of interesting details, and as I'm reading through the comments, things are coming back, now I remember reading about this in 2010/2011...
In order to effectively sell the JVM, they need to help you understand and reduce the non-GC pauses so you can realize the full value of your investment.
(full disclosure: happy Zing customer)
Yeah well its a DOS triggered by pretty particular conditions. While it's bad, starting a panic for that reminds me of the boy who cried wolf story.
But then again feel free to slap a fancy site and a codeword to it, all the cool kids do it anyway :)
Also, an interesting note is that the commit references this deadlocking on ARM64, so I'm guessing this probably broke on non-x86 architectures in strange ways unless I'm really missing something...
From the mailing list:
> In our case it's reproducing on 10 core haswells only which are different than 8 cores (dual vs single ring bus and more cache coherency options). It's probably a probability matter. [...]
> Pinning the JVM to a single cpu reduces the probability of occurrence drastically (from a few times a day to weeks) so I'm guessing latency distributions may have an effect.
Should we also alert the President? Maybe OP is only talking regarding to the ml he has posted on but we're out of context here on HN? Only affected systems in production seems to be RHEL 6.6 on Haswell.
Ubuntu 14.04/Debian 8: have the fix for a long time  
Ubuntu 12.04/Debian 7: was never affected  . Newer enablement stack kernels for Ubuntu has the same fix as .
RHEL 7: OP only talks about 6.6 so I assume either it doesn't have the regression backported to it or it already has the fix
SLES: don't know, don't care.
Also, i don't believe your queries are a sufficient check at all.
You can clearly find the missing default case in http://kernel.ubuntu.com/git/ubuntu/ubuntu-precise.git/tree/... . So i guess that means ubuntu precise is/was also affected?
Same in your other link https://git.kernel.org/cgit/linux/kernel/git/stable/linux-st... .
Please check your facts next time.
> Not true, it is architecture independent (commit also mentions Android)
I saw it (and arm64 comment in this thread) but I didn't include them because I don't think it would be a production-serious issue there. Thanks for clarification.
> Please check your facts next time.
Thanks to your post I've corrected links, but no need for such an aggressive tone, huh? Maybe you should try to be more polite in your next refutal tries?
The patch also replaces atomic_inc() with futex_get_mm():
+ futex_get_mm(key); /* implies MB (B) */
> And your links are using atomic_inc().. What that means with regards to this bug? I don't know.
It means they are simply old versions and they were never affected by the bug, if we believe the commit message of the fix commit and OP.
30 lines commit message for 2 lines of code. Kernel developers sure are disciplined.
Also, does anyone know if there is some RHEL errata about this bug?
edit: I just looked at the RedHat applied patches for RHEL5.11 linux 2.6.18-398 and this bug was also introduced in the RHEL5.11 series (not sure if a subsequent kernel version fixes this)
I'm guessing perhaps checking /proc/cpuinfo for the XEON version v3, or looking for flags 'hle|rtm|tsx' would work - but something more definitive would help with mass-auditing.
EDIT: Turns out 14.04.02 LTS can also optionally use 3.16: http://www.omgubuntu.co.uk/2015/02/ubuntu-14-04-2-lts-releas...
But i didn't check if there are updated kernels for those versions in Ubuntu.. Atleast for RHEL5.11 it looks to me that the -404 kernel is the latest...
Code dealing with memory barriers in SMP systems is non trivial to write, review and test. Everything is hardware specific, timing dependent and non-deterministic. Simple unit tests are useless for this kind of tasks, it needs stress testing on different hardware and a variety of workloads.
Great book, highly recommend it. It won the Pulitzer.
Unpopular features on less common architectures are frequently broken for large stretches of time, and go unnoticed until someone complains. Open source really exemplifies the squeaky wheel getting the grease, which is kind of sad.
Places where Linux is popular undoubtedly have their own internal private test suites, especially for features less popular on bleeding edge kernels (eg S390 arch support or Infiniband)
It would be hard to get any sort of good coverage with unit tests, too, but that shouldn't be a reason to avoid trying.
Could a large but spotty unit test suite inspire false confidence that led to be being less careful about signing off on changes and thus decrease overall quality?
Of course, kernel devs were already confident enough to merge breaking code without the added confidence of a partial unit test suite in place.
While there are many advantages to unit testing, kernels are typically tested from userspace.
Some tests are laughably simple. I panicked the OS X kernel a while back with a shell script that repeatedly loaded and unloaded my kernel extension. Only a minute or two was required for the panic.
Apple fixed the panic but never told me how they screwed up.
EDIT: Of significant concern is how the kernel deals with the electrical circuitry. While the kernel is implemented in software, the reason we even have kernels, is so that end-user code doesn't have to understand much about physics.
AMCC - since acquired by LSI - sold some high-end RAID Host Bus Adapters. We had quite a significant problem with motherboard support. We had to test our cards on a whole bunch of different motherboards as well as PCI expansion chassis.
One might protest that "PCI is a standard!" but what we have is what we can buy at Microcenter. :-/
While not all of the kernel is concerned with physical hardware, much of it is. It's not really possible to write unit tests for the parts that have to deal with edge cases in electrical circuitry.
Much more important what the code in that new default block is doing. It's a memory fence.