
Finding the Four Month Bug: A Debugging Story (2015) - dodders
https://www.evanjones.ca/jvm-mmap-pause-finding.html
======
jorangreef
I had a similar experience with Node, which took months to track down.

I noticed something was blocking our event loop for between 200ms to 2 seconds
at a time. I assumed it was GC and optimized everything off-heap but the issue
remained. It turned out Node's async spawn() is not async, and blocks while it
copies the page table. For processes with large RSS, this adds up.

[https://github.com/nodejs/node/issues/14917](https://github.com/nodejs/node/issues/14917)

------
isolli
It reminds me of a story that came out a few years ago, of how an OCaml
developer tracked a thorny issue all the way down to a processor bug [0]. It
took them 5 months - and a lucky break.

[0] I found a bug in Intel Skylake processors
([https://news.ycombinator.com/item?id=14686277](https://news.ycombinator.com/item?id=14686277))

------
hwj
In addition to the mentioned book "Debugging" by David Agans there is also
"Why programs fail" by Andreas Zeller:

[http://www.whyprogramsfail.com/](http://www.whyprogramsfail.com/)

The author of the latter book is a Professor from Saarbrücken and former
maintainer of GNU DDD.

------
nullwasamistake
If the author is here, have you tried the new ZGC collector? I've tried it
experimentally and verified that pause time are indeed 10ms or less in every
case with a couple of our apps.

I also heard Twitter is a big adoper of Graal but apparently not everywhere.
How is that going? I'm deeply upset that Oracle is segmenting the CE and EE
version by performance (to the point where I think a fork is likely),
wondering what your experience with it was

------
choeger
Nice catch. Demonstrates how important a holistic view of the whole stack can
be for debugging. How many developers even know how I/O works with the kernel?
How the JVM implements GC?

------
giacaglia
Great blog post!

