Finding the Four Month Bug: A Debugging Story (2015)

jorangreef · on July 3, 2019

I had a similar experience with Node, which took months to track down.

I noticed something was blocking our event loop for between 200ms to 2 seconds at a time. I assumed it was GC and optimized everything off-heap but the issue remained. It turned out Node's async spawn() is not async, and blocks while it copies the page table. For processes with large RSS, this adds up.

https://github.com/nodejs/node/issues/14917

isolli · on July 3, 2019

It reminds me of a story that came out a few years ago, of how an OCaml developer tracked a thorny issue all the way down to a processor bug [0]. It took them 5 months - and a lucky break.

[0] I found a bug in Intel Skylake processors (https://news.ycombinator.com/item?id=14686277)

hwj · on July 3, 2019

In addition to the mentioned book "Debugging" by David Agans there is also "Why programs fail" by Andreas Zeller:

http://www.whyprogramsfail.com/

The author of the latter book is a Professor from Saarbrücken and former maintainer of GNU DDD.

nullwasamistake · on July 3, 2019

If the author is here, have you tried the new ZGC collector? I've tried it experimentally and verified that pause time are indeed 10ms or less in every case with a couple of our apps.

I also heard Twitter is a big adoper of Graal but apparently not everywhere. How is that going? I'm deeply upset that Oracle is segmenting the CE and EE version by performance (to the point where I think a fork is likely), wondering what your experience with it was

choeger · on July 3, 2019

Nice catch. Demonstrates how important a holistic view of the whole stack can be for debugging. How many developers even know how I/O works with the kernel? How the JVM implements GC?

giacaglia · on July 3, 2019

Great blog post!