Hacker News new | past | comments | ask | show | jobs | submit login
Finding the Four Month Bug: A Debugging Story (2015) (evanjones.ca)
39 points by dodders on July 3, 2019 | hide | past | favorite | 6 comments



I had a similar experience with Node, which took months to track down.

I noticed something was blocking our event loop for between 200ms to 2 seconds at a time. I assumed it was GC and optimized everything off-heap but the issue remained. It turned out Node's async spawn() is not async, and blocks while it copies the page table. For processes with large RSS, this adds up.

https://github.com/nodejs/node/issues/14917


It reminds me of a story that came out a few years ago, of how an OCaml developer tracked a thorny issue all the way down to a processor bug [0]. It took them 5 months - and a lucky break.

[0] I found a bug in Intel Skylake processors (https://news.ycombinator.com/item?id=14686277)


In addition to the mentioned book "Debugging" by David Agans there is also "Why programs fail" by Andreas Zeller:

http://www.whyprogramsfail.com/

The author of the latter book is a Professor from Saarbrücken and former maintainer of GNU DDD.


If the author is here, have you tried the new ZGC collector? I've tried it experimentally and verified that pause time are indeed 10ms or less in every case with a couple of our apps.

I also heard Twitter is a big adoper of Graal but apparently not everywhere. How is that going? I'm deeply upset that Oracle is segmenting the CE and EE version by performance (to the point where I think a fork is likely), wondering what your experience with it was


Nice catch. Demonstrates how important a holistic view of the whole stack can be for debugging. How many developers even know how I/O works with the kernel? How the JVM implements GC?


Great blog post!




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: