I had a similar experience with Node, which took months to track down.
I noticed something was blocking our event loop for between 200ms to 2 seconds at a time. I assumed it was GC and optimized everything off-heap but the issue remained. It turned out Node's async spawn() is not async, and blocks while it copies the page table. For processes with large RSS, this adds up.
It reminds me of a story that came out a few years ago, of how an OCaml developer tracked a thorny issue all the way down to a processor bug [0]. It took them 5 months - and a lucky break.
If the author is here, have you tried the new ZGC collector? I've tried it experimentally and verified that pause time are indeed 10ms or less in every case with a couple of our apps.
I also heard Twitter is a big adoper of Graal but apparently not everywhere. How is that going? I'm deeply upset that Oracle is segmenting the CE and EE version by performance (to the point where I think a fork is likely), wondering what your experience with it was
Nice catch. Demonstrates how important a holistic view of the whole stack can be for debugging. How many developers even know how I/O works with the kernel? How the JVM implements GC?
I noticed something was blocking our event loop for between 200ms to 2 seconds at a time. I assumed it was GC and optimized everything off-heap but the issue remained. It turned out Node's async spawn() is not async, and blocks while it copies the page table. For processes with large RSS, this adds up.
https://github.com/nodejs/node/issues/14917