The main two tricks are: it preprocesses all of the DWARF info at startup for faster lookups, and it dynamically patches the return addresses of functions on the stack injecting an address to its own trampoline, which allows it to skip going through the whole stack trace every time it needs to dump a backtrace. For example, if you're running a function nested 100 stack frames deep and that function calls malloc 100 times then Bytehound will only go through ~300 stack frames in total (~100 times for the first call then only ~2 frames for each successive call, if my math is right), while other similar tools will go through 10000 stack frames (going through all ~100 frames to the very bottom for every call).
Dynamic patching of return addresses is a very cool trick. I don't think I've seen this before. Have you run into any situations where this crashes programs or otherwise interferes with their execution?
If the program's already doing weird stuff with the stack/control flow/etc., yes, but that should be relatively rare and for the majority of the programs it should work fine.
It should support C++ exceptions. The trampolines have exception landing pads included to catch and rethrow any exceptions which are thrown through them.
For performance profiling I find that `perf`-like sampling profiling works well enough to find the hot spots, and then Valgrind's Callgrind is great for micro-optimizing the hot spots code on the assembly level.
Of course, it would be cool to have a unified memory + performance analysis tool like this, but I don't think I can justify the time investment to write one in my spare time.
Yeah, I'm really happy that Gimli exists, considering the absolute insanity/complexity pit of DWARF.
For what it’s worth, Valgrind completely fails to run on the Glommio runtime (something about it causes some threading code on startup to deadlock), so I’ve been looking for an alternate profiler that can give me better insights than perf. Also a profiler that can give me deeper insights without all the overhead of Valgrind would be sweet.
I'm not sure about this implementation, but the parca implementation only needs the .eh_frame section of the binary (which is part of, but not all of "DWARF") which still exists even in stripped binaries.
However you then still need debug symbols of some kind to convert those to names.
Yes, it should also work without any debugging info. You'll still need unwinding tables though (used for handling exceptions in C++/panics in Rust/etc.), which are technically DWARF too (except on 32-bit ARM, which is special).
The main two tricks are: it preprocesses all of the DWARF info at startup for faster lookups, and it dynamically patches the return addresses of functions on the stack injecting an address to its own trampoline, which allows it to skip going through the whole stack trace every time it needs to dump a backtrace. For example, if you're running a function nested 100 stack frames deep and that function calls malloc 100 times then Bytehound will only go through ~300 stack frames in total (~100 times for the first call then only ~2 frames for each successive call, if my math is right), while other similar tools will go through 10000 stack frames (going through all ~100 frames to the very bottom for every call).