> - 5% of system-wide cycles spent in function prologues/epilogues? That is wild, it can't be right.
TBH I wouldn't be surprised on x86. There are so many registers to be pushed and popped due to the ABI, so every time I profile stuff I get depressed… Aarch64 seems to be better, the prologues are generally shorter when I look at those. (There's probably a reason why Intel APX introduces push2/pop2 instructions.)
This sounds to me more like an inlining problem than an ABI problem. If the calls take as much time than the running, perhaps you just need a better language that doesn’t arbitrarily prevent inlining due to compilation boundaries (eg. basically any modern language that isn’t in the C/C++ family, before LTO)
I see this in LTO/PGO binaries as well. If a function is 20 instructions long, it's not like you can inline it uncritically, yet a five-cycle prologue and a five-cycle epilogue will hurt. (Also, recursive functions etc.)
TBH I wouldn't be surprised on x86. There are so many registers to be pushed and popped due to the ABI, so every time I profile stuff I get depressed… Aarch64 seems to be better, the prologues are generally shorter when I look at those. (There's probably a reason why Intel APX introduces push2/pop2 instructions.)