Very often people are looking at icache misses instead of something more precise when regarding perf effects due to code size/layout, etc. That more precise thing is frontend stalls: you only care about misses when they cause stalls; otherwise they are overlapped with actual work being done by the execution units.
You can measure frontend stalls on many recent intel chips by
I agree with you that one can very often get distraced by single events, however knowing that you are frontend/backend bound isn't all that more helpful either.
For frontend you can guess that PGO, BOLT, huge tables might probably help but it's still a blind guess without knowing what to look at next.
Intel's TMA is the only helpful thing here really. Bit sad that AMD and ARM don't provide a way to calculate something TMA-like themselves.
Because of branch prediction and deep instruction windows, the instruction prefetcher is significantly more effective than the data prefetcher.
There are of course second order effects of code bloat: L2 for example is typically shared with data and wasting more of it for code can have negative effects.
the point being that lower-case i is an Apple-ism, while the correct casing is upper-case I (I've also seen all lower case when used informally, e.g. icache, iram, etc.)
I have trouble distinguishing between uppercase i vs lowercase L in my system fonts, and am too lazy to change them, so I appreciate the author using iCache for readability.
One optimization I never saw is to adjust the stack within the function, other than the beginning and end. Because of this inlining can significantly blow up stack space.
What I would like to like is something like the generated call of `call_foo1` substituted verbatim into the generated code of `baz1` in the place of the function call. That way at the point of calling `bar` there is much less stack space allocated, minimizing stack usage.
But maybe this would pessimize other things, or for some weird reason is actually incorrect.
gcc at least deallocates the stack before tail-calling `baz`, I don't know if that is "shrink wrapping" or just plain TCO.
You can get into weird cases with DWARF based unwinders when two paths through a function with different stack depths makes it impossible to reliably unwind.
LLVM has bugs in this regard with calls to variadic calls, since after a certain number of arguments have been passed in registers, you start pushing parameters on the stack (ABI dependant).
Is the result different for this benchmark for an ARMv8 cpu with its ~31 registers vs an x86_64 CPU with its ~15 registers? For example, M1 vs Skylake?
You can measure frontend stalls on many recent intel chips by
IDQ_UOPS_NOT_DELIVERED.CORE
https://perfmon-events.intel.com/
Neoverse N1 from Arm has STALL_FRONTEND; see
https://developer.arm.com/documentation/PJDOC-466751330-5476...