Hacker News new | past | comments | ask | show | jobs | submit login
It’s Not Always iCache (matklad.github.io)
66 points by tjalfi on July 12, 2021 | hide | past | favorite | 19 comments



Very often people are looking at icache misses instead of something more precise when regarding perf effects due to code size/layout, etc. That more precise thing is frontend stalls: you only care about misses when they cause stalls; otherwise they are overlapped with actual work being done by the execution units.

You can measure frontend stalls on many recent intel chips by

IDQ_UOPS_NOT_DELIVERED.CORE

https://perfmon-events.intel.com/

Neoverse N1 from Arm has STALL_FRONTEND; see

https://developer.arm.com/documentation/PJDOC-466751330-5476...


I agree with you that one can very often get distraced by single events, however knowing that you are frontend/backend bound isn't all that more helpful either.

For frontend you can guess that PGO, BOLT, huge tables might probably help but it's still a blind guess without knowing what to look at next.

Intel's TMA is the only helpful thing here really. Bit sad that AMD and ARM don't provide a way to calculate something TMA-like themselves.


Interesting article.

Because of branch prediction and deep instruction windows, the instruction prefetcher is significantly more effective than the data prefetcher.

There are of course second order effects of code bloat: L2 for example is typically shared with data and wasting more of it for code can have negative effects.


Off topic, but shouldn't it be ICache or Icache, but not iCache. This seems like a misnomer from too much Apple products consumption/dogma?

I could only find references to ICache or ICACHE via quick search.


I$ is the u̶s̶u̶a̶l̶ ̶c̶u̶t̶e̶ ̶n̶a̶m̶e canonical shorthand.


the point being that lower-case i is an Apple-ism, while the correct casing is upper-case I (I've also seen all lower case when used informally, e.g. icache, iram, etc.)


I have trouble distinguishing between uppercase i vs lowercase L in my system fonts, and am too lazy to change them, so I appreciate the author using iCache for readability.


Not really a legitimate reason to standardize such a weak idea though.


(submitter)

This is a follow up to Inline in Rust[0] which was submitted a couple times earlier this week.

[0] https://matklad.github.io//2021/07/09/inline-in-rust.html


Could a difference in alignment of the hot loop also have an effect here?


One optimization I never saw is to adjust the stack within the function, other than the beginning and end. Because of this inlining can significantly blow up stack space.

https://godbolt.org/z/Gaa4MEMnK


The optimization exists in MSVC, LLVM, and GCC. It's called "shrink wrapping". MSVC may do it more aggressively when profile information is available.

See https://github.com/gcc-mirror/gcc/blob/master/gcc/shrink-wra... or https://llvm.org/doxygen/ShrinkWrap_8cpp_source.html.


None of gcc, clang or msvc deallocate the frame or part of the frame in `baz2` before calling `bar` the first time here:

https://godbolt.org/z/rvx36sM74

What I would like to like is something like the generated call of `call_foo1` substituted verbatim into the generated code of `baz1` in the place of the function call. That way at the point of calling `bar` there is much less stack space allocated, minimizing stack usage.

But maybe this would pessimize other things, or for some weird reason is actually incorrect.

gcc at least deallocates the stack before tail-calling `baz`, I don't know if that is "shrink wrapping" or just plain TCO.


You can get into weird cases with DWARF based unwinders when two paths through a function with different stack depths makes it impossible to reliably unwind.

LLVM has bugs in this regard with calls to variadic calls, since after a certain number of arguments have been passed in registers, you start pushing parameters on the stack (ABI dependant).


> They also agree that inline_always version executes less instructions.

The data shows that this is the other way around: inline_always executes 6,396,754,995 instr and inline_never 5,597,215,493 .


Is the result different for this benchmark for an ARMv8 cpu with its ~31 registers vs an x86_64 CPU with its ~15 registers? For example, M1 vs Skylake?


x86_64 has 15 named registers, but many more in practice. Register renaming is used to bypass code that juggles between them.


I’m aware of that. But the compiler is limited to using the architectural register sets.


Right, but the modern CPU pretty much ignores that altogether and looks only at dependency chains to decide which and how many registers to use.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: