
Avoiding Instruction Cache Misses - ingve
https://pdziepak.github.io/2019/06/21/avoiding-icache-misses/
======
hydroreadsstuff
The complexity of current hardware is madness.

The performance of software is influenced by the workings of tens of units
(+local caches) connected (non-linearly) with fifos, ooo buffers, replay
mechanism. There are multiple versions of the same unit to save chip space
(like light and heavy Integer/FP). Units/domains have different clock speeds.
And from generation to generation port connections and instructions between
pipelines can be reshuffled.

I suppose what keeps performance changes relatively straightforward for
developers is the set of benchmarks used to evaluate the hardware early on.

I appreciate the article focusing on the I-Cache, and the nice intro to
decoding.

I would have preferred having an example, and improving something instead of
abstractly talking about problems, effects of code and optimizations and
possible workarounds.

Tangent: I wonder if we will be seeing specialized instructions sometime that
span multiple units and multiple cores in order to reduce data-movement. Think
matrix-matrix multiplication. The potential improvements for power and speed
seem huge.

~~~
jandrewrogers
This is, in a nutshell, why high-performance systems engineering is a rare
skill set. It entails writing C++ (or whatever) with full understanding of the
machine code that is likely to generate _and_ how that machine code will
interact with the incredibly complex internals of modern microarchitectures.
It essentially requires de-abstracting two levels of abstraction below the
programming language, which exist to reduce cognitive load, in your software
design and implementation.

It is unfortunate that this is still so useful in practice, given what it
implies about the magnitude of waste in typical software systems.

~~~
rmdashrfstar
Any advice to learn systems engineering properly?

~~~
loeg
Get a job where you do it some or full time, and acquire lots and lots of
practice.

~~~
jeffreygoesto
+1 took me too long to type, otherwise I would have just quotey you ;-).

------
AnanasAttack
I've been wondering about the impact of insturction cache misses and potential
downsides of executable bloat caused by C++ templates, but have never seen a
real case of this being a problem.

~~~
tempguy9999
How would you know it's happening? How would you know it's not? How have you
measured/looked?

I'm know little about this area, so I'm curious.

~~~
opencl
Several popular profiler tools (valgrind, visual studio profiler, dtrace,
perfmon) can track cache miss rates.

~~~
namibj
perf-tools are my favorite. The overhead is negligible, and thus any metrics
you gather are very accurate. Valgrind is rarely useful, considering the
execution time disadvantage.

------
htfy96
Another big issue is iTLB misses. Today if you profile a real-life
application, the iTLB miss rates can be 5%+ due to increased program image
size. Splitting hot/cold variant like this article also works for this
problem, but I was wondering if we can use transparent huge page on code while
disabling it for mmapped data pages

------
shereadsthenews
I wish the article discussed huge pages for executable text and their
beneficial impact on the iTLB.

