
Precise timing of machine code with Linux perf - nkurz
https://easyperf.net/blog/2019/04/03/Precise-timing-of-machine-code-with-Linux-perf
======
A2017U1
This is great. Always love seeing what can be accomplished with just a shell.

Wish I could get more time to work on micro-optimisations, in this age you
often get funny looks proposing to dig down and optimise sluggish code. Spin
up more compute and call it a day they say.

While that's completely understandable from a business perspective, it's
somewhat unsatisfying as a developer, the expertise that comes from it surely
has lasting benefits too.

~~~
dendibakh
Thanks. I'm glad you like the article. :)

------
darawk
Would anyone who understands mind explaining the phrase "prefetch window"? I
don't quite get what's meant by that from context, and can't seem to find
anything from googling.

~~~
PowerfulWizard
I'm just going off the article, here is my understanding.

The sample program is doing 2^7 = 128 NOPs, 4 at a time for 32 cycles, and
then it is doing a memory access.

The address of memory that is going to be accessed is known right before doing
the 32 cycles of "work", so a prefetch can be issued at that time.

The meaning of the 'prefetch windown' term is the number of cycles that you
have between when you issue the prefetch to when you issue the instruction
that accesses the address that was prefetched. So it is based on the structure
of the program being analyzed.

~~~
darawk
Hmm ya that explanation sounds right to me. Thanks for the help :)

------
vbernat
This is quite interesting. Is there anything preventing the collection of the
number of cycles spent in an arbitrary function? It seems this is just a
matter of identifying all branches.

~~~
dendibakh
Yes, that might be possible. However, you probably will get multiple cycle
counts for the same function depending on which path was taken. And it works
only if the amount of taken branches in the function is not that big (less
than 32). Otherwise it will not fit into LBR stack. For example, if you have a
loop with more than 32 iterations it will trash the LBR stack with backwards
jumps. But yeah, for small functions it might work pretty well.

I would better go for analyzing not the whole function (all basic blocks of
the function) but only the Hyper Blocks (typical hot path through the
function). Here is the example of how to do it:
[https://lwn.net/Articles/680985/](https://lwn.net/Articles/680985/) chapter
"Hot-path analysis".

~~~
CalChris
Superblocks rather than Hyperblocks. Except for cmov which is partial
predication, x86 doesn’t have predication. But SBs are probably what your
optimizer wants anyways.

