
Top-down performance analysis methodology - matt_d
https://dendibakh.github.io/blog/2019/02/09/Top-Down-performance-analysis-methodology
======
jacques_chester
A note here that "front end" and "back end" are meant in the CPU
microarchitecture sense, not the web development sense.

I think Brendan Gregg's _Systems Performance: Enterprise and the Cloud_ is
still the single best volume on this topic; I'd recommend Gunther's _Guerilla
Capacity Planning_ as a useful followup to get a foothold on theory.

------
moab
I wish the author had explained how he went from identifying that the issue is
stalled references to DRAM to the CYCLE_ACTIVITY.STALLS_L3_MISS performance
event. Similarly from DRAM_Bound to MEM_LOAD_RETIRED.L3_MISS_PS. My complaint
is that there's still a lot of magic here that requires carefully reading the
Intel manuals that is elided in this post. That said, thanks to the author for
the post---it is still very useful.

Edit: the author has another post that covers this information
[https://dendibakh.github.io/blog/2018/06/01/PMU-counters-
and...](https://dendibakh.github.io/blog/2018/06/01/PMU-counters-and-
profiling-basics)

~~~
dendibakh
Hi, I'm glad you like the article. The process how I went from TMAM metric to
particular event that was used to calculate it is describe in the TMAM metrics
table:
[https://download.01.org/perfmon/TMA_Metrics.xlsx](https://download.01.org/perfmon/TMA_Metrics.xlsx)

In the same row for DRAM_bound metric there is precise event PEBS specified
that we can use for locating the issue. Sampling on the precise event will let
us detect exact place in the code where we have the most amount of L3 misses.

Let me know if you have further questions!

~~~
moab
Thanks for the reply and the pointer!

------
SilverSurfer972
Would play nicely with the Scalability Prediction tools we built:
[https://stacktical.com/demo](https://stacktical.com/demo) As soon as a
regression is detected in the CICD pipeline a Top-down analysis can be done to
according to the amount of Serialization and/or Synchronization penalties.

------
m0zg
Would love to see the exact same thing but for ARM, and in particular aarch64.

~~~
shereadsthenews
Unfortunately aarch64 does not have anywhere near the number and variety of
events offered by the X86 PMU.

~~~
m0zg
But it does have some (usually), and it needs perf optimization much more,
especially in regards to cache misses and such, because RAM is usually pretty
slow on lower end ARM.

