
Intel 64 and IA32 Architectures Performance Monitoring Events [pdf] - ingve
https://software.intel.com/sites/default/files/managed/8b/6e/335279_performance_monitoring_events_guide.pdf
======
strstr
Perf counters are super useful. On linux the perf tool (and perf event api)
make these usable:
[https://perf.wiki.kernel.org/index.php/Main_Page](https://perf.wiki.kernel.org/index.php/Main_Page)

The counters vary per Intel CPU, though the most useful ones are universal
(e.g. cycle counts). AMD has similar counters.

~~~
soulbadguy
ocperf is wrapper around perf provided by someone at intel. At the first run,
it downloads a list of counter specific the CPU detected, pretty cool;
[https://github.com/andikleen/pmu-tools](https://github.com/andikleen/pmu-
tools)

------
CalChris
If you're a low level hacker a reading knowledge of these events is useful but
really we should be using VTune as our PME tool. Still, it's possible that a
particular event may shed light on a particular piece of code and using an API
like PCM.

[https://github.com/opcm/pcm](https://github.com/opcm/pcm)

~~~
kev009
vtune is first class but most people will be using perf on linux or pmcstat on
freebsd so you do need to crossreference a doc like this occasionally when you
want to probe a new counter to look for bottlenecks.

pcm is also quite nice to monitor what an entire system is doing in terms of
memory bandwidth, NUMA link traffic, and other package level concerns but
doesn't give any kernel or application level tracing like the other tools.

------
grandmczeb
Open question to other commenters: are there hardware performance
counters/features that you would like to see implemented but currently aren’t?

~~~
_chris_
As a RISC-V core implementer, I'm super interested in answers to this
question. Some of the things I've pondered is ways to figure out 1) what
branch am I constantly mispredicting and 2) what load is constantly cache-
missing. Not sure the best way to expose that to the programmer, particularly
in a way that's cheap for most cores.

~~~
strstr
1) Modern LBR might solve this. LWN has a summary (though I've only skimmed
this): [https://lwn.net/Articles/680996/](https://lwn.net/Articles/680996/)

2) Not sure for this, though I can think of some crappy hacks:

\--A) Timed LBR mentioned in that LWN article (somewhat indirect, but might
get the job done)

\--B) use perf counter overflow interrupts (for cache misses) and set the perf
counter initial value high (which should let you sample the cache miss
locations). This can only tell you if a particular load is making up a large
fraction of your overall cache misses (which is probably not super useful).

Edit: Forgot about PEBS, which is really what you want for 2).

