
Performance Analysis Guide for Xeon 5500 (2009, pdf) - dragontamer
https://software.intel.com/sites/default/files/m/0/8/8/performance_analysis_guide.pdf
======
dragontamer
While the Xeon 5500 / Nehelem processor is 9-years-old at this point, this
document meticulously goes through its architecture... as well as a huge
number of tips on how to use hardware performance counters to guide
optimization of code.

If anyone knows of a similarly good document with modern architectures (ie:
Intel's Skylake / 6th Gen i7 or later or AMD Zen), I'm all ears. But this
document looks to be exceptionally good, despite its age.

\-----------

For the uninitiated: modern systems have performance counters available
through tools like Linux's perf, which can count branch prediction misses, or
cache-misses... information which can guide a low-level programmer towards
optimizations.

Figuring out "why" your program is slow and how to change your algorithms to
improve speed requires an understanding of computer architecture (ex:
rearranging a for-loop to be L1 cache friendly can lead to 10x speed
increases, as you go from main-memory constrained to L1 cache constrained).

But in the real world, when dealing with real problems, it is difficult to
figure out performance issues. And that's where knowledge of hardware counters
can come in.

------
wyldfire
I did a lot of work with the 5600 Westmere. After working out the "easy" stuff
related to NUMA and IOH, the (relatively new IIRC) uncore was a big remaining
part of the recipe for performance when trying to saturate the PCIe devices.

~~~
dragontamer
Interesting story! Care to elaborate on what kind of PCIe devices you were
using?

Was it GPGPU compute? Or were you trying to saturate a router / network of
some kind? And do you remember what kind of problems you faced? (Latency or
bandwidth??)

~~~
wyldfire
They were Intel X520-SR2 Niantic 10GbE cards (later 82599ES). Bandwidth. It
was challenging to find a stable recipe on linux to receive TCP streams at or
near the media limit (x four ports, simultaneously).

