
The Microarchitecture of Intel, AMD and VIA CPUs [pdf] - CalChris
http://www.agner.org/optimize/microarchitecture.pdf
======
CalChris
Less well known, but Torbjorn Granlund _Instruction latencies and throughput
for AMD and Intel x86 processors_ has also been updated for Ryzen.

[https://gmplib.org/~tege/x86-timing.pdf](https://gmplib.org/~tege/x86-timing.pdf)

~~~
glangdale
I hadn't seen this before, but it looks a lot like it's less well known
largely because it's very incomplete (almost no SIMD) and redundant with Agner
Fog's work and InstLat.

~~~
CalChris
It concentrates on integer ops and leaves out branches altogether. The Intel
Optimization manual is pretty light on indirect branches as well.

What I like about it is that it makes it easy to compare how
microarchitectures handle a given instruction over time. For example, the BT
instructions were a little slow, 2 cycles, 1 port through Haswell. With
Broadwell that changed to 1 cycle and 2 ports. Similarly, CMOV improved with
Broadwell (remember Linus' rant about the evils of CMOV?) and is now 1 cycle
latency.

I don't know InstLat. Do you have a link?

~~~
mulvya
[http://instlatx64.atw.hu/](http://instlatx64.atw.hu/)

------
glangdale
I am very fond of this document, and am constantly amazed at how the
commentariat, here and elsewhere, frequently like to theorize about what
instructions "might be expensive" without bothering to look them up.

~~~
brianwawok
Many programmers do things "because they are faster" with 0 work testing the
theories. A little bit funny and a little bit sad.

Thankful the guy engineering bridges doesn't make up which material to use
like software devs pick algos.

~~~
jcranmer
> Thankful the guy engineering bridges doesn't make up which material to use
> like software devs pick algos.

Except that one time where the builder said "this joint is hard to build, can
we modify it slightly?", the engineer looked at the change, said "sure", and
114 people died when said joint failed.

------
CalChris
Agner Fog's microarchitecture document has been updated for AMD Ryzen.

------
Coding_Cat
Does anyone know how Agner actually produces all this information? It can't be
easy to determine all these parameters.

~~~
CalChris
In a word, empirically. In a few more words, empirically and reading a ton of
Intel, AMD and VIA documentation and I'd posit, some of the patent and
academic literature.

~~~
tmccrmck
He explains his method in Instruction Tables [1] under 'How the values were
measured' section and he even includes a zip of the code. I found this part
particularly interesting:

> It is not possible to measure the latency of a memory read or write
> instruction with software methods. It is only possible to measure the
> combined latency of a memory write followed by a memory read from the same
> address. What is measured here is not actually the cache access time,
> because in most cases the microprocessor is smart enough to make a "store
> forwarding" directly from the write unit to the read unit rather than
> waiting for the data to go to the cache and back again. The latency of this
> store forwarding process is arbitrarily divided into a write latency and a
> read latency in the tables. But in fact, the only value that makes sense to
> performance optimization is the sum of the write time and the read time.

[1]
[http://www.agner.org/optimize/instruction_tables.pdf](http://www.agner.org/optimize/instruction_tables.pdf)

------
throwaway-1209
Does anyone know of a similar document for ARM, and in particular for the
various flavors of aarch64?

------
DamonHD
Wow! This is a great doc! These days I'm targetting things other than x86 for
the day job, but this level of insight, when also armed with -O3 -S assembly
output from a compiler, is what really lets one go to town...

