
Uops.info: Characterizing Latency, Throughput, and Port Usage on Intel - nkurz
https://arxiv.org/abs/1810.04610
======
BeeOnRope
Finally a third independent source for x86 instruction timings, joining Agner
[1] and instlatx64 [2].

Is the testing code open source? There is no obvious link from the main page.

[1]
[https://www.agner.org/optimize/#manuals](https://www.agner.org/optimize/#manuals)

[2] [http://users.atw.hu/instlatx64/](http://users.atw.hu/instlatx64/)

~~~
andreas-a
The testing code is not open source yet, but we plan to make it available in
the near future. We are also working on a code analysis tool, similar to
Intel's IACA, that uses the instruction data from
[http://www.uops.info/](http://www.uops.info/). Finally, we also plan to add
data for AMD processors.

~~~
BeeOnRope
Cool. Is is the IACA-alike tool:

[https://github.com/RRZE-HPC/OSACA](https://github.com/RRZE-HPC/OSACA)

?

It would be nice to see something with a more open and responsive development
process compared to IACA.

~~~
andreas-a
No, this is a different IACA-like tool. Our tool is not public yet.

~~~
BeeOnRope
Oh cool, seems like we'll jump from zero to two direct competitors to IACA
soon. I look forward to your tool (hopefully it can use the same IACA marker
bytes so binaries and be compiled once for all these tools).

~~~
wallnuss
There is also llvm-mca, which is kinda neat since it uses the I formation that
LLVM has about the op cost. Contrasting that to IACA or OACA is valuable as
well.

------
CalChris
Latency numbers will differ (a lot) because the definitions of latency these
sites use differs. Agner uses _the delay that the instruction generates in a
dependency chain_. uops.info uses _number of clock cycles that are required
for the execution core to complete the execution of all of the μops that form
an instruction_. instalatx64 uses _latency means the time that it takes for
the next dependent same-type instruction to start_. These are very different
definitions.

For example, Agner gives a latency of 2 for pop r while uops.info gives 6.

~~~
andreas-a
The "number of clock cycles that are required for the execution core to
complete the execution of all of the μops that form an instruction" is the
definition that Intel uses in its manuals to define latency.

It is actually _not_ the definition that uops.info uses. Instead, it uses a
definition that takes into account that different operands of an instruction
might be ready at different times (see section 4.1 of the paper that mentions
Intel's definition as a "common definition" and then continues to introduce
the new definition). Furthermore, uops.info also considers latency differences
that can occur if an instruction uses the same register for multiple operands
(the SHLD instruction is an example for this).

~~~
BeeOnRope
Is it even possible to measure the latency by Intel's definition? I think
"operational" definitions that correspond to something you can actually
measure (and by extension apply to measured performance in real code too) are
far preferable.

In theory "it is complicated" because an instruction might have N inputs and M
outputs and the latency matrix might have different values for every element
in that N x M matrix, but in practice instructions have 1 output and the
inputs are usually symmetric with regards to latency so one figure is enough
for most. It's worth calling out the exceptions though, and uops.info is
supposed to do that I think (an example would be great).

~~~
andreas-a
Section 7.3 of the paper describes several such examples.

On Sandy Bridge, for example, the result of the "AESDEC XMM1, XMM2"
instruction is ready 8 cycles after XMM1 becomes available. If only XMM2 is on
the critical path, however, the result is already available after a bit more
than one cycle.

On Nehalem, the SHLD R1, R2, imm instruction has, according to Intel's manual,
[http://instlatx64.atw.hu/](http://instlatx64.atw.hu/), and IACA, a latency of
4 cycles. Agner Fog reports a latency of 3 cycles. The measurements on
uops.info show that the latency from R1 to R1 is 3 cycles, while the latency
from R2 to R1 is 4 cycles.

~~~
BeeOnRope
Right, I read the paper and saw the examples, but I meant more an example of
if/how this information is surfaced in the results as collected at uops.info.

Anyways, it is all there, using your SHLD example:

[http://uops.info/html-instr/SHLD-1633.html](http://uops.info/html-
instr/SHLD-1633.html)

You can see the varying latencies on Nehalem for op1->op1 versus op2->op1.

Great work!

------
nkurz
They mention in the "Limitations" section that "Except for the division
instructions, we do not consider performance differences that might be due to
different values in registers, or different immediate values."

Besides division, are there other instructions that are
known/expected/suspected to have different execution times based on the values
in the input registers?

~~~
BeeOnRope
I have found that load instructions take one less cycle (4 vs 5) if using an
index register, where the index register is zero, _and the value was set to
zero via zeroing idiom_. More details on RWT [1].

You didn't include immediates in your question, but it was included in the
quote you referred to so it's worth mentioning adc with immediate zero, which
is "twice as fast" [2] from Sandy Bridge through Haswell.

Of course many FP instruction have value-dependent performance, particularly
with denormals (although it is common even if denormals don't occur).

FWIW recent AMD chips seem to have fixed-latency integer dividers.

[1]
[https://www.realworldtech.com/forum/?threadid=179004](https://www.realworldtech.com/forum/?threadid=179004)

[2] [https://github.com/travisdowns/uarch-bench/wiki/Intel-
Perfor...](https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-
Quirks#adc-with-a-zero-immediate-ie-adc-reg-0-is-twice-as-fast-as-with-any-
other-immediate-or-register-source-on-haswell-ish-machines)

