
Understanding CPU port contention - matt_d
https://dendibakh.github.io/blog/2018/03/21/port-contention
======
wallnuss
IACA is awesome to understand the behaviour of your own hot loops. Sadly it
only works on Intel. Luckily LLVM has recently merged llvm-mca [1] (machine-
code analyser) which hopefully will in time bring all the features of IACA and
more to other platforms as well.

I am working on making these tools easily accessible from Julia [2] and one of
the things missing from llvm-mca is the ability to mark a region of assembly
code to be analysed instead of analysing the entire provided assembly.

[1] [https://llvm.org/docs/CommandGuide/llvm-
mca.html](https://llvm.org/docs/CommandGuide/llvm-mca.html) [2]
[https://github.com/vchuravy/IACA.jl](https://github.com/vchuravy/IACA.jl)

------
mjw1007
This bit caught my eye:

« According to Agner’s instruction_tables.pdf load instruction that I use has
2 cycles latency »

and indeed that's what Agner Fog's tables say (and they further say this
figure is « the delay that the instruction generates in a dependency chain »).

But all other sources seem to agree that the L1 dcache latency on the main
Intel processor family is 4 cycles (since Nehalem).

Does anyone know what the distinction is here? If I do a a simple pointer
chase through L1 dcache I wouldn't make progress at 2 cycles per step, would
I?

~~~
BeeOnRope
Yes Agners doc is wrong or perhaps misleading here. He mentions that for loads
and stores he tested a loop of load and store to the same location
(essentially essentially testing store forwarding latency) to get the latency
of that pair and then "arbitrarily divided" the measured latency between the
load and the store.

He did that because to measure the latency of a single instruction the
"domain" (eg GP register, SIMD register memory, etc)) of the input(s) as to be
the same as the output.

Stores have memory output but register/address input so it isn't really
possible to measure the latency of a store in isolation. That's why Agners
paired up loads and stores to get a latency figure and did the arbitrary
split.

All that out of the way, it _is_ possible to measure the latency of a load, at
least for the load-to-address path like you mention with a pointer chasing
load. That latency is 4 cycles minimum on all modern Intel. Add extra 1 cycle
each for complex addressing modes or vector loads.

So yeah, a load latency of 2 is nonsense for modern Intel. Even talking about
the latency is also nonsense since all the loads are independent so it's the
throughput, not the latency that matters.

IIRC some chips have had 2 cycle L1 accesses: some P4 designs and one of the
semi-recent POWER chips.

~~~
mjw1007
Thank you very much.

He does say he's reporting the minimum latency; I was thinking that no case
would be faster than reading from L1 cache, but it makes sense that store-
forwarding can be faster.

~~~
BeeOnRope
Right, but store forwarding is definitely not 2 cycles. 4 and 5 cycles is
common for store forwarding although other values are possible. So I think
Agner measured 4 cycles and divided it as 2 for the load and 2 for the store.
For pure loads the latency is easy to measure and often widely reported and
it's 4 cycles minimum for Intel.

------
jayd16
Is this contention mostly taken care of by out of order execution and
hyperthreaidng? These topics aren't mentioned in the article (or I missed
them).

~~~
titzer
Out-of-order execution doesn't help if you saturate the functional units.
Hyperthreading also doesn't help; in fact, it hurts even more, since now the
other hyperthread(s) are competing for the contended resource.

~~~
BeeOnRope
Well hyperthreading can help in the sense that the execution "slots" may not
be wasted if the other hyperthread can use the ports that would be otherwise
unused if only a single thread were running. That mostly only works if the
workloads of each thread have very different port usage.

Consider one workload which saturates on p1, doing integer multiplication, and
another one which is load/store heavy. Those two loads run on sibling
hyperthreads could perhaps approach the best possible speedup of 2x since they
use different ports.

------
mnw21cam
So, if instead of issuing a second bswap instruction, you were to issue an
instruction that could go to a different ALU port, you would get that one for
(almost) free?

~~~
blattimwind
If it fits (latency/throughput-wise), yes, unless you clog something else
(like the decoder or retirement) with that instruction.

------
alain94040
I believe what you describe are called pipes, not ports.

EDIT: I was commenting on my day job, but what do I know :-) Reservation
stations have ports, but the multiple executions units are commonly referred
as pipes.

~~~
blattimwind
No.

Basically, in one of Intel's execution engines you have a reservation station
which is fed µops and allocates resources to each µop. A RS port is what
connects ALUs, load-store units, vector units etc. to the RS.

