
Performance Speed Limits - pimterry
https://travisdowns.github.io/blog/2019/06/11/speed-limits.html
======
haberman
I generally dislike the "bottleneck" metaphor when applied to performance,
because the metaphor only makes sense when the individual "parts" run in
parallel. In most scenarios, each bit of code adds its own latency that is
independent of the latency of other components. So every component is a
"bottleneck", so-to-speak, it's just that some are bigger bottlenecks than
others.

However this case is an unusual exception, because all of these hardware
components described in this article _do_ run in parallel. So the slowest
element does truly gate the throughput of all stages of the pipeline. You
could add more load to the non-bottlenecked resources and the overall latency
of the pipeline wouldn't change at all, roughly speaking.

~~~
BeeOnRope
Yes, exactly. I didn't describe this but it's something I was thinking about:
if you don't hit the bottleneck, you don't pay any price: if the limit of uops
allocated is 4 but you are only doing 3.5, you don't pay any price: the limit
might as well be infinite.

It's for this reason I struggled to include, for example, branch prediction
failures. Let's say they take a 15 cycle bubble in the front-end. You can
create a "speed limit" there and say "you can't do more than 1/15 mispredicts
per cycle", which is indeed true. However, it then doesn't work like the other
limits: if you measure 1 mispredict every 30 cycles, you aren't at the limit,
but it still has a huge impact on performance, because the bubbles are likely
slowing down everything else you do in those 30 cycles too. Perhaps 15 cycles
of the 30 are solely dedicated to handling the mispredicts.

Therefore to include mispredicts here, it won't be a simple "speed limit"
thing: you will need to understand the exact effect of the mispredict on the
front end, and how it interacts with the other speed limits. Sometimes a
mispredict costs 15 cycles (the figure you hear), but sometimes they don't
slow down your code at all, and you can certainly design tests where they cost
1000s of cycles each.

------
rkunde
You mention there’s always only one scalar multiplication unit. Can you
elaborate why that is the case? What’s keeping AMD and Intel from adding more?

~~~
BeeOnRope
Multipliers that can do a 64x64->128 bit multiplication every cycle at 5 GHz
are large in area and power hungry. They are most expensive traditional
hardcoded integer unit (divide is even more expensive, but it is usually
implemented as many operations that repeatedly use a divider unit and other
units to do a few digits at a time: on essentially all current x86 chips large
divides take 40-80 cycles!).

The limitation isn't that bad in practice: it's hard to get more than 1
multiplication per cycle in non-trivial code since you have only 3 more ops to
do all the other work - where are the multiplication inputs coming from, for
example?

So more than one multiplier probably has a small payoff. Apple chips, for
example, are wider, and IIRC they have two multipliers as they can take more
advantage.

It's worth noting that 64-bit multipliers are expensive enough that even in
AVX-512 Intel still offers zero 64-bit input multiplies in their SIMD unit
(they offer a 53 bit one that reuses the FMA hardware though).

------
Asm2D
Thanks for the article!

I started AsmGrid project that provides X86 architecture overview and
instruction timings for people like me that often work with assembly. It's
basically a simple web application that displays data provided by AsmDB and
timings obtained through cult tool
([https://github.com/asmjit/cult](https://github.com/asmjit/cult)).

AsmGrid is available here, click on X86 Perf tab to see instruction timings:

    
    
      - https://asmjit.com/asmgrid
    

A bit older version that has more microarchitectures (but less instructions)
is here:

    
    
      - https://kobalicek.com/asmgrid
    

I'm also looking for some help with instruction timings. I would be thankful
to anyone who has X86 CPU having a different microarchitecture than those
available at [https://asmjit.com/asmgrid/](https://asmjit.com/asmgrid/) and is
willing to compile & run cult tool, and provide me its output.

Instructions for cloning and compiling cult are here:

    
    
      - https://github.com/asmjit/cult

~~~
BeeOnRope
Thanks for the link, I didn't know about CULT (but I use asmjit, it's a great
project).

I do feel like there is a lot of redundancy in this area. You are probably
aware of Agner's instruction tables, and he makes the source available for his
latency testing tools. You are probably aware of uops.info which has
latency/throughput numbers for instructions online and in XML format. Now
there is CULT (or actually CULT came before uops.info but I wasn't aware of
it), and ASMDB.

There is the AIDA64 instruction latency thing giving the tables at:

[http://instlatx64.atw.hu/](http://instlatx64.atw.hu/)

There is also uarch-bench which I wrote, which has slightly different goals
than just getting the latency/throughput of instructions, but still does a lot
of that and has a lot of overlap in other areas.

I have seen more as well, but I can't remember all of them.

Unless you love is really writing these kind of tools (mine is not), it would
be nice if there was one winner here: one canonical source for instruction
timings, available in whatever format, one test application for new platforms,
etc. Then efforts like "please run this on your new hardware" won't be split
among various people: right now sometimes Agner gets new hardware first,
sometimes the uops.info guys, sometimes the Instlat guy, etc.

Let a thousand flowers bloom and all that, but personally I think this space
is due for some collaboration...

~~~
Asm2D
Yeah there is definitely some redundancy here, but there were reasons I wrote
these tools. There are multiple sources that provide instruction timings.
Agner's instruction tables are amazing, but they cover only a single
microarchitecture at a time, so it's really difficult to compare multiple
microarchitectures. Intel's online intrinsics guide only shows instruction
latencies of few latest microarchitectures, etc... Basically before AsmGrid it
was time consuming for me to verify whether the instructions I have selected
would perform well on a majority of machines that I target.

I wrote AsmGrid initially for myself to explore AsmDB data and to spot
possible errors. Then I wrote CULT just for fun as I wanted to see whether
it's possible to write such tool based on AsmJit. CULT basically iterates over
AsmJit instruction database, checks whether instructions (with various
signatures) can be executed, and then creates JIT code for executing them
either in parallel or sequencially. It works really well and the testing can
be easily done in user space.

In other words, I wanted a table of instructions and their timings where I can
compare several microarchitectures at the same time during development and
AsmGrid provides that. The only downside is that I cannot test all
microarchitectures myself and when I make changes to support more instructions
a re-run is necessary.

~~~
BeeOnRope
Right, so CULT is maybe not the redundant one. Maybe the uops.info guys should
use CULT, I dunno. Also, if it's for fun, anything goes, of course. I just
feel sometimes like there is a lot of division of effort: it would be nice to
have a standard single data source for instruction performance attributes.

I am surprised I never ran across this before, because when I found uops.info,
I was like "finally, this stuff is online and linkable" \- but maybe that
already happened earlier with ASMGrid?

Anyways, I can run this on SKL, SKX, CNL for you.

~~~
Asm2D
Thanks! Let's continue on that issue page.

------
BeeOnRope
Author here, happy for any feedback and to answer any questions!

Some things are definitely missing - I realize I don't cover load or store
buffers in the OoO part, for example. I am working on fixing that.

~~~
haberman
> As an example, a load instruction takes a cache miss which means it cannot
> retire until the miss is complete. On an Haswell machine with a ROB size of
> 192, at most 191 additional instructions can execute while waiting for the
> load: at that point the ROB window is exhausted and the core stalls. This
> puts an upper bound on the maximum IPC of the region of 192 / 300 = 0.64.

Where does 300 come from? Is this the number of cycles for the cache miss?

~~~
nkurz
Yes, I think 300 is an estimate of the number of cycles for a load from RAM:
192 ops / 300 cycles == .64 ops / cycle.

I'm not sure whether the calculation is true, though. Why can't the post-load
µops be executed and retired, and the ROB refilled, and then more µops
executed? I'd think the calculation would only apply if everything was somehow
speculative and thus nothing could be retired, but I don't see this being a
common occurrence.

~~~
BeeOnRope
Because that's not how the ROB or any of the other "in order" structures like
the load buffers, etc, work. Fundamentally, retirement happens in order.

uops only leave the ROB when they retire, and they retire in the same order as
they appear in the dynamic instruction steam. The ROB is needed to preserve
the ability to roll back if any instruction faults, to handle interrupts, to
generate precise exceptions, etc: think about what would happen if you had
retired a bunch of random instructions on either side of an un-retired
instruction that ended up faulting?

Everything _is_ speculative. The CPU always operates as if any unretired
instruction is speculative. In reality it almost is: probably at least a
quarter of all instructions can fault in some way, and as soon as there is any
such instruction in the unretired stream all younger instructions are
"speculative". I'm putting it in quotes because "speculative" as we're
discussing it just something we are making up: CPUs don't have a "I'm
speculating" flag: everything is treated as speculative all the time (and if
you consider interrupts, maybe even the cases where you have a stream of
cannot-fault instructions would also be "speculative").

There is a structure that works as you describe: the scheduler/reservation
station. That's the place where everything happens out of order and slow
instructions can stall and be passed by younger ones who leave the scheduler
and make room for others - but this sits within the in-order allocate + retire
machinery.

~~~
nkurz
Thanks. So because of in-order retirement, the ROB is indeed limited to
holding only consecutive µops. And since nothing can be executed until it is
put in the ROB, this means that there is a hard limit on the distance in µops
between the last unretired µop and what can possibly be executed. Does this
also mean that a load is only ever retired after it has been successfully
fulfilled? I haven't understood the role of retirement for loads and stores.

~~~
BeeOnRope
> Thanks. So because of in-order retirement, the ROB is indeed limited to
> holding only consecutive µops. And since nothing can be executed until it is
> put in the ROB, this means that there is a hard limit on the distance in
> µops between the last unretired µop and what can possibly be executed.

Yes. The ROB holds even things that will never be executed, such as nops and
zeroing idioms.

Note that the ROB is not really the bad guy here: if you invent some way to
make the ROB retire out-of-order to break that restriction, you'll just
immediately run into the PRF limit, since almost all pending instructions need
a destination register. Since these values are all live (from CPU's point of
view) until all older instructions have retired, you'll get the same kind of
limit from the PRF.

The PRF is a big structure using lots of area and power, so probably the ROB
size is kind of determined from the PRF size: how big are going to make the
PRF? X? OK, let's make the ROB size X * 1.5 since then the ROB will rarely
limit performance. Note: we know the ROB increased dramatically in size from
224 to 354 in SNC, but we don't yet know how big the reg files are!

There are good papers out there about super-high ILP designs if you are
interested about how this kind of stuff can be solved, but there are many
problems.

> Does this also mean that a load is only ever retired after it has been
> successfully fulfilled? I haven't understood the role of retirement for
> loads and stores.

Yes. Loads and stores both go in the ROB, and also in the load/store buffers,
which are in-order just like the ROB.

A load can't retire until it completes: only then are the physical resources,
like the register it writes to and the load buffer entry, able to be freed.

Note that this is a huge difference between loads and prefetches: prefetches
can retire immediately (well as soon as the load address has been calculated).
They just kick off the load in the memory subsystem and then their work is
done.

Stores are different: they can retire as soon as the store address and store
data are known (i.e., as soon as their inputs are available). At this point
they become so-called _senior_ stores: stores which have retired, and hence
are non-speculative, but haven't become visible to the rest of the system yet.
They must eventually become visible, which means that on an an interrupt this
part of the store buffer is preserved or drained, never thrown away like the
rest of the OoO buffers. When the time is right (write?) they commit to L1.

