Hacker News new | past | comments | ask | show | jobs | submit login
Performance Speed Limits (travisdowns.github.io)
256 points by pimterry on June 11, 2019 | hide | past | favorite | 48 comments

Author here, happy for any feedback and to answer any questions!

Some things are definitely missing - I realize I don't cover load or store buffers in the OoO part, for example. I am working on fixing that.

Great article, thanks!

My main suggestion for improvement would be to flesh out the details of determining which limit you are hitting. For example, how do you (semi-automatically) determine that changing to a simpler addressing mode might let you fit in another load per cycle? How do you determine that you are being limited by the size of the ROB? Knowing that these bottleneck exists helps, but knowing which remedy to try helps even more.

As for errors, there are a bunch of small typos. Making a pass with a spell-checker would catch most of these, or I could send my usual email. In the early passage about vectorization, I think you mean "you could get [an additional] 3x, down to roughly 0.125". Factually, I wasn't sure that "One operation per port, per cycle" was quite true, especially since you later talk about address calculations.

As far as static site commenting options, I came across this nice summary of approaches when we were trying to fix up Daniel's blog: https://darekkay.com/blog/static-site-comments/.

Thanks nkurz!

I will look at installing a spell checker in the IDE where I wrote this. I recently removed it because it was too frustrating for code and technical documentation due to the large number of false positives, but I'll take another shot.

> My main suggestion for improvement would be to flesh out the details of determining which limit you are hitting.

Yeah. For the few limits I wrote about (which isn't in the order of the document) I included perf counters you can use to check if you are hitting the limit.

The basic idea though is that you go though the limits one by one, calculate the limit as it applies to your code, then compare that against the observed performance. Ideally, one limit will match the performance you are getting and then that's the limit (it could be tied of course, but it's still the limit in the sense that you can't go faster w/o solving that).

Of course, this is often difficult to do: not everyone wants to read assembly and go through the list of limits. That's where say llvm-mca comes into play, although I could write more about that.

I can also write more about peformance counters which can detect the problems, but then it starts to overlap a lot with Intel's top-down methodology, on which a lot has already been written (BTW, topdown is good, I recommend it).

> Factually, I wasn't sure that "One operation per port, per cycle" was quite true, especially since you later talk about address calculations.

I didn't catch the issue, can you elaborate?

> "One operation per port, per cycle" > I didn't catch the issue, can you elaborate?

I could just be wrong, but I thought that on Intel Ports 2 and 3 handle the address calculation separately from the fetch, and thus technically do two operations per cycle. Or at least, this is the impression I got from the fact that IACA splits the number of operations for each into two separate columns, and from some diagrams that split those ports into LD and STA.

I think this may be related to the simple vs complex address calculations, although I'm fuzzy on the details. When I said "I wasn't sure" I was if anything understating my uncertainty. Searching now, it looks like Peter and Percy talk about it here: https://stackoverflow.com/questions/47701898/store-forwardin...

Oh, and some niggling typos: determing, differnet, limtis, reoganization, eliminat, soure, Essnentially, doens’t, intersting, itration, particpate, somethign, implicity, transforamtion, architecturs, depeneding, StackOveflow, unecessary, differnet, independnet, adressing, adressing, transfomations, instructoins, instructon, broaded, regsiter, unecessary, multi-predictate, subsequet, analagous, limtiation, uncontional, unecessary, ephemism, posssible, endianess. To be fair, I only noticed about half of these while reading, but then pasted it in to TextEdit for a more complete list. Other than 'broaded', none affected clarity.

Loads are only a single uop, and this uop is responsible for doing the address calculation and starting the fetch (the results are usually passed directly to the operations that consume them, the "receive the fetch" part is their job). So logically there might be multiple things going on, but it is reflected in a single uop dispatched to either p2 or p3, not breaking any "one port, one op" rule at that level.

Stores are different: they are two uops, STA (store address) and STD (store data) and they need different ports p237 for STA and p4 for STD and so they play within the rules as well. Basically the two ops are needed for the separate inputs they use: the store address, and the store data and they can execute happen independently. Loads OTOH have 1 input (the address) and one output.

Note that loads _can_ actually end up dispatching multiple uops, in the case of a cache miss!

If the load uop misses in L1 when it executes, it gets replayed 7 cycles later, with the idea that the data will be arriving from L2 if there is an L2 hit. If that also misses, the load will replayed a final time when the data arrives from L3 or DRAM (there is no additional replay for an L3 miss, because the latency is variable so the load just goes to sleep waiting for the result, whether from L3, L4, DRAM or wherever). The replayed uops still have to play by the one port, one op rule.

Thank you for this - bookmarked for the future.

My uarch knowledge is a bit out of date; would you happen to know if the following can still cause performance problems in modern Intel/AMD?

- call/return address mismatch

- mixing int/float vector instructions with same register

- false dependencies between partial registers (ie. AH and AL)

Thanks again!

> - call/return address mismatch

Definitely, it's as important as ever.

If you have matched call/ret pairs but cause an address mismatch by say modifying the return value on the stack, the subsequent `ret` will mispredict.

OTOH, if you cause a mismatch by having unbalanced call ret pairs, e.g., a `call` to a location that does a `jmp` back instead of a ret, the situation is much worse: the ret prediction stack is mismatched so you'll can mispredict on the next N return calls where N is 16 or 32 or whatever the size of the stack is (it doesn't mean that will happen: if the following code does many more calls before the correspond rets the bad predictions may fall of the stack and never be used).

The good news is that this one is really easy to avoid. Why would you end up with mismatched call/ret in the first place? It doesn't happen in normal code and even in hand-rolled assembly I can't see many reasons you'd end up like that.

There is one pattern that is used sometimes with mismatched calls: `call; pop eax` to get the current IP in rax (for example) - useful in 32-bit position independent code (in 64-bit code you can just use rip-relative addressing). This one is special cased though so it does not cause a misprediction.

Way more details avaialble here (this page is also linked in my post):


> - mixing int/float vector instructions with same register

These are the so-called domain bypass penalties. They are much reduced and I haven't seen them in practice on modern chips, but lets check Agner...

Yeah, it's best just to read Agner. See 10.9 Data Bypass Delays, for example. Here's an excerpt:

> However, there are fewer such delays on Haswell and Broadwell than on previous processors. I found no such delays in the following cases:

> - when a floating point Boolean instruction, such as ORPS is used with integer data

> - when a wrong type of move instruction is used, e.g. MOVPS or MOVDQA

> - when a wrong type of shuffle instruction is used, e.g. SHUFPS or PFHUFD

The situation is similar for Skylake. Recommend checking out microarchitecture.pdf for the full scoop.

- false dependencies between partial registers (ie. AH and AL)

This one has the best answer of all, very complete:


I refer to it whenever I forget the details. The part I always remember: anything involving al always extracts/merges from/into the existing physical register: there are no uop or latency penalties, but there is always a dependency on the full register (rax in this case). So a write to al can't proceed until the rest of rax is available, and same for a read.

ah is very different: it renames to separate register, so it acts decoupled from rax. The penalties are reversed from above: there is no false dependency, but there may be merging uops and other costs.

The details on old chips are different.

IMO this one has little impact, but there are some cases where it matters, like really trying to optimize scalar byte extraction.

The Intel Optimization Manual also lists bypass delays for Skylake (table 2-3), Broadwell (2-8) and Nehalem (2-29). It's only ever a single cycle per uop for Skylake.

Good catch!

It looks like the main remaining delay cases are integer mul instructions which get a bypass delay no matter what their source, and a few cases like FMAs fed by non-shuffle integer ops (not that common), or non-shuffle integer ops fed by FMA or integer mul (also not that common).

The key part is that shuffles have zero delays in any configuration, as producer or consumer, except when a shuffle feeds an integer mul. That's good because shuffles are very common as inputs to both integer and FP ops.

Wow - thank you once more for the answer and the links!

I tried a simple JiT once that did code threading by jmp/ret - seems like that would not be any better today :D

tldr; prefer trampolines over weird stack tricks.

Yes, jmp/ret will be slow.

call/ret has to particular advantage over e.g. storing the return address in a register (say 15) and then `jmp [r15]` - other than the prediction, so if you can't use the prediction, jmp/jmp indirect should work fine.

This is one of the best technical articles I've ever read, thanks.

It wasn't clear to me what "organize the likely path instead as untaken" means.

Yes, it is confusing: it should probably say "organize the code so that the likely branch outcome is the fall-through one".

Does that make sense?

Basically the "jump" should be the unusual case. You can always organize the code like that, but sometimes the compiler needs a little help.

Consider the following function, which just sets negative values to zero.

    void zero_threshold(int *data, size_t len) {
        for (size_t i = 0; i < len; i++) {
            int x = data[i];
            if (x < 0) data[i] = 0;
On gcc this compiles to something like:

        mov     edx, DWORD PTR [rdi]
        test    edx, edx
        jns     .L3
        mov     DWORD PTR [rdi], 0
        add     rdi, 4
        cmp     rdi, rax
        jne     .L4
The `jns` is the jump that reflects the if (x < 0) statement and it jumps in the case the number is >= 0 (i.e., non-negative). gcc organizes it like this because it involves a only a single jump. However if you expect the non-negative case to be the common one, this is will limit the throughput of the code. It would be better to compile it like this:

        mov     edx, DWORD PTR [rdi]
        test    edx, edx
        js      .L10
        add     rdi, 4
        cmp     rdi, rax
        jne     .L4
        mov     DWORD PTR [rdi], 0
        jmp     .L3
This now involves two jumps when the number is negative: it has to jump out to the extra bit of code outside the main body of the function at .L10, and then jump back, but the common case is now "fall through" so this could execute at close to 1 per cycle, twice as fast as the other version.

You can get gcc to produce the second code by changing the condition to __builtin_expect(x < 0, 0):


Of course, this code would be much off vectorized, but it's just an example (and the same principle applies to jumps in vectorized code).

I’ve worked in 1 company where we had tools that would re-organize the code to make the common code path be the fall through as much as possible

This was done before linking, based on analysis of the common paths from a running program

What this also did - which was the important bit at the time - was move all the error code to pages that wouldn’t be loaded for most users

So it moved error handling out of common code pages and changed branch instructions to mostly fall through

This was many years ago - do similar tools exist in the OSS community?

All good compilers do this today. They use heuristics to guess what the likely outcome of branches, but you can provide them specific information with profile guided optimization, or with manual branch hits (the __builtin_expect I used in the example above is one such hint).

Similarly, you can annotate certain functions or paths "hot" or "cold", and compilers even understand that functions like abort() are cold (they can be called at most once, after all) and hence paths leading up to them are also cold.

Similarly, JIT compilers use the runtime observed branching behavior to organize branches so that fall-through is the common case.

Grab a look at the FDO and AutoFDO sections from GCC's wiki:


You'll also find it's referred to as Profile Guided Optimization, there's docs here for clang: https://clang.llvm.org/docs/UsersManual.html#profile-guided-...

It makes complete sense now, appreciate the explanation.

> As an example, a load instruction takes a cache miss which means it cannot retire until the miss is complete. On an Haswell machine with a ROB size of 192, at most 191 additional instructions can execute while waiting for the load: at that point the ROB window is exhausted and the core stalls. This puts an upper bound on the maximum IPC of the region of 192 / 300 = 0.64.

Where does 300 come from? Is this the number of cycles for the cache miss?

Thanks, good catch - it is indeed the number of cycles. I had reorganized that sentence planning to move the number of cycles and instead it just got dropped.

Fixed and credited. Now it says:

> Let's say the load takes 300 cycles to finish, which is a typical latency. Then, on an Haswell machine with a ROB size of 192...


Yes, I think 300 is an estimate of the number of cycles for a load from RAM: 192 ops / 300 cycles == .64 ops / cycle.

I'm not sure whether the calculation is true, though. Why can't the post-load µops be executed and retired, and the ROB refilled, and then more µops executed? I'd think the calculation would only apply if everything was somehow speculative and thus nothing could be retired, but I don't see this being a common occurrence.

Because that's not how the ROB or any of the other "in order" structures like the load buffers, etc, work. Fundamentally, retirement happens in order.

uops only leave the ROB when they retire, and they retire in the same order as they appear in the dynamic instruction steam. The ROB is needed to preserve the ability to roll back if any instruction faults, to handle interrupts, to generate precise exceptions, etc: think about what would happen if you had retired a bunch of random instructions on either side of an un-retired instruction that ended up faulting?

Everything is speculative. The CPU always operates as if any unretired instruction is speculative. In reality it almost is: probably at least a quarter of all instructions can fault in some way, and as soon as there is any such instruction in the unretired stream all younger instructions are "speculative". I'm putting it in quotes because "speculative" as we're discussing it just something we are making up: CPUs don't have a "I'm speculating" flag: everything is treated as speculative all the time (and if you consider interrupts, maybe even the cases where you have a stream of cannot-fault instructions would also be "speculative").

There is a structure that works as you describe: the scheduler/reservation station. That's the place where everything happens out of order and slow instructions can stall and be passed by younger ones who leave the scheduler and make room for others - but this sits within the in-order allocate + retire machinery.

Thanks. So because of in-order retirement, the ROB is indeed limited to holding only consecutive µops. And since nothing can be executed until it is put in the ROB, this means that there is a hard limit on the distance in µops between the last unretired µop and what can possibly be executed. Does this also mean that a load is only ever retired after it has been successfully fulfilled? I haven't understood the role of retirement for loads and stores.

> Thanks. So because of in-order retirement, the ROB is indeed limited to holding only consecutive µops. And since nothing can be executed until it is put in the ROB, this means that there is a hard limit on the distance in µops between the last unretired µop and what can possibly be executed.

Yes. The ROB holds even things that will never be executed, such as nops and zeroing idioms.

Note that the ROB is not really the bad guy here: if you invent some way to make the ROB retire out-of-order to break that restriction, you'll just immediately run into the PRF limit, since almost all pending instructions need a destination register. Since these values are all live (from CPU's point of view) until all older instructions have retired, you'll get the same kind of limit from the PRF.

The PRF is a big structure using lots of area and power, so probably the ROB size is kind of determined from the PRF size: how big are going to make the PRF? X? OK, let's make the ROB size X * 1.5 since then the ROB will rarely limit performance. Note: we know the ROB increased dramatically in size from 224 to 354 in SNC, but we don't yet know how big the reg files are!

There are good papers out there about super-high ILP designs if you are interested about how this kind of stuff can be solved, but there are many problems.

> Does this also mean that a load is only ever retired after it has been successfully fulfilled? I haven't understood the role of retirement for loads and stores.

Yes. Loads and stores both go in the ROB, and also in the load/store buffers, which are in-order just like the ROB.

A load can't retire until it completes: only then are the physical resources, like the register it writes to and the load buffer entry, able to be freed.

Note that this is a huge difference between loads and prefetches: prefetches can retire immediately (well as soon as the load address has been calculated). They just kick off the load in the memory subsystem and then their work is done.

Stores are different: they can retire as soon as the store address and store data are known (i.e., as soon as their inputs are available). At this point they become so-called senior stores: stores which have retired, and hence are non-speculative, but haven't become visible to the rest of the system yet. They must eventually become visible, which means that on an an interrupt this part of the store buffer is preserved or drained, never thrown away like the rest of the OoO buffers. When the time is right (write?) they commit to L1.

Huh -- when I read this:

> This pattern is hard to achieve in practice in a high level language, although you might have luck emulating it with gcc’s labels as values functionality.

it sounded familiar, and immediately made me wonder if it's the same as the 'direct threading' technique used in the BEAM VM under the hood that uses that gcc extension -- it sounds like it is, if I'm reading this right:


(A language based more or less on embracing context switches is of course funny company for an article about performance speed limits)

This is using the same GCC functionality, but for a different purpose.

The emulator is using it to replace a loop which selects one of several bodies on each iteration, with a direct jump from the body of the previous iteration to the correct body for the next iteration.

This article is suggesting using it to replace a call/return pair to a leaf function with what essentially amounts to a single-entry call stack (or a manually implemented link register).

Do 'ret' instructions count against the branch order buffer, since they're essentially indirect branches?

Good question, I am not sure. Henry mentions they do not count against the 14-15 "call limit", but they may be handled the same as a branch in the BOB as you suggest.

There is a small typo under the "Pipeline Width" Heading. Superscalar link is "Supercalar"

Thanks, fixed and credited.

Really nice article.

Near the end of the Carried Dependency Chains section, you have another typo: "transforamtion".

Thanks, fixed and credited.

I'm glad I'm not paying by the typo, I'd be broke by now (one person alone identified ~15 typos).

Under "Out of Order Limits", ...which all effect the... should be ...which all affect the...

Under "Calls in Flight", Only 14-15 to calls... should be Only 14-15 calls...

In footnote 17, ...compilers will often cannot.. should probably be just ...compilers often cannot....

In footnote 19, Node that LSD... should be Note that LSD...

Thanks! Fixed and credited.

Do you have books or lectures on the topic or do you go from scratch and pure brain juice ?

I learned from Agner's guides, the Intel and AMD optimization manuals and experimentation.

It didn't exist in its current form whenn I started, but the x86 tag wiki is a great source for x86 specific stuff:


I focus on x86 but I would say 75% of the knowledge is transferable to other architectures. 25% is weird x86 stuff that you aren't likely to find anywhere else.

Thanks a ton, both article and answer :)

Very good, you clearly got a very accurate grasp of the low level details. Thanks for writing such an informative guide!

This is very practical guide, thanks for sharing.

Nice work - thank you!

Thanks for the article!

I started AsmGrid project that provides X86 architecture overview and instruction timings for people like me that often work with assembly. It's basically a simple web application that displays data provided by AsmDB and timings obtained through cult tool (https://github.com/asmjit/cult).

AsmGrid is available here, click on X86 Perf tab to see instruction timings:

  - https://asmjit.com/asmgrid
A bit older version that has more microarchitectures (but less instructions) is here:

  - https://kobalicek.com/asmgrid
I'm also looking for some help with instruction timings. I would be thankful to anyone who has X86 CPU having a different microarchitecture than those available at https://asmjit.com/asmgrid/ and is willing to compile & run cult tool, and provide me its output.

Instructions for cloning and compiling cult are here:

  - https://github.com/asmjit/cult

Thanks for the link, I didn't know about CULT (but I use asmjit, it's a great project).

I do feel like there is a lot of redundancy in this area. You are probably aware of Agner's instruction tables, and he makes the source available for his latency testing tools. You are probably aware of uops.info which has latency/throughput numbers for instructions online and in XML format. Now there is CULT (or actually CULT came before uops.info but I wasn't aware of it), and ASMDB.

There is the AIDA64 instruction latency thing giving the tables at:


There is also uarch-bench which I wrote, which has slightly different goals than just getting the latency/throughput of instructions, but still does a lot of that and has a lot of overlap in other areas.

I have seen more as well, but I can't remember all of them.

Unless you love is really writing these kind of tools (mine is not), it would be nice if there was one winner here: one canonical source for instruction timings, available in whatever format, one test application for new platforms, etc. Then efforts like "please run this on your new hardware" won't be split among various people: right now sometimes Agner gets new hardware first, sometimes the uops.info guys, sometimes the Instlat guy, etc.

Let a thousand flowers bloom and all that, but personally I think this space is due for some collaboration...

Yeah there is definitely some redundancy here, but there were reasons I wrote these tools. There are multiple sources that provide instruction timings. Agner's instruction tables are amazing, but they cover only a single microarchitecture at a time, so it's really difficult to compare multiple microarchitectures. Intel's online intrinsics guide only shows instruction latencies of few latest microarchitectures, etc... Basically before AsmGrid it was time consuming for me to verify whether the instructions I have selected would perform well on a majority of machines that I target.

I wrote AsmGrid initially for myself to explore AsmDB data and to spot possible errors. Then I wrote CULT just for fun as I wanted to see whether it's possible to write such tool based on AsmJit. CULT basically iterates over AsmJit instruction database, checks whether instructions (with various signatures) can be executed, and then creates JIT code for executing them either in parallel or sequencially. It works really well and the testing can be easily done in user space.

In other words, I wanted a table of instructions and their timings where I can compare several microarchitectures at the same time during development and AsmGrid provides that. The only downside is that I cannot test all microarchitectures myself and when I make changes to support more instructions a re-run is necessary.

Right, so CULT is maybe not the redundant one. Maybe the uops.info guys should use CULT, I dunno. Also, if it's for fun, anything goes, of course. I just feel sometimes like there is a lot of division of effort: it would be nice to have a standard single data source for instruction performance attributes.

I am surprised I never ran across this before, because when I found uops.info, I was like "finally, this stuff is online and linkable" - but maybe that already happened earlier with ASMGrid?

Anyways, I can run this on SKL, SKX, CNL for you.

Thanks! Let's continue on that issue page.

I generally dislike the "bottleneck" metaphor when applied to performance, because the metaphor only makes sense when the individual "parts" run in parallel. In most scenarios, each bit of code adds its own latency that is independent of the latency of other components. So every component is a "bottleneck", so-to-speak, it's just that some are bigger bottlenecks than others.

However this case is an unusual exception, because all of these hardware components described in this article do run in parallel. So the slowest element does truly gate the throughput of all stages of the pipeline. You could add more load to the non-bottlenecked resources and the overall latency of the pipeline wouldn't change at all, roughly speaking.

Yes, exactly. I didn't describe this but it's something I was thinking about: if you don't hit the bottleneck, you don't pay any price: if the limit of uops allocated is 4 but you are only doing 3.5, you don't pay any price: the limit might as well be infinite.

It's for this reason I struggled to include, for example, branch prediction failures. Let's say they take a 15 cycle bubble in the front-end. You can create a "speed limit" there and say "you can't do more than 1/15 mispredicts per cycle", which is indeed true. However, it then doesn't work like the other limits: if you measure 1 mispredict every 30 cycles, you aren't at the limit, but it still has a huge impact on performance, because the bubbles are likely slowing down everything else you do in those 30 cycles too. Perhaps 15 cycles of the 30 are solely dedicated to handling the mispredicts.

Therefore to include mispredicts here, it won't be a simple "speed limit" thing: you will need to understand the exact effect of the mispredict on the front end, and how it interacts with the other speed limits. Sometimes a mispredict costs 15 cycles (the figure you hear), but sometimes they don't slow down your code at all, and you can certainly design tests where they cost 1000s of cycles each.

You mention there’s always only one scalar multiplication unit. Can you elaborate why that is the case? What’s keeping AMD and Intel from adding more?

Multipliers that can do a 64x64->128 bit multiplication every cycle at 5 GHz are large in area and power hungry. They are most expensive traditional hardcoded integer unit (divide is even more expensive, but it is usually implemented as many operations that repeatedly use a divider unit and other units to do a few digits at a time: on essentially all current x86 chips large divides take 40-80 cycles!).

The limitation isn't that bad in practice: it's hard to get more than 1 multiplication per cycle in non-trivial code since you have only 3 more ops to do all the other work - where are the multiplication inputs coming from, for example?

So more than one multiplier probably has a small payoff. Apple chips, for example, are wider, and IIRC they have two multipliers as they can take more advantage.

It's worth noting that 64-bit multipliers are expensive enough that even in AVX-512 Intel still offers zero 64-bit input multiplies in their SIMD unit (they offer a 53 bit one that reuses the FMA hardware though).

Applications are open for YC Winter 2021

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact