The latency shows after how many cycles the result of an instruction can be consumed by another, while the throughput shows how many such instructions can be pipelined per cycle, i.e. in parallel.
I believe the throughput shown in those tables is the total throughput for the whole CPU core, so it isn't immediately obvious which instructions have high throughput due to pipelining within an execution unit and which have high throughput due just to the core having several execution units capable of handling that instruction.
That's true, but another part of the tables show how many "ports" the operation can be executed on, which is enough information to concluded an operation is pipelined.
For example, for many years Intel chips had a multiplier unit on a single port, with a latency of 3 cycles, but an inverse throughput of 1 cycle, so effectively pipelined across 3 stages.
In any case, I think uops.info [1] has replaced Agner for up-to-date and detailed information on instruction execution.
Yes. In the past new HW has been made available to the uops.info authors in order to run their benchmark suite and publish new numbers: I'm not sure if that just hasn't happened for the new stuff, or if they are not interested in updating it.
FWIW, there are two ideas of parallelism being conflated here. One is the parallel execution of the different sequential steps of an instruction (e.g. fetch, decode, operate, retire). That's "pipelining", and it's a different idea than decoding multiple instructions in a cycle and sending them to one of many execution units (which is usually just called "dispatch", though "out of order execution" tends to connote the same idea in practice).
The Fog tables try hard show the former, not the latter. You measure dispatch parallelism with benchmarks, not microscopes.
Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time.
I don't think anyone is talking about "fetch, decode, operate, retire" pipelining (though that is certainly called pipelinig): only pipelining within the execution of a instruction that takes multiple cycles just to execute (i.e., latency from input-ready to output-ready).
Pipelining in stages like fetch and decode are mostly hidden in these small benchmarks, but are visible when there are branch misprediction, other types of flushes, I$ misses and so on.
> I don't think anyone is talking about "fetch, decode, operate, retire" pipelining (though that is certainly called pipelinig): only pipelining within the execution of a instruction that takes multiple cycles just to execute (i.e., latency from input-ready to output-ready).
I'm curious what you think the distinction is? Those statements are equivalent. The circuit implementing "an instruction" can't work in a single cycle, so you break it up and overlap sequentially issued instructions. Exactly what they do will be different for different hardware, sure, clearly we've moved beyond the classic four stage Patterson pipeline. But that doesn't make it a different kind of pipelining!
We are interested in the software visible performance effects of pipelining. For small benchmarks that don't miss in the predictors or icache, this mostly means execution pipelining. That's the type of pipelining the article is discussing and the type of pipelining considered in instruction performance breakdowns considered by Agner, uops.info, simulated by LLVM-MCA, etc.
I.e., a lot of what you need to model for tight loops only depends on the execution latencies (as little as 1 cycle), and not on the full pipeline end-to-end latency (almost always more than 10 cycles on big OoO, maybe more than 20).
Adding to this: the distinction is that an entire "instruction pipeline" can be [and often is] decomposed into many different pipelined circuits. This article is specifically describing the fact that some execution units are pipelined.
Those are different notions of pipelining with different motivations: one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates." If 64-bit multiplication were not pipelined, the minimum achievable clock period would be constrained by "how long it takes for bits to propagate through your multiplier."
> one is motivated by "instruction-level parallelism," and the other is motivated by "achieving higher clock rates."
Which are exactly the same thing? For exactly the same reasons?
Sure, you can focus your investigation on one or the other but that doesn't change what they are or somehow change the motivations for why it is being done.
And you can have a shorter clock period than your non-pipelined multiplier just fine. Just that other uses of that multiplier would stall in the meantime.
An OoO design is qualitatively different from an in-order one because of renaming and dynamic scheduling, but the pipelining is essentially the same and for the same reasons.
> and it's a different idea than decoding multiple instructions in a cycle and sending them to one of many execution units (which is usually just called "dispatch", though "out of order execution
Being able to execute multiple instructions is more properly superscalar execution, right? In-order designs are also capable of doing it and the separate execution unit do not even need to run in lockstep (consider the original P5 U and V pipes).
Right; it's easy to forget that superscalar CPU cores don't actually have to be in-order, but most of them are out-of-order because that's usually necessary to make good use of a wide superscalar core.
(What's the best-performing in-order general purpose CPU core? POWER6 was notably in-order and ran at quite high clock speeds for the time. Intel's first-gen Atom cores were in-order and around the same time as POWER6 but at half the clock speed. SPARC T3 was ran at an even lower clock speed.)
The IBM Z10 came out a year later. It was co-designed with POWER6 as part of IBM's eClipz project, and shared a number of features / design choices, including in-order execution.
In-order parallel designs are "VLIW". The jargon indeed gets thick. :)
But as to OO: the whole idea of issuing sequential instructions in parallel means that the hardware needs to track dependencies between them so they can't race ahead of their inputs. And if you're going to do that anyway, allowing them to retire out of order is a big performance/transistor-count win as it allows the pipeline lengths to be different.
VLIW is again a different thing. It uses a single instruction that encodes multiple independent operations to simplify decoding and tracking, usually with exposed pipelines.
But you can have, for example, a classic in-order RISC design that allows for parallel execution. OoO renaming is not necessary for dependency tracking (in fact even scalar in order CPUs need dependency tracking to solve RAW and other hazards), it is "only" needed for executing around stalled instructions (while an in order design will stall the whole pipeline).
Again P5 (i.e the original Pentium) was a very traditional in order design, yet could execute up to two instructions per cycle.
No it isn't. I'm being very deliberate here with refusing pedantry. In practice, "multiple dispatch" means "OO" in the same way that "VLIW" means "parallel in order dispatch". Yes, you can imagine hypothetical CPUs that mix the distinction, but they'd be so weird that they'd never be built. Discussing the jargon without context only confuses things.
> you can have, for example, a classic in-order RISC design that allows for parallel execution.
Only by inventing VLIW, though, otherwise there's no way to tell the CPU how to order what it does. Which is my point; the ideas are joined at the hip. Note that the Pentium had two defined pipes with specific rules about how the pairing was encoded in the instruction stream. It was, in practice, a VLIW architecture (just one with a variable length encoding and where most of the available instruction bundles only filled one slot)! Pedantry hurts in this world, it doesn't help.
> Note that the Pentium had two defined pipes with specific rules about how the pairing was encoded in the instruction stream. It was, in practice, a VLIW architecture (just one with a variable length encoding and where most of the available instruction bundles only filled one slot)!
This is ridiculous. There are no nop-filled slots in the instruction stream, and you can't even be sure which instructions will issue together unless you trace backwards far enough to find a sequence of instructions that can only be executed on port 0 and thus provide a known synchronization point. The P5 only has one small thing in common with VLIW, and there's already a well-accepted name for that feature, and it isn't VLIW.
Meh. P5's decode algorithm looks very much like VLIW to me, and emphatically not like the 21264/P6 style of dispatch that came to dominate later. I find that notable, and in particular I find senseless adherence to jargon definitions[1] hurts and doesn't help in this sphere. Arguing about how to label technology instead of explaining what it does is a bad smell.
[1] That never really worked anyway. ia64, Transmeta's devices and Xtensa HiFi are all "VLIW" by your definition yet work nothing like each other.
The point was exactly that "VLIW" as a term has basically no meaning. What it "means" in practice is parallel in-order dispatch (and nothing about the instruction format), which is what I said upthread.
VLIW means Very Large Instruction Word. It is a property of the instruction set, not of the processor that implements it.
You could have a VLIW ISA that is implemented by a processor that "unrolls" each instruction word and mostly executes the constituent instructions serially.
> Also IIRC there are still some non-pipelined units in Intel chips, like the division engine, which show latency numbers ~= to their execution time
I don't think that's accurate. That latency exists because the execution unit is pipelined. If it were not pipelined, there would be no latency. The latency corresponds to the fact that "doing division" is distributed across multiple clock cycles.
Division is complicated by the fact that it is a complex micro-coded operation with many component micro-operations. Many or all of those micro-operations may in fact be pipelined (e.g., 3/1 lat/itput) , but the overall effect of executing a large number of them looks not very pipelined at all (e.g., 20 of them on a single EU would have 22/20 lat/itput, basically not pipelined when examined at that level).
Sorry, correcting myself here: it's cut across multiple cycles but not pipelined. Maybe I confused this with multiplication?
If it were pipelined, you'd expect to be able to schedule DIV every cycle, but I don't think that's the case. Plus, 99% of the time the pipeline would just be doing nothing because normal programs aren't doing 18 DIV instructions in a row :^)
These days CPUs are so complex and have so many interdependencies that the best way to simulate them is simply to run them!
In most real code the high throughput of these sorts of operations means that something else is the limiting factor. And if multiplier throughput is limiting performance then you should be using SIMD or a GPU.
> the best way to simulate them is simply to run them!
And it's quite sad because when you are faced with choosing between two ways to express something in the code, you can't predict how fast one or another option will run. You need to actually run both, preferrably in an environment close to the prod, and under similar load, to get accurate idea which one is more performant.
And the worst thing is, you most likely can't extract any useful general principle out of it, because any small perturbation in the problem will result in a code that is very similar yet has completely different latency/throughput characteristics.
The only saving grace is that modern computers are really incredibly fast, so layers upon layers of suboptimal code result in applications that mostly perform okay, with maybe some places where they perform egregiously slow.
My point is, it's impossible to test everything in a reasonable timeframe. It would be much, much more convenient to know (call it "having an accurate theory") beforehand which approach will be faster.
Imagine having to design electronics the way we design performant programs. Will this opamp survive the load? Who knows, let's build and try these five alternatives of the circuit and see which one of them will not blow. Oh, this one survived but it distorts the input signal horribly ("yeah, this one is fast, but it has multithreading correctness issues and reintroduction of locks makes it again about as slow"), what a shame. Back to the drawing board.
The amateurs usually run benchmarks (because they can't reason about it as they lack the relevant knowledge) and believe they got a useful result on some aspect when in the fact the benchmark usually depends on other arbitrary random factors (e.g. maybe they think they are measuring FMA throughput, but are in fact measuring whether the compiler autovectorizes or whether it fuses multiply and adds automatically).
A pro would generally only run benchmarks if it's the only way to find out (or if it's easy), but isn't going to trust it unless there's a good explanation for the effects, or unless they actually just want to compare two very specific configurations rather than coming up with a general finding.
Then again, 'reasoning about it' can easily go awry if your knowledge doesn't get updated alongside the CPU architectures. I've seen people confidently say all sorts of stuff about optimization that hasn't been relevant since the early 2000s. Or in the other direction, some people treat modern compilers/CPUs like they can perform magic, so that it makes no difference what you shovel into them (e.g., "this OOP language has a compiler that always knows when to store values on the stack"). Benchmarks can help dispel some of the most egregious myths, even if they are easy to misuse.
My favorite illustrations for the concepts discussed here (in an accessible form, not the processor optimization manuals) has long been [0].
For me, this really makes working with a modern microprocessor a science, as anyone who has written benchmarks knows -- it's difficult to reason about the complex behaviour and performance cliffs without testing.
Another excellent example of the weirdness has to be JVM anatomy quarks [1]
It was very good, I learned a lot. Actually more high-quality info than I can absorb in one sitting, but I bookmarked it to come back to.
I'd love to see an updated version. The article talks about how out-of-order execution has a high power overhead and produces relatively small performance gains, but the M-series chips from Apple have deep OOO and low power consumption; I'm curious to learn what they did differently.
Pipelining is the direct reason behind 1996 Quake running 30 fps on Intel Pentium 133 but requiring 233MHz from Cyrix and 166MHz from AMD K6 to reach same result. 20 fps needed 75MHz Intel Pentium, 133MHz Cyrix and 100MHz AMD K5.
Specifically, Carmack exploited the fact that on the Pentium, integer instructions could run in parallel with floating-point division[0]. This goes to show that an optimization that usually gets you ~10% might get you 200% depending on what software you're running. And that no implementation detail is safe from an ambitious software engineer.
No, thats not it as explained in my reply to that TerrifiedMouse comment in 2023 :)
On x86 Integer instructions _never_ waited for floating point opcode to retire. You can check it yourself by reading FPU status register busy flag - if FPU was blocking you could never catch it in BUSY state and FWAIT would be useless :-)
Abrash exploited the fact Pentium was the first time x87 FPU instructions could run pipelined overlapping one another. All other x86 vendors FPUs waited for previous FPU instruction to retire.
While floating point instructions in general cannot be paired, many can be pipelined, i.e. one
instruction can begin before the previous instruction has finished. Example:
fadd st1,st0 ; Clock cycle 1-3
fadd st2,st0 ; Clock cycle 2-4
fadd st3,st0 ; Clock cycle 3-5
fadd st4,st0 ; Clock cycle 4-6
Obviously, two instructions cannot overlap if the second instruction needs the result of the
first one. Since almost all floating point instructions involve the top of stack register, ST0,
there are seemingly not very many possibilities for making an instruction independent of the
result of previous instructions. The solution to this problem is register renaming. The FXCH
instruction does not in reality swap the contents of two registers; it only swaps their names.
Paradoxically often mentioned Texture Divide every 16 pixels overlaps just fine on all CPUs and doesnt explain performance discrepancies. Intel FDIV latency is 19 cycles compared to Cyrix 24, mere 20% yet you need almost double the MHz on Cyrix to match FPS numbers. Answer lies in rest of Quake heavily pipelined FPU code.
Pipelining did also play a role. The Quake inner rasterization loop has a decent amount of non-division math in it as well that leverages the Pentium's ability to execute FP add/multiplies at 1/cycle. The K6 and 6x86 FPUs were considerably slower -- 2 and 4 cyclesnon-pipelined (http://www.azillionmonkeys.com/qed/cpuwar.html).
Additionally, the FXCH instructions required to optimally schedule FPU instructions on the Pentium hurt 486/K6/6x86 performance even more since they cost additional cycles. Hard for the 6x86 to keep up when it takes 7 cycles to execute an FADD+FXCH pair vs. 1 for the Pentium.
Is this still the case if I have different ALU operations. Say I have a single ALU on a single x86 core. Would the ALU be able to interleave say ADD and MULs? or would I incur the latency measure for each operation switch?
I know that some ALU's have multiple ADD complexes, and I assume that would influence the answer, hence why I specified x86.
Yes, it applies to different operations. E.g. you could interleave two or three different operations with 3 cycle latency and 1 cycle inv throughput on the same port and get 1 cycle inv throughput in aggregate for all of them. There is no restriction that they must be same operation.
In some cases mixing operations with _different_ latencies on the same execution port will leave you with less throughput than you expect due to "writeback conflicts", i.e., two instructions finishing on the same cycle (e.g., a 2 cycle operation starting on cycle 0 and a 1 cycle operation on cycle 1, will both finish on cycle 2 and in some CPUs this will delay the results of one of the operations by 1 cycle due to a conflict).
A modern ALU has multiple pipelined data paths for its operations. So maybe three adders with a one-cycle latency, two multipliers with a three-cycle latency, and one divider with a 16-cycle latency. Sustained throughput depends on the operation. Maybe one per cycle for add and multiply, but only one every eight cycle for divide.
You can interleave most operations how you like, without any extra latency, each op starting as soon as all the results are ready, regardless of where they were computed.
There are some exceptions where "domain crossing", or using a very different operation costs an extra clock cycle. Notably, in vector registers using FP or integer operations on the result of the different type of op, as modern CPUs don't actually hold FP values in their IEEE754 transfer format inside registers, but instead registers have hidden extra bits and store all values in normal form, allowing fast operations on denormals. The downside of this is that if you alternate between FP and INT operations, the CPU has to insert extra conversion ops between them. Typical cost is 1-3 cycles per domain crossing.
Unrelated but for benchmarks like these you can just ask the kernel for how many clock cycles you’ve used using proc_pid_rusage rather than hardcoding it.
EDIT: Adrian_B’s sibling comment has even earlier historical examples that illustrate how architecture ideas are as old as the hills.
There are good historical hardware architecture textbooks: Peter Kogge’s “The Architecture of Pipelined Computers” (‘81) mentions that the UNIVAC 1 pipelined IO (from which dedicated IO units sprung out), the IBM 7094 similarly interleaved multiple-state memory access cycles between multiple memory units to increase throughput, and then the IBM STRETCH had a two-part fetch-execute without attempt to address hazards.
So pipelining came out the ad-hoc application of the patterns of interleaving and hiving off functionality to dedicated units. In general, a lot of hardware architecture ideas cropped up very early in big iron, with microcoding even co-existing with them, without all of the extra full ideas being worked out (e.g. forwarding, hazard resolution of all types, etc.)
There’s a whole fun universe of vintage hardware architecture textbooks that get through the essential concepts through historical big iron, if you’re willing to look beyond Hennessy and Patterson and Shen and Lipasti! I am thinking that the latter might have references to other historical textbooks on pipelining, though. “The Anatomy of a High-Performance Microprocessor”, a pedagogical deep-dive into the design of the AMD K6 uarch (up to detailed algorithms and HDL!), will have good references to historical textbooks too since these authors were trained before the crop of “modern” (microprocessor-era) textbooks.
1957 in a transistorized computer is far too late for the origin of pipelining.
Pipelining had already been used in computers with vacuum tubes a few years before and it had also been used already a decade earlier in computers with electromechanical relays, i.e. IBM SSEC, which had a 3-stage pipeline for the execution of its instructions (IBM SSEC had a Harvard architecture, with distinct kinds of memories for program and for data).
Before 1959, pipelining was named "overlapped execution".
The first use of the word "pipeline" was as a metaphor in the description of the IBM Stretch computer, in 1959. The word "pipeline" began to be used as a verb, with forms like "pipelining" and "pipelined" around 1965. The first paper where I have seen such a verbal use was from the US Army.
Outside computing, pipelining as a method of accelerating iterative processes had been used for centuries in the progressive assembly lines, e.g. for cars, and before that for ships or engines.
Pipelined execution and parallel execution are dual methods for accelerating an iterative process that transforms a stream of data. In the former the process is divided into subprocesses through which the stream of data passes in series, while in the latter the stream of data is divided into substreams that go in parallel through multiple instances of the process.
Wikipedia gives another much more ancient example of a "pipeline", the Venetian Arsenal, which had a progressive assembly line for ships, which had a throughput that could be greater than of one ship per day, and which has been built in the 12th century.
> Pipelined execution and parallel execution are dual methods for accelerating an iterative process that transforms a stream of data. In the former the process is divided into subprocesses through which the stream of data passes in series, while in the latter the stream of data is divided into substreams that go in parallel through multiple instances of the process.
With parallel do you mean superscalar execution i.e. having two ALUs or do you mean having multiple cores?
Parallel execution can exist at many different levels, e.g. at processor level you can have multiple processors executing in parallel some threads.
At instruction level, the execution of instructions is an iterative process, so like for any other iterative process parallelism or pipelining or both parallelism and pipelining may be used.
Most modern CPUs use both parallelism and pipelining in the execution of instructions. If you have e.g. a stream of multiply instructions, the CPU may have for example 2 multiply pipelines, where each multiply pipeline has 4 stages. The incoming multiply instruction stream is divided into 2 substreams, which are dispatched in parallel to the 2 multiply pipelines, so 2 multiply instructions are initiated in each clock cycle, which makes the example CPU superscalar. The multiply instructions are completed after 4 clock cycles, which is their latency, when they exit the pipeline. Thus, in the example CPU, 8 multiply instructions are simultaneously executed in each clock cycle, in various stages of the 2 parallel pipelines.
Whenever you have independent iterations, e.g. what looks like a "for" loop in the source program, where there are no dependencies between distinct executions of the loop body, the iterations can be executed in 4 different ways, sequentially, interleaved, pipelined or in parallel.
The last 3 ways can provide an acceleration in comparison with the sequential execution of the iteration. In the last 2 ways there is simultaneous execution of multiple iterations or multiple parts of an iteration.
For parallel execution of the iteration (like in OpenMP "parallel for" or like in NVIDIA CUDA), a thread must be created for each iteration execution and all threads are launched to be executed in parallel by multiple hardware execution units.
For pipelined execution of the iteration, the iteration body is partitioned in multiple consecutive blocks that use as input data the output data of the previous block (which may need adding additional storage variables, to separate output from input for each block), then a thread must be created for each such block that implements a part of the iteration, and then all such threads are launched to be executed in parallel by multiple hardware execution units.
These 2 ways of organizing simultaneous work, pipelined execution and parallel execution are applicable to any kind of iterative process. Two of the most important such iterative processes are the execution of a stream of instructions and the implementation of an array operation, which performs some operation on all the elements of an array. In both cases one can use a combination of pipelining and parallelism to achieve maximum speed. For these 2 cases, sometimes the terms "instruction-level parallelism and pipelining" and "data-level parallelism and pipelining" are used. The second of these 2 terms is misleading, because not data are executed in parallel or pipelined, but the iterations that process data are executed in parallel or pipelined. For any case where pipelining may be used, it is important to recognize which is the iterative process that can be implemented in this way. For unrelated tasks a.k.a. processes a.k.a. threads, only 3 ways of execution are available: sequential, interleaved and in parallel. The 4th way of execution, pipelined, is available only for iterations, where the difference between iterations and unrelated tasks is that each iteration executes the same program (i.e. the loop body, when the iteration is written with the sequential loop syntax).
You are right, but this can be make more precise: pipelining is a specific form of parallelism. After all the different stages of the pipeline are executing in parallel.
You use "parallelism" with a different meaning than me.
I use "parallel" with its original meaning "one besides the other", i.e. for spatial parallelism.
You use "parallel" with the meaning "simultaneous in time", because only with this meaning you can call pipelining as a form of parallelism.
You are not the only one who uses "parallel" for "simultaneous in time", but in my opinion this is a usage that must be discouraged, because it is not useful.
If you use "parallel" with your meaning, you must find new words to distinguish parallelism in space from parallelism in time. There are no such words in widespread use, so the best that you can do is to say "parallel in space" and "parallel in time", which is too cumbersome.
It is much more convenient to use "parallel" only with its original meaning, for parallelism in space, which in the case of parallel execution requires multiple equivalent execution units (unlike pipelined execution, which in most cases uses multiple non-equivalent execution units).
When "parallel" is restricted to parallelism in space, pipelining is not a form of parallelism. Both for pipelining and for parallelism there are multiple execution units that work simultaneously in time, but the stream of data passes in parallel through the parallel execution units and in series through the pipelined execution units.
With this meaning of "parallel", one can speak about "parallel execution" and "pipelined execution" without any ambiguity. It is extremely frequent to have the need to discuss about both "parallel execution" and "pipelined execution" in the same context or even in the same sentence, because these 2 techniques are normally combined in various ways.
When "parallel" is used for simultaneity in time it becomes hard to distinguish parallel in space execution from pipelined execution.
The pipeline stages (say: fetch, decode, execute, memory access, register write back), are organised "parallel in space" as transistors on chip. The point of having a pipeline is so the stages can execute "parallel in time".
More generally, parallel in space is interesting because it is a necessary precondition for parallel in time.
In its original meaning, which is still the meaning used in mathematics and physics, "parallel" provides more information than just saying that the parallel things are located in different places in space. Such an information can be provided by other words.
One should not say that the pipeline stages are parallel, when the intended meaning is that they are separate or distinct, which is the correct precondition for their ability to work simultaneous in time.
"Parallel" says about two things that they are located side-by-side, with their front-to-back axes aligned, which is true for parallel execution units where the executions of multiple operations are initiated simultaneously in all subunits, but it is false for pipelined execution units, where the executions of multiple operations are initiated sequentially, both in the first stage and in all following stages, but the initiation of an execution is done before the completion of the previous execution, leading to executions that are overlapped in time.
The difference between parallel execution and pipelined execution is the same as between parallel connections and series connections in any kind of networks, e.g. electrical circuits or networks describing fluid flow.
Therefore it is better if the terms used in computing remain consistent with the terms used in mathematics, physics and engineering, which have already been used for centuries before the creation of the computing terminology.
If this is true, then it resembles in this feature the IBM SSEC, another electromechanical computer with a three stage instruction pipeline, which however was built only later, after WWII, between 1944 and 1947.
The main inspiration for IBM SSEC has been Harvard Mark I, an earlier electromechanical computer built by IBM based on a design done mostly by Howard Aiken, but it would not have been impossible for some information about the Zuse computers to have reached IBM after WWII, contributing to the design of the SSEC.
Probably the first person to ask, 'how can I speed up this processor, maybe there's a way to do more than one processing step at a time for each instruction'
Slightly offtopic but did we go to uni together and team up on a couple of demo compos and VJ some events? If so hey it's been ages and whats up? :D If not, cool name and stuff. :)
Is it really that much different than e.g. a cook preparing carrots while onions are being cooked? Nobody needed to invent this as far as my knowledge of history goes; so, maybe there really is nothing new under the sun :)
Joking aside, of course, the ingenuity lies in finding a fitting solution to problem at hand, as well as knowing how to apply it.
Come on, this isn't like the invention of calculus. Pipelining is just assembly line processing that anyone who managed or designed any factory would have implemented as soon as the transistor real estate became available.
I don't think there is a generally optimal design. There are cons and pros to using the same homogeneous FMAs units for adds, multiplies and fmas, even at the cost of making adds slower (simpler design, and having all instructions of the same latency greatly simplifies scheduling). IIRC intel cycled through 4 cycles fma, add and mul, then to 4 cycles add and mul and 5 cycles fmas, then with a dedicated 3 cycles add.
The optimal design depends a lot on the rest of the microarchitecture, the loads the core is being optimized for, the target frequency, the memory latency, etc.
Adds are cheaper only for fixed-point computations. Floating point addition needs to denormalize one of its' arguments, perform an (integer) addition and then normalize the result.
Usually FP adds take a cycle or two longer than FP multiplication.
Depends on what you mean by ‘cheaper’. Multiplies are still more gates. The adds are slower due to longer dependency chains, not because they cost more gates.
If the two ALUs could feed results into each other, that would make things really interesting, especially when you consider the way it would affect reordering.
Imagine if the OP's benchmark had two of those multiplication chains on independent registers:
mul x1, x0, x0 // a
mul x2, x1, x1 // b
mul x3, x2, x2 // c
mul x4, x3, x3 // d
mul x6, x5, x5 // e
mul x7, x6, x6 // f
mul x8, x7, x7 // g
mul x9, x8, x8 // h
If you had two independent ALUs, my intuition is that you'd want to dispatch the instructions as AEBFCGDH, interleaving the two chains.
But, if the two ALUs could feed each other, perhaps you'd actually want to leave it in ABCDEFGH order? Hmm.
> If you had two independent ALUs, my intuition is that you'd want to dispatch the instructions as AEBFCGDH, interleaving the two chains. But, if the two ALUs could feed each other, perhaps you'd actually want to leave it in ABCDEFGH order? Hmm.
Once the frontend has determined what the dependencies are between instructions, you no longer have a linear sequence of instructions being dispatched to execution units but a DAG, and with multiple instructions starting on the same clock cycle the execution order would be more like (AE)(BF)(CG)(DH) without any ordering between instructions in the same parenthesized group.
The reorder buffer in the CPU will be large enough to handle the compiler emitting either ABCDEFGH or AEBFCGDH, but the interleaved ordering is less likely to run into limitations of the instruction decoder (if this chunk of code isn't already in a decoded µop cache).
Closest I can think of would be the original Pentium 4, which had double-speed integer ALUs where simple operations like addition would produce the low 16 bits in the first half cycle and the high 16 bits in the second half cycle. A dependent op could immediately start using the low result before the high result was done to give effective 0.5c latency and throughput and 4 adds/cycle. Intel dropped this in the 64-bit capable revision of the P4, though.
I haven't seen any cpu with a fully pipelined division, but division units are at least partially pipelined in recent-ish cpus. They usually can start a new division well before the previous one has stopped executing.
ALUs in Recent Apple cpus can actually start a new division every other cycle (in addition to having an abnormally low latency), which is very impressive.
I think my favorite introduction to just how insane the pipelines, predictors, and other insanity in "modern" (more than a decade ago now) was trying to improve `Math.sqrt()` performance in JSC. This was during the first generation of JIT JS engines (e.g. no one was inlining functions yet), and I was replacing the host implementation of Math.sqrt with a pure assembly version - essentially calling a host function was significantly more expensive another JS function - e.g. JIT JS function -> JIT JS function was significantly faster than JIT JS function -> host code (e.g. C/C++). As part of doing that I was going step by step through each instruction making sure it was the minimum overhead as each step, think something like (very approximate - again more than a decade ago):
v0:
1. if (input not a number)
fallback to C++; else
2. return tagged 0; // Just making sure the numeric check was optimal
v1:
1. As above
2. If integer
convert to float
3. return tagged 0
v2:
1-2. as above
3. If negative
return tagged nan
4. Return tagged 0
v3:
1-3. as above
4. use the sqrt instruction
5. return tagged 0
v4.
1-4. as above
5. move <4> back to an integer register
6. return tagged 0
v5.
1-5. as above
6. tag the result of sqrt
7. return tagged 0
v6.
1-6. as above
7. Actually return/store the result of <6>
Alas I cannot recall whether at this point return values were going into the heap allocated VM call stack, or whether the return was via rax, but that's not the bit that was eye opening to me.
I had a benchmark that was something like
for (var i = 0; i < large number; i++)
Math.sqrt(i)
Noting that while I was working on this there was no meaningful control flow analysis, inlining, etc so this was an "effective" benchmark for perf work at the time - it would not be so today.
The performance remained "really good" (read fake) until the `v6` version that actually store/returned the result of the work. It was incredibly eye opening to see just how much code could be "executed" before the CPU actually ended up doing any work, and significantly impacted my approach to dealing with codegen in future.
My perspective at the time was "I know there's a significant marshaling cost to calling host code, and I know the hardware sqrt is _very_ fast", so it seemed that it was possible that a 5-10x perf improvement seemed "plausible" to me at the time (because marshaling was legitimately very expensive) - and I can't recall where in the 5-10x range the perf improvement was - but then once the final store/return was done it dropped in perf to only 2x faster. Which was still a big win, but also seeing just how much work the CPU could just avoid doing while trying to build out the code was a significant learning experience.
The latency shows after how many cycles the result of an instruction can be consumed by another, while the throughput shows how many such instructions can be pipelined per cycle, i.e. in parallel.
reply