Hacker News new | past | comments | ask | show | jobs | submit login
The surprising subtleties of zeroing a register (2012) (randomascii.wordpress.com)
112 points by gtirloni on Nov 4, 2021 | hide | past | favorite | 59 comments



Author in the old thread (https://news.ycombinator.com/item?id=19262249) says

> An x86-64 CPU has sixteen integer registers, but 100-200 physical integer registers. Every time an instruction writes to, say, RAX the renamer chooses an available physical register and does the write to it, recording the fact that RAX is now physical-register #137. This allows the breaking of dependency chains, thus allowing execution parallelism.

I'm curious why they have so many more physical registers than... logical? registers. I have a couple of guesses:

* Physical registers are physically cheaper to add than logical registers.

* Adding logical registers breaks backwards compatibility, or at best means you get no speedup on things written (/compiled) for fewer logical registers. Adding physical registers lets you improve performance without recompiling.

* Adding logical registers increases complexity for people writing assembly and/or compilers. Adding physical registers moves that complexity to people designing CPUs.

Are some of these correct? Other reasons I'm missing?


Don't think of %eax as a real register. Think of it as a tag in a compressed dataflow graph. The compression is performed by the compiler's register allocator, and the decompression is performed by the CPU's register renaming.

A compiler's IR is a directed graph, where nodes are instructions and tagged by an assigned register. It would be pleasant to assign each node a distinct register, but then machine instructions would be unacceptably large. So the compiler's register allocator compresses the graph, by finding nodes that do not interfere and assigning them the same register.

The CPU's register renamer then reinflates this graph, by inspecting the dataflow between instructions. If two instructions share a register tag, but the second instruction has no dependence on the first, then they may be assigned different physical registers.

`xor eax, eax` has no dependence on any instruction, so it can be specially recognized as allocating a new physical register. In this way of thinking, `xor eax, eax` doesn't zero anything, but is like malloc: it produces a fresh place to read/write, that doesn't alias anything else.


Adding more logical registers is compatibility breaking. And, since you have to encode the register specifier in the instruction it means larger instructions (hence the compatibility breaking) which makes reading and decoding instructions slower.

And, regardless of how many logical registers you have you need to have more physical registers. These are needed for out-of-order (OOO) execution and speculative execution. An OOO super-scalar speculative CPU can have hundreds of instructions in flight and these instructions all need physical registers to work on. If you don't have excess physical registers you can't do OOO or speculative execution.


Adding more logical registers doesn't have to break application compatibility. Intel added x87, MMX ... extensions with their new register sets all without breaking compatibility. They even doubled the integer register set in x86_64 with the REX prefix. New programs could use these features and their register sets without existing programs being broken.

What register renaming allows is to increase the performance of both new and existing programs, which is no mean feat. It allows the CPU scheduler to search for more out-of-order parallelism rather than relying on the compiler to find in-order parallelism.

This binary compatibility doesn't seem very important now, don't break userspace excepted, but it was then. Compatibility made IBM and Intel hundreds of billions of dollars.


Adding those registers each time at the bare minimum broke OS compatibility in order to enable them, and required kernel changes to save and restore the new architectural registers. Adding to the physical register file allows existing code (including kernel space) to take advantage of it. There's been some extensions lately like xsave that try to address that, but they're not fully embraced by major kernels AFAIK.


Tomasulo's register renaming algorithm (1967) comes from an era when you bought the computer and the operating system together. So OS compatibility was their problem. That's not our era but the resulting in-order vs out-of-order war is long over.

Register renaming enabled out-of-order execution. The 90s were a competition between in-order compiler based scheduling and out-of-order CPUs. The in-order proponents said that out-of-order was too power hungry, too complex and wouldn't scale. Well, it did. Even the Itanium which was the great in-order hope, its last microarchitecture, Poulson, had out-of-order execution.

Ultimately, out-of-order won the war but in-order survives for low power low complexity designs; the A53 is in-order. Skylake Server has 180 physical registers with an out-of-order search window of 224.

https://www.primeline-solutions.com/media/wysiwyg/news-press...


In the 1960s, os compat was your problem as the end user too. There wasn't the strict divide between OS code and user code in the same way. The 360/91 that Tomasulo's algorithm originally shipped on didn't even have an MMU to separate your code from the kernel.

Additionally, compiler tech wasn't anywhere near where it was today, and high perf code was written in ASM. Therefore existing code would have to be rewritten to use more registers, but the 360/91 ran existing code just fine (which was very important for the 360/91's main customers).


x87 and MMX's register encodings exist in (mostly) separate parts of the x86 operand encoding map. That's in contrast to GPRs, which have to squeeze into 3 (sometimes 4, with the REX prefix) bits.

That's where the incompatibility comes from -- x86-64 required an entirely new prefix to merely double the GPRs; adding a few hundred more would require some very substantial changes to the opcode map and all decoders already out there.


When x87+MMX were added, existing programs ran unchanged. When x86-64 doubled the register sets, many existing programs ran unchanged (some features were dropped). Compatibility was largely maintained. That compatibility was what AMD wagged in Intel's face when Intel was trying to pivot to Itanium. Intel had to then take the walk of shame and adopt AMD's approach.

Seriously, Intel took a long view towards this. x87 was a wart on the side of mole and still its unholy marriage with MMX (they shared a register set) allowed existing programs to run while creating a compatibility barrier to competitors. Competitors had to be compatible and bug compatible. The guy tasked with doing this at Transmeta almost had a nervous breakdown, not from compatibility (easy) but from bug compatibility.

IBM 360 programs still run on the Z architecture.


The original 8087 prompted the IEE-754 floating point standard, which had a profound impact on the entire field of computer science.

https://news.ycombinator.com/item?id=17767925

https://news.ycombinator.com/item?id=23205225

https://news.ycombinator.com/item?id=23362673

https://news.ycombinator.com/item?id=18107165


I'm not saying it's impossible! They certainly have plenty of space in the EVEX scheme. But extending the GPRs is a much bigger lift, tooling-wise, than is adding a relatively disjoint ISA extension. Even if they can do it while preserving older encodings, it's just another speedbump at a time when Intel is probably anxious to make x86-64 as frictionless as possible.

Besides, register renaming seems to be working splendidly at the uarch level. Why complicate the architectural model when the gains are already present?


> Even if they can do it while preserving older encodings, it's just another speedbump at a time when Intel is probably anxious to make x86-64 as frictionless as possible.

Just wanted to throw out there that it was AMD that came up with x86-64's ISA rather than Intel. Intel was still pushing Itanium hard at the time.


x87 instructions weren't dropped, they are still available on x86_64.


Yeah, you're right. 64-Bit Mode Valid Thanks.


> I'm curious why they have so many more physical registers than... logical? registers

Register renaming allows instructions to be executed out-of-order [1] which allows for more instruction throughput.

This goes back to 1967 and to the IBM 360/91 with its 4 floating point registers. That's not many registers but Moore's law was making more transistors available. The problem was how to use these transistors to get more throughput from existing programs without changing the ISA and (potentially) breaking compatibility.

The solution was Tomasulo's algorithm [2] which allowed (few) architectural registers to be renamed to (many) physical registers.

  original         renamed           reordered
  mov RAX, 1       mov PHYS1, 1      mov PHYS1, 1; mov RAX, [RCX]
  add RBX, RAX     add RBX, PHYS1    add RBX, PHYS1
  mov RAX, [RCX]   mov RAX, [RCX]
The first and third instructions can be executed at the same time on independent functional units. The third is out-of-order with respect to the second.

[1] https://inst.eecs.berkeley.edu/~cs152/sp20/lectures/L10-Comp...

[2] https://en.wikipedia.org/wiki/Tomasulo_algorithm


I've not seen this in other responses, but an answer to your question is every logical register adds overhead to context switching by the operating system.

The OS has to store and load all registers whenever it decides to switch which thread is processing. 100 more logical registers means 100 more locations the OS has to keep track of.

This is part of the reason why new SIMD instruction sets need OS support before you can start using them.


> Adding logical registers breaks backwards compatibility

This, plus adding logical registers increases instruction size and therefore decreases the number of instructions that can be fetched with a given memory bandwidth.


It can definitely be beneficial to add logical registers. When AMD designed x86-64 they doubled the number of general logical registers up to 16. As other commenters have said, unless you are already making breaking changes, increasing the number of logical registers is probably not worth it.


Having more physical registers than logical means the CPU can do optimizations, opcodes don't have to be as big (it takes more bits to encode more registers), compatibility with older binary code is maintained, and CPUs at different price points can have different numbers of physical registers while all being able to run the same binaries.

(I don't know if any manufacturer actually does that last thing, however.)


> * Adding logical registers breaks backwards compatibility, or at best means you get no speedup on things written (/compiled) for fewer logical registers. Adding physical registers lets you improve performance without recompiling.

> * Adding logical registers increases complexity for people writing assembly and/or compilers. Adding physical registers moves that complexity to people designing CPUs.

These two points are the same thing: compatibility and compatibility is what Intel and AMD have lived on from day one with x86. It's why we still live with this really weird instruction set with all of its historical oddities. Certain features of real mode weren't removed until long into the 64-bit era. Adding things isn't any better: If you wanted to add more add more registers, you'd have to change instruction encoding and the instruction space is finite (actually limited to 15 bytes.) That would be rather disruptive.


I think your second point hits it and is the primary benefit to hiding the microarchitecture layer - it can be improved and existing code will benefit from it.

Basically Intel is saying if you had 200 GPRs, you couldn't do better at using the free ones than the CPU scheduler/decoder.

> Adding *architecturally visible* registers increases complexity for people writing assembly and/or compilers.

More registers just makes your code less likely to have to shuffle stuff to and back from RAM - which is where stuff will go if you don't have registers.

It's always faster for a CPU to access registers within itself than have to talk over a bus to a memory. Even when RAM was the same speed as CPUs (8-bit era) you would still save a cycle or two.


Having more logical/architectural registers is great except for a few costs:

1) More bits to encode register numbers in instructions. Doubling the number of logical registers costs another two or three bits depending on how many registers are referenced in an instruction

2) Logical registers have to be saved on context switches

3) Logical registers have to be saved around function calls. Either the caller or the callee has to save (or not use) registers, and most functions are both callers and callees. That is, if you are not a leaf-node function then every register you use you have to first save to the stack, or else assume that the functions you call will trash it. Thus, more registers have diminishing returns.

4) No matter how many logical registers you have you _always_ want to have an order of magnitude more physical registers, because otherwise you can't implement OOO or speculative execution.

Point #4 is probably the most critical because I think what people are really asking is why are there more physical than logical registers, and OOO/speculative execution is the answer.


I wonder how things like register renaming (or pipelining) are implemented. It would seem difficult even in a high level language, but they do it inside the processor. Is this in microcode that runs on the "actual" processor? Or is it in hardware? Do they write th algorithm in a language like VHDL or Verilog?


Register renaming is implemented in hardware. Because it is used on every instruction it is on the critical path and is probably hand-optimized. Here is some more reading on this topic:

https://en.wikipedia.org/wiki/Register_renaming


Renaming isn't really done in microcode. Microcode is just another source for ops that get renamed. All of the renaming happens in hardware, and boils down to a handful of tables inside the processor.


Some cores are open source and you can see for yourself.

Rename logic from BOOM, a RISC-V core written in a DSL embedded in Scala:

https://github.com/riscv-boom/riscv-boom/blob/1ef2bc6f6c98e5...

From RSD, a core designed for FPGAs written in SystemVerilog:

https://github.com/rsd-devel/rsd/blob/master/Processor/Src/R...

And then there's Alibaba's recently open-sourced XuanTie C910, which contains this Verilog… which is completely unreadable. Seems like it was produced by some kind of code generator that they didn't open-source?

https://github.com/T-head-Semi/openc910/blob/d4a3b947ec9bb8f...


All of the above, I believe.


Beyond reasons & limitations explained in sibling posts, having large number of (logical / instruction-level) registers also inflates instruction size, and thus diminishes instruction density, and thus lowers performance - so there is a trade-off between that and large number of registers. Hear me out.

The CPU has limited memory bandwidth; the larger instruction size, the more bytes needs to be loaded from memory to execute the instruction. Same with cache size - the more space an instruction takes, the lower the amount of instructions that is cached. Lastly, there's the complexity & latency of the instruction decoder. This possible performance loss is averted by keeping instructions short and instruction set "dense".

Any instruction that refers to a register needs certain amount of bits in the operand portion to indicate which specific register(s) is to be used [1][2][3]. As example, in case of 8-register x86 the operand generally uses 3 bits just to indicate which register to use. In case of 16 register x86_64, it takes 4 bits. If we wanted to use all 200 physical register, that would require whole 8 bits reserved in the instruction just to indicate the register to use. Certain instructions - data transfer, algebra & bitwise operations, comparisons, etc. - naturally use two or more registers, so multiply that accordingly.

Since using this many registers gives only diminishing return in terms of performance (and also requires very heavy lifting on compiler's part[4]), the trade-off selected is that the compiler uses architecture-defined small number of registers, and the processor at runtime is able to speed up some code using the spare registers for instruction-level execution parallelism.

[Edit]

There's one more common circumstance where large number of registers is undesirable: a change of execution context (thread switch; process switch; interrupt). Typically all architecturally-visible registers are saved to memory on a change of context and new set is loaded for the new context. The more registers there are, the more work is to be done. Since the hardware registers are managed directly by CPU and serve as more of cache than directly accessed register, they don't need to be stored to memory.

[1] Aside of certain specialized instructions that implicitly use a particular register; for example in x86 many instructions implicitly use the FLAGS register; DIV/IDIV integer division implicitly uses AX and DX registers.

[2] Aside of certain instruction prefixes that influence which register is used; for example in x86 that would be segment register overrides.

[3] Aside of certain architectures where registers were organized in a "file" and available only through a "window" - i.e., an implicit context, implicit register addressing base; instruction operands referred to registers relative to the current window, and the whole window could be shifted by specialized instructions. Typically shifted on function enter/leave and similar. This was more-or-less the whole "hardware registers" being exposed at architecture level, however in a somewhat constrained / instruction-dense way.

[4] Arranging which registers to use, which to spill to memory etc. is non-trivial work for compiler, and the complexity grows super-linearly with the number of registers.


the ISA defines logical registers

the implementation defines physical registers

any implementation is permissible as long as it conforms to the ISA


(2012)

The most active previous discussion: https://news.ycombinator.com/item?id=19262249


Thanks! Expanded list:

The Surprising Subtleties of Zeroing a Register (2013) - https://news.ycombinator.com/item?id=19262249 - Feb 2019 (22 comments)

The Surprising Subtleties of Zeroing a Register (2012) - https://news.ycombinator.com/item?id=11057679 - Feb 2016 (3 comments)

The Surprising Subtleties of Zeroing a Register - https://news.ycombinator.com/item?id=6312266 - Sept 2013 (1 comment)


A plus for the benchmarks. One more complication of xor xx,xx goes unmentioned: xor also (partially) affects the flags register. That should get renamed as well.


What ahout FP registers? Is there special instructions to Zero them?


For x87, you can use FLDZ to load `+0.0` onto the register stack. My guess is that neither Intel nor AMD has tried particularly hard to optimize that, given x87's legacy status and weird register semantics (see: them being a pseudo-stack).

For SIMD-based FP, you can use VZEROUPPER and VZEROALL if you want to clear all {X,Y,Z}MM state. For a single register, I believe PXOR is still the common idiom, mirroring `XOR EAX, EAX`.


> For x87

Should read: for x87, unless you are doing this for fun, shoot yourself in the face and instead use SSE because no one in their right mind should be writing x87 code in 2021!

Please help this architectural misfeature die, which should have happened decades ago.


For some operations 80 bit accumulators are still useful, so a mix of SSE and x87 might be optimal.


> Should read: for x87, unless you are doing this for fun, shoot yourself in the face and instead use SSE because no one in their right mind should be writing x87 code in 2021!

My information is pretty old, but IIRC x87 still has specialized instructions for sin/cos/tan that are sometimes more performant than their equivalent implementations in SSE. x87 instructions are also very small, so trig-heavy workloads where I$ is a measurable performance component might unfortunately still be a good fit for x87.


On recent Intel/AMD CPUs the x87 transcedental functions are usually much slower than their equivalents using SSE/AVX instructions.

For future CPUs, it is expected that the performance gap will increase.

The x87 trigonometric functions require typically between 100 and 200 clock cycles. During that time a recent CPU can execute 200 to 500 instructions, enough to compute many values of a trigonometric function (using a polynomial approximation). When SIMD instructions can be used, several tens of values of a function could be computed during a single x87 instruction.


x87 trig instructions are pretty inaccurate even in the supported range mainly because of faulty range reduction [1] and can't be vectorized at all. If you have trig-heavy workloads nowadays you would want SIMD libm, not x87.

[1] http://notabs.org/fpuaccuracy/


Why can't the trig instructions be vectorized? I've never quiet understood why SSE/AVX didn't add trig functions to finally kill off any argument for using x87.


Unlike simple operations like a floating-point multiplication, trigonometric functions and the other transcedental functions are too complex, so they must be split into many steps.

If a trigonometric function is encoded as a single instruction, then it must launch a microprogram, to execute the many required steps.

A microprogram cannot execute faster than when the same execution steps would have been encoded as separate instructions, the only advantage of encoding a trigonometric function in a single instruction would be to reduce the program size. Most programs contain few trigonometric functions, so the reduction in program size is not worthwhile.

While a microprogrammed trigonometric function could be as fast as the equivalent sequence of instructions, in reality it is usually much slower.

The reason is that the modern CPUs are optimized for the most frequent instructions and they dedicate a minimum of resources for the seldom used microprogrammed instructions, so these hit various limitations that do not exist for the simple instructions. The microprogrammed instructions usually have some phases whose execution cannot be overlapped in time with other instructions, which leads to lower performance.


There exist libraries that provide SIMD versions (on the fast path) of math functions, e.g. https://sourceware.org/glibc/wiki/libmvec


Gotcha, hence the reason they've added instructions like FMA. Those don't require microprograms but do make things like calculating a taylor series faster. Right?


> My information is pretty old, but IIRC x87 still has specialized instructions for sin/cos/tan that are sometimes more performant than their equivalent implementations in SSE.

They absolutely are not. If you accept the same, crappy precision, you can do an estimate in just a few cycles instead of the 60+ that the instructions take.


Last time I checked, x87 (and MMX which uses the same register file) aren't renamed. There's only the architectural register file. At that point these sorts of concerns don't matter because you're not metaprogramming rename hardware through the veneer of the architectural registers.


The x87 register stack has been renamed on Intel desktop/server OoO cpus since the original Pentium Pro, by special handilng of the fxch instruction.

If anything, AMD improved its handling on recent Zens.

Take a look at Agner manuals.

It is plausible it will be dropped at some point but we are not there yet.

[1] newer atoms, although a


I guess I didn't really consider the manual renaming via fxch the same kind of renaming, and that equally doesn't apply to these sorts of zeroing concerns.

That being said, it looks like my information is a little out of date, and they are renamed fully now and share a PRF with the AVX-512 K registers.


FXCH has always existed; before PPro, FXCH was a normal FP instruction that was needed to manipulate the physical architectural register stack (as most x87 instructions implicitly operated on the top of stack).

Since PPro there isn't really a register stack anymore. FXCH is not a "real" instruction any more, it has 0 latency and uses no execution resources, and it is resolved at the renaming stage and aliases the implicit stack positions to actual fp registers.


According to agner, FXCH didn't actually move values around on Pentium and Pentium MMX processors either despite them being in order cores. There was 'rename hardware' sitting in front of the x87/MMX register file, but there wasn't this distinction of size between architectural and physical register files. Despite renaming, there was still only one physical register for each architectural register at a time (except for a cute bypass technique combined with FXCH).

> The solution to this problem is register renaming. The FXCH instruction does not in reality swap the contents of two registers; it only swaps their names. Instructions that push or pop the register stack also work by renaming. Floating point register renaming has been highly optimized on the Pentiums so that a register may be renamed while in use. Register renaming never causes stalls - it is even possible to rename a register more than once in the same clock cycle, as for example when FLD or FCOMPP is paired with FXCH.

So, like I stated originally, the presence of FXCH and x87 stack doesn't necessarily imply the the same sorts of physical register allocation dependency breaking concerns involved with "how do I zero a register properly". They do now though, as it looks like there's true renaming going on now.


If x87 and MMX registers don't support renaming then that means that they can't support OOO and speculative execution of these instructions. This is possible but seems unlikely to me.

That is, even though not a lot of x87/MMX code is executed these days (and even less is written) I would be surprised if the OOO/speculative-execution of the processor gets halted whenever these instructions are encountered.

Note that x87 instructions still get used in most 32-bit programs because the x87 registers are the defined way for functions to return floating-point results.

Which is to say, citation needed.


> That is, even though not a lot of x87/MMX code is executed these days (and even less is written) I would be surprised if the OOO/speculative-execution of the processor gets halted whenever these instructions are encountered.

The implications aren't that bad. Only instructions using the outputs of x87/MMX instructions would have to stall. And especially if you're only using them to shift values during function returns, then that wouldn't have to have a huge impact on performance.


It looks like my information was woefully out of date. They're fully renamed and executed OoO now, and on Intel even share a PRF with AVX-512 K registers interestingly enough.

Thanks for keeping me honest.


> 1: add eax, 1

> 2: mov ebx, eax

> 3: xor eax, eax

> 4: add eax, ecx

> Ideally we would like our awesome out-of-order processor to execute instructions 1 and 3 in parallel. There is a literal data dependency between them, but a sufficiently advanced processor could detect that this dependency is artificial.

This doesn't seem right... instruction 2 does need to run after instruction 1 and before instruction 3.


Not if it's renamed. Instruction 2 does need to run after instruction 1, but instruction 3 doesn't need to run after them. xor eax, eax doesn't depend on the previous value of eax (because xoring anything at all with itself is zero), so you can assign it a new physical register file slot and just set it to zero.


Yep, exactly. The rule of thumb is that whenever a register is written to a new register mapping is created. For instruction #1 eax might be assigned to physical register 103. It maintains that identity for instruction #2. For instruction #3 eax can be assigned to, let's say, physical register 67. Then for instruction #4 it could be assigned physical register 92. The only tricky thing is that instruction #3 also reads from eax, which normally would cause a dependency, but because XOR doesn't depend on the register contents this clever optimization applies.

Here's another variant:

> 1: add eax, 1 > 2: mov ebx, eax > 3: mov eax, 42 > 4: add eax, ecx

Here instruction 3 writes the constant 42 to the register. This might make it clearer why instruction #3 can run in parallel with or before instruction #1.

Note that the results must be _retired_ in order, but execution can happen out of order.


After reading https://news.ycombinator.com/item?id=29104841 today and wondering why they zero one of registers using XOR on itself, it suddenly makes a lot of sense.


I asked why that was when I took assembler in college and the instructor didn't know (he said, "oh, um, it's faster"). I'm glad to finally find the actual correct answer to that question.


To be fair, the choice of xor eax, eax is 'faster' than mov eax, 0. The xor option is only two bytes versus the five of mov, which can have I$ and decode width concerns. And back in the day, five bytes was pretty awkward for up to 32 bit data buses, where you'd need at least two cycles to even read the mov instruction. That's why existing code tended to use xor eax, eax in the first place, even before Intel started adding rename hardware to their processors.


Same for xor ax,ax on 16-bit processors by the way (2 bytes vs 3), and even earlier for the 8-bit 8080 (1 byte vs 2). The xor or sub trick dates to the late 70s.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: