
The Surprising Subtleties of Zeroing a Register (2013) - ingve
https://randomascii.wordpress.com/2012/12/29/the-surprising-subtleties-of-zeroing-a-register/
======
brucedawson
Author here - I just want to drop in to reiterate how awesome register
renaming is. It's a bit weird to try and understand, but it is crucial to how
these out-of-order processors run, and it is the trick which lets the register
zeroing run so efficiently.

An x86-64 CPU has sixteen integer registers, but 100-200 _physical_ integer
registers. Every time an instruction writes to, say, RAX the renamer chooses
an available physical register and does the write to it, recording the fact
that RAX is now physical-register #137. This allows the breaking of dependency
chains, thus allowing execution parallelism.

In the case of the zeroing instructions the register renamer says "RAX now
maps to physical-register #0" where that is a register reserved for the
purpose of being zero. No execution is needed. It's beautiful.

For more details read up on out-of-order execution and register renaming.

~~~
theoh
This is a good place to start:
[https://en.wikipedia.org/wiki/Tomasulo_algorithm](https://en.wikipedia.org/wiki/Tomasulo_algorithm).

Butler Lampson mentions register renaming as one of a dozen paradigmatic neat
techniques in this 2015 presentation:

[http://bwlampson.site/Slides/Hints%20and%20principles%20(HLF...](http://bwlampson.site/Slides/Hints%20and%20principles%20\(HLF%202015\)%20abstract.htm)
Edit: corrected link

It's easy for software people not to ever encounter it!

~~~
brucedawson
> It's easy for software people not to ever encounter it!

Yep, because it does its magic without any intervention required.

One concrete example that I like is this:

mov rax,[rsi+0] mov [rdi+0],rax mov rax,[rsi+8] mov [rdi+8],rax

This moves sixteen bytes of memory, all of it going through rax. Without
register renaming the third instruction can't run until the second instruction
completes, because they're both using rax. With register renaming the third
instruction uses a different physical register for rax, and the out-of-order
engine can do the loads in parallel and the stores in parallel.

Before anybody freaks out - register renaming doesn't change the semantics of
your program. It just makes your program run faster. It just means that you
can write code like the above - there's no need to rearrange the instructions
and use multiple temporary registers, because the CPU will do it for you, and
will dirty fewer architectural registers.

~~~
stcredzero
I wonder how often (golang, let's say)

    
    
        x, y = y, x
    

Just becomes register renaming? What about

    
    
        array[i], array[j] = array[j], array[i]
    
    ?

~~~
brucedawson
If the "x, y = y, x" is implemented using the x86 "swap" instruction then the
CPU could implement the swap using register renaming, I think - with no
execution unit activity required. I don't know if it does.

But everything that golang (or C or C++ or whatever) does has to be expressed
in assembly language, and assembly language doesn't directly control register
renaming - it's a hidden implementation detail.

~~~
stcredzero
Thanks. That clarifies things.

------
richardhod
This is highly appreciated: the kind of advanced, specific knowledge that your
average hacker might not know, but should improve their/our thought processes,
efficiency, reliability security and general understanding of programming.
This is one of the primary reasons HN is so valuable, and I do wish there were
some good repository of the best of these kinds of article to browse, apart
from searching HN, which like anything can be noisy.

------
rayiner
As to the 4 instruction limit, see: [https://www.realworldtech.com/sandy-
bridge/5](https://www.realworldtech.com/sandy-bridge/5).

The register rename stage, which does the zeroing, can handle four
instructions per cycle. It’s not immediately obvious from the picture, but
after instructions are fetched from the uop cache, they are sent to a 28-entry
decoder queue, which can act as a small loop buffer. From there, four uops per
cycle are sent to the allocate/rename stage (which is part of the ROB block in
the first picture).

~~~
jabl
It's said in the article, under the "Update: January 7, 2013" heading.

------
jakeinspace
I recently had an interview at one of the big chip designers, and towards the
end they gave a few basic computer architecture and assembly brainteaser
questions. One of them was whether I was aware of a smaller instruction for
zeroing a register, and I did manage to guess xor after a little thought. It's
a trivial idea, but very satisfying to realize for the first time.

------
RyJones
(2013)

The article was written in 2012, but was last updated in 2013, so dealer's
choice?

~~~
plucas
Written December 29th and updated 9 days later. :)

------
saagarjha
> CPU designers could create a design where an xor instruction was faster than
> a sub instruction. However that would mean that sub would take at least two
> clock cycles, since instruction lengths are integer cycle counts. Unless
> this modification let them at least double the clock speed it would be a net
> loss. It’s better to make xor take just as long as sub. You can’t make an
> xor instruction take, say, 0.6 cycles. But feel free to try.

Earlier, the article says that the process can perform four instructions per
cycle. Just to confirm, this isn't normally referred to as taking "0.25 cycles
per instruction", right?

~~~
rayiner
The Intel manuals sometimes use that as a shorthand, but no, it’s usually not.
The difference is between throughout and latency. You can process four
independent instructions per cycle in parallel (if none depend on the results
of the others), but the results of a given instruction will take at least one
cycle to be available for the next instruction that uses it as an input. (It
takes 9 months to make a baby even though you can make more than one at a
time.)

~~~
Sean1708
> You can process four independent instructions per cycle in parallel

is this down to pipelining, or is that separate?

~~~
brucedawson
Pipelining is necessary is well, but super-scalar is the key ingredient here -
parallelism.

Pipelining means that instruction fetch, decode, execution, and retirement are
done in separate stages, each one taking at least one cycle.

Super-scalar/parallelism means that there are multiple units for all of these
stages so the CPU can fetch many instructions in parallel, decode many
instructions in parallel, etc.

The cool/critical thing here is that many Intel CPUs can execute three XOR
instructions in parallel but can retire _four_. Because the CPU recognizes XOR
of an instruction with itself as a special case it skips the execution stage
and can then process four per cycle.

Or, more likely in real code, the execution units are available for other
instructions to use.

~~~
a1369209993
Strictly speaking pipelining isn't actually _necessary_ , it's just hard to
imagine why you'd bother with superscalar execution without first exhausting
the relatively-lower-hanging-fruit of pipelining.

