
Loading an x64 register, how hard could it be? - mpu
http://c9x.me/art/notes.html?9/19/2015
======
tbirdz
Is this really that crazy? A 64-bit immediate value takes 8 bytes. So of that
10 byte instruction, 8 bytes of it are the value to load into the register.
Similarly, a 32-bit immediate value takes 4 bytes, so 4 bytes of the other
instructions are the 32-bit immediate values. Taking this into account, we see
that the non-immediate overhead is 1 byte for movl, 2 bytes for movq, and 2
bytes for movabsq.

I don't really think this is as crazy as the article is implying.

~~~
userbinator
I think the author is saying what's crazy is the lack of 64-bit immediates in
a machine that contains 64-bit registers. There's an instruction to add a
32-bit immediate (and I believe 8 and 16 as well), but not a 64-bit one.

~~~
devit
It's rare for programs to contain integer constants that don't fit in 32 bits,
and they aren't needed for 64-bit code/global memory references due to the
introduction of addressing relative to the instruction pointer.

Also there is a 15 bytes instruction length limit that would have to be
extended if 64-bit immediates were allowed on all instructions.

------
gsg
Populating x86-64 floating point registers is also an amusing subject.

The obvious instruction for loading a (64-bit) float into an xmm register is
movsd. With a memory source operand, the higher part of the register is
zeroed, which is what you want. No problem.

Now the fun part: if the source is not memory but another xmm register, the
higher part of the register is _not_ zeroed. This induces a false dependency
on the previous value of the destination register that can cause performance
issues. To avoid this problem, such register-register copies should be done
with a packed move instruction. (Or vmovsd, but that was added much later.)

The obvious packed move instruction for 64-bit floats is movapd, but we can do
better than that by using movaps - it is still a float domain instruction but
is a byte smaller.

So the optimal way to move a single double from one register to another is to
use a vector move of the wrong type.

------
waterhouse
> "it is impossible to add 2^33 to rax using one instruction only."

This can in fact be done, with a memory operand. I'm not sure about the
performance compared to a 64-bit load immediate followed by an add, but this
will do it (NASM syntax):

    
    
              add rax, [rel the_constant]
              ...
      
      the_constant:
              dq 8589934592

~~~
raverbashing
This is the right answer.

This is what ARM does a lot of times (their 'immediate' value ops allow you to
pick an 8 bit number and rotate it a bit)

[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc....](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0473l/dom1359731145835.html)

------
justin_
For those wondering what he meant by the "data dependency" issue that zeroing
out the upper 32 bits avoids, the first answer to this SO question does a
decent job of explaining it:
[https://stackoverflow.com/questions/11177137/why-do-
most-x64...](https://stackoverflow.com/questions/11177137/why-do-
most-x64-instructions-zero-the-upper-part-of-a-32-bit-register)

------
WalterBright
It's actually 7 bytes, not 6, to load a sign extended value into a register.

    
    
            48 C7 C0 FF FF FF FF    mov     RAX,0FFFFFFFFh

~~~
mpu
Thank you, I fixed that. In my head I knew a REX prefix was necessary and
thought it was sufficient, but i386 has shortcuts to load an immediate into a
register that are not available in 64 bits version!

Thanks for mentioning that.

------
transfire
Makes one miss the 6502.

~~~
tacos
A better example is the 6809. LDA #$FF takes two bytes and two cycles. LDX
#$FFFF takes three bytes and three cycles. "Isn't that crazy?" Sigh.

