
X86 Addressing Under the Hood - ScottWRobinson
http://paul.bone.id.au/2018/09/26/more-x86-addressing/
======
userbinator
x86 instruction encoding is best viewed in octal --- not just the ModRM, but
the primary opcode too:

[http://www.dabo.de/ccc99/www.camp.ccc.de/radio/help.txt](http://www.dabo.de/ccc99/www.camp.ccc.de/radio/help.txt)

You can mentally assemble/disassemble the bulk of the commonly encountered
instructions by memorising a few tables (in octal), the addressing modes being
one of them. In 16-bit the memory addressing modes can be described as "one or
more of {displacement}{BX,BP}{SI,DI}" and 32-bit "one or more of
{displacement}{register}{scaled register}" with (e)BP as a special case.

------
vardump
Good homework is to implement an x86-64 mod/rm encoder.

Quite a few corner cases. Sometimes you'll need to up-convert input
parameters, because some combinations can't use 8-bit offset, for example, but
require a 32-bit value. RSP can't use scaling, etc.

There's also plenty of redundancy:
[https://www.strchr.com/machine_code_redundancy](https://www.strchr.com/machine_code_redundancy)

------
dsamarin
I once had to multiply the `rsi` register by 3 so you bet I used `imul rsi,
3`, right? Nope. This does the exact same thing with an unknown or small
performance advantage:

lea rsi, [rsi + rsi*2]

~~~
userbinator
Definitely an advantage because AFAIK that doesn't go through the ALU but
instead uses the dedicated address calculation unit, which is going to be
faster than a multiply instruction.

A related lea trick is combining a shift, add, add (or subtract) immediate,
and mov into a single instruction that doesn't modify flags, i.e.

    
    
        lea eax, [1234+ebx+ecx*4]
    

performs the equivalent of:

    
    
        eax = ebx + ecx*4 + 1234

~~~
BeeOnRope
Not anymore, at least on modern Intel - lea instructions go through the ALU
like other ALU instructions.

It's still an advantage over mul in this case, because it is a so-called "two
component lea" which only has two components (base reg and index reg) so it
takes 1 cycle latency and can execute at a throughput of 2 per cycle, versus 3
cycles latency and 1 per cycle for mul.

