
How x86_64 Addresses Memory - chmaynard
https://blog.yossarian.net/2020/06/13/How-x86_64-addresses-memory
======
haberman
> There’s one exception to this: x86_64 allows for a 64-bit displacement with
> the a* registers.

I don't think the example for this
([https://gcc.godbolt.org/z/35ytYW](https://gcc.godbolt.org/z/35ytYW)) is
correct.

The highlighted instruction ("movabs rax, offset x") is not a load, it's just
moving the address of x into rax. The "offset x" operand is a 64-bit immediate
(not a displacement). It will be resolved to the address of x with a
relocation.

Indeed, you can get the compiler to emit this form of "movabs" into other
registers, which contradicts the point that 64-bit displacements are a*
specific: [https://godbolt.org/z/3MYTUC](https://godbolt.org/z/3MYTUC)

To get a load or store to a 64-bit displacement, I think you want something
like this: [https://godbolt.org/z/4QMtpo](https://godbolt.org/z/4QMtpo)

~~~
woodruffw
Whoops! You’re absolutely right. I’ll fix that in a bit.

~~~
haberman
Cool. :) Nice article btw.

I noticed a few other things in the section about segments:

> The good news is that caring about them isn’t too bad: they essentially boil
> down to adding the value in the segment register to the rest of the address
> calculation.

Surprisingly, the segment register's value isn't added directly (my coworker
and I discovered this recently). The segment bases are stored in model-
specific registers, and I don't believe these are readable from user-space.
[https://en.wikipedia.org/wiki/X86_memory_segmentation#Later_...](https://en.wikipedia.org/wiki/X86_memory_segmentation#Later_developments)

If you try to read %fs directly, you'll get a completely unrelated value (an
offset into table I think?)

Another surprising and unfortunate thing is that segment-qualified addresses
don't work with lea. That means getting the address of a thread-local
unfortunately is not as simple as "lea rax, fs:[var]." You actually have to do
a load to get the base address of the thread-local block (eg. fs:0). The first
pointer in the thread-local block is reserved for this. That's why this
function has to do a load before the lea:
[https://godbolt.org/z/jkt28n](https://godbolt.org/z/jkt28n)

~~~
woodruffw
Thanks for the kind words!

And yep! I need to make the language around the segment registers more
precise: if I'm remembering right, the segment value itself gives you the GDT
index (maybe only in 32-bit mode?), which you can then pull from.

For 64-bit modes I think those values essentially become nonsense because of
the fsbase/gsbase MSRs, as you mentioned :-)

~~~
cesarb
If I recall correctly, at least on 32-bit mode the segment registers were a
13-bit offset into either the GDT or the LDT (so a maximum of 8192 entries on
each), 1 bit to choose between the GDT and the LDT, and 2 bits for the
protection ring. The base and limit (and other details) were loaded from the
GDT/LDT when the segment register was loaded; the undocumented SAVEALL/LOADALL
instructions could save and load the "hidden" base/limit/etc directly, leading
to tricks like the "unreal mode".

------
thornjm
I actually find the intel manual on SIB bytes quite straightforward and
useful. Section 2.1.5 and specifically tables 2-2 and 2-3 show really quite
simply all possible values of the ModRM byte and their operands [1].

It can be quite a good exercise to try and produce your own hex opcodes from
the tables using something like CyberCheff [2].

[1]
[https://www.intel.com/content/dam/www/public/us/en/documents...](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
software-developer-instruction-set-reference-manual-325383.pdf#page=38)

[2]
[https://gchq.github.io/CyberChef/#recipe=Disassemble_x86('32...](https://gchq.github.io/CyberChef/#recipe=Disassemble_x86\('32','Full%20x86%20architecture',16,0,true,true\)&input=OGIwMAo)

~~~
adito
You can append #page=N directive in the url to link directly into the relevant
page. Section 2.1.5 would be in this url[1].

[1]
[https://www.intel.com/content/dam/www/public/us/en/documents...](https://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-
software-developer-instruction-set-reference-manual-325383.pdf#page=38)

------
jart
This is a good blog post about instruction encoding. I like to use this data
structure to describe how memory addressing works in long mode:

    
    
        register char (*(*(*(*ram)[512])[512])[512])[4096] asm(cr3)
    

Basically each modrm pointer thingy goes through four layers of page table
indirection for each memory access, in order to turn virtual memory addresses
into real memory addresses. But it's mostly an implementation detail where
access to the above data structure is restricted to the operating system only,
unfortunately, and the closest thing we have to using it is the mmap api.

------
pizlonator
I think it’s good to understand why compilers emit these.

The dominant reason is: it saves registers. X86 is register starved even in
64-bit mode. Just 16 regs means shit gets spilled. If you also needed a tmp
reg for each memory access, the way that makes things slower is that it causes
spills - usually elsewhere - that wouldn’t have been there if the address was
computed by the instruction.

It helps that CPUs make these things fast. But it can be hard to prove that
using the addressing mode is faster in the absence of the register pressure
scenario I described.

~~~
saagarjha
Well, if you care about register pressure you’ll only be saving at most one
register, since you can just do a shl/add sequence into that to grab the
address, and you can save even that if you’re willing to clobber one of your
input registers when generating the address, or you want a load into a
register from that address later. So I thought that the real reason it existed
was that it saved you from the data dependency you’ve just created if you used
the technique described above, plus you have that extra lea execution port/low
uop unit or whatever that modern Intel processors have these days for doing
this. Plus I guess you save a bit on decoding and code size too.

~~~
pizlonator
One register is a big deal on x86-64.

Even bigger deal on x86-32.

I recently wrote a x86-64 backed (as in like 4 years ago) and I recall that if
address pattern matching (which is now very complete) was not there, you’d
lose >1% perf across the board.

I wrote a pretty good x86-32 backend - for a totally different compiler (a
whole program PTA-based AOT for JVM bytecode) a long ass time ago and vaguely
remember it being like 10-15% there.

Note I’m talking about macrobenchmarks in both cases. And not individual ones
- the average over many.

Also don’t forget those functions that sit on the ledge of caller saves.
Having any caller saves is more costly than having none. So if a function is
using all of the volatile regs, makes no calls (so no need to promote to
caller save), and you cause it to use one more register, then it’s
prologue/epilogue slows down.

Registers matter a lot. :-)

And about your point about freeing up ports or other things: that may be a
cool theory, but I’m just saying that it’s hard to show that using those
addressing modes is a speedup if register allocation doesn’t care either way
(I.e the use of addressing modes doesn’t help spills or prologues). Meaning,
most of the reason why compilers emit these is for regalloc, and it is the
only reason that I’ve been able to detect as being the one that changes perf
across two different back ends.

~~~
pizlonator
Stupid autocorrect of course turned most of my attempts to say callee save
with caller save.

~~~
SAI_Peregrinus
Can't you add "callee" to the autocorrect dictionary? I know I've done that on
Android with the AnySoftKeyboard app, not sure about others.

------
danharaj
Neat! I feel like wrapping my head around the 6502's indexed indirect
addressing mode a few weeks ago well-prepared me for this article :)

~~~
cesaref
Ah yes, takes me back. Page zero is kind of like a large bank of registers
with the fast access instructions.

The one that I remember puzzling with back in the day was 'SEI' \- I mean, why
have an interrupt disable bit? Wouldn't it be more sensible to set and clear
interrupts, not set and clear disabling interrupts.

------
stephc_int13
> mov rbx, 0xacabacabacabacab

Am I the only one seeing this?

~~~
danharaj
all cats are beautiful

------
xelxebar
Have been digging into segmentation and paging in Linux as well as x86_64
instruction encoding lately. Almost all the technically detailed information I
know of elides discussion of historical context. Coming to these things for
the first time, there is _so much_ that feels counter to how one would want to
design things if starting from scratch.

Thus, I spend quite a bit of thought, trying to infer the historical
constraints and motivations that give us x86's beauty, but I'd love to have
some resources that could flesh out my understanding in this regard.

~~~
spc476
See if you can find any books about assembly language programming for the 8086
(or 8088) and the 80286. Just those two should give enough context for why
things are the way they are on the x86 line.

~~~
guenthert
Not sure about that. Segmentation as it was used on Intel CPUs before the 386
("Real Mode", "Protected Mode") wasn't used in Linux (other then during the
boot as PC BIOS was running in 8086 compatible Real Mode). Linus was quite
outspoken about the awkward programming model of those earlier CPUs and
there's a reason Linux started not before he got a 386 (he was accustomed to
32bit wide flat address space from the 68008 in his earlier Sinclair QL).

Paging is an old OS concept and predates Intel CPUs and even Unix. How far
back in time do you want to go?

I think some concepts which were influential in the early days of Linux are
well covered in Minix (1.0) code and book (after all, Linus first experimented
with a 386 scheduler for Minix).

~~~
spc476
xelxebar was asking about this historical background of the x86 architecture,
not a historical background of Linux. Knowing how the x86 architecture evolved
over time helps explain the oddities.

------
sabas123
Nice blog!

Would it fair to say that the following can be considered an encoding of an
instruction? ADD rax, rax

Since if we do then we can actually say x86 has 2 forms of encoding. It's
assembly form and it's binary form which is abit interesting.

~~~
d_tr
You _could_ say this because it is true, but the term "encoding" is mostly
used to refer to the binary representation. The textual, human-friendly
representation of an instruction is often referred to as an "instruction
mnemonic".

------
chrisseaton
> x86_64

I don't know why people still use these crazy names. x86_64, x64, etc. The
people who designed it call it AMD64. Let's call it that.

~~~
woodruffw
Author here: I use AMD64 and x86_64 interchangeably (with a slight preference
for the latter when publishing something, since it has more Google results
than the former). I agree that the proliferation of names is an unfortunate
mess.

FWICT, "x64" is mostly limited to Microsoft. I wouldn't mind that one being
thrown out.

~~~
chrisseaton
> FWICT, "x64" is mostly limited to Microsoft. I wouldn't mind that one being
> thrown out.

Yes, that one particularly, since 86 and 64 don't even have anything to do
with each other. One is a product number and the other is a word width - why
did they replace one with the other?!

~~~
woodruffw
Complete shot in the dark, but I wouldn't be remotely surprised if there's a
`TCHAR arch[3]` somewhere deep in MSBuild.

"x86" fits nicely in there and they couldn't refactor it in time, so they just
decided to go with "x64".

------
jeffbee
Kinda weird that the author couldn't think of a use for base+index addressing.
Doesn't it seem like the obvious application?

Anyway, the tone of the article is unnecessary, IMHO. These addressing modes
are useful and easy to understand, and the address generation units do double-
duty as low-latency, high-throughput add-and-shift units, via the LEA
instruction. CISC is useful, after all.

~~~
nkurz
> These addressing modes are useful and easy to understand, and the address
> generation units do double-duty as low-latency, high-throughput add-and-
> shift units, via the LEA instruction.

While I think you are right that LEA exists because of the memory addressing
modes, at least on Intel (and I'm pretty sure AMD) it's been a _long_ time
since it's actually been executed on the address generation units. Instead,
it's [mostly] treated as just another arithmetic instruction and executed on
the same integer ALU's that execute all the other simple integer math.
According to Agner
([https://www.agner.org/optimize/instruction_tables.pdf](https://www.agner.org/optimize/instruction_tables.pdf))
it's been this way at least since the original Pentium.

Does anyone know what the last mainstream processor was that actually executed
LEA on the AGU? And whether there any less mainstream processors that still
do?

[mostly] I say mostly because there are a few odd addressing modes that have
longer than usual latencies, and a because the 3 argument form also takes
longer than the usual 1 cycle latency. It's still executed on a standard
integer port, though, and not on the AGU.

~~~
jeffbee
Sure, but there's still something special/magic about LEA. For one the ports
on which it dispatches are different from the ones available for ADD or SHL,
even though is it effectively capable of SHL with register and immediate
operands. And it's odd that 2- and 3-component address is 1 uop. Even though
it's in the kitchen sink with all the other arithmetic, it's still a special
case, rather than being decomposed into several adds, shifts, and moves.

~~~
spc476
Generally speaking, the LEA instruction will never modify the flags, while ADD
and SHL will.

------
esmi
It’s a nice tutorial on base plus index addressing but from the title I
expected a tutorial on pointer tags as x86_64 is what makes tags even
possible, i.e. we have a 64b address space but not 2^64 memory locations.

[https://www.mikeash.com/pyblog/friday-qa-2012-07-27-lets-
bui...](https://www.mikeash.com/pyblog/friday-qa-2012-07-27-lets-build-tagged-
pointers.html)

And for ARM.

[https://www.mikeash.com/pyblog/friday-
qa-2013-09-27-arm64-an...](https://www.mikeash.com/pyblog/friday-
qa-2013-09-27-arm64-and-you.html)

~~~
saagarjha
Actually, Objective-C's tagged pointers mostly rely on malloc's alignment
guarantees.

~~~
dan-robertson
This is the case for most tagged pointer systems. Indeed most of them come
from a time when 32 bit support was required.

