The highlighted instruction ("movabs rax, offset x") is not a load, it's just moving the address of x into rax. The "offset x" operand is a 64-bit immediate (not a displacement). It will be resolved to the address of x with a relocation.
Indeed, you can get the compiler to emit this form of "movabs" into other registers, which contradicts the point that 64-bit displacements are a* specific: https://godbolt.org/z/3MYTUC
To get a load or store to a 64-bit displacement, I think you want something like this: https://godbolt.org/z/4QMtpo
I noticed a few other things in the section about segments:
> The good news is that caring about them isn’t too bad: they essentially boil down to adding the value in the segment register to the rest of the address calculation.
Surprisingly, the segment register's value isn't added directly (my coworker and I discovered this recently). The segment bases are stored in model-specific registers, and I don't believe these are readable from user-space. https://en.wikipedia.org/wiki/X86_memory_segmentation#Later_...
If you try to read %fs directly, you'll get a completely unrelated value (an offset into table I think?)
Another surprising and unfortunate thing is that segment-qualified addresses don't work with lea. That means getting the address of a thread-local unfortunately is not as simple as "lea rax, fs:[var]." You actually have to do a load to get the base address of the thread-local block (eg. fs:0). The first pointer in the thread-local block is reserved for this. That's why this function has to do a load before the lea: https://godbolt.org/z/jkt28n
And yep! I need to make the language around the segment registers more precise: if I'm remembering right, the segment value itself gives you the GDT index (maybe only in 32-bit mode?), which you can then pull from.
For 64-bit modes I think those values essentially become nonsense because of the fsbase/gsbase MSRs, as you mentioned :-)
If I recall correctly, at least on 32-bit mode the segment registers were a 13-bit offset into either the GDT or the LDT (so a maximum of 8192 entries on each), 1 bit to choose between the GDT and the LDT, and 2 bits for the protection ring. The base and limit (and other details) were loaded from the GDT/LDT when the segment register was loaded; the undocumented SAVEALL/LOADALL instructions could save and load the "hidden" base/limit/etc directly, leading to tricks like the "unreal mode".
I actually find the intel manual on SIB bytes quite straightforward and useful. Section 2.1.5 and specifically tables 2-2 and 2-3 show really quite simply all possible values of the ModRM byte and their operands [1].
It can be quite a good exercise to try and produce your own hex opcodes from the tables using something like CyberCheff [2].
Basically each modrm pointer thingy goes through four layers of page table indirection for each memory access, in order to turn virtual memory addresses into real memory addresses. But it's mostly an implementation detail where access to the above data structure is restricted to the operating system only, unfortunately, and the closest thing we have to using it is the mmap api.
I think it’s good to understand why compilers emit these.
The dominant reason is: it saves registers. X86 is register starved even in 64-bit mode. Just 16 regs means shit gets spilled. If you also needed a tmp reg for each memory access, the way that makes things slower is that it causes spills - usually elsewhere - that wouldn’t have been there if the address was computed by the instruction.
It helps that CPUs make these things fast. But it can be hard to prove that using the addressing mode is faster in the absence of the register pressure scenario I described.
Well, if you care about register pressure you’ll only be saving at most one register, since you can just do a shl/add sequence into that to grab the address, and you can save even that if you’re willing to clobber one of your input registers when generating the address, or you want a load into a register from that address later. So I thought that the real reason it existed was that it saved you from the data dependency you’ve just created if you used the technique described above, plus you have that extra lea execution port/low uop unit or whatever that modern Intel processors have these days for doing this. Plus I guess you save a bit on decoding and code size too.
I recently wrote a x86-64 backed (as in like 4 years ago) and I recall that if address pattern matching (which is now very complete) was not there, you’d lose >1% perf across the board.
I wrote a pretty good x86-32 backend - for a totally different compiler (a whole program PTA-based AOT for JVM bytecode) a long ass time ago and vaguely remember it being like 10-15% there.
Note I’m talking about macrobenchmarks in both cases. And not individual ones - the average over many.
Also don’t forget those functions that sit on the ledge of caller saves. Having any caller saves is more costly than having none. So if a function is using all of the volatile regs, makes no calls (so no need to promote to caller save), and you cause it to use one more register, then it’s prologue/epilogue slows down.
Registers matter a lot. :-)
And about your point about freeing up ports or other things: that may be a cool theory, but I’m just saying that it’s hard to show that using those addressing modes is a speedup if register allocation doesn’t care either way (I.e the use of addressing modes doesn’t help spills or prologues). Meaning, most of the reason why compilers emit these is for regalloc, and it is the only reason that I’ve been able to detect as being the one that changes perf across two different back ends.
> it can be hard to prove that using the addressing mode is faster in the absence of the register pressure scenario I described.
Maybe you personally weren't able to prove to yourself if you did some very small microbenchmarking, but I really believe that measuring the differences of bigger code built with one or another approach it should be relatively straightforward to demonstrate the advantage of using the more compact instructions.
These were giant macrobenchmarks. And I said hard, not impossible. As in, most code doesn’t care if you shift or lea, unless it affects regalloc, which it almost always does.
It’s true that having smaller code is better regardless of perf - so if perf was neutral we would still use those instructions.
The benefit of the smaller instructions for perf is better register allocation, as I said. So, for macrobenchmarks, using the smaller instructions is a win every single time. And that win comes mostly from fewer spills. That’s the point of what I’m saying.
I suppose someone could do the experiment of turning on instruction selection patterns for address modes but still “pinning down” a register as if it was needed for the shift-add sequence you would have otherwise emitted. Feel free to do that if you want to prove me wrong. But just hand waving that I must not have run big benchmarks isn’t going to help you learn more about compilers.
> I suppose someone could do the experiment of turning on instruction selection patterns for address modes but still “pinning down” a register as if it was needed for the shift-add sequence you would have otherwise emitted.
Starting with that as a thought experiment, isn't it expected that the bigger code can be shown to execute slower, as soon as the performance isn't measured with a microbenchmarking test which avoids to stress caches?
Or to be more specific, imagine starting with your proposed modification of the compiler and recompiling both an OS and all the applications. Would the resulting change in the performance be measurable or not? I honestly can't imagine how it wouldn't.
> Starting with that as a thought experiment, isn't it expected that the bigger code can be shown to execute slower, as soon as the performance isn't measured with a microbenchmarking test which avoids to stress caches?
I agree that it “should” be so. I’m just saying that based on data I’ve seen so far, I don’t think it actually is. But only one way to find out, and that’s to run the experiment.
Note that some apps care about icache for perf and some kinda don’t. Sometimes the thing that the program is blocked on in the cpu is data or something else, not instruction fetching, so if you make the fetching slower it won’t affect the end-to-end perf. Basically run time is not a linear function of overheads but rather something much more complex since there is queueing and asynchrony going on. It’s possible for one element of a program to be worse but it doesn’t affect running time because the CPU is blocked elsewhere.
And lets stop talking about microbenchmarks, ok? I don’t use those. None of my claims are based on them.
> Or to be more specific, imagine starting with your proposed modification of the compiler and recompiling both an OS and all the applications. Would the resulting change in the performance be measurable or not? I honestly can't imagine how it wouldn't.
Yeah, that’s an experiment someone could try.
I think probabilistically so I don’t want to say that I believe that the experiment will definitely go one way or another. But I can give you my odds:
- 20% chance you see a speedup of any kind, and only 5% chance it’ll be uncontroversial (remember when you test big shit, somethings will be faster and others slower and you’ll have noise - so good chance you’ll see data that makes you feel like something is just wrong).
- 80% chance that speedup is less than a quarter of the speedup if you hadn’t pinned down the register.
Thinking about it even more - if you did see a speedup and you had the ability to analyze why it sped up, then I bet that increased work in the cpu would be the cause sooner than icache pressure.
On x86, I think that an instruction with a memory address always takes a cycle to compute it no matter how simple or hard it is, so if you do the computation in a separate instruction, you are likely to add cycles.
Ah yes, takes me back. Page zero is kind of like a large bank of registers with the fast access instructions.
The one that I remember puzzling with back in the day was 'SEI' - I mean, why have an interrupt disable bit? Wouldn't it be more sensible to set and clear interrupts, not set and clear disabling interrupts.
Have been digging into segmentation and paging in Linux as well as x86_64 instruction encoding lately. Almost all the technically detailed information I know of elides discussion of historical context. Coming to these things for the first time, there is so much that feels counter to how one would want to design things if starting from scratch.
Thus, I spend quite a bit of thought, trying to infer the historical constraints and motivations that give us x86's beauty, but I'd love to have some resources that could flesh out my understanding in this regard.
See if you can find any books about assembly language programming for the 8086 (or 8088) and the 80286. Just those two should give enough context for why things are the way they are on the x86 line.
Not sure about that. Segmentation as it was used on Intel CPUs before the 386 ("Real Mode", "Protected Mode") wasn't used in Linux (other then during the boot as PC BIOS was running in 8086 compatible Real Mode). Linus was quite outspoken about the awkward programming model of those earlier CPUs and there's a reason Linux started not before he got a 386 (he was accustomed to 32bit wide flat address space from the 68008 in his earlier Sinclair QL).
Paging is an old OS concept and predates Intel CPUs and even Unix. How far back in time do you want to go?
I think some concepts which were influential in the early days of Linux are well covered in Minix (1.0) code and book (after all, Linus first experimented with a 386 scheduler for Minix).
xelxebar was asking about this historical background of the x86 architecture, not a historical background of Linux. Knowing how the x86 architecture evolved over time helps explain the oddities.
You could say this because it is true, but the term "encoding" is mostly used to refer to the binary representation. The textual, human-friendly representation of an instruction is often referred to as an "instruction mnemonic".
The "x86-64" name is the original one, and came from AMD themselves: https://web.archive.org/web/20000817071303/http://www.amd.co... (and "x86_64" is obviously an alias for where a hyphen is not an allowed character, like identifiers on many programming languages).
The "x64" name came from Microsoft, probably due to file name length limitations (this was before Windows XP unified the Windows 9x and Windows NT lines).
IIRC, the "AMD64" name came later, probably to distinguish it better from Intel's IA-64 (Itanium).
This is a great example because that person is confused by these terrible names - they've seen the name x64, and they assume there must be an x32... but there wasn't, back in 2009. They mean 32-bit x86, which is not the same thing as x32.
Unfortunately names get baked into things, so it's not as easy to change as just saying "let's change". "x86_64" is what Linux 'uname' and the gcc triplet naming conventions use, so that's what I go with, because it's the closest there is to a "standard" name in the software ecosystem I spend most time in.
AMD64 and EM64T are actually not identical. Though it doesn’t usually matter much in usermode, they are by specification not the same architecture. x86_64 is an umbrella for the various almost identical 64 bit extensions to x86.
I have encountered that unfortunate fact when doing my term homework for a graduate level computer architecture course. Failure to detect or misdetection of on which kind of x86-64 you run will cause type confusion and finally crash your code.
I did not perform an applied experiment, more a literature and benchmarks review; but Mayhem and Nubok[0] say this:
> Near branches with the 0×66 (operand size) prefix behave differently. One type of CPU clears only the top 32 bits, while the other type clears the top 48 bits.
This is more than enough to send your code into wrong address and possibly cause a crash.
Yes and I think it had another code-name before then. But taking x86-64 and making it x86_64 is even more lunacy - why change the punctuation to give us another subtlety different name?
Why does everyone want to invent their own name for this thing?
> But taking x86-64 and making it x86_64 is even more lunacy
There are contexts where hyphens are not allowed but underscores are, like identifiers in many programming languages. Replacing the hyphen with an underscore is an obvious workaround.
Author here: I use AMD64 and x86_64 interchangeably (with a slight preference for the latter when publishing something, since it has more Google results than the former). I agree that the proliferation of names is an unfortunate mess.
FWICT, "x64" is mostly limited to Microsoft. I wouldn't mind that one being thrown out.
> FWICT, "x64" is mostly limited to Microsoft. I wouldn't mind that one being thrown out.
Yes, that one particularly, since 86 and 64 don't even have anything to do with each other. One is a product number and the other is a word width - why did they replace one with the other?!
x86_64 clearly shows it to be an extension of x86. The people designing things are not necessarily the people we should be listening to when consider what to call something, too; sometimes they're kind of bad at naming.
It's only called x86_64 instead of amd64 because Intel was able to lobby the right people to not use AMD's name when implementing the standard AMD designed and published.
Intel calls it IA-32e which I assume means Intel Architecture 32-bit w/ Extensions for 64-bit, 128-bit, 256-bit, and 512-bit computation. My preferred term for x86_64 is NexGen32e due how AMD bought NexGen which sold so much better than IA-64 that Intel ditched it and licensed the IP.
The K6 was a very competent chip but the you can see in the how the high level architecture starts to look much more like the EV6 with the Athlon series.
Kinda weird that the author couldn't think of a use for base+index addressing. Doesn't it seem like the obvious application?
Anyway, the tone of the article is unnecessary, IMHO. These addressing modes are useful and easy to understand, and the address generation units do double-duty as low-latency, high-throughput add-and-shift units, via the LEA instruction. CISC is useful, after all.
Author here: it's an extremely obvious application! 'saagarjha points out that I chastise myself later on for missing it while trying to contrive samples. I'll blame that one on writing this post in the middle of the night :-)
With respect to the tone: it's a little flippant, sure. I work professionally on research programs that involve binary translating x86 (and other CISCs) into various representations for program analysis; what you're seeing is some of my frustration there bubble up.
> These addressing modes are useful and easy to understand, and the address generation units do double-duty as low-latency, high-throughput add-and-shift units, via the LEA instruction.
While I think you are right that LEA exists because of the memory addressing modes, at least on Intel (and I'm pretty sure AMD) it's been a long time since it's actually been executed on the address generation units. Instead, it's [mostly] treated as just another arithmetic instruction and executed on the same integer ALU's that execute all the other simple integer math. According to Agner (https://www.agner.org/optimize/instruction_tables.pdf) it's been this way at least since the original Pentium.
Does anyone know what the last mainstream processor was that actually executed LEA on the AGU? And whether there any less mainstream processors that still do?
[mostly] I say mostly because there are a few odd addressing modes that have longer than usual latencies, and a because the 3 argument form also takes longer than the usual 1 cycle latency. It's still executed on a standard integer port, though, and not on the AGU.
Sure, but there's still something special/magic about LEA. For one the ports on which it dispatches are different from the ones available for ADD or SHL, even though is it effectively capable of SHL with register and immediate operands. And it's odd that 2- and 3-component address is 1 uop. Even though it's in the kitchen sink with all the other arithmetic, it's still a special case, rather than being decomposed into several adds, shifts, and moves.
It’s a nice tutorial on base plus index addressing but from the title I expected a tutorial on pointer tags as x86_64 is what makes tags even possible, i.e. we have a 64b address space but not 2^64 memory locations.
> i.e. we have a 64b address space but not 2^64 memory locations.
Except the designers foresaw this and established Canonical Addresses[0] to prevent people from using that "unused" space for tags. The space is explicitly reserved. This is probably why LuaJIT uses NaN tagging of doubles instead of tagged pointers.. even though that causes an issue of it's own[1].
On Arm, PAC and MTE eat that space instead. (and you'll have Morello with 128-bit pointers soon, let's see if it'll end up being considered as productible for future Arm designs)
Sure. Some software has to exist to make use of this system, for example something has to create the tag in the first place, and mall ic is a part of that, but the large address space is what makes them possible.
I don't think the example for this (https://gcc.godbolt.org/z/35ytYW) is correct.
The highlighted instruction ("movabs rax, offset x") is not a load, it's just moving the address of x into rax. The "offset x" operand is a 64-bit immediate (not a displacement). It will be resolved to the address of x with a relocation.
Indeed, you can get the compiler to emit this form of "movabs" into other registers, which contradicts the point that 64-bit displacements are a* specific: https://godbolt.org/z/3MYTUC
To get a load or store to a 64-bit displacement, I think you want something like this: https://godbolt.org/z/4QMtpo