Let's see how things have changed since 2004 when this was published:
> The x86 has a small number (8) of general-purpose registers
x86-64 added more general-purpose registers.
> The x86 uses the stack to pass function parameters; the others use registers.
OS vendors switched to registers for x86-64.
> The x86 forgives access to unaligned data, silently fixing up the misalignment.
Now ubiquitous on application processors.
> The x86 has variable-sized instructions. The others use fixed-sized instructions.
ARM introduced Thumb-2, with a mix of 2-byte and 4-byte instructions, in 2003. PowerPC and RISC-V also added some form of variable-length instruction support. On the other hand, ARM turned around and dropped variable-length instructions with its 64-bit architecture released in 2011.
> The x86 has a strict memory model … The others have weak memory models
Still x86-only.
> The x86 supports atomic load-modify-store operations. None of the others do.
As opposed to load-linked/store-conditional, which is a different way to express the same basic idea? Or is he claiming that other processors didn't support any form of atomic instructions, which definitely isn't true?
At any rate, ARM previously had load-linked/store-conditional but recently added a native compare-and-swap instruction with ARMv8.1.
> The x86 passes function return addresses on the stack. The others use a link register.
Apple M1 supports optional x86-style memory event ordering, so that its x86 emulation could be made to work without penalty.
When SPARC got new microcode supporting unaligned access, it turned out to be a big performance win, as the alignment padding had made for a bigger cache footprint. That was an embarrassment for the whole RISC industry. Nobody today would field a chip that enforced alignment.
The alignment penalty might have been smaller back when clock rates were closer to memory latency, but caches were radically smaller then, too, so even more affected by inflated footprint.
> as the alignment padding had made for a bigger cache footprint
I argues with some of the Rust compiler members the other day about wanting to just ditch almost all alighnment restrictions because I of this exact thing. They laughed and basically told me i didnt know what i was talking about. I remember about 15 years ago when i worked at market making firm and we test this and it was a great gain - we started packing almost all our structs after that.
Now, at another MM shop, and we're trying to push the same thing but having to fight these areguments again (the only alignmets I want to keep are for AVX and hardware accessed buffers).
There are other things you need to take into account too - padding can make it more likely for a struct to divide evenly into cache lines, which can trigger false sharing. Changing the size of a struct from 128 bytes to 120 or 122 bytes will cause it to be misaligned on cache lines and reduce the impact of false sharing and that can significantly improve performance.
The last time I worked on a btree-based data store, changing the nodes from ~1024 bytes to ~1000 delivered something like a 10% throughput improvement. This was done by reducing the number of entries in each node, and not by changing padding or packing.
True. Another reason to avoid too much aligning is to help reduce reliance on N-way cache collision avoidance.
Caches on modern chips can handle keeping up to some small fixed number, often 4, objects all in cache whose addresses are at the same offset into a page, but performance may collapse if that number is exceeded. It is quite hard to tune to avoid this, but by making things not line up on power-of-two boundaries, we can avoid out-and-out inviting it.
FWIW, it's still better to lay out your critical structures carefully, so that padding isn't needed. That way, you win both the cache efficiency and the efficiencies for aligned accesses.
One of the forms of 'premature optimization' that's often worth doing. Just align everything you can to the biggest power of two data-size you can. Also, always use fixed sized data types. E.G. (modular int) uint32 or (signed 2s comp) sint32 rather than int
It's definitely received wisdom that may once have been right and no longer is.
Most people are not used to facts having a half-life, but many facts do, or, rather, much knowledge, does.
We feel very secure in knowing what we know, and the reality is that we need to be willing to question a lot of things, like authority, including our very own. Now, we can't be questioning everything all the time because that way madness lies, but we can't never question anything we think we know either!
Epistemology is hard. I want a doll that says that when you pull the cord.
Fifty bucks says it isn't even about performance, but is instead about passing pointers to C code. Zero-overhead FFI has killed a lot of radical performance improvements that Rust could have otherwise made.
I don't know, because nobody's actually posting a link to it.
This strikes me as likely. Bitwise compatibility with machine ABI layout rules has powerful compatibility advantages even in places where it might make code slower. (And, for the large majority of code, slower doesn't matter anyway.)
Of course C and C++ themselves have to keep to machine ABI layout rules for backward compatibility to code built when those rules were (still) thought a good idea. Compilers offer annotations to dictate packing for specified types, and the Rust compiler certainly also offers such a choice. So, maybe such annotations should just be used a lot more in Rust, C, and C++.
This is not unlike the need to write "const" everywhere in C and C++ because the inherited default (from before it existed) was arguably wrong. We just need to get used to ignoring the annotation clutter.
But there is no doubt there are lots of people who think padding to alignment boundaries is faster. And, there can be other reasons to align even more strictly than the machine ABI says, even knowing what it costs.
The topic at hand is, specifically, that nobody makes cores that enforce alignment restrictions anymore. So, it doesn't matter where the pointer goes. All that matters is if your compiler lays out its structs the same way as whoever compiled the code a pointer to one ends up in.
There are embedded targets that still enforce alignment restrictions, but you are even less likely to pass pointers between code compiled with different compilers, there.
> The topic at hand is, specifically, that nobody makes cores that enforce alignment restrictions anymore. So, it doesn't matter where the pointer goes.
Compilers can rely on alignment even if the CPU doesn't. LLVM does, which is why older versions of rustc had segfaults when repr(packed) used to allow taking references. While it would be pretty easy to get rustc to stop emitting aligned loads, getting Clang and GCC to stop emitting aligned loads might be trickier. https://github.com/rust-lang/rust/issues/27060
People arguing against changing struct layout rules are probably mainly interested, then, in maintaining backward compatibility with older Rust, itself.
Anyway, same policy applies: annotate your structs "packed" anywhere performance matters and bitwise compatibility with other stuff doesn't.
Rust does not keep backward compatibility like that. In the absence of any statement forcing a specific ABI, the only guaranteed compatibility is with code that's part of the same build, or else part of the current Rust runtime and being linked into said build.
Even in Rust, mapping memory between processes, or via files between past and future processes, happens. Although structure-blasting is frowned upon in some circles, it is very fast where allowed.
But... when you are structure-blasting, you are probably also already paying generous attention to layout issues.
Yes, there are sound reasons for it to be optional. It is remarkable how little the penalty is, on M1 and on x86. Apparently it takes a really huge number of extra transistors in the cache system to keep the overhead tolerable.
> ARM introduced Thumb-2, with a mix of 2-byte and 4-byte instructions, in 2003. PowerPC and RISC-V also [...]
x86 is still the weirdo. Both Thumb-2 and the RISC-V C extension (I don't know about PowerPC) have only 2-byte and 4-byte instructions, aligned to 2 bytes; x86 instructions can vary from 1 to 15 bytes, with no alignment requirement.
Power10 has prefixed instructions. These are essentially 64-bit instructions in two pieces. They are odd even (particularly?) to those of us who have worked with the architecture for a long time, and not much otherwise supports them yet. Their motivation is primarily to more efficiently represent constants and offsets.
I suspect variable-length instructions are a big gain because you get to pack instructions more tightly and so have fewer cache misses. Though, obviously, it's going to depend on having an instruction set that yields shorter text for typical assembly than fixed-sized instructions would. (In a way, opcodes need a bit of Huffman encoding!)
Any losses from having to do more decoding work are probably offset by having sufficiently deep pipelines and enough decoders.
The counterpoint is that variable-length decoding introduces sequential dependence in the decoding, i.e. you don’t know where instruction 2 starts until you’ve decoded instruction 1. This probably limits how many decoders you can have. If you know all your instructions are 4B you can basically decode as many as you want in parallel.
A larger problem is that they're bad for security; you can hide malicious instructions from static analysis by jumping into the middle of a cleaner one. Or use it to find more ROP gadgets, etc.
I can imagine ways to deal with this, but x86 doesn't have them.
I think what happen in practice is that the decoders still speculatively decode in parallel then drop all misdecoded instructions. Easy when instructions have only a couple of sizes. Hard and wasteful for something like x86.
I vaguely recall that LL/SC solves the ABA problem whereas load-modify-store does not.
It's been a while, so I'm going to define my understanding of the ABA problem in case I misunderstood it:
x86 only supplies cmpxchange instructions on which will update a value only if it matches the passed in previous value. There's a class of concurrency bugs where the value is modified away from it's initial value and then modified again back to it's value. cmpxchange can't detect that condition, so if it's a meaningful difference often the 128-bit cmpxchange will be used with a counter in the second 64-bits that is incremented on each write to catch this case.
LL/SC will trigger on any write, rather than comparing the value, providing the stronger guarantee.
(Please correct me if this is inaccurate, it's been a hit minute since I learned this and I'd love to be more current on it).
AIUI, a cmpxchg loop is enough to implement read-modify-write of any atomically sized value. The ABA problem becomes relevant when trying to implement more complex lock-free data structures.
Thank you for writing this. I was going to cover quite a lot of these points and you have done it so very succinctly.
It may be obvious, but I think it bears repeating. This blog entry should not reflect badly on Raymond C as he was reporting on the architecture at this time.
The 2022 follow-up also said "And by x86 I mean specifically x86-32." Also, I don't he was on the AMD64 team yet at that time (still Itanium) so probably that is something.
In regards to memory alignment, it's even worse. Most instructions work on unaligned data. But some instructions require 8 byte, 16 byte, 32 byte, 64 byte and I think there's even some 128 and 256 byte alignment. One of the more common pitfalls someone can find themselves in when coding x86-64 asm.
There is still one big thing that hasn't changed but has been the subject of discussion on whether x86-64 is fundamentally bottlenecking CPU architecture. Variable length instructions means decoder complexity scales quadratically rather than linearly. It's been speculated this is one reason why even the latest x86 architectures stick with relatively narrow decode but Arm CPUs with lower performance levels (e.g. Cortex X1/2) are already 5-wide and Apple is 8-wide.
"As opposed to load-linked/store-conditional, which is a different way to express the same basic idea? Or is he claiming that other processors didn't support any form of atomic instructions, which definitely isn't true?"
It refers specifically to things like fetch_and_add. Supported by risc-v and i64.
>>The x86 has a strict memory model … The others have weak memory models
> Still x86-only.
SPARC supported (eventually exclusively) TSO well before x86 committed to it. For a while Intel claimed to support some form of Processor Ordering which I understand is slightly weaker than TSO, although no Intel CPU ever took advantage of the weakened constraints.
> The x86 has a small number (8) of general-purpose registers
x86-64 added more general-purpose registers.
> The x86 uses the stack to pass function parameters; the others use registers.
OS vendors switched to registers for x86-64.
> The x86 forgives access to unaligned data, silently fixing up the misalignment.
Now ubiquitous on application processors.
> The x86 has variable-sized instructions. The others use fixed-sized instructions.
ARM introduced Thumb-2, with a mix of 2-byte and 4-byte instructions, in 2003. PowerPC and RISC-V also added some form of variable-length instruction support. On the other hand, ARM turned around and dropped variable-length instructions with its 64-bit architecture released in 2011.
> The x86 has a strict memory model … The others have weak memory models
Still x86-only.
> The x86 supports atomic load-modify-store operations. None of the others do.
As opposed to load-linked/store-conditional, which is a different way to express the same basic idea? Or is he claiming that other processors didn't support any form of atomic instructions, which definitely isn't true?
At any rate, ARM previously had load-linked/store-conditional but recently added a native compare-and-swap instruction with ARMv8.1.
> The x86 passes function return addresses on the stack. The others use a link register.
Still x86-only.