x32 binaries are different from x86 programs in that they get access to 64-bit x86 registers, while their pointers are 32-bit.
It would be really cool if some Linux distro provided both x64 and x32 packages for programs not expected to rarely if ever use more than 4 GiB.
Similarly, it would be cool if Chrome and Firefox spawned x32 processes for most web pages, and then only spun up a x64 process when a web page exceeded like 2 GiB of memory usage (and transparently & quickly converted the page over onto the new process).
FWIW I recently tried this on https://www.oilshell.org, thinking the code would be faster due to better cache locality. There are many pointer rich data structures and they should be packed much more tightly when pointers are 32 bit.
It definitely saved memory, but it was slightly slower or the same speed. My guess is that 32-bit performance isn't prioritized on x86-64 by either compiler writers or chip designers, i.e. they optimize for the common case of 64 bit pointers.
Definitely interested in other comments / opinions / data on this.
I think if it actually sped programs up, you would see many more distros compiling packages for 32-bit. It does save memory, but somehow that's not big enough motivation. Probably because desktops/laptops have 8GB+ RAM these days (enough for most apps), and Linux systems with constrained memory are probably already 32-bit.
And because real memory hogs like browsers probably want to address more than 4GB. Though to be honest I think a multi-process browser with a 4GB-per-origin limit sounds like a very good idea!
Another issue is probably shared libs. Linux distros are pretty tightly coupled in this respect, and if core system programs like the compiler want to use 64-bit libc, etc. then it's economical for other programs to also be 64-bit as well. Otherwise the dependency management problem for shared libs becomes more annoying.
> It definitely saved memory, but it was slightly slower or the same speed. My guess is that 32-bit performance isn't prioritized on x86-64 by either compiler writers or chip designers, i.e. they optimize for the common case of 64 bit pointers.
That may be partially to blame, but depending on your implementation (was this x32 or actual compressed pointers?) the overhead of “decompressing” each pointer dereference may have been a significant factor. For null-safe languages like Java or JavaScript you already have null checks and may have GC barrier checks on most dereferences, so the extra costs are already expected- but adding a shift and/or constant add to each access is going to be a more noticeable cost compared to plain accesses in native code.
This is dirt simple native C++ code compiled with -m32, with structs and pointers (which happens to be generated, but that doesn't matter here)
I would expect that dereferencing a 32-bit pointer on x86-64 has zero overhead compared to dereferencing a 64-bit pointer (though I didn't test it). The only way I can see it being slower in the context of a real program is if the base address is an operand that has to be loaded from memory. But it shouldn't be because it's a 32-bit address space and it should be handled by the VMM, i.e. the base address is constant for the entire process.
-m32 doesn't target the x32 ABI though, it targets x86. Per the manual:
> The -m32 option sets int, long, and pointer types to 32 bits, and generates code that runs on any i386 system.
That restricts you to the smaller register set, removes all amd64 instructions, and uses the x86 ABI (on Linux, parameters are passed on the stack instead of in registers.) I suspect that this is why your program became slower, as opposed to using 32-bit pointers.
I think that the answer here is "not easily". To use the x32 ABI, you would need:
1. Kernel support to be enabled in order to make x32 system calls. Checking /proc/config.gz on Arch Linux, my kernel does not appear to have it enabled.
3. All of the libraries you link with to be recompiled for the x32 ABI.
x86 is definitely easier to build against since there's still legacy software using it, whereas almost nobody seems to use x32. However, over the long term I expect x86 compatibility will bit-rot too.
Yeah it sounds like x32 is really only practical for embedded systems, if anything
Limited use of registers in x86 does sound like a good explanation for it being a wash in terms of speed
It seems like there is very little language-level support for small pointers. Only the unreleased Jai language seems to have "relative pointers" (or maybe they took this out)
The Zig compiler recently replaced pointers with integers in their code representation for memory/speed, but lost some type safety in the process -- i.e. a T* isn't a U*, but indices into t[] and u[] are the same type. I'm not sure I like that tradeoff, and it seems like language-level support would help
I would definitely you to give a shot at compiling Oilshell to x32.
It definitely _is_ a pain to compile for it.
x32 ABI is not switched on by default on any distro that I'm aware, so you would need to recompile your kernel to get support for it. I think the poor support for it probably is because almost no one is using it. There was even a kernel thread discussing removing support for it.
So, from the point of view of distributing it to end users, x32 ABI is pretty much dead end (for now, *until distros start turning on the flag for it on, for their standard default distribution kernels).
But I would definitely be interested in your results. Most likely, you should see a smaller footprint + a speed boost.
For x86 you have a lot less registers which makes it a bit of a wash. x32 is interesting because you get all the registers but keep the pointers small. Sadly it’s not well supported.
I'm gonna go out on very thin ice and disagree with a Donald Knuth quote:
> It is absolutely idiotic to have 64-bit pointers when I compile a program that uses less than 4 gigabytes of RAM. When such pointer values appear inside a struct, they not only waste half the memory, they effectively throw away half of the cache.
In general if shrinking the pointer size is a huge optimization, your data structures have too many pointers in them. Learning Rust was a turning point for me here. The borrow checker encouraged me to get rid of references to parents or siblings. In general, I ended up passing parent references on the program stack and having sibling references (e.g. in graph structures) via indices into a slab, which can be u32 rather than pointer-sized. You could also write a SmallVec/SmallString that uses u32 for len/capacity rather than usize, which would be a nice savings if you have lots of them. And/or move the len/capacity inside the heap allocation, so empty/unallocated vecs/strings don't have those fields at all. Or do what Rust's smallvec crate does and C++'s std::string's "small string optimization" does: carve out the space for them but use it for actual object storage until it overflows and you need to move to the heap.
If you're working in a language that doesn't have value objects (that is, all user-defined structures are implicitly boxed), my sympathy, you will just end up with a bunch of extra pointers, which is why this kind of optimization is important for Java and JS. I consider lack of value objects to be a significant design flaw in a language.
> What you're describing is literally pointer compression. You're using offset indices that point to a compressed/sufficiently small address space.
True, but it's finer-grained than a global switch. I can decide that I will have at most 2^32-1 objects in this particular graph/slab without giving up the ability for my program to address more than 4 GiB of RAM in total as it would with the x32 ABI. Also, this trick is only a small part of reducing pointer storage. I mentioned passing stuff in via the stack vs keeping it in the data structure. I forgot to mention: use vecs rather than e.g. linked lists, open addressing hash maps rather than chaining, btrees rather than red/black trees, etc. This is also stuff that Rust's std library encourages.
btw: of course none of the techniques I mention are unknown to Donald Knuth. He's likely either wanting to get programs written by others to run more quickly or trying to optimize well beyond what I'd care to do. I'd rather split the difference, basically: decrease pointer density a fair bit and keep the pointers at 64-bit.
You're basically screaming "don't use so many pointers!", but this much much easier said than done. You can only really do this if you write the entire stack -all the dependencies- yourself, just so you can optimize like this. This approach doesn't scale or compose well at all.
That's just like saying you can only optimize CPU usage if you've written the world perfectly from scratch. In reality, profiles aren't flat. A little effort in the right spot can go a long way for any optimization goal.
A big problem is that it requires an entire ABI and operating system support to handle this case, when the alternative is... just use a custom heap in your own program. You're describing an insane level of engineering for something that mostly is hidden and can be handled just as effectively by the applications.
Like, if you're in the weird case where A) pointer chasing is killing your performance and blowing your cache and B) your working set fits in 4GB, you can just fix this yourself with your own custom allocator to handle that. You'll still be pointer chasing, mind you, just half as much (and there's the thing: you could very well be much better off reducing the chasing you're doing versus mitigating it this way.)
In reality superscalar OOO processors hide miss latency very effectively helping mitigate the problems of A, and B isn't always clear from the start. So here we are.
Don't get me wrong, I want more efficient programs, and centralizing some complexity like ABI is a good thing. But in practice x32 was a massive technological investment for something that mostly didn't matter, that people mostly didn't need, and whose benefits can be gotten much more easily and directly in a lot of cases.
I agree that the x32 ABI is probably a technological dead end, but I don't agree with "just" use a custom heap. I tried in C++ and it's not easy. You lose a lot of type safety and you have to basically rewrite every line of your program.
Another program I remember that uses small integers instead of pointers is the Sorbet type checker (not coincidentally it also has a fine-grained AST code representation): https://blog.nelhage.com/post/why-sorbet-is-fast/
I peeked at the code awhile back, and it's not really idiomatic C++. It's obviously doable but you have to architect the program from the start for it -- it's not feasible to retrofit this technique onto existing software.
I wonder if you could apply a simpler version of this to ordinary C/C++ programs? If you allocate a chunk on the heap with 8-byte alignment, then you can throw away the lowest 3 bits of precision, and then you can address 32GB with a 32-bit pointer. You'd need the cooperation of the memory allocator to ensure it gives you addresses in the first 32GB of address space, and it might interfere with things like ASLR, but it doesn't seem impossible.
Something like this in C++:
template<class T>
class CompressedPointer {
public:
T &operator*() const {
return *static_cast<T *>((uint64_t)ptr << 3);
}
/* It's not safe to construct a CompressedPointer<T> from an arbitrary T*;
you probably want to malloc and construct via the CompressedPointer itself */
template<Args...>
static CompressedPointer malloc(Args...) {
/* allocate 8-byte aligned sizeof(T) in first 32GB, then construct T in it */
}
private:
uint32_t ptr;
}
[Edit] Extra crazy idea: Take an existing large C++ codebase. For each struct T in the codebase, use static analysis to determine if T is ever stack-allocated (or otherwise incompatible with CompressedPointer). If not, automatically rewrite all "T*" to "CompressedPointer<T>" and all "new T(...)" with "CompressedPointer<T>::malloc(...)". I suspect this would work for 90% of the heap allocations in a typical C++ codebase...
> If you allocate a chunk on the heap with 8-byte alignment, then you can throw away the lowest 3 bits of precision, and then you can address 32GB with a 32-bit pointer.
For C++, it's a really interesting thought experiment, but in practice there are a few rough edges:
1. Your debugging tools will probably struggle to interpret compressed pointers.
2. You'll need to roll your own containers too. Otherwise, vectors, strings, maps, smart pointers, etc. will still use 64-bit pointers. In my experience, containers account for most pointers in a modern C++ code base.
3. _Technically_, the representation of null pointers is implementation-defined, so truncating to 32 bits and then zero-extending is not portable: it could convert a null pointer to a non-null pointer. In practice it doesn't matter though, since every modern target uses a zero bit pattern for null pointers.
auto-p32
Instructs the compiler to analyze the program to
determine if there are 64-bit pointers that can be
safely shrunk to 32-bit pointers.
I don't know how good it is in practice, having never tried it. GCC has -mx32, but that changes ABI completely, you have to recompile everything, and you need a new kernel with another system call table (which your favorite kernel might have disabled).
I don't think there is any way to retrofit existing software with this technique. In addition to the issues you mentioned, you would have to deal with pointer arithmetic (interior pointers), Base* and Derived* semantics (covariance, arrays, templates, etc.), composition with other pointer-like types, etc.
Not to mention dealing with external code that doesn't traffic in compressed pointers
Pointer compression is a pretty nifty and simple optimization.
While working on a high-performance wait-free MPSC queue, pointer compression allowed me to fit the tails of each ring buffer queue together into a single cache line, reducing cache pressure and cache coherency traffic. The tails were essentially 16-bit compressed indices, storing the offset from the backing array's base address.
Pointer compression lets you do so much cheating in lock-free and wait-free data structures. A 16-byte CAS can now operate on 4+ values instead of just ~2, and you can conceivably use an 8-byte object for both a pointer and a counter (an ABA counter, perhaps) so you can use things like XCHG and XADD instead of CAS for operations that don't actually need a CAS loop.
Do you have a pseudo-code example of this? It sounds neat:
> A 16-byte CAS can now operate on 4+ values instead of just ~2, and you can conceivably use an 8-byte object for both a pointer and a counter (an ABA counter, perhaps) so you can use things like XCHG and XADD instead of CAS for operations that don't actually need a CAS loop.
This is what it spit out, if anyone is curious (best of them, from what I could tell, I dunno much about concurrent programming but this looked the most reasonable to me):
I don't think this is correct since it doesn't use CAS/atomics but it's certainly interesting.
Comments that it generated for the classes:
// This is a lock-free stack that uses pointer compression to store a counter
// and a tag in the pointer. The tag is used to indicate whether the stack is
// empty or not, and the counter is used to detect ABA problems.
template <typename T>
class uint64_compressed_ptr_and_counter_stack
No it's the memory allocated via oilpan, before versus after pointer compression. Oilpan tracks C++ objects as part of blink / chrome's garbage collection.
x32 binaries are different from x86 programs in that they get access to 64-bit x86 registers, while their pointers are 32-bit.
It would be really cool if some Linux distro provided both x64 and x32 packages for programs not expected to rarely if ever use more than 4 GiB.
Similarly, it would be cool if Chrome and Firefox spawned x32 processes for most web pages, and then only spun up a x64 process when a web page exceeded like 2 GiB of memory usage (and transparently & quickly converted the page over onto the new process).