Hacker News new | past | comments | ask | show | jobs | submit login
Pointer Compression in Oilpan (v8.dev)
72 points by feross on Nov 29, 2022 | hide | past | favorite | 39 comments



The x32 ABI is another opportunity for optimization: https://en.wikipedia.org/wiki/X32_ABI

x32 binaries are different from x86 programs in that they get access to 64-bit x86 registers, while their pointers are 32-bit.

It would be really cool if some Linux distro provided both x64 and x32 packages for programs not expected to rarely if ever use more than 4 GiB.

Similarly, it would be cool if Chrome and Firefox spawned x32 processes for most web pages, and then only spun up a x64 process when a web page exceeded like 2 GiB of memory usage (and transparently & quickly converted the page over onto the new process).


FWIW I recently tried this on https://www.oilshell.org, thinking the code would be faster due to better cache locality. There are many pointer rich data structures and they should be packed much more tightly when pointers are 32 bit.

It definitely saved memory, but it was slightly slower or the same speed. My guess is that 32-bit performance isn't prioritized on x86-64 by either compiler writers or chip designers, i.e. they optimize for the common case of 64 bit pointers.

Definitely interested in other comments / opinions / data on this.

I think if it actually sped programs up, you would see many more distros compiling packages for 32-bit. It does save memory, but somehow that's not big enough motivation. Probably because desktops/laptops have 8GB+ RAM these days (enough for most apps), and Linux systems with constrained memory are probably already 32-bit.

And because real memory hogs like browsers probably want to address more than 4GB. Though to be honest I think a multi-process browser with a 4GB-per-origin limit sounds like a very good idea!

Another issue is probably shared libs. Linux distros are pretty tightly coupled in this respect, and if core system programs like the compiler want to use 64-bit libc, etc. then it's economical for other programs to also be 64-bit as well. Otherwise the dependency management problem for shared libs becomes more annoying.


> It definitely saved memory, but it was slightly slower or the same speed. My guess is that 32-bit performance isn't prioritized on x86-64 by either compiler writers or chip designers, i.e. they optimize for the common case of 64 bit pointers.

That may be partially to blame, but depending on your implementation (was this x32 or actual compressed pointers?) the overhead of “decompressing” each pointer dereference may have been a significant factor. For null-safe languages like Java or JavaScript you already have null checks and may have GC barrier checks on most dereferences, so the extra costs are already expected- but adding a shift and/or constant add to each access is going to be a more noticeable cost compared to plain accesses in native code.


This is dirt simple native C++ code compiled with -m32, with structs and pointers (which happens to be generated, but that doesn't matter here)

I would expect that dereferencing a 32-bit pointer on x86-64 has zero overhead compared to dereferencing a 64-bit pointer (though I didn't test it). The only way I can see it being slower in the context of a real program is if the base address is an operand that has to be loaded from memory. But it shouldn't be because it's a 32-bit address space and it should be handled by the VMM, i.e. the base address is constant for the entire process.


-m32 doesn't target the x32 ABI though, it targets x86. Per the manual:

> The -m32 option sets int, long, and pointer types to 32 bits, and generates code that runs on any i386 system.

That restricts you to the smaller register set, removes all amd64 instructions, and uses the x86 ABI (on Linux, parameters are passed on the stack instead of in registers.) I suspect that this is why your program became slower, as opposed to using 32-bit pointers.


Hm so then how do you compile for x32 and have 32-bit pointers without the other restrictions?

It sounds like x32 is poorly supported by Linux and x86 would be better supported.


I think that the answer here is "not easily". To use the x32 ABI, you would need:

1. Kernel support to be enabled in order to make x32 system calls. Checking /proc/config.gz on Arch Linux, my kernel does not appear to have it enabled.

2. An x32 toolchain to build against. For gcc and glibc, that would mean re-compiling as described here: https://sourceware.org/glibc/wiki/x32

3. All of the libraries you link with to be recompiled for the x32 ABI.

x86 is definitely easier to build against since there's still legacy software using it, whereas almost nobody seems to use x32. However, over the long term I expect x86 compatibility will bit-rot too.


Yeah it sounds like x32 is really only practical for embedded systems, if anything

Limited use of registers in x86 does sound like a good explanation for it being a wash in terms of speed

It seems like there is very little language-level support for small pointers. Only the unreleased Jai language seems to have "relative pointers" (or maybe they took this out)

The Zig compiler recently replaced pointers with integers in their code representation for memory/speed, but lost some type safety in the process -- i.e. a T* isn't a U*, but indices into t[] and u[] are the same type. I'm not sure I like that tradeoff, and it seems like language-level support would help


I would definitely you to give a shot at compiling Oilshell to x32.

It definitely _is_ a pain to compile for it.

x32 ABI is not switched on by default on any distro that I'm aware, so you would need to recompile your kernel to get support for it. I think the poor support for it probably is because almost no one is using it. There was even a kernel thread discussing removing support for it.

So, from the point of view of distributing it to end users, x32 ABI is pretty much dead end (for now, *until distros start turning on the flag for it on, for their standard default distribution kernels).

But I would definitely be interested in your results. Most likely, you should see a smaller footprint + a speed boost.


For x86 you have a lot less registers which makes it a bit of a wash. x32 is interesting because you get all the registers but keep the pointers small. Sadly it’s not well supported.


AFAICT, x32 is basically dead.

I'm gonna go out on very thin ice and disagree with a Donald Knuth quote:

> It is absolutely idiotic to have 64-bit pointers when I compile a program that uses less than 4 gigabytes of RAM. When such pointer values appear inside a struct, they not only waste half the memory, they effectively throw away half of the cache.

In general if shrinking the pointer size is a huge optimization, your data structures have too many pointers in them. Learning Rust was a turning point for me here. The borrow checker encouraged me to get rid of references to parents or siblings. In general, I ended up passing parent references on the program stack and having sibling references (e.g. in graph structures) via indices into a slab, which can be u32 rather than pointer-sized. You could also write a SmallVec/SmallString that uses u32 for len/capacity rather than usize, which would be a nice savings if you have lots of them. And/or move the len/capacity inside the heap allocation, so empty/unallocated vecs/strings don't have those fields at all. Or do what Rust's smallvec crate does and C++'s std::string's "small string optimization" does: carve out the space for them but use it for actual object storage until it overflows and you need to move to the heap.

If you're working in a language that doesn't have value objects (that is, all user-defined structures are implicitly boxed), my sympathy, you will just end up with a bunch of extra pointers, which is why this kind of optimization is important for Java and JS. I consider lack of value objects to be a significant design flaw in a language.


I'm not sure I understand your point. First you say:

> In general if shrinking the pointer size is a huge optimization, your data structures have too many pointers in them.

And then you go on to say

> having sibling references (e.g. in graph structures) via indices into a slab, which can be u32 rather than pointer-sized.

What you're describing is literally pointer compression. You're using offset indices that point to a compressed/sufficiently small address space.


> What you're describing is literally pointer compression. You're using offset indices that point to a compressed/sufficiently small address space.

True, but it's finer-grained than a global switch. I can decide that I will have at most 2^32-1 objects in this particular graph/slab without giving up the ability for my program to address more than 4 GiB of RAM in total as it would with the x32 ABI. Also, this trick is only a small part of reducing pointer storage. I mentioned passing stuff in via the stack vs keeping it in the data structure. I forgot to mention: use vecs rather than e.g. linked lists, open addressing hash maps rather than chaining, btrees rather than red/black trees, etc. This is also stuff that Rust's std library encourages.

btw: of course none of the techniques I mention are unknown to Donald Knuth. He's likely either wanting to get programs written by others to run more quickly or trying to optimize well beyond what I'd care to do. I'd rather split the difference, basically: decrease pointer density a fair bit and keep the pointers at 64-bit.


Fair enough, I agree that a global x32 switch seems misguided, especially since efficient data structures are already written in cache-aware ways.


You're basically screaming "don't use so many pointers!", but this much much easier said than done. You can only really do this if you write the entire stack -all the dependencies- yourself, just so you can optimize like this. This approach doesn't scale or compose well at all.


That's just like saying you can only optimize CPU usage if you've written the world perfectly from scratch. In reality, profiles aren't flat. A little effort in the right spot can go a long way for any optimization goal.


You can still build a Gentoo system using the 'abi_x86_x32' flag.


A big problem is that it requires an entire ABI and operating system support to handle this case, when the alternative is... just use a custom heap in your own program. You're describing an insane level of engineering for something that mostly is hidden and can be handled just as effectively by the applications.

Like, if you're in the weird case where A) pointer chasing is killing your performance and blowing your cache and B) your working set fits in 4GB, you can just fix this yourself with your own custom allocator to handle that. You'll still be pointer chasing, mind you, just half as much (and there's the thing: you could very well be much better off reducing the chasing you're doing versus mitigating it this way.)

In reality superscalar OOO processors hide miss latency very effectively helping mitigate the problems of A, and B isn't always clear from the start. So here we are.

Don't get me wrong, I want more efficient programs, and centralizing some complexity like ABI is a good thing. But in practice x32 was a massive technological investment for something that mostly didn't matter, that people mostly didn't need, and whose benefits can be gotten much more easily and directly in a lot of cases.


I agree that the x32 ABI is probably a technological dead end, but I don't agree with "just" use a custom heap. I tried in C++ and it's not easy. You lose a lot of type safety and you have to basically rewrite every line of your program.

I mentioned the Zig compiler here: https://news.ycombinator.com/item?id=33797368

Another program I remember that uses small integers instead of pointers is the Sorbet type checker (not coincidentally it also has a fine-grained AST code representation): https://blog.nelhage.com/post/why-sorbet-is-fast/

I peeked at the code awhile back, and it's not really idiomatic C++. It's obviously doable but you have to architect the program from the start for it -- it's not feasible to retrofit this technique onto existing software.


> It would be really cool if some Linux distro provided both x64 and x32 packages for programs not expected to rarely if ever use more than 4 GiB.

How would one decide which version to use? Wait for the x32 version to crash and try again?


Use the same mechanism for running i386 code under x64.


For those who are too lazy to RTFA, Oilpan pointer compression shipped with Chrome 106. This resulted to real memory savings for users:

Blink memory P50 P99 Windows -21% (-1.37MB) -33% (-59MB) Android -6% (-0.1MB) -8% (-3.9MB)


I wonder if you could apply a simpler version of this to ordinary C/C++ programs? If you allocate a chunk on the heap with 8-byte alignment, then you can throw away the lowest 3 bits of precision, and then you can address 32GB with a 32-bit pointer. You'd need the cooperation of the memory allocator to ensure it gives you addresses in the first 32GB of address space, and it might interfere with things like ASLR, but it doesn't seem impossible.

Something like this in C++:

  template<class T>
  class CompressedPointer {
  public:
    T &operator*() const {
      return *static_cast<T *>((uint64_t)ptr << 3);
    }

    /* It's not safe to construct a CompressedPointer<T> from an arbitrary T*;
    you probably want to malloc and construct via the CompressedPointer itself */
    template<Args...>
    static CompressedPointer malloc(Args...) {
      /* allocate 8-byte aligned sizeof(T) in first 32GB, then construct T in it */
    }

  private:
    uint32_t ptr;
  }
[Edit] Extra crazy idea: Take an existing large C++ codebase. For each struct T in the codebase, use static analysis to determine if T is ever stack-allocated (or otherwise incompatible with CompressedPointer). If not, automatically rewrite all "T*" to "CompressedPointer<T>" and all "new T(...)" with "CompressedPointer<T>::malloc(...)". I suspect this would work for 90% of the heap allocations in a typical C++ codebase...


> If you allocate a chunk on the heap with 8-byte alignment, then you can throw away the lowest 3 bits of precision, and then you can address 32GB with a 32-bit pointer.

This part is totally doable, and in fact it's one of the pointer compression modes built into the JVM: https://shipilev.net/jvm/anatomy-quarks/23-compressed-refere...

For C++, it's a really interesting thought experiment, but in practice there are a few rough edges:

1. Your debugging tools will probably struggle to interpret compressed pointers.

2. You'll need to roll your own containers too. Otherwise, vectors, strings, maps, smart pointers, etc. will still use 64-bit pointers. In my experience, containers account for most pointers in a modern C++ code base.

3. _Technically_, the representation of null pointers is implementation-defined, so truncating to 32 bits and then zero-extending is not portable: it could convert a null pointer to a non-null pointer. In practice it doesn't matter though, since every modern target uses a zero bit pattern for null pointers.


ICC has an option that is documented as:

auto-p32 Instructs the compiler to analyze the program to determine if there are 64-bit pointers that can be safely shrunk to 32-bit pointers.

I don't know how good it is in practice, having never tried it. GCC has -mx32, but that changes ABI completely, you have to recompile everything, and you need a new kernel with another system call table (which your favorite kernel might have disabled).


I think this is a compelling idea, and I did some experiments with it a few years ago, but I don't think it works.

I think you have to rewrite all your code in C++ to get smaller pointers, e.g. see my comments on Zig and Sorbet (C++) in this thread. Also some on this wiki page: https://github.com/oilshell/oil/wiki/Compact-AST-Representat...

I don't think there is any way to retrofit existing software with this technique. In addition to the issues you mentioned, you would have to deal with pointer arithmetic (interior pointers), Base* and Derived* semantics (covariance, arrays, templates, etc.), composition with other pointer-like types, etc.

Not to mention dealing with external code that doesn't traffic in compressed pointers


Pointer compression is a pretty nifty and simple optimization.

While working on a high-performance wait-free MPSC queue, pointer compression allowed me to fit the tails of each ring buffer queue together into a single cache line, reducing cache pressure and cache coherency traffic. The tails were essentially 16-bit compressed indices, storing the offset from the backing array's base address.

Cool post!


Pointer compression lets you do so much cheating in lock-free and wait-free data structures. A 16-byte CAS can now operate on 4+ values instead of just ~2, and you can conceivably use an 8-byte object for both a pointer and a counter (an ABA counter, perhaps) so you can use things like XCHG and XADD instead of CAS for operations that don't actually need a CAS loop.


Do you have a pseudo-code example of this? It sounds neat:

  > A 16-byte CAS can now operate on 4+ values instead of just ~2, and you can conceivably use an 8-byte object for both a pointer and a counter (an ABA counter, perhaps) so you can use things like XCHG and XADD instead of CAS for operations that don't actually need a CAS loop.


Not without breaking any NDAs.


Fair enough! I'd be interested if Github Copilot can take this as a prompt and spit out C/C++ for it...


This is what it spit out, if anyone is curious (best of them, from what I could tell, I dunno much about concurrent programming but this looked the most reasonable to me):

I started by typing this:

  template <typename T>
  struct uint64_compressed_ptr_and_counter
This is one of the implementations it generated:

https://godbolt.org/z/hbfY6GPzo

I don't think this is correct since it doesn't use CAS/atomics but it's certainly interesting.

Comments that it generated for the classes:

    // This is a lock-free stack that uses pointer compression to store a counter
    // and a tag in the pointer. The tag is used to indicate whether the stack is
    // empty or not, and the counter is used to detect ABA problems.
    template <typename T>
    class uint64_compressed_ptr_and_counter_stack


That's funny - it's far from correct, but it looks like a lot of lock free data structures in an aesthetic sense.


    Blink memory    P50                 P99
    Windows       -21% (-1.37MB)    -33% (-59MB)
    Android        -6% (-0.1MB)      -8% (-3.9MB)
The median memory usage in Blink on Windows is only ~6.5MB, and only ~1.7MB on Android? That seems very low…


No it's the memory allocated via oilpan, before versus after pointer compression. Oilpan tracks C++ objects as part of blink / chrome's garbage collection.


It’s really too bad this stuff never worked in Node. It’s not a command line flag, it’s determined at compile time.


Goodbye ASLR protection


This is about heap pointers, not code pointers.


It's a JIT. The heap and code are intertwined.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: