
Pointer Compression in V8 - DyslexicAtheist
https://blog.infosectcbr.com.au/2020/02/pointer-compression-in-v8.html
======
svat
The immediate context for this is this post
[https://v8.dev/blog/v8-release-80](https://v8.dev/blog/v8-release-80) on the
v8 blog, where they announced that in v8 version 8, they saved an average of
40% memory, _and_ (unlike usual memory-time tradeoffs) also got good
performance improvements. (Design doc:
[https://docs.google.com/document/d/10qh2-b4C5OtSg-
xLwyZpEI5Z...](https://docs.google.com/document/d/10qh2-b4C5OtSg-
xLwyZpEI5ZihVBPtn1xwKBbQC26yI/edit)) So pointer compression is clearly a good
thing. I don't know much about the history of pointer compression, but I know
the following.

In 2008, Don Knuth posted on his then "news" page
([https://cs.stanford.edu/~knuth/news08.html](https://cs.stanford.edu/~knuth/news08.html)):

> _A Flame About 64-bit Pointers_

> _It is absolutely idiotic to have 64-bit pointers when I compile a program
> that uses less than 4 gigabytes of RAM. When such pointer values appear
> inside a struct, they not only waste half the memory, they effectively throw
> away half of the cache._

> _The gcc manpage advertises an option "-mlong32" that sounds like what I
> want. Namely, I think it would compile code for my x86-64 architecture,
> taking advantage of the extra registers etc., but it would also know that my
> program is going to live inside a 32-bit virtual address space._

> _Unfortunately, the -mlong32 option was introduced only for MIPS computers,
> years ago. Nobody has yet adopted such conventions for today 's most popular
> architecture. Probably that happens because programs compiled with this
> convention will need to be loaded with a special version of libc._

> _Please, somebody, make that possible._

Presumably Knuth was not the only person asking for it, and in 2011 there was
work on this: see "Making Knuth's wish come true: the x32 ABI"
([http://blog.reverberate.org/2011/09/making-knuth-wish-
come-t...](http://blog.reverberate.org/2011/09/making-knuth-wish-come-
true-x32-abi.html)) and Wikipedia/LWN coverage
([https://en.wikipedia.org/w/index.php?title=X32_ABI&oldid=921...](https://en.wikipedia.org/w/index.php?title=X32_ABI&oldid=921754052)
[https://lwn.net/Articles/456731/](https://lwn.net/Articles/456731/))

Unfortunately, no one was using it (not sure why, maybe not many people who
care about performance write the kinds of programs that would hugely benefit
from this?), and the x32 ABI got sort of deprecated by late 2018
([https://www.phoronix.com/scan.php?page=news_item&px=Linux-
Po...](https://www.phoronix.com/scan.php?page=news_item&px=Linux-Potentially-
Drops-x32) etc).

Now, a personal story. Recently, while searching for something on Stack
Exchange, I came across a question related to Bentley's June 1986 _Programming
Pearls_ column that featured an invited literate program by Knuth and a review
by Doug McIlroy, about which there is a lot of misinformation and
misunderstanding on the internet (e.g. calling it an "interview question" and
what not!). Anyway, this question on codegolf.SE
([https://codegolf.stackexchange.com/questions/188133/bentleys...](https://codegolf.stackexchange.com/questions/188133/bentleys-
coding-challenge-k-most-frequent-words)) was about implementing a fast
solution to the same problem, and the "winner" was an elegant Rust program. I
was curious about Knuth's original Pascal (WEB) program from 1986, so I
studied it, translated it to C++, and found to my surprise that it ran faster
than the fastest program that had been posted on the site! Looking closer into
why, experimenting with this and that, it turned out AFAICT that the probable
reason, ultimately, was that where the Rust program used (64-bit) pointers,
the translation of Knuth's program (which had been targeting "common
denominator" Pascal, without pointer types) used (32-bit) array indices, so it
was able to fit twice as many struct values in the cache.

In fact, taking just this one idea (cache-friendliness) and using a regular
trie data structure (as we're no longer operating under similar memory or
language constraints as Knuth was) gives something even faster.
([https://codegolf.stackexchange.com/a/197870](https://codegolf.stackexchange.com/a/197870))
I'd been planning to write a blog post explaining all this -- the clever data
structures used (tries, trie-packing, and hash tries), how they're used in the
TeX program for hyphenation, the context in 1986 and misunderstandings today,
and my experiments with the programs -- but got distracted by other things,
but this post has reminded me to try again. :-)

~~~
MisterAV
Having 32-bit pointers doesn't mean that ASLR becomes much less effective? I
can suppose that for V8 is not a big problem, because they use compressed
pointers where necessary, but if it was a compiler directive (like Knuth
wanted) it would affect the whole program. I would not use that option for any
program that have to process untrusted input.

~~~
verwaest
It does: pointers in the V8 heap all lie within the same 4gb region. But then
again, Spectre also makes ASLR much less effective.

------
tybit
“ A downside to this is that the V8 heap can not be any greater than 4 GB as
that is the maximum limit of a 32-bit address space. This is fine for
browsers, as the heap doesn’t need to be greater than 4 GB anyway. It becomes
a problem with things like node.js that require larger heaps. Because of this,
pointer compression is disabled for node.js until a better solution can be
figured out.”

It will be interesting to see how this evolves. In the Hotspot JVM(not using
fancy GCs at least), heaps up to ~32GB can use pointer compression, as they
compress pointers to object offsets instead of memory addresses.

[https://stackoverflow.com/questions/25120546/trick-behind-
jv...](https://stackoverflow.com/questions/25120546/trick-behind-jvms-
compressed-oops)

~~~
verwaest
We're definitely interested in exploring this. Unfortunately it's likely a
little slower than 4gb PC: Compression right now is basically a no-op,
decompression simply being a single add instruction. And it'll fragment memory
a little because of the alignment requirements. But surely worth it for where
larger heaps are necessary.

~~~
the8472
Don't fancy x86 addressing modes provide most of those multiplications and
offsets with very little IPC penalty?

~~~
cfallin
Yeah, this should be roughly the same overhead as an ADD:

    
    
        LEA rDest, [rBase + 8*rPtr]
    

(The "load effective address" instruction computes an effective address like a
load or store would, but just gives the address without doing a memory
access.)

~~~
the8472
AIUI mov supports these things directly[0] and if I read the instruction
tables correctly then at least on skylake the latency/throughput is the same
for all addressing modes[1]

[0]
[http://www.c-jump.com/CIS77/ASM/Addressing/lecture.html#R77_...](http://www.c-jump.com/CIS77/ASM/Addressing/lecture.html#R77_0060_scaling_factors)
[1]
[https://www.agner.org/optimize/instruction_tables.pdf](https://www.agner.org/optimize/instruction_tables.pdf)
(page 238)

------
IvanK_net
"This is fine for browsers, as the heap doesn’t need to be greater than 4 GB
anyway."

I develop a photo editor www.Photopea.com , where people often edit e.g.
100-megapixel photos. Then, Chrome may crash (because of 4GB limit) and they
lose their work. I have to recommend users to use Firefox for such cases.

~~~
ygra
Is it really necessary to keep the whole uncompressed image in memory (as well
as all undo steps) at all time? I guess the browser environment makes trading
RAM for disk space difficult, but Photoshop ran fine with moderate amounts of
RAM and plenty of scratch space on disk a few decades ago. It's probably
_easier_ to just keep stuff in memory, but perhaps not strictly necessary as
not everything needs the same latency.

~~~
MrPatan
I have 32 GB of RAM, why shouldn't software use it to make things faster?

~~~
ygra
It definitely should, but keeping your working set smaller can also make
things faster (as seen by Chrome here). This depends very much on the workload
and what's being done to the data in memory. My example was just that a raster
image editing program probably doesn't require a huge memory footprint just to
be able to edit images well (as a lot of the memory use typically is not the
image you're seeing, but history and undo state, which is neither latency-
critical, not frequently accessed).

~~~
cormacrelf
Chrome's 32-bit address space/4GB limit is different from having 64-bit
machine/48-bit address space/4GB of RAM. In the latter, you can keep
allocating after 4GB, it will just get slower as the pager will start swapping
pages to and from the disk. But in V8 with pointer compression, you will just
hit a brick wall.

To swap memory to disk, you still need address space to map it to. Say the
editor has allocated 3.4GB, and then makes another 600MB layer, allocated at
roughly 0xD000000. With no address space left, it asks for another layer, and
the allocator returns NULL. It can't give you a pointer to a 600MB region,
because there's no address space left. If you paged out the layer at
0x20000000, that would not help, because it wouldn't magically free up the
addresses 0x20000000-0x40000000. They would just refer to pages that are
currently on disk, and still be 'occupied' address space. You still need an
address for this new allocation, and there is no room to put it.

No slow degradation -- it won't page fault at all unless you otherwise fill
the RAM on the machine. So the image editor just falls over, with an
uncatchable OOM exception I presume, with no perceptible warning from page
fault slowdown just prior. It will go full speed into the brick wall. For your
account to be accurate, V8 would have had to implement their own virtual
address space, which they have not. VA basically requires a hardware TLB to be
fast, and V8's "TLB" here is just `mov eax, [whatever]; and rax, r13`.
Anything other than that would have completely defeated the speed gains from
locality.

This doesn't account for nuances like whether ArrayBuffers would be allocated
elsewhere and have no pointer compression applied, but it's definitely true of
general objects. For a regular JS program to fill 4GB with normal web app
things would be a miracle, and the image editors of the world can probably
still work if they make the big-allocation APIs use full-size pointers.

------
londons_explore
I'd be interested to see how far this can be taken...

Why not make pointers just 4 bits, giving them only 15 possible memory
locations they can point to? Then try to allocate the thing they'll be
pointing at in one of those 15 locations. Reserve the 16th location for some
kind of backup data structure which can point anywhere.

Clearly it isn't branchless, but I would guess that most codepaths would
either always be able to make use of one of the 15 locations, or would never
be able to, making branches predictable.

~~~
unwind
Four bits is probably too low, since you would need to fit other data into the
renaining bits in order not to waste them.

The x86 is (merely) byte-addressable so a byte is the smallest piece of data
that makes sense for an "object", as far as I know.

~~~
londons_explore
Nearly all objects are full of pointers. You can pack loads in if they're 4
bit.

------
joe_the_user
Pointer compression seems like an amazing method for bytecode optimization in
general.

Bytecode optimization is one of my hobbies and I wish I could find a list of
all such methods that VM like V8 use.

~~~
unlinked_dll
I think pointer compression is worthy of a masters/PhD thesis in compiler
design if it hasn't been done already. Should be ubiquitous, if you don't have
a program that will ever require the amount of memory needed for more than a
2^N bit number the pointers should probably be compressed and arithmetic
optimized. There's a lot of edge cases to look for there however.

~~~
ndesaulniers
Or if vtables had relative offsets, rather than absolute addresses? Fuchsia is
designing their ABI with that change to the Itanium ABI.

~~~
vidarh
It's unclear what you mean here. Lots of uses of vtables will use offsets to
look up a specific slot in a vtable, but the total space used by vtables in
most systems will be a tiny proportion of the heap, so storing the pointers in
the vtables as offsets doesn't seem likely to make much difference.

~~~
ndesaulniers
What's in those slots? (Full pointers)

[https://itanium-cxx-abi.github.io/cxx-abi/cxx-vtable-
ex.html](https://itanium-cxx-abi.github.io/cxx-abi/cxx-vtable-ex.html)

[https://bugs.llvm.org/show_bug.cgi?id=26723](https://bugs.llvm.org/show_bug.cgi?id=26723)

~~~
vidarh
Yes, but there's only 1 vtable per class. While the latter shows a shockingly
high reduction in code size, which seems to imply a ludicrous number of tiny
classes with few call sites per method in use in the Chromium code base,
relative to what I've seen when doing compiler development, it shouldn't
translate into much of a change in heap usage.

From reading those, my main takeaway is that the Chromium codebase sounds
awful.

------
andreareina
Getting a 40% heap reduction by halving pointer size implies that pointers
were 80% of the heap! What am I missing?

~~~
eru
Don't think you are missing anything. There are a lot of pointers.

------
boardwaalk
Is there a way to do this in a systems language? I know there was the X32 ABI
that’s essentially deprecated at this point. I suppose you could use your own
allocator and pointer wrapper types, but it’d be nice if there was just a
switch you could flip if you know you’re just not going to use >4GiB.

~~~
Taniwha
sure you can - 35 years or so back I ported Unix (v6/v7) to run in a virtual
system (under VMS) with relative pointers (so you could move swapped images
around without playing too many MMU games). I did this by hacking the C
compiler to use a relative pointer much the same as these people are doing

~~~
winrid
Wow, how long did it take to Port Unix to run under VMS?

~~~
Taniwha
It was a long time ago, probably ~ 6 months part time - bringing up the kernel
at the same time as a compiler is never easy - it ran in supervisor/user modes
in place of VMS's shell.

Porting V6 was harder, it was very pdp-11ish, lots of stuff (especially
context switch) depended a lot on knowledge of the structure pdp11 stack
frames

------
codedokode
Would not it be easier just to use 32-bit x86 arch instead of 64-bit? Pointer
compression looks like a solution for a problem that should not exist in the
first place. I saw the arguments against this in [1] but they look very weak
("We cannot use 32-bit arch because Chrome has switched to 64-bit and because
there is an OS nobody is using that doesn't allow this").

Why not make 64-bit Chromium for those who has over 16 Gb of RAM and 32-bit
for normal people?

[1] [https://docs.google.com/document/d/10qh2-b4C5OtSg-
xLwyZpEI5Z...](https://docs.google.com/document/d/10qh2-b4C5OtSg-
xLwyZpEI5ZihVBPtn1xwKBbQC26yI/edit)

~~~
anarazel
Realistically the limit for 32bit apps isn't 4GB, but considerably lower. The
OS mapping takes out like 1gb at least. Then you have shared libraries mapped
in. You want space for aslr. Mmapping files. Etc.

Leaving available memory aside, for a lot of optimizations it's useful to have
plenty virtual memory space - which is definitely not the case in 32bit.

Basically, just because js bytecode doesn't need more than 4gb, doesn't mean
no other part of chrome needs more.

~~~
beagle3
That's correct, although 32-bit processes on a 64-bit operating system (as are
still supported on Windows and Linux, and were supported on macOS until
Catalina) can effectively have 4GB.

Back when I was still using Windows, you could boot the 32-bit OS with "/3GB",
which would make the kernel/user split 1GB/3GB instead of the original 2GB/2GB
default - but it was optional and explicit, because quite a bit of software
failed; I would guess that changed with time, but likely a similar "/4GB"
switch for 32-bit apps on 64-bit OS would also expose assumptions about the
address space layout ...

~~~
anarazel
It'd also make syscalls a bit more expensive...

------
londons_explore
How about another approach...

Design a "pointer predictor", which for a given pointer predicts where it will
lead to. I would guess there are many arrays of identical structures, so
predicting any given pointer value ought to be doable. The predictor could be
as simple as "This object has patterns of pointers very similar to this other
object, so use those instead"

Then replace each pointer with a single bit saying "the predictor is right" or
"the predictor is wrong, use an alternative pointer stored in an external
table".

~~~
lifthrasiir
Similar ideas were proposed for the memory compression, exploiting a fact that
most allocations in typical applications are object-like. See for example [1]
(HN discussion: [2]).

[1]
[https://blog.acolyer.org/2019/05/24/zippads/](https://blog.acolyer.org/2019/05/24/zippads/)

[2]
[https://news.ycombinator.com/item?id=19998645](https://news.ycombinator.com/item?id=19998645)

------
kilowatt
At some point (not sure if it is still true) Jai had or was going to have
language-level implementation of relative pointers, i.e. a 16 or 32 bit
relative offset from “this field’s memory address.”

------
Santosh83
When is this rolling out for stable Chromium/Chrome? Any timeline?

~~~
woofwoofwoof
> This post provides a preview of some of the highlights in anticipation of
> the release in coordination with Chrome 80 Stable in several weeks.

[https://v8.dev/blog/v8-release-80](https://v8.dev/blog/v8-release-80)

------
Waterluvian
Can someone give some context on why pointer compression is worthwhile? Are
there actually so many pointers in use that indeed you save a significant
amount of memory?

~~~
NikolaeVarius
The article links to a design doc
[https://docs.google.com/document/d/10qh2-b4C5OtSg-
xLwyZpEI5Z...](https://docs.google.com/document/d/10qh2-b4C5OtSg-
xLwyZpEI5ZihVBPtn1xwKBbQC26yI/edit)

~~~
tybit
“We expect about ~35% of V8 heap reduction on 64-bit platforms on real-world
web sites” amazing

