
Mesh: Compacting Memory Management for C/C++ Applications - matt_d
https://arxiv.org/abs/1902.04738
======
vanderZwan
Is the shuffle vector a novel data structure? I've never seen it before but it
looks so obvious that just glancing at figure three was enough to figure out
how it worked before I got to the description: it's basically a vector with a
Fisher-Yates[0] shuffle built in! I wonder what other uses they might have.

[0]
[https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle](https://en.wikipedia.org/wiki/Fisher%E2%80%93Yates_shuffle)

~~~
jsnell
It seems very closely related to the well known trick for deleting elements
from a middle of an array in constant time (swap last element with the element
to be removed, then pop the last element). The difference is just that here
the swap happens on insertion rather than on deletion.

~~~
vanderZwan
Yes, kind of, but that is more when you have an unsorted array and don't worry
about making it even less sorted.

Shuffle vectors guarantee randomized sampling of a known set of elements for
very cheap, with the ability to return elements to the set. That seems like it
could have some very interesting niche usages.

It kind of reminded me of modern Tetris implementations. Instead of picking
one tetronimo at random for the next piece, they generate random sequences of
each tetronimo. However, this context never requires returning a single
element to the set of options - you can just generate a new sequence as a
whole once you run out of pieces.

Although... you could "flip" the shuffle-vector by swapping with the elements
_already taken:_ make a small buffer of seven elements (one for each piece),
circularly iterate over it, and whenever you reach element _i_ you swap the
value with a random position in the (inclusive range from zero to _i_. This
would basically be distributing the Fisher-Yates shuffle over the iterations,
rather than running it when the pieces run out.

For some reason though, I doubt that tetris piece randomization was ever a
source of performance instability for these games.

~~~
jsnell
How is it cheaper? In both formulations, you end up with one swap and one RNG
call per operation on the vector. With the shuffle vector you pay those costs
up front, with swap-with-last you pay those costs lazily. Asymptotically
there's no difference there at all.

~~~
vanderZwan
It's not, that's what I meant with:

> _This would basically be distributing the Fisher-Yates shuffle over the
> iterations, rather than running it when the pieces run out._

This obviously makes no difference for a sequence of seven tetronimos, but
perhaps there are contexts imaginable where one would do similar operations
over very large sequences, in which case paying costs lazily could be nice.

EDIT: I just remembered that Go guarantees that iterating over the built-in
maps is guaranteed to be in random order. I wonder how that is implemented -
perhaps something like this might be useful there for the situations where one
often bails out of such loops early.

~~~
vanderZwan
Since it's too late to edit: nope, that won't work for Go maps, because it
only _shuffles elements already visited_ , so bailing out of the loop early
guarantees the first N elements will be the same N elements that have been
iterated over up until that point. Which breaks the randomness guarantee.

OTOH, that actually is kind of an interesting feature; there might be a few
situations where that is desired behavior.

------
masklinn
> In future work, we plan to explore integrating Mesh into language runtimes
> that do not currently support compaction, such as Go and Rust.

With respect to Rust, isn't it usable as-is through LD_PRELOAD or whatnot
since Rust now uses the system allocator by default? Though obviously a proper
GlobalAlloc implementation could be provided (and eventually an Alloc).

~~~
nteon
This sounds right - that line was written before Rust used the system
allocator by default. Will update, although I'm still itching to write a
GlobalAlloc implementation.

~~~
JoshTriplett
I'd love to see one, and benchmarks for the Rust compiler running atop it.
Impressive work!

------
mehrdadn
Does anyone have some kind of list of the most important results in the area
of memory management? I'd never heard of the Robson bounds but they seem quite
crucial to know about, and I'd always been looking for something like them.
Are there other such results?

~~~
ric129
>I'd never heard of the Robson bounds but they seem quite crucial to know
about

I'd argue it's not really important in modern systems due to the use of
virtual memory. With virtual memory, you can massive reduce fragmentation
since you are no longer managing a single continuous region of memory.

Fragmentation still exists, but it only applies to "small" allocations (think
smaller than a page) and in most scenarios it is a result of modern memory
allocators preferring performance over reduced memory usage. Memory is cheap
nowadays, and performance is king.

>Does anyone have some kind of list of the most important results in the area
of memory management?

There is a survey [1] on the subject of memory allocation that mentions
Robson's bounds and other results

[1]
[https://link.springer.com/chapter/10.1007/3-540-60368-9_19](https://link.springer.com/chapter/10.1007/3-540-60368-9_19)

~~~
mehrdadn
Thanks!!

------
lefty2
Source code is available here in case anyone's interested:
[https://github.com/plasma-umass/Mesh](https://github.com/plasma-umass/Mesh)

------
_wmd
Random allocation sounds like it could potentially hurt cache for some
workloads, but still, this is easily the coolest idea I've seen so far this
year.

It's beyond amazing we can still innovate on something like a memory allocator
using hardware tricks that have been available for nearly 40 years. Leave no
stone unturned!

~~~
nteon
Hi! I'm one of the authors -- thanks so much!

We thought similarly about cache performance, but weren't able to find
specific examples (most of the slow benchmarks we dug into were due to
unoptimized allocator performance rather than poorer caching).

Informally I think one of the reasons for this is that in long running
problems, allocations returned from malloc end up being pretty random rather
than contiguous -- as allocations are freed they are put in freelists, and
freelists are designed more for temporal locality rather than spatial.

~~~
liuliu
On free, other allocators are likely to return the last freed address but with
mesh it will return a random location (i.e. hurt temporal locality). But I
guess in modern CPUs it doesn't matter that much because that whole page
probably cached anyway, selecting any offsets from that page would be
similarly good.

BTW, this is really cool project. Do you envision any difficulties of merging
this into jemalloc?

~~~
_wmd
jemalloc is written in C :) That aside, I don't think merging with an existing
allocator, if it weren't obscenely unlikely due to the general conservatism in
such projects, would necessarily accomplish anything.

In terms of contrasts it'd be like trying to merge Ruby with Python.. this
project's approach stands independently. It may be better to ask what features
it can gain from merging with something like jemalloc

~~~
liuliu
Yeah, it is interesting to see whether Jason Evans is interested enough to
port some of the ideas over to jemalloc. My original wording is poor and
shouldn't be read as suggesting merging the codebase literally. Just from
reading the paper, the idea itself is simple enough to get ported to any
modern allocators (with the only sticky point, as discussed in this thread
earlier, about the random allocation algorithm). The overhead in finding /
merging pages would be interesting topic to see whether it is reasonable for
these modern allocators as well. Wondering whether there are other
difficulties the authors saw about porting the idea to other allocators.

------
saagarjha
Interesting! I had always considered in the back of my head that there might
be a way to reduce memory fragmentation by basically passing that cost on to
virtual memory, but the meshing approach used here is a pretty cool
implementation of this. Here’s the implementation, it looks like:
[https://github.com/plasma-umass/mesh](https://github.com/plasma-umass/mesh)

------
obl
Very clever use of virtual memory.

Unfortunately I would expect this technique not to scale well with increasing
page size, and I sure hope that we will be slowly moving to >4k pages as years
go by since the TLB is a very common bottleneck.

~~~
nteon
Well, there are 2 sides to increasing page sizes. If you can fit more objects
on a page, there is a higher chance of collision at a given occupancy %,
reducing the effectiveness of meshing. On the other hand, it means that very
low occupancy pages (like single allocations) would cause an even larger
amount of memory to be held onto by a process - a situation Mesh would be able
to handle well.

We didn't look at huge pages in the paper for the reason that today, if you
care most about heap footprint you can always use 4K pages rather than 2M
ones. (and lots of database-like applications, like Redis, explicitly suggest
disabling huge pages)

------
yxhuvud
That is a cool technique. I wonder if it can be translated into a garbage
collected language, to create a nonmoving and compacting GC.

~~~
Hydraulix989
This is how garbage collection works (CLR is one implementation):

[https://en.wikipedia.org/wiki/Mark-
compact_algorithm](https://en.wikipedia.org/wiki/Mark-compact_algorithm)

~~~
masklinn
That is very much a _moving_ garbage collector, marked (live) objects get
moved to the start of the heap for compaction. GP is specifically asking about
the applicability of Mesh/TFA's non-moving capabilities for GCd systems:
moving objects requires the ability to update in-system references somehow,
and makes FFI more complex as references either can't be shared/leaked or
specific pinning API have to be set up (or some updatable addressing scheme
has to be provided but then you don't get straightforward pointers anymore).

------
Someone
I have trouble understanding why this works. _Mesh works by finding pairs of
pages and merging them together physically but not virtually_ works, until a
program makes alloc calls that end up allocating some memory in one of of
those virtual pages and some of it in the other. What guarantees that those
two allocations do not have overlap in physical memory?

Also, what happens when such a page with multiple virtual mappings get paged
out and paged in again? Does it stay a merged page?

~~~
_wmd
Since programs are only permitted read and or modify addresses within
allocations handed out by malloc, they are oblivious to the merge process as
pieces of another page are copied to virtual addresses near an existing
allocation in a page they are accessing

So long as mprotect() is used to ensure the source page is stable for the
duration of the copy and until the mremap() is complete, it should not be
possible for the program to observe any difference in behaviour -- except
perhaps if it tries to modify an allocation on a page that is currently being
merged by malloc() running in another thread (presumably they have some kind
of SIGSEGV handler to catch this)

~~~
Someone
Memory allocators use memory for two reasons: to dole it out to application,
and to keep track of what memory is used and what is free.

From the pictures, I thought merging only merged the first, but rereading, I
guess the former is stored in the same blocks as the latter. They still need
to make sure that multi-threaded access through different virtual pages works,
but that seems doable, too.

------
aetherspawn
If it’s a plug-in replacement and has little or no performance penalty, then
could it just be mainlined?

~~~
masklinn
Mainlined where? There is no mainline for allocators, C and C++ rely on the
system-provided allocator as their baseline, possibly replaced by an
alternative / custom allocator (though the same allocator could be system-
provided in one place and custom on another e.g. jemalloc is the system
allocator for freebsd but commonly used as custom allocator on other
platforms), supplemented with bespoke allocators for task-specific schemes
(e.g. arena or linear allocators where that makes sense).

You could try to get the mesh methodology integrated into existing allocators
I guess, but there is no such thing as a "mainline allocator".

~~~
iainmerrick
_e.g. jemalloc is the system allocator for freebsd but commonly used as custom
allocator on other platforms_

I think they just mean mainlined on a particular platform. If FreeBSD were to
switch its system allocator from jemalloc to mesh, for example, I’d count that
as mainlining.

