
Scalable memory allocation using jemalloc (2011) - luu
https://www.facebook.com/notes/facebook-engineering/scalable-memory-allocation-using-jemalloc/480222803919
======
nkurz
Maybe this is the right place to ask a meta-question on malloc that I've been
wondering about.

Are there any malloc implementations that try work more cooperatively with the
virtual memory system? The systems I've looked into seem to duplicate a lot of
what the kernel and hardware are already doing. They don't seem to really
distinguish between actively mapped memory and allocated but untouched memory.

With a 64-bit system, malloc could internally just reserve some vast logical
address space and let the kernel fault pages in as needed. Instead of having a
single frontier, you might have thousands of arenas used for different slab
sizes, each being automatically extended as needed. Each arena would have a
maximum waste up the 4KB page size, but on modern systems this seems unlikely
to ever add up to a real burden.

You'd still need record keeping for allocations within a slab, but the page
table itself would be the authoritative record for the arenas themselves. All
the pieces for this approach seem fairly straightforward. The only major thing
lacking (I think) is an efficient userspace way to request the unmapping of a
page that is no longer needed.

Is there research into this approach? What key terms should I be searching
for?

Edit:

I think this discussion may be about the same approach that I'm wondering
about:
[http://caml.inria.fr/mantis/view.php?id=6101](http://caml.inria.fr/mantis/view.php?id=6101)

And this paper touches on a lot of the same issues, although coming at it from
the direction of modifying the hardware:
[http://research.cs.wisc.edu/multifacet/papers/isca13_direct_...](http://research.cs.wisc.edu/multifacet/papers/isca13_direct_segment.pdf)

~~~
gosu
You don't need to have a single frontier, and IIRC no one does these days. A
stack of slabs per "arena" suffices, and the extra cost is only a read from a
cacheline you'll soon need anyway. It looks like jemalloc uses a tree of
"runs" for that.

I don't know exactly what you're imagining, but I don't see how you can pass
this off to the kernel. You need a slab which has free memory. How do you know
where to find it? You could write a nasty new syscall to ask the kernel for a
ZFOD page(s), and then cry because your program is slow.

Also, unless your slabs are huge and your allocator is being used to fuel some
kind of pre-caching, faulting pages is going to be suboptimal. You incur at
least twice as many mode switches: one for the mmap call, one for each
faulting page. I've measured this, and it gets pretty expensive. I'm kind of
dubious about the performance of schemes which stray from the principle of
"allocate what you actually need; free it when you don't."

The bottom line is that a slab allocator should be solving a different, mostly
less general problem than the kernel. It needs finer granularity and it's
optimizing for common block sizes that are small enough to fit into slabs of
known sizes. So it should optimize for allocating page chunks in a small
subset of possible sizes (namely, the slab size(s)), while the kernel should
be worried about unbounded, oddly-sized page chunks.

I toyed with scalable allocators for a course a few years back. It's a little
embarassing, but if you want to see what the costs actually are:
[http://www.andrew.cmu.edu/user/apodolsk/418/finalreport.html](http://www.andrew.cmu.edu/user/apodolsk/418/finalreport.html)

You can do lockfree slab allocators pretty easily, so the game in scalable
malloc is really about avoiding inter-CPU cacheline sharing and fitting your
data structures into per-CPU caches. Slabs happen to be really good for that,
for non-obvious reasons.

~~~
nkurz
It certainly wouldn't be the first time that I've suggested something
ridiculous. One easy explanation for the lack of papers I can find is that the
idea is so obviously useless that no one has written about it!

That said, it just seems unlikely that the best solution for 32-bit systems
with limited virtual address space and slow interrupt generating system calls
is also the best solution for 64-bit systems with practically-limitless
virtual address space and fast hardware-supported syscall/sysenter.

And I think what I'm suggesting is roughly analogous to what's being done in
Azul's (successful?) C4 garbage collection: provide fast userspace for per-
process page table manipulation, and offload as much as possible to the
hardware:
[http://www.azulsystems.com/sites/default/files/images/c4_pap...](http://www.azulsystems.com/sites/default/files/images/c4_paper_acm.pdf)

But I'm definitely too fuzzy about some of the key concepts to suggest
anything particularly coherent at this point. I'll check out your writeup, and
try to learn more. Thanks!

------
jasone
Hey folks, jemalloc author here, willing to answer any questions you have
about it. I'll be AFK for much of today but will be around all day tomorrow.

------
martinpw
It would be interesting to see the performance of TBB malloc included in the
mix. For our application (parallel graph evaluation) we found the TBB
allocator outperformed jemalloc. The margin was not huge, but enough to make
it worthwhile using it. However a colleague at a company with somewhat similar
software found jemalloc outperformed TBB, so your mileage definitely varies
depending on application.

It can also happen that these allocators use more memory than regular malloc
since they hold onto local pools, so that may be a concern for some use cases.

------
killercup
Interesting. Just today I [read][1] some discussion about on which platforms
and under which circumstances Rust should use jemalloc instead of libc's
malloc.

[1]:
[http://smallcultfollowing.com/babysteps/blog/2014/11/14/allo...](http://smallcultfollowing.com/babysteps/blog/2014/11/14/allocators-
in-rust/)

------
CyberDildonics
So a few questions on C and C++ memory allocation for the experts here.

Will typical C++ new (and C malloc) give thread contention issues?

Specifically I think I have seen multi threading performance die when multiple
tasks are finishing and doing lots of freeing at one time.

-Is it possible that is true, and if so, is that the sort of thing tcmalloc (and jemalloc?) fix?

-If they fix it, what is the best way to get that into C++ programs that use new instead of malloc (the vast majority through STL structures)?

~~~
rayiner
Yes (to both--new usually calls malloc), but the degree to which depends on
the platform. In the old days, there was a single lock around the allocator
serializing malloc/free. These days, all the major platforms have a memory
allocator that tolerates some degree of concurrency, and you can also link in
tcmalloc or jemalloc. Even then, a lot will depend on your allocation
patterns. For example, many threaded malloc implementations depend on thread-
local caches to avoid having to take a lock on the fast path. This works great
for many workloads, but can also fall flat on others. Consider what happens
when a producer thread allocates memory, passes a pointer to a consumer thread
which frees it. The thread-local cache in the producer thread will never be
replenished by free() calls, and the thread-local cache in the consumer thread
will never be used by malloc() calls. Thus performance can become very
dependent on the mechanism the allocator uses to transfer memory to/from the
local caches and the global heap.

------
robmccoll
Does anyone know of a more recent comparison between tcmalloc and jemalloc?

Also of note, if you haven't played with this before, be aware of how you are
linking if you are in a more complex project. It is possible (though
difficult) to end up with multiple malloc implementations in one binary, so be
careful to keep from passing memory allocated in one to the free function of
the other :-)

------
based2
[http://osvdb.org/show/osvdb/110196](http://osvdb.org/show/osvdb/110196)

[http://osvdb.org/search?search[vuln_title]=jemalloc&search[t...](http://osvdb.org/search?search\[vuln_title\]=jemalloc&search\[text_type\]=alltext)

------
mschuster91
In case FB ever has problems with a mass migration of users to a successor
(though unlikely, as there are billions of people still not having either
internet or a FB account, in the next 10 years), they still can morph into a
software and hardware technology giant...

------
tyrion
I think it should be stated on the title that this is a post from 2011.

------
flatestcat
Wow, I've been looking for something like this. Much better than vanilla
malloc. Thanks!

