
Rpmalloc – A lock-free memory allocator - gbrown_
https://github.com/rampantpixels/rpmalloc
======
CyberDildonics
Looks like they are missing jemalloc in the benchmarks, which would be good to
see.

I think the simplicity of a 1300 line allocator is very alluring. At that
level it should avoid bloating small programs and is hopefully easy enough to
understand if digging in to it.

Also it looks like it is split up into 3 files. consolidating it to a single
header file C library would be very nice.

One oddity is the 16 bytes alignment. 64 bytes is the false sharing boundary
in x64 so I'm surprised to see this in a concurrent allocator, but maybe I've
misunderstood exactly what that means here.

~~~
maniccoder
I'm going to add jemalloc, lockless and scalloc in the benchmark comparisons
soon(ish).

Regarding the files, only rpmalloc.[h|c] is needed. The malloc.c file is just
for overriding the malloc family of entry points, and provide automatic
init/fini glue.

The 16 byte alignment is just a note that all returned blocks are naturally
aligned to 16 byte boundaries due to the fact that all size buckets are a
multiple of 16 bytes internally. It is needed for SSE instructions, which is
why I mention it.

~~~
rurban
Fair would be a comparison single-threaded to ptmalloc2 (glibc). I assume the
slowdown will be ~5%.

And the size granularity is split into 3 (small, medium, large) which is fine,
but other's have finer granularities. So benchmarks with object systems would
be nice, not just random data.

~~~
maniccoder
There already is a benchmark against glibc single threaded, compile the
rpmalloc-benchmark repository linked in the readme and run the runall script.
Compare rpmalloc 1 thread case against crt 1 thread (with crt being the C
runtime, i.e glibc in this case).

Graphs and analysis for a Linux system will be added soon, but a spoiler:

rpmalloc 22.7m memory ops/CPU second, 26MiB bytes peak, 15% overhead

jemalloc 21.3m memory ops/CPU second, 30MiB bytes peak, 33% overhead

tcmalloc 20.4m memory ops/CPU second, 30MiB bytes peak, 32% overhead

crt 13.5m memory ops/CPU second, 24MiB bytes peak, 7% overhead

rpmalloc is faster and has less overhead than both jemalloc and tcmalloc, and
a lot faster than glibc at the cost of extra memory overhead.

The reasoning behind the granularities is that 16 bytes is a reasonable
granularity for smaller objects (below 2KiB in size). Medium sized requests
(between 2KiB and 32KiB) have a 512 byte granularity, while large requests (up
to 2MiB) have a 62KiB granularity.

If you have a suggestion for additional benchmark models I would be happy to
add an implementation for it.

~~~
rurban
Ouch, so this looks like the final coffin into ptmalloc2, and esp. a shout to
emacs, which still hasn't finished their portable dumper. But not much missing
there, should be ready soon. (I got a not working branch on my GitHub) Hope
glibc will adopt this soon.

------
hacknat
Remember kids, "lock-free", does not always mean faster, it just means lock
free.

Example: Lock-free linked lists are great spiky write consistent read
situations, but in high-write situations they probably aren't any better than
locked linked lists.

------
tqh
They don't mention anything about memory fragmentation..

------
_wmd
Pretty graphs but insufficient analysis, not even a mention of fragmentation
in the readme.

~~~
maniccoder
What specifically would you like to know? I would be happy to elaborate more
in the readme if you can explain in more detail what you mean with
"insufficient analysis". The reason I posted about it on twitter in the first
place was that I wanted feedback and comments on what to improve.

(Btw, I'm the author of the library)

~~~
_wmd
Sorry for the terse (and let's face it, slightly unintentionally mean)
comment, I just meant that the most interesting aspect of an allocator is not
simply its performance, but how it copes with adverse conditions. Yours is
already lockless, so that sounds positive already w.r.t. multithreaded
performance, but how about when faced with classical allocation patterns
likely to cause fragmentation (alloc 2kb, alloc 2kb, free 2kb, alloc 3kb,
rinse repeat), and suchlike.

I've never had a desire to deviate from my underlying library's libc
allocator, but the principle reason for that (aside from potential for
mismatched malloc/free calls due to integration errors), is that I know this
stuff is a dark art to nail correctly, and I'm opening myself up to all kinds
of potential unknowns by discarding the finely tweaked (although potentially
not the most efficient) libc allocator that's been getting patched on to
handle adversity for the past 20-30 years.

I'd expect documentation of this sort to appear in the readme, even if just to
make mention of another allocator this one is patterned on, where I could
maybe go digging for further information around failure cases

This is just a conservative engineer talking, I've never actually written my
own general purpose heap allocator, although I understand the design of a few,
but perhaps not sufficiently to understand the motivation behind the design
(e.g. the kinds of failure cases the design is hoping to avoid). Just more
interested in how it's likely to break my app :)

~~~
maniccoder
Thank you for the clarifications, I will definitely improve the documentation
w.r.t trouble cases, fragmentation and similar allocators.

The goal is of course not to replace the standard allocator, but rather to
offer an alternative for those cases where the standard allocator performs
poorly.

------
Asooka
I don't see it mentioned, so does anyone know: Can I use this in parallel with
malloc? Meaning, can one part of my program use this for allocating memory,
while another uses the standard library malloc/free?

~~~
frutiger
If you implement the pmr interface and use pmr container types[0] then yes.

1\.
[http://en.cppreference.com/w/cpp/memory/memory_resource](http://en.cppreference.com/w/cpp/memory/memory_resource)

------
karavelov
Does is support allocating memory from HugePages? I couldn't find anything but
it should be easy to add for *nix systems as it is already using mmap to
request memory from the kernel.

~~~
maniccoder
It does not, at the moment. Feel free to create an issue about it on the
github repo, or even submit a PR if you're feeling creative.

------
ckok
Seems interesting; but what happens when 1 thread allocates memory, passes it
to another thread and ends? There does not seem to be a way to change
ownership of a memory block?

~~~
maniccoder
Memory blocks are owned by the span (the page-aligned larger bucket of blocks)
they belong to. While a span is in use (at least on block is allocated) it is
indeed owned by the heap that allocates the block. When the span is completely
freed, ownership is released.

In your case, the block will be marked as free for allocation. If it is the
last used block in that span, the span will eventually be returned to the
global cache and reused by other heaps/threads when the owner thread ends. If
the span is still used, it will remain in the heap until the heap is picked up
by another thread. Heaps are reused as threads are joined and created (i.e
number of allocated heaps are the maximum number of concurrent threads since
process started).

The worst case is that one thread allocates blocks, hands them off to other
threads then ends, and that the process never creates additional threads. In
this case, the spans will remain mapped and unused. I have planned to allow
other threads to pick up and collect spans from unused heaps at some point to
solve this.

~~~
ckok
that's pretty cool. Is there a way to track progress on reuse (like an issue I
can follow)?

~~~
maniccoder
[https://github.com/rampantpixels/rpmalloc/issues/6](https://github.com/rampantpixels/rpmalloc/issues/6)

------
faragon
Anyone has measured what takes to free a big amount of small reserved blocks
at once, e.g. free 10^8 blocks of 16 bytes?

~~~
CyberDildonics
Are you doing something where you are doing one hundred million small
allocations that all get freed at once?

If so, I'm going to make a bold claim that there are better ways to manage
that memory than a general purpose allocator.

~~~
faragon
It was just curiosity for that corner case. Obviously, for handling many small
memory elements, it is better to avoid calling allocators as much as possible,
e.g. using one linear allocation for all the elements, and using realloc()
with geometric resize heuristic (i.e. instead of realloc(new_size), doing
realloc(new_size * 1.50)).

~~~
CyberDildonics
That many allocations probably shouldn't even be allocations in the first
place. Realloc is not going to make things better, it copies the memory in
anything but very tiny allocations.

~~~
faragon
Realloc, if well implemented, does not copy (using virtual memory stuff).

~~~
hawski
In this case would it work similarly to having single big mmap and counting on
kernel allocating pages on first write? So for most cases you would only have
a single mmap call and multiple page faults (as it would always be the case).
I assume that most allocators do mmap for big allocations so it does not need
to be called directly.

I lately tried very naive synthetic test to check when realloc will return
different address (meaning that it had to copy). My system started swapping,
but it did not return different address. I did not know that realloc works
like that, but why wouldn't it.

~~~
hoytech
> check when realloc will return different address (meaning that it had to
> copy)

What faragon is referring to is that this assumption is incorrect. The virtual
address changing does not imply the memory was copied. On linux for example it
could have been remapped with mremap(2).

