Hacker News new | comments | show | ask | jobs | submit login
Rpmalloc – A lock-free memory allocator (github.com)
95 points by gbrown_ on Mar 19, 2017 | hide | past | web | favorite | 36 comments

Looks like they are missing jemalloc in the benchmarks, which would be good to see.

I think the simplicity of a 1300 line allocator is very alluring. At that level it should avoid bloating small programs and is hopefully easy enough to understand if digging in to it.

Also it looks like it is split up into 3 files. consolidating it to a single header file C library would be very nice.

One oddity is the 16 bytes alignment. 64 bytes is the false sharing boundary in x64 so I'm surprised to see this in a concurrent allocator, but maybe I've misunderstood exactly what that means here.

I'm going to add jemalloc, lockless and scalloc in the benchmark comparisons soon(ish).

Regarding the files, only rpmalloc.[h|c] is needed. The malloc.c file is just for overriding the malloc family of entry points, and provide automatic init/fini glue.

The 16 byte alignment is just a note that all returned blocks are naturally aligned to 16 byte boundaries due to the fact that all size buckets are a multiple of 16 bytes internally. It is needed for SSE instructions, which is why I mention it.

Fair would be a comparison single-threaded to ptmalloc2 (glibc). I assume the slowdown will be ~5%.

And the size granularity is split into 3 (small, medium, large) which is fine, but other's have finer granularities. So benchmarks with object systems would be nice, not just random data.

There already is a benchmark against glibc single threaded, compile the rpmalloc-benchmark repository linked in the readme and run the runall script. Compare rpmalloc 1 thread case against crt 1 thread (with crt being the C runtime, i.e glibc in this case).

Graphs and analysis for a Linux system will be added soon, but a spoiler:

rpmalloc 22.7m memory ops/CPU second, 26MiB bytes peak, 15% overhead

jemalloc 21.3m memory ops/CPU second, 30MiB bytes peak, 33% overhead

tcmalloc 20.4m memory ops/CPU second, 30MiB bytes peak, 32% overhead

crt 13.5m memory ops/CPU second, 24MiB bytes peak, 7% overhead

rpmalloc is faster and has less overhead than both jemalloc and tcmalloc, and a lot faster than glibc at the cost of extra memory overhead.

The reasoning behind the granularities is that 16 bytes is a reasonable granularity for smaller objects (below 2KiB in size). Medium sized requests (between 2KiB and 32KiB) have a 512 byte granularity, while large requests (up to 2MiB) have a 62KiB granularity.

If you have a suggestion for additional benchmark models I would be happy to add an implementation for it.

Ouch, so this looks like the final coffin into ptmalloc2, and esp. a shout to emacs, which still hasn't finished their portable dumper. But not much missing there, should be ready soon. (I got a not working branch on my GitHub) Hope glibc will adopt this soon.

Any plan to add the allocators used in modern browsers? bmalloc (webkit), which ever tcmalloc variant chrome currently uses, the current firefox allocator, I have no idea if edge has an allocator that is not the default windows one?

I will look into it

16 bytes is the required aligment for long double, which is the largest (at least on x86) required alignment. If you need more use memalign or something similar.

Indeed. jemalloc uses a lock-free radix tree at its core.

jemalloc doesn't support non-Unix platforms (eg Windows). You may or may not be able to find a Windows port of any particular jemalloc version, but the developers ignore Windows because they don't use it in their server software. We gave up using jemalloc on Windows because it was such a crap shoot.

That sounds like a good basis for a "portable" version project. Other very targeted software (such as OpenSSH[1]) does this. I'm not sure what it takes to make jemalloc windows compatible, but even if the jemalloc developers don't want to support windows, it's possibly they might be more amenable to working with a project dedicated to portability in general.

Also, my understanding is that Rust uses jemalloc, and Rust runs on windows, so what does Rust do? Just use a known good ported jemalloc version?

1: https://www.openssh.com/portable.html

We don't use jemalloc on Windows, we use the system allocator, aka HeapAlloc.

There are already versions out there on github that have windows support built in, I've used it on windows. I also had problems with it however.

jemalloc does support Windows, and stability has improved dramatically in recent releases, primarily because we now have continuous integration coverage for several Windows configurations.

Remember kids, "lock-free", does not always mean faster, it just means lock free.

Example: Lock-free linked lists are great spiky write consistent read situations, but in high-write situations they probably aren't any better than locked linked lists.

They don't mention anything about memory fragmentation..

Pretty graphs but insufficient analysis, not even a mention of fragmentation in the readme.

What specifically would you like to know? I would be happy to elaborate more in the readme if you can explain in more detail what you mean with "insufficient analysis". The reason I posted about it on twitter in the first place was that I wanted feedback and comments on what to improve.

(Btw, I'm the author of the library)

Sorry for the terse (and let's face it, slightly unintentionally mean) comment, I just meant that the most interesting aspect of an allocator is not simply its performance, but how it copes with adverse conditions. Yours is already lockless, so that sounds positive already w.r.t. multithreaded performance, but how about when faced with classical allocation patterns likely to cause fragmentation (alloc 2kb, alloc 2kb, free 2kb, alloc 3kb, rinse repeat), and suchlike.

I've never had a desire to deviate from my underlying library's libc allocator, but the principle reason for that (aside from potential for mismatched malloc/free calls due to integration errors), is that I know this stuff is a dark art to nail correctly, and I'm opening myself up to all kinds of potential unknowns by discarding the finely tweaked (although potentially not the most efficient) libc allocator that's been getting patched on to handle adversity for the past 20-30 years.

I'd expect documentation of this sort to appear in the readme, even if just to make mention of another allocator this one is patterned on, where I could maybe go digging for further information around failure cases

This is just a conservative engineer talking, I've never actually written my own general purpose heap allocator, although I understand the design of a few, but perhaps not sufficiently to understand the motivation behind the design (e.g. the kinds of failure cases the design is hoping to avoid). Just more interested in how it's likely to break my app :)

Thank you for the clarifications, I will definitely improve the documentation w.r.t trouble cases, fragmentation and similar allocators.

The goal is of course not to replace the standard allocator, but rather to offer an alternative for those cases where the standard allocator performs poorly.

I don't see it mentioned, so does anyone know: Can I use this in parallel with malloc? Meaning, can one part of my program use this for allocating memory, while another uses the standard library malloc/free?

If you implement the pmr interface and use pmr container types[0] then yes.

1. http://en.cppreference.com/w/cpp/memory/memory_resource

Yes, you can (unless you also include the provided malloc.c glue code which provides the malloc family of entry points - only use the rpmalloc.c implementation). However, read the caveat section in the readme if you're on a mmap based system.

Does is support allocating memory from HugePages? I couldn't find anything but it should be easy to add for *nix systems as it is already using mmap to request memory from the kernel.

It does not, at the moment. Feel free to create an issue about it on the github repo, or even submit a PR if you're feeling creative.

Seems interesting; but what happens when 1 thread allocates memory, passes it to another thread and ends? There does not seem to be a way to change ownership of a memory block?

Memory blocks are owned by the span (the page-aligned larger bucket of blocks) they belong to. While a span is in use (at least on block is allocated) it is indeed owned by the heap that allocates the block. When the span is completely freed, ownership is released.

In your case, the block will be marked as free for allocation. If it is the last used block in that span, the span will eventually be returned to the global cache and reused by other heaps/threads when the owner thread ends. If the span is still used, it will remain in the heap until the heap is picked up by another thread. Heaps are reused as threads are joined and created (i.e number of allocated heaps are the maximum number of concurrent threads since process started).

The worst case is that one thread allocates blocks, hands them off to other threads then ends, and that the process never creates additional threads. In this case, the spans will remain mapped and unused. I have planned to allow other threads to pick up and collect spans from unused heaps at some point to solve this.

that's pretty cool. Is there a way to track progress on reuse (like an issue I can follow)?

Anyone has measured what takes to free a big amount of small reserved blocks at once, e.g. free 10^8 blocks of 16 bytes?

Are you doing something where you are doing one hundred million small allocations that all get freed at once?

If so, I'm going to make a bold claim that there are better ways to manage that memory than a general purpose allocator.

It was just curiosity for that corner case. Obviously, for handling many small memory elements, it is better to avoid calling allocators as much as possible, e.g. using one linear allocation for all the elements, and using realloc() with geometric resize heuristic (i.e. instead of realloc(new_size), doing realloc(new_size * 1.50)).

That many allocations probably shouldn't even be allocations in the first place. Realloc is not going to make things better, it copies the memory in anything but very tiny allocations.

Realloc, if well implemented, does not copy (using virtual memory stuff).

In this case would it work similarly to having single big mmap and counting on kernel allocating pages on first write? So for most cases you would only have a single mmap call and multiple page faults (as it would always be the case). I assume that most allocators do mmap for big allocations so it does not need to be called directly.

I lately tried very naive synthetic test to check when realloc will return different address (meaning that it had to copy). My system started swapping, but it did not return different address. I did not know that realloc works like that, but why wouldn't it.

> check when realloc will return different address (meaning that it had to copy)

What faragon is referring to is that this assumption is incorrect. The virtual address changing does not imply the memory was copied. On linux for example it could have been remapped with mremap(2).

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact