I think the simplicity of a 1300 line allocator is very alluring. At that level it should avoid bloating small programs and is hopefully easy enough to understand if digging in to it.
Also it looks like it is split up into 3 files. consolidating it to a single header file C library would be very nice.
One oddity is the 16 bytes alignment. 64 bytes is the false sharing boundary in x64 so I'm surprised to see this in a concurrent allocator, but maybe I've misunderstood exactly what that means here.
Regarding the files, only rpmalloc.[h|c] is needed. The malloc.c file is just for overriding the malloc family of entry points, and provide automatic init/fini glue.
The 16 byte alignment is just a note that all returned blocks are naturally aligned to 16 byte boundaries due to the fact that all size buckets are a multiple of 16 bytes internally. It is needed for SSE instructions, which is why I mention it.
And the size granularity is split into 3 (small, medium, large) which is fine, but other's have finer granularities. So benchmarks with object systems would be nice, not just random data.
Graphs and analysis for a Linux system will be added soon, but a spoiler:
rpmalloc 22.7m memory ops/CPU second, 26MiB bytes peak, 15% overhead
jemalloc 21.3m memory ops/CPU second, 30MiB bytes peak, 33% overhead
tcmalloc 20.4m memory ops/CPU second, 30MiB bytes peak, 32% overhead
crt 13.5m memory ops/CPU second, 24MiB bytes peak, 7% overhead
rpmalloc is faster and has less overhead than both jemalloc and tcmalloc, and a lot faster than glibc at the cost of extra memory overhead.
The reasoning behind the granularities is that 16 bytes is a reasonable granularity for smaller objects (below 2KiB in size). Medium sized requests (between 2KiB and 32KiB) have a 512 byte granularity, while large requests (up to 2MiB) have a 62KiB granularity.
If you have a suggestion for additional benchmark models I would be happy to add an implementation for it.
Also, my understanding is that Rust uses jemalloc, and Rust runs on windows, so what does Rust do? Just use a known good ported jemalloc version?
Example: Lock-free linked lists are great spiky write consistent read situations, but in high-write situations they probably aren't any better than locked linked lists.
(Btw, I'm the author of the library)
I've never had a desire to deviate from my underlying library's libc allocator, but the principle reason for that (aside from potential for mismatched malloc/free calls due to integration errors), is that I know this stuff is a dark art to nail correctly, and I'm opening myself up to all kinds of potential unknowns by discarding the finely tweaked (although potentially not the most efficient) libc allocator that's been getting patched on to handle adversity for the past 20-30 years.
I'd expect documentation of this sort to appear in the readme, even if just to make mention of another allocator this one is patterned on, where I could maybe go digging for further information around failure cases
This is just a conservative engineer talking, I've never actually written my own general purpose heap allocator, although I understand the design of a few, but perhaps not sufficiently to understand the motivation behind the design (e.g. the kinds of failure cases the design is hoping to avoid). Just more interested in how it's likely to break my app :)
The goal is of course not to replace the standard allocator, but rather to offer an alternative for those cases where the standard allocator performs poorly.
In your case, the block will be marked as free for allocation. If it is the last used block in that span, the span will eventually be returned to the global cache and reused by other heaps/threads when the owner thread ends. If the span is still used, it will remain in the heap until the heap is picked up by another thread. Heaps are reused as threads are joined and created (i.e number of allocated heaps are the maximum number of concurrent threads since process started).
The worst case is that one thread allocates blocks, hands them off to other threads then ends, and that the process never creates additional threads. In this case, the spans will remain mapped and unused. I have planned to allow other threads to pick up and collect spans from unused heaps at some point to solve this.
If so, I'm going to make a bold claim that there are better ways to manage that memory than a general purpose allocator.
I lately tried very naive synthetic test to check when realloc will return different address (meaning that it had to copy). My system started swapping, but it did not return different address. I did not know that realloc works like that, but why wouldn't it.
What faragon is referring to is that this assumption is incorrect. The virtual address changing does not imply the memory was copied. On linux for example it could have been remapped with mremap(2).