
Benchmarking C++ Allocators - jeffbee
https://docs.google.com/document/d/e/2PACX-1vTJmRADDPyybMjBxQ5r-PHEdHQWoOW-Wk87IVoT_EvFv9B5Ks3Mjuk8IXIDYPKFvWW6ezsl9PSZ1JbF/pub
======
pizlonator
My data says that perf of malloc/free in _actual programs_ (not fake ass shit
calling malloc in a loop) is never affected by the call overhead. Not by
dynamic call overhead, not even if you write a wrapper that uses a switch or
indirect call to select mallocs. Reason is simple: malloc perf is all about
data dependencies in the malloc/free path and how many cache misses it causes
or prevents. Adding indirection on the malloc call path is handled gracefully
by the CPU’s ability to ILP, and malloc is a perfect candidate since it’s
bottlenecks will be cache misses inside the malloc/free paths.

~~~
s_kanev
This was the case a few years back when the fastest pools were implemented
with recursive data structures (e.g. linked lists for the freelists in
gperftools).

In the new tcmalloc (and, I think, hoard?) the fastest pools are essentially
slabs with bump allocation, so the fastest (and by far, the most common) calls
are a grand total of 15 or so instructions, without many cache misses (size
class lookups tend to stay in the cache). Call overhead can be a substantial
chunk of that.

~~~
pizlonator
First of all, let’s be clear: for a lot of algorithms adding call overhead
around them is free because the cpu predicts the call, predicts the return,
and uses register renaming to handle arguments and return.

But that’s not the whole story. Malloc perf is about what happens on free and
what happens when the program accesses that memory that malloc gave it.

When you factor that all in, it doesn’t matter how many instructions the
malloc has. It matters whether those instructions form a bad dependency chain,
if they miss cache, whether the memory we return is in the “best” place, and
how much work happens in free (using the same metrics - dependency chain
length and misses, not total number of instructions or whether there’s a
call).

~~~
jlebar
> First of all, let’s be clear: for a lot of algorithms adding call overhead
> around them is free because the cpu predicts the call, predicts the return,
> and uses register renaming to handle arguments and return.

This is only thinking at the hardware layer. There are also effects at the
software layer.

Function calls not known to LTO prevent all sorts of compiler optimizations as
well. In addition, as Jeff says on this thread, not being able to inline free
means you get no benefit from sized-delete, which is a substantial improvement
on some workloads. (I'd cite the exact number, but I'm not sure Google ever
released it.)

Source: I literally worked on this.

~~~
ahh
LTO I will grant you. I don't think you need to meaningfully inline the body
of delete to get sized-delete to work effectively. You do want to statically
link your sized operator delete body though, obviously. As of the last time I
tried this, trying to inline the actual body was difficult to make effective,
though.

(parent: read username :))

~~~
jlebar
Hi, Andrew. :)

Agree, I shouldn't have been loose wrt the distinction between inlining and
LTO. I think I also ran that experiment with inlining sized-deletes and came
to the same (surprising, to me) conclusion.

------
favorited
A bit unrelated, but if anyone is interested, I recently watched Andrei
Alexandrescu's 2015 talk about a composable allocator design in C++ and it was
really interesting. And Andrei's presentation style was fantastic, per usual.

Note that this talk is about a std::allocator-like type, potentially _on top
of_ one of malloc/jemalloc/etc.

[https://www.youtube.com/watch?v=LIb3L4vKZ7U](https://www.youtube.com/watch?v=LIb3L4vKZ7U)

------
mehrdadn
> For C++ programs, replacing malloc and free at runtime is the worst choice.
> When the compiler can see the definition of new and delete at build time it
> can generate far better programs. When it can’t see them, it generates out-
> of-line function calls to malloc for every operator new, which is bananas.

I feel like whenever I've dealt with general-purpose memory allocation (as
opposed to special-purpose like from a stack buffer without freeing) this kind
of overhead has been dwarfed by the actual overhead of the allocation and
pointless to worry about. Is this not the case in others' experience?

~~~
jeffbee
There are a wide variety of outcomes available when the implementation of
global operator new is visible to the compiler. It may inline the entire new,
and having done so it may also apply any other optimization that would apply
to any other function. Without an allocator at build time the only thing it
can do it generate a call to malloc. Leaving malloc resolution to runtime also
negates all benefits of sized delete.

~~~
the8472
My understanding is that compilers are allowed to treat malloc specially, i.e.
they can assume that it returns non-aliasing memory, does not mutate other
memory and can be elided if its result is unused.

~~~
mehrdadn
I'm not aware of that being a blanket assumption they can make, is it? (Is it
in the C++ standard?) That kind of thing is generally dictated by compiler-
specific annotations (noalias etc.) which you can put on any function, and
which shouldn't apply unless you write them explicitly...

~~~
gpderetta
on gcc you can use __attribute__((malloc)) to tell the compiler that your
function is malloc-like (i.e. its result type won't alias any other pointer),
but any call to the litterall 'malloc' function itself is implicitly treated
as such even if you do not include the appropriate header and declare it
yourself (you can use -fno-builtin-functions or -free-standing or something
like that to disable this treatment).

------
AshamedCaptain
> For C++ programs, replacing malloc and free at runtime is the worst choice.
> When the compiler can see the definition of new and delete at build time it
> can generate far better programs. When it can’t see them, it generates out-
> of-line function calls to malloc for every operator new, which is bananas.

This argument smells like bullshit, sorry, and it's totally not justified in
the article. Giving just numbers without even explaining a potential causality
points to benchmark setup failure more than anything.

Unless you are talking about statically linking your malloc implementation,
your entire STL and all depending libraries AND then performing LTO on it, and
even then I'd be surprised if there's any noticeable improvement over this.

Not to mention I don't know of _any_ real-life executable that does this...

~~~
banachtarski
> This argument smells like bullshit, sorry, and it's totally not justified in
> the article. Giving just numbers without even explaining a potential
> causality points to benchmark setup failure more than anything.

Why? This argument sounds perfectly reasonable and is true in many other
contexts as well.

------
nteon
Can you compare `tcmalloc` to e.g. tcmalloc dynamically loaded, or jemalloc
specified with Bazel's malloc option? It is unclear from the post whether the
wins are improvements from the internal Google improvements to tcmalloc post-
gperftools, or to reduction in overhead from bypassing the PLT

------
gok
The big problem with malloc/new/free/delete is that the interface is way too
flexible. Allowing objects to be allocated one thread and freed on another
fundamentally requires a bunch of complicated bookkeeping. Particularly in C++
it's usually more productive to change your allocation strategy entirely if
malloc/free is really a hot spot in your code.

~~~
saagarjha
Eh, on the fast path many allocators will just pull the block from what’s
essentially a bump allocator on the allocating thread and then stick the block
in the freeing thread’s tcache when it’s done with it, doing rebalancing once
in a while when necessary. It’s not too horrible.

------
moonchild
Note that there are other considerations like absolute allocation speed, like
memory fragmentation, which isn't affected by inlining. In the ithare post
they link, jemalloc thrashes all other allocators[1] on that front. It would
be interesting to see if the newer version of tcmalloc is improved at all in
that respect.

1: [http://ithare.com/wp-content/uploads/malloc-
overhead.png](http://ithare.com/wp-content/uploads/malloc-overhead.png)

~~~
jeffbee
In this test jemalloc actually uses the most memory of all. With default
tunings it uses 700k pages at peak, compared to 170k for tcmalloc and 131k for
glibc. I have no idea if that is relevant in real life and I have never cared,
so I didn't mention it.

~~~
riking
That's due to differences in your workload - in e.g. Ruby on Rails, the total
uncertainty of which objects are going to survive a request means that
jemalloc does a much better job at packing.

------
benibela
Something that could help a lot with performance would be functions to handle
memory for many small blocks at once.

Like a malloc extension function that allocates many small blocks at once.
Rather than returning one block of size n, it would return k blocks of size n.

Of course the user could do something similar by allocating one big block of
size k · n ordinarly. But then the user needs to keep track which small blocks
belong to which big block, and it might too big, when some small blocks are
not needed anymore. Say, the function needs k small blocks temporarily for
processing, but only returns k/10 small blocks. Then 90% of the memory would
be wasted.

But if you could allocate k small blocks at once, the allocator could allocate
one big array of k·(n + sizeof metadata). Then each of the small blocks could
be freed like a normally allocated block of size k, but they have all the
advantages of quick allocation and cache locality like a big array.

Or, alternatively, there could be a partial free that does not free the entire
malloced block, but only a part of it. Then you could allocate a block of size
k·n, and when you do not need it anymore, free the 90% of the block that you
do not need, but keep the 10% that you need.

~~~
munchbunny
Object pools are a pretty common pattern for that kind of usage. Is this any
different, or are you just asking to push the "allocation" into the memory
allocator instead?

~~~
benibela
The difference would be that each object can be freed individually when it is
no longer needed.

Especially when the code that creates the objects and the code that
uses/discards the objects are in two different projects.

Like a library that can load a JSON file as some kind of tree structure. Then
someone uses the library, loads a file of a million objects, but then only
needs a single object. All the other objects should be freed, but if they are
in a pool, they cannot. The library does not know which object the user wants,
and an user of the library should not need to know how the library allocates
the objects

~~~
munchbunny
I needed some time to think about your point. :)

I'm not sure I understand the distinction. Why couldn't the parser maintaining
that pool? If the two different projects (which I assume translates into two
different object files or DLL's) belong in the same process, it's pretty
straightforward for the parser to keep a singleton pool somewhere. If the two
projects are in different processes, then it's probably better in terms of
security to disallow sharing of that memory - and virtual memory spaces should
abstract the space efficiency issues well enough for those purposes.

~~~
benibela
The problem is that the pool becomes too big and wastes memory

It is fine when the user keeps opening files and the parser can reuse the
objects.

But eventually, the user has opened all the files he needs. Then the process
will never open another file again, and none of the objects can be reused. But
they also cannot be freed, when some objects are still used, e.g. when the
user did not close all files.

------
The_rationalist
They should include mimalloc. It is gaining momentum, for example it is the
default allocator for kotlin native

~~~
jeffbee
Picking the best allocator was not my goal. I was writing about how to
correctly use any allocator. I agree that mimalloc is worth evaluating. Note
that Microsoft says this in their docs, which agrees with what I am driving
at:

""" For best performance in C++ programs, it is also recommended to override
the global new and delete operators. For convience, mimalloc provides
mimalloc-new-delete.h which does this for you -- just include it in a
single(!) source file in your project. """

------
inquirrer
How does tcmalloc optimize away new/deletes at compile time exactly?

~~~
scott_s
Not optimize them away entirely, but eliminate the actual function call
through inlining. User-code can implement their own new and delete. If those
implementations are available during compilation, then they are candidates for
inlining just like any other function.

------
ape4
In my experience an applications specific pool works the best. Lets say a part
of your programs does a bunch of malloc()s then frees... put it into a pool
then delete the pool when its done.

~~~
haberman
We often call this "arena allocation". Wikipedia calls it Region-based memory
management: [https://en.wikipedia.org/wiki/Region-
based_memory_management](https://en.wikipedia.org/wiki/Region-
based_memory_management)

------
mitchs
When I backtrace from my malloc wrapper, injected with LD_PRELOAD, I see
malloc, std::new, then the call site that involved new. Unless std::new is
doing some funky shit (Like detecting dlsym("malloc") isn't glibc's malloc,
and changing implementations) I have my doubts about the claim that dynamic
replacement is costly. It appears new is already dynamically linking malloc.

~~~
jeffbee
That is the entire point of the post. If you leave the resolution of malloc to
run time, the best you can hope for is that new calls malloc, out of line. If
you build with an implementation of new, the outcomes may be dramatically
better.

By the way there is no such thing as std::new. We are discussing ::new.

~~~
mitchs
Heh, std::new was some cruft I got out of backtrace(3) with some wacky
embedded tool chain.

Are these gains supposed to be from inlining the top level function of malloc
into new? Compared to the costs of what can go on inside malloc... Is that the
biggest problem?

~~~
ckennelly
Dynamic linking requires calling through the PLT to get to the implementation,
so there's a data dependency on determining where the code for it is.

Independent of inlining (with LTO, since the C++ language rules requires
inhibit optimizing out "operator new"), the static call is far simpler.

------
CppCoder
Let’s not forget a benchmark is not a typical application. Just because it
does improve without using dynamic linking does not conclude it would so in
every application. But, I definitely would not dynamically link a custom
allocator. For me it would be the improved memory usage and performance, which
I do not want to risk loosing.

------
quotemstr
The C++ allocator interface really needs to support realloc. Vector, in many
cases, could avoid copying elements, particularly in cases where the move
constructor is trivial --- e.g., vector<int>.

~~~
leni536
It's a shame we don't have "destructive move"/relocation yet. Many more
objects are trivially relocatable than trivially movable.

------
the8472
The post has no date, does not mention versions used or allocator tunables.
Considering that this is a shifting landscape that makes it difficult to judge
how relevant those results are today.

~~~
jeffbee
True. I added a date at the top. I wrote it this morning with head revisions
of both jemalloc and tcmalloc.

------
balls187
Thank you for sharing.

Would you consider linking your source code, including make files?

