
Testing Memory Allocators: ptmalloc2 vs. tcmalloc vs. hoard vs. jemalloc - ingve
http://ithare.com/testing-memory-allocators-ptmalloc2-tcmalloc-hoard-jemalloc-while-trying-to-simulate-real-world-loads/
======
davidtgoldblatt
(As background, I'm a jemalloc developer; reposting my twitter comment on the
same article): This is quite a bad way of doing a malloc benchmark -- getting
realistic activity patterns is critical (see e.g. Wilson et al.'s survey). It
doesn't meaningfully test inter-thread interactions, and randomizes in a way
that hurts the effectiveness of thread-local caching.

For the large majority of server workloads on Linux, jemalloc or tcmalloc is
probably the right choice of allocator. Trying these out (and spending a few
additional test runs tuning their configuration) will often yield significant
wins compared to the glibc allocator.

~~~
MichaelMoser123
if your application needs to be up a lot then you will also care about memory
fragmentation - as far as I know most benchmarks do not focus on such
problems. You still must test it under real live conditions! No way around
that.

~~~
mdcallag
jemalloc and tcmalloc have been much better than glibc for me, especially when
it comes to avoiding fragmentation with some server workloads
[http://smalldatum.blogspot.com/2017/11/concurrent-large-
allo...](http://smalldatum.blogspot.com/2017/11/concurrent-large-allocations-
glibc.html) [http://smalldatum.blogspot.com/2018/04/myrocks-malloc-and-
fr...](http://smalldatum.blogspot.com/2018/04/myrocks-malloc-and-
fragmentation-strong.html) [http://smalldatum.blogspot.com/2015/10/myrocks-
versus-alloca...](http://smalldatum.blogspot.com/2015/10/myrocks-versus-
allocators-glibc.html)

------
senozhatsky
The actual allocation happens in memset() - because this is where the
application page faults and jumps to the kernel. The kernel populates
application's address space and returns back to user land. This surely can add
a lot of noise to the test results. free() is also not very simple anymore
considering all the madvise() magic it does.

It probably would be interesting to see how many page faults the application
generates with tcmalloc/jemalloc/glibc/etc allocators.

-ss

------
sebcat
For more jemalloc information/context/history/trivia, I would recommend
reading the jemalloc paper.

"A Scalable Concurrent malloc(3) Implementation for FreeBSD", Jason Evans,
April 16, 2006

[https://www.bsdcan.org/2006/papers/jemalloc.pdf](https://www.bsdcan.org/2006/papers/jemalloc.pdf)

------
JoshTriplett
This benchmark tested glibc 2.24. glibc 2.26 introduced a newer, much more
scalable implementation of malloc, with a per-thread cache. I'd love to see
how that competes with jemalloc and the other allocators.

------
archi42
If you want to consider more allocators: We use tbbmalloc from the Intel
threaded building blocks for our Linux and Windows builds. I'm not sure if the
dev in charge did benchmark other allocators, but results are good (we spent
less time in the allocator, program runs faster overall). Our workload varies
between 2 and 8GB RAM on average with <15 minute runtime for internal, reduced
tests (outliers or customer runs can reach >100gb and hours or days of
runtime; those improved too).

~~~
rurban
ptmalloc3 would also be better with threads. It should be even the glibc
default

------
pcstl
Seems like Rust has made a solid choice in making jemalloc their default
allocator.

~~~
robin_reala
I’d assumed it was the Mozilla legacy - Gecko uses jemalloc, and there’ll be a
lot of institutional knowledge around tuning it.

~~~
sanxiyn
This is correct.

------
bdarnell
Performance isn't the only thing to consider. For example, tcmalloc and
jemalloc both have good profiling/debugging tools, and these are the biggest
reasons why I choose one of them for any large C/C++ project. I've also found
that jemalloc is easier to integrate into complex build systems than tcmalloc,
so jemalloc is my first choice in most cases.

------
berti
It would help to generalise the conclusions if a few different types of
workloads were tested. Even as simple as a an application with a heavy skew
toward lots of small allocations, and same for large allocations.

~~~
kev009
I think this is a really hard thing to generalize, the broader an attempt
would be likely to make mistakes and casual readers would overlook a lot of
context sensitive information (OS, OS version, CPU ISA and particular uarch,
configuration of policy things like superpages, NUMA, scheduler) that limit
applicability outside of the benchmark and run itself. I like the article's
conclusion and call to action.

~~~
berti
Agreed. The only thing I think you can take from the benchmarks given is that
for 99% of cases it's not going to be worth the effort to do your own
benchmarks and choose a different allocator, the wins simply aren't going to
be big enough.

------
pletnes
What happens if you replace the glibc malloc used in implementations of higher
level languages like python/ruby/js? Aren’t these languages big on allocating
small objects very often?

~~~
inopinatus
Large rails apps can see significant improvement in memory use efficiency with
jemalloc e.g. as in
[https://www.levups.com/en/blog/2017/optimize_ruby_memory_usa...](https://www.levups.com/en/blog/2017/optimize_ruby_memory_usage_jemalloc_heroku_scalingo.html)

However I’ve also seen an improvement from simply setting MALLOC_ARENA_MAX=2.

------
ebikelaw
tcmalloc has a lot of parameters, and the defaults are not suited to any
workload I've ever seen, so I'd be interested to know if these knobs were
turned for this benchmark. Things like total thread cache size make a huge
difference. Also note that the tcmalloc being tested here, from the Debian
package, lacks the fast-path improvements released in gperftools 2.6, which
represented a roll-up of years of internal Google improvements to tcmalloc.

~~~
nobugs
Good point but adjusting allocators from default is a separate task - which is
better to be done by fans of respective allocator. If somebody is able to
adjust their-favorite-allocator so results with this-publicly-available-test
become better - we'll be happy to publish results with such tuning :-).

------
mabynogy
Try to do a bitmap allocator. Each word of memory has a free/used bit attached
in a separate map. It allows quick resizing before or after the block without
moving. It can also be optimal by reducing the fragmentation (find the best
block for a given size in the map). It can be optimized with bit counting
intrinsics (like popcntl()).

~~~
JoshTriplett
That works well if you allocate many blocks of the same size, or you have some
other data structure that tracks which blocks belong to a specific object
(such as in a filesystem). For a malloc/free implementation, however, you'd
need some way to track the size of the allocation to free it later.

~~~
mabynogy
No. The purpose is the opposite. A map to describe the memory instead of a
list of blocks. The granularity can be 1 bit to describe 1 byte (or 2/4/8...).

Recent x64 instructions to count bits from left or right can help a lot to
make that fast.

~~~
gliptic
1 bit per byte would require an awful lot of memory access to find a free
block. You would need something like a hierarchy of bitmaps to cut down on
scanning.

~~~
mabynogy
No because you'll end up with the same problem elsewhere (handling blocks). If
I'm correct tcmalloc memory overhead is around 4%. More overhead can be
acceptable if the memory is less fragmented (less fragmentation, less cache
misses, better compactness and better perfs).

~~~
gliptic
Well, you're welcome to try, but it's not like free space bitmaps are unknown
to allocator writers.

------
bonzini
glibc malloc has been optimized a lot during the last couple years, mostly to
improve multithreaded performance. The work was done by DJ Delorie (of DJGPP
fame).

------
olliej
It would be nice to see bmalloc in here, but I guess it’s not used anywhere
outside of WebKit so maybe it doesn’t matter :-/

------
dingo_bat
I was trying to implement my own malloc and free, but I couldn't figure out
how to validate the correctness of my implementation. Does anybody know of any
test suite for malloc/free?

Edit: I should have read TFA.

~~~
easytiger
LD_PRELOAD?

~~~
dullgiulio
Nah, I guess the parent question is more: what sort of allocation/freeing
patterns is worth measuring?

------
smallstepforman
Before a majority of Operating Systems adopted Bonwicks Slab allocator, custom
allocators made sense. These days, it's best to let the OS worry about the
allocations since it knows about available storage devices (eg. NNVM), cache
levels and threads, power management, etc, and it can return overallocated
(yet unused) memory which custom allocators cannot.

~~~
vidarh
On POSIX systems at least, the API provided to userspace to allocate memory is
too low level to remove the need for a higher level allocator, so you will be
using a separate allocator.

> and it can return overallocated (yet unused) memory which custom allocators
> cannot.

These allocators allocate memory out of buffers they've gotten from the
kernel, so overallocation is very much possible depending on your kernel
config.

~~~
kev009
Yep, this two layer approach is also desirable since in userspace you aren't
context switching to the kernel all the time and can make some assumptions to
avoid certain locking and external fragmentation with the kmem. Short lived
allocs are really cheap coming out of i.e. jemalloc arenas and that is
important for a lot of programming ergonomics.

