Hacker News new | past | comments | ask | show | jobs | submit login
Testing Memory Allocators: ptmalloc2 vs. tcmalloc vs. hoard vs. jemalloc (ithare.com)
121 points by ingve on July 4, 2018 | hide | past | web | favorite | 42 comments

(As background, I'm a jemalloc developer; reposting my twitter comment on the same article): This is quite a bad way of doing a malloc benchmark -- getting realistic activity patterns is critical (see e.g. Wilson et al.'s survey). It doesn't meaningfully test inter-thread interactions, and randomizes in a way that hurts the effectiveness of thread-local caching.

For the large majority of server workloads on Linux, jemalloc or tcmalloc is probably the right choice of allocator. Trying these out (and spending a few additional test runs tuning their configuration) will often yield significant wins compared to the glibc allocator.

if your application needs to be up a lot then you will also care about memory fragmentation - as far as I know most benchmarks do not focus on such problems. You still must test it under real live conditions! No way around that.

jemalloc and tcmalloc have been much better than glibc for me, especially when it comes to avoiding fragmentation with some server workloads http://smalldatum.blogspot.com/2017/11/concurrent-large-allo... http://smalldatum.blogspot.com/2018/04/myrocks-malloc-and-fr... http://smalldatum.blogspot.com/2015/10/myrocks-versus-alloca...

So true. Inside Google, it used to be the case that every N months someone had to figure how to build their Python daemon with Blaze (Bazel) using tcmalloc. It was painful and I hope it's easier nowadays. Without it, after a few weeks you'd see memory leaks whose actual root cause was fragmentation. I can't remember how they interacted with the allocator in Python, but I always suspected that heavy use of protocol buffers made the problem a lot more acute.

The actual allocation happens in memset() - because this is where the application page faults and jumps to the kernel. The kernel populates application's address space and returns back to user land. This surely can add a lot of noise to the test results. free() is also not very simple anymore considering all the madvise() magic it does.

It probably would be interesting to see how many page faults the application generates with tcmalloc/jemalloc/glibc/etc allocators.


For more jemalloc information/context/history/trivia, I would recommend reading the jemalloc paper.

"A Scalable Concurrent malloc(3) Implementation for FreeBSD", Jason Evans, April 16, 2006


This benchmark tested glibc 2.24. glibc 2.26 introduced a newer, much more scalable implementation of malloc, with a per-thread cache. I'd love to see how that competes with jemalloc and the other allocators.

Seems like Rust has made a solid choice in making jemalloc their default allocator.

I’d assumed it was the Mozilla legacy - Gecko uses jemalloc, and there’ll be a lot of institutional knowledge around tuning it.

This is correct.

If you want to consider more allocators: We use tbbmalloc from the Intel threaded building blocks for our Linux and Windows builds. I'm not sure if the dev in charge did benchmark other allocators, but results are good (we spent less time in the allocator, program runs faster overall). Our workload varies between 2 and 8GB RAM on average with <15 minute runtime for internal, reduced tests (outliers or customer runs can reach >100gb and hours or days of runtime; those improved too).

ptmalloc3 would also be better with threads. It should be even the glibc default

Performance isn't the only thing to consider. For example, tcmalloc and jemalloc both have good profiling/debugging tools, and these are the biggest reasons why I choose one of them for any large C/C++ project. I've also found that jemalloc is easier to integrate into complex build systems than tcmalloc, so jemalloc is my first choice in most cases.

It would help to generalise the conclusions if a few different types of workloads were tested. Even as simple as a an application with a heavy skew toward lots of small allocations, and same for large allocations.

I think this is a really hard thing to generalize, the broader an attempt would be likely to make mistakes and casual readers would overlook a lot of context sensitive information (OS, OS version, CPU ISA and particular uarch, configuration of policy things like superpages, NUMA, scheduler) that limit applicability outside of the benchmark and run itself. I like the article's conclusion and call to action.

Agreed. The only thing I think you can take from the benchmarks given is that for 99% of cases it's not going to be worth the effort to do your own benchmarks and choose a different allocator, the wins simply aren't going to be big enough.

What happens if you replace the glibc malloc used in implementations of higher level languages like python/ruby/js? Aren’t these languages big on allocating small objects very often?

Large rails apps can see significant improvement in memory use efficiency with jemalloc e.g. as in https://www.levups.com/en/blog/2017/optimize_ruby_memory_usa...

However I’ve also seen an improvement from simply setting MALLOC_ARENA_MAX=2.

There has been lots of discussion lately about Ruby and malloc. jemalloc gives the best results for Server Performance and memory usage [1], but is slightly slower in low memory usage scenario i.e Non Server usage which is popular in Japan. But jemalloc sometimes ( Not sure if this is still the case ) doesn't work well with muslc, used in Alpine Linux.

The Ruby Core team has decided against [2] shipping jemalloc by default, but will use it when it is available. Which ultimately led to some work on SleepyGC [3] and Transient heap [4].

[1] https://www.mikeperham.com/2018/04/25/taming-rails-memory-bl...

[2] https://bugs.ruby-lang.org/issues/14718

[3] https://bugs.ruby-lang.org/issues/14723

[4] https://bugs.ruby-lang.org/issues/14858

It works fine. jemalloc is the system malloc on FreeBSD.

Python's allocator makes use of the fact the GIL must be held to avoid any locking in the normal case for objects under 512 bytes. The introduction of pymalloc was a huge perf win at the time. For larger allocations, it's not so clear if a drop-in malloc would be such a win: many allocators just fall back to mmap() for big allocations and so I imagine their behaviour would be quite similar.

tcmalloc has a lot of parameters, and the defaults are not suited to any workload I've ever seen, so I'd be interested to know if these knobs were turned for this benchmark. Things like total thread cache size make a huge difference. Also note that the tcmalloc being tested here, from the Debian package, lacks the fast-path improvements released in gperftools 2.6, which represented a roll-up of years of internal Google improvements to tcmalloc.

Good point but adjusting allocators from default is a separate task - which is better to be done by fans of respective allocator. If somebody is able to adjust their-favorite-allocator so results with this-publicly-available-test become better - we'll be happy to publish results with such tuning :-).

Try to do a bitmap allocator. Each word of memory has a free/used bit attached in a separate map. It allows quick resizing before or after the block without moving. It can also be optimal by reducing the fragmentation (find the best block for a given size in the map). It can be optimized with bit counting intrinsics (like popcntl()).

That works well if you allocate many blocks of the same size, or you have some other data structure that tracks which blocks belong to a specific object (such as in a filesystem). For a malloc/free implementation, however, you'd need some way to track the size of the allocation to free it later.

No. The purpose is the opposite. A map to describe the memory instead of a list of blocks. The granularity can be 1 bit to describe 1 byte (or 2/4/8...).

Recent x64 instructions to count bits from left or right can help a lot to make that fast.

1 bit per byte would require an awful lot of memory access to find a free block. You would need something like a hierarchy of bitmaps to cut down on scanning.

No because you'll end up with the same problem elsewhere (handling blocks). If I'm correct tcmalloc memory overhead is around 4%. More overhead can be acceptable if the memory is less fragmented (less fragmentation, less cache misses, better compactness and better perfs).

Well, you're welcome to try, but it's not like free space bitmaps are unknown to allocator writers.

How would you implement the free function? You need some way to track how much memory to free.

By allocating from a different block for different sizes. This means the bitmap only needs one bit per element, not per byte and reduces fragmentation. You can keep the number of blocks small by limiting the allocation sizes to a number of pre-defined buckets.

From what I remember jemalloc uses bitmaps for small allocations (below the 4 KiB page size) but switches to different data structures for medium and large allocations.

That makes sense, yes. If you're allocating blocks of uniform size, a bitmap works just fine.

glibc malloc has been optimized a lot during the last couple years, mostly to improve multithreaded performance. The work was done by DJ Delorie (of DJGPP fame).

It would be nice to see bmalloc in here, but I guess it’s not used anywhere outside of WebKit so maybe it doesn’t matter :-/

I was trying to implement my own malloc and free, but I couldn't figure out how to validate the correctness of my implementation. Does anybody know of any test suite for malloc/free?

Edit: I should have read TFA.


Nah, I guess the parent question is more: what sort of allocation/freeing patterns is worth measuring?

Before a majority of Operating Systems adopted Bonwicks Slab allocator, custom allocators made sense. These days, it's best to let the OS worry about the allocations since it knows about available storage devices (eg. NNVM), cache levels and threads, power management, etc, and it can return overallocated (yet unused) memory which custom allocators cannot.

FWIW: all allocators tested are user-space allocators - and moreover, to achieve this kind of performance they MUST be user-space (as user-kernel switch with all the checks is waaaay too expensive). And as soon as we're user-space - it IS possible to compete with whatever-allocator-is-provided-by-your-user-space-lib (not that it makes sense for 99.9% of the projects, but it MIGHT be possible to write an allocator which is better-than-<whatever-allocator-we-pick>). In addition, it IS possible to write allocators MUCH-better-suited-for-specific-apps (see, for example, talk on Arena Allocators on CPPCON17 by John Lakos from Bloomberg).

On POSIX systems at least, the API provided to userspace to allocate memory is too low level to remove the need for a higher level allocator, so you will be using a separate allocator.

> and it can return overallocated (yet unused) memory which custom allocators cannot.

These allocators allocate memory out of buffers they've gotten from the kernel, so overallocation is very much possible depending on your kernel config.

Yep, this two layer approach is also desirable since in userspace you aren't context switching to the kernel all the time and can make some assumptions to avoid certain locking and external fragmentation with the kmem. Short lived allocs are really cheap coming out of i.e. jemalloc arenas and that is important for a lot of programming ergonomics.

OS allocators allocate pages, userspace allocators use the OS allocator to allocate bytes.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact