For the large majority of server workloads on Linux, jemalloc or tcmalloc is probably the right choice of allocator. Trying these out (and spending a few additional test runs tuning their configuration) will often yield significant wins compared to the glibc allocator.
It probably would be interesting to see how many page faults the application generates with tcmalloc/jemalloc/glibc/etc
"A Scalable Concurrent malloc(3) Implementation for FreeBSD", Jason Evans, April 16, 2006
However I’ve also seen an improvement from simply setting MALLOC_ARENA_MAX=2.
The Ruby Core team has decided against  shipping jemalloc by default, but will use it when it is available. Which ultimately led to some work on SleepyGC  and Transient heap .
Recent x64 instructions to count bits from left or right can help a lot to make that fast.
From what I remember jemalloc uses bitmaps for small allocations (below the 4 KiB page size) but switches to different data structures for medium and large allocations.
Edit: I should have read TFA.
> and it can return overallocated (yet unused) memory which custom allocators cannot.
These allocators allocate memory out of buffers they've gotten from the kernel, so overallocation is very much possible depending on your kernel config.