
SuperMalloc: A Super Fast Multithreaded Malloc for 64-bit Machines - ingve
http://conf.researchr.org/event/ismm-2015/ismm-2015-papers-supermalloc-a-super-fast-multithreaded-malloc-for-64-bit-machines
======
emeryberger
For what it's worth, in my tests, Hoard
([http://www.hoard.org](http://www.hoard.org),
[https://github.com/emeryberger/Hoard](https://github.com/emeryberger/Hoard))
significantly outperformed SuperMalloc on a machine without TSX. For example,
on a simple microbenchmark that tests scalability, Hoard is 2x faster with one
thread, and the gap widens with more threads (Hoard is 3.3x faster with 8
threads, 4x faster with 16 threads).

~~~
throwawayaway
Wow! Want to do a full disclosure?

Your papers were very valuable while doing my undergraduate study. Is Hoard
neck and neck with jemalloc and lockless' allocator nowadays?

[http://locklessinc.com/benchmarks_allocator.shtml](http://locklessinc.com/benchmarks_allocator.shtml)

------
Someone
Code at
[https://github.com/kuszmaul/SuperMalloc](https://github.com/kuszmaul/SuperMalloc).

Paper seems to be there, too, at
[https://github.com/kuszmaul/SuperMalloc/blob/master/paper/cp...](https://github.com/kuszmaul/SuperMalloc/blob/master/paper/cpc/onepage.tex)

~~~
wyldfire
"SuperMalloc is an implementation of malloc(3) designed for x86 Hardware
Transactional Memory (HTM). It turns out that the same design also makes it
fast even without HTM. "

Does anyone know if there's a cpuid feature flag for HTM? Do all haswell-
generation SKUs support the feature?

~~~
rayiner
TSX is disabled in most (all?) Haswell SKUs due to bugs.

~~~
wyldfire
Interesting. "In August 2014, Intel announced a bug in the TSX implementation
on current steppings of Haswell, Haswell-E, Haswell-EP and early Broadwell
CPUs, which resulted in disabling the TSX feature on affected CPUs via a
microcode update." \-- So only (non-early) Broadwell CPUs support it now?

~~~
putterson
As well as Skylake CPUs now. Check in the "TSX-NI" table entry.
[http://ark.intel.com/products/88191/Intel-
Core-i5-6600K-Proc...](http://ark.intel.com/products/88191/Intel-
Core-i5-6600K-Processor-6M-Cache-up-to-3_90-GHz)

~~~
webaholic
It shows that it is supported in Skylake. I guess they fixed the bugs.

------
scott_s
Neat, we have multiple allocator authors on this thread! I'm one of the
authors of reference [38], Streamflow ([http://www.scott-
a-s.com/projects/#streamflow](http://www.scott-a-s.com/projects/#streamflow)).
Unlike the other allocators, our code has not been kept updated over the
years. So it's very reasonable our allocator is not compared against in the
experiments. In any event, the allocator for TBBmalloc shares a lot of design
similarity. (It's based on McRT-malloc, which was presented as a paper during
the same session as our allocator at ISMM 2006.)

The author of this paper does not use lock-free techniques, which our
allocator used - I'm curious if using lock-free algorithms would have changed
the author's design, or improved the performance. I do think that despite the
similarities that Streamflow has to TBBmalloc, Streamflow is not susceptible
to the same kind of memory blowup. The problem, as described in the recent
paper:

"TBBmalloc can have an unbounded footprint. One case was documented by [40].
In this case, one thread allocates a large number of objects, and a second
thread then frees them, placing them into the first threads foreign block. If
the first thread then does not call free(), then the memory will never be
removed from the foreign block to be reused. There appears to be no easy fix
to this problem in TBBmalloc, since the thread-local locking policy assumes,
deep in its design, that every thread calls free() periodically."

Streamflow avoids this problem by putting the remote-free block check on the
malloc path. That is, when allocating memory, you always check if other
threads remotely freed memory for you. Basically, if you're continually
allocating memory, you're also continually checking to see if you should clean
your memory that other threads freed for you. You can see this in action:
[https://github.com/scotts/streamflow/blob/master/streamflow....](https://github.com/scotts/streamflow/blob/master/streamflow.c#L1211)

All of the above is just to add some background - I look forward to really
digging into this paper over the holidays.

(In case anyone actually follows my comments, they may know I currently do
research and development for IBM Streams. That has zero relationship to this
memory allocator I worked on early in grad school; it's just an odd
coincidence of project names.)

------
collector
Fast, multicore-scalable, low-fragmentation memory allocation through large
virtual memory and global data structures [1]

An allocator that is designed for scalability from the ground up. Similar
design to Streamflow (based on so-called spans) that eagerly returns memory
(latency-aware) and a backend that also makes use of fragmenting virtual
memory, which is plentiful available on 64bit systems.

[1]:
[http://dl.acm.org/citation.cfm?doid=2814270.2814294](http://dl.acm.org/citation.cfm?doid=2814270.2814294)

~~~
scott_s
You all are giving me a lot of reading material.

------
jws
I think the significant design decision is that in a 64 bit world, virtual
address space is no longer scarce. This allows a simpler implementation,
roughly half the code.

The micro benchmarks are great, the whole program level benchmarks show
nothing earth shaking. But I'll take ok performance and half the code any day.

~~~
thrownaway2424
Can you explain? It seems like the address space is getting more scarce, not
less. Several years ago when x86-64 adoption really took off, 48 bits was a
lot bigger than the actual amount of memory one might put in a system, but now
it isn't (48 bits virtual vs. 40 bits physical).

~~~
jws
The algorithm sparsely uses 512MB of virtual address space for its allocation
and caches. For each of about 40 sizes, it keeps an array of fixed size
blocks, plus a bunch of per thread and per CPU caches.

In a 32bit world, taking 1/4 to 1/8 of the address space would be impolite. In
a 64 (or 48) bit world it doesn't matter.

------
RossBencina
SuperMalloc was subject to critique and comparison with Bonwick's Slab
Allocator in this excellent talk by Ryan Zezeski: "Memory by the Slab: The
Tale of Bonwick's Slab Allocator" video, paper etc:
[http://paperswelove.org/2015/video/ryan-zezeski-memory-by-
th...](http://paperswelove.org/2015/video/ryan-zezeski-memory-by-the-slab/)

------
lotsoflumens
"dual licensed" \- guaranteed to double your licensing confusion.

~~~
DiThi
Can't we just use the MIT part? What's the point of having two FOSS licenses,
one being non-viral?

~~~
jws
I don't think adding a GPLv3 dual to an MIT license changes much. You can
derive a work from MIT licensed code and license your work with a more
restrictive license. You would still need to reproduce the MIT license
alongside your GPLv3 license which would be confusing.

The author is giving you the option to omit the MIT text and have a pure GPLv3
if you wish to restrict access to your software in that way.

~~~
DiThi
Don't all MIT licensed projects have this possibility de facto? I mean, I can
take any other MIT licensed software, add some bits of mine GPL licensed, and
release the whole thing as just GPL.

~~~
justincormack
The whole thing is not GPL, someone can still just replace your changes with
more MIT code and use the whole thing under BSD. You cannot change the license
of someone else's code.

------
thrownaway2424
Comparison to tcmalloc suspiciously omitted.

~~~
halayli
Came here to say just that. tcmalloc performs better than jemalloc in various
cases.

~~~
scott_s
Interesting - I had noticed the omission, but I had assumed that meant that
jemalloc had emerged as the agreed-upon allocator best for production. Are you
aware of any recent published experiments with the two?

~~~
thrownaway2424
I'm not aware of any serious published works. Really the only way to choose an
allocator is to try them all on your benchmark workload. Any other benchmarks
are likely to be irrelevant.

I think tcmalloc vs. jemalloc is basically a wash. One is written by google
people and the other is written by facebook people. Facebook is much more
forward about their open source project than is Google, so more people have
heard of jemalloc.

~~~
stass
jemalloc is unaffiliated with FB. It was developed and integrated into FreeBSD
and Firefox long before the author started working for Facebook.

~~~
thrownaway2424
What I meant about facebook's advocacy is that they have published lots of
high profile blogs and whatnot about applications of jemalloc which have
contributed greatly to its renown relative to tcmalloc, and I think it's fair
to say that jemalloc versions 2, 3, and 4 have been largely underwritten by
the Big Thumb.

And the other side of what I was trying to say is that for Google it's largely
an issue of personality. tcmalloc is Google's malloc. JE works for Facebook.
If JE worked at Google I'm sure Google would just use jemalloc.

