

Optimization Tricks used by the Lockless Memory Allocator - jcsalterego
http://locklessinc.com/articles/allocator_tricks/

======
scott_s
This is very similar to a multithreaded memory allocator I implemented many
years ago: <http://people.cs.vt.edu/~scschnei/streamflow/>

Full paper: <http://people.cs.vt.edu/~scschnei/papers/ismm06.pdf> I also have
an incomplete extension to that paper with more algorithmic details. I can
clean it up and send it to interested parties.

Source code: <http://github.com/scotts/streamflow/>

There's also an excellent paper recently published at PACT 2011 that is
extends and improves on what we did previously, but the paper is not available
anywhere yet:
[http://aces.snu.ac.kr/Center_for_Manycore_Programming/SFMall...](http://aces.snu.ac.kr/Center_for_Manycore_Programming/SFMalloc.html)
I have a copy of the paper and can email it to interested people.

Edit after looking through his code: I think he has locks on the main
execution path for malloc. That violates one of our design principles, which
was to avoid synchronization on the main path, and to only use lock-free
synchronization for most operations on the main path. (Allocations that hit
the page-block allocation in our work hit a lock. The SFMalloc allocator I
mention above takes those locks out and sees performance improvement.)

~~~
sparky
Going lock-free has the benefit on any platform (well, any platform with
atomic primitives) that you don't have to worry about a thread getting
preempted while holding a lock and blocking others indefinitely. However, I
think the scalability advantages aren't as evident on today's systems as they
will be (or are, in the case of GPUs) on systems with high-tens to hundreds or
thousands of hardware threads. The main reason I've found is that most lock-
free algorithms I've implemented add significant constant factors over their
serial equivalents (often via retries on CAS failure), so you really have to
have high access concurrency to make it worth your while.

Going lock-free also tends to force you to use simpler data structures, since
you have to figure out how to migrate the data structure from state A to state
B via a sequence of very "small" operations (e.g., a CAS or atomic add),
making sure that every intermediate state along the way is also valid. This is
in contrast to acquiring a (group of) lock(s) and going to town for a lock-
based data structure. As evidence of this, being able to figure out a lock-
free variant of something like a linked list or skip list --- things which are
not considered rocket science in the serial world --- will usually get you a
conference or journal paper.

Another reason I don't think we've seen the best of lock-free yet is that most
current implementations besides GPUs ride on top of cache coherence; you incur
all the overhead of getting the cache line in your local cache, perform the
atomic op, then ship it somewhere else. Since the whole raison d'etre of lock-
free algorithms is that data operated on atomically tends to be getting
hammered by many threads, there's really not much point to shipping it all
around the chip; just put some functional units in the last-level cache and
keep the data there.

Haven't looked at the Lockless source yet, but locks on the fast path does
seem unusual; it seems fairly standard in high-performance allocators to have
a per-thread cache of freed blocks, which obviously doesn't need locks.

I'd be interested in a copy of the SFMalloc paper, if it's alright with the
authors (email in profile); wasn't able to attend PACT this year to see the
talk, but I'm very interested in the work.

~~~
scott_s
Your points are true in general, but memory allocation is a special case.

The main data structures in memory allocators tend to be pretty simple. We got
away with using singly-linked lists (FIFO queues) as the only shared data
structure that needed to be lock-free. In that case, you trade one lock for
one compare-and-swap, so there it's an even trade. The main path of the
allocation was free of synchronization.

I also implemented a lock-free radix tree (the original was borrowed from
TCMalloc), and you can see multiple compare-and-swap operations in the main
routine
([https://github.com/scotts/streamflow/blob/master/streamflow....](https://github.com/scotts/streamflow/blob/master/streamflow.c#L395)),
but they will be hit only rarely.

I applied your reasoning, though. The fast path for allocation and freeing
small objects was usually synchronization free, and always lock-free. But I
protected my page manager with spin locks. The page manager was responsible
for grabbing pages from the OS, giving them back to the OS, and carving out
pages for individual threads to use as their backing store for their small
object allocations. My rationale was that we would only hit the page manager
rarely, and in that case, it would probably be _faster_ to use a spin lock,
since I anticipated getting pages from the page manager would usually be
uncontended.

But that was back in 2005 and 2006. We only had access to 8 thread machines.
The SFMalloc paper makes it lock-free all the way up the stack, and their
experiments of up to 48 cores shows significant improvement.

------
huhtenberg
Lockfree slab allocator for _smaller blocks_ is frequently all that's needed,
especially in STL-heavy C++ code.

There is typically heck of a lot allocations of 0-32 byte blocks, about half
of that of 33-64, a further half of that of 65-128, etc. - meaning that if one
optimizes allocation of blocks smaller than 512 bytes, it would yield 80-90%
of achievable speed gain due to a better allocator.

In practical terms it translates into simpler implementation - just set up a
bunch of slab buckets - one for 0-16 byte blocks, next - for up to 32, next -
for up to 63, etc. - and pass larger allocation requests to default malloc.
Exact size ranges are easy to determine by running the app with a proxy
allocator and looking at the block size histogram. I did this with several
projects and in all cases histograms stabilized very quickly. The tricky part
is a lock-free management of the slabs, especially the disposal, but it is not
a rocket science by any means.

A real-world example - adding a lock-free slab allocator to a Windows app that
did some multi-threaded data crunching yielded 400% speed up compared to
native HeapAlloc().

