

Ask HN: What is your favorite hand-rolled malloc? - leif

In a similar vein to http://graphics.stanford.edu/~seander/bithacks.html, I'm curious to hear the people's choice for memory pool/slab allocators.  My question is a three-parter:<p>1) Allocation size function: Do you allocate fixed size chunks of N objects at a time?  Exponential increase in chunk size (I believe valgrind does this for its memchecker)?  If exponential, do you also back off exponentially, or linearly?  Something else?<p>2) Used/free record keeping: Bitfield?  Linked-list?  Other?<p>3) New allocations: realloc or array/linked list of pointers (I bet I know the answer to this one)?<p>Some justification would be nice, if you've got the time.
======
tptacek
I have two go-to solutions for malloc. In order of preference:

I use arena allocators when I can get away with it. Arenas are absolutely
trivial to code; there's no per-object "free", so allocation is just a pointer
bump. You will be surprised at how often you can get away with this; lots of
allocation-heavy code paths map to a single operation, transaction, request,
file, or what-have-you, and all you're doing is deferring free to the end of
it. Using arenas in this scenario basically erases the cost of allocation from
your program. It's also dramatically harder to fuck up.

For anything else, when malloc hits the top of my profile, I have a simple
freelist library for pools of homogenous allocations, with an embedded linked
list of free items.

You could get smarter than an arena, a generic freelist, and malloc, but why?

~~~
neilc
Yeah, arena allocators are great. I wouldn't immediately go and write my own
from scratch, though: talloc (<http://talloc.samba.org/>), Postgres, and APR
all implement variants on the arena idea that you can copy from liberally.

~~~
huhtenberg
And talloc was derived from halloc (<http://swapped.cc/halloc>), which is a
hierarchical allocator. Or rather a _wrapper_ for an existing allocation api.

~~~
neilc
Thanks for mentioning that link -- that's a nice library.

------
adamt
At various times I've profiled some CPU bound code and thought malloc is
taking too much time. This has led me to read up on, and try and develop
various usage-optimised allocators. The one lesson I've learnt is that in most
cases the standard libc allocator is actually extremely good.

Where I have had big advantages over libc malloc is where the memory
allocations have been specific to a thread context. For example, consider a
multi-threaded server, where by you could guarantee that only one thread was
active in a connection at anyone time. It's quite easy for a single connection
to spend 5-10% of it's time in malloc and there are often big globalish locks
around general purpose allocators, so it's easy to get get mutex contention
which is far worse than the overhead of a sub-efficient allocator.

In such cases you often need to have small bits of memory within the context
of a connection. Here I found big benefits my mallocing memory in bug chunks
and then simply splitting that up into small sections within a connection
specific allocator to avoid any malloc mutex locking.

What I've found is that in order to beat normal malloc, I think you really
have to have a specific usage case that restricts something to do better.

------
ori_b
Slab Allocators, as described by Bonwick
([http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4...](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.29.4759))
if I'm allocating lots of same-size objects. Other than that, I mainly stick
to the system malloc(), or something specifically optimized for the task at
hand.

------
lacker
At Google we use tcmalloc which is now open sourced.

A paper about it:

<http://goog-perftools.sourceforge.net/doc/tcmalloc.html>

The code:

<http://code.google.com/p/google-perftools/>

------
jey
It really depends on the application. I just use the default system allocator
until it's inadequate for some task, and then I switch to an allocation scheme
or heap allocator that's well-suited for that problem.

Can you tell us more about your specific problem? Then we'd be able to give
suggestions and help evaluate the options.

~~~
leif
The current issue is that we allocate and free small structures very, very
rapidly. They have very short lifespan, and there are never (in the average
case) more than a few around at any one time. After profiling, we saw lots of
time spent in malloc, and wrote a fixed-size-slab bitfield-based allocator
which gave fantastic performance.

There are some other issues with the same piece of code that need re-working,
and we'll end up with a totally different approach by the end, so I don't
really have a problem that needs a crowd solution; I was just curious what
everyone here thinks.

~~~
huhtenberg
If it's a fixed-size slab, why bother with the bitfield ? It is hard to beat
the performance of allocating a block from a free (double-linked) list. Start
with an empty list and full slab. Try allocating from the list first, and from
the bottom of the slab second. Free into the free list only. If you memory
utilization is stable in a long run, then at some point you will be
allocating/freeing from the list only.

------
kvs
My recent allocator design is an attempt to address memory fragmentation in
Windows XP & Vista. The unique problem I faced was that there was significant
amount of code already in place which used C++ new/delete (about 300KLOC) and
lot of coding is ongoing at breakneck speed. The allocator is written in C++.

My story:

\- Profiled the code to figure out allocation sizes and number of allocations.

    
    
       * Found most objects are 32, 64, 128 bytes and we allocate millions of them.
    
       * Allocation rate is linear in the sense, rate of allocation is pretty steady within a time window.
    
       * Life time of these objects are long; Majority of them (99.9%) lasts as long as the application itself.
    
       * Found couple of large allocations in the range of 64MB to 128MB
    
    

Can you tell me yours? With such detail, then perhaps we can figure out what's
best for you? :-)

~~~
slackerIII
Did you try the windows low-fragmentation heap? I've seen it used very
successfully.

~~~
kvs
Yes. That didn't work to our satisfaction.

------
scott_s
Mine: <http://www.cs.vt.edu/~scschnei/streamflow>

We allocated fixed size chunks of N objects, although we didn't have a smooth
function for what the chunk size was. The inflection points where we changed
our strategy were cache-line size and page size.

Used/free record keeping was a simple table to map an address to the data
structure for that address, and we re-used the free memory for an internal
list.

That's what we did to get good sequential performance - it was the
multithreaded performance we were interested in. Aiming for good multithreaded
performance influenced the data structure design to be simpler to reduce
synchronization.

Details are in the paper and source code.

------
jacquesm
I'm using jemalloc in places where we ran in to trouble with the regular
malloc.

It's been a long long time since I rolled my own, so I don't think I'm
qualified to answer the rest of your questions except for the second, linked
list of pointers of free(d) memory would be the best choice I think, with as
an extra an array of such lists grouped by size. (I hope that makes sense...)
to speed up the search for a block of the right size.

~~~
leif
I think the regular (gnu, at least) malloc uses bucketed linked lists.

------
dejb
For me the best malloc for me is one I don't have to know anything about -
whatever is built in to the scripting language I happen to be using. But I
know you aren't playing around with memory allocation for the sheer geekery of
it. There are still some problems that can't be solved efficiently with higher
level languages. This class of problems I'll gladly leave to you guys.

------
Locke1689
I usually just use the standard GCC implementation. I remember implementing a
pretty quick red-black tree version for my systems class. It got me an A, but
I don't think I'd use it unless I find myself having serious problems with
GCC.

------
spot
tcmalloc: <http://goog-perftools.sourceforge.net/doc/tcmalloc.html>
[http://mituzas.lt/2006/12/06/google-perftools-tcmalloc-
squid...](http://mituzas.lt/2006/12/06/google-perftools-tcmalloc-squid/)

