

Memory – Part 4: Intersec’s custom allocators - fruneau
https://techtalk.intersec.com/2013/10/memory-part-4-intersecs-custom-allocators/

======
rayiner
The thread test is janky. Most multithreaded allocators optimize for the
(common) case that objects are freed by the same thread that creates them.
When objects are freed by different threads than the ones that allocated them,
typically some sort of slow-path is invoked.

Older allocators with per-thread caches used to behave very badly with cross-
thread frees, accumulating tons of freed objects in threads that didn't
necessarily allocate a lot of objects. Tcmalloc uses a garbage collection
process to move those objects back to the central free list.

The test in the article, where one thread does all the allocations and another
does all the frees, basically subverts the thread-caching in tcmalloc, and
just tests how quickly the garbage collection process can move freed objects
from the free()-thread's cache back to the central heap where they can be
reused by the malloc()-thread.

~~~
fruneau
I admit that test stress some corner cases (at least some cases that the
allocator designer consider as corner cases). That said, malloc has no choice
but supporting that use case.

A use case for such pattern is a message-posting with workers: you queue some
messages that are later unqueued and processed by a different thread. This is
an increasingly common pattern in modern programs. In that pattern the message
is allocated in one thread (let say the main one) and processed then
deallocated by another thread.

If your implementation of message allocation is malloc-based, then you will
stress the exact same code paths the benchmark is stressing.

~~~
JoachimSchipper
You're not wrong that malloc-based message passing causes that load on malloc,
but if performance of the message-passing code is important, you'd want to use
a ring buffer anyway - cross-CPU or not, malloc is pretty slow.

~~~
fruneau
Clearly, we go back to the initial statement: for specific use cases, we need
specific allocators.

------
exDM69
It should be obvious that a lot of 8 byte mallocs will give bad performance
and horrible memory use. This article and in particular the benchmarks in it
would be a lot more informative if the test case was more realistic.

Please add at least 32 or 64 bytes of payload to the linked list structure and
re-run the benchmarks. Even that is a very small allocation block, but is on
the lower end of realistic allocation sizes although not a good practice.

~~~
barrkel
8-byte mallocs are expensive because most mallocs have per-allocation memory
overhead to track things. This is exactly why you may want to use a different
allocator.

IOW, you're angle is that instead of finding a solution to the problem,
instead choose a different problem. You don't always have that luxury.

My background on this problem is compilers. Compilers allocate lots of little
structures that represent tree nodes, values, tokens, etc. Forcing them all to
be a minimum of 32 or 64 bytes in size on the basis that would justify using
malloc for them, would be more than a little bizarre. Arena allocation - both
per module (for structures that need to persist for the whole compilation) and
stack based (for structures that are discarded after e.g. evaluation or
codegen) - makes far more sense than contorting the problem so that malloc
makes sense.

~~~
exDM69
> 8-byte mallocs are expensive because most mallocs have per-allocation memory
> overhead to track things. This is exactly why you may want to use a
> different allocator.

I guess if the objective of the article is to point out the obvious fact that
mallocing 8 bytes at a time is a bad idea, then showing up some actual numbers
from actual malloc implementations is a good idea. However, even small objects
in practical problems are usually bigger than 8 bytes, so making the
allocation size a bit bigger would give more realistic figures.

Overall I think the article was informative and well written but more
realistic test case would better point out when to write a custom allocator
and what allocator to choose for a particular usage pattern.

~~~
jbooth
I kinda see it more as "we're using an unrealistic workload to emphasize the
per-malloc overhead". A program that was written to minimize and batch
allocations wouldn't be as demonstrative.

~~~
tptacek
Contra "exDM69", there's nothing unreasonable or "unrealistic" about
allocating 8 bytes at a time; it's in fact extremely convenient to be able to
do that. I think it's telling that someone would call that workload
unrealistic; it indicates to me that they've never even really considered
alternatives to malloc.

It's a little like those C programmers who try to use 256 byte static arrays
for all their strings because they don't quite grok malloc/realloc.

~~~
jbooth
Well, personally, if I'm using C, it's typically for a very narrowly defined,
performance-sensitive problem. Both times I've done this in the last couple
years, I found myself doing a couple big mallocs at the start and then running
some tight loops over that memory with no further mallocs -- for use cases
with more allocation, I'll just use whatever other language is more
convenient, preferably one with a GC.

But if I were writing a bigger application entirely in C/C++, I could totally
see lots of small allocations happening at different points (and getting into
wackiness with custom pooled allocators and auto_ptr).

~~~
tptacek
That's essentially what I do in much of my C code: implement pool allocators.
And that's essentially what the post we're commenting on is talking about.

Incidentally, good common case for 8-byte allocations that is actually common
in real-world C code: 32 bit linked list nodes (4 bytes nextptr, 4 bytes
dataptr).

~~~
cdman
Hmm, are you sure that this (linked lists with small elements) is a good
example which you should tell people about?

Linked lists are bad because they have big overhead (8 bytes for every element
in your case - which dominates by a large margin if you store an int in each
node for example) and they are really bad for CPU caches (they have very
little spatial locality), thus slowing down your code.

I much rather prefer the "growing array" approach.

~~~
jbooth
Yeah, but they have some unique characteristics related to inserting/deleting
from the middle or beginning of the list. They're actually used in a few
places in the linux kernel, for performance reasons, in spite of cache
locality issues.

------
cs648
Very interesting article, especially the t_scope allocator - I never knew you
could get GCC to perform that cleanup automagically. One minor grammar point:
isn't a lock under contention a _contended_ lock, not a contented lock?

~~~
fruneau
Wording issue fixed.

~~~
dman
Is the source for the allocators freely available? Would love to study those.

~~~
fruneau
Unfortunately, not for the moment.

~~~
aktau
I'd also like to put in a request for either open-sourcing or a more detailed
overview of the implementations, they sound really interesting.

~~~
fruneau
I'll consider writing a more detailed article on the subject. Open-sourcing
the code will not be possible in the short-term.

------
blue11
Sorry to nitpick, but I believe the time difference code has an error of 1
millisecond 25% of the time:

    
    
        int64_t delta = tv2->tv_sec - tv1->tv_sec;
        return delta * 1000 + (tv2->tv_usec - tv1->tv_usec) / 1000;
    

One way to fix it is:

    
    
        int64_t deltasec = tv2->tv_sec - tv1->tv_sec - 1;
        int64_t deltausec = tv2->tv_usec - tv1->tv_usec + 1000000;
        return deltasec * 1000 + deltausec / 1000;

~~~
fruneau
The diff is a truncation. The actual error rate is 0.5ms on average. By using
a round instead of truncation, we can reduce the error to 0.25ms on average.

~~~
blue11
Well, that much is obvious. But if you are going to truncate, you should be
consistent. Always truncate towards 0, not sometimes towards 0 and sometimes
towards infinity.

------
eeadc
The fact that returning memory to the Kernel is hard is supported by the
circumstance, that most allocators will use brk/sbrk to resize the data
segment of the executing process to allocate memory, at least if they shall
allocate few memory.

The other fact, that allocators have to lock global data structures is also
not true. Most modern operating systems supports thread-local storage and
therefore you don't need locking because you can keep much per-threads
allocators, and only if you want to release memory of a foreign thread you
have to lock (but that's also bad practice in most cases).

Therefore, this article is great if your horizon end at the default allocators
tcmalloc, ptmalloc and jemalloc, but the reality is much more complex. The
fact that such a thing doesn't exists isn't founded in the fact that it's hard
to implement, it's founded in the fact that there is no need for such an
allocator, because most well-written software will allocate large chunks of
memory.

~~~
dexen
_> (...) that most allocators will use brk/sbrk to resize the data segment of
the executing process to allocate memory, at least if they shall allocate few
memory._

The other commonly used backend for malloc() is mmap() without underlying
file:

    
    
      void *chunk = mmap(NULL, length, PROT_READ|PROT_WRITE, MAP_ANONYMOUS, -1, 0);
    

Handy both when allocating large chunks of memory and for allocating pools for
smaller suballocations. Has the additional benefit of being zeroed-out at low
cost (or no cost at all -- for example via hardware DMA), and also playing
nice on systems with constrained / fragmented address space, as kernel is free
to allocate at any address visible from userspace.

------
ArbitraryLimits
Not about heap allocators but I followed the "About" link to this text:

> At Intersec, technology matters…Because it’s the core of our business, we
> aim to provide our clients with the most innovative and disruptive
> technological solutions. We do not believe in the benefits of reusing and
> staking external software bricks when developing our products. Our software
> is built in C language under Linux, with PHP/JavaScript for the web
> interfaces and it is continuously improved ...

So now I'm wondering whether PHP is actually perceived as being hard-core?

Also, how would one stake a brick?

~~~
chris_wot
I think the really hard stuff isn't done via PHP. Seems to me they use PHP as
the front end because they want to focus on their really valuable technology -
their C code.

~~~
fruneau
PHP is used as a small layer that enables talking to our C code from
javascript. We have a custom (Protocol Buffer-like) protocol to manage our
RPC, the PHP embeds a native module that implements that protocol and exposes
a webservice to which our Javascript code can talk in order to provide some
valuable user-experience on top of our C-written technologies.

Nowadays there is so little intelligence in the PHP that we only consider it
as a pass-through layer.

------
bd_at_rivenhill
In thinking about the t_stack allocator, I think I can see some cases for
which you might want to use this instead of the alloca function, but there is
not enough information to be sure if I am going down the correct mental path.
Can you please explain when/why I should use t_stack instead of alloca?

~~~
professorTuring
In fact, I was thinking of _alloca_ the whole article and I really don't see
the benefits of implementing a "custom solution" in detriment of a well
working existing one.

It would be great to compare the results they give against alloca =)

~~~
fruneau
alloca has its drawbacks. See the previous article in the series:
[https://techtalk.intersec.com/2013/08/memory-
part-3-managing...](https://techtalk.intersec.com/2013/08/memory-
part-3-managing-memory/#Stack)

------
zwieback
Really interesting and well written, thanks for that. If you wrote some more
about heap allocation strategies (best-fit, worst-fit, first-fit, etc.) to
round out the discussion I'd love to read that as well, especially if you add
varying allocation sizes to your benchmark.

------
dkhenry
I cannot for the life of me find this t_stack allocator he talks about. Anyone
have a link ?

~~~
vineel
It's a custom allocator internal to Intersec.

------
deweerdt
Which version of jemalloc was used in the benchmark?

~~~
fruneau
jemalloc 3.4.0 (current package in debian sid)

