
How to allocate memory - ksherlock
http://geocar.sdf1.org/alloc.html
======
obi1kenobi
If you'd like to learn more, here's an MIT research paper on fast memory
allocation that has some really clever ideas:
[http://supertech.csail.mit.edu/papers/Kuszmaul15.pdf](http://supertech.csail.mit.edu/papers/Kuszmaul15.pdf)

Abstract:

SuperMalloc is an implementation of malloc(3) originally designed for X86
Hardware Transactional Memory (HTM). It turns out that the same design
decisions also make it fast even without HTM. For the malloc-test benchmark,
which is one of the most difficult workloads for an allocator, with one thread
SuperMalloc is about 2.1 times faster than the best of DLmalloc, JEmalloc,
Hoard, and TBBmalloc; with 8 threads and HTM, SuperMalloc is 2.75 times
faster; and on 32 threads without HTM SuperMalloc is 3.4 times faster.
SuperMalloc generally compares favorably with the other allocators on speed,
scalability, speed variance, memory footprint, and code size.

SuperMalloc achieves these performance advantages using less than half as much
code as the alternatives. SuperMalloc exploits the fact that although physical
memory is always precious, virtual address space on a 64-bit machine is
relatively cheap. It allocates 2 MiB chunks which contain objects all the same
size. To translate chunk numbers to chunk metadata, SuperMalloc uses a simple
array (most of which is uncommitted to physical memory). SuperMalloc takes
care to avoid associativity conflicts in the cache: most of the size classes
are a prime number of cache lines, and nonaligned huge accesses are randomly
aligned within a page. Objects are allocated from the fullest non-full page in
the appropriate size class. For each size class, SuperMalloc employs a
10-object per-thread cache, a per-CPU cache that holds about a level-2-cache
worth of objects per size class, and a global cache that is organized to allow
the movement of many objects between a per-CPU cache and the global cache
using O(1) instructions. SuperMalloc prefetches everything it can before
starting a critical section, which makes the critical sections run fast, and
for HTM improves the odds that the transaction will commit.

~~~
eatbitseveryday
While it has some new material, it's also not very well-written, and some of
the benchmarks could be argued as being unrealistic. For example, Brad should
have also included results using applications such as Redis (which by default
uses jemalloc), or MongoDB, or many others. In many other scenarios not
detailed in the paper, the allocator can use much more memory compared to
tcmalloc.

The "Vyukov" benchmark is an invented (and I would argue, contrived) scenario
that Vyukov him or herself thought of to cause memory bloat in Intel's TBB
just by examining the code. Whether or not it actually occurs in a real
application is debatable.

If you may have noticed, nowhere in the paper is tcmalloc mentioned :) It is
still a viable alternative today.

There are many other papers I would recommend in addition to this reading,
e.g., "Dynamic Storage Allocation A Survey and Critical Review" Wilson et al.
despite it being quite old.

This might get me down votes, but it is wise to remember that just because
work has the MIT stamp on it, doesn't mean it's always the most top-quality.

~~~
tptacek
If you haven't spent much time reading about allocation strategies and are
looking for a good place to start, _Dynamic Storage Allocation A Survey and
Critical Review_ is a fantastic start. One of my all-time favorite survey
papers.

[http://www.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pdf](http://www.cs.northwestern.edu/~pdinda/ics-s05/doc/dsa.pdf)

~~~
pcwalton
That's a good paper, but it's worth noting that most of the interesting work
these days is in _multithreaded_ memory allocators, which weren't as important
in 2005. Scaling well under multithreading (i.e. not taking a global malloc
lock) changes the design space considerably: you need to have per-thread heaps
and rebalance them from time to time, which is itself a very interesting
problem.

~~~
sitkack
For anyone interesting in this area, take a look at the memory allocator (and
GC) for HotSpot. Specifically "Hierarchical PLABs, CLABs, TLABs in Hotspot"
[0]

[0] [http://cs.uni-salzburg.at/~hpayer/](http://cs.uni-salzburg.at/~hpayer/)

------
mallaco
> One major limitation of malloc (and even the best implementations like
> jemalloc and dlmalloc) is that they try to use a single allocator for each
> data structure. This is a mistake: A huge performance gain can be had by
> using a separate allocator for each of your data structures — or rather, for
> each of your data usage patterns.

I stopped reading the article at this point. This statement has been disproven
repeatedly over the years with Hoard and jemalloc. It is counter-intuitive but
the data backs it up.

Custom per-data-structure allocators can fragment the global memory arena and
cause more CPU cache misses as result of the extra code involved. The
latest/greatest malloc/free implementations use a myriad of optimizations to
achieve speed improvements that a custom allocator implementation would rarely
use.

It's not an accident that jemalloc is so widely used in major applications -
it works extremely well.

[https://github.com/jemalloc/jemalloc/wiki/History](https://github.com/jemalloc/jemalloc/wiki/History)

[https://github.com/jemalloc/jemalloc/wiki/Adoption](https://github.com/jemalloc/jemalloc/wiki/Adoption)

~~~
ksk
> This statement has been disproven repeatedly over the years with Hoard and
> jemalloc. It is counter-intuitive but the data backs it up.

What data is that? I would honestly love to see it. TBH, it would not change
my mind since I have practical experience of custom allocators working better
for me over other general ones. But then again, I work in a severely memory
constrained environment.

>Custom per-data-structure allocators can fragment the global memory arena and
cause more CPU cache misses as result of the extra code involved.

Well, pretty much all modern allocators have techniques to avoid
fragmentation. But that is besides the point.

But your comment did not make sense to me. Perhaps it does to others. As the
developer you know the general sizes of your data structures, their usage
patterns, their lifetimes and the frequency/rate at which they are allocated
(all of which a general allocator has to guess or use some sort of rudimentary
strategy which adds needless metadata). That is where you can design your
custom allocator to actually reduce fragmentation. It may well be that the
gain from a custom allocator isn't worth it, but I don't quite understand how
it can get _worse_ as you claim.

>The latest/greatest malloc/free implementations use a myriad of optimizations
to achieve speed improvements that a custom allocator implementation would
rarely use.

So what exactly stops you from writing in those optimizations in a custom
allocator? Certain domains, like game programming for e.g. People have been
doing this for decades with excellent results.

>It's not an accident that jemalloc is so widely used in major applications -
it works extremely well.

Sorry, you have no possible way of knowing _why_ everybody else uses
something. That itself is not an argument for anything.

~~~
mallaco
Custom allocators have the disadvantage of not knowing how the rest of the
heap is managed. They are at a disadvantage to a modern well designed memory
allocator like hoard or jemalloc.

Feel free to examine the many papers put out by Emery Berger on the subject of
memory allocation as well as the design documents of jemalloc.

~~~
ksk
>Custom allocators have the disadvantage of not knowing how the rest of the
heap is managed.

Sure, but they can work out of a fixed region statically allocated on startup.

>Feel free to examine the many papers put out by Emery Berger on the subject
of memory allocation as well as the design documents of jemalloc.

[https://people.cs.umass.edu/~emery/pubs/berger-
oopsla2002.pd...](https://people.cs.umass.edu/~emery/pubs/berger-
oopsla2002.pdf)

Do you have links to more? I only found one paper from 14 years ago. It is
interesting but it comes with a huge YMMV since they only examined specific
projects which might not have a pattern that _requires_ a custom strategy
anyway. A distribution of the objects along with time distribution would have
been nice to include.

------
colanderman
Not to be a downer, but this is poor and outdated C programming style.

Don't use sbrk(2) unless you're completely reimplementing the standard
library.

Don't use alloca(3) unless you want weird headaches. Just use dynamically-
sized arrays, which have been present in C for nearly two decades. They are
block-scoped and easier to reason about.

None of the magic numbers are explained. `(p->si_addr + (16LL<<22)) & ~4095`,
where do those come from? I'm guessing 4095 is the page size less one, but
DON'T hardcode that!

Don't assign to lvalues in conditionals! (e.g. `while(j<31 &&
!(h=free_table[j]))`) It's hard to read and bug-prone.

Don't use `&h[1]` to refer to the space after a structure. I'm pretty sure
that's undefined behavior, and it often gives the wrong alignment for whatever
you want to put after h. Rather, add a final element to the structure of
flexible array type (say `int data[]` if you're storing ints after the
structure). That is guaranteed to have the semantics you're looking for.

Here's my advice:

1) Just use malloc(3). It's already fast and tuned to many allocation
scenarios (including all four in this article!), and your application will
continue to reap whatever performance improvements its maintainers make. Use
aligned_alloc(3) if you need pages.

2) When working in the embedded world, it's often preferable to preallocate
pools as large as your app will ever need for each kind of object you have
collections of, and pin them to RAM (not necessarily for performance, but so
you know you have the space). If you do so, you can reference the objects by
index rather than pointer. It's much faster and more space-efficient,
especially on 64-bit architectures.

3) Don't be afraid to realloc(3) to grow arrays. It incurs no asymptotic
penalty.

4) Pay attention to data layout, to make sure that commonly-accessed stuff is
not interspersed with rarely-accessed stuff. E.g. separate indexes from data
where possible, if you tend to traverse the index rather than the data.

(Caveat: all my comments above apply to glibc. YMWV on other systems.)

~~~
cperciva
_Don 't use `&h[1]` to refer to the space after a structure. I'm pretty sure
that's undefined behavior_

This is legal. '&h[1]' is a synonym for 'h + 1', thanks to C99 section 6.5.3.2
paragraph 3:

    
    
        ... if the operand is the result of a [] operator, neither the & operator
        nor the unary * that is implied by the [] is evaluated and the result is
        as if the & operator were removed and the [] operator were changed to a
        + operator.)
    

and that is valid thanks to C99 section 6.5.6, paragraphs 7 and 8:

    
    
        7. For the purposes of these operators, a pointer to an object that is
        not an element of an array behaves the same as a pointer to the first
        element of an array of length one with the type of the object as its
        element type.
        8. When an expression that has integer type is added to or subtracted
        from a pointer, the result has the type of the pointer operand. [...]
        Moreover, if the expression P points to the last element of an array
        object, the expression (P)+1 points one past the last element of the
        array object [...] If both the pointer operand and the result point
        to elements of the same array object, or one past the last element of
        the array object, the evaluation shall not produce an overflow; otherwise,
        the behavior is undefined. If the result points one past the last element
        of the array object, it shall not be used as the operand of a unary *
        operator that is evaluated.
    

So is 'h' points to an object it is entirely legal to compute '&h[1]'; whether
you can dereference it is a separate question (and will depend on memory
layout et cetera).

~~~
colanderman
Ah, I forgot that "one past the last element" is excepted. I stand corrected.
(Similar tricks, such as &h[-1], are indeed undefined behavior.)

Regardless, my point stands that it can give you the wrong alignment, if the
elements you're putting _after_ the header have greater alignment requirements
than the header itself. E.g.:

    
    
        #include <stdint.h>
        struct foo { uint16_t a,b,c; }
        uint64_t *bar(struct foo *x) { return (uint64_t *) &x[1]; }
    

Gives you a misaligned pointer to an 8-byte object.

    
    
        #include <stdint.h>
        #include <stdint.h>
        struct foo { uint16_t a,b,c; uint64_t d[]; }
        uint64_t *bar(struct foo *x) { return x->d; }
    

That gives you a properly aligned pointer.

------
FroshKiller
I don't mean to be spammy about this, but I feel like I should repeat it. If
you like this kind of content, consider joining the SDF. It's a free public
access computing community with lots of artists, hackers, and grognards of all
stripes. Your support is appreciated: [https://sdf.org/](https://sdf.org/)

------
_RPM
sizeof returns size_t, not sure why he's using `long long`, but to be precise
it isn't a function, so it doesn't "return" anything. I guess it "yields"
size_t.

