In my hobby project, I started always passing an allocator argument to every function or object which requires allocation (inspired by Zig) and I love it so far. Often I can just pass a bump pointer allocator or a stack-based allocator and do not care about deallocation of individual objects. I also wrote a simple unit testing framework to test out-of-memory conditions because it's easy to do when you're in control of allocators. Basically I inject an allocator which calculates how many allocations are done when a unit test is run, and then unit tests are later rerun again by injecting OOM at every known allocation point. A lot of bugs and crashes happen when an OOM is encountered because such paths are rarely tested. The idea of my pet project is a very resilient HTTP server with request-scoped allocations and recovery from OOM without crashes.
Adding that when many linux distributions face OOM, a killer daemon steps in and might kill your service even if you were handling the situation properly.
Interestingly (confusingly), Linux's OOM killer is invoked for a different notion of OOM than a null return from malloc / bad_alloc exception. On a 64-bit machine, the latter will pretty much only ever happen if you set a vsize ulimit or you pass an absurd size into malloc. The OOM killer is the only response when you actually run out of memory.
If you want to avoid your program triggering the OOM killer all on its own, you need to set up a vsize such that you'll get an application level error before actually exhausting memory. Even that isn't completely foolproof (obviously anyone with a shell can allocate a large amount of RAM), but in practice -- if your program is the only significant thing on the system -- you can get it to be very reliable this way.
Add in some cgroup settings and you should be able to keep your program from being OOM killed at all, though that step is a bit more complex.
I wonder if it is possible to avoid OOM by making sure that all allocations are done from a named (on disk, not shm) memory file. This way in principle is always possible to swap to disk and never overcommit.
I guess in practice the kernel might be in such dire straits that it is not able to even swap to disk and might need to kill indiscriminately.
That would have to be all allocations in all processes (and the kernel and drivers)
In extreme circumstances, the OOM killer can decide to kill your process even if it barely uses any memory (a simple way to get there is by fork-bombing copies of such processes)
You would also need to prevent overcommit of disk; you'd typically mmap to a sparse file, and then you've got the same problem of overcommit on disk as you did in memory.
If you're going to do drastic things, you can configure Linux's memory overcommit behavior, although strictly avoiding overcommit usually results in trouble from software not written with that in mind.
The idea is that the server must have a known allocation budget, similar to Java's max heap size. There's a tree of allocators, i.e. a temporarily created arena allocator needs initial memory for its arena, so it can grab it from the root allocator. And the root allocator ultimately must be fixed-size and deterministic. Sure if there are other processes in the system allocating without concern for other apps, then the OOM killer can kill the server. But if there's no such process, I think it should be pretty stable.
Oh wow that is a really interesting test solution. That would be an interesting thing to add to all zig tests (I know they already have the testing allocator and good valgrind support but I don't think that tests/simulates oom).
I love things like these that use existing tests and expand the to just test further thing in already covered flows. We have done similar things at my work where we test expansion of data models against old models to check that we cover upgrade scenarios.
This is a clear sign of a badly designed language. You should never see a fixed-size (less than page size) allocation fail, simply because there's nothing you can reasonably do if it does fail. Either you should crash or it should block until it is possible again.
(Where crash means a worker process or something limited to something less than the entire system. See Erlang for the logical extension of this.)
I realize this implies Windows and Java are badly designed and my answer to that is "yes".
So if my browser tries to allocate memory for a tab and the allocation fails it should just crash or block instead of handling the failure gracefully by not creating a new tab and telling me the system is running low on memory?
As long as the syscalls don't themselves fail, it's no problem until you run out of address space. At which point you should crash because you probably need to allocate to handle any reasonable errors.
I've been using this helper: https://github.com/judofyr/zini/blob/ea91f645b7dc061adcedc91.... It starts by making the first allocation fail, then the second, then the third, and so on. As long as it returns OutOfMemory (without leaking memory) then everything is fine.
A very old trick for running Lua in your PlayStation 2 game (where the PS2 is a machine with 32MB of RAM and no memory paging) is to hook Lua’s realloc function into the venerable Doug Lea’s Malloc (https://gee.cs.oswego.edu/dl/html/malloc.html) set up to run in arena mode (ONLY_MSPACES? It’s been a decade or two…). That way Lua can fragment the arena all it wants without making a mess of the rest of the tiny address space.
> (where the PS2 is a machine with 32MB of RAM and no memory paging)
The PS2 hardware does have full support for memory paging (at least on the main cpu core). PS2 Linux makes full use of it.
But the default TLB configuration from the BIOS is just a single static 31MB page (the other 1MB is reserved for the BIOS) and the SDK doesn't provide any tooling for dynamic pages.
And this is MIPS, so it's a software managed TLB with 48 entries. I wouldn't be surprised if some games did have dynamic paging, but they would need to provide their own TLB exception handler.
I recently needed to write a memory allocator and being lazy asked ChatGPT for help. Interestingly it came up with implementation eerily similar to what is described in that document.
Nonetheless everything worked like a charm from the start.
Writing a memory allocator is a standard homework assignment in any CS program. The solutions are mostly similar, and ChatGPT has probably learned hundreds of examples.
There is also the double-sided arena allocator which uses one contiguous buffer but grows in both directions (front-to-back and back-to-front). When allocating memory from it you need to indicate whether you want memory from the front or back. The allocator is out-of-memory when both font and back meet.
The double-sided approach is useful for various purposes. For instance you can allocate short-lived data from the front and long-lived data from the back. It also makes better use of free space: with two separate arena allocators one could be out-of-memory but the other might have free space. With the double-sided approach all memory is fair game.
Fantastic article! I have a project with a similar arena allocator, so I'll definitely be taking some of these tricks. One thing my allocator does do is organize arenas into a linked list so that you can grow your size dynamically.
However I really like the article's point that you're always going to be living within _some_ memory budget, so you might as well allocate everything up front into a giant arena, and then divide the giant arena up into smaller arenas.
Also I've heard that you can save an instruction when checking if your allocator is full by subtracting from the top, and checking the zero flag. It seems to complicate alignment logic. Does that ever end up mattering?
> However I really like the article's point that you're always going to be living within _some_ memory budget, so you might as well allocate everything up front into a giant arena, and then divide the giant arena up into smaller arenas.
That depends. If you’re running on e.g. a video game console where you’re the sole user of a block of pretty much all memory, go ahead. On a system with other things running, you generally don’t want to assume you can just take some amount of memory, even if it’s “just the free memory”, or even “I probably won’t use it so it will be overcommitted”. Changing system conditions and other system pressure are outside of your control and your reservation may prevent the system from doing its job effectively and prioritizing your application appropriately.
Yeah, profiling is your friend. I forget if it's called a sharded slab or a buddy allocator, but the one where you have different preallocated buffers chunked at different sizes. Any time you allocate you are given the smallest chunk that will hold what you asked for. Profiling gives you optimal size boundaries as well as the number of each. Add a safety margin and off you go. Super fast allocation and guaranteed no fragmentation. In a c++ codebase overloading std::new to do this is probably the easiest way to get your allocation performance back and avoid fragmentation.
> If you’re running on e.g. a video game console where you’re the sole user of a block of pretty much all memory
Games consoles haven't been that for a long time. PS5 and XSS are full blown multi-user multi-application systems. PS4 and Xbox One were multi user systems with reserved blocks for the OS, but still very close to a modern OS.
> so you might as well allocate everything up front into a giant arena, and then divide the giant arena up into smaller arenas
However if you do this note how the article hints at this strategy needing a bit more code on Windows: Windows doesn't do overcommit by default. If you do one big malloc Windows will grow the page file to ensure it can page that much memory in if you start writing to it. That's fine if you allocate a couple megabytes, but if your area is gigabytes in size you want to call VirtualAlloc with MEM_RESERVE to get a big contiguous memory area, then call VirtualAlloc with MEM_COMMIT as needed on chunks you actually want to use.
> While you could make a big, global char[] array to back your arena, it’s technically not permitted (strict aliasing).
Aren't char pointers/arrays allowed to alias everything?
I used that technique in my programming language and its allocator. It's freestanding so I couldn't use malloc. I had to get memory from somewhere so I just statically allocated a big array of bytes. It worked perfectly but I do disable strict aliasing as a matter of course since in systems programming there's aliasing everywhere.
char can alias everything, i.e. you can deference a char pointer with impunity, even if the actual dynamic type[1] of an object is a different type. The reverse is not true: if the dynamic type of an object is char, just by using the alias rules, you can't deference it as an object of a different type.
In C++ you can just use placement new to change the dynamic type of (part of ) a char array (but beware of pointer provenance). In C is more complex: I don't claim to understand this fully, but my understanding is you can't change the type of a named object, but you should be able to change the type of anonymous memory (for example, what is allocated with malloc) by simply writing into it.
In practice at least GCC considers the full alias rules unimplementable and my understanding is that it uses a conservative model where every store can change the type of an object, and uses purely structural equivalence.
[1] of course most C implementations don't actually track dynamic types at runtime.
> The reverse is not true: if the dynamic type of an object is char, just by using the alias rules, you can't deference it as an object of a different type.
A limitation like that simply makes no sense to me. Everything is a valid char array but I can't place structs on top of one? Oh well, nothing I can do about it. I'll just keep strict aliasing disabled. If we're writing C, it's because we want to do stuff like that without the compiler getting clever about it.
> you should be able to change the type of anonymous memory (for example, what is allocated with malloc) by simply writing into it
Well, in my case, I'm the one writing the malloc and the buffer is the anonymous memory. I remember months ago I scoured the GCC documentation for some kind of builtin that would allow me to mark the memory as such but there was nothing. I did add some malloc attributes to my allocation function just like TFA suggested but apparently its main purpose is to optimize based on aliasing nonsense which I disabled anyway.
At some point you need to get the memory for your pool from somewhere, for example from malloc [1], hence you can safely set the type of the raw memory by writing into it. You can also change that type to some other type, by writing other stuff (so you can reuse the memory). You can also write metadata to it between uses. What you cannot do is having overlapping lifetimes for different types.
Making sure that you respect all the underspecified, obscure, and often contradicting rules is not easy, so if you prefer to disable strict-alias, you have my sympathy. For the most part is useful for high performance numerical code, and less advantageous for typical pointer chasing stuff.
From the practical point of view, the safest way to implement a custom allocator is to make sure that the compiler can't see through it, so separate compilation and no LTO and/or launder your pointers through appropriate inline asm.
[1] but other 'anonymous' sources, like mmap, would also work in practice.
> Everything is a valid char array but I can't place structs on top of one? Oh well, nothing I can do about it. I'll just keep strict aliasing disabled.
Well, yeah. Strict aliasing is less about the incidental values of memory addresses and more about the actual semantics of what you're doing. Where writing a struct into the middle of a char array makes no sense because you have no guarantee in the type system that the array is properly sized or aligned to contain that struct.
Strict aliasing doesn't allow the compiler to magically decide that someplace you are writing happens to be in a statically allocated buffer. Strict aliasing says you have a pointer of some type and what you do with it has to agree with the type of that pointer.
I feel like there should be a builtin that takes a void* and returns a void* which basically marks the input as aliased by the returned pointer regardless of the TBAA rules.
Turns out there is! __builtin_launder seems mostly equivalent to what the author proposes but probably more reliable. Facebook’s folly library appears to use both, preferring the builtin.
If you overlay a struct in your (char) buffer and dereference a pointer to it you would be accessing something with a different type than its declared type(char* as struct something *), it’s strict aliasing violation
To do stay in the rules you could set up a void* to suitable region in a linkerscript
Processors doing out-of-order execution doesn't change the semantics of the code. That's very different from the example where gcc just throws away the assignment.
The idea that he just needs to accommodate the compiler people is silly. Compilers exist to serve programmers, not the other way around. It's entirely reasonable to disagree with the compiler developers and use a flag to disable behaviour your don't want.
Clearly he doesn't. He just disabled harmful features and called it a day. Stuff like strict aliasing and undefined signed integer overflow apparently do nothing but serve as an excuse for the optimizer to screw up perfectly reasonable code.
That's not how they work but I doubt getting into an argument with you about how undefined behavior works with someone who is dead-set on it being harmful is going to be productive.
Some other good cases for arenas are rendering of a frame in a video game and handling of a http request. The memory is contained within that context and short lived.
That's the lifetime of the objects in the arena, but wouldn't you recycle the arena itself across frames?
For requests it might make sense to have low and high water marks so that additional arenas are created during request peaks and destroyed after if you want to limit long term memory usage of your application.
Typically you would have different arenas for different lifetimes ('per frame', 'per level', 'per game session' - or maybe even more specialized, like the duration of an IO operation), and 'reset' the arenas at the end of their respective lifetimes (which is basically just setting the current 'top offset' to zero). This sort-of expects that object destruction is a no-op (e.g. no destructors need to be called).
The general idea being that you don't need to track granular 'per-object lifetimes', but only a handful of 'arena lifetimes', and all objects share the lifetime of the arena they've been allocated in.
Of course it's also possible to manually call a destructor on an object in the arena without recycling its memory, but I heavily prefer using plain POD structs without owning pointers and which can be safely 'abandondend' at any time without leaking memory.
I never heard the term "arena allocation" before. I always thought it was called "bump pointer allocation" since all you do is add to (bump) a pointer. One useful trick when designing a bump allocator is to allocate word size bytes (8 on 64-bit) extra to store object headers in. For example, if you store objects' sizes in the header you can iterate all allocated objects and you can often also reallocate objects in the middle of the arena without having to free every object.
I really like arena / bump allocators, they're really useful and powerful tools. I've been playing around lately with a Julia package that makes it relatively easy and safe to manage an arena https://github.com/MasonProtter/Bumper.jl
The thread-safety and dynamic extent is something I'm particularly pleased about.
> Unsigned sizes are another historically common source of defects, and offer no practical advantages in return. Case in point exercise for the reader: Change each ptrdiff_t to size_t in alloc, find the defect that results, then fix it.
I know that it’s a different “kind” of defect, but none of the code has overflow checks even with ptrdiff_t…
also called a "bump" allocator... because all it does is bump a pointer.
nice to use when you have a nicely ordered order of execution where you are guaranteed to always come back to a known position where you can free the entire heap/arena at once. (i.e. a typical main message handling loop).
It first allocates objects into an arena like structure. In a second step, it moves (evacuates) long lived objects into a compact region. The first region gets deallocated at once after.
Roughly speaking this leans on a heuristic that most objects are short lived. So it has arena like characteristics, but is of course managed/dynamic.
This might be one reason why managed languages like Java/C# get such good out of the box performance. You really need insight in your program and how it executes to beat this.
This is correct. In .NET Gen 0 heap, if there is sufficient space, the allocation is just getting a heap address from threadlocal, bumping an offset, writing object header and methodtable, and returning the pointer (reference).
> A minority of programs inherently require general purpose allocation, at least in part, that linear allocation cannot fulfill. This includes, for example, most programming language runtimes
Almost every program written in some programming language depends on the runtime provided by this programming language. Very few programming languages even let you to opt out of using the runtime. Which means that if the runtime needs something, then your program also needs it. Complexity doesn't magically go away when you put it into a library
It’s not so much about using the code as it is about writing the code (or at least, providing utilities to the code being written). Yes every program uses the runtime, but very few people write the runtime. That is, perhaps the default provided to end users should be arena allocators, keeping a general malloc for special cases.
I tend to avoid to work with data types which require to have a stable virtual address, namely a dynamic base with an offset. Because, mremap.
That said, it really depends on the data usage semantics, and one could write a much less costly allocator for such specific data. Virtual address stable generic allocators have a tendancy to be technical abominations based on statistical usage assumptions. Namely, their cost could be not worth for the improvment, even if there is a significant one.
Are there security issues with not zeroing out the previously used memory when "releasing" a buffer (moving the offset)? I'm not a systems programmer, so I guess I just assumed that most malloc implementations also zeroed out memory when freeing it, but a quick Google suggests that's not actually the case (with typical malloc implementations "getting their pages from /dev/zero" [0], effectively zeroing memory at allocation time rather than when freeing it).
> Are there security issues with not zeroing out the previously used memory
Yes, there can be. Security-critical software often does this explicitly, and it's been a bug when compilers have removed the zeroing by reasoning that unreachable memory is unreachable...leading to crypto secrets floating in memory unnecessarily.
For languages like Java and Go where objects are at least zero-initialized before the constructor(s) run, usually the allocator just zeroes the entire TLAB before allocation.
"Without individual lifetimes, you don’t need to write destructors, nor do your programs need to walk data structures at run time to take them apart."
Destructors are a very bad idea if you are using any form of garbage collection other than reference counting. The destructors won't run until some arbitrary time after the last access to the object, and in the case of arenas, what would be a very fast deallocation becomes proportional to the number of objects. Further, if destructors can revive objects, everything gets very complicated.
If alloc() changes .beg, how do you free (release to the OS, not mark as unused) the whole arena when it's not needed anymore? Wouldn't changing .end be better, so that you can implement a destroyarena(a) which calls free(a.beg)?
Typical use: set up a specific arena just to read in the config file. Once it's done, release the whole arena.
I think this is not accurate. Destructors are not deallocators, they are supposed to set the object field in an invalid state. Now truth is that both are often called together, e.g `delete`.
> Typically arena lifetime is the whole program, so you don’t need to worry about freeing it
A technic I use is to increment a counter on `arena.alloc` and decrement it on `arena.dealloc`, and then free the memory (if it's on the heap) accordingly.
> I think this is not accurate. Destructors are not
> deallocators, they are supposed to set the object field in
> an invalid state. Now truth is that both are often called
> together, e.g `delete`.
If the object manages some resource other than memory, and if the object's lifetime is intended to guard the resource, then a destructor is needed.
But if the object manages memory only, as is often the case, and all of that memory came from the arena, then you really don't need to call any destructors.
This is the approach taken in one C++ [library][1] I've worked with, where objects represented scalar values to be used en masse for spreadsheet-like applications. In those applications (especially in 32-bit mode), being able to omit an allocator pointer and neglect a destructor call made things smaller and faster.
> Destructors are not deallocators, they are supposed to set the object field in an invalid state.
A typical arena allocator would just reset an offset to zero when the arena is 'freed' without calling any object destructors, and the allocator wouldn't actually have any type information about the objects allocated in the arena (of course you could also write an allocator which registers a destructor function with each allocation, and which would be called before the arena is reset):
bla_t* bla = arena_alloc(arena, alignment, size, destructor_func);
For C++ style RAII it probably makes more sense to use placement-new, and call the destructor manually to 'invalidate' the object without recycling its memory (the memory would only be recycled once the arena allocator is reset).
Coding a MUD as a hobby project using memory arenas and I am straight up not having a good time with strings. Right now I'm giving the players fixed size 4k buffers to send commands too instead of using dynamically sized strings. Everything else is golden. Just slap it on to the frame/temp arena and reset the marker back to 0 after an update.
Either use a full-fledged library like tcmalloc or jemalloc with profiler and debugger capabilities built-in or do tooled generated code that's ensured to be bug-free and usually tied to lifecycle of request/response or connection etc (or game level, scene, session etc). Hand-rolling like this in 2023 is neither necessary nor optimal.
Having to decide ahead of time how much memory to allocate to the arena is... crap...
The vast majority of programmers don't want arbitrary 'out of memory in arena' errors just because the user inputted slightly more things than expected. Yes, I know that modern OS's don't actually allocate memory till you use it, but when you make widespread use of that functionality, typically your reuse of address space is poor and free'd stuff will be neither reused nor returned to the OS.
Likewise, not being able to free things within the arena is also crap - I'm sure there will be plenty of times the system is running low of RAM, but hundreds of applications have thousands of arenas, all half full of never-to-be-used again items that the OS can't reuse.
The actual point to understand is that a general purpose allocator is usually "mostly crap" because it requires an incredible internal complexity to meet all requirements. It's often better to use specialized local allocators, and write code which expects an allocator to be provided instead of calling into global allocator functions. Such specialized allocators can often be a lot simpler, while being at least as fast general purpose allocators like jemalloc or mimalloc.
Very often you don't actually need to track or manage the lifetime of individual objects, since related objects are often created and destroyed at the same time. For instance when parsing a JSON file you might end up with many individual nodes which can all be discarded at the same time once the parsing result has been consumed. With an arena allocator, you just throw away the memory for all those nodes at once, instead of calling a free/deallocate/delete functions tens- or hundreds-of-thousand times.
It's straightforward to make this slightly more dynamic by keeping a linked list of arenas and just add new one when you run out of space. You get most of the benefit with only a tiny increase in complexity.
That doesn't fix your second objection, but where you want to use this tends to be where you know object lifetimes are similar anyway.
E.g. way back we loaded fonts for an embedded device using t1lib, which on loading a font made hundreds of tiny malloc calls, all of which were freed at the exact same time when the font was freed. Adding an arena allocator both sped it up, reduced memory use (less malloc overhead), and it didn't matter at all that we couldn't free things within the arena because they'd always be freed at the same time anyway.
So the takeaway from that might be that arena allocators aren't always right, but you'd be surprised how often you can predictably group allocations into sets with similar enough lifetimes it doesn't matter much. A key to this is often the trick showed in the linked article: Don't just lump everything into the same arena; use different arenas for different lifetimes. You might well find you're left with so few allocations that don't seem to fit that you can afford to just keep those around in a single long-lived arena as well.
This...depends on your application. With a lot of small memory/embedded applications, deciding how much memory you need ahead of time is how you do things and normally you don't malloc at all. Because you absolutely need to know the memory bounds of your app. In that sort of environment, this approach can be useful.
You aren't the target audience I think? The main two reasons to use custom alloctors, or the two I run into at least, is when memory allocation profiles as a significant portion of your flamegraph, and/or there is a real probability of memory fragmentation preventing new allocation even when there is plenty of space left. That latter is often the case with embedded. The former can be, and is also pretty common any time you are cpu bound as in gaming or signal processing.
> Having to decide ahead of time how much memory to allocate to the arena is... crap...
A nice solution in many cases is to just keep a bunch of std::vector<T> (or equivalent) around for the types you need, and .clear() them all at the start of each request/frame/message/whatever being processed. Calling push_back on a vector will only allocate when the vector's already reached its capacity, which will happen very rarely after the first few runs, so the hot loop will usually be allocation-free, but without needing to allocate a fixed amount of memory ahead of time.
Generally, std::vector not going to work for allocator backend.
The issue being, std::vector relocates all elements every time it increases the capacity. Therefore, adding an element to std::vector may invalidate addresses of all previously added vector elements.
You gonna have to adjust your higher-level data structures which use that allocator, replacing pointers with offsets relative to the start of that std::vector. This introduces another level of indirection, adds complexity, and in some edge cases may even ruin the performance.
std::vector is fine if you access elements by index, instead of pointer, which can be a perfectly fine approach. Of course this kills the general purpose allocator aspect, since you're not getting back pointers, only offsets.
However, if you combine that with generation numbers, you can make yourself a very handy container with stable and safe references. slotmap [1] comes to mind.
Yes, that's true if you need higher-level datastructures, but if your functions are just taking vectors/spans of stuff as input then it works out fine.
Your second paragraph doesn't make sense because the whole point of this type of allocation is to guarantee that you will reuse the same address space you just freed.