I semi-recently implemented a malloc tester (am TAing an OS course). Students write their own implementation of malloc and free, and then I "intercept" their calls to brk and sbrk by using macros to re-define those symbols as my own versions of those functions.
It is quite nice: I have knobs to "fuzz" the alignment of returned addresses from sbrk, as well as to inject arbitrary EAGAINs, to see if they're following the man pages. Another thing that I looked for was re-entrancy. This was a bit hand-wavy, but ultimately took the form of a try-lock in my sbrk implementation, where if the try-lock fails, their invocations of sbrk are not being synchronized.
Another thing I did to evaluate their implementations was to poison the entire unallocated and allocated sbrk space, poison all allocated memory blocks that my test runner allocates, and poison all freed memory blocks before my test runner frees memory. This all enabled me to estimate the meta-data overhead of their malloc implementation at a byte-granularity. That is, I can search the sbrk space for the various poisons, and then report on their relative frequency. If the sum of the frequencies is under 100%, then I claim the remaining percent as meta-data overhead.
The poisoning also also allowed me to detect simple bugs: if they write beyond the returned sbrk pointer, then I can potentially catch that because the values in memory don't match the expected poison.
Overall, it was fun to do and I could present some nice stats. For example, I produced a histogram of malloc call sizes and sbrk call sizes. This allowed one to quickly see if they're invoking sbrk unnecessarily often.
This sounds like a generally useful thing. It would wager that any non-toy malloc implementation has such a thing. Perhaps it's a good idea to open-source it so we all benefit from this spec-testing, fuzzing and histogramming?
Great tutorial, Dan! Since it sounds like you are planning to continue the series, I have a few thoughts on potential directions to go with it.
First, a link to Doug Lea's classic malloc page might be a good addition to the resources section. His dlmalloc() is the basis for GCC's current ptmalloc. His code is wonderfully clear and commented, both for the implementation and the rationale behind it: http://g.oswego.edu/dl/html/malloc.html
Second, I wonder if it would make sense to jump straight to using mmap() instead of the classic brk()/sbrk(). I think it's no more complicated, has more uses elsewhere, is conceptually more portable, and allows multiple arena's to be added in a straightforward way. Are there advantages I'm not seeing to sticking to the ancient ways?
Last, on the debugging side, I think you might want to start with an introduction to Valgrind rather than gdb. It's a much easier learning curve, and even for an expert it's often the better tool for the memory allocation type bugs that are going to be most common here. Alternatively (or additionally) some examples of the more modern Address Sanitizer that's now in GCC and CLang would be slick: https://code.google.com/p/address-sanitizer/wiki/AddressSani...
> Second, I wonder if it would make sense to jump straight to using mmap() instead of the classic brk()/sbrk().
Yes, it will. brk/sbrk are terrible ways to allocate anything but more stack. Use mmap instead, both for more control over the layout in memory, and for more portability.
Perhaps you know: on Linux, how do mmap() and brk() differ as far as page initialization? Does brk() conceptually include the effects of MAP_POPULATE? Do both of them cause virtual to physical memory mapping at the time of the call?
I'm not sure as to the specifics, because brk is a strict subset of mmap; however, I would be surprised if it doesn't use the full VM substructure to allocate the mapping immediately.
I would amend my statement before to say that, if you're looking into memory allocation, conceptually, mmap is where everything happens these days. brk is kept for backwards compatibility.
There is a much better discussion of this in vol. I of Knuth.
(Amusingly, in the original edition Knuth uses a simple algorithm (linking the allocated spaces) in the text and leaves the better algorithm (linking the free spaces) to an exercise. In Unix V6 (yes, this really dates me), the "malloc" in the C library used the simple algorithm from Knuth, variable names and all, causing O(N^2) performance problems.)
The HN title doesn't match the article title "A Quick Tutorial on Implementing and Debugging Malloc, Free, Calloc, and Realloc" but from the article it is obviously C.
The mentioning of sbrk in the article seems to do so in passing. Nothing describes what it is. I would spend time on it.
I would also address virtual memory and at least let the reader know that the underlying operating system is ultimately responsible for allocating memory. When a modern OS uses virtual memory, you get a virtual memory pointer which would be different than the physical address.
Linux, in particular, uses a process referred to as Optimistic Memory Allocation to honor malloc requests.
I mention all this because I think anyone interested in these lower level details would also be interested in how the OS and hardware are involved.
I have "a thing" for basic system services like malloc, linker, etc.
Writing a memory allocator is a nice exercise but it's also fundamental research: by toying around, because of sheer interest, with what everyone else takes for granted you might come up with something that changes things for all, big time.
For example, memory allocators were considered a "well enough solved" problem for a long time until suddenly we had a rush of new, experimental, and/or optimized allocators like jemalloc, tcmalloc etc. While they might not be revolutionary they still blast the old 80's/90's implementations like no tomorrow.
Basically, a wheel is perfect. Nothing we have developed so far is a wheel. We should not stop trying to learn how to write these low level things, because these are the people that might come up with better things in the future.
Linkers don't get as much attention as they should. The linker has the last chance to look at the entire program before it runs. It's a good place for final checks and optimizations.
In Modula, each module had an "init" section. The Modula linker traced the dependencies of the "init" sections and arranged for them to be executed in a valid order. If there was a dependency loop, that was an error. That's way ahead of the C/C++ rules on order of initialization.
Elimination of duplicate code at link time is possible. For template and generic instantiations, this can be a big win, especially for C++.
Some linkers support "weak links" - if A has a weak link to B, and there are no strong links to B, B is not loaded and the link is null. This is useful for optional components which need some kind of startup.
Vegedor, it looks like an early negative comment got your account auto-killed.
Quoting dead comment:
vegedor 17 hours ago | link [dead]
GCC with --flto, literally link-time-optimizations, is new to me and seems to be good. It turns everything it can into constants, even inlines functions from pointer calls and resolves them if the arguments are also constant. That didn't work for me without it. Dunno if that's std compliant.
[quote]While they might not be revolutionary they still blast the old 80's/90's implementations like no tomorrow.[/quote]
But will they do so on 80's and 90's hardware?
There are quite a few extra issues that come up when you implement a usable allocator. I'm surprised the article didn't mention them. Here are just a few:
Alignment: different systems have different minimum alignment requirements. Most require all allocations are 8 or 16 byte aligned.
Buffer overruns: using object headers is risky, since a buffer overrun can corrupt your heap metadata. You'll either need to validate heap metadata before trusting it (e.g. keep a checksum) or store it elsewhere. This also wastes quite a bit of space for small objects.
Size segregation: this isn't absolutely essential, but most real allocators serve allocations for each size class from a different block. This is nice for locality (similarly-sized objects are more likely to be the same type, accessed together, etc). You can also use per-page or per-block bitmaps to track which objects are free/allocated. This eliminates the need for per-object headers.
Internal frees: many programs will free a pointer that is internal to an allocated object. This is especially likely with C++, because of how classes using inheritance are represented.
Double/invalid frees: you'll need a way of detecting double/invalid frees, or these will quickly lead to heap corruption. Aborting on the first invalid free isn't a great idea, since the way you insert your custom allocator can cause the dynamic linker to allocate from its own private heap, then free these objects using your custom allocator.
Thread safety: at the very least, you need to lock the heap when satisfying an allocation. If you want good performance, you need to allocate objects to separate threads from separate cache lines, or you'll end up with false sharing. Thread segregated heaps also reduce contention, but you need a way of dealing with cross-thread frees (thread A allocates p, passes it to B, which frees it).
The HeapLayers library is very useful for building portable custom allocators: https://github.com/emeryberger/Heap-Layers. The library includes easily-reusable components (like freelists, size classes, bitmaps, etc.) for building stable, fast allocators. HeapLayers is used to implement the Hoard memory allocator, a high performance allocator optimized for parallel programs.
It is quite nice: I have knobs to "fuzz" the alignment of returned addresses from sbrk, as well as to inject arbitrary EAGAINs, to see if they're following the man pages. Another thing that I looked for was re-entrancy. This was a bit hand-wavy, but ultimately took the form of a try-lock in my sbrk implementation, where if the try-lock fails, their invocations of sbrk are not being synchronized.
Another thing I did to evaluate their implementations was to poison the entire unallocated and allocated sbrk space, poison all allocated memory blocks that my test runner allocates, and poison all freed memory blocks before my test runner frees memory. This all enabled me to estimate the meta-data overhead of their malloc implementation at a byte-granularity. That is, I can search the sbrk space for the various poisons, and then report on their relative frequency. If the sum of the frequencies is under 100%, then I claim the remaining percent as meta-data overhead.
The poisoning also also allowed me to detect simple bugs: if they write beyond the returned sbrk pointer, then I can potentially catch that because the values in memory don't match the expected poison.
Overall, it was fun to do and I could present some nice stats. For example, I produced a histogram of malloc call sizes and sbrk call sizes. This allowed one to quickly see if they're invoking sbrk unnecessarily often.