Hacker News new | past | comments | ask | show | jobs | submit login
Tinyalloc: replacement for malloc/free in unmanaged, linear memory situations (github.com)
199 points by ingve 3 months ago | hide | past | web | favorite | 56 comments



If I read the code correctly, ta_free() is O(n), whereby n is a number of allocated blocks. That's not very good.

PS. Also the block alignment is aligning the wrong thing. It should be aligning the start of the public (returned) part, not the block size and that's done by padding Block struct as required (if required)... and not to the hardcoded 8 bytes. Otherwise you'll get SIGBUS on RISC boxes.

PPS. At the risk of stating the obvious, this is nothing more than a toy allocator, with lots of loose ends. A lab exercise in a second-third year of an average CS course - maybe, but this is not something suitable for production. Perhaps it works for some specific cases, but these aren't specified. So seeing this massively upvoted and at the top of HN is a bit strange.


And this: Also, allocation will fail when all blocks in the fixed size block array are used, even though there might still be ample space in the heap memory region...

Basically you can fail memory allocation with a lot of small allocations that aren't using all of a block. This situation bit me when using some C++ code on a Cortex-M class processor where the C++ code ended up creating a bunch of objects with very small allocations in them.

That said, the code is pretty clean and easy to read so its a pretty understandable allocator. Some malloc implementations are so opaque you have to study them for a week to tease out how they work.


Hey, author here - this post/thread completely hit my by surprise, so apols for late reply:

1) ta_free() is only O(n) if block compaction is enabled, else it's O(1), i.e. a simple cons of the freed block addr to the free list (https://github.com/thi-ng/tinyalloc/blob/master/tinyalloc.c#...)

2) there's no hardcoded alignment (it's configurable via TA_ALIGN macro) and I'm confused about your other comment about not aligning the right address in general. that's a wrong observation and i think you misunderstood the diagram in the readme.

3) it's clearly stated in the readme that this is largely meant for memory constrained situations (or where not large numbers of blocks, realloc or multithreading is needed/available) and so it's not meant to be a replacement for dlmalloc / jemalloc caliber allocators etc. i needed something small & lightweight and it did it's job very well in several production projects on the ARM STM32 platform (where I have upto around 8MB managed). I also found it very useful for some ongoing JS/WASM interop projects (i.e. there's JS version here: https://github.com/thi-ng/umbrella/blob/master/packages/mall...)


Looks nifty and simple. Last time I had to work with low mem environment (64k heap for Lua running on MIPS) I ended up implementing something close to TLSF [1] allowing to choose some trade-offs regarding perf/allocation efficiency.

In the end we still had to add one layer of hack^H^H engineering tricks to workaround fragmentation issues in the worst case :-)

[1] http://www.gii.upv.es/tlsf/files/ecrts04_tlsf.pdf


That's cool. I've written small but less flexible things before based on arrays as used/freed block maps rather than lists, from embedded devices with 128K of RAM.

This seems like it has some nice features. Though a 3K overhead could be important in such a constrained environment (I guess if it's only managing 128k that could reduce).


Did similar implementation 15 years ago. I needed a implementation that works in SMP environment and it didn't use lock. I use atomic increment to allocate block index, atomic dec to free the block index. It worked very well with good performance in SMP environment. It was implement as share library and the same code can be link statically in the kernel module. Linux Kernel can allocate the block and pass the pointer (index to the block) as parameter to user space and user space and de reference it directly.

The library was implemented using C++. We also ported that to QNX in addition to Linux Kernel.

I remembered it can do > 1 millions msg block per second between user space programs or similar perf for between user and kernel space.

It did waste some amount of space. That can be config at system startup.


Cool. How did you deal with the case where a block gets freed out of order?

For example,

Thread1 allocate (block index++) Thread2 allocate (block index++) Thread1 free (block index--) Thread3 allocate (block index++)

Now thread 3 has a block that is actually still owned by thread 2.


The blocks were actually fix in pre-allocated blocks.

When a thread, process, kernel thread "allocated" a block. It actually acquired and owned the "index" via atomic inc from the free list.

When it free the block, it put the block's index back to the "free" list.

The "free" list is actually a array. Thus the list index can be allocated and free with atomic inc, dec. Thus it can do SMP safe alloc, free operations without any locks and guarantee the correctness.

This was actually a IPC, DIPC library to support TCP session redundancy in a network router product. Other programmers call this library from user, kernel and TCP stack, etc.

Since the blocks and index are all in share memory, there are huge potential for share memory corruptions. In debug mode, there are memory space before and after each block, every allocate, free are checked for ownership, double free, and border space corruption. There are regression tests to cover all aspect of API calls that were run 24/7 check for every commit. When not compile with debug mode, the operations were much faster but there were still quite a bit of assertions check in place.

There were a debug trace system that track all system activities (including Linux, QNX, vxwork context switches.) The log is actually in special region of memory that preserved over warm boot. If the system was completely hang, in 99% of cases, I can recover the memory to do post mordem analysis and know exactly the last user, kernel space threads that were running before the system crash. It was more useful for vxWork, less useful for Linux as majority of code were in user space and didn't take down the system.

The regression test coverage code and the debug mode validations were 80% of code. The actually implementations were very small.


Freeing is usually easy (for free() and memory pool) because the pointer can be used to compute the index into the bool array or bitfield. Allocating is more difficult since you have to search. You can use a linear search that starts at the last block that was allocated (for memory pools that's usually good enough). You can also use a bloom filter to accelerate the search. Example:

256 blocks of memory, you use 8 words of 32 bits to to indicate used/free status. 1 means free, 0 means occupied. To quickly find a free block you compare the words with zero.


3k overhead where you’re trying to reduce from malloc is definitely a lot. One of my products have 8K total, not going to waste 3 of that replacing malloc.

I’m fairly happy that mirsa C forbids malloc. I can’t use it even if I wanted to, so I need to be more careful about how memory is used. Annoying at first, makes a lot of sense later on though.


It is quite possible (and it can be reasonable) to deviate from that rule. For example you could write a wrapper for malloc that only allows dynamic allocation in the initialization phase of your program. This can be useful to allocate memory for opaque-pointer-style types.


There are other ways to get chunks of memory separate from the stack - static uninitialized byte arrays.


Yeah, but it's nicer to define that kind of stuff in your linker script IMO, where sizes of large memory regions can be thought of holistically.


The parent post was talking about using malloc once at the start of the program, are you talking about defining that with your linker script?

Also if the goal is to get one chunk of memory with a single call, why use malloc at all? Why not use the direct memory mapping command?


I don't think he's talking about using malloc "once", he's talking about only using it during init. In practice you end calling it a bunch in that model, the last design I shipped like that ended up making a few hundred malloc (well new operator because this was C++) calls at init.

Under that system malloc/new was just a pointer bump, and an assert if there wasn't enough room. Free wasn't linked, and delete asserted. That heap region (or mine had several heaps that lived in different memory with different characteristics that could be specified with a new overload) is way easier to define in your linker.


Of course, but then we are back to memory pools.


What is the difference between allocating memory right when the program starts with malloc and using the same size in static unitialized byte arrays?

There are differences of course, but that memory can be used in the same way, and so memory pools are orthogonal to how the memory is aquired.


If you use modules with opaque pointers, the types are incomplete and therefore their size is not known outside of the module. You can either make the type visible (against the idea of opaque pointers), modify the module to allocate a fixed set of instances (somewhat against the open/closed principle), or you let the module use malloc to allocate memory for new instances.


Why would someone be doing this while trying to fit within the boundaries of the MISRA standard? Also what do you mean by module?


The rule you are referring to is a "required" (not a "mandatory") rule. You can deviate from it and still be MISRA-C compliant. People write deviations for all kinds of reasons. Here are some rationales for deviations: https://lars-lab.jpl.nasa.gov/JPL_Coding_Standard_C.pdf (Rule 5 on page 10 on malloc) https://www.keil.com/pack/doc/CMSIS/Core/html/coreMISRA_Exce...

The unit of reuse that defines the type, example: http://c-faq.com/struct/sd1.html


So your solution, in a thread about using a fixed amount of memory, is to do nothing different and let anything use the heap at any time because it might be possible in some cases to get it past MISRA standards? I'm not really sure of what you are trying to say, it seems you keep skewing further from the original discussion.


> I’m fairly happy that mirsa C forbids mallo

We wrote the system I mentioned above (128K of RAM, 256K of flash) without it, I agree, you can do everything you need without malloc when you're careful.

Unfortunately we had a third party dependency that required a malloc implementation to be able to run, so we had to implement a simple one.


>Unfortunately we had a third party dependency that required a malloc implementation to be able to run, so we had to implement a simple one.

We ran into the same thing. Some open source compression code I wanted to use had malloc, dumped it for something else that didn't. Luckily we found a protobuffers instance that didn't use malloc, that saved some time on a different project.

First thing I do now when looking at example or source we might want to use is a quick scan for malloc.


Unfortunately for us, this was a certified/approved EMV payment kernel, and about the only one we could get for our architecture, so we had to live with it.

Sometimes you don't get to choose or do it yourself, more's the pity!


That concept is called a memory pool or storage pool. Very useful, but their greatest disadvantage compared to malloc() is that you waste a lot of memory if your allocation size varies greatly (for example a network protocol where packet size can range from few bytes to KB).


In my experience, best way to implement a network protocol on a small embedded chip is fixed length statically allocated circular buffer, not malloc.


That would be like a "first-fit-or-die" malloc, wouldn't it? Which mechanism to choose always depends on your (de-)allocation profile and available memory. If you can free out-of-order, the circular buffer will fragment quickly.


> If you can free out-of-order, the circular buffer will fragment quickly.

Right, just when coding a network protocol for embedded, often you can free in the same order, i.e. it’s often a good idea to limit parallelism to a small number of FIFO streams (ideally a single one).

That’s only for small chips. When they aren’t small e.g. you have a quad-core ARM with a gig of RAM you usually have OS anyway and the programming is totally different, much higher level and you guaranteed to have a fast malloc and lots more.


The overhead is not fixed and the 3KB are only for the default config and based on the max number of blocks one might have in flight (256 by default).


As some comments point out that this code might not be production ready, can anyone recommend other solutions that are able to manage a given block of memory? My understanding of libraries like jemalloc is that they are replacing malloc (using system calls brk and mmap [0]) but I can't hand it an already allocated block of memory to manage, right?

Something like

    memory_manager m(some_buffer);
    allocation a = m.allocate(512);
would be nice. :-)

[0] https://medium.com/iskakaushik/eli5-jemalloc-e9bd412abd70


https://android.googlesource.com/platform/external/qemu/+/em...

https://android.googlesource.com/platform/external/qemu/+/em...

https://android.googlesource.com/platform/external/qemu/+/em...

These 3 files should be a fairly self contained, minimal way to allocate into existing buffers. No attempt is made at thread safety though (we assume the user synchronizes that themselves)


This TLSF implementation should be able to do that: https://github.com/mattconte/tlsf


FreeRTOS features sone nice and simple allocators on fixed buffers [0]. I have used heap4 to implement memory management for a fully portable test suite requiring malloc. Really any toy allocator would have done the job for that use-case, since performance wasn't critical, but heap4 was fairly easy to integrate.

[0] https://www.freertos.org/a00111.html


Depending on what exactly you mean by "hand it a block of memory to manage", you can do this with jemalloc's extent hook functionality. Though that's meant for "I have some long-lived data structures, I want all their memory stored on hugepages" more than "I have this small-ish data structure, I want to batch its allocations together".



Consider using sequential instead of linear.


If handles and setters/getters can be used, then it opens up the possibility of defragging or 'garbage collecting' the memory.

A neat trick could be to be able to 'trim' the data used by the handle, then when it comes to defrag time just deal with the remaining part and free up the trimmed space.


Is this missing code? I can’t find a definition of Heap anywhere.



Thanks. In my defense, GitHub search is garbage: https://github.com/thi-ng/tinyalloc/search?q=Heap&unscoped_q...


github search is optimized for finding accidentally committed credentials, not useful code.


How does it stack up to dlmalloc?


It's a tiny allocator designed for tiny embedded systems. It's not intended for large desktop machines like dlmalloc() is. dlmalloc() can't run on tiny embedded systems because it has relatively high overhead. This provides a lot less features but has a very small memory footprint. It's really intended for cases where your embedded program mostly allocates memory once on startup and doesn't usually free() much at all.


Its a nifty tool, but I can't help feeling a little disappointed that we're still having to deal directly with issues such as memory allocation in 2019.


Well 'we' deal with it so that you don't have to. What do you suggest? Somewhere memory needs to be allocated, this stuff isn't magic.


Somebody's got to deal with it so that you may have the luxury of not dealing with it.


How do you think we should be dealing with memory allocation instead?


    addr = getrandom(8);
    mem = mmap(addr, len, PROT_RWX, MAP_FIXED|MAP_ANON, -1, 0);


It's not realistic to use a per-allocation mapping like that in production - all your allocations will be scaled up to multiples of pages.


Whoosh!


> PROT_RWX

Hmm…


The X is for fleXibility.


How else will you put the self-modifying code there?


You don't necessarily.

Don't forget that theres memory managed languages that compile to c. They still use malloc under the hood.


Thats a good point.


Right? Just use JavaScript already! /s




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: