Hacker News new | past | comments | ask | show | jobs | submit login
Mimalloc – A compact general-purpose allocator (github.com/microsoft)
356 points by dmit on June 22, 2019 | hide | past | favorite | 66 comments

We tried mimalloc in ClickHouse and it is two times slower than jemalloc in our common use case https://github.com/microsoft/mimalloc/issues/11

To be clear, your program ran at half speed, right? That's far worse than doubling the time spent in memory-management functions.

Yes, our program ran at half speed.

It is quite bad if changing the allocator made your program run at half speed. It doesn't matter if the library itself consumed more CPU or caused more wait (guessing that's the case here) due to concurrency or excessive syscalls.

The developer helped us a lot to improve the speed and now it is almost at the same level as jemalloc for us

Are there functions available with which I can at run-time query how much OS memory is used, how much handed out in allocations, how many mmap()ed pools are used, and so on?

I find that one of the most important features of a malloc library to debug memory usage.

glibc has these functions (like malloc_info()) -- they are very bugged in that they return wrong results, but after patching them to be correct, they are super useful.

Any more info/link on how they are incorrect and the patches you need to fix them?

Sure, my patches are linked in the bugs I filed:

https://sourceware.org/bugzilla/show_bug.cgi?id=24026 - "malloc_info() returns wrong numbers"

https://sourceware.org/bugzilla/show_bug.cgi?id=21556 - "malloc_stats printing size_t fields as unsigned int"

If you're interested in this topic:

After finding one bug after the other due to 32-bit integer overflow in malloc.c, I just searched for "unsigned int" in that file for fun, and 30 seconds later found what I consider a security vulnerability in realloc():


In certain situations, if you realloc() e.g. 32G + 5 bytes (reallocs this large can happen in large programs e.g. for data analysis), it'll copy only 5 bytes, and leave the rest as memory garbage.

That experience taught me that open source code being old doesn't mean anybody ever read it.

pt2malloc is only maintained by glibc, but not really. Since they cannot maintain it, they want to get rid of it. The upstream maintainer released a better version pt3malloc, which glibc refused to adopt. It needs one more word per alloc.

On Linux, there's a bit of the info you're looking for under /proc/self/*. (Not a great answer; the info under /proc/ isn't exactly well structured, but doesn't change super often).

/proc doesn't really tell you much about userspace concerns such as "how much memory has the malloc implementation handed out to the program.

It describes only the kernel-userspace parts of memory usage. For example, malloc fragmentation is not inspectable in it.

(If you enable the mallopt() setting that turns each malloc() into a separate mmap(), then it tells you a lot, but that is not a default setting.)

Looks like the same idea as Konstantin Knizhnik's thread_alloc:


At least the same architecture of allocated chunks management.

I always find comparisons with tcmalloc hard to parse, since it has a million knobs and the defaults are terrible. If they are running with 16 threads I would normally advise increasing the thread cache size far above the default 3MiB. also interesting would be jemalloc in per-CPU mode.

As always the thing to do is build and run your own workload and see the results.

The tricky part with allocators is always the multi-threaded setups.

Even something as simple as a bunch of threads doing malloc-free in a loop will drop performance of a lot of allocators to the floor, due to some sort of central locking or excessive cache thrashing. This is typically solved by adding per-thread block pools, free lists or some such.

If you go further down the rabbit hole, there's a case when blocks are allocated in one thread and freed in another, your very typical producer-consumer setup. This too further complicates things with the pool/freelist setup and requires periodic rebalancing of freelists and pools.

So once all this is accommodated, a well-tuned allocator inevitably converges to a model with central slabs/pools/freelists and per-thread caches of the same, which are periodically flushed into the former. Then it all comes down to routine code optimization to make fastpaths fast, through lock-free data structures, some clever tricks and what not.

In other words, it's always nice to read through someone's allocator code, but in the end this is a very well-explored area and there's basically a single stable point once all common scenarios are considered.

This allocator appears to have some genuinely interesting things in its multi thread support:

- it looks like it avoids atomics on common cases of malloc and free. That’s a big deal and not all malloc a accomplish that.

- it looks like it has cleverness specifically for the case that one thread frees an object into another thread’s heap. It seems like this case was given some special consideration - in particular avoiding the need for the thread that owns the heap to stop or synchronize at the moment that happens even though that thread is non-atomically allocating and freeing in that same heap.

So, it’s always easy to sound smart by saying that a field is well explored - but it looks like this thing actually has some cool new ideas in it.

>it looks like it avoids atomics on common cases of malloc and free. That’s a big deal and not all malloc a accomplish that.

Many mallocs implementations try to avoid one cache per thread because it can waste a huge amount of memory in applications with thousands of threads opting for per cpu pools or similar solutions. These solutions help with contention but still require atomics (unless using something like restartable sequences).

It is a trade-off, but for applications that use one thread per cpu, completely private free lists can branch win of course.

All modern allocator implementations I am aware of (including Hoard and Mesh) use per-thread caches (of a limited size), and thus avoid the use of atomic operations in the common case while also avoiding blowup (wasted memory caused by using a memory allocator that "leaks" -- see the Hoard paper (ASPLOS 2000, https://people.cs.umass.edu/~emery/pubs/berger-asplos2000.pd...) for a full discussion).

Does hoard avoid atomic ops on the common case of both malloc and free?

This is quite possibly an uninformed question, but is it possible to use multiple allocator implementations simultaneously in the same program?

Obviously this would introduce additional creative ways in which to mess up memory management, but at least theoretically it seems like it ought to be doable. When I tried to search for information on the topic just now, I found only a few items about using multiple allocators in C++ that involved the STL and templates, but nothing about low level code or best practices.

Yeah you can but then you pay a tax on free() because you have to do some dispatch to decide what algorithm you are using. The more algorithms the higher the tax.

Also, you want to make sure that the mallocs can share free pages with each other directly. Otherwise you will likely het higher peak memory overhead or worse perf or both.

The systwm allocator on macOS supports multiple "zones", which can each be backed by different allocators.

They claim to perform well on at least some multi-thread workloads (read https://github.com/microsoft/mimalloc#performance):

"The larson server workload allocates and frees objects between many threads. Larson and Krishnan [2] observe this behavior (which they call bleeding) in actual server applications, and the benchmark simulates this. Here, mimalloc is more than 2.5× faster than tcmalloc and jemalloc due to the object migration between different threads. This is a difficult benchmark for other allocators too where mimalloc is still 48% faster than the next fastest (snmalloc)."

There are a lot of ways to write the same story; Mimalloc has a few innovations of its own, has good benchmarks to justify them, and the paper is a very good read. In particular, Mimalloc works to increase data locality, by using per-‘page’ (64kB memory blocks) free queues. All this with a codebase almost a tenth the size of jemalloc, and it's looking very promising.

Andrei Alexandrescu had a great talk[0] on allocators and this convinced me that if you really care about allocator performance[1] what you need is a tunable and composable system. The whole idea of how and when (runtime/compile-time) to choose the appropriate allocator is also interesting.

[0] https://www.youtube.com/watch?v=LIb3L4vKZ7U

[1] I suppose there's a digression to be had here about allocation being trivial and RAII-style deallocation not really being deterministic, but...

To the point you really need specialized knowledge.

I remember DrDobbs and The C/C++ Users Journal having ads for companies whose sole product was a memory allocator for different kinds of deployment scenarios.

And those products really did help. Microsoft's implementation of malloc performed terribly for many workloads - dropping in a 3rd party allocator could speed up your software considerably.

It wasn't until well into the 2000s until the standard allocators started to become good enough to make the 3rd party libraries less essential. Many of them you can still buy today and they probably still help with certain types of software.

You may want to look at snmalloc, another allocator we did at Microsoft Research. It specifically targets these producer / consumer scenarios.


And the paper, which I’ll be presenting tomorrow at ISMM: https://github.com/microsoft/snmalloc/blob/master/snmalloc.p...

It's a little hidden, but this is out of MS research, and seems to be initially designed for ref-counted scripting languages. I'm not really sure how much they focused on multi-threaded stuff.

>We present mimalloc, a memory allocator that effectively balances these demands, shows significant performance advantages over existing allocators, and is tailored to support languages that rely on the memory allocator as a backend for reference counting.


That's not what the quote means, really. They have best-in-class performance on both single- and multi-threaded workloads. It is just saying that the allocator supports a deferred_free callback, which is zero-overhead in the fast path and negligible overhead in the slow path.

You should actually read the technical report: https://www.microsoft.com/en-us/research/uploads/prod/2019/0...

This has some quite clever and very carefully considered details around concurrency.

> in the end this is a very well-explored area and there's basically a single stable point once all common scenarios are considered.

I think it's absurd to call allocation a solved problem where there's one known optimal design.

but in the end this is a very well-explored area

With a lot of interesting new generic allocators that beat the competition hands down popping up in the last 10 years, I'd dare to claim the area is not very well explored. It has indeed been explored a lot but it seems that there's still low-hanging fruit. Every few years someone writes an allocator that makes significant improvements over the previous generation implementations. I love to see such action on "solved problems".

What if I'm writing code that's strictly single-threaded? Presumably malloc could be simpler and/or faster in this special case?

Is there a production-ready allocator that's optimized for single-threaded use?

It is already faster because there is no lock congestion.

It's very hard to not accidentally pull in a dependency that makes your code multithreaded.

You guys could argue this all day and both be right. Depending on what type of system you're working on, you are either very likely to pull in a multithreaded dependency or you are not. Systems programming is different from game programming is different from web client programming is different from kernel development. In most of my career, I would have never accidentally pulled in a multithreaded dependency, but I could see how that could be easy to do in some cases.

For this use case a good copying collector would be best. Beats any malloc by miles.

The benchmarks are very impressive! I am excited to read through this code and think on it.

Edit: They do mention they're all from AMD's EPYC chip, which is a little idiosyncratic. Speculation: perhaps page locality is more important on this architecture.

The benchmark repo contains results with a Intel Xeon.

Looks roughly similar: https://github.com/daanx/mimalloc-bench

Just a general question in regards to using memory allocators, in the consideration of a C only application.

The problems I encounter with allocator and heap manager are almost never solved by these types of frameworks. These problems include:

1. Improper usage of the memory returned that contradict implementation. 2. Pool allocators that don't have separation between individual blocks (performance reasons). 3. Specifying the lifetime of the memory to a thread or until specific events happen. 4. Difficult to diagnose corruption, with any tool available.

Here's a specific scenario I deal with very often: There are N persistent worker threads. These worker threads have their own pool of memory, and prior to getting work we know this pool is clean. After the work is finished and before more work is recieved the memory is cleaned. Any excess requested memory is returned to the global-pool, and any memory that is "unmanaged" is dealt with properly.

This means that people can do whatever heap management call you use (void * obtainMemory(size_t);) in the scope of business logic without having to worry about infrastructure concerns.

Having a faster malloc/calloc doesn't benefit me as much as making the usage of memory easier, and the understanding of what happens easier.

Is anyone aware of a good/fast single threaded allocator for cases where you don't need/want to pay for thread safety?

Mutexes in Linux are implemented on top of futex, and on Windows on top of WaitForAddress. Uncontested futex (as well as WaitForAddress) never goes to kernel, pure userspace code and in hot path it is quite fast


If you're single threaded then you'll never have mutex contention so they'll always be fast-path. I'd suggest you actually prove that the memory barriers are actually a problem for you via profiling, since it's somewhat unlikely it is.

> If you're single threaded then you'll never have mutex contention so they'll always be fast-path.

Concurrent algorithms (to which multi-thread capable allocators belong) typically necessitate design and performance compromises which aren't nullified by creating only a single thread.

> I'd suggest you actually prove that the memory barriers are actually a problem for you via profiling, since it's somewhat unlikely it is.

civility's reply to your post is somewhat rude, but they're right. It's rather arrogant to assume that the poster doesn't know what they're doing, without knowing anything about their specific problem. Modern allocators are complex pieces of software, typically with a lot of knobs and dials, and it is entirely plausible that an allocator tuned for single-threaded programs is more performant for a specific use case than a generic multi-thread allocator.

You can get large gains tuning an allocator for a specific use case, yes, but that's very different from tuning it for thread safety. Particularly since most major allocators use a first-level allocator that's thread-local and is therefore not paying any thread-safety tax in the first place.

> It's rather arrogant to assume that the poster doesn't know what they're doing

Given the question they asked I don't believe it was arrogant at all to assume they don't really know what they were doing. Their response seems to justify the push to start from the basics as well.

Nice job - you've succeeded in being just another condescending asshole.

It's true that shallow comments are frustrating, but aggressive ones are worse. If you keep breaking the site guidelines like you did repeatedly in this thread, we're going to have to ban you. Can I persuade you, instead, to review https://news.ycombinator.com/newsguidelines.html and use HN as intended? It sounds like you know a lot, and we want knowledgeable users. It's just that we want not to have a tirefire wreck of a website more—so we have no choice but to put out flamewars.

Dan, go fuck yourself. By letting people get away with snarky condescending tones and banning people who call them out on it, you're encouraging shitty behavior. "Don't be snarky" is right up top in the comments section of the guidelines, but it's easier for you to chastise me because I called the snarky guy an asshole. That's just lazy on your part.

Aggressive comments aren't worse, they're just easier for you to recognize.

You're right that "Don't be snarky" is high up in the comment guidelines, but even before that comes "Be kind." Being unkind is worse than being snarky. I don't mean morally worse, but worse in the long-term effect it has on the forum. The guidelines aren't moral edicts, they're heuristics designed to prevent the system from burning out. Snark is bad, but aggression is worse: it leads to flamewar and eventually, as people keep upping the ante, to scorched earth. It's in that sense that what you did was worse than what you were reacting to. Snark is like vandalism—it eventually wrecks a neighborhood. But aggression is like arson, or gun battles in the streets.

We care as much as you do about snarky and condescending comments. I personally share your feeling of being even more averse to those than to the cruder abuses. But for a bunch of reasons, they're harder to moderate. Here's one: people's interpretations of what counts as snark or condescension vary widely. There is little community consensus around this, and moderation can never get too far ahead of the community view. We have to choose our battles wisely if we don't want to spark backlashes, protests, and off-topic distractions that render the cure worse than the disease.

I can't moderate based on my personal views. That's the reason why your advice wouldn't work. We couldn't moderate based on your personal views either. It isn't laziness—it's that you can't impose an individual interpretation set on the community as a whole. People assume we do, but that's only because they haven't learned about moderation the hard way. Moderation is extremely different from extrapolating one's likes and dislikes into site rules and then using power to enforce them.

Does that mean doing nothing about snarky and condescending comments? Hardly, and if you read my comment history (not that I recommend it), you'll find plenty of examples of asking people not to do that. But they're ad hoc and I try to be careful not to demand too big a leap from the reader.

The plan is, over time, to raise the bar for comments so that gradually the snark and shallow dismissals stand out more prominently as abusive, the way that "you're an asshole" comments do now. Then it will be possible for moderators to do more about them. But this needs to happen slowly— frog-boilingly slowly. If we push too hard, the community balks and moderation loses power. We can only do what the community will support. We can lead—but only a bit at a time. Even if https://news.ycombinator.com/item?id=20252539 was trite and unhelpful, I guarantee you that the hivemind is not yet ready to support moderator intervention at that level. It needs to get more refined before that is possible. That's the long-term hope, but it will take years if not decades to get there.

I appreciate your reply, but I don't believe the vandalism and arson metaphor fits. There are just people you let get away with crap and ones you don't. His goal was to antagonize me while flying under the radar, and he succeeded. My goal was to antagonize him in retribution, and I failed.

So where are we now? I'll go back to reading the headlines instead of the comments because I'm not clever enough to keep people like that from getting under my skin, and he'll go on to piss off someone else the next chance that comes up.

Anyways, I regret snapping at you. Take care.

> There are just people you let get away with crap and ones you don't.

It's really a comment-by-comment thing rather than a people thing. If you're seeing cases where we're failing to moderate a post that cries out for it, the likeliest explanation is that we didn't see it. We'd appreciate links, because we can't come close to seeing everything here. Or of course you can flag the comment (described at https://news.ycombinator.com/newsfaq.html). In egregious cases, emailing hn@ycombinator.com is best because then we're guaranteed to see it sooner.

> His goal was to antagonize me while flying under the radar

Really, are you sure? Can you point to the comment that demonstrates this? Because either I missed something obvious or you're reading perhaps a bit too much into what was posted. Intent is notoriously difficult to read accurately in these posts, as I'm sure you know.

> Really, are you sure?

Pretty sure, and user adwn commented on it too. It's a pretty common pattern to not answer the question and then indicate it was a dumb question in the first place. Then he doubled down on the acceptable insults in the follow-on response.

I'm sure I'm oversensitive to crap like this, but I don't think I'm wrong to notice it.

I wonder if HN would ever add a feature to block/hide users others don't want to see, something per reader. I don't see a browser addon for it, and while I'm sure it could be done purely client-side, you might find stats about who is being blocked and how often interesting if it was done on the server. Heh, I'm sure I would end up on a few people's list, but it doesn't seem likely that we're all just going to get along anytime soon.

Anyways, thank you again for the polite reply.

You're welcome! and likewise.

I feel like we probably shouldn't add a block/hide/killfile feature because it would be a step back into the siloed style of forum (https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...). That would feel easier in the short term but would perhaps be a retreat from the hard problem of building a community that actually works. The thing about that task is that it's always deeply unsatisfying and frustrating. One can only fall short. Yet it seems like the right task to be working on. Perhaps in the end, as Bob Dylan put it, we win the war after losing every battle. Or perhaps it just falls back into the swamp. The ambition of HN has always been to maybe stave off doom a while longer (https://hn.algolia.com/?sort=byDate&dateRange=all&type=comme...), or as pg put it years ago, "make a conscious effort to resist decline" (https://news.ycombinator.com/newswelcome.html).

> civility's reply to your post is somewhat rude

Not as rude as I wanted to be. I expect answers like that from Stack Overflow or Reddit - just another condescending ego play without actually addressing the question.

Regardless, thank you for your reply.

I don't suppose you know a good single-threaded (or share-nothing across threads) allocator worth looking at?

Your answer is trite and unhelpful.

The important thing about all this is to measure perf on realistic workloads before and after. I don't really believe in allocators that have "excellent performance" on everything.

The Dev at Discourse also try it with Ruby, the result aren't as good as jemalloc. [1]

[1] https://twitter.com/samsaffron/status/1143048590555697152

the redis benchmark is interesting! Maybe antirez can make sth out of it

I never like names that require a “pronounced like” note, but cool project regardless.

Given the rather unpredictable nature of English pronunciation that's basically required for any made up word.

Particularly since it would not be unreasonable to assume that the "mi" in mimalloc is incorrectly pronounced like the "mi" in Microsoft.

That's what happens when languages like English adopt the foreign pronunciation of words instead of the English pronunciation. Now everything is an edge case and you can't be sure of the pronunciation even if it's an English word.

Cool fact - I'd bet that no one pronounces MySQL like its author does.

they should not have tested on AWS

I should create memealloc which converts all memory allocations to base64 encoded gifs

with cats and unicorns inside

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact