167 points by matt_d on Feb 17, 2019 | hide | past | favorite | 36 comments

 Is the shuffle vector a novel data structure? I've never seen it before but it looks so obvious that just glancing at figure three was enough to figure out how it worked before I got to the description: it's basically a vector with a Fisher-Yates[0] shuffle built in! I wonder what other uses they might have.
 It seems very closely related to the well known trick for deleting elements from a middle of an array in constant time (swap last element with the element to be removed, then pop the last element). The difference is just that here the swap happens on insertion rather than on deletion.
 Yes, kind of, but that is more when you have an unsorted array and don't worry about making it even less sorted.Shuffle vectors guarantee randomized sampling of a known set of elements for very cheap, with the ability to return elements to the set. That seems like it could have some very interesting niche usages.It kind of reminded me of modern Tetris implementations. Instead of picking one tetronimo at random for the next piece, they generate random sequences of each tetronimo. However, this context never requires returning a single element to the set of options - you can just generate a new sequence as a whole once you run out of pieces.Although... you could "flip" the shuffle-vector by swapping with the elements already taken: make a small buffer of seven elements (one for each piece), circularly iterate over it, and whenever you reach element i you swap the value with a random position in the (inclusive range from zero to i. This would basically be distributing the Fisher-Yates shuffle over the iterations, rather than running it when the pieces run out.For some reason though, I doubt that tetris piece randomization was ever a source of performance instability for these games.
 How is it cheaper? In both formulations, you end up with one swap and one RNG call per operation on the vector. With the shuffle vector you pay those costs up front, with swap-with-last you pay those costs lazily. Asymptotically there's no difference there at all.
 It's not, that's what I meant with:> This would basically be distributing the Fisher-Yates shuffle over the iterations, rather than running it when the pieces run out.This obviously makes no difference for a sequence of seven tetronimos, but perhaps there are contexts imaginable where one would do similar operations over very large sequences, in which case paying costs lazily could be nice.EDIT: I just remembered that Go guarantees that iterating over the built-in maps is guaranteed to be in random order. I wonder how that is implemented - perhaps something like this might be useful there for the situations where one often bails out of such loops early.
 Since it's too late to edit: nope, that won't work for Go maps, because it only shuffles elements already visited, so bailing out of the loop early guarantees the first N elements will be the same N elements that have been iterated over up until that point. Which breaks the randomness guarantee.OTOH, that actually is kind of an interesting feature; there might be a few situations where that is desired behavior.
 tyingq on Feb 17, 2019 That's roughly why a typical stream cipher like rc4 is fast isn't it...as opposed to CBC?
 > In future work, we plan to explore integrating Mesh into language runtimes that do not currently support compaction, such as Go and Rust.With respect to Rust, isn't it usable as-is through LD_PRELOAD or whatnot since Rust now uses the system allocator by default? Though obviously a proper GlobalAlloc implementation could be provided (and eventually an Alloc).
 This sounds right - that line was written before Rust used the system allocator by default. Will update, although I'm still itching to write a GlobalAlloc implementation.
 I'd love to see one, and benchmarks for the Rust compiler running atop it. Impressive work!
 Does anyone have some kind of list of the most important results in the area of memory management? I'd never heard of the Robson bounds but they seem quite crucial to know about, and I'd always been looking for something like them. Are there other such results?
 >I'd never heard of the Robson bounds but they seem quite crucial to know aboutI'd argue it's not really important in modern systems due to the use of virtual memory. With virtual memory, you can massive reduce fragmentation since you are no longer managing a single continuous region of memory.Fragmentation still exists, but it only applies to "small" allocations (think smaller than a page) and in most scenarios it is a result of modern memory allocators preferring performance over reduced memory usage. Memory is cheap nowadays, and performance is king.>Does anyone have some kind of list of the most important results in the area of memory management?There is a survey [1] on the subject of memory allocation that mentions Robson's bounds and other results
 Thanks!!
 Source code is available here in case anyone's interested: https://github.com/plasma-umass/Mesh
 Random allocation sounds like it could potentially hurt cache for some workloads, but still, this is easily the coolest idea I've seen so far this year.It's beyond amazing we can still innovate on something like a memory allocator using hardware tricks that have been available for nearly 40 years. Leave no stone unturned!
 Hi! I'm one of the authors -- thanks so much!We thought similarly about cache performance, but weren't able to find specific examples (most of the slow benchmarks we dug into were due to unoptimized allocator performance rather than poorer caching).Informally I think one of the reasons for this is that in long running problems, allocations returned from malloc end up being pretty random rather than contiguous -- as allocations are freed they are put in freelists, and freelists are designed more for temporal locality rather than spatial.
 On free, other allocators are likely to return the last freed address but with mesh it will return a random location (i.e. hurt temporal locality). But I guess in modern CPUs it doesn't matter that much because that whole page probably cached anyway, selecting any offsets from that page would be similarly good.BTW, this is really cool project. Do you envision any difficulties of merging this into jemalloc?
 _wmd on Feb 17, 2019 jemalloc is written in C :) That aside, I don't think merging with an existing allocator, if it weren't obscenely unlikely due to the general conservatism in such projects, would necessarily accomplish anything.In terms of contrasts it'd be like trying to merge Ruby with Python.. this project's approach stands independently. It may be better to ask what features it can gain from merging with something like jemalloc
 Yeah, it is interesting to see whether Jason Evans is interested enough to port some of the ideas over to jemalloc. My original wording is poor and shouldn't be read as suggesting merging the codebase literally. Just from reading the paper, the idea itself is simple enough to get ported to any modern allocators (with the only sticky point, as discussed in this thread earlier, about the random allocation algorithm). The overhead in finding / merging pages would be interesting topic to see whether it is reasonable for these modern allocators as well. Wondering whether there are other difficulties the authors saw about porting the idea to other allocators.
 Couldn't the same algorithm with different parameters be used to increase cache utilization?
 Interesting! I had always considered in the back of my head that there might be a way to reduce memory fragmentation by basically passing that cost on to virtual memory, but the meshing approach used here is a pretty cool implementation of this. Here’s the implementation, it looks like: https://github.com/plasma-umass/mesh
 Very clever use of virtual memory.Unfortunately I would expect this technique not to scale well with increasing page size, and I sure hope that we will be slowly moving to >4k pages as years go by since the TLB is a very common bottleneck.
 Well, there are 2 sides to increasing page sizes. If you can fit more objects on a page, there is a higher chance of collision at a given occupancy %, reducing the effectiveness of meshing. On the other hand, it means that very low occupancy pages (like single allocations) would cause an even larger amount of memory to be held onto by a process - a situation Mesh would be able to handle well.We didn't look at huge pages in the paper for the reason that today, if you care most about heap footprint you can always use 4K pages rather than 2M ones. (and lots of database-like applications, like Redis, explicitly suggest disabling huge pages)
 Wait, are you talking about page size, or nr of pages? Either way, that all depends on how scaling either of them up would affect the distributions of these randomized algorithms they refer to, no?> Mesh makes meshing both efficient and provably effective (with high probability) by combining it with two novel randomized algorithms. First, Mesh uses a space-efficient randomized allocation strategy that effectively scatters objects within each virtual page, making the above scenario provably exceedingly unlikely. Second, Mesh incorporates an efficient randomized algorithm that is guaranteed with high probability to quickly find candidate pages that are likely to mesh.> Meshing is only possible when no objects on the pages occupy the same offsets. A key observation is that as fragmentation increases (that is, as there are more free objects),the likelihood of successfully finding pairs of pages that mesh also increases.
 Yeah, I agree, I wonder how the overall costs will work out, with the increased TLB pressure
 That is a cool technique. I wonder if it can be translated into a garbage collected language, to create a nonmoving and compacting GC.
 There's no reason it could not, at least the authors think so in principle:> Compacting garbage collection in managed languages:> Compacting garbage collection has long been a feature of languages like LISP and Java [12, 14]. Contemporary run- times like the Hotspot JVM [20], the .NET VM [19], and the SpiderMonkey JavaScript VM [7] all implement compaction as part of their garbage collection algorithms. Mesh brings the bene ts of compaction to C/C++; in principle, it could also be used to automatically enable compaction for language implementations that rely on non-compacting collectors.
 This is how garbage collection works (CLR is one implementation):https://en.wikipedia.org/wiki/Mark-compact_algorithm
 That is very much a moving garbage collector, marked (live) objects get moved to the start of the heap for compaction. GP is specifically asking about the applicability of Mesh/TFA's non-moving capabilities for GCd systems: moving objects requires the ability to update in-system references somehow, and makes FFI more complex as references either can't be shared/leaked or specific pinning API have to be set up (or some updatable addressing scheme has to be provided but then you don't get straightforward pointers anymore).
 yxhuvud on Feb 17, 2019 No, that clearly updates all the pointers after moving the allocations. Mesh is different as it works on the page level and use the virtual memory to achieve compaction without changing the pointers.
 I have trouble understanding why this works. Mesh works by finding pairs of pages and merging them together physically but not virtually works, until a program makes alloc calls that end up allocating some memory in one of of those virtual pages and some of it in the other. What guarantees that those two allocations do not have overlap in physical memory?Also, what happens when such a page with multiple virtual mappings get paged out and paged in again? Does it stay a merged page?
 Since programs are only permitted read and or modify addresses within allocations handed out by malloc, they are oblivious to the merge process as pieces of another page are copied to virtual addresses near an existing allocation in a page they are accessingSo long as mprotect() is used to ensure the source page is stable for the duration of the copy and until the mremap() is complete, it should not be possible for the program to observe any difference in behaviour -- except perhaps if it tries to modify an allocation on a page that is currently being merged by malloc() running in another thread (presumably they have some kind of SIGSEGV handler to catch this)
 Memory allocators use memory for two reasons: to dole it out to application, and to keep track of what memory is used and what is free.From the pictures, I thought merging only merged the first, but rereading, I guess the former is stored in the same blocks as the latter. They still need to make sure that multi-threaded access through different virtual pages works, but that seems doable, too.
 If it’s a plug-in replacement and has little or no performance penalty, then could it just be mainlined?
 Mainlined where? There is no mainline for allocators, C and C++ rely on the system-provided allocator as their baseline, possibly replaced by an alternative / custom allocator (though the same allocator could be system-provided in one place and custom on another e.g. jemalloc is the system allocator for freebsd but commonly used as custom allocator on other platforms), supplemented with bespoke allocators for task-specific schemes (e.g. arena or linear allocators where that makes sense).You could try to get the mesh methodology integrated into existing allocators I guess, but there is no such thing as a "mainline allocator".
 e.g. jemalloc is the system allocator for freebsd but commonly used as custom allocator on other platformsI think they just mean mainlined on a particular platform. If FreeBSD were to switch its system allocator from jemalloc to mesh, for example, I’d count that as mainlining.

Search: