This project presents a new data structure for storing ordered integer maps. The proposed data structure is a compressive, cache-friendly, radix tree that has performance comparable to an unordered map (`std::unordered_map`) and is an order of magnitude faster than an ordered map (`std::map`).
It's similar but not quite. IntMap in Haskell uses bit as the prefix unit with a span factor of 2. This uses nibble as the prefix unit with a span factor of 16. Also IntMap uses a bitmask for the range of prefix units in a node while this uses a subnet-mask style to get the prefix of a node.
I'm not sure what you mean by "necessarily must be persistent" here, but (under my interpretation which means only persistent implementations of this data structure are possible) that's not true.
The IntMap paper by Chris Okasaki acknowledges that the data structure is not a new one (it's a PATRICIA tree), but one that was around 30 years old at the date his paper was published and deserved to be better known. The functional/persistent implementation is what's novel about Okasaki's paper.
Edit: The data structure this submission is about looks like great work! Excited to see it and will probably attempt my own ports of it to other languages.
Sorry I wasn't clear. What I meant is that an idiomatic Haskell data structure needs to be persistent: an insertion needs to produce a new version of the data structure with the inserted element while keeping the original. So almost all Haskell data structures need to satisfy this requirement otherwise using it is a lot of PITA, even though the ST monad makes mutations fairly easy. However the original submission is not persistent, and so they aren't comparable.
This really doesn't seem to be comparing to comparable data structures. For int map specializations like this, the optimized alternatives are things like Judy (which is looking quite aged these days) or roaring bitmaps, not to mention that any C++ developer using "ordinary" maps will be using absl's SwissTable (flat_hash_map) or folly's F14 (F14FastMap) or perhaps absl::btree_map if order is important. Comparisons to std::map and std::unordered_map are simply too naive to make the case for this data structure.
Oh come on. Don't be too harsh. This is an ordered map. Most of the mentioned ones are unordered maps. They might be fast but they are unordered. The only comparables are absl::btree_map and Judy Array. Without benchmarking, my gut feeling is this will beat absl::btree_map. Trie usually beats BTree.
I've written a lot of high performance/scale C++ code using a lot of data structures over the years, and ordered iteration has been very rarely needed; unordered data structures still rule the day in performance the vast majority of the time, and their lower constant factors very frequently outperform more specialized data structures. They're absolutely worth benchmarking against if the goal is actual uptake in the actual world.
In my experience, practically every single time I've used absl::btree_map, I've ended up reverting for performance reasons to either a flat hash map or, in some relatively rare cases, a sorted vector map (despite its O(n) insert/erase) because the constant factors are _so doggone low_. The experience remains: btree_map (or SkipLists, or whatever) has (in my experience) essentially never, in over a half million lines of C++, actually remained in the code.
Also, I presume (based on the implementation details, not based on actual use) that roaring bitmaps have some reasonable iteration API that would make them relevant even to the ordered comparison.
The important thing here is that if anyone wants to contend that someone should use their new data structure because it's better or faster or more optimal for some particular use case, it's important for them to demonstrate that they've thoroughly investigated the data structure space around their proposal. Comparison against std::map and std::unordered_map simply doesn't demonstrate the kind of comprehensive knowledge that I would expect from someone who claims that I should use their optimized integer map in my own code.
That just means you never have the need for implementing a querying or search functionality. Range query and search are used everywhere and need ordered maps.
I'm not saying we should rush to adopt it right the way, but at least don't dismiss it hastily. It has some novel ideas to advance the art.
What novel ideas did you see here? To me this looks like a standard 16-way compressed trie. The node encoding is quite natural if you consider the limitation of 64M entries (which is really not a lot). Did I miss something?
We're using a similar trie structure as the main document (node) index in SirixDB[1]. Lately, I got some inspiration for different page-sizes based on the ART and HAMT basically for the rightmost inner pages (as the node-IDs are generated by a simple sequence generator and thus also all inner pages (we call them IndirectPage) except for the rightmost are fully occupied (the tree height is adapted dynamically depending on the size of the stored data. Currently, always 1024 references are stored to indirect child pages, but I'll experiment with smaller sized, as the inner nodes are simply copied for each new revision, whereas the leaf pages storing the actual data are versioned themselfes with a novel sliding snapshot algorithm.
You can simply compute from a unique nodeId each data is assigned (64bit) the page and reference to traverse on each level in the trie through some bit shifting.
This looks very good. The idea of using a subnet-mask style to compute the prefix of a node is pretty novel. I haven't seen anything like it. The choice of span factor of 16 is a good compromise between node size and tree depth. The node slot packing is amazing. Actually if you relax the restriction on 64-byte node to 128-byte node, you can get 64 bits per slot and will get a much higher limit for the item count. Newer CPU's are starting to support 128-byte cache line.
I believe it's quite different. If I'm not mistaken, flat_hash_map is a typical hash table, with a hash function to map the key to a bucket. Its novel part is using the 7 lower bits of the hashed value as a bit mask against the control bits to check the presence of the item. It's like a mini-Bloom-Filter where a false query means the item definitely not in the bucket while a positive query means the item might be in the bucket. A subsequent search through the bucket can confirm the existence of the item.
It is mentioned prominently enough in the readme, I would say. 2²⁶ is plenty enough for many applications. I currently use a persistent intmap to label continuations in a toy CPS compiler, and with my current approach there is no chance whatsoever that any conceivable program would compile to use that many continuation labels.