Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Integer Map Data Structure (github.com/billziss-gh)
91 points by billziss 10 months ago | hide | past | favorite | 27 comments
This project presents a new data structure for storing ordered integer maps. The proposed data structure is a compressive, cache-friendly, radix tree that has performance comparable to an unordered map (`std::unordered_map`) and is an order of magnitude faster than an ordered map (`std::map`).



FWIW there is prior art here. e.g. see IntMap in Haskell: https://hackage.haskell.org/package/containers-0.7/docs/Data...


It's similar but not quite. IntMap in Haskell uses bit as the prefix unit with a span factor of 2. This uses nibble as the prefix unit with a span factor of 16. Also IntMap uses a bitmask for the range of prefix units in a node while this uses a subnet-mask style to get the prefix of a node.


It hardly seems comparable given that this Haskell data structure must necessarily be persistent.


I'm not sure what you mean by "necessarily must be persistent" here, but (under my interpretation which means only persistent implementations of this data structure are possible) that's not true.

The IntMap paper by Chris Okasaki acknowledges that the data structure is not a new one (it's a PATRICIA tree), but one that was around 30 years old at the date his paper was published and deserved to be better known. The functional/persistent implementation is what's novel about Okasaki's paper.

Edit: The data structure this submission is about looks like great work! Excited to see it and will probably attempt my own ports of it to other languages.


Sorry I wasn't clear. What I meant is that an idiomatic Haskell data structure needs to be persistent: an insertion needs to produce a new version of the data structure with the inserted element while keeping the original. So almost all Haskell data structures need to satisfy this requirement otherwise using it is a lot of PITA, even though the ST monad makes mutations fairly easy. However the original submission is not persistent, and so they aren't comparable.


Oh, that makes sense and I agree. Appreciate the clarification!


This really doesn't seem to be comparing to comparable data structures. For int map specializations like this, the optimized alternatives are things like Judy (which is looking quite aged these days) or roaring bitmaps, not to mention that any C++ developer using "ordinary" maps will be using absl's SwissTable (flat_hash_map) or folly's F14 (F14FastMap) or perhaps absl::btree_map if order is important. Comparisons to std::map and std::unordered_map are simply too naive to make the case for this data structure.


Oh come on. Don't be too harsh. This is an ordered map. Most of the mentioned ones are unordered maps. They might be fast but they are unordered. The only comparables are absl::btree_map and Judy Array. Without benchmarking, my gut feeling is this will beat absl::btree_map. Trie usually beats BTree.


I've written a lot of high performance/scale C++ code using a lot of data structures over the years, and ordered iteration has been very rarely needed; unordered data structures still rule the day in performance the vast majority of the time, and their lower constant factors very frequently outperform more specialized data structures. They're absolutely worth benchmarking against if the goal is actual uptake in the actual world.

In my experience, practically every single time I've used absl::btree_map, I've ended up reverting for performance reasons to either a flat hash map or, in some relatively rare cases, a sorted vector map (despite its O(n) insert/erase) because the constant factors are _so doggone low_. The experience remains: btree_map (or SkipLists, or whatever) has (in my experience) essentially never, in over a half million lines of C++, actually remained in the code.

Also, I presume (based on the implementation details, not based on actual use) that roaring bitmaps have some reasonable iteration API that would make them relevant even to the ordered comparison.

The important thing here is that if anyone wants to contend that someone should use their new data structure because it's better or faster or more optimal for some particular use case, it's important for them to demonstrate that they've thoroughly investigated the data structure space around their proposal. Comparison against std::map and std::unordered_map simply doesn't demonstrate the kind of comprehensive knowledge that I would expect from someone who claims that I should use their optimized integer map in my own code.


That just means you never have the need for implementing a querying or search functionality. Range query and search are used everywhere and need ordered maps.

I'm not saying we should rush to adopt it right the way, but at least don't dismiss it hastily. It has some novel ideas to advance the art.


What novel ideas did you see here? To me this looks like a standard 16-way compressed trie. The node encoding is quite natural if you consider the limitation of 64M entries (which is really not a lot). Did I miss something?


Subnet mask style prefix construction is pretty novel.


We're using a similar trie structure as the main document (node) index in SirixDB[1]. Lately, I got some inspiration for different page-sizes based on the ART and HAMT basically for the rightmost inner pages (as the node-IDs are generated by a simple sequence generator and thus also all inner pages (we call them IndirectPage) except for the rightmost are fully occupied (the tree height is adapted dynamically depending on the size of the stored data. Currently, always 1024 references are stored to indirect child pages, but I'll experiment with smaller sized, as the inner nodes are simply copied for each new revision, whereas the leaf pages storing the actual data are versioned themselfes with a novel sliding snapshot algorithm.

You can simply compute from a unique nodeId each data is assigned (64bit) the page and reference to traverse on each level in the trie through some bit shifting.

[1] https://github.com/sirixdb/sirix


Also will throw in to the mix, in C#:

https://julesjacobs.com/2014/11/11/immutable-vectors-csharp....

His implementation uses buffers of capacity 32, generics, and bit shifting to do lookups.


Neat, thank you! I'd love to see how it compares to the libgdx IntMap[0].

[0] https://github.com/libgdx/libgdx/blob/master/gdx/src/com/bad...


This is a radix tree (ordered, does more allocations), that is a hash table. Also TFA is C/C++, libgdx looks like Java.


yeah, just thought it'd be fun to compare :) ordered is a big difference.


Interesting! Reminds me a great deal of Judy Arrays: https://en.m.wikipedia.org/wiki/Judy_array

Judy Arrays are a radix trie with branching and a few node types designed to be cache line width optimized.


Looks rad, I was going to look into some b-trees for a use-case where I need an ordered map of things similar to integers and this might be better.

I couldn't immediately see, is there mention of whether insertions invalidate iterators? Maybe not strictly needed for my use-case but good to know.


This looks very good. The idea of using a subnet-mask style to compute the prefix of a node is pretty novel. I haven't seen anything like it. The choice of span factor of 16 is a good compromise between node size and tree depth. The node slot packing is amazing. Actually if you relax the restriction on 64-byte node to 128-byte node, you can get 64 bits per slot and will get a much higher limit for the item count. Newer CPU's are starting to support 128-byte cache line.


If I understood it correctly, it's not much different from what `absl::flat_hash_map` does.

Look for "Metadata information"/control bits at https://abseil.io/about/design/swisstables


I believe it's quite different. If I'm not mistaken, flat_hash_map is a typical hash table, with a hash function to map the key to a bucket. Its novel part is using the 7 lower bits of the hashed value as a bit mask against the control bits to check the presence of the item. It's like a mini-Bloom-Filter where a false query means the item definitely not in the bucket while a positive query means the item might be in the bucket. A subsequent search through the bucket can confirm the existence of the item.

This int map is a trie with an integer based key.


Interesting, but the summary does not mention an important fact: the data structure can contain at most 67108864 items, which is a quite low limit.


It is mentioned prominently enough in the readme, I would say. 2²⁶ is plenty enough for many applications. I currently use a persistent intmap to label continuations in a toy CPS compiler, and with my current approach there is no chance whatsoever that any conceivable program would compile to use that many continuation labels.


This limit comes from using int32s. Does anyone have an idea why on modern machines it isn't making use of int64s?


Appears to be a 16 way branching trie which completely misses both advantages of tree structures over hashes:

1/ This tree is mutable, insert doesn't give you a new tree via path copying

2/ union/intersection style operations can be sublinear. None of the batch operations are implemented


It'd be nice to include djb's crit-bit tree implementation (linked to from the intro) in the benchmarks ("Performance Test" section).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: