Hacker News new | comments | show | ask | jobs | submit login
Roaring bitmaps: A better compressed bitset (roaringbitmap.org)
76 points by SwellJoe 850 days ago | hide | past | web | 11 comments | favorite

We use EWAH (another compressed bitset from Daniel Lemire) heavily at work to store inverted indices in a compressed form. Apart from lowering the memory foot print (vs storing sorted integers), AND/OR operations are much faster when you want to do stuff like intersecting 2 bitsets to find the overlapping members etc.

You should checkout some of the other impressive compression related stuff that Daniel has open sourced: https://github.com/lemire

This is quite nice

There are many flavours of bitsets. Sometimes it's not the largest element that's a limit but the cardinality. You should also take note of the cost of clearing a bitset: This cost is almost never included in performance considerations. For instance, when you're making small random subsets (of size k) of a set (size n) you might be tempted to use a bitset and recycle it between subsets. you typically end up with a performance of k.O(1) for set but O(n) for a clear.

pick your poison ;)

How do these compare space and performance wise with Judy arrays, which are 256-ary trees whose nodes also distinguish between sparse and dense subsets? https://en.wikipedia.org/wiki/Judy_array

Good question, since the patent on them won't expire until Nov 29, 2020.

Aha, this is interesting =).

So this is basically just an array of booleans, and each element has two states.

But are there equivalent fancy compressed data structures for more than two states? (i.e. like a compressed array of enumerations) Or does that not make sense for some reason?

Yes, it's a research area called succinct data structures. E.g. given work on DNA in bioinformatics, a lot of work is on elements with 4 stats.

How would I use this data-structure in an application? Is it basically a replacement for a byte array with improved space characteristics?

Semantically, it's a boolean array. Or, to look at it another way, a set of integers, where a[n] is true if n is in the set.

This representation, and the various others the paper compares it to, are useful when the array is fairly sparse - the introduction suggests the range 1/10000 to 1/64 of the bits being set. They're a lot more compact, but still support key operations - ORing, ANDing, testing a particular bit, enumerating all set bits - fairly efficiently.

The major motivating use of structures like this is for bitmap indexes in databases. Bitmap indexes store a single bit of data per record - it could be "is this person a current employee?", "has this user ever bought anything?", etc. If you can encode these compactly, you can use them to very quickly winnow vast amounts of data down to a small number of interesting records, which you can then retrieve one by one.

It would be nice to have a breakdown of the trade offs that this data structure involves in terms of insertion time and storage space.

It's covered pretty well in the paper they link to:

"On the Java platform we used for our experiments, we estimate that we can compute and write bitwise ORs at 700 million 64-bit words per second. If we further compute the cardinality of the result as we produce it, our estimated speed falls to about 500 million words per second. That is, we suffer a speed penalty of about 30% because we maintain the cardinality. (We find that this penalty effectively disappears when using an equivalent optimized implementation written in C.) In contrast, competing methods like WAH and Concise must spend time to decode the word type before performing a single bitwise operation. These checks may cause expensive branch mispredictions or impair superscalar execution."

They also discuss why this is so (specialized instructions in modern processors, and modern processors and superscaler allowing execution of multiple such operations per cycle). All of this is in the section that begins "Bitmap vs. Bitmap".

One restriction is that random insertion into the bitmap is a O(N) operation. You typically expect to construct the bitmap in a sequential fashion and once constructed, treat it as immutable.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact