
Roaring bitmaps: A better compressed bitset - SwellJoe
http://roaringbitmap.org/
======
karterk
We use EWAH (another compressed bitset from Daniel Lemire) heavily at work to
store inverted indices in a compressed form. Apart from lowering the memory
foot print (vs storing sorted integers), AND/OR operations are much faster
when you want to do stuff like intersecting 2 bitsets to find the overlapping
members etc.

You should checkout some of the other impressive compression related stuff
that Daniel has open sourced:
[https://github.com/lemire](https://github.com/lemire)

------
toolslive
This is quite nice

There are many flavours of bitsets. Sometimes it's not the largest element
that's a limit but the cardinality. You should also take note of the cost of
clearing a bitset: This cost is almost never included in performance
considerations. For instance, when you're making small random subsets (of size
k) of a set (size n) you might be tempted to use a bitset and recycle it
between subsets. you typically end up with a performance of k.O(1) for set but
O(n) for a clear.

pick your poison ;)

------
tromp
How do these compare space and performance wise with Judy arrays, which are
256-ary trees whose nodes also distinguish between sparse and dense subsets?
[https://en.wikipedia.org/wiki/Judy_array](https://en.wikipedia.org/wiki/Judy_array)

~~~
MrBuddyCasino
Good question, since the patent on them won't expire until Nov 29, 2020.

------
victorhooi
Aha, this is interesting =).

So this is basically just an array of booleans, and each element has two
states.

But are there equivalent fancy compressed data structures for more than two
states? (i.e. like a compressed array of enumerations) Or does that not make
sense for some reason?

~~~
andreasvc
Yes, it's a research area called succinct data structures. E.g. given work on
DNA in bioinformatics, a lot of work is on elements with 4 stats.

------
javajosh
How would I use this data-structure in an application? Is it basically a
replacement for a byte array with improved space characteristics?

~~~
twic
Semantically, it's a boolean array. Or, to look at it another way, a set of
integers, where a[n] is true if n is in the set.

This representation, and the various others the paper compares it to, are
useful when the array is fairly sparse - the introduction suggests the range
1/10000 to 1/64 of the bits being set. They're a lot more compact, but still
support key operations - ORing, ANDing, testing a particular bit, enumerating
all set bits - fairly efficiently.

The major motivating use of structures like this is for bitmap indexes in
databases. Bitmap indexes store a single bit of data per record - it could be
"is this person a current employee?", "has this user ever bought anything?",
etc. If you can encode these compactly, you can use them to very quickly
winnow vast amounts of data down to a small number of interesting records,
which you can then retrieve one by one.

------
joe_the_user
It would be nice to have a breakdown of the trade offs that this data
structure involves in terms of insertion time and storage space.

~~~
SwellJoe
It's covered pretty well in the paper they link to:

"On the Java platform we used for our experiments, we estimate that we can
compute and write bitwise ORs at 700 million 64-bit words per second. If we
further compute the cardinality of the result as we produce it, our estimated
speed falls to about 500 million words per second. That is, we suffer a speed
penalty of about 30% because we maintain the cardinality. (We find that this
penalty effectively disappears when using an equivalent optimized
implementation written in C.) In contrast, competing methods like WAH and
Concise must spend time to decode the word type before performing a single
bitwise operation. These checks may cause expensive branch mispredictions or
impair superscalar execution."

They also discuss _why_ this is so (specialized instructions in modern
processors, and modern processors and superscaler allowing execution of
multiple such operations per cycle). All of this is in the section that begins
"Bitmap vs. Bitmap".

