
On-the-fly encoding and decoding of bitmaps - amallia
https://www.antoniomallia.it/on-the-fly-encoding-and-decoding-of-bitmaps.html
======
willglynn
There are a lot of tools available in the "compressed bitmap" space. When last
I needed something like this, I evaluated several approaches and ended up
selecting Roaring Bitmaps:

[https://arxiv.org/abs/1402.6407](https://arxiv.org/abs/1402.6407)

[http://roaringbitmap.org](http://roaringbitmap.org)

That application had some particular wins relating to large sparse bitmaps,
especially the ability to efficiently perform binary operations directly on
their compressed representations. I didn't need random insertions, so this
ended up being a win across the board: faster runtime, lower memory footprint,
and lower I/O requirements versus a normal bitset.

~~~
amallia
Roaring Bitmaps is for sure another big thing out there. I will definitely
write about it soon.

------
leni536
> boost::math::binomial_coefficient<double>

Using double here seems to be unsuitable for this kind of algorithm. The boost
algorithm seems to be tuned for floating point arithmetic for large n and k
values. An integral binomial_coefficient can be written rather easily. Such a
function can easily be declared constexpr and can be used to create a lookup
table at compile time.

Also I don't see how your algorithm is constant time, unless you can write a
bitmap class with a constant time operator[]. I expected the lookup to be
constant time. Decoding a known block to be constant time is not such a strong
statement.

I can see that it can compress well bitmaps where there are large chunk of 0s
or 1s. I wonder what is the benefit compared to RLE. jooster's comment seems
to be valid too, you need 3 bits to store the popcount. That makes your
"compressed" example 27 bits, which is 3 bits _more_ than the original, making
it not a too compelling example for your compression algorithm.

~~~
joosters
I can't see how an algorithm to lookup bit 'i' in a compressed bitmap of size
'n' can ever be constant time.

No matter how the bitmap is compressed, the block storage space for 'b' bits
of the bitmap _cannot_ be constant (since it's impossible to always compress
'b' bits of arbitrary data into fewer than 'b' bits).

So, to find bit 'i' in the compressed form, you are going to have to parse the
compressed data. A naive approach would just linearly read through the
compressed data, giving O(n). Alternatively, the data structure could contain
indexes, pointing to various places in the bitmap. You could get the cost down
to O(log n) by making your index contain pointers to a binary subdivision of
the bitmap. But we're still not constant time.

If you index every 'x' bits, the cost drops to O(x). To reach constant time,
you would need to index every bit in the bitmap, which obviously cannot be
done without creating a data structure bigger than the original bitmap.

(Edit: fixed my misuse of linear/constant time!)

~~~
amallia

      I can't see how an algorithm to lookup bit 'i' in a compressed bitmap of size 'n' can ever be constant time.
    

What you do is decoding a block in constant time, as the block size is
constant and you need exactly block size iterations.

Accessing the i-th bit cannot be done in constant time as I already mentioned
unless you also store all the offsets (including all-zeros/ones). In that
case, all you have to do is moving to the i/b block (where is is the position
you want to access and b is the block size) by skipping (len(C)+len(O)) * i/b
bits.

~~~
joosters
But to decode a block of size 'b' takes O(b), i.e. larger blocks take longer
to decode. You can pick whatever size block you like, but smaller blocks lead
to less compression. It's just another form of indexing.

------
specialist
Ages ago, I tried to create a sparse bitmap for ANTLR’s storage of Unicode
match sets. The paper’s algorithm I tried to implement was like compressed
b-trees. I didn’t get very far.

I’ve been wondering how parser generators now Unicode bitsets.

------
joosters
_To store the cardinality of each block we would need 2 bits per block, for a
total of 12 bits._

I don’t follow this. Each block, or 4 bits, can have a cardinality of 0-4. Yet
2 bits can only store 0-3. What am I missing?

~~~
amallia
You are absolutely right! Will fix. Thanks for the feedback

