
Benchmarking 32bit integer key compression for a C++ Btree - cruppstahl
https://upscaledb.com/0012-32bit-integer-compression-algorithms-part2.html
======
mtanski
We do something similar with compressing integers. Unlike the article we pick
from 3 different compression schemes: SIMD FastPFOR, Bitpack (max value) and
delta encoding.

Which one we pick depends on the distribution of the integers. FastPFOR we use
when the values are mostly smallish but there's an occasional large value.
Works great. Bitpack we use when there is a tight band (ex. all values are ~5
-> ~7 bits). Delta works well for (mono) increasing values.

Also, we've adapted most of these to 64bit integers since we work with those
more commonly.

~~~
cruppstahl
That's interesting - are the 64bit versions available for download?

~~~
mtanski
My junior (at the time) coworker created the 64bit implementations by reading
the original papers. But you should be able to find other implementations on
github now, since the paper has been out for a while. Our code is pretty tied
to software (our allocation / buffer handling).

In hindsight, it might not have been the best idea to say. Hey you just
started working with C++ and never worked with SIMD, so yeah... I need to
extend this SIMD Fast PFOR scheme to 64bit. But he ended up doing quite well.

------
ddorian43
How about using a bitlist/bitmap/bitset/bitstring/bitarray, maybe even a
compressed one, like roaring-bitsets ?

Edit: question: is there any type of index better than compressed bitsets when
you have per-column index and you want to AND,OR, basically combine them ?

~~~
nkurz
_is there any type of index better than compressed bitsets when you have per-
column index and you want to AND,OR, basically combine them ?_

If your index is denser than ~.1 and sparser than ~.9, a simple bitmap is
almost certainly the fastest solution for union and intersections, and likely
the most compact as well. SIMD is so simple and fast over these that they are
hard to beat.

At higher or lower densities than this, "something else" is probably better.
The idea of a compressed bitset (like Roaring) is to use a different encoding
depending on the density. So in that sense, a compressed bitset is the best
you can do.

That said, there is still lots of room for improvement among compressed
bitsets. Among other things, I don't think there are any that yet deal well
with "inverted selection" for very high densities. Most papers just avoid the
issue by assuming "stop lists", which might not be possible in a real world
database. Intersections between different density columns using different
encodings still has lots of optimization potential too.

