
Bitpacking and Compression of Sparse Datasets - luu
http://moderndescartes.com/essays/bitpacking_compression
======
Opteron67
[http://roaringbitmap.org/](http://roaringbitmap.org/)

~~~
79d697i6fdif
+1 to this. OP is using the wrong flavor of compression for the task at hand.
OP, look up sparse bitmaps.

~~~
gwern
Are these really sparse? He is working with Go boards. Go games are usually
hundreds of moves and by the end of the game, most of the board will be
populated with stones, so on average somewhere under half of all recorded
intersections will have stones (the crowded late game positions balance the
empty early game positions). Sparse matrix representations usually require a
lot of sparsity to shine, I was under the impression.

------
derefr
Regarding the benefits of knowing your data on compression: rearranging the
dimensions of a dataset can also yield interesting increases in
compressibility.

For example, you can store your position records arr[T][X,Y] as arr[X,Y][T]:
that is, sequences of bits representing how each point on the grid evolves
over time, rather than sequences of bits representing boards. These single-
position time-series should both be more internally predictable (lower
entropy), and more isotropic (having similarity with other features, allowing
secondary compressibility) than the original ones, and you should be able to
repopulate the in-memory 3D array from the serialized representations in
roughly the same time (ignoring caching effects, because a 19x19x28 array is a
very small array.)

Of course, this doesn't always work: the same magic _could_ be applied to
videos... but decompressed videos _do not_ fit easily into memory, let alone
are they small enough to ignore cache-coherency during reads/writes, so
actually _doing_ the dimensional transforms is a bit implausible. (But if we
_really needed_ a video squeezed 10x more than we do today, and were willing
to spend hours on both ends doing so, it'd certainly be possible.)

------
ragle
> If you know the structure of your data, you can easily do a better and
> faster job of compressing than a generic compression algorithm.

I'd change this to: the ease of doing a better and faster job of compressing
than a generic compression algorithm will be a function of the data's
Kolmogorov Complexity.

You can easily know your data's structure (e.g. strings no larger than N bytes
containing natural language in a line-separated file) and not be able to
(easily) do better than a generic compressor due to the complexity of the
required compression / reconstruction operations.

~~~
brilee
Not quite; I would describe it as moving the complexity into the algorithm.
For example, even though the bitpacking from the post achieves fast good
compression, it is unable to compress floats in general; merely 1.0 and 0.0.
The knowledge of that correspondence has been moved into the algorithm.

~~~
gopalv
> For example, even though the bitpacking from the post achieves fast good
> compression, it is unable to compress floats in general; merely 1.0 and 0.0.

Fewer bits does not always mean better compression, particularly if the data
has other patterns which is destroyed by the packing.

With Apache ORC, I found out that if you bit-pack data to say 7bits vs leaving
it as 8 bits, the 8 bits version compressed much more with Zlib than the 7 bit
version.

This had to do with the data getting a bit offset into the previous byte
sequence, until what was a sequence of repeating bytes turned into a pattern
which repeats far less often.

Leaving the extra bit in place, helped Zlib dictionary encoding and huffman
work much better than trying to save a bit.

The final kicker was that the 24 bit sequence was faster to read than a 23 or
21 bit sequence, but purely due to the fact that the word aligned stuff can be
decoded in SIMD.

I'm no better at guessing what would work - "whatever works ... works, so try
them.".

~~~
delhanty
Your comment, particularly the degradation in byte alignment leading to worse
subsequent compression, reminds me of the following Hacker News comment by
NelsonMinar from then end of November 2016:

[https://news.ycombinator.com/item?id=13049894](https://news.ycombinator.com/item?id=13049894)

>It's crucial to evaluate encoding space usage in the context of compression.
For instance gzip(base16(data)) is often smaller than gzip(base64(data)) for
practical data. Even though base64 is more efficient than base16, it breaks up
data across byte boundaries which then makes gzip significantly less
efficient.

------
ramzyo
Informative read, seems like the author learned some things and empirically
discovered that his/her solution was ultimately a case of balancing trade-offs
in generality vs specificity. As stated in the "Conclusions" section toward
the bottom of the page, "If you know the structure of your data, you can
easily do a better and faster job of compressing than a generic compression
algorithm."

------
Const-me
I recently used LZ4 compression to save RAM used by infrequency accessed data
in my app.

LZ4 is 1.5 times faster than Snappy, the compression however is better:

[https://www.percona.com/blog/2016/04/13/evaluating-
database-...](https://www.percona.com/blog/2016/04/13/evaluating-database-
compression-methods-update/)

------
jakozaur
Try ZDtandard. You can spend as tune compression speed vs. compression ratio.
Decompressing is always fast.

It was recently released:
[https://github.com/facebook/zstd](https://github.com/facebook/zstd)

------
oever
RDF HDT (Header Dictionary Triples) is a clever storage format for linked
data.

[http://www.rdfhdt.org/hdt-internals/](http://www.rdfhdt.org/hdt-internals/)

