
Blosc – A high performance compressor optimized for binary data - tonteldoos
https://blosc.org/pages/blosc-in-depth/
======
buybackoff
It's basically byte or bit shuffling filter (very fast SIMD optimized) in
front of several modern compressors (lz4, zstd, their own) with self
describing header. So if you have an array of 100 8-byte values, the result of
shuffling is 100 1st bytes, followed by 100 of 2nd bytes and so on.

It shines when values are of fixed size with lots of similar bits, e.g.
positive integers of the same magnitude. It's not so good for doubles, where
bits change a lot. Also, if stroring diffs it helps to take a diff from
initial value in a chunk, not previous value, so that deltas change sign less
often (and most bits flipped).

From own usage case, for the same data, C# decimal (16 bytes struct) is
compressed much better than doubles (final absolute blob size), while decimal
is taking 2x more memory uncompressed.

If data items have little similar bits/bytes then it's underlying compressor
that matters.

------
Xcelerate
Back when I did HPC work, I used Blosc to compress information about atoms for
molecular dynamics simulations before transferring this data between the
Infiniband interconnects. Despite the high speed of the interconnects, it was
actually faster to compress, transmit, and decompress using Blosc than to
transmit only the raw data.

~~~
gtt
btw, I'm currently tasked with Kolmogorov complexity estimation, so could
someone recommend me best (from ratio point of view) compressors?

~~~
Faint
[http://mattmahoney.net/dc/text.html](http://mattmahoney.net/dc/text.html), is
pretty much the scoreboard of
[https://en.m.wikipedia.org/wiki/Hutter_Prize](https://en.m.wikipedia.org/wiki/Hutter_Prize)

------
lrm242
Blosc is an outstanding project. I have used it with great success in finance
and general data science in production with very large total datasets (one
custom binary format and one leveraging protobufs).

It really shines first and foremost as a meta compressor, giving the developer
a clean block based API. Once integrated (which really is quite easy) you can
experiment easy with different compressors and preconditioners to see what
works best with your dataset. These things can be changed at runtime and give
you great flexibility.

Francesc has been advancing blosc consistently with a steady vision for years
and years. It is one of the most underrated tools around IMO.

------
devit
Apparently they have several benchmarks where they claim that decompression is
faster than memcpy (!).

However, this is only the case because on several Intel x86_64 benchmarks they
report memcpy performance between 5-10 GB/s, while even a basic DDR3 dual
channel arch has 20 GB/s memory bandwidth, while a modern quad channel DDR4
can have 76.8 GB/s bandwidth, and of course there is no reason for memcpy to
be substantially slower than memory bandwidth assuming it's properly
implemented (AVX can separately read two and write one 256-bit per cycle = 128
GB/s memcpy at 4GHz).

Am I missing something or is this another case of "implausible claims = they
screwed the benchmark = they are incompetent/malicious"?

~~~
stagger87
The absolute numbers don't seem far fetched. An AVX optimized memcpy on my
high end machine (DDR4) has a throughput of 30GB/s.

As long as they are using the same memcpy routine in both the decompression
case and the 'only memcpy' case, that seems reasonable. Obviously, the quicker
memcpy becomes, the faster the decompression has to become to maintain the
same performance ratios, but things like faster clock speeds or multi-
threading can make that issue moot.

------
xiaodai
It's very good! I have used Blosc in developing JDF.jl a serialization format
for dataframes.

[https://github.com/xiaodaigh/JDF.jl](https://github.com/xiaodaigh/JDF.jl)

~~~
doublesCs
Could you tell us more? Is this meant to be an alternative to parquet?

In fact, now that I think about it, parquet supports compression. Shouldn't
this be just an option when saving to parquet format?

~~~
pletnes
Parquet’s snappy and brotli compressors are quite ok. Not sure if blosc is
even faster though.

------
gigatexal
Would be cool to see this in ZFS to make compressing binaries even more
efficient

------
nisa
The used shuffle techniques before compresson might be useful for squashfs? We
play around with a mesh network (freifunk.net) and there are ton's of cheap
4mb flash devices that need every kb of storage :)

------
axegon_
Blosc is an excellent choice if speed is what you are after. Give or take 5
years ago I had to use a compression to transport a lot of data over zmq and
blosc ran in circles over all other compressions.

~~~
w0utert
Yes it's apparently so fast that in some scenarios it's even usable for
compressing RAM. A framework I'm using does that to be able to process much
bigger data sets than what would fit in RAM otherwise.

~~~
ddorian43
Can you be more specific around the framework and data type and access
patterns ?

~~~
w0utert
It's a framework called OpenVDB [1], which we use to represent and manipulate
volumetric data (level sets). It stores the data as a sparse hierarchical grid
with (from a practical perspective) infinite dimensions, and allows very
efficient iteration and local manipulations of the grid.

I'm not an expert on how it is implemted exactly, but I believe the way it
uses Blosc is by saving leaves of the VDB grids in blosc-compressed chunks,
which are loaded into memory directly and only decompressed on-demand when the
data is accessed, then re-compressed when the leaves are processed.

[1] [https://www.openvdb.org](https://www.openvdb.org)

------
requin246
Can someone with Blosc 2 experience tell me what are the proper conditions to
use superchunks or frames? When does it become advantageous to use one over
the other?

This is a really interesting library.

------
js8
This would be an excellent candidate to put on an FPGA directly next to the
CPU. (Assuming such thing would exist and be open enough to be usable by
general public.)

------
waatels
This look amazing. The application looks so diverse ! Can someone know if it
can be applied on msgpack ?

~~~
profquail
Not generally, no. blosc is geared towards “rectangular” data — that is, a
C-style array of int, double, or some struct type.

------
any1
Can blosc be used to compress/decompress regular zlib streams?

