
File Compression in the Multi-Core Era - ajbatac
http://www.codinghorror.com/blog/archives/001231.html
======
jrockway
This is interesting, but not for the reasons Jeff suggests. bzip uses all 8
cores for 21 minutes to produce a 986M file (or 2 minutes for 1092M), while
7zip doesn't use all 8 cores, and produces a file smaller than anything bzip
can produce, in 5 minutes.

So it looks like 7zip is not just slightly better than bzip; it's _much_
better. Ideally you can utilize all your cores by piping data from the DB
directly into the compressor -- the compressor will use 2 cores (or whatever),
and your database will use the rest.

~~~
DarkShikari
Bzip2 is a very slow compression algorithm, mostly due, from what I recall, to
the Burrows-Wheeler transform that lies at its core. LZMA is pretty much
superior in every respect; it is (though quite slowly) on the road to
replacing the aging Bzip2.

And as usual the comments on CodingHorror (at least the initial dozen or two)
show a relative ignorance about the topic. 7zip (as can any compressor) can be
trivially parallelized just by running it simultaneously on each solid block.
The compression cost of a smaller solid block size is generally near-zero for
the case where dictionary size << input data size.

The included Windows interface doesn't allow this kind of threading AFAIK, but
it would be relatively simple to implement in an app using the LZMA libraries.

~~~
Andys
This is the exact approach taken by pigz (Parallel GZIP) -
<http://www.zlib.net/pigz/>

~~~
alecco
> and uses the zlib and pthread libraries

Er, no, thanks. What about a good STL C++ implementation with OpenMP
(automagic on STL.)

That's great for us, developers. We'll never run out of things to do :)

~~~
Andys
Whats the problem? pigz works as advertised for me. Almost a 4x speed increase
on a 4 core system, assuming your storage I/O can keep up.

------
ruslan
Poor guy discovered multi-threading on SMP too late! I would fire such system
administrator who does not undestand system in essence. Also the one who does
not respect the history and does not understand the fact that old unix
utilities are not (and in most cases cannot be!) multi-threaded.

------
artificer
Note that the bzip2 implementation he uses is 7zip's; the classic unix
implementation does not make use of multiple cores. But, there is also pbzip2,
which supposedly uses all available cores:

<http://compression.ca/pbzip2/>

