

Many times faster (de)compression using multiple processors. - igorhvr
http://www.iasylum.net/writings/parallel-compression.html

======
alecco
It's good people get interested in the subject. But this is very odd and has
some errors. For example xz requires a lot more memory resources than bzip2
(see benchmarks below, Mem column).

<http://mattmahoney.net/dc/text.html>

<http://mattmahoney.net/dc/uiq/>

Matt Mahoney mantains the best benchmarks on text and generic compression.
Some of the best on the field (like Matt) usually hang out at encode.ru.

~~~
igorhvr
By resources I meant wall-time, mostly - what I am optimizing for (I move
>200GB zfs snapshots around a lot..). I did not pay attention to memory,
specifically - in that light I updated the side note.

What else did you find odd/wrong?

~~~
alecco
You don't mention ratios. Better compression becomes
asymptotically/exponentially harder when it gets close to the optimal code
size.

Thanks for the shout out. BTW Matt's benchmarks are for all major compression
engines, not only bzip2.

------
bmm6o
Is this decompressing a single stream on multiple processors? My knowledge of
gzip is very limited, but I would have thought sequential processing was
required. What's the trick here? (TFA doesn't explain anything, and e.g. pigz
homepage doesn't either).

~~~
igorhvr
You are right. The .pdf version of the user manual of pigz (at
<http://zlib.net/pigz/pigz.pdf> ) has a few details (the bottom of this
answers your question):

"Pigz compresses using threads to make use of multiple processors and cores.
The input is broken up into 128 KB chunks with each compressed in parallel.
The individual check value for each chunk is also calculated in parallel. The
compressed data is written in order to the output, and a combined check value
is calculated from the individual check values.

The compressed data format generated is in the gzip, zlib, or single-entry zip
format using the deflate compression method. The compression produces partial
raw deflate treams which are concatenated by a single write thread and wrapped
with the appropriate header and trailer, where the trailer contains the
combined check value.

Each partial raw deflate stream is terminated by an empty stored block (using
the Z_SYNC_FLUSH option of zlib), in order to end that partial bit stream at a
byte boundary. That allows the partial streams to be concatenated simply as
sequences of bytes. This adds a very small four to five byte overhead to the
output for each input chunk. The default input block size is 128K, but can be
changed with the -b option. The number of compress threads is set by default
to the number of online processors, which can be changed using the -p option.
Specifying -p 1 avoids the use of threads entirely. The input blocks, while
compressed independently, have the last 32K of the previous block loaded as a
preset dictionary to preserve the compression effectiveness of deflating in a
single thread. This can be turned off using the -i or --independent option, so
that the blocks can be decompressed independently for partial error recovery
or for random access.

Decompression can’t be parallelized, at least not without specially prepared
deflate streams for that purpose. As a result, pigz uses a single thread (the
main thread) for decompression, but will create three other threads for
reading, writing, and check calculation, which can speed up decompression
under some circumstances. Parallel decompression can be turned off by
specifying one process ( -dp 1 or -tp 1 )."

I also just updated the article with this info.

~~~
ComputerGuru
Interesting. How is this capable of extracting LZMA "solid" archives where
_all_ files are part of a single stream to achieve better (and very much
slower) compression?

------
joshbaptiste
Had to try this on my quad core laptop, as I never heard of these tools .

    
    
      josh@snoopy:~/Downloads $ grep -m2 -i intel  /proc/cpuinfo 
      vendor_id       : GenuineIntel
      model name      : Intel(R) Core(TM) i5-2410M CPU @ 2.30GHz
    
      josh@snoopy:~/Downloads $ ls -l test 
      -rw-r--r-- 1 josh josh 1073741824 2012-03-07 20:06 test
    
      josh@snoopy:~/Downloads $ time gzip test
      real    0m16.430s
      user    0m10.210s
      sys     0m0.490s
    
      josh@snoopy:~/Downloads $ time pigz test 
      real    0m5.028s
      user    0m16.040s
      sys     0m0.620s
    

Looks good.. although the man page describes it as being "an almost compatible
replacement for the gzip program".

~~~
alecco
But what's the compression ratio and what's the test data?

~~~
joshbaptiste
Test data

    
    
      josh@snoopy:~/Downloads $ dd if=/dev/zero of=test bs=1024K count=1000
      
      josh@snoopy:~/Downloads $ gzip -l test.gz 
      compressed        uncompressed  ratio   uncompressed_name
      1176024          1048576000  99.9% test.txt
    

with -p2 it runs ~2 sec slower

~~~
alecco
For that kind of data you can just use RLE instead. It's the trivial case (an
edge case in fact).

------
isocpprar
Is xz less resource intensive then bzip2? My testing (admittedly two years ago
or so) showed significant differences, better compression ratio with xz but
significantly longer and/or more memory used.

~~~
igorhvr
In the tests I did (including compression of an entire filesystem image), xz
always compresses better for the same effort (amount of time).

One example (using the standard tools, not the multithreaded ones):

    
    
      igorhvr:tmp/ $ ls -la co_2011-08_import.mdb*
      -rw-r--r-- 1 igorhvr igorhvr 261832704 2011-10-11 07:19 co_2011-08_import.mdb
      igorhvr:tmp/ $ sudo time nice -n -20 bzip2 -k co_2011-08_import.mdb
      42.85user 0.14system 0:43.16elapsed 99%CPU (0avgtext+0avgdata 31968maxresident)k 112inputs+0outputs (3major+2268minor)pagefaults 0swaps
      igorhvr:tmp/ $ sudo time nice -n -20 xz -3 -k co_2011-08_import.mdb
      30.99user 0.19system 0:31.48elapsed 99%CPU (0avgtext+0avgdata 130640maxresident)k 96inputs+0outputs (2major+8389minor)pagefaults 0swaps
      igorhvr:tmp/ $ ls -la co_2011-08_import.mdb*
      -rw-r--r-- 1 igorhvr igorhvr 261832704 2011-10-11 07:19 co_2011-08_import.mdb
      -rw-r--r-- 1 igorhvr igorhvr  24020243 2011-10-11 07:19 co_2011-08_import.mdb.bz2
      -rw-r--r-- 1 igorhvr igorhvr  23949408 2011-10-11 07:19 co_2011-08_import.mdb.xz
    

In this case xz compressed roughly to the same size (slighly better than
bzip2, actually) using only 72% of the time. This is often the case.

~~~
getsat
Put two spaces before your lines of terminal output to prevent HN from munging
them.

~~~
igorhvr
Done - thanks

------
PaulHoule
If you're handling a lot of data it make sense to hash-partition it on some
key and spread it out to a large number of files.

In that case you might have, say, 512 partitions and you can farm out
compression, decompression and other tasks to as many CPUs as you want, even
other machines in a cluster.

------
mappu
I like to use PPMd (via 7zip) for large volumes of text, but it seems to only
be single-threaded, which is a shame. It cuts a good 30% again off the size of
the .xml.bz2's that Wikipedia provides.

------
dhruvbird
This is awesome since compression in parallel has been largely neglected in
practice.

~~~
alecco
It's not neglected, it's just hard and a lot of projects failed. But there are
many papers on the subject in the last 10 years. Also parallel is a vague
term, it can mean multicore, SIMD, GPU and clusters. All different, with a
unique set of tradeoffs.

