
Xz - golwengaud
http://en.wikipedia.org/wiki/Xz
======
spicyj
Can someone explain to me why this Wikipedia article is at the top of the
front page?

~~~
matthavener
LZMA2 has a better compression ratio than bzip2, according to this article
[http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Mark...](http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Markov_chain_algorithm)
. Seems like it would be a good idea to switch away from bzip2/gzip to zx

~~~
wheels
The main reason to continue to use gzip is speed: it's an order of magnitude
faster than bzip2 for compression / decompression, for a relatively small
compression ratio hit. It's fast enough to us inline in web servers, SSH
sessions, etc, and if you're just moving data around a single time, often the
amount of time for compress / transfer / decompress is much faster than with
bzip2.

~~~
gxti
I've found that on many types of data, xz -2 is twice as fast _and_ compresses
better than gzip -9. Of course gzip -1 kicks the pants off xz but there's
absolutely no reason to use gzip at its higher levels, nor to use bzip2 at
all. And if you want fast, LZO does it better -- the lzop tool provides a
typical unix compress interface.

Using 50MiB from a tarball I had lying around, best of 3:

    
    
         prog    time(s)   size(%)
        gzip  -1  1.703    38.7
        gzip  -9 14.670    35.1    (worse than xz -1)
        bzip2 -1  7.714    34.0
        bzip2 -9  8.035    31.4    (worse than xz -1)
        xz    -1  6.278    31.3
        xz    -9 42.445    22.7    (best compression)
        lzop  -1  0.670    46.1    (fastest)
        lzop  -9 18.144    38.0

~~~
huhtenberg
Out of curiousity, can you run this tarball through paq (say, paq8hp12)?

<http://en.wikipedia.org/wiki/PAQ>

~~~
gxti
It segfaulted at the end, after taking 7m28s. I don't know how big the result
was because it got truncated to 35 bytes.

~~~
huhtenberg
Ugh. Can you try with a lighter compression level? There is a command line
switch for that, and it should also show the memory requirements for each
compression level.

I'm insisting because PAQ is a really incredible collection of compressors.
Far more intelligence in it than in the LZ* bunch.

------
vasi
I wrote a tool, pixz, that does xz compression in parallel to take advantage
of multiple cores. It also indexes xz-compressed tarballs, so you can extract
an individual file very quickly instead of needing to decompress the entire
tarball. The parallel-compressed, self-contained tarball+index is fully
compatible with regular tar/xz, you don't need any special tools to extract
it.

<http://github.com/vasi/pixz>

The interface is still very rough, but it works. The xz utility comes with a
very nice library and API, which made this a lot easier--thanks, Lasse!

------
coderdude
While we're talking about compression algorithms, here is a nice little gem
called BMZ:

<http://bitbucket.org/mattsta/bmz/src>

It's is a fast compression scheme implemented using BMDiff and a Google Zippy
clone (based off LZO).

~~~
dschoon
I built and tried this out. I'm rather shocked to see speed and compression
ratio better than gzip -9.

The results of my unscientific test (compression only):

    
    
       Compressor  Size  Ratio  Time
       gzip -1     23MB  88%     1.18s
       gzip -2     23MB  87%     1.38s
       bzip2       23MB  87%     5.57s
       xz -1       23MB  87%     5.35s
       xz -9       11MB  43%    10.58s
       bmz         13MB  45%     0.95s

~~~
vicaya
It really depends on the data. For highly redundant data like web pages with
lots of boilerplate header/footer, it can compress better because the bm_pack
(first pass of bmz) looks for large common patterns over all the input. For
typical text, it should be a little worse than gzip but faster.

BMZ = bmpack + lzo by default and can be combined with lzma if necessary. It's
not really a BMDiff and Zippy clone, as I've never had a chance to see
Google's implementation. It's based on the original Bentley & McIlroy paper:
"Data Compression Using Long Common Strings", 1999. Even the two pass idea is
from that paper. It was really a wacky experimental implementation (with a lot
of room for improvement) to satisfy my curiosity. I'm a little surprised that
the 0.1 version has been stable for quite a few people compressing TBs of data
through it.

------
marcinw
The only reason I stick with gzip over bz2 or 7z (lzma) is that gzip is
everywhere. In reality, nobody has a file compression utility that can handle
LZMA (even though I do, I never use it).

~~~
swolchok
Interestingly, file(1) can't identify lzma(1)-compressed files (they show up
as data), but unlzma(1) reassuringly knows the difference between LZMA and
random data. This came up once or twice in DEFCON CTF quals
([http://www.vnsecurity.net/2010/05/defcon-18-quals-
writeups-c...](http://www.vnsecurity.net/2010/05/defcon-18-quals-writeups-
collection/)).

~~~
wmf
That's why you should use xz instead of lzma (they're not the same thing) —
the XZ format has a magic number.

------
acg
The specification ( <http://tukaani.org/xz/xz-file-format.txt> ) reads like
this format allows the application of a chain of compression "filters".
Perhaps the most interesting, and least portable is the filter that modifies
code to make compression easier. (Section 5.2.3) Is this mainly for intel
architecures?

------
dmn001
I use 7-zip 9.15 beta - it supports LZMA2 compression and available to
download here:

<http://www.7-zip.org/>

------
india
Also notably, slackware has shifted it's packaging format to a xz tar since
the last two releases. This validates xz as a good replacement for gz/bz2.

------
tman
Xz appears to use the same algorithm as 7zip.

Here are some comparisons between the big 3 compression algorithms (taken from
<http://blogs.reucon.com/srt/tags/compression/> \-- he used a 163 MB Mysql
dump file for the tests):

    
    
      Compressor 	Size 	Ratio 	Compression 	Decompression
      gzip 	        89 MB 	54 % 	0m 13s 	        0m 05s
      bzip2 	81 MB 	49 % 	1m 30s 	        0m 20s
      7-zip 	61 MB 	37 % 	1m 48s 	        0m 11s

~~~
swombat
So bzip2 and 7-zip are way, way slower than gzip, then?

Bandwidth is cheap. Stick to gzip.

~~~
w1ntermute
_So bzip2 and 7-zip are way, way slower than gzip, then?

Bandwidth is cheap. Stick to gzip._

It's not as simple as that. Which one is better depends on the use case. If
you're sending a one-off file to somebody, sure, gzip is better. But if you
want to distribute a file to a large number of people (like Linux
distributions do with their packages), the extra CPU time is insignificant
compared to the bandwidth saved over the course of thousands of downloads.

~~~
alecco
It's very annoying to wait minutes to decompress big files. In particular
installation times.

~~~
barrkel
Decompression is more often limited by disk I/O, in my experience,
particularly when the source and destination are the same disk. I can often
get large improvements in decompression and installation speed by putting the
source file and / or temporary installation files on a different disk.

~~~
alecco
It's not always I/O speed. You can notice when installing CPU usage goes to
100% (or fans kicking in) for BWT/LZM* and not for the DEFLATE (unless you use
-9 or something like that.) While you install something at least one of your
cores is unavailable for anything else.

This affects energy consumption, too.

And think about both mobile and servers. Those systems are usually more
sensible to high CPU load.

I have a draft blog post with analysis of different protocols with valgrind
and other tools. But it is so much data to present and graph I never get
around to finish it :(

------
gubatron
Tried to apt-get it (on Ubuntu 9.10) but got scared when I gotta a warning
about lzma getting uninstalled and possibly breaking my dpkg.

Looking good on ubuntu 10.4

Same folder compressed in 1/2 the size as with tar cvfz, sick.

~~~
ars
xz is backward compatible with lzma. It can both create and decompress lzma
archives, and it has a tool with the same command line options.

lzma is depreciated by its author in favor of xz.

