Hacker News new | past | comments | ask | show | jobs | submit login
Xz (wikipedia.org)
232 points by golwengaud on June 24, 2010 | hide | past | favorite | 64 comments

Can someone explain to me why this Wikipedia article is at the top of the front page?

Most hackers don't know about XZ yet, but they should. I still prefer not to see Wikipedia articles on HN since they lack context. A "7 reasons you should ditch bzip2 and use XZ" blog post would probably be better.

Oh god please not more "X reasons for Y"

Give me Z reasons why you don't like "X reasons for Y" posts.

9 Reasons Why You Should Hate "X Reasons for Y" posts:

1. They're overdone

2. They're uncreative

3. They're unoriginal

4. They're a crutch

5. They're assembly-line writing

6. They're used by tabloids to appeal to supermarket zombies

7. They've helped turn respectable mags into tabloids (or are a symptom of it, not sure which came first)

8. They're psychologically manipulative (I don't know how, I just know it)

9. They work


You forgot to put these across 11 individual comments so your 8pts would have been multiplied to 88!

I do not mind lists, but I cannot stand lists that are seperated to pages for inflated page counts!

He was going to put it across 11 comments, but as the Wikipedia article clearly points out, xz compression conventionally bundles a single file.

The HN Guidelines explicitly frown upon "X reasons..." titles because they're almost always linkbait.

If the original title begins with a number or number + gratuitous adjective, we'd appreciate it if you'd crop it. E.g. translate "10 Ways To Do X" to "How To Do X," and "14 Amazing Ys" to "Ys." Exception: when the number is meaningful, e.g. "The 5 Platonic Solids."


How about a "Y is the new X" or "X is dead, long live Y"?

I'd prefer it to a two paragraph Wikipedia article.

"X reasons for Z": there, I fixed it for you.

Agree on the wikipedia articles with no context point. Especially since the article in question is short with not very much real information on why it matters.

LZMA2 has a better compression ratio than bzip2, according to this article http://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Mark... . Seems like it would be a good idea to switch away from bzip2/gzip to zx

The main reason to continue to use gzip is speed: it's an order of magnitude faster than bzip2 for compression / decompression, for a relatively small compression ratio hit. It's fast enough to us inline in web servers, SSH sessions, etc, and if you're just moving data around a single time, often the amount of time for compress / transfer / decompress is much faster than with bzip2.

I've found that on many types of data, xz -2 is twice as fast and compresses better than gzip -9. Of course gzip -1 kicks the pants off xz but there's absolutely no reason to use gzip at its higher levels, nor to use bzip2 at all. And if you want fast, LZO does it better -- the lzop tool provides a typical unix compress interface.

Using 50MiB from a tarball I had lying around, best of 3:

     prog    time(s)   size(%)
    gzip  -1  1.703    38.7
    gzip  -9 14.670    35.1    (worse than xz -1)
    bzip2 -1  7.714    34.0
    bzip2 -9  8.035    31.4    (worse than xz -1)
    xz    -1  6.278    31.3
    xz    -9 42.445    22.7    (best compression)
    lzop  -1  0.670    46.1    (fastest)
    lzop  -9 18.144    38.0

If anyone is still reading this thread, I made a chart overnight using a better corpus:



Out of curiousity, can you run this tarball through paq (say, paq8hp12)?


It segfaulted at the end, after taking 7m28s. I don't know how big the result was because it got truncated to 35 bytes.

Ugh. Can you try with a lighter compression level? There is a command line switch for that, and it should also show the memory requirements for each compression level.

I'm insisting because PAQ is a really incredible collection of compressors. Far more intelligence in it than in the LZ* bunch.

Yes, the new compression algorithms (BWT and LZ77-based) are slower and require a lot more memory than DEFLATE (*zip) in most cases.

This guy did a not so rigorous analysis but it is mostly OK: http://changelog.complete.org/archives/931-how-to-think-abou...

A little bird told me there's a better algorithm in speed/compression rate coming soon ;)

Another feature of gzip which may be your life-saver is that it keeps a record of the original size so you can know it without decompression.

xz format seems to have this feature, but: $ xz --list xz: --list is not implemented yet.

More importantly, xz decompresses quickly. Essentially, you get the compression ratio of bzip2 at the decompression speed of gzip (and, unfortunately, the compression speed of bzip2).

EDIT: also, like gzip but unlike bzip2, you can stream data through xz (at an insignificant penalty in compression ratio).

LZMA is much, much slower. It is simply a matter of the rate at which you generate data and the compression ratio and the rate at which you add more storage/move the files off to tape. Can you "keep up" essentially.

Right now, CPUs are fast enough that LZMA is a realistic prospect. It boils down to what the extra CPUs will cost vs the cost of the storage saved (incl. the "cost" of space in the datacentre, the administrative overhead of more hardware, etc).

I wonder if it would be possible to have page templates delivered separately from their dynamic content? The templating of the dynamic data would happen on the browser-side machine. This way, we could use really expensive compression algorithms with high CPU cost for compression, but then cache the compression result.

This could be achieved by using Javascript and fetching data as JSON, but it seems to be something that would be very useful and beneficial if it were standardized.

There is the SDCH spec: Shared Dictionary Compression over HTTP. Google Chrome uses it, and the Google Toolbar adds support to IE.

The server supplies a pre-created dictionary (template) which gets cached, and further requests get a diff to that dictionary as a response.

I've only found Google search using it on the server side, and the browser penetration is fairly low, but it seems promising. It makes requests very fast.

I've also noticed that Google search serves image thumbnails as data:// URLs directly in the original response. They are going a long way to ensuring each page loads entirely in one request.

There is client-side XSLT already. The problem is that transformed template is usually same size or smaller than your source data.

Really expensive algorithms aren't that much better to offset cost of sending template logic to the client.

You could probably save some processing time by integrating compression directly into server-side templating - take advantage of the fact that some parts never change and keep them pre-compressed or at least cache some statistics about them to aid compression.

In retrospect, I probably should have submitted the XZ Utils project page: http://tukaani.org/xz/ .

However, that page does not give the same picture of widespread adoption that the Wikipedia article does. That use was what interested me: I'm surprised that it only came to my attention when I was trying to recall the name of the 7z command line utility p7zip.

I upvoted it because, on top of being interesting, it fights the passively accepted wisdom that "newsworthy" is synonymous with "worthy".

I'm guessing because people voted it up. Additionally, it's not a peak time right now.

When is peak time?

It's apparently something new that has been floating around. Some hackers are probably curious about the "new thing" that popped up lately.

Software that was released 9 months ago isn't particularly new.

First, I think that is why it's in quotes.

Second, I am reading Of Mice or Men for the first time this summer. It is new to me!

I wrote a tool, pixz, that does xz compression in parallel to take advantage of multiple cores. It also indexes xz-compressed tarballs, so you can extract an individual file very quickly instead of needing to decompress the entire tarball. The parallel-compressed, self-contained tarball+index is fully compatible with regular tar/xz, you don't need any special tools to extract it.


The interface is still very rough, but it works. The xz utility comes with a very nice library and API, which made this a lot easier--thanks, Lasse!

While we're talking about compression algorithms, here is a nice little gem called BMZ:


It's is a fast compression scheme implemented using BMDiff and a Google Zippy clone (based off LZO).

I built and tried this out. I'm rather shocked to see speed and compression ratio better than gzip -9.

The results of my unscientific test (compression only):

   Compressor  Size  Ratio  Time
   gzip -1     23MB  88%     1.18s
   gzip -2     23MB  87%     1.38s
   bzip2       23MB  87%     5.57s
   xz -1       23MB  87%     5.35s
   xz -9       11MB  43%    10.58s
   bmz         13MB  45%     0.95s

It really depends on the data. For highly redundant data like web pages with lots of boilerplate header/footer, it can compress better because the bm_pack (first pass of bmz) looks for large common patterns over all the input. For typical text, it should be a little worse than gzip but faster.

BMZ = bmpack + lzo by default and can be combined with lzma if necessary. It's not really a BMDiff and Zippy clone, as I've never had a chance to see Google's implementation. It's based on the original Bentley & McIlroy paper: "Data Compression Using Long Common Strings", 1999. Even the two pass idea is from that paper. It was really a wacky experimental implementation (with a lot of room for improvement) to satisfy my curiosity. I'm a little surprised that the 0.1 version has been stable for quite a few people compressing TBs of data through it.

The only reason I stick with gzip over bz2 or 7z (lzma) is that gzip is everywhere. In reality, nobody has a file compression utility that can handle LZMA (even though I do, I never use it).

Interestingly, file(1) can't identify lzma(1)-compressed files (they show up as data), but unlzma(1) reassuringly knows the difference between LZMA and random data. This came up once or twice in DEFCON CTF quals (http://www.vnsecurity.net/2010/05/defcon-18-quals-writeups-c...).

That's why you should use xz instead of lzma (they're not the same thing) — the XZ format has a magic number.

Arch Linux already switched to xz as the default compression for packages

7-Zip utility (for compressing and decompressing different file types) doesn't?

The specification ( http://tukaani.org/xz/xz-file-format.txt ) reads like this format allows the application of a chain of compression "filters". Perhaps the most interesting, and least portable is the filter that modifies code to make compression easier. (Section 5.2.3) Is this mainly for intel architecures?

I use 7-zip 9.15 beta - it supports LZMA2 compression and available to download here:


Also notably, slackware has shifted it's packaging format to a xz tar since the last two releases. This validates xz as a good replacement for gz/bz2.

Xz appears to use the same algorithm as 7zip.

Here are some comparisons between the big 3 compression algorithms (taken from http://blogs.reucon.com/srt/tags/compression/ -- he used a 163 MB Mysql dump file for the tests):

  Compressor 	Size 	Ratio 	Compression 	Decompression
  gzip 	        89 MB 	54 % 	0m 13s 	        0m 05s
  bzip2 	81 MB 	49 % 	1m 30s 	        0m 20s
  7-zip 	61 MB 	37 % 	1m 48s 	        0m 11s

I tried it on a 943M tarball that contains miscellaneous Git repositories and their checked out trees that I had lying around:

    Compressor  Size  Ratio  Compression  Decompression
    gzip -9     555M  59%    2m39.840s    0m16.495s
    bzip2 -9    531M  56%    4m10.541s    1m27.720s
    xz -9       457M  48%    13m55.730s   0m53.290s

Testing them all on -9 is perhaps not a fair comparison. For example, according to http://changelog.complete.org/archives/931-how-to-think-abou..., gzip -9 saved a tiny amount of space compared to gzip's default, but took significantly longer.

So bzip2 and 7-zip are way, way slower than gzip, then?

Bandwidth is cheap. Stick to gzip.

So bzip2 and 7-zip are way, way slower than gzip, then?

Bandwidth is cheap. Stick to gzip.

It's not as simple as that. Which one is better depends on the use case. If you're sending a one-off file to somebody, sure, gzip is better. But if you want to distribute a file to a large number of people (like Linux distributions do with their packages), the extra CPU time is insignificant compared to the bandwidth saved over the course of thousands of downloads.

It's very annoying to wait minutes to decompress big files. In particular installation times.

Decompression is more often limited by disk I/O, in my experience, particularly when the source and destination are the same disk. I can often get large improvements in decompression and installation speed by putting the source file and / or temporary installation files on a different disk.

It's not always I/O speed. You can notice when installing CPU usage goes to 100% (or fans kicking in) for BWT/LZM* and not for the DEFLATE (unless you use -9 or something like that.) While you install something at least one of your cores is unavailable for anything else.

This affects energy consumption, too.

And think about both mobile and servers. Those systems are usually more sensible to high CPU load.

I have a draft blog post with analysis of different protocols with valgrind and other tools. But it is so much data to present and graph I never get around to finish it :(

If you look at some of the stats people are posting, it's the compression that takes the most time, not the decompression. gzip has fast compression and decompression, which is why it's used for things like compressing network streams (http,ssh,etc). But when you want to package up large files for distribution to a large audience, then it makes more sense throw some extra CPU time at the compression to get a smaller package (so long as the decompression time on the other end is reasonable).

  > If you look at some of the stats people are posting, it's the
  > compression that takes the most time, not the decompression
5 vs. 11 seconds. Worse than 2x slower decompression:


If you have to wait minutes to download the files it doesn't matter, but if you already have the file locally it is very annoying.

Also if this is used extensively on projects with a large server deployment this matters even more related to latency and energy consumption. That's why Google has their own compression algorithms derived from BMDiff and LZW (Zippy.) Think about it. Speed matters.

Are you willing to donate money to your favorite open source software so they can afford the bandwidth? If not, don't complain about having to spend a few more seconds decompressing the latest release (which you're getting for free).

As a programmer, I would rather work on a patent-unencumbered and open source compression algorithm solving exactly this problem. Perhaps investing months of my own unpaid time on it. HINT

... such as BitTorrent distribution.

mileage may vary; If the time it takes to compress is costs much less than the time it takes to transfer over the network, you might want not want to use gzip. For example, you're transferring a large file (1GB? 1TB?) to a remote person to deal with, is it cheaper to gzip (lower compression rate), take longer for the network transfer (most likely slowest step), and have the other person unzip, or to use a better compressor, and have the file transferred over quicker?

The question in my mind is if 7-zip has a good multicore implementation for compression/decompression. I recall that the multicore implementation by the original gzip author increased the speed on a 4-core machine by something like a x3.75 boost.

Given the increasing CPU/network gap, it's only a matter of time before the bandwidth (and thus time) saved by XZ more than compensates for the slower compression.

Tried to apt-get it (on Ubuntu 9.10) but got scared when I gotta a warning about lzma getting uninstalled and possibly breaking my dpkg.

Looking good on ubuntu 10.4

Same folder compressed in 1/2 the size as with tar cvfz, sick.

xz is backward compatible with lzma. It can both create and decompress lzma archives, and it has a tool with the same command line options.

lzma is depreciated by its author in favor of xz.

xz-utils deprecates lzma.

  $ dpkg -s xz-utils|grep Replaces
  Replaces: lzma
  $ dpkg -S $(which lzma)
  xz-utils: /usr/bin/lzma

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact