Hacker News new | past | comments | ask | show | jobs | submit login
Happy new year and goodbye bzip2 (kernel.org)
158 points by Morgawr on Dec 28, 2013 | hide | past | web | favorite | 31 comments



Good riddance; for text files, gzip is mediocre compression and fast, lzma (xz) is high compression and slow. Bzip2 is medium-high compression and _really_ slow [0].

[0] http://tukaani.org/lzma/benchmarks.html


Your link shows bzip2 having a more balanced compression time than lzma that is often 3-4 times slower. So for a tarball xz sounds indeed better as you compress once and uncompress many times. For compressing HTTP or things of the sort bzip2 might be better, but maybe for those cases gzip is already best.


You are right, I was speaking to the compression ratios and decompression times. For cases where something is compressed once and decompressed many times I don't care much how long it takes to do the initial compression (within reason, of course).


Here is another with lz4, lzo as well as lzma1, although they're not very relevant for the tarball test. Gzip -9 is on par with bzip2 -1 while almost twice as fast. And uses practically no memory(!).

http://pokecraft.first-world.info/wiki/Quick_Benchmark:_Gzip...


-1 is an edge-case: if you're memory constrained, bzip2 isn't a good fit. Otherwise, you pay the extremely low percentage increase in CPU time to use -9 where bz2 will massively outperform gzip on most workloads.


Note that LZMA decompressed at 20MB/s in that benchmark, while bzip2 managed about 6.

LZMA tends to be faster and more efficient than bzip2 in all respects.


pedrocr said that LZMA decompression is faster, but that sometimes LZMA compression is slower.


I don't know how well lzma parallelises, but bzip2 is plenty fast for me with a multi-core version like pbzip2, as measured in wall-time on an otherwise idle system. Different considerations in different environments and all that.


Unfortunately there's no multithreaded option for the standard xz uncompression tools (xzcat, unxz). However I wrote (as a quick hack) a parallel xzcat:

http://git.annexia.org/?p=pxzcat.git;a=summary

I also wrote a random access program for xz files:

https://github.com/libguestfs/nbdkit

(in the plugins/xz subdirectory)


The version of the tools that I have installed (5.1.2alpha in Fedora) do support a -T option for parallel compression which I use regularly.

However, it is true that that same option does not seem to work everywhere. At least, it didn't seem to work a few weeks ago in a different system (Ubuntu). The option is accepted, but does not seem to have any effect.


lzma parallelizes very well! My implementation of parallel xz ( https://github.com/vasi/pixz ) works quite well. There are others as well, including pxz and the alpha version of the standard xz. All these tools produce compressed files that conform to the xz format and can be decompressed by any xz utility.


I'll throw out a vote for lbzip2. It threads strongly, and doesn't cannibalize your memory like threaded xz. (You can still use gzip where every CPU cycle counts)


I always was fascinated by compression. It feels like no big progress was ever made in lossless compression. How would todays best compression algos compare to the ones used during the times of the C64? And compared to the first zip algorithm implemented in PKZIP?


Well, there are clear information theoretic limits on how much you can compress something, so any improvements are necessarily going to be incremental and converging to the theoretical limit. The more special-purpose the compression algorithm, the more assumptions it can make, the better it can perform on the sort of data it's designed for, at the expense of behaving poorly in the general case. Indeed, it can be shown that any possible lossless compression algorithm, on average, will yield an output larger than its input; it's just that most "interesting" data contains plenty of redundancy that's reliably compressible.


Also, it often is better to measure an improvement as relative to the maximum improvement possible, rather than as a percentage of the file size.

For compression, let's say that the best possible compressor compresses a file to 30% of its size, and the current compressor reaches 50%. Then, an improvement to 45% should not be seen as 'only 5%', or as '10% smaller', but as '25% of the maximum possible improvement'. A follow-on step that gets you to 40% would be a larger improvement of 33%.

That, IMO, is a reasonable way to somewhat compensate for the fact that the low hanging fruit gets plucked by those who come first.

And yes, there is a problem there. That 'best possible compressor', theoretically, can produce extremely small files. Maybe your Wikipedia dump happens to be equal to the binary expansion of sin(1/e + 34/sqrt(PI)) to a few billion digits, but how are you going to find out? So, for most files, we don't really know what that best compression is.


Quite a bit. For excellent introduction to the topic, read this: http://mattmahoney.net/dc/dce.html

For fast lossless compressors, see this article using an Apple II to compare the best of today's (LZ4) with what was available when the hardware was popular/new.


One new thing is that patents on arithmetic compression expired. Libjpeg will use now, but too much software won't understand it.


The low-hanging fruit was picked long ago, but there has been some interesting recent work like Snappy, PAQ, and Intel's improvements to gzip ( http://mail.madler.net/pipermail/zlib-devel_madler.net/2013-... ).


Compression algorithms are improving. There are compression programs out there that will beat pretty much anything famous like xz, but 99% of the time at the expense of either compression/decompression time or memory.

PAQ8PX is a classic example of this - amazing compression ratios, but ridiculously slow.

http://www.maximumcompression.com/data/summary_mf.php

http://www.maximumcompression.com/data/summary_mf3.php

There are programs that can shrink JPG losslessly by 24%:

http://www.maximumcompression.com/data/jpg.php

NanoZip keeps a very good balance of speed and compression ratio, but it's so unpopular, it's not very practical.


With typical files, XZ Utils create 30 % smaller output than gzip and 15 % smaller output than bzip2.

http://tukaani.org/xz/



The most appropriate compression algorithm to use is highly case-dependent (data structure, compression vs decompression cpu/wall-clock time, compression ratio, system memory,bandwidth, etc.). I'm sure people at kernel.org have their reasons to convert to xz.

The following link has some interesting (albeit non-exhaustive) benchmarks, comparing various compression algorithms (both serial and parallel implementations). http://vbtechsupport.com/1614/


Why ditch bzip2 and not gzip ? It was clearly for storage space.


For compatibility and speed, gzip is a winner. It's fast and it's everywhere (plus it compresses text well enough that it makes sense to use in place of non-compressed files).

For strong compression, there are better options than bzip2.

In a sense, bzip2 is "worse of both worlds".


Tl;dr, Thx for the summary


Try to extract xz compressed linux kernel on host with 128Mb RAM and 500MHz CPU. Then try to extract bzip2 - it will be very slow. IMHO gzip is the best option for old hardware or small virtual machines.


XZ sounds interesting. Anybody know if work is in progress to add XZ support in webbrowsers as an alternative to gz compression? Any webserver already supporting xz?


I used to create WinRAR backup archives with recovery record or recovery volumes for DVD backups. Are there other (free and open) archive formats with such technologies? Or maybe stand-alone tools?


If I understand your question correctly, you should look into PAR2


did someone say WebP? :)


Did the bzip maintainer steal someones girlfriend or anything else?




Applications are open for YC Winter 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: