Hacker News new | comments | show | ask | jobs | submit login
Large Text Compression Benchmark (mattmahoney.net)
69 points by Ashuu 1357 days ago | hide | past | web | 11 comments | favorite



That's a good classic, but don't forget to also read his excellent writings of the connections between data compression and artificial intelligence:

http://mattmahoney.net/dc/rationale.html

Also, his book "Data Compression Explained" was a great help and eye opener for me:

http://mattmahoney.net/dc/dce.html


So, most interesting for me are the well known compression utilities:

                Compression                      Compressed size      Decompresser  Total size   Time (ns/byte)
  Program           Options                       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg Note
  -------           -------                     ----------  -----------  -----------  -----------  ----- -----  --- --- ----
  
  7zip 4.46a        -m0=ppmd:mem=1630m:o=10 ... 21,197,559  178,965,454          0 xd 178,965,454    503   546 1630 PPM  23
  WinRAR 3.60b3     -mc7:128t+ -sfxWinCon.sfx   22,713,569  198,454,545          0 xd 198,454,545    506   415  128 PPM
  xz 5.0.1          -9 -e                       24,831,648  211,776,220    103,692 x  211,879,912   2482    36  660 LZ77 26
  bzip2 1.0.2       -9                          29,008,736  253,977,839     30,036 x  254,007,875    379   129    8 BWT
  gzip 1.3.5        -9                          36,445,248  322,591,995     38,801 x  322,630,796    101    17  1.6 LZ77
  compress 4.3d                                 45,763,941  424,588,663     16,473 x  424,605,136    103    70  1.8 LZW


I think a very good compromise between speed and size still is bzip2, so i will stick to bzip2. Seems to be in the middle between compression time, decompression time, memory and size.


Note that there is a huge difference between compression levels in xz (LZMA), while there is almost no difference between bzip2 levels. This benchmark uses the slowest 9th level. If you try lower levels, e.g. xz -3, it may give better compression ratio and better compression speed than bzip2, and, being an LZ77 algorithm, much better decompression speed. Here's an example with different levels: http://pokecraft.first-world.info/wiki/Quick_Benchmark:_Gzip...

Also, I recommend using parallel versions (pbzip2 or pxz) if you have more than one CPU core: compression is highly parallelizable, so if you have 2 cores, they can half the compression time.


I didn't know about the parallel versions, thanks!


I don't know whether it makes a big difference, but you also should (and, frankly, I should. I tend to use bzip2 or 7z, without much testing) try and get data for your representative load. How often do you compress 100 megabyte or even one gigabyte XML files?

Also, if you are compressing lots of small files, and won't need to extract them individually often, you could use what rar and 7z call solid archives: all files concatenated, and then compressed (equivalent to .tar.gz or .tar.bzip2)


It would be interesting to put those industry-standard compressors on the graphs to see how far behind the frontier they are.


The charts are horrible

1) The y-axis (time) is logarithmic, while the x-axis(size) is linear.

2) The x-axis begins at 1000, which makes it seem like there's a huge size difference between the slowest programs, which isn't true.


1) What' is so bad about log-linear graphic? My rule of thumb is trying to use an scale were the graph looks like a line. Line are easy to understand, and to apply the inverse transformations to get the true form of the curve and an approximate formula. In a graph it's difficult to distinguish the positive part of 1/x, 1/x^2, ..., e^{-x}, ...


Because we don't perceive time logarithmically. 10 sec compression is 10 times worse than 100 sec compression.

So looking at the first chart, lpaq9i and durilca4linux points are right next to each other, while in realty durilca4linux is 530sec slower.


I don't see the problem. It's obvious what the chart conveys if you stop to think for a moment (at least speaking for myself, I'm used to looking at graphs with a logarithmic time scale, and didn't think about it). In fact, it's reasonable that a linear improvement in compressed size should require an exponential increase in time and memory.


If the size was logarithmic, then the differences would become invisible. There are sharply diminishing returns as size improves.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact

Search: