
Large Text Compression Benchmark - Ashuu
http://mattmahoney.net/dc/text.html
======
vog
That's a good classic, but don't forget to also read his excellent writings of
the connections between data compression and artificial intelligence:

[http://mattmahoney.net/dc/rationale.html](http://mattmahoney.net/dc/rationale.html)

Also, his book "Data Compression Explained" was a great help and eye opener
for me:

[http://mattmahoney.net/dc/dce.html](http://mattmahoney.net/dc/dce.html)

------
buster
So, most interesting for me are the well known compression utilities:

    
    
                    Compression                      Compressed size      Decompresser  Total size   Time (ns/byte)
      Program           Options                       enwik8      enwik9     size (zip)   enwik9+prog  Comp Decomp  Mem Alg Note
      -------           -------                     ----------  -----------  -----------  -----------  ----- -----  --- --- ----
      
      7zip 4.46a        -m0=ppmd:mem=1630m:o=10 ... 21,197,559  178,965,454          0 xd 178,965,454    503   546 1630 PPM  23
      WinRAR 3.60b3     -mc7:128t+ -sfxWinCon.sfx   22,713,569  198,454,545          0 xd 198,454,545    506   415  128 PPM
      xz 5.0.1          -9 -e                       24,831,648  211,776,220    103,692 x  211,879,912   2482    36  660 LZ77 26
      bzip2 1.0.2       -9                          29,008,736  253,977,839     30,036 x  254,007,875    379   129    8 BWT
      gzip 1.3.5        -9                          36,445,248  322,591,995     38,801 x  322,630,796    101    17  1.6 LZ77
      compress 4.3d                                 45,763,941  424,588,663     16,473 x  424,605,136    103    70  1.8 LZW
    
    
    

I think a very good compromise between speed and size still is bzip2, so i
will stick to bzip2. Seems to be in the middle between compression time,
decompression time, memory and size.

~~~
dchest
Note that there is a huge difference between compression levels in xz (LZMA),
while there is almost no difference between bzip2 levels. This benchmark uses
the slowest 9th level. If you try lower levels, e.g. xz -3, it may give better
compression ratio and better compression speed than bzip2, and, being an LZ77
algorithm, much better decompression speed. Here's an example with different
levels: [http://pokecraft.first-
world.info/wiki/Quick_Benchmark:_Gzip...](http://pokecraft.first-
world.info/wiki/Quick_Benchmark:_Gzip_vs_Bzip2_vs_LZMA_vs_XZ_vs_LZ4_vs_LZO)

Also, I recommend using parallel versions (pbzip2 or pxz) if you have more
than one CPU core: compression is highly parallelizable, so if you have 2
cores, they can half the compression time.

~~~
buster
I didn't know about the parallel versions, thanks!

------
rorrr2
The charts are horrible

1) The y-axis (time) is logarithmic, while the x-axis(size) is linear.

2) The x-axis begins at 1000, which makes it seem like there's a huge size
difference between the slowest programs, which isn't true.

~~~
gus_massa
1) What' is so bad about log-linear graphic? My rule of thumb is trying to use
an scale were the graph looks like a line. Line are easy to understand, and to
apply the inverse transformations to get the true form of the curve and an
approximate formula. In a graph it's difficult to distinguish the positive
part of 1/x, 1/x^2, ..., e^{-x}, ...

~~~
rorrr2
Because we don't perceive time logarithmically. 10 sec compression is 10 times
worse than 100 sec compression.

So looking at the first chart, lpaq9i and durilca4linux points are right next
to each other, while in realty durilca4linux is 530sec slower.

~~~
fdej
I don't see the problem. It's obvious what the chart conveys if you stop to
think for a moment (at least speaking for myself, I'm used to looking at
graphs with a logarithmic time scale, and didn't think about it). In fact,
it's reasonable that a linear improvement in compressed size should require an
exponential increase in time and memory.

