
Compression Benchmarks: brotli, gzip, xz, bz2 - datajeroen
https://www.opencpu.org/posts/brotli-benchmarks/
======
hthh
"It's weird how there are like two alternate realities of data compression.

"There's people who actually know WTF they're doing. (eg. encode.ru & the
DCC's and so on). In that world you have compressors like PAQ, Nanozip,
lzturbo, tornado, CCM, ZCM, Zstd, LZ4, BMF, BCIF, gralic, etc. etc.

"Then there's the mainstream world, where people still think gzip is state of
the art (or ooh real modern bzip2), and they invent new things like brotli and
webp and seem to not pay any attention to that alternate reality where the
experts live. They sometimes do good work, but it's just a little odd."

\- Charles Bloom, 10-02-15
([http://www.cbloom.com/rambles.html](http://www.cbloom.com/rambles.html))

------
jmspring
There are standard corpuses for compression / decompression. Against a
specific README file isn't interesting.

Give me the full results against the standard corpuses (in different sizes).

The biggest fallacy of "compression results" is non-standard data sets.

Edit: spent years in image/video compression under Langdon, involved with
JPEG2000, etc.

~~~
danieltillett
Do you know a current good comparison?

~~~
ahoge
[https://quixdb.github.io/squash-benchmark/](https://quixdb.github.io/squash-
benchmark/)

------
pengy
The published results aren't consistent with the general understanding that xz
is faster than bzip2 while compressing better even at the lowest settings.

Xz beyond the lowest compression setting quickly enters the realm of
diminishing returns, but when even the lowest setting compresses better than
bzip2 and is faster in both compression and decompression, there is no reason
to use bzip2.

The opencpu post completely ignores the compression setting dimension,
presenting an incomplete picture. Xz is shown as consistently the slowest by a
wide margin, even relative to bzip2. This is unexpected, xz will be faster in
both compression and decompression than bzip2 when using a compression setting
appropriate for comparison with bzip2 compression ratios.

~~~
rwmj
Also xz upstream defaults to single threaded. However xz decompression scales
pretty much linearly at least up to 4 cores - I wrote a parallel pxzcat to
prove that:

[http://git.annexia.org/?p=pxzcat.git;a=summary](http://git.annexia.org/?p=pxzcat.git;a=summary)

------
ot
> Hence we can conclude

By testing on _one_ file. Furthermore, Brotli "cheats" by having a static
dictionary including english words, so it doesn't make sense to compare
against general-purpose compressors on the COPYING file.

~~~
Dylan16807
Its job is to compress html, so at the very least the html-related dictionary
contents make a lot of sense.

------
joosters
They are testing compression on a 6849 byte file full of English text. (GPL)

This is a terrible comparison test - first of all, it's biased because of
Brotli's built-in dictionary, and the file size is so small that the
compression performance might well be overwhelmed by the startup time of each
compressor / decompressor. Consider that the compression level of bzip2 (the
1-9 parameter) controls a search block size of 100k-900k - so the input file
is far far smaller than even the smallest, worst compression setting.

There are several good example file sets to try compression with, they should
be used if you want to do any kind of comparative general testing.

~~~
kijin
The article explicitly states that it is trying to find out which algorithm is
optimal for small text-based documents that are common on the web, such as
CSS, JS, and HTML.

The best compression algorithm for small text files might be very different
from the best algorithm for large amounts of general data. If your browser
needs to decompress 100 separate files in a fraction of a second, startup time
does matter.

~~~
joosters
Fair enough, but if you want to compress CSS, JS and HTML then why not test it
on those?

It also depends upon how you are supplying the compressed data. For static
files, the compression time might not even matter - the web server can cache
all the compressed versions and the compression performance is minimised.

For dynamic data (e.g. the output of a PHP or whatever), you might also want
to test how well the compressors handle streaming data, e.g. feeding parts of
a page piecemeal into the compressor and directly sending the output to the
client packet by packet. Some compression systems handle this extremely badly.

------
nailer
> Brotli decompression is at least as fast as for gzip while significantly
> improving the compression ratio. The price we pay is that compression is
> much slower than gzip.

Is it as fast as gzip or slower?

~~~
myle
Fast decompression, slower compression.

------
threeseed
Would have been useful to do this against Snappy and LZ4.

Both are quiet popular in the Big Data space.

~~~
wmf
Brotli is a "strong" compression algorithm so it's obviously going to be much
slower with much better compression than "light" algorithms like Snappy or
LZ4.

------
kuprel
What's the Weissman score

