Hacker News new | comments | show | ask | jobs | submit login
Compress data more densely with Zopfli (googledevelopers.blogspot.com)
132 points by speeder 1606 days ago | hide | past | web | 42 comments | favorite

A quick summary of the difference between this and existing implementations of deflate such as gzip or zlib:

Deflate combines two compression techniques: Huffman coding to replace each value with a string of bits whose length depends on the frequency of that value, and backreferences of the form "go back N bytes and repeat M bytes from there". (Glossing over a pile of fun interactions between those two, such as Huffman-coding both literals and backreferences, and Huffman-coding the Huffman tables. Not to mention the ability to have M > N, repeating the earlier parts of the thing you backreferenced. See RFC1951, https://www.ietf.org/rfc/rfc1951 , for the full details.)

Existing deflate implementations use various heuristics to guess when they should encode the next bit of data as a literal symbol or a backreference, such as a hash table of possible strings to backreference; the compression level (-1 through -9) tunes those heuristics to take more or less time, most notably by changing the lengths of strings stored and searched for for in the hash table.

Zopfil says it uses "more exhaustive compression techniques" based on "iterating entropy modeling and a shortest path search algorithm to find a low bit cost path through the graph of all possible deflate representations." The entropy modeling allows Zopfil to estimate the effectiveness of a given approach to deflate. The path-search algorithm effectively treats the space of all possible deflate representations as a single-player game, with "moves" that change the representation in some incremental way, and then uses standard search algorithms to find a near-optimal representation.

(Without reading the source in detail, I don't know whether the search space includes the choice of Huffman tables, the set of possible backreferences, or both. I can see how either one could map onto a search space, though.)

In case you're wondering how to try it, the site lacks a simple getting started guide. Sure it only takes a few minutes to get to this, but for others looking to try this with <30 seconds of work, do this:

  git clone https://code.google.com/p/zopfli/
  cd zopfli
  chmod +x zopfli
  ./zopfli <FILENAME>
  Optional: Copy it to /usr/local/bin (or your favorite path location)

Not a scientific experiment or something like that, just tried it on uncompressed jQuery for fun;

Original file: 268381 bytes

Zopfli (with -i1000) 75730 bytes in 950ms

Gzip (with -9) 79388 bytes in 30ms

It's really slow but the difference seems significant enough. (They have more data here: https://zopfli.googlecode.com/files/Data_compression_using_Z...)

Importantly, though, the decompression speed is the same as deflate (e.g, gzip). It's only slower during compression.

As I suspected, XZ performs better at least for jquery. I was disappointed it was missing from benchmarks.

For comparison on my machine:

zopfli -i1000 takes [real, sys]=[1069ms, 21ms], 75730 bytes.

xz -9 takes [148ms, 15ms], 69476 bytes.

(Though as mentioned as a pro of zopfli, xz would lose on decompression speed for a larger file.)

Yeah but the point here is to produce a .gz file which is compatible with standard tools like gunzip, PNG decoders, your browser's handling of "Content-Encoding: gzip", etc.

If you control the compressor and decompressor then applying new tricks to producing a Deflate stream is far less interesting, just use a better compressor.

Can we get browsers to support LZMA? :) (Hey, a bug report! https://bugzilla.mozilla.org/show_bug.cgi?id=366559) But yeah, zopfli does seem like an excellent fit when you can't force a particular decompressor.

and bzip2 gets it down to 65195 bytes.

I've not used xz, I'm sure it's good, but the point isn't really about which compression algorithm is the best in the world, but which works best in the (depressingly limited) real world conditions.

If it is any indicator, current versions of FreeBSD and some Linux distros have "xz" as part of the base system, along with their gzip and bzip2 counterparts, including xzcat/xzless/etc., newsyslog/logrotate compression options for xz, xz-compressed tarballs for source distributions (tar.xz / txz), as well as binary packages for FreeBSD compressed with the format.

LZMA compression has gained a lot of traction over the past few years.

Thanks - I meant to update my comment after I went and looked up xz, I've used 7zip and LZMA before, but I've never directly come across a .xz file, odd!

Tried compressing the minified jQuery, got:

  32653 gzip -9
  31763 7z -tgzip -si -mx=9 -mfb=258 -mpass=16
  31691 zopfli -i1000
IOW, it yields a much smaller benefit as data gets harder to compress. Of course this is not at all surprising.

For my tests I used the default compression levels, so 6 (out of 1-9) for gzip, 6 (out of 1-9) for xz, and 15 (out of 5-1000) for zopfli. My input was the linux kernel 3.8.1 source tarball. This was run on a macbook pro with OS X 10.8.2 and a 2.2 GHZ i7 processor.

  time    size    format
  n/a     987640  uncompressed
  18.44   210896  gzip
  281.44  143128  xz
  3604.21 200640  zopfil

gzip with 32KB dictionary and 128 word size


  original  268381

  gzip       76070

  zopfli     75730
Zopfli only has 340 bytes savings and 20x time increase.

Not worth it.

Zopfli can be used with nginx without code changes with the static gzip module [1].

[1] http://wiki.nginx.org/NginxHttpGzipStaticModule

Absolutely love this Nginx module - use it all the time in production. For me, dropping in Zopfli into the "pre compress asset phase" will be trivial.

You can already get 3% more compression out of zlib by giving it a bigger dictionary like 7zip can optionally do.


  original  268381

  gzip       76070      (32KB dictionary, 128 word size)

  Zopfli     75730

340 bytes is not worth a 20x time increase for compression.

340 bytes is not worth 20x time increase. 0.4% decrease in file size IS worth the time. If you can measure the amount of time it takes to download those last 340 bytes, then you can start to count how many downloads it takes before its worth it. (If download speed was constant, it would be about 224 downloads).

PNG seems like a good candidate for this style compression. A constant time initial cost for recurring savings in bytes transmitted seems like an excellent use of CPU time to me.

As an aside, I ran the Zopfli program with 10000 iterations on jquery-1.9.1.js:

  268381 jquery-1.9.1.js
   75622 jquery-1.9.1.js.gz

"A constant time initial cost for recurring savings in bytes transmitted seems like an excellent use of CPU time to me."

Wish more people understood this aspect of web scale! well said :)

A fairer (in my opinion) comparison, with minimised jquery:

    92629 jquery-1.9.1.min.js
    32666 (35.26%) gzip -9
    31691 (34.22%) zopfli with standard compress
    31688 (34.20%) zopfli with max compress

     31783 gzip with 32KB dictionary and 128 word size
     31764 above with filename stripped
I am too lazy to strip out the filename so there is a few bytes penalty, so it's more like 31764

So in this case it's

76 bytes

I guess the real question is what is the decode penalty time if any?

It is if you serve the file a hundred billion times.

If that's your logic and proprietary compression engines are okay despite not being in any browsers, then 7zip ultra can get it to 69531 bytes? That's 5000+ bytes lower than Zopfli

bzip2 can get it to 65536 bytes which is a whopping 10k smaller than zopfli

Oh wait, I see that zopfli is deflate compatible so that means browsers can still decode it. Fascinating.

I wonder if there's any chance of browsers supported xz? Maybe?

There is a compression-time, decompression-time trade-off whenever you use compression. Xz (or LZMA) is interesting in that its decode time is quite small compared to its encode time, unlike others like bz2. For large, static content the trade-off is quite good. However, with smaller files, like css and javscript sources, you really need to do some benchmarking to be sure.

On-demand xz will probably never be worth it however, the encode time just takes too long. You would have physically transferred the data by the time it has finished compressing.

I was just talking about this with some other Firefox developers today, actually. So...yes, there's a chance. Not a very big one, mind you. But a guy can dream!

Apparently only gzip and zlib compatibility is included in the code.

Does anyone know if there are programs out there that allow recompressing zip or png format files with this library?

The "PNG crunch" programs out there - eg OptiPNG - actually do a bit more work. They'll try palette-ising a truecolor image, applying all the PNG filters (http://www.w3.org/TR/PNG-Filters.html), etc.

They also tend to use forked versions of zlib, often to expose some more internal knobs to trial-and-error on, so a zlib-compatible wrapper around zopfli wouldn't necessarily just drop in.

But I'm sure support for zopfli will appear sooner rather than later :)

Ideally, you'd want a zlib-compatible zopfil library that you could just substitute for libz.so.1 when running an appropriate format-specific compressor that uses zlib.

It would be good to see some comparison with bzip2, which typically has better compression ratio than gzip.

FYI, bzip2 has been obsoleted by xz which is both faster and gives better compression. But as others have said, Google doesn't care because browsers do gzip.

In the future, browsers could also do xz; they'd just need to advertise it in their Accept-Encoding header. Google could put this in Chrome any time they like, and nothing would break.

As Jach points out above, Mozilla has had a bug[1] open about it for 6 years.

[1]: https://bugzilla.mozilla.org/show_bug.cgi?id=366559

There's also sdch.

Importantly, gzip is twice as fast as xz at decompression.

xz is essentially LZMA2. It definitely beats bzip2:


As a Swiss I'm curious how the name has come to be. Is the team related to ETH Zurich?

Lode Vandevenne wrote the blog post and the code. LinkedIn says he works at Google Zürich.

Thanks a lot. I'm still reluctant to use LinkedIn since much of it feels like extortion to me.

Btw. I'm surprised that you've found the ü character. Not even US customs/border protection let me enter my German name in the ESTA form.

It's actually right in the post that he's based in Zurich.

I could be wrong, but I think it was edited in. Otherwise scaevolus would probably have pointed it out instead of going to hunt LinkedIn.

I guess, Zopfli might be Zöpfli, which is a kind of braided Swiss bread. http://en.wikipedia.org/wiki/Zopf

Zopfli sounds like a pharmaceutical drug.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact