1. They're overdone
2. They're uncreative
3. They're unoriginal
4. They're a crutch
5. They're assembly-line writing
6. They're used by tabloids to appeal to supermarket zombies
7. They've helped turn respectable mags into tabloids (or are a symptom of it, not sure which came first)
8. They're psychologically manipulative (I don't know how, I just know it)
9. They work
I do not mind lists, but I cannot stand lists that are seperated to pages for inflated page counts!
If the original title begins with a number or number + gratuitous adjective, we'd appreciate it if you'd crop it. E.g. translate "10 Ways To Do X" to "How To Do X," and "14 Amazing Ys" to "Ys." Exception: when the number is meaningful, e.g. "The 5 Platonic Solids."
Using 50MiB from a tarball I had lying around, best of 3:
prog time(s) size(%)
gzip -1 1.703 38.7
gzip -9 14.670 35.1 (worse than xz -1)
bzip2 -1 7.714 34.0
bzip2 -9 8.035 31.4 (worse than xz -1)
xz -1 6.278 31.3
xz -9 42.445 22.7 (best compression)
lzop -1 0.670 46.1 (fastest)
lzop -9 18.144 38.0
I'm insisting because PAQ is a really incredible collection of compressors. Far more intelligence in it than in the LZ* bunch.
This guy did a not so rigorous analysis but it is mostly OK:
A little bird told me there's a better algorithm in speed/compression rate coming soon ;)
xz format seems to have this feature, but:
$ xz --list
xz: --list is not implemented yet.
EDIT: also, like gzip but unlike bzip2, you can stream data through xz (at an insignificant penalty in compression ratio).
Right now, CPUs are fast enough that LZMA is a realistic prospect. It boils down to what the extra CPUs will cost vs the cost of the storage saved (incl. the "cost" of space in the datacentre, the administrative overhead of more hardware, etc).
The server supplies a pre-created dictionary (template) which gets cached, and further requests get a diff to that dictionary as a response.
I've only found Google search using it on the server side, and the browser penetration is fairly low, but it seems promising. It makes requests very fast.
I've also noticed that Google search serves image thumbnails as data:// URLs directly in the original response. They are going a long way to ensuring each page loads entirely in one request.
Really expensive algorithms aren't that much better to offset cost of sending template logic to the client.
You could probably save some processing time by integrating compression directly into server-side templating - take advantage of the fact that some parts never change and keep them pre-compressed or at least cache some statistics about them to aid compression.
However, that page does not give the same picture of widespread adoption that the Wikipedia article does. That use was what interested me: I'm surprised that it only came to my attention when I was trying to recall the name of the 7z command line utility p7zip.
Second, I am reading Of Mice or Men for the first time this summer. It is new to me!
The interface is still very rough, but it works. The xz utility comes with a very nice library and API, which made this a lot easier--thanks, Lasse!
It's is a fast compression scheme implemented using BMDiff and a Google Zippy clone (based off LZO).
The results of my unscientific test (compression only):
Compressor Size Ratio Time
gzip -1 23MB 88% 1.18s
gzip -2 23MB 87% 1.38s
bzip2 23MB 87% 5.57s
xz -1 23MB 87% 5.35s
xz -9 11MB 43% 10.58s
bmz 13MB 45% 0.95s
BMZ = bmpack + lzo by default and can be combined with lzma if necessary. It's not really a BMDiff and Zippy clone, as I've never had a chance to see Google's implementation. It's based on the original Bentley & McIlroy paper: "Data Compression Using Long Common Strings", 1999. Even the two pass idea is from that paper. It was really a wacky experimental implementation (with a lot of room for improvement) to satisfy my curiosity. I'm a little surprised that the 0.1 version has been stable for quite a few people compressing TBs of data through it.
Here are some comparisons between the big 3 compression algorithms (taken from http://blogs.reucon.com/srt/tags/compression/ -- he used a 163 MB Mysql dump file for the tests):
Compressor Size Ratio Compression Decompression
gzip 89 MB 54 % 0m 13s 0m 05s
bzip2 81 MB 49 % 1m 30s 0m 20s
7-zip 61 MB 37 % 1m 48s 0m 11s
Compressor Size Ratio Compression Decompression
gzip -9 555M 59% 2m39.840s 0m16.495s
bzip2 -9 531M 56% 4m10.541s 1m27.720s
xz -9 457M 48% 13m55.730s 0m53.290s
Bandwidth is cheap. Stick to gzip.
It's not as simple as that. Which one is better depends on the use case. If you're sending a one-off file to somebody, sure, gzip is better. But if you want to distribute a file to a large number of people (like Linux distributions do with their packages), the extra CPU time is insignificant compared to the bandwidth saved over the course of thousands of downloads.
This affects energy consumption, too.
And think about both mobile and servers. Those systems are usually more sensible to high CPU load.
I have a draft blog post with analysis of different protocols with valgrind and other tools. But it is so much data to present and graph I never get around to finish it :(
> If you look at some of the stats people are posting, it's the
> compression that takes the most time, not the decompression
If you have to wait minutes to download the files it doesn't matter, but if you already have the file locally it is very annoying.
Also if this is used extensively on projects with a large server deployment this matters even more related to latency and energy consumption. That's why Google has their own compression algorithms derived from BMDiff and LZW (Zippy.) Think about it. Speed matters.
Looking good on ubuntu 10.4
Same folder compressed in 1/2 the size as with tar cvfz, sick.
lzma is depreciated by its author in favor of xz.
$ dpkg -s xz-utils|grep Replaces
$ dpkg -S $(which lzma)