Hacker News new | comments | ask | show | jobs | submit login
Stop using gzip (imoverclocked.blogspot.com)
432 points by imoverclocked on Dec 12, 2015 | hide | past | web | favorite | 255 comments

The trouble with this is that, as a software author, it doesn't really matter if it takes 70 seconds instead of 33 to install my software. 70 seconds is fast enough, for someone who's already decided to start downloading something as involved as Meteor; even if it took one second it wouldn't get me more users. And it would have to take over 5-10 minutes before I start losing users.

On the other hand, having to deal with support requests from users who don't have any decompressor other than gzip will cost me both users and my time. Some complicated "download this one if you have xz" or "here's how to install xz-utils on Debian, on RHEL, on ..." will definitely cost me users, compared to "if you're on a UNIXish system, run this command".

From a pure programming point of view, sure, xz is better. But there's nothing convincing me to make the engineering decision to adopt it. The practical benefits are unnoticeable, and the practical downsides are concrete.

There are downsides which most young fellas around here don't really appreciate. Hands up, how many of you remember bzip? Not bz2, but bzip, the original one, mostly seen as "version 0.21"?

Well, at the time it was released, people were making much the same arguments (with kittens). It compressed so much better than gzip, no reason to use the obsolete gzip format and tools, etc. And some of us jumped on the hype bandwagon and started recompressing our data, only to find out afterwards that bzip2 is now the new thing and the format is not only obsolete, but also patent-encumbered and in general needs to be phased out.

From a long-term perspective I'm fine with gzip. At least I know that I'll be able to open my data in 10 years time, which is not the case with bzip-0.21. The jury is still out on "xz", in my opinion.

Agreed, and I think people do not fully understand it because the way it happens is not safety based the way you describe it. The question is not is X safe and Y unsafe, the question is each on its own. Is X safe, and will it still work on old systems or in ten years? Yes, 100%.is y safe and still work in ten years? Well some old system might have issues, and the patents might still exists, and....

If you're a developer, gzip is simply the best option. It's not the best, but it's good enough and it's safe.

I had to explain the same thing to an engineering team the other day. There was a push to switch from a "fast but ok" compression algorithm to a "faster and better" compression algorithm. This seemed like win-win, but I explained that:

* The faster compression made a difference of about 100 milliseconds to a user experience lasting minutes.

* The better compression made a different of about 1 second to most users.

* The change of compression algorithm would take time away from engineering teams, and ultimately introduce bugs.

So in the end it was sidelined until other fundamental changes (file format etc) made it able to be coat-tailed into production.

Same story as yours: don't fix what ain't broke. Tallest working radio antenna in the world -> longest lying broken antenna in the world.

We started packaging things as .tar.xz for a while, but a large number of users were having terrible trouble opening the file.

Mostly Mac, but iirc some Linux users were confused as well. We switched back to gzip because everyone knows how to use it.

Interestingly, it's opened by the same command on both OSes -

    tar -xf some_file.tar.xz 
    tar -xf some_file.tar.gz

I guess a lot of people are or were used to:

  tar -xzf file.tar.gz
To be honest, I only started doing -xf a year ago. I was used to -xjf and -xzf.

tar flags must be a prime example of cargo culting. Hands up everyone who's done or seen someone do

  tar -xzvf file.tgz > /dev/null

never seen that in my life until now - what does it do? Just unzip the file to dev/null? What's the purpose? Does the verbose flag show you what's inside but the dev/null means it's not written to disk while unzipping?

Actually, the files are decompressed to the current directory, it's the just output of the verbose flag that goes to /dev/null. Which makes is even more senseless.

Exactly. I've seen people who always do `tar xzvf` and have no idea removing the `v` is the correct way to make it not print the name of every file in the archive.

I use the -v switch since I want to see what I decompress, however I didn't knew that I could supress the z switch.

You didn't use to be able to suppress the 'z' switch. You had to specify 'z' or 'j' depending on whether you wanted gzip or bzip2 decompression. It's a somewhat recent (sometime in the last 15 years, I think) change to "tar" to make it just detect the compression algorithm.

Isn't it better though to omit the -v switch and do `ls *` and/or `tree` afterwards? That gives you the same information but structured so it's much easier to understand.

The advantage of -v is that you can see what is being extracted as it happens. This is useful if you have a tarball with thousands of small files, as otherwise it's hard to tell whether tar has got stuck or there are just a lot of files.

I don't understand why you'd do that. tar does not compress to stdout.

Redirecting stdout is to cancel the -v flag (verbose, lists every file extracted).

If you don't explicitly specify (de)compression method (z for gzip, J for xz), tar will try to guess it.

That's new. Tar didn't use to behave that way, and plenty of people got used to specifying flags.

And many implementations don't behave that way.

That's gnu tar. I don't think that works on BSD.

I'm fairly sure bsdtar had automatic compression detection before GNU tar did. Been the default on FreeBSD since version 5.3 (2004).

Also works with non-tar formats like Zip, RAR and 7z: https://www.freebsd.org/cgi/man.cgi?query=libarchive-formats...

Correct, on OpenBSD:

    $ tar -xf foo.tar.xz                                      
    tar: End of archive volume 1 reached
    tar: input compressed with xz

    $ tar -xf foo.tar.gz                                      
    tar: End of archive volume 1 reached
    tar: input compressed with gzip; use the -z option to decompress it

Does that _tell_ you what it's compressed with but then not decompress it? That's the absolute worst way to do it!

It tells you how it's compressed and how to decompress it if it knows how. OpenBSD's tar doesn't support xz so it can't help there, but does support gzip so it suggests using -z.

Not letting untrusted input automatically increase the attack surface it's exposed to is a feature.

>Not letting untrusted input automatically increase the attack surface it's exposed to is a feature.

How is that a feature? The user's explicitly asking for this.

This feature reminds me of vim, that suggests closing with ":quit" when you press C-x C-c (i.e. the keychord to close emacs). It knows full well what you want to do and even has special code to handle it, but then insists to hand you more work.

Vim suggests closing with ":quit" when you hit C-c; the C-x is irrelevant.

Upon receiving a C-c, it does not know full well what the user wants to do.

When vim receives a C-c from you (or someone who just stumbled into vim and doesn't know how to exit) the user wants to exit.

When vim receives a C-c from me, it's because I meant to kill the process I spawned from vim, and it ended before the key was pressed. I very much do not want it to quit on me at that point.

Showing a message seems the best compromise.

`tar -xf` is not "explicitly asking" for gzip. `tar -zxf` is "explicitly asking" for gzip.

I don't really care what vim does, that's a different argument. There have been many vulnerabilities in gzip, and in tar implementations that let untrusted input choose how it gets parsed, those vulnerabilities might as well be in tar itself.

The same applies to OGG audio. Better than MP3 but the average user is unable to play it. So everybody just sticks to MP3.

MP3s with a decent bit rate is as good is it gets. Of course something like ogg back in the napster days would have been fantastic, but MP3 at 320 Kbps is fine for anyone who doesn't pay $1000 a meter for speaker wire.

But MP3 was patent encumbered and so a bunch of music creation software had weird work-arounds.

I am not sure what your point is? Yes MP3 files were not the right choice years ago, but today who cares.

If I recall correctly, there are some alive patents in the U.S. until 2017 when dealing with MP3 encoders, requiring purchasing a license per copy distributed.

The workaround that the parent is talking about is usually "get LAME from a different distributor", which is still done by Audacity and others.

That's the thing. With better formats, you don't need 320 kbps for transparency.

For those of us that sample from songs that we buy, WAVs are a bit easier to work with because the DAW doesn't have to spend time converting it. That said, since most of my tracks these days are using either the 48k or 96k sample rate, it still needs to be converted from 44.1 :)

But can you tell the difference between a variable encoded 320 Kbps MP3 using a modern encoder and a wav file? I have some reasonable equipment and I most definantly can't.

When I DJed I used a mix of FLAC files and 320 Kbps CBR and high level VBR. On performance equipment I could tell VBR was not holding up. There is also some quality loss that you encounter when slowing down MP3s that is not present for FLAC or WAV, especially when kept in key, but for the most part that is only audible beyond the 8-15% range, and it was not common to alter the tempo that much for me. I ended up settling mostly on FLAC when I can get it and 320 CBR otherwise. I don't think I ever heard the difference.

Can you hear the difference between 320 Kbps CBR and VBR? I have to say I have never tried slowing down the music to try and hear the difference so it is possible under these conditions that it might make a difference.

I would say yes, but again, playing music at very amplified volumes and tweaking its tempo is not a common use-case.

most people can't. They say they can, but under scrutiny it generally falls apart. At some point, you're listening to sound and not music anyway.

The degree of difference depends on the kind of music you listen to. Live recordings of acoustic ensembles in airy cathedrals -- in that case you can tell the difference. On tracks that have a highly produced studio sound, where everything is an electronic instrument -- not going to be much of a difference.

I tried doing tests like this and I could not find any recordings where I could tell the difference at 160 kbps VBR. I not saying that it is impossible, but the conditions must be pretty rare and the difference very minor - compared to the massive degradation that come from room effects it amounts to nothing.

compared to the massive degradation that come from room effects it amounts to nothing.


Chamber music in an echo-y cathedral. With bad encoding, you can hear a noticeable difference in the length of time the reverberations are audible and and the timbre of those reverberations cab be quite different. With lots of acoustic music, the "accidental beauty" produced by such effects can be quite important.

Finding this convinced me to re-encode my music collection in 320kbps MP3 for anything high quality, and algorithmically chosen variable bitrates for lower quality recordings -- usually around 160 kbps. That was quite a number of years ago, though. I'd probably use another format today.

That's not true. MP3 simply never gets transparent and you can notice with 5 dollar in-ears. And people in general notice. This leads to absurdities such as bitrates of 320 kbps, even thou these do not sound significantly better than 128 kbps and are still not transparent.

On the other hand 128 kbps AAC is transparent for almost any input. AAC is supported abou everywhere where mp3 is. The quality alone should be convincing. The smaller size make the usage of mp3 IMHO insane.

OTOH "the scene" still does MPEG-2 releases I think.

I have listened to a lot of MP3 at different bit rates and with modern encoders and variable bit rates I can't tell the difference between anything above 160 Kbps - most of the time it is hard to tell the difference between 128kbp and anything higher. Really at 320kps you are entering the realm of fantasy if you think you can hear any difference.

I absolutely heard a difference between 320 and everything below. You can tell me I didn't, but I did. There is a world of difference between 160Kbps and 256, and 128 is a lot worse. If you can't hear it, I understand, but the blame isn't the algorithm -- it is your equipment, your song selection, or your ears.

This is not true. It is trivial for almost anyone to distinguish 320kbps mp3 from uncompressed audio, with built-in DACs and $5 headphones, with as little as 5 minutes of training.

Like the parent comment, more bold statements about perceivable sound quality, with no evidence.

How can it not be true as I am describing my experience. Are you really telling me that I actually can tell when I say I can't?

You're also describing everyone else's experience:

> Really at 320kps you are entering the realm of fantasy if you think you can hear any difference.

It depends on the encoder, the track, your equipment, and how good you are at picking out artifacts. Some people do surprisingly well in double-blind tests, though I doubt anyone can do it all the time on every sample.

There is no scientific evidence that anyone can do it at all above 192 kbps.

This is why ABX testing is so big in lossy audio circles. People can and do demonstrate their ability to distinguish between lossy and lossless encodings with certain samples in double-blind tests, at all sorts of bitrates. I've done it myself occasionally.

That people have been doing this for many years is one of the big reasons modern encoders are so good - they've needed tonnes of careful tuning to get to this point.

[citation needed]

You are making some bold statements about the general transparency of different audio formats that contradict pretty much everything I've read about this topic so far. Hence, I'd like to learn more, do you have any links that you would recommend?

Well, try it yourself :). Make sure to make it blind-test with the help of somebody else. Ideally such things would be subject to scientific studies. But these are kind of expensive and nobody cares for mp3 anyway. I'm not aware of any recent.

Hydrogenaudio listening tests [1] are studies by volunteers, but they focus on non-transparent compression. Anyway, it also illustrate aswell how bad mp3 is.

[1] http://wiki.hydrogenaud.io/index.php?title=Hydrogenaudio_Lis...

This page says at 128 Kbps all the encoders were the same (a 5 way tie). Their wiki says at 192 Kbps MP3 is transparent [1].

1. http://wiki.hydrogenaud.io/index.php?title=Transparency

You realize that that test doesn't test anywhere near 320kbps mp3, right?

Yes, that's what the last two sentences are about.

We have OGG opus these days which is even better than OGG vorbis.

I actually tried this last year, and found out the hard way after reencoding my mp3 collection to vbr opus at around half the bitrate (I did some light quality testing to make sure it was of similar fidelity, of course you lose some quality going lossy -> lossy) that either opus-enc or gstreamer at the time would produce choppy broken audio.

And it was reproducible on all my computers. I couldn't use my opus collection at all because either the encoder was broken or the playback was broken.

I need to do it again at some point, when I have 8 hours to transcode everything and try again. See if they've fixed it.

I'd expect similarly awful results for any lossy -> lossy transcoding.

Don't ever do that. If you didn't rip CDs to lossless or buy lossless, you are stuck with the format you've got.

Using xz for Linux builds of your software might make sense though, or would do so in the future. Recent releases of Fedora and RHEL already uses xz to compress their RPM packages.

Debian/Ubuntu dpkg supports compressing with xz too -- and it's a hard dependency of dpkg at least as far back as the precise (12.04) LTS release. So I'd say the majority of Linux users already have access to xz.

I just started typing in 'xz compression' in Google to learn more about it, it offered 'xz compression not available'. And some more queries that indicate it's not quite ubiquitous.

It's good to be informed about the capabilities of xz. I will keep using gzip, but consider xz in situations where the size or time matters. I might not care about 100 megs versus 50 very much, but I will about two gigabytes versus one.

Is it likely that a user has gzip on a system but not tar itself? From the article:

What about tooling? OSX: tar -xf some.tar.xz (WORKS!) Linux: tar -xf some.tar.xz (WORKS!) Windows: ? (No idea, I haven't touched the platform in a while... should WORK!)

Tar does not implement decompression. If you don't have xz installed it won't work.

That doesn't seem correct, on osx 10.11 `tar xf` can extract `.tar.xz` yet doesn't fork an xz. AFAIK 10.11 doesn't even come with xz.

tar can link the xz lib without forking.

So tar does "implement decompression" (and compression, by delegating the work to libarchive) and it can work even "if you don't have xz installed".

It would require liblzma, but you are correct that the library is a separate thing from the executable xz.

> It would require liblzma

Yep, in the same way it requires libz and libbz2.

IIRC bsdtar (e.g. on OS X) includes xz.

On the vast majority of Linux distributions, you can pretty much guarantee that both tar and zlib will be installed.

Both tend to be part of an essential core of packages required to install a system.

Pretty sure tar -xf does not actually work on osx unless you download a recent tar.

I have tar that came with the system (latest OSX) and tar -xf works just fine. And it did work fine for as long as I can remember.

It does, and should have since at least 10.6 (I can find references to 10.6's tar being built on libarchive and that's one of libarchive's headline features; 10.5 predates libarchive so it may not have supported that)

I'd argue that bzip2 is a better example of a compression algorithm which no one needs anymore.

Considering these features:

  * Compression ratio
  * Compression speed
  * Decompression speed
  * Ubiquity
And considering these methods:

  * lzop
  * gzip
  * bzip2
  * xz
You get spectrums like this:

  * Ratio:    (worse) lzop  gzip bzip2  xz  (better)
  * C.Speed:  (worse) bzip2  xz  gzip  lzop (better)
  * D.Speed:  (worse) bzip2  xz  gzip  lzop (better)
  * Ubiquity: (worse) lzop   xz  bzip2 gzip (better)
So, xz, lzop, and gzip are all the "best" at something. Bzip2 isn't the best at anything anymore.

You can easily apply the same argument to xz here, by introducing something rarer with an even better compression ratio (e.g. zpaq6+). Now xz isn't the best at anything either.

But despite zpaq being public domain, few people have heard of it and the debian package is ancient, and so the ubiquity argument really does count for something after all.

"This package has been orphaned, but someone intends to maintain it. Please see bug number #777123 for more information"

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=777123 - get in touch with the new owner of the package if you're interested. It's probably on their Never Ending Open Source To Do List.

No, xz (on a particular level setting) is both faster than bzip2 and provides better compression ratio, but zpaq is just slower.

>No, xz (on a particular level setting) is both faster than bzip2 and provides better compression ratio, but zpaq is just slower.

Are you implying that xz out-compresses zpaq? Can you supply a benchmark?

Here's one from me - http://mattmahoney.net/dc/text.html - showing a very significant compression ratio advantage to zpaq.

No, where did you find this implication in my comment? What I meant is that:

* xz is faster than bzip2 and provides better compression ratio [than bzip2]

* zpaq is slower [than bzip2 and provides better compression ratio than bzip2]

But looks like I'm mistaken? It seems like it can be faster and give better compression ratio than bzip2, can it?

> So, xz, lzop, and gzip are all the "best" at something. Bzip2 isn't the best at anything anymore.

Two points:

(1) It's very, very easy for the best solution to a problem not to simultaneously be the best along any single dimension. If you see a spectrum where each dimension has a unique #1 and all the #2s are the same thing, that #2 solution is pretty likely to be the best of all the solutions. Your hypothetical example does actually make a compelling argument that bzip2 is useless, but that's not because it doesn't come in #1 anywhere; it's because it comes in behind xz everywhere. (Except ubiquity, but that's likely to change pretty quickly in the face of total obsolescence.)

(2) lzop, in your example, is technically "the best at something". But that something is compression and decompression speed, and if your only goal is to optimize those you can do much better by not using lzop (0 milliseconds to compress and decompress!). So that's actually a terrible hypothetical result for lzop.

Heck, zero compression easily wins three of your four categories.

Zero compression is very often the correct choice these days.

No even when speed matters sometime lz4 is the best answer. I wrote a data sync that worked over 100mbps WAN and using lz4 on the serialised data transferred far faster than the raw data. Not just on network you can often be processing data faster (specially on spinning disk) since the reduction in disk I/O can in some cases can actually make the processing faster.

Being second-best on ratio and ubiquity is still pretty handy for serving files. It's compress-once, decompress on somebody else's machine, so neither of those matter. Ratio saves you money and ubiquity means people can actually use the file.

> It's compress-once, decompress on somebody else's machine, so neither of those matter.

Last week, there was a drive mount that was filling up, rate was roughly 30Gb/hr. The contents of that mount was used by the web application. Deletion was not an option. Something that compressed quickly was needed. And on the retrieval end, when the web app needs to do decompression, seconds matter.

I found lz4 to be the best for general purpose analysis, it increased the throughput of my processing 10x compared to bz2. Then if you're working with very large files you can use the splittable version of lz4, 4mc, which also works as a Hadoop InputFormat. I just wish they would switch the Common Crawl archives to lz4.

I should probably mention the compression ratio was slightly worse than bz2 (maybe 15% larger archive) but for the 10x increase in throughput I didn't really mind that much. I could actually analyze my data from my laptop!

If I'm actually doing something with my data, gzip -1 beats out lz4 for streaming, as gzip -1 can usually keep up with the slower of the in/out sides, and gzip -1 is higher compression ratio than lz4 and faster compression (but not decompression) than lz4hc.

I just tested this on my laptop, I used the first 5 million JSONLines of /u/stuck_in_the_matrix reddit dataset (~4.6GB).

For compression lz4 took ~22 seconds (~210 MB/s) and I got ~30% compression, gzip -1 took ~56 seconds (~80 MB/s) and I got ~22% compression.

For decompression lz4 gave me 500MB/s while gunzip gave me 300MB/s.

Commands used:

    lz4 -cd RS_full_corpus.lz4 | pv | head -5000000 | gzip -1 > test.gz

    gunzip -c test.gz | pv > /dev/null

    lz4 -cd RS_full_corpus.lz4 | pv | head -5000000 | lz4

    lz4 -cd stdin.lz4 | pv >/dev/null

Interesting; on a mix of source and binaries (archived fully-built checkouts) gzip -1 outperformed lz4 in compression ratio.

No you're correct, gzip -1 outperformed lz4 in my test in compression ratio. I don't know why I typed "30% compression" instead of "compression ratio of 30%." Sorry about that.

FYI, this cool little project — https://code.google.com/p/miniz/ — implements faster gzip compression at level 1.

Last time i checked lz4 did not had a streaming decompression support on their python lib. It will be problem for larger files like common crawl if you are not planning to pre-decompress before processing.

It's not a problem for me since I mostly use Java. However, you can probably just pipe in your data from the lz4 CLI then use that InputStream for whatever python parser you're using and you should be fine.

The biggest problem is using a parser that can do 600MB/s streaming parsing. If you use a command line parser don't try jq even with gnu parallel.

Being the best at something does not make it necessarily the best choice for most situations. This is trivially shown through this example. Assume for the four measured aspects, there is a program this is the best, but in the other three aspects it is orders of magnitude worse than the best in that aspect. Now consider another program which is best in nothing, but is 95% the way to best in every aspect. It's never best in any aspect, but it's clearly a good choice for many, if not most, situations.

Doesn't bzip2 have a concurrent mode that those others don't?

bzip2 can take advantage of any number of CPU cores when compressing.

Bzip2 doesn't handle multiple cores as far as I'm aware, but tools such as pbzip2 can. I wrote about this some time ago: https://hackercodex.com/guide/parallel-bzip-compression/

That said, parallel XZ is even better: https://github.com/vasi/pixz

I don't believe that's true, though there are multiple projects that offer that feature. lbzip2.org and compression.ca/pbzip2 to name a couple.

Really? How? My bzip2 has no option for it and when tested it stuck to one CPU. xz on the other hand has

  -T, --threads=NUM   use at most NUM threads; the default is 1; set to 0
                      to use the number of processor cores

Side note: `xz` only got -T option in stable releases less than 12 months ago (5.2.0 in 2014-12-21), so it hasn't made it into every distro yet.

The xz installed on my system carries a rather promising -T option, but then this text below it

> Multithreaded compression and decompression are not implemented > yet, so this option has no effect for now.

I believe you're looking for pbzip2. Parallel bzip2 file compressor. I've replaced it as my go-to compression.

lbzip2 is pretty good

pbzip2 output isn't universally readable by third-party bz2 decompressors (Hadoop, for example).

If you'd included the "zip" format in your analysis, gzip would not be the best at something anymore.

I use bzip2 purely for sentimental reasons.

One of the great things about gz archives is that the --rsyncable flag can be used to create archives that can be rsynced efficiently if they change only slightly, such as sqldumps and logfiles. Basically the file is cut into a bunch of chunks, and each chunk is compressed independently of the other chunks. xz doesn't seem to have an equivalent feature because the standard implentation isn't deterministic[1].

Changing from one compression format to another seems harmless, but it always pays to think carefully about the implications.

[1]: https://www.freebsd.org/cgi/man.cgi?query=xz&sektion=1&manpa...

The `--rsyncable` patch never got upstreamed, and in recent debian the feature is totally broken (rsync needs to transmit ~100% of the file again).

`pigz` has a similar flag that works reliably, though.

Are you referring to https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=708423 ? If so, it was fixed in December 2013 and doesn't affect the current Debian release (Jessie), although it does affect the previous release (Wheezy) and there is an open request to backport the fix https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=781496

Yep. I wrote a post about it, also doing some comparisons: http://www.aktau.be/2014/10/23/pg-dump-and-pigz-easy-rsyncab...

There are many more concerns to address than just compression ratio. Even the ratio one is questionable, because some people have really fast networks but we all have basically the same speed of computers. So a 4x CPU time and memory pressure penalty may be much worse on a system than a 2x stream size increase. Another use case is a tiny VM instance: half a gigabyte of RAM is not actually present in every machine today. Embedded, too.

Another way compression formats can win you much more than a 2x space reduction is by supporting random access within their contained files. Gzip sort of supports this if you work hard at it. Xz and bzip2 appears similar (though the details are different). I achieved a 50x speedup with this in real applications, and discussed it a bit here: http://stackoverflow.com/questions/429987/compression-format...

Agreed on usefulness of random access. Here's a couple more links on seekable gzip compression:



Thanks for the random access discussion!

And you are right for embedded! .xz just doesn't work there.

I've also found that on the faster systems, for different uses of mine, when I want the compression to last as little as possible and the total round trip time matters (compression and decompression), gzip -1 gives the best resulting size for the reasonably short time I want to spend.

I've come across quite a lot of firmware on embedded Linux devices that uses LZMA (the xz compression algorithm) to compress the kernel, u-boot, and/or filesystems. One memory optimisation for these, as they are typically being decompressed straight into RAM, is for the decompressor to refer to its output as the dictionary rather than building a separate one, as would be the case in decompressing to the network or disk.

> So a 4x CPU time and memory pressure penalty may be much worse on a system than a 2x stream size increase.

If it's being downloaded once

Even then it depends.

If it takes you 60 seconds to download as gz, and 50 as xz. The decompression needs to take less than 10 seconds more for it to be comparable and you've got to be sure that your end users have more memory and sufficient processing power to throw at the task.

He didn't mention the biggest difference between gzip and xz - ram usage. At maximum compression, you need 674 MiB free to make a .xz file, and 65 MiB to decompress it again. That's not much on most modern systems, but it's quite a lot on smaller embedded systems.

Admittedly, in most cases, that isn't much excuse though.

It can also lead to disaster on a web server when linux decides to OOM kill a critical part of the infrastructure like the database server or memcached. Then you can get a cascading problem of services failing, all because of a careless unzip statement. (I've been there.)

You can set exclusions for the OOM killer to prevent this. See:


....or just use gzip and get 90% of the value minus the high probability that that setting will be fubar at an inconvenient time.

This is why gzip rocks, the relatively low memory usage, in particular for compressing things on the fly.

Yeah, I use gzip for log processing tasks: zcat input.gz | log-processor | gzip > output.gz.

Summary: Compatibility and decompression speed is more important than compression ratios for many use cases. Gzip is nearly universal, where lz4, xz, and parallel bzip2 are not.

The challenge of sharing internet-wide scan data has unearthed a few issues with creating and processing large datasets.

The IC12 project[1] used zpaq, which ended up compressing to almost half the size of gzip. The downside is that it took nearly two weeks and 16 cores to convert the zpaq data to a format other tools could use.

The Critical.IO project[2] used pbzip2, which worked amazingly well, except when processing the data with Java-based tool chains (Hadoop, etc). The Java BZ2 libraries had trouble with the parallel version of bzip2.

We chose gzip with Project Sonar[3], and although the compression isn't great, it was widely compatible with the tools people used to crunch the data, and we get parallel compression/decompression via pigz.

In the latest example, the Censys.io[4] project switched to LZ4 and threw data processing compatibility to the wind (in favor of bandwith and a hosted search engine).


1. http://internetcensus2012.bitbucket.org/images.html 2. https://scans.io/study/sonar.cio 3. https://sonar.labs.rapid7.com/ 4. https://censys.io/

Me, I wish people would stop using RAR. It's proprietary and doesn't have a real compression advantage vs. e.g., 7-Zip, bzip2, or xz.

For anyone looking to stop making compromises, I recommend pixz. It's binary compatible with xz, and is better at compression speed, decompression speed, and ratio than both gzip and xz on multicore systems. I've adopted it in production to great benefit.

Totally agree with this. As someone with a commit bit to the project, as well as a long-time user, I'd like to second the recommendation. Pixz is a terrific parallel XZ compression/expansion tool. I find it indispensable for logs and database backups. Link: https://github.com/vasi/pixz

Fish shell users can take advantage of the Extract and Compress plugins I wrote, which utilize Pixz if installed: https://github.com/justinmayer/tackle/tree/master/plugins/ex...

How is the memory consumption?

Being a windows user these days, I am getting kinda frustrated with how anemic everyone is at even trying to google for 20s to find the windows solution.

7zip is the program you want to handle most everything, with both gui and command line options: http://www.7-zip.org/

Given how radically MS is trying to reform itself to be an open-source friendly company and how ineffectually inoffensive they've been the last 5 years, can we at least try and throw them a bone or two?

The article is not talking about Windows, so folks here aren't either. Why are you surprised that folks are uninterested in Windows?

I've preferred Mac systems for longer than most of the HN crowd has been alive, so I understand what it's like to feel ignored and in the minority. For years, Mac users were treated as pariahs. The tables have turned, and as someone who has been in your situation, I should have great empathy for your predicament.

And I do, but your tone in general -- and your last sentence in particular -- makes it very hard to empathize. Microsoft used its near-monopoly status to stifle innovation for years, and many of us have figurative scars that will never heal. You seem to think that they are making great strides (while I see them as half-hearted overtures), but either way I'm not about to "throw them a bone." Their decades of misdeeds, in my eyes, will not be expiated so easily.

Perhaps Microsoft will someday be worthy of forgiveness, either from the perspective of morality (e.g., Mozilla) or product excellence (e.g., Apple). Until that day, Microsoft will continue to reap what they have sown, given no more attention than they have earned.

> And I do, but your tone in general -- and your last sentence in particular -- makes it very hard to empathize. Microsoft used its near-monopoly status to stifle innovation for years, and many of us have figurative scars that will never heal. You seem to think that they are making great strides (while I see them as half-hearted overtures), but either way I'm not about to "throw them a bone." Their decades of misdeeds, in my eyes, will not be expiated so easily.

I feel like Apple's has forgotten what was important, created then ruined a market, and lost everything that made it interesting (long before Steve Jobs passed away, by the way). Which cuts all the more deeply because back in the early 2ks they were walking the walk and taking a lot from NeXT's culture of developer friendliness. I grew up deeply invested in Macs and NeXT, which makes the realization painful, but... Apple wants to annihilate maker culture as it monetizes its platform. It's also stopped caring about design on a grand scale, instead appealing to very shallow notions of "visual simplicity".

That's all gone now, and they're consequently useless to me. I'd rather patronize a company currently doing the right thing after a troubled past than pretend a previously aligned company was still there.

It should be very telling that Apple AND Google's flagship hardware announcement of 2015 was something that Microsoft has been doing for years.

And if Microsoft suddenly goes evil again? Fuck them, I'll drop them and move somewhere else. Not Linux, unless the distros pull their act together, but I'm sure a competitor will emerge. Or I'll make one.

You make some good points here, which I completely understand. Hopefully the future will bring better options for us all.

I know your feeling too, thanks for accepting I feel differently. And sometimes I ask myself "What the hell am I doing with this Surface book?" I won't pretend I don't have doubts.

It's sort of a rough time for devs right now even as we enjoy unprecedented prosperity and recognition. Big businesses are attempting to monetize and control every aspect of developers.

"...decades of misdeeds..." Which they benefited handsomely from and were never sufficiently punished. I'm with you, my trust of MS is still pretty close to nil.

GP seems to be talking about this line in the article:

> Windows: ? (No idea, I haven't touched the platform in a while... should WORK!)

I do think the author missed the ball in not doing basic research for the Windows platform. His point is that people should switch from one compression tool to another. If people on Windows were unable to compress or decompress such files, that would be a huge problem for his argument.

That said, I don't agree with the GP's tangent.

How about Microsoft offering basic utility software with their OS releases, e.g. curl/ssh/awk/grep/sed/tar etc.

They do, although that doesn't include tar. Powershell does all of that.

Powershell is very capable tool but unfortunately not very interoperable.

What do you mean? It seems to me like an extremely interoperable tool, as it can invoke arbitrary .net code.

... not very interoperable outside MS/Windows ecosystem.

It can invoke arbitrary executables as well, which makes it exactly as interoperable as Bash.

Powershell can run on Linux, too. I've even met a few people who quietly prefer it.

So what, exactly, were you referring to? Shell choices are like editor choices: arbitrary and largely equivalent and without any real meaning or impact on a developer's productivity.

The context was: someone had difficulties to find "tar -xf" equivalent for Windows. I pointed out that it would be nice if Microsoft included tar etc. basic utilities in their OS releases. With out-of-the-box Windows machine, you cannot ssh, you cannot untar, etc. Windows-way of doing things is totally different than *nix culture (OS X, Linux, etc). In that context powershell is not "interoperable" (maybe bad wording from me).

> The context was: someone had difficulties to find "tar -xf" equivalent for Windows.

Right and putting that in google, "tar equivalent for windows", immediately nets 5 useful results. You can use tar, or a windows command line variant of 7z or tar, or a gui.

> With out-of-the-box Windows machine, you cannot ssh, you cannot untar, etc.

On an out-of-the-box Linux machine, you generally can't do a lot of things either. It seems particularly ironic that in a discussion about how we shouldn't be using old UNIX tools just because they're entrenched, you then call for compatibility.

> Windows-way of doing things is totally different than *nix culture (OS X, Linux, etc).

Stupid legacy path limits not included, Powershell is in my experience just a superior way to do things. I should maybe restart my blog to talk about that.

But even if we ignore Powershell and windows, your statement is divisive within the Linux community. MANY people prefer shells on Linux that don't adhere to the bash legacy. TCSH and CSH are very popular, to this day. Are they 'not interoperable'?

Everyone's got a big chip on their shoulder about how development tooling "should be." One of the things I've come to realize is how arbitrary, unnecessary, and useless these mores are. They just hold us back.

> On an out-of-the-box Linux machine, you generally can't do a lot of things either.

Minimal distros aside, you must be kidding.

No. I'm not. But minimal distros are the primary surface area linux exposes for many people these days. The desktop userbase is (justifably) almost non-existent, and most core cloud distros don't even come loaded with curl by default. It's even more extreme as you work with docker.

I really don't see what that has to do with Microsoft. 7-zip is GPL and cross platform. In fact - I just checked - a command line 7-zip is available by default on my Linux Mint install.

Right, but an xz-supporting compression tool is shipped with many Linux distros, whereas you do need to grab a tool to support xz on any version of Windows.

But you're right, 7z is an under-appreciated format.

I have had awful experiences with 7-zip. Recently I was unable to decompress a multi-file rar, it kept saying it was corrupted. I downloaded it several times until I realised it was actually fine, because winrar could decompress it.

I sorta view RAR files as the problem. Only WinRAR ever gets them really right. I dunno why that is, but I also dunno why anyone keeps using RAR. It's something of a joke in the windows dev community.

Don't you have to seek through the entire tar file to find a file? You see, that is a joke.

When I owned an Amiga they kept on changing the archive format to find a better one that saved space.

The had arc, pak, zip, zoo, warp, lharc, and every Amiga BBS I got on used a different archive format. Everyone had a different opinion on which archive format compressed things in the best way.

I think eventually they decided on lharc when they started to put PD and shareware files on the Internet.

Tar.gz is used because there are instructions for it everywhere and it seems like a majority of free and open source projects archive in it. It is a more popular format than the others right now. Might be because it is an older format and had more ports of it done.

But I really like 7zip, it seems to compress smaller archives, before 7Zip I used to use RAR but WinRAR wasn't open source and 7Zip is so I switched.

With high speed Internet it doesn't seem to matter much anymore unless the file is in over a gigabyte in size. Even then Bit Torrent can be used to download the large files. I think BitTorrent has some sort of compression included with it if I am not mistaken. To compress packets to smaller sizes over the torrent network and then resize them when the client downloads them. That is if compression is turned on and both clients support it.

> When I owned an Amiga they kept on changing the archive format to find a better one that saved space.

It happened on DOS too: ZIP, ARJ, RAR, ...

That was back on the days of floppy disks (which usually had at most 1440 KiB) and small hard disks (a few tens of megabytes). Even a few kilobytes could make a huge difference.

As storage and transfer speeds grew, "wasting" a few kilobytes is no longer that much of an issue, and other considerations like compatibility become more important. Furthermore, many new file formats have their own internal compression, so compressing them again gains almost nothing regardless of the compression algorithm.

The reason both ZIP and GZIP became ubiquitous is, IMO, that the compression algorithm both use (DEFLATE) was released as guaranteed to be patent-free, back in a time where IIRC most of the alternatives were either patented or compressed worse. As a consequence, everything that needed a lossless compression method chose DEFLATE (examples: the HTTP protocol, the PNG file format, and so on).

LHA and LZX were the popular pair when my Amiga days came to an end. Being a commercial product the latter kind of occupied a similar position RAR does on Windows.

Microsoft ended up adopting LZX for things like CAB and CHM files.

> So, who does use xz?

Arch Linux started using lzma2 compression for their packages nearly 6 years ago!


It's very common to see xz files in Gentoo as well.

    ls /usr/portage/distfiles/      \
      | sed 's/.*[.]//g'            \
      | sort | uniq -c | sort -n -r \
      | head -n 6

       3377 gz
       3051 xz
       1656 bz2
        295 zip
        194 tgz
        107 jar

Slackware's official packages are compressed with XZ, and it's been that way for a while, too. :)

* MikTeX (windows TeX/LaTeX) system started using it circa 2007. * TexLive switched, iirc, circa 2008. * ArchLinux started circa 2010. * Linux kernel 2.6.0, iirc, circa 2011(?) * Gnome switched circa 2011.

There are others but I can't remember. It's fairly common now.

    OSX: tar -xf some.tar.xz (WORKS!)
    Linux: tar -xf some.tar.xz (WORKS!)
I had no idea tar could autodetect compression when extracting. (I wonder if this is GNU tar only, or whether the OSX default tar can do it too?) I've been typing `tar zx` or `tar jx` for too long.

I highly recommend using atool[1], and never worrying about extracting archives again. It's a wrapper around basically every compression/archive tool in remotely common use.

Bonus: it decompresses to a safely-named subdirectory, but moves the contents of that subdirectory back to the current directory if the archive contained exactly one file. Highly convenient without any risk of accidentally expanding 1000 files into the current directory.

After creating this macro, I've basically never had to care about how to decompress/unarchive anything.

    # 'x' for 'eXpand'
    alias xx='command atool -x'

    # use
    % cd $UNPACK_DIR    # (optional) (can be the PARENT dir)
    % xx foo.zip        # or .tar.{gz,xz} or whatever
    foo.zip: extracted to `foo' (multiple files in root)
    % cd foo/
    % ls | wc -l 
atool actually has many other useful features, but it's worth it just for the extractor.

[1] http://www.nongnu.org/atool/

For OS X and macports I use p7zip package which intalls 7z command. It understand .zip, .rar, of course .7z and probably other formats. I wrote simple Automator script to use that command from Finder and it works just fine (actually .zip and .tar.gz formats are supported by OS X archive tool, but .rar is not and I often have to deal with that format).

Looks awesome by description but last release is from 2012 which makes me worried slightly.

Old doesn't mean bad. Sometimes, it means "finished".

About the only thing that a small would need to be updated at this point is support for a new compressor. (that 2012 release mainly added suppport for plzip)

There are usually issues in compatibility between linux zip utility and the mac one. Has to do with zip not being backward compatible.

How does atool deal with cases where there are two versions of the same extractor?

I'm not familiar with that issue, but you can set the path to any extractor in /etc/atool.conf or ~/.atoolrc

    # ~/.atoolrc
    path_zip /path/to/preferred/bin/zip
See atool(1) for details. http://linux.die.net/man/1/atool

As far as I know, different zip formats are not auto-detected. However, it does (optionally) use file(1) to detect the file format, which can be overridden with the 'path_file' option, so a hack may be possible?

Wow, really? Honestly I've never used the z or j flags, I could not even tell you what they do. I use `tar caf whatever.(txz|tgz|tar.lz|tar.lzo) /path/to/files` to create, and just `tar xf whatever` to extract.

Some of us have been using Unix tools a very long time. For example, I'm pretty sure I started using tar and gzip in 1995.

I think it's libarchive used by BSD tar that allows this auto detection as well

Best thing about libarchive/bsdtar is, it also handles zip, rar, cpio, iso files and many others. So basically bsdtar xf is what I'm using to extract almost every archive.

bsdtar (the version in OS X) can do it too. Older versions of OS X used to come with "gnutar" as a separate binary, but not recent versions.

I wish lzma (xz) was integrated into the browser and curl as an Accept-Encoding. Would be amazing for us (clara.io), and I am sure a lot of others.

Browsers are getting Brotli which is comparable to xz: https://groups.google.com/a/chromium.org/forum/#!msg/blink-d...

My tests with brotli is that is it overrated - it is slow and has poor compression ratios compared to xz. It confuses me why it is being pushed so hard...


Eh, Google gave one example where Brotli does well and you gave one where it does poorly; we're not exactly in science territory here.

Brotli does not work well for bhoustons use case, so his original wish stands and your helpful suggestion that he should be able to use Brotli in the near future does unfortunately not fulfill his wish.

Yeah, the example given by GP involves large binary streams. Brotli was designed for small text documents with lots of English words in them, as we often see on the web.

Where does it say that Brotli is for small English text documents? I didn't see anything like that in the draft spec or the Google blog post.

The spec doesn't say much on the subject but has this item in the Purpouse section: "Compresses data with a compression ratio comparable to the best currently available general-purpose compression methods and in particular considerably better than the gzip program"

Brotli includes a built-in dictionary that contains a lot of English words, HTML tags, etc. so it will give better compression for that kind of input.

Yes, it has that optimization for short data (though it's not restricted to English), but the PR and specs say it's still meant to be a general purpouse compressor. And it does very well on most types of large data.

It's a real problem in a compressor proposed for general purpouse use when it's shown that a naturally occurring major class of data has this bad performance.

Decompression of xz is slow and quite memory intense. I'd argue that the light memory footprint of gz decompression is better suited to the web, particularly mobile where you need to balance battery v bandwidth.

Compressing xz is relatively slow, and would be expensive for web servers. Maybe something like LZO would be better? (Which OpenVPN uses to compress data in transit.)

I just want to serve large static data asserts from aws cloudfront. :)

I think this is one of those things where the author is pretty much 100% right and it just won't happen. Habits are hard to break and in many cases, the negatives just don't impose a high enough cost to matter.

There are times when I do seriously look for the optimum way to do things like this and then there's most of the time I just want to spend brain cycles on more important problems.

The author is not 100% right, as is always the case with this, it depends on the data you are compressing. Here is a stackexchange with some relevant experiments: https://superuser.com/questions/581035/between-xz-gzip-and-b...

I believe that the biggest driver of using old-school ZIP or GZIP is the fact that everyone knows that everything can decompress these formats. And in a modern world of terabyte disks in every laptop, multicore multi-Ghz CPUs, and megabit bandwidth, it isn't worth the effort of using a format that saves an additional 20% on compressed size at the cost of someone not being able to decompress it.

On typical source trees and mesh data xz is in the range of twice at good at compressing than gzip. That is very significant imo.

That's only really much good if you're in the business of archiving things like that. For most people, source trees are ad-hoc downloads for patch fixes, oddball platform compiles, etc. And then the universality of gzip is better than any marginal space savings from xz.

I mentioned it when it came up on another thread. Compare apples and apples -- use one of the standard corpuses when running bench marks.

Ian Witten put together the Calgary corpus - https://en.m.wikipedia.org/wiki/Calgary_corpus

Windows users: 7-zip can extract .xz files should you need to (article didn't mention a Windows solution).

Although 7-Zip can't "look through" a .tar.{x,g}z file - browsing a .tar.xz will require fully decompressing the .tar to a temporary location.

Tar isn't great about any of those files either, it's just building a list while decompressing to /dev/null


Isn't that mostly just because of how tar is designed? It's a concatenation of individual files with headers, so you have to decompress the whole thing to get a file list anyway. At which point you might as well save the decompressed tar in temporary.

It's also quite a bit faster than Windows at opening and creating zip (edit: not gzip) files.

You're confusing gzip (.gz) and PKZip (.zip). Windows has no native support for gzip, only PKZip.

Which version of RHEL does "Linux" include? The world isn't all Ubuntu recent releases.

RHEL 6 doesn't include it. So that's most of enterprise deployments...

It's such a shame that so many slow moving 'enterprises' still have RHEL 6 servers, it's so incredibly outdated - not only does it limit what they can do, but it negatively affects peoples impressions of linux.

And this is the reason I am still supporting Python 2.6 even-though I wish to drop it, drop it hard.

For me, as a Python user, I've found that gzip is currently the only compression format that allows streaming compression/decompression. I don't want to have to store hundreds of gigabytes of data and THEN compress it, rather than compressing it right during file generation. I haven't found any other compression lib that supports this out of the box.

I generally use gzip for everything because it's everywhere and good enough, but xz and bzip also support streaming, in fact anything that tar compresses does afaik.

    > dd if=/dev/urandom bs=1M count=5 | gzip > test
    5+0 records in
    5+0 records out
    5242880 bytes (5.2 MB) copied, 0.438033 s, 12.0 MB/s

    > dd if=/dev/urandom bs=1M count=5 | xz > test
    5+0 records in
    5+0 records out
    5242880 bytes (5.2 MB) copied, 1.52744 s, 3.4 MB/s

    > dd if=/dev/urandom bs=1M count=5 | bzip2 > test
    5+0 records in
    5+0 records out
    5242880 bytes (5.2 MB) copied, 0.804324 s, 6.5 MB/s

The algorithms support streaming, but that doesn't mean implementations in (in this case) Python libraries still might not do so. Although I can't understand why they wouldn't, presumably they just wrap the same C libraries everybody else uses, and a streaming interface wrapping another stream (i.e. file, socket) should feel very natural and easy.

I prefer compressing with gzip because it's on more systems, works well even with low RAM, and enables fast rsync for updates.

Decompressing takes 4 times as long? I wonder if that is slow enough to create a bottleneck in processing. Not everyone uses compression for purely archival purposes. In the genomics field, most sequencing data are gzipped to save disk space. And most programs used to process the sequencing data can take in the gzipped files directly.

There might still be use cases left for gzip, but this article is specifically about software tarballs. And in that case, I have to agree with it.

geezip is fun to say. Until there's a catchy name for "crosszip"/xz/whatever, I think we're preaching to the wrong choir. There's a human element in toolchains. Address it.

it's a shame that algorithm improvements would necessitate a shift away from the name "gzip". It would be better if the intent to compress/decompress was orthogonal to the features of the implementation (compression ratio, speed, split-ability, etc...)

The article misses (but the comments here touch on) that all compression algorithms have a built in obsolescence, even the fancy shiny xz.

It's not the algorithm per-se that go obsolete, but their use in specific cases until all are diminished. Whether lossy or lossless, eventually other technological advancements renders them unnecessary.

And it seems that strongest algorithm is usually the earliest to be widely adopted; these are almost never toppled.

Just like .gz, look at MP3 or JPEG -- 'better' alternatives exist, but the next widely adopted step will be to eliminate that compression entirely. The first radio station playout systems were hardware MPEG audio compression, and the next most widespread step was to uncompressed WAVs. Even video pipelines based on uncompressed frames are becoming more widespread. Eventually the complexity and unpredictability of compression is shunned for simplicity.

Read the gzip docs and the focus is around compression of text source code, a key use case at the time but barely considered these days -- tar.gz source archives exist almost only out of habit; they could just as well be tar.

Media codecs are a little different because there is a significant cost to replacing hardware that only supports the old standard. It seems to me that AAC is becoming pretty ubiquitous and probably will be the go to standard for the next 10-20 years. We are also at a point where the vast majority of users aren't going to notice the difference between a 100kbps MP3 vs AAC stream, which is going to be less than 1% of an he stream, so there is little incentive to innovate by the players with the deepest pockets. Until network capacity is free (or at least cheap compared to to the cost of CPE) pipelines based on uncompressed media are not going to be a thing outside of the production end of things.

For source tar balls though there is basically 0 cost to switching. Since you can download a new compressor in under a minute and should be able to assume that your users are pretty sophisticated. The incentives are similar to media in that cost transfer has to be weighed against the cost of CPE, except that since users supply the CPE the cost is effectively 0 and compression will probably always make sense.

gzip is fast, gzip -1 is even faster, gzip has low memory requirements, gzip is widely adopted. Those are the reasons of gzip is still being used, and why gzip has future. I.e. the gzip "ecosystem" is rich and useful, despite not being the best compressor in terms of compressed size.

P.S. There are gzip-compatible implementations with tiny per-connection encoding memory requirements (< 1KB).

tl;dr: xz compresses better but is significantly slower. This isn't the deepest analysis of potential tradeoffs you might be able to find.

A few reasons why gzip is still useful to have around:

* Speed is critical for many applications, and so size can take a backseat when performance is critical or resources are low.

* gzip is basically guaranteed to be available everywhere in utility and library forms.

* Download speeds vary and so the faster your pipe, the less the archive size factor will matter, and the faster-worse compression might win out in other comparisons.

* xz doesn't compress every type of data this much better than gzip. I've dealt with scenarios where the difference is consistently less than 2%, and the extra time xz spends is actually a tremendous waste.

Sure, for package downloads where xz files will be significantly smaller it makes sense to save the bandwidth, time and storage space. But it's not 100% cut and dry.

Ahh, kids. "Let's all start adopting this new thing that has existed for a few years that's slower and uses more ram because we're wasting literally tens of megabytes all the time!"

Tens of megabytes, times the number of people downloading it, can get pretty significant pretty fast. And costly.

Fortunately, the same is not true of time or memory.

On the server, it would increase the amount of time and memory required.

Oh, right, that seems obvious. Wonder why it's not obvious to everyone...

Very obvious.

I don't see any reason the author made the jump from "xz is better than gzip" to "stop using gzip!".

What is the compelling thing here that makes him feel that this is a moral imperative?

Can node-tar untar .tar.xz? This is what Meteor uses to parse tarballs.


Correct me if I'm wrong but there doesn't look to be any kind of compression supported.

Amusingly, "news.ycombinator.com" serves its pages with .gz compression. Even if you send an HTTP header that demands plain text only.

    $ ncat -C --ssl news.ycombinator.com 443 <<EOF
    GET / HTTP/1.1
    host: news.ycombinator.com
Works for me, no compression. Maybe you messed something up or maybe there's a non-compliant proxy between you and the rest of the internet?

What if you put in:

  accept-encoding: identity

Uncompressed still.

I asked this question on StackExchange a few months ago http://unix.stackexchange.com/questions/183465/fastest-way-o...

That's my problem with GZIP, in this particular use case anyway.

If I'm sending contents of a website to client in .xz format, will browsers be able to decompress it?


There's also lzip, which apparently uses a similar compression algorithm as xz but is apparently built with partial recovery of corrupted archives in mind (so more useful for long-term archival or backup storage). It's made by the same guy who made ddrescue.

The only people who care about compression ratios are:

(1) People who still use 56k modems to download content

(2) People who host extremely popular downloads and who want to minimize their outbound bandwidth bills

If you're not one of these two, you almost certainly care more about compatibility and compression time than compression ratio. gzip continues to win on both those fronts, and it explains why it's still the most popular compression format other than ZIP (which is a better choice than gzip if you frequently need to extract a single file from a compressed archive).

Until there's a compression tool released that can compress at wire speed (like gzip) and has a significantly better compression ratio, don't expect the landscape to change much.

Aha! Thanks for sharing.

Is xz greppable?

No. Neither is gz. In fact, how could any compression algorithm possibly be greppable?

One example is "Fast and Flexible Word Searching on Compressed Text". A copy is at http://www.cs.uml.edu/~haim/teaching/iws/tirsaa/sources/ACM_... .

> We present a fast compression and decompression technique for natural language texts. The novelties are that (1) decompression of arbitrary portions of the text can be done very efficiently, (2) exact search for words and phrases can be done on the compressed text directly, using any known sequential pattern-matching algorithm, and (3) word-based approximate and extended search can also be done efficiently without any decoding. The compression scheme uses a semistatic word-based model and a Huffman code where the coding alphabet is byte-oriented rather than bit-oriented.

Apparently it is possible, as zgrep can search gzipped files.

The 'xzgrep' script (and related xzdiff, xzless, xzmore scripts) are part of the standard xz package, though they are an optional feature so YMMV between distros.

    ~ $ xx /usr/portage/distfiles/xz-5.0.8.tar.gz
    ~ $ cd xz-5.0.8/
    ~/xz-5.0.8 $ ./configure --help | grep -A1 scripts
      --disable-scripts       do not install the scripts xzdiff, xzgrep, xzless,
                              xzmore, and their symlinks

zgrep simply decompresses all the data and feeds it into regular grep. If the data is indexed in some way, it is possible to do better by not having to look at all the data exhaustively.

It's possible to compress and index a file at the same time, gaining both a size and speed advantage over the original. For example: https://en.wikipedia.org/wiki/FM-index

Probably something like this:

  $ cat zgrep
  # zgrep FILE ARGS...
  gzip -d <"$FILE" | grep "$@"


tar ... tape archive

What about availability? I often find myself having to download and compile (de)compressing software because the authors of some other software I need decided to ship it in something other than the standard (.tar.gz), which is available in basically all *nix boxes.


The Weissman score is less than worthless, it "isn't even wrong". This comes up in most HN discussions about compression algorithms and I'm waiting for it to go away.


I think the smiley at the end indicates that it is a joke. After all it was created as a fictional plot device for a comedy.

^ Yes. Clearly people can't take jokes here, already downvoted.

Edit: Even the above got downvoted and I get reminded again that HN is full of stuck up, overly sensitive and unhumorous engineers. With the exception of this being a great place to find interesting things to read the community aspect is uninviting and unforgiving.

The same joke is posted on literally every submission about compression. It adds nothing to the discussion, so it gets downvoted.

Comments about being downvoted are also almost always downvoted. The comment guidelines actually warn you that it's a bad idea.


But trivia is still worthy of knowing about, however tired and overdone, or dated and played-out. Just like, if it were XKCD yesterday, it was The Far Side twenty years ago.



ick, the invocation is `tar tJvf`? Granted, I can alias it but a capital J is just about the worst option letter I can think of.

When unpacking you can leave the compression method flag out, tar will figure out which compression method is used by itself.

And when packing you can use -a, then tar selects which compression method to use based on the filename.

Just do tar -xf It works on pretty much anything

I can never remember the shortcuts so I always do:

tar -cv "folder" | xz >"out.tar.xz"

GNU tar also has easy-to-remember long options for all the major compression programs: --gzip, --xz, --lzop, --lzma, --lzip.

Using lzop got me into that habit. Often, the fastest way to transfer a directory from one machine to another is:

  hostA$ tar c mydir | lzop | socat - tcp-listen:1234
  hostB$ socat tcp:hostA:1234 | lzop -d | tar x
As the network gets slower and the CPU's get faster, you substitute a more heavyweight compression command.

Not many single-letter options were available. Tar is kind of like ls that way. At least it's easy to remember for those of us who already learned to use lowercase-j for bzip2.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact