
An ode to pack: gzip’s forgotten decompressor - ingve
http://www.vidarholen.net/contents/blog/?p=691
======
kccqzy
I have to say hats off to Paul Eggert, the maintainer who fixed the obscure
bug _and_ added a test case _and_ responded in just several hours[0]. I admire
his dedication to free software. At least the good news is that he isn’t some
poor bloke but rather well-respected and well-paid[1].

[0]: see timestamp in the mailing list message
[http://lists.gnu.org/archive/html/bug-
gzip/2017-10/msg00000....](http://lists.gnu.org/archive/html/bug-
gzip/2017-10/msg00000.html)

[1]:
[https://transparentcalifornia.com/salaries/2016/university-o...](https://transparentcalifornia.com/salaries/2016/university-
of-california/-paul-richard-eggert-/)

------
adrianN
Wouldn't it be a good idea to remove pack support from gzip if it obviously
doesn't get enough eyeballs to detect bugs in a timely manner? Poorly tested
obscure format support is a goldmine for hackers looking for exploitable code.

~~~
gumby
It's valuable to keep legacy decompression around for opening up old files,
lest the data be lost forever.

~~~
awirth
A nice middle ground would be to change to have this unexpected code path
require explicit invocation via a command line flag or enablement through an
env variable. This would reduce the attack surface for the vast majority of
users and still retain support (albeit not entirely backwards compatible with
scripts) for folks that want this.

> Poorly tested obscure format support is a goldmine for hackers looking for
> exploitable code.

IMO the issue is when these code paths are automatically invoked. This is how
you get gstreamer running 6502 opcodes for a codec no one has ever heard of
when you download a file from the internet.

~~~
vanderZwan
> This is how you get gstreamer running 6502 opcodes for a codec no one has
> ever heard of when you download a file from the internet.

Ok, I'll bite: context?

~~~
lmm
[https://scarybeastsecurity.blogspot.co.uk/2016/11/0day-
explo...](https://scarybeastsecurity.blogspot.co.uk/2016/11/0day-exploit-
compromising-linux-desktop.html)

------
bbanyc
There's an even more obscure Unix compression tool of the same vintage as
pack. While pack originated in USG Unix (System V and its predecessors), the
early BSD releases contained a program called "compact" which used adaptive
Huffman coding. According to the creator of compress [0], compact had a
comparable compression ratio to pack but took six times as long to run.

The original versions of both can be found on the Unix Tree [1] which I've
found to generally be an incredible resource/time-sink.

[0]
[https://groups.google.com/d/msg/net.sources/fonve4JCDpQ/39Vk...](https://groups.google.com/d/msg/net.sources/fonve4JCDpQ/39VkTFQV7ScJ)

[1] [http://www.tuhs.org/cgi-bin/utree.pl](http://www.tuhs.org/cgi-
bin/utree.pl)

~~~
tempay
The article mentions compress but states that pack is it's predecessor. It
even gives a comparison of their compression ratios on a kernel release,
showing compress is significantly better:

compress: 302 MiB (-61%) pack: 548 MiB (-30%)

~~~
bbanyc
compact != compress

The easily confused names don't help with the obscurity.

------
flavio81
I loved it.

\- Retrospective on how we have advanced with compression methods; it's quite
interesting how the same file is reduced in compressed size when doing a
survey of the distinct compression methods.

\- nice explanation of Huffman coding

\- mention of code golfing

A quintaessential HN article. I think each of the 3 items above is worth 10
hacking nerd bonus points (collect 50 points to get a Devo Energy Dome)

If code golfing is done using Befunge or Brainf __*, multiply score by 3.

EDIT: BTW, Brainf____ source code probably compresses nicely with Pack.

~~~
elsurudo
It's okay, we're all adults. You can say Brainfuck :)

------
mnx
I feel like code golfers are always just slightly insane in the best possible
way.

~~~
lucb1e
That resonates with what I think of people making demos. Like, demoscene
demos. Never realized that this is actually a modern variant of the scene.

------
zmix
Exactly that type of post, I come here for. Great read!

------
LeoPanthera
It's unfortunate that xz has become the "default" compressor. It has some
issues that make it ill-suited for long term storage. lzip is a similar format
that corrects these issues:

[http://www.nongnu.org/lzip/xz_inadequate.html](http://www.nongnu.org/lzip/xz_inadequate.html)

~~~
comex
Where the “issues” in question consist of things like “some field is 64 bits
wide even though there won’t be 2^64 values in practice”, and “very slightly
less likely to detect corrupted data” (but still extremely likely; anyway,
since neither xz nor lzip can _correct_ errors, I hope your long term storage
ensures integrity at a different level).

Meanwhile, xz is seekable and lzip is not. Both formats use the same
compression algorithm under the hood, so size differences are minimal. And
lzip is licensed under GPLv3, which means it cannot become a universally
adopted standard, which is surely important for long-term storage.

~~~
LeoPanthera
Version 2 or later, not version 3. And to defend it's long-term viability:

"The lzip format is as simple as possible (but not simpler). The lzip manual
provides the source code of a simple decompressor along with a detailed
explanation of how it works, so that with the only help of the lzip manual it
would be possible for a digital archaeologist to extract the data from a lzip
file long after quantum computers eventually render LZMA obsolete."

Additionally there is a separate public domain version for those allergic to
GPL:
[http://www.nongnu.org/lzip/pdlzip.html](http://www.nongnu.org/lzip/pdlzip.html)

lzip is included in every Linux distro's package repository, and the three
major BSDs.

~~~
comex
> Version 2 or later, not version 3.

Ah. I wasn't misremembering, though: it seems it was previously GPLv3, but
changed to v2+ with lzip 1.16 from late 2014.

In any case, lzip has already lost the popularity war. If XZ were actually
"inadequate" and "inadvisable" as the lzip author claims, I might consider
that a shame and work to counter it. In reality, the blog post comes off more
like sour grapes. Some other things:

\- The post spends several paragraphs complaining that some things in XZ are
padded to a multiple of four bytes. Seriously? It may be unnecessary, but it's
also very common in binary formats and makes very little difference.

\- I was mistaken when I interpreted the blog post as claiming that XZ is
"very slightly less likely to detect corrupted data". Actually, it's probably
more likely, because it uses CRC64 whereas other formats (including lzip) use
CRC32. (Although supposedly lzip makes up for this by being more likely to
detect errors through the decoding process itself.) The post notes this, but
claims that reducing what it calls "false negatives" (i.e. thinking a
corrupted file is not corrupted) is not important "if the number of false
negatives is already insignificant".

The post does make a decent case that XZ should have some kind of check or
redundancy for length fields - though arguably the index serves as that, and
there's a case to be made that lzip is worse because it doesn't have one.

But where the post spends most of its time claiming lzip is superior is in
reducing _false positives_ \- that is, thinking a file is corrupted when it is
really intact! How is that even possible? Well, because the checksum itself
could get corrupted while the actual data in the file remains intact. The post
assumes that each bit written has a given chance of being corrupted, so a
larger file will, on average, have more corruption than a smaller file.
Because a longer checksum is _slightly_ larger (even though checksums are
still a tiny fraction of the overall file size), it's supposedly a greater
risk.

Which is such a load of bunk, _especially_ when it comes to long-term
archiving. In practice, the probability of error is not independent per bit.
Rather, you should have a setup that uses error-correcting codes (whether in
the form of RAID parity, built-in ECC on a NAND chip, etc.) to make the
probability negligible that you will _ever_ see _any_ errors. Which is not to
say that there aren't bad setups that can encounter errors… that there aren't
people who stick their treasured photos on some $99 USB hard disk that
frequently takes nosedives to the floor. But even then, one error is not just
one error; it's often a precursor to the entire drive becoming unreadable. In
any case, the solution is not "store very slightly less data to reduce what's
exposed to damage", it's "fix your setup"! On the other hand, for anyone who
thinks they have a good setup, it's very, very important to detect if an error
has occurred, as an error would indicate a problem that needs to be fixed to
protect the rest of the data. (Or in other cases, such as corruption during
transmission, it could indicate the need to re-copy that file from elsewhere.)
So "false negatives" are extremely harmful. Meanwhile, "false positives"
aren't necessarily bad at all, because if a given file is only detected as
corrupted because the checksum itself was corrupted, it still demonstrates
that there was an error! I'd go so far as to say you can never have too many
checksums - but at any rate, it's absurd to treat false positives as more
important than false negatives.

\- Oh, this is rich. Another part of the post makes a big deal about lzip
being able to handle arbitrary data appended to the end of the file (because
apparently not supporting that means "telling you what you can't do with your
files"). But lzip _also_ supports concatenating two lzip files together to get
something that decodes to the concatenation, and this is how "multi-member"
streams work (which you get e.g. from parallel lzip, or by using an option to
limit member size). There's nothing to indicate whether or not a "member" is
the last one in the file; the decompressor just checks whether the data that
follows starts with the magic bytes "LZIP".

But what if there's a bitflip in one of the magic bytes? How do you know that
what follows is a corrupted member, and not some random binary data that
someone decided to append, which can be safely ignored?

You don't. I tried it. If you make any change to the magic bytes of a member
after the first one, decompressing with `lzip -d` silently truncates the
output at that point, without reporting any error. That's right - the
supposedly safer-for-archiving lzip can have silent data corruption with a
_single bit flip_.

The manual actually mentions this, but brushes it off:

> In very rare cases, trailing data could be the corrupt header of another
> member. In multimember or concatenated files the probability of corruption
> happening in the magic bytes is 5 times smaller than the probability of
> getting a false positive caused by the corruption of the integrity
> information itself. Therefore it can be considered to be below the noise
> level.

So it's not just a question of "the false negative rate is already low, and we
should also get the false positive rate as low as possible". Rather, the
manual _directly compares the probabilities_ of a bit flip either (a) causing
an error when theoretically the data could be fully recovered, or (b) silently
corrupting data. Since the chance of (b) is a whole 5 times smaller than (a),
its reasoning goes, it can be disregarded. That makes no sense at all! Those
are not comparable scenarios! If you have a backup, then _any_ detected error
is recoverable no matter what bytes it affects - but an undetected error could
be fatal, as backups are rotated over time.

I was curious: if you flip a random bit in an .lz file, what's the chance of
it being unprotected? In other words, what fraction of the file is made up of
'LZIP' headers? Using plzip (parallel lzip) with default settings - which
makes one member per 16MB of uncompressed data - I measured this as ranging
anywhere from ~2 in 10 million for mostly uncompressible data, to over 1 in
1000 (!) for extremely compressible data (all zeroes).

To be fair, I am not sure whether xz has any comparable vulnerabilities. But
if you're going to talk about how superior your format is in the face of
bitflips, your format had better be actually resilient to bitflips!

------
vmp
The next default compressor might be lrzip [1] by Con Kolivas; I've only see
it a couple of times in the wild so far but for certain files it can increase
the compression ratio quite a bit.

[1] [https://github.com/ckolivas/lrzip](https://github.com/ckolivas/lrzip)

    
    
      # 151M    linux-4.14-rc6.tar.gz
      # GZIP decompression
      ~$ time gzip -dk linux-4.14-rc6.tar.gz
    
      real    0m4.518s
      user    0m3.328s
      sys     0m13.422s
    
      # 787M    linux-4.14-rc6.tar
      # LRZIP compression
      ~$ time lrzip -v linux-4.14-rc6.tar
      [...]
      linux-4.14-rc6.tar - Compression Ratio: 7.718. Average Compression Speed: 13.789MB/s.
      Total time: 00:00:56.37
    
      real    0m56.533s
      user    5m35.484s
      sys     0m9.422s
    
      # 137M    linux-4.14-rc6.tar.lrz
      # LRZIP decompression
      ~$ time lrzip -dv linux-4.14-rc6.tar.lrz
      [...]
      100%     786.16 /    786.16 MB
      Average DeCompression Speed: 131.000MB/s
      Output filename is: linux-4.14-rc6.tar: [OK] - 824350720 bytes
      Total time: 00:00:06.35
    
      real    0m6.524s
      user    0m8.031s
      sys     0m1.766s
    
      # Results
      ~$ du -hs linux* | sort -h
      137M    linux-4.14-rc6.tar.lrz
      151M    linux-4.14-rc6.tar.gz
      787M    linux-4.14-rc6.tar
    
    

tested on WSL (Ubuntu BASH for Windows 10)

edit:

    
    
      ~$ time xz -vk linux-4.14-rc6.tar
      linux-4.14-rc6.tar (1/1)
        100 %        98.9 MiB / 786.2 MiB = 0.126   3.0 MiB/s       4:25
    
      real    4m25.189s
      user    4m23.828s
      sys     0m1.094s
      
      ~$ du -hs linux* | sort -h
      99M     linux-4.14-rc6.tar.xz
      137M    linux-4.14-rc6.tar.lrz
      151M    linux-4.14-rc6.tar.gz
      787M    linux-4.14-rc6.tar
    

It looks like XZ still has the best compression ratio but also took the
longest (real)time.

~~~
terrelln
lrzip is a preprocessor that finds matches in the distant past that the
backend compressor (xz) couldn't normally find. zstd has a new long range
matcher mode inspired by the ideas behind rzip/lrzip with some extra tricks.
It produces data in the standard zstd format, so can be decompressed the the
normal zstd decompressor. There is a short article about it in the latest
release notes
[https://github.com/facebook/zstd/releases/tag/v1.3.2](https://github.com/facebook/zstd/releases/tag/v1.3.2)

~~~
Twirrim
I tried a few times with zstd at various levels of compression with the linux
kernel sources. While I've been impressed with zstd, and have some projects
lined up to use it, it seems in the case of the linux kernel sources, it's not
a great fit. xz handily beats it, and not by a small margin either. I had to
really ratchet up the compression levels (20+) before I could get close to
100Mb.

~~~
Iburinoc
In general, xz beats zstd in compression ratio, as xz is very committed to
providing the strongest compression, at the expense of speed, while zstd
provides a range of compression ratio vs speed tradeoffs [0]. At the lower
levels, zstd isn't approaching xz's compression level, but it's doing it much
much faster. Additionally, zstd generally massively outperforms xz in
decompression speed

    
    
      $ time xz linux-4.14-rc6.tar
    
      real    4m26.009s
      user    4m24.828s
      sys     0m0.724s
    
      $ wc -c linux-4.14-rc6.tar.xz
      103705148 linux-4.14-rc6.tar.xz
    
      $ time zstd --ultra -20 linux-4.14-rc6.tar
      linux-4.14-rc6.tar   : 12.81%   (824350720 => 105564246 bytes, linux-4.14-rc6.tar.zst)
    
      real    4m34.129s
      user    4m33.484s
      sys     0m0.432s
    
      $ time cat linux-4.14-rc6.tar.xz | xz -d > out1                                                                                                                                           
    
      real    0m9.677s
      user    0m6.608s
      sys     0m0.704s
    
      $ time cat linux-4.14-rc6.tar.zst | zstd -d > out2
    
      real    0m1.702s
      user    0m1.220s
      sys     0m0.520s
    

[0]:
[https://github.com/facebook/zstd/blob/dev/doc/images/DCspeed...](https://github.com/facebook/zstd/blob/dev/doc/images/DCspeed5.png)

------
jzelinskie
Beyond code golf, I believe pack is still used on many platforms for creating
.a archives of .o object files.

~~~
dfox
I'm almost certain that you are confusing pack with ar (which is uncompressed
archive format, in contrast to tar it is designed such that file can be
extracted without reading whole archive).

ar is used as format of .a/.lib files by almost all C compilers (IIRC even by
MSVC). Libraries are simply archives of .o files, on some platforms with
additional symbol->file index, which is what is added by running .a through
ranlib.

~~~
jzelinskie
Yup! You are correct. Confusing it due to being recently exposed to `pack`
which is a binary in the go toolchain that happens to interact with ar
archives.

------
moonbug
Code golf is not a use case.

