
Micro-Optimizing .tar.gz Archives by Changing File Order - zdw
https://justinblank.com/experiments/optimizingtar.html
======
mannschott
Yea, I used to do this with a little script. The strategy I used, which worked
well when I was compressing and archiving workspaces (which might often
contain checkouts of different branches of the same project) was essentially
this:

    
    
        find * -print | rev | sort | rev |
        tar --create --no-recursion --files-from - |
        gzip
    

This clusters file types together and within file types and within that files
with the same base name close together.

This worked surprisingly well for my use cases, though you can imagine that
packing and unpacking times were impacted by the additional head seeks caused
by the rather arbitrary order in which this accesses files.

~~~
david_draco
Awesome! I will use this. I would like this even more if it stopped at
filenames (ignored paths) and when equal, sorted by file sizes.

~~~
mannschott
This will already sort equal file names together. If I wanted to combine that
with file sizes, I'd probably do some kind of

    
    
        decorate | sort | undecorate
    

dance on each line produced by find. Where decorate would add the start of
each line the things you want to sort by and undecorate would remove them
again.

------
wolf550e
Because DEFLATE uses a small 32KB window, this task is just matching the last
32KB of one file with the first 32KB of another file for most redundancy. A
tool can do those estimations to find the best matching file to put in the
archive after the current file, without doing the huffman coding when doing
it.

For cases like multiple checkouts of the same repo in a single archive, you
want real long range compression like lrzip or zstd --long.

And in the end, using zstd would be a win over DEFLATE every time. Just stop
using DEFLATE unless you have to for compatibility with something that can't
do zstd.

~~~
vanderZwan
The fact that most of us use DEFLATE without ever realizing that it's
optimized for hardware specs from many decades ago, and especially this 32KB
window part, is is pretty much the "grandma's cooking secret"[0] of
compression, isn't it?

[0] [https://www.snopes.com/fact-check/grandmas-cooking-
secret/](https://www.snopes.com/fact-check/grandmas-cooking-secret/)

~~~
zxcvbn4038
That is the beauty of deflate! It was fast on hardware many decades ago and it
does well on today’s hardware as well. It has found a real sweet spot between
execution time and memory consumption. You can pick up a few extra percent
here and there but often requiring massive increases in one and/or the other,
or specific knowledge of what your compressing (images, audio, assembker
output, etc)

~~~
wolf550e
No, compression technology has advanced a lot since DEFLATE. Today's general
purpose compressors beat DEFLATE at all points on the Pareto frontier (any
network/disk speed, any compression ratio vs CPU time trade-off). Use zstd. If
you need even more speed, use lz4. If you need even more compression, use LZMA
(or if it's natural language text use ppmd).

~~~
zxcvbn4038
Agree that zstd has performance but memory requirements are potentially orders
larger, there is no browser support, and you can’t count on people having
decompressor installed. It’s good if your compressing data for your use but it
doesn’t have the reach of deflate. Maybe that will change in the future.

~~~
vanderZwan
How many orders larger, and larger than what? The 32kb? A megabyte is two
orders larger than that, a gigabyte five orders larger. We got quite a bit of
leeway here..

------
qwerty456127
I can't stop wondering (sincerely, not for a "holywar" sake or anything) why
we still use tar.whatever today, when ~99% of us have never seen a tape drive.

.tar.gz felt such a wild step backwards after 7z when I switched from Windows
to Linux. Why can't we just introduce links and access rights metadata to 7z
or something and use a modern, feature-rich archive format which wouldn't
require you to decompress everything just to get a clue if there is a
particularly named file in there?

~~~
ekimekim
The simplicity of the tar format has certain advantages, such as ensuring an
implementation in a variety of languages and platforms.

The seperation of "group files into an archive" and "compress these bytes" can
also be very useful. For example, it means I can play around with the latest
and greatest compression formats without needing to wait for a version of tar
that has that format built-in.

Overall, I think it appeals to the "unix philosophy" \- composing two
independent tools (tar and gzip) instead of having one integrated solution.

The "tape drive" part is legacy cruft, certainly, but in practice its removal
doesn't improve things enough that people find it worth the compatability
break.

I'm not going to claim this is better than the more modern feature-rich
archive formats, but hopefully this helps you understand the other side of the
argument a little better.

~~~
pornel
There's no simplicity in tar. Even reading the file size is a guesswork. It's
a baroque format with weird short-sighted decisions patched and extended in
ad-hoc ways by multiple subtly incompatible implementations.

~~~
klodolph
Unless you're using PaX, isn't the file size just a fixed-size field in the
header? Is there something I'm missing?

~~~
Hello71
fixed... ish... with two sizes and two formats:
[https://dev.gentoo.org/~mgorny/articles/portability-of-
tar-f...](https://dev.gentoo.org/~mgorny/articles/portability-of-tar-
features.html#large-file-sizes)

------
metafunctor
The biggest ”whoa” effect I've had in a while with gz archives was using pigz
--rsyncable. It's so much faster than single-threaded gzip, and borgbackup can
do block-level deduplication on the result. Perfect for database dumps, for
example.

~~~
diroussel
Links for the lazy: [https://zlib.net/pigz/](https://zlib.net/pigz/)

Made by:
[https://en.wikipedia.org/wiki/Mark_Adler](https://en.wikipedia.org/wiki/Mark_Adler)

------
nisa
You can get even better results by sorting the files by content similarity:
[http://neoscientists.org/~tmueller/binsort/](http://neoscientists.org/~tmueller/binsort/)

~~~
andruby
Very cool!

> binsort <dir> | tar -T- --no-recursion -czf out.tar.gz

> Binsorting the distribution of abiword 2.8.6 (44029103 bytes in 3391 files)
> on a quadcore CPU takes approx. 12s, and produces a tar.gz more than 14%
> smaller than without processing.

14% is an amazing improvement.

------
gwern
I had some archives which are heavily redundant (web mirrors & scrape mirrors
by date), and so I looked into file order compression:
[https://www.gwern.net/Archiving-URLs#sort---key-
compression-...](https://www.gwern.net/Archiving-URLs#sort---key-compression-
trick) On my particular use-case, the compression gains are _enormous_. It's
particularly impressive because as far as any enduser is concerned, it's just
like any other compressed XZ tarball.

------
mnw21cam
If I recall, the LZX [0] compression program on the Amiga, way back when, used
to do this. It re-ordered the list of files according to some detected type or
other before compressing in groups. It called these groups "merge groups" [1].

[0] [https://en.wikipedia.org/wiki/LZX](https://en.wikipedia.org/wiki/LZX) [1]
[http://xavprods.free.fr/lzx/optsmmgs.html](http://xavprods.free.fr/lzx/optsmmgs.html)

~~~
formerly_proven
"Merge groups" sound like solid compression, i.e. compressing multiple files
together. Note that .tar.XYZ is always fully solid, while something like .zip
compresses each file individually, and something like .7z has a solid block
size. That's the main reason why .tar.gz usually compresses better than .zip,
despite zip usually using deflate, just like gz.

~~~
mnw21cam
Yes, that's right. The useful bit is making sure that similar files are put in
the same group (or otherwise within the search window of the compression
algorithm, which for LZX was 64kB, and for deflate is 32kB).

------
xuhu
Unpacking already packed files before adding them to the archive could also
improve compression, if the packed files contain common parts (or were packed
in a less sophisticated format).

~~~
Hello71
[https://github.com/schnaader/precomp-
cpp](https://github.com/schnaader/precomp-cpp)

~~~
xuhu
Some results here, impressive:
[http://schnaader.info/precomp_results.php](http://schnaader.info/precomp_results.php)

------
m463
This is an interesting take on things.

Decompressing using a different algorithm like .bz2 or .xz is noticeably
slower, so sticking with gz and shuffling the files might split the
difference.

~~~
mr__y
This would come with a cost of longer compression times - either multiple
attempts with random shuffling or pre-compression file ordering optimization
process. For resources that are compressed once and then distributed and
decompressed multiple times this would be quite interesting solution

~~~
m463
I got curious and looked it up. On this page [1] it looks like uncompressing
with gzip vs bzip2 vs xz is:

    
    
              gzip    bzip2   xz
      1       6.771   24.23   13.251
      2       6.581   24.101  12.407
      3       6.39    23.955  11.975
      4       6.313   24.204  11.801
      5       6.153   24.513  11.08
      6       6.078   24.768  10.911
      7       6.057   23.199  10.781
      8       6.033   25.426  10.676
      9       6.026   23.486  10.623
    

so gzip has the fastest decompression.

That said, xz is in the ballpark and can be _significantly_ smaller.

[1] [https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-
co...](https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-comparison/)

~~~
JoshTriplett
zstd is faster _and_ smaller. If you can choose the format, zstd beats deflate
across the board, on every front except for compatibility with things that
only understand deflate.

Also, if you need to use deflate for compatibility, use
[https://github.com/zlib-ng/zlib-ng](https://github.com/zlib-ng/zlib-ng) ,
which is substantially faster than either zlib or gzip.

~~~
dependenttypes
Is it though? In [https://quixdb.github.io/squash-
benchmark/unstable/](https://quixdb.github.io/squash-benchmark/unstable/)
deflate beats zstd at least in some tests.

~~~
joseluisq
Yeah, but zstd beats deflate in others. For example the latest MySQL 8
[https://dev.mysql.com/worklog/task/?id=13442](https://dev.mysql.com/worklog/task/?id=13442)

------
jakozaur
Zstd is superior to gz. It also got larger window.

I wonder why adoption of better compression formats is so slow.

------
siscia
Maybe I am missing something, but this approach is doomed to fail / produce
small gains.

A tar file is a concatenation of different 512 bytes of data. The first block
contains the metadata of the file, and the next blocks contains the content.

Of those first 512 bytes of data, only the first 100 bytes contains the text.

Trying to play with the orders of files to make 100 bytes of data fit next to
each other in block of 512 bytes, it does not seems a successful approach.

gzip use a 32kb windows, that can include up to 64 tar headers. At the very
best we are saving 100bytes * 64 => 6400 bytes alright 7kb every 32kb. For an
optimal compression of the (32-7)/32 => 78%.

The author says that the standard gzip compressed to 45768 / 337920 => 13%

I believe there are better approaches...

~~~
Kuinox
Yeah, like stop using tar, and use a better format like Zip Archives.

~~~
throwaway2048
Zip archives do not support unix permissions

~~~
Kuinox
One more reason to use zip archives then !

------
david_draco
A big help could be sorting files by extension, and sorting by filename. For
example, the same file name in many sub-directories is probably going to have
similar content. For example:

a/index.txt b/index.txt a/img/index.txt a/img/logo.png a/img/test.css
a/foo.css

Edit: this is what mannschott's script does.

------
zomg
always remember the xkcd comic "tar", which will never get old! :)
[https://xkcd.com/1168/](https://xkcd.com/1168/)

