Hacker News new | past | comments | ask | show | jobs | submit login
Micro-Optimizing .tar.gz Archives by Changing File Order (justinblank.com)
48 points by zdw on Aug 20, 2020 | hide | past | favorite | 57 comments



Yea, I used to do this with a little script. The strategy I used, which worked well when I was compressing and archiving workspaces (which might often contain checkouts of different branches of the same project) was essentially this:

    find * -print | rev | sort | rev |
    tar --create --no-recursion --files-from - |
    gzip
This clusters file types together and within file types and within that files with the same base name close together.

This worked surprisingly well for my use cases, though you can imagine that packing and unpacking times were impacted by the additional head seeks caused by the rather arbitrary order in which this accesses files.


A small experiment with a 143M directory.

  $ tar -zcvf file.tar.gz directory/
  $ du -sh directory.tar.gz
  57M directory.tar.gz

  $ find directory/* -print | rev | sort | rev | \
      tar --create --no-recursion --files-from - | \
      gzip -c > directory.tar.gz
  $ du -sh directory.tar.gz
  55M directory.tar.gz
3.51% (2MB) reduction makes many sense here.


Small nitpick: * will miss hidden files.

-print is also unnecessary.

Simply use

  find |


Awesome! I will use this. I would like this even more if it stopped at filenames (ignored paths) and when equal, sorted by file sizes.


This will already sort equal file names together. If I wanted to combine that with file sizes, I'd probably do some kind of

    decorate | sort | undecorate
dance on each line produced by find. Where decorate would add the start of each line the things you want to sort by and undecorate would remove them again.


On the other hand SSDs have cut at least 20% off head seek times. ;)


Because DEFLATE uses a small 32KB window, this task is just matching the last 32KB of one file with the first 32KB of another file for most redundancy. A tool can do those estimations to find the best matching file to put in the archive after the current file, without doing the huffman coding when doing it.

For cases like multiple checkouts of the same repo in a single archive, you want real long range compression like lrzip or zstd --long.

And in the end, using zstd would be a win over DEFLATE every time. Just stop using DEFLATE unless you have to for compatibility with something that can't do zstd.


The fact that most of us use DEFLATE without ever realizing that it's optimized for hardware specs from many decades ago, and especially this 32KB window part, is is pretty much the "grandma's cooking secret"[0] of compression, isn't it?

[0] https://www.snopes.com/fact-check/grandmas-cooking-secret/


That is the beauty of deflate! It was fast on hardware many decades ago and it does well on today’s hardware as well. It has found a real sweet spot between execution time and memory consumption. You can pick up a few extra percent here and there but often requiring massive increases in one and/or the other, or specific knowledge of what your compressing (images, audio, assembker output, etc)


No, compression technology has advanced a lot since DEFLATE. Today's general purpose compressors beat DEFLATE at all points on the Pareto frontier (any network/disk speed, any compression ratio vs CPU time trade-off). Use zstd. If you need even more speed, use lz4. If you need even more compression, use LZMA (or if it's natural language text use ppmd).


Agree that zstd has performance but memory requirements are potentially orders larger, there is no browser support, and you can’t count on people having decompressor installed. It’s good if your compressing data for your use but it doesn’t have the reach of deflate. Maybe that will change in the future.


How many orders larger, and larger than what? The 32kb? A megabyte is two orders larger than that, a gigabyte five orders larger. We got quite a bit of leeway here..


zstd and lz4 can be run in small memory footprints and without heap memory allocation and I bet zstd would beat DEFLATE in 32KB too (just not as much).

It's not as popular yet, yes, but for internal use, including using HTTP content encoding if both your server and client support it, it's great.


I can't stop wondering (sincerely, not for a "holywar" sake or anything) why we still use tar.whatever today, when ~99% of us have never seen a tape drive.

.tar.gz felt such a wild step backwards after 7z when I switched from Windows to Linux. Why can't we just introduce links and access rights metadata to 7z or something and use a modern, feature-rich archive format which wouldn't require you to decompress everything just to get a clue if there is a particularly named file in there?


The simplicity of the tar format has certain advantages, such as ensuring an implementation in a variety of languages and platforms.

The seperation of "group files into an archive" and "compress these bytes" can also be very useful. For example, it means I can play around with the latest and greatest compression formats without needing to wait for a version of tar that has that format built-in.

Overall, I think it appeals to the "unix philosophy" - composing two independent tools (tar and gzip) instead of having one integrated solution.

The "tape drive" part is legacy cruft, certainly, but in practice its removal doesn't improve things enough that people find it worth the compatability break.

I'm not going to claim this is better than the more modern feature-rich archive formats, but hopefully this helps you understand the other side of the argument a little better.


There's no simplicity in tar. Even reading the file size is a guesswork. It's a baroque format with weird short-sighted decisions patched and extended in ad-hoc ways by multiple subtly incompatible implementations.


Unless you're using PaX, isn't the file size just a fixed-size field in the header? Is there something I'm missing?


fixed... ish... with two sizes and two formats: https://dev.gentoo.org/~mgorny/articles/portability-of-tar-f...


It's unix philosophy alright, but it's the other unix philosophy -- worse is better. tar has a poorly designed base format with lots of incompatible extensions, but for most people most of the time it works fine. So why change?


Is tar "worse" in the context?


> Overall, I think it appeals to the "unix philosophy" - composing two independent tools (tar and gzip) instead of having one integrated solution.

The unix philosophy in this case breaks down when you need to access individual files. One possible "unixy" solution would be .gz.tar as opposed to .tar.gz, but it is not universally supported as .tar.gz is.


You can have the first file in the tar archive as the "index" file


Because tar gz works on any unix and 7z requires installing additional software. Whether you want that or not depends on your use-case. It's a good example of "good enough" software which is hard to phase out.

That said, recently I've found that CentOS 8 minimal did not have tar installed. So may be it's worth to start using something modern like zip or even 7z.


I use Peazip on Linux https://peazip.github.io/

tar.gz is ancient


Your question is essentially one of "general composabilitiy vs specialized tool". There's two (opposing) philosophical answers, and three (opposing) practical answers, and who am I to say which is better for you?

For me, Tar + compression is optimal for most of my use cases - transfer over network, software package distribution, long term storage. In case of transferring over fast network, uncompressed Tar is pretty good [0], and Lz4 compressed is pretty fast too. In case of distributing software, XZ compression is optimal (expensive, good level for small projects). In case of long term storage, Rzip compression is optimal (very expensive, good level for large projects).

Besides the pragmatic considerations, as a software developer & maintainer, I honestly enjoy composability.

Funnily enough, just yesterday I've read the specification of SQLite Archive Files[1]. A pretty cool format and yet - my first question was, "Why is there no support for other compression methods? Like Lz4 for high speed, XZ for high compression?" Please note that, unlike with Tar, extending this standard is pretty much a no-go. Meanwhile a .tar.gz / .tar.xz / .tar.rz / .tar.lz4 is a self-documenting file. You know what to do, and it's easy to write a tool that does the right thing automatically.

I've been implementing a toy project recently, and one of the network communication protocols in it has grown into a poor man's re-implementation of Tar. Having realized that just now, I'll probably yank out several dozen lines of code and replace them with one call to tar on each side. However crufty Tar it might have gotten over decades, it's certainly less buggy than my hand-rolled code.

Lastly, Tar is a much richer format than one might ordinarily expect. For example it supports multiple versions of the same file[2] - and you can add newer version(s) incrementally (append-only file; a much safer operation than re-writting/replacing files). Think a poor man's Git/SQLite/Wiki/whatever. Self-contained, self-documenting, all very nice.

--

[0] uncompressed Tar supports fast seeking. If you expect working with an archive for a while, just create an uncompressed copy. Or a text file index, we don't judge.

[1] https://www.sqlite.org/sqlar.html

[2] https://www.gnu.org/software/tar/manual/html_node/multiple.h...


I recently impletented an extension to sqlite - https://github.com/itroot/sqlite-brotli-compress - that adds ability to compress blobs to sqlite via brotli, and I think it is trivial to add any compression format to sqlar (with some format extension, of course). But sqlar I believe is not very popular, and just an example of how sqlite can be used.


The biggest ”whoa” effect I've had in a while with gz archives was using pigz --rsyncable. It's so much faster than single-threaded gzip, and borgbackup can do block-level deduplication on the result. Perfect for database dumps, for example.



You can get even better results by sorting the files by content similarity: http://neoscientists.org/~tmueller/binsort/


Very cool!

> binsort <dir> | tar -T- --no-recursion -czf out.tar.gz

> Binsorting the distribution of abiword 2.8.6 (44029103 bytes in 3391 files) on a quadcore CPU takes approx. 12s, and produces a tar.gz more than 14% smaller than without processing.

14% is an amazing improvement.


I had some archives which are heavily redundant (web mirrors & scrape mirrors by date), and so I looked into file order compression: https://www.gwern.net/Archiving-URLs#sort---key-compression-... On my particular use-case, the compression gains are enormous. It's particularly impressive because as far as any enduser is concerned, it's just like any other compressed XZ tarball.


If I recall, the LZX [0] compression program on the Amiga, way back when, used to do this. It re-ordered the list of files according to some detected type or other before compressing in groups. It called these groups "merge groups" [1].

[0] https://en.wikipedia.org/wiki/LZX [1] http://xavprods.free.fr/lzx/optsmmgs.html


"Merge groups" sound like solid compression, i.e. compressing multiple files together. Note that .tar.XYZ is always fully solid, while something like .zip compresses each file individually, and something like .7z has a solid block size. That's the main reason why .tar.gz usually compresses better than .zip, despite zip usually using deflate, just like gz.


Yes, that's right. The useful bit is making sure that similar files are put in the same group (or otherwise within the search window of the compression algorithm, which for LZX was 64kB, and for deflate is 32kB).


Unpacking already packed files before adding them to the archive could also improve compression, if the packed files contain common parts (or were packed in a less sophisticated format).


True. The java world it was common to put a jar (zipfile) in a war (zipfile). If you modify the jar so every file in it is stored instead of deflated, then package it in a war, the resulting file can be a lot smaller.



Some results here, impressive: http://schnaader.info/precomp_results.php


This is an interesting take on things.

Decompressing using a different algorithm like .bz2 or .xz is noticeably slower, so sticking with gz and shuffling the files might split the difference.


This would come with a cost of longer compression times - either multiple attempts with random shuffling or pre-compression file ordering optimization process. For resources that are compressed once and then distributed and decompressed multiple times this would be quite interesting solution


I got curious and looked it up. On this page [1] it looks like uncompressing with gzip vs bzip2 vs xz is:

          gzip    bzip2   xz
  1       6.771   24.23   13.251
  2       6.581   24.101  12.407
  3       6.39    23.955  11.975
  4       6.313   24.204  11.801
  5       6.153   24.513  11.08
  6       6.078   24.768  10.911
  7       6.057   23.199  10.781
  8       6.033   25.426  10.676
  9       6.026   23.486  10.623
so gzip has the fastest decompression.

That said, xz is in the ballpark and can be significantly smaller.

[1] https://www.rootusers.com/gzip-vs-bzip2-vs-xz-performance-co...


zstd is faster and smaller. If you can choose the format, zstd beats deflate across the board, on every front except for compatibility with things that only understand deflate.

Also, if you need to use deflate for compatibility, use https://github.com/zlib-ng/zlib-ng , which is substantially faster than either zlib or gzip.


I've never heard of zstd (thanks!), but it seems it's right under my nose.

"Arch Linux added support for zstd as a package compression method in October 2019 with the release of the pacman 5.2 package manager, and in January 2020 switched from xz to zstd for the packages in the official repository. Arch uses zstd -c -T0 --ultra -20 -, the size of all compressed packages combined increased by 0.8% (compared to xz), the decompression speed is 1300% faster, decompression memory increased by 50 MiB when using multiple threads, compression memory increases but scales with the number of threads used."

https://en.wikipedia.org/wiki/Zstandard


Is it though? In https://quixdb.github.io/squash-benchmark/unstable/ deflate beats zstd at least in some tests.


Yeah, but zstd beats deflate in others. For example the latest MySQL 8 https://dev.mysql.com/worklog/task/?id=13442


It's universally faster if you compress threaded, and otherwise it's generally still a win.


Yeah, and it is IETF standard since 2018 https://tools.ietf.org/html/rfc8478


I wonder how zstd would compare.


this is my go-to comparison of different compression algorithms [0]

TL;DR zstd is faster, smaller and has a wider range than gzip. For maximum compression, plzip seems to be the best.

[0] https://community.centminmod.com/threads/round-4-compression...


Zstd is superior to gz. It also got larger window.

I wonder why adoption of better compression formats is so slow.


Maybe I am missing something, but this approach is doomed to fail / produce small gains.

A tar file is a concatenation of different 512 bytes of data. The first block contains the metadata of the file, and the next blocks contains the content.

Of those first 512 bytes of data, only the first 100 bytes contains the text.

Trying to play with the orders of files to make 100 bytes of data fit next to each other in block of 512 bytes, it does not seems a successful approach.

gzip use a 32kb windows, that can include up to 64 tar headers. At the very best we are saving 100bytes * 64 => 6400 bytes alright 7kb every 32kb. For an optimal compression of the (32-7)/32 => 78%.

The author says that the standard gzip compressed to 45768 / 337920 => 13%

I believe there are better approaches...


Yeah, like stop using tar, and use a better format like Zip Archives.


zip seems to be the only widely supported compressed archive format that supports extracting a file content by jumping to the data location, and easy deletion of individual files.


Zip archives do not support unix permissions


One more reason to use zip archives then !


A big help could be sorting files by extension, and sorting by filename. For example, the same file name in many sub-directories is probably going to have similar content. For example:

a/index.txt b/index.txt a/img/index.txt a/img/logo.png a/img/test.css a/foo.css

Edit: this is what mannschott's script does.


always remember the xkcd comic "tar", which will never get old! :) https://xkcd.com/1168/




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: