Yea, I used to do this with a little script. The strategy I used, which worked well when I was compressing and archiving workspaces (which might often contain checkouts of different branches of the same project) was essentially this:
This clusters file types together and within file types and within that files with the same base name close together.
This worked surprisingly well for my use cases, though you can imagine that packing and unpacking times were impacted by the additional head seeks caused by the rather arbitrary order in which this accesses files.
This will already sort equal file names together. If I wanted
to combine that with file sizes, I'd probably do some kind of
decorate | sort | undecorate
dance on each line produced by find. Where decorate would add the start of each line the things you want to sort by and undecorate would remove them again.
Because DEFLATE uses a small 32KB window, this task is just matching the last 32KB of one file with the first 32KB of another file for most redundancy. A tool can do those estimations to find the best matching file to put in the archive after the current file, without doing the huffman coding when doing it.
For cases like multiple checkouts of the same repo in a single archive, you want real long range compression like lrzip or zstd --long.
And in the end, using zstd would be a win over DEFLATE every time. Just stop using DEFLATE unless you have to for compatibility with something that can't do zstd.
The fact that most of us use DEFLATE without ever realizing that it's optimized for hardware specs from many decades ago, and especially this 32KB window part, is is pretty much the "grandma's cooking secret"[0] of compression, isn't it?
That is the beauty of deflate! It was fast on hardware many decades ago and it does well on today’s hardware as well. It has found a real sweet spot between execution time and memory consumption. You can pick up a few extra percent here and there but often requiring massive increases in one and/or the other, or specific knowledge of what your compressing (images, audio, assembker output, etc)
No, compression technology has advanced a lot since DEFLATE. Today's general purpose compressors beat DEFLATE at all points on the Pareto frontier (any network/disk speed, any compression ratio vs CPU time trade-off). Use zstd. If you need even more speed, use lz4. If you need even more compression, use LZMA (or if it's natural language text use ppmd).
Agree that zstd has performance but memory requirements are potentially orders larger, there is no browser support, and you can’t count on people having decompressor installed. It’s good if your compressing data for your use but it doesn’t have the reach of deflate. Maybe that will change in the future.
How many orders larger, and larger than what? The 32kb? A megabyte is two orders larger than that, a gigabyte five orders larger. We got quite a bit of leeway here..
zstd and lz4 can be run in small memory footprints and without heap memory allocation and I bet zstd would beat DEFLATE in 32KB too (just not as much).
It's not as popular yet, yes, but for internal use, including using HTTP content encoding if both your server and client support it, it's great.
I can't stop wondering (sincerely, not for a "holywar" sake or anything) why we still use tar.whatever today, when ~99% of us have never seen a tape drive.
.tar.gz felt such a wild step backwards after 7z when I switched from Windows to Linux. Why can't we just introduce links and access rights metadata to 7z or something and use a modern, feature-rich archive format which wouldn't require you to decompress everything just to get a clue if there is a particularly named file in there?
The simplicity of the tar format has certain advantages, such as ensuring an implementation in a variety of languages and platforms.
The seperation of "group files into an archive" and "compress these bytes" can also be very useful. For example, it means I can play around with the latest and greatest compression formats without needing to wait for a version of tar that has that format built-in.
Overall, I think it appeals to the "unix philosophy" - composing two independent tools (tar and gzip) instead of having one integrated solution.
The "tape drive" part is legacy cruft, certainly, but in practice its removal doesn't improve things enough that people find it worth the compatability break.
I'm not going to claim this is better than the more modern feature-rich archive formats, but hopefully this helps you understand the other side of the argument a little better.
There's no simplicity in tar. Even reading the file size is a guesswork. It's a baroque format with weird short-sighted decisions patched and extended in ad-hoc ways by multiple subtly incompatible implementations.
It's unix philosophy alright, but it's the other unix philosophy -- worse is better. tar has a poorly designed base format with lots of incompatible extensions, but for most people most of the time it works fine. So why change?
> Overall, I think it appeals to the "unix philosophy" - composing two independent tools (tar and gzip) instead of having one integrated solution.
The unix philosophy in this case breaks down when you need to access individual files. One possible "unixy" solution would be .gz.tar as opposed to .tar.gz, but it is not universally supported as .tar.gz is.
Because tar gz works on any unix and 7z requires installing additional software. Whether you want that or not depends on your use-case. It's a good example of "good enough" software which is hard to phase out.
That said, recently I've found that CentOS 8 minimal did not have tar installed. So may be it's worth to start using something modern like zip or even 7z.
Your question is essentially one of "general composabilitiy vs specialized tool". There's two (opposing) philosophical answers, and three (opposing) practical answers, and who am I to say which is better for you?
For me, Tar + compression is optimal for most of my use cases - transfer over network, software package distribution, long term storage. In case of transferring over fast network, uncompressed Tar is pretty good [0], and Lz4 compressed is pretty fast too. In case of distributing software, XZ compression is optimal (expensive, good level for small projects). In case of long term storage, Rzip compression is optimal (very expensive, good level for large projects).
Besides the pragmatic considerations, as a software developer & maintainer, I honestly enjoy composability.
Funnily enough, just yesterday I've read the specification of SQLite Archive Files[1]. A pretty cool format and yet - my first question was, "Why is there no support for other compression methods? Like Lz4 for high speed, XZ for high compression?" Please note that, unlike with Tar, extending this standard is pretty much a no-go. Meanwhile a .tar.gz / .tar.xz / .tar.rz / .tar.lz4 is a self-documenting file. You know what to do, and it's easy to write a tool that does the right thing automatically.
I've been implementing a toy project recently, and one of the network communication protocols in it has grown into a poor man's re-implementation of Tar. Having realized that just now, I'll probably yank out several dozen lines of code and replace them with one call to tar on each side. However crufty Tar it might have gotten over decades, it's certainly less buggy than my hand-rolled code.
Lastly, Tar is a much richer format than one might ordinarily expect. For example it supports multiple versions of the same file[2] - and you can add newer version(s) incrementally (append-only file; a much safer operation than re-writting/replacing files). Think a poor man's Git/SQLite/Wiki/whatever. Self-contained, self-documenting, all very nice.
--
[0] uncompressed Tar supports fast seeking. If you expect working with an archive for a while, just create an uncompressed copy. Or a text file index, we don't judge.
I recently impletented an extension to sqlite - https://github.com/itroot/sqlite-brotli-compress - that adds ability to compress blobs to sqlite via brotli, and I think it is trivial to add any compression format to sqlar (with some format extension, of course). But sqlar I believe is not very popular, and just an example of how sqlite can be used.
The biggest ”whoa” effect I've had in a while with gz archives was using pigz --rsyncable. It's so much faster than single-threaded gzip, and borgbackup can do block-level deduplication on the result. Perfect for database dumps, for example.
> binsort <dir> | tar -T- --no-recursion -czf out.tar.gz
> Binsorting the distribution of abiword 2.8.6 (44029103 bytes in 3391
files) on a quadcore CPU takes approx. 12s, and produces a tar.gz
more than 14% smaller than without processing.
I had some archives which are heavily redundant (web mirrors & scrape mirrors by date), and so I looked into file order compression: https://www.gwern.net/Archiving-URLs#sort---key-compression-... On my particular use-case, the compression gains are enormous. It's particularly impressive because as far as any enduser is concerned, it's just like any other compressed XZ tarball.
If I recall, the LZX [0] compression program on the Amiga, way back when, used to do this. It re-ordered the list of files according to some detected type or other before compressing in groups. It called these groups "merge groups" [1].
"Merge groups" sound like solid compression, i.e. compressing multiple files together. Note that .tar.XYZ is always fully solid, while something like .zip compresses each file individually, and something like .7z has a solid block size. That's the main reason why .tar.gz usually compresses better than .zip, despite zip usually using deflate, just like gz.
Yes, that's right. The useful bit is making sure that similar files are put in the same group (or otherwise within the search window of the compression algorithm, which for LZX was 64kB, and for deflate is 32kB).
Unpacking already packed files before adding them to the archive could also improve compression, if the packed files contain common parts (or were packed in a less sophisticated format).
True. The java world it was common to put a jar (zipfile) in a war (zipfile). If you modify the jar so every file in it is stored instead of deflated, then package it in a war, the resulting file can be a lot smaller.
Decompressing using a different algorithm like .bz2 or .xz is noticeably slower, so sticking with gz and shuffling the files might split the difference.
This would come with a cost of longer compression times - either multiple attempts with random shuffling or pre-compression file ordering optimization process. For resources that are compressed once and then distributed and decompressed multiple times this would be quite interesting solution
zstd is faster and smaller. If you can choose the format, zstd beats deflate across the board, on every front except for compatibility with things that only understand deflate.
Also, if you need to use deflate for compatibility, use https://github.com/zlib-ng/zlib-ng , which is substantially faster than either zlib or gzip.
I've never heard of zstd (thanks!), but it seems it's right under my nose.
"Arch Linux added support for zstd as a package compression method in October 2019 with the release of the pacman 5.2 package manager, and in January 2020 switched from xz to zstd for the packages in the official repository. Arch uses zstd -c -T0 --ultra -20 -, the size of all compressed packages combined increased by 0.8% (compared to xz), the decompression speed is 1300% faster, decompression memory increased by 50 MiB when using multiple threads, compression memory increases but scales with the number of threads used."
Maybe I am missing something, but this approach is doomed to fail / produce small gains.
A tar file is a concatenation of different 512 bytes of data. The first block contains the metadata of the file, and the next blocks contains the content.
Of those first 512 bytes of data, only the first 100 bytes contains the text.
Trying to play with the orders of files to make 100 bytes of data fit next to each other in block of 512 bytes, it does not seems a successful approach.
gzip use a 32kb windows, that can include up to 64 tar headers. At the very best we are saving 100bytes * 64 => 6400 bytes alright 7kb every 32kb. For an optimal compression of the (32-7)/32 => 78%.
The author says that the standard gzip compressed to 45768 / 337920 => 13%
zip seems to be the only widely supported compressed archive format that supports extracting a file content by jumping to the data location, and easy deletion of individual files.
A big help could be sorting files by extension, and sorting by filename. For example, the same file name in many sub-directories is probably going to have similar content. For example:
This worked surprisingly well for my use cases, though you can imagine that packing and unpacking times were impacted by the additional head seeks caused by the rather arbitrary order in which this accesses files.