Hacker News new | past | comments | ask | show | jobs | submit login
The tar archive format, and why GNU tar extracts in quadratic time (mort.coffee)
317 points by mort96 on July 23, 2022 | hide | past | favorite | 133 comments

> Tar is pretty unusual for an archive file format. There's no archive header, no index of files to fascilitate seeking, no magic bytes to help file and its ilk detect whether a file is a tar archive, no footer, no archive-wide metadata. The only kind of thing in a tar file is a file object.

tar = tape archive. It was designed around streaming files to a tape drive in a serial fashion. Tape bandwidth has almost always been slow compared to source data devices. You need to start writing to the target device as fast as possible. Taking time to construct a full index just delays starting the slowest stage of things, while providing only relatively minimal benefit.

Generating an index is fairly straightforward, the file headers give you the information you need, including what you need to know to get to the next file header. Your only bottleneck is just how fast random reads are. If the file is on disk, it should be relatively trivial.

It's even possible to do this relatively cheaply with Range headers if you're looking at a tar file stored on an HTTP endpoint, though you'll likely want to think about some smart pre-fetching behaviour.

The issue with tapes is not the bandwidth, it is the seek latency, could be around ~2 minutes end to end (with the correct tape already loaded). So tapes are not used as random-access, thus no central index on tape archive. The backup software systems keep their catalogs elsewhere.

When I was a beginner I once accidentally created an Amazon RDS Postgres database backed by tape. I was so flustered by how slow it was until I saw the setting.

I'm not sure how this is possible. You can only choose EBS backed storage for RDS Postgres. Do you mean 'magnetic' storage? That's not tape.

It's not possible.

GP was likely confused by the different types of hard drives (standard, sc1, st1), and assumed it was tape or was incorrectly told by someone else that it was tape.

If I squint hard enough, the only viable explanation I could come up with is GP said "I created the database using the st1 storage type" and someone responded "st!? As in a UNIX SCSI tape drive[0]?"

0: https://man7.org/linux/man-pages/man4/st.4.html

It could be tape. But perhaps you’re right that it’s HDDs I guess? I assumed most of their normal storage is hard disk and not SSD. They’ve got a lot.

Edit: I checked. You’re 100% right. Well now my anecdote sucks! :) And now, seven years later, I’m back to start on wondering why it was so brutally slow.

Heh. It's not tape. :)

If you provisioned a really small amount of GP2, you can run out of I/O credits pretty quickly and get throttled to baseline performance (and it used to be 7 times the storage provisioned, or some such, and now it's 100 IOPS). That's brutally slow for a working database.

When I was in school, the professor of our operating systems class told us about a virtual memory project implementation that had the swapfile on tape. It worked, but it wasn't anything you'd actually ever use.

When I was 9 or 10 I once set virtual memory on my Macintosh to live in a 44MB SCSI cartridge drive. It was brutally slow.

Lol! I'm curious how slow it got... How slow are we talking about here?

A non-cached GET for a single database record was like 7 seconds.

It was especially scary because I was in the “I have no idea what I’m doing, and I’m at a tech startup and am a team of one.”

Still too fast for it to have been tape hah. If it was on tape, it would have taken minutes

Latency is absolutely killer with tapes, but bandwidth is still a major issue. The current generation LTO-9 is only capable of 400MB/s in throughput. That's still slower than a SATA3 drive.

Last time I used tape in earnest was a while back though (2005ish), LTO-2 ruled the roost at a whopping 40MB/s, with LTO-3 just making an appearance. Around the time I left the company they were building out a disk storage array as a backup target, and shifting tape further and further away from the main backup path. Data production and storage needs across the industry had long been growing at a greater rate than tape capacity and performance.

> The current generation LTO-9 is only capable of 400MB/s in throughput. That's still slower than a SATA3 drive.

True, but that's the speed of SATA3 interface, or SSD drives. Single spinning rust drive can realistically achieve 150-250 MB/s.

I wouldn't view this as being about write speed... keep in mind that the common ZIP implements a very simple optimization to address this exact issue, which is writing the index at the end of the file instead of at the beginning. Actually there were tools that did this when writing to tape as well, but tar wasn't one of them. In fact a lot of common tape archiving solutions would just put the index on its own dedicated tape, which of course makes sense when you consider that a computer operator would probably need to find out which tape a record of interest was on. Similarly it is possible to build an "out-of-band" index for a tar file but I'm not aware of a tool that does so... probably mostly because this is way less practical if the tarball has been compressed (compressing the files in the tarball instead fixes this issue and I've written tools that do it that way before, but it doesn't seem to be common either).

The issue is much more about seek speed. Putting the master header at the end of ZIP files requires initially reading them backwards to locate the beginning of that data structure. This works fine on random access devices but is of course pretty untenable on tape. Even if the tape drive could actually read backwards without seeking to read each successive block in the normal direction (I'm sure there's at least one weird tape drive out there that could but not the common models, although if you read very small blocks you could get pretty good performance doing this as long as you kept the seeks within the buffer columns), the seek time to get to the end of the tape and then back to the beginning could be minutes.

Backup programs (at least as early as the 90's) would get around this by creating index or catalog files for the backup tape, and keeping those locally. You could re-generate the catalog file by re-indexing the backup tape, but of course that took a considerable amount of time. If you were restoring to a new system and no longer had the catalog files, for example.

Another benefit was that you could quickly browse the contents of your backups without having to read or even insert the tape.

It's only really seekable if you don't compress it further. That's because a tar file gets compressed after the fact. Even with compression formats that support starting decompression somewhere in the middle, you now either need an additional file index that tells you which files there are and where in the compressed stream to start reading, or you have to try and scan the while archive as you would with an uncompressed tar by seeking along the headers and hope the files are large enough that you can actually skip large enough chunks to matter.

It is indeed really odd that the *nix world never came up with their own, agreed upon, indexed or at least efficiency seekable archive format. I think if one of the extensions would have added per-file compression that would have had the highest chance of success.

> Tape bandwidth has almost always been slow compared to source data devices

This is not the case.

> Your only bottleneck is how fast random reads are. If the file is on disk, it should be relatively trivial.

Ah yes, how fast is a random read to the middle of a 1TB file... that has been compressed as a single gzip stream...


You don't loathe tar files due to a deficiency in the tar format. You loathe tar files because your operating system vendor doesn't build in support for one of the most widely used archive formats in the world. Complain to Microsoft, not to people who use tar.

Tar might've been designed for use with tape, but no part of its design makes the HDD/SSD use-case worse just to work better with tape.

I exclusively use linux, and I wish there was something better than tar that was more widely used. In particular, that didn't have the problem of several subtly incompatible sets of extensions, and that had better support of using an index, especially in tandem with compression.

But that "something better" isn't zip. It's probably something that doesn't currently exist.

7z doesn't support unix file permissions, which means it isn't suitable as a tar replacement.

I believe pixz adds a file index when compressing tar files. It’s definitely much, much faster at extracting a single file from a large archive compared to a tar file compressed with gzip or zstd.

For what it's worth there is the SQLite Archive, but it's definitely not popular and there is no support for metadata like ownership, permissions or extended attributes (ACLs, SELinux). Also compression is missing on my Fedora system.

https://www.sqlite.org/sqlar.html / https://news.ycombinator.com/item?id=28668615

> there is no support for metadata like ownership, permissions or extended attributes

That makes it unsuitable as a tar replacement for many use cases.

Well, the lack of index for one

It seems fairly trivial to scan the whole file and build an in memory index to speed things up, and I'm sure most viewers do. I can't say I've ever had significant speed issues with most archives to see this as a problem.

It's actually pretty terrible in practice. Keep in mind that most common tar files have also been run through a stream compression algorithm, which means scanning the archive for file headers requires decompressing the entire thing. Some software does index tar files automatically (and of course the typical CLI tools will if you want) but for a large archive, say 20gb, I have had to stare at a progress bar for 30s+ while the tool indexed the archive. Extracting a single file required another 30s+ wait while it scanned the archive again to get back to that file, since with gzip compression it's not trivial to store enough state to be able to decompress at a point without having to decompress everything up to there.

I believe pixz was created to solve the exact problem you have just described. For xz/lzma compression at least.

rar files are the same if you want to extract them.

I see tar files used with compression pretty often, usually gzip or xz. That can mess up the ability to seek around in the tar archive and build a good index.

I use tar.gz as my go-to archive format, don't get me wrong. It's easy, well-supported, handles POSIX concepts well, and is pretty universal. But seekability isn't it's strong-suit.

Good luck doing that with a multi-terabyte compressed tar file!

What makes tar files better than zip files? Why should anyone prefer .tar.xz to an even more commonly used format that has even broader support?

There’s a very good reason to prefer .tar.gz (or xz or whatever) to .zip: tar.gz files deliver better compression (ranging from “marginally better” to “significantly better” depending on what you’re compressing).

In a .zip, the files are each compressed individually using DEFLATE and then concatenated to create an archive, whereas in a .tar.gz the files are first concatenated into a .tar archive and then DEFLATEd all together.

Because of this, a .tar.gz often achieves much better compression on archives containing many small files, because the compression algorithm can eliminate redundancies across files. The downside is that you can’t decompress an individual file without decompressing every preceding file in the stream, because DEFLATE does not support random access. (And so tar’s lack of index is an advantage here; an index would not be useful if you can’t seek.)

This is why e.g. open source software downloads often use .tar.gz. A source code archive has hundreds or thousands of tiny text files with a ton of redundancy between files in variable & function names, keywords, common code snippets/patterns, etc., so tar.gz delivers significantly better compression than zip. And there’s little use for random access of individual files, since all the files need to be extracted in order to compile the program anyway.

The abbreviation “tape archive” may be anachronistic nowadays, but the performance cbaracteristics of a tape drive — namely, fast sequential access but absolutely awful random access — coincide with the performance characteristics of compression algorithms. So an archive format designed for packaging files up to be stored on a tape is perfect for packaging files up to be compressed.

The trade-off is that it takes a long time to extract a single file from a solid archive. The 7z format supports a "solid block size" parameter for this reason (for all supported compression algorithms AFAICT) which can be set to anything from "compress all files individually" to "size of whole archive"

I wasn't saying tar is better than zip (though if you want to read about issues with the zip file format, have a look at this post: https://games.greggman.com/game/zip-rant/ -- HN discussion: https://news.ycombinator.com/item?id=27925393). I'm just saying there's nothing about the tar file format itself that makes you loathe it, you just loathe that Windows doesn't support it. That's not a problem with tar.

I don't even know if it's true that tar is less common than zip. I know zip is incredibly common, but truly, so is tar. In the UNIX world, _everything_ happens through tar. And as someone who almost exclusively operates in the UNIX world, I interact with zip files very rarely, while I work with tarballs all the time.

Just don't blame the format for a deficiency in your operating system, that's all I'm saying.

To be fair, I almost never come across tar files. Most crossplatform software provides .tar.gz for Linux/macOS and .zip for Windows.

Should windows have native support for tar.gz files? Maybe! Maybe not. I dunno. So when I come across something using that format for windows what it really comes across is half-ass Windows support. Which isn’t the end of the world. But it’s rarely a good sign.

Ah, yeah, that's completely fair. If someone is making an archive _for Windows users_, that archive should absolutely be a zip. A tarball definitely sends the signal that Windows users aren't the primary audience. Sometimes that's okay, sometimes it's a sign of a really shoddy port.

Sort of. I wouldn't make a separate source archive for Windows users - anyone who can compile stuff will manage installing 7-Zip or another archiver that handles .tar and solid compression wastes less space and bandwith. For Windows-specific archives .zip is a no brainer though.

7-Zip for windows is always something I go for on a fresh install. Then I also have rar support. But screw rar.

Also on a fresh install I install WSL, so tar is always available that way too.

.tar.gz is a tar file, it’s just gzipped afterwards.

Technicall yes, but if you care about user experience you should treat it as a single compressed archive not as one archive in another like e.g. 7-Zip does.

For unixy systems zip doesn't have sufficient metadata. I suspect that is almost the entire reason zip isn't used more on such systems. It is still used quite a bit when that doesn't matter.

Why should you couple your archive format to a compression algorithm?

Well, there are reasons. If your archive format handles compression, it can be designed in such a way that you can seek and extract only parts of the archive. If the archive format doesn't handle compression, you're dependent on reading through the archive sequentially from start to finish.

That's not to say tar is wrong to not have native compression, it's just one reason why it's not crazy for archive formats to natively support compression.

I’m semi-sure that this is possible with .tar.gz files already. I’ve used vim to view a text file within a few different rather large archives without noticing the machine choke up on extracting several gigs before showing content. Certainly nothing was written to disk in those cases.

.tar.gz files can only be read sequentially, but there are optimizations in place on common tools that make this surprisingly fast as long as there's enough memory available to essentially mmap the decompressed form. The problem is bigger with archives in the tens of GB (actually pretty common for tarballs since it's popular as a lowest-common-denominator backup format) or resource-constrained systems where the swapping becomes untenable.

There are extensions for gzip that can make it coarsely seekable, I wouldn't be surprised if some archive tools used that.

The beauty of tar is using it as a stream of files. That lets you pipe a tar stream from a program (e.g. git) into another program (e.g. docker context) without loading the entire contents into RAM.

Install 7Zip. It is $CurrentYear after all.

> Just use a zip like a normal person.

zip files are an ancient format, not without flaws too.


> My tarball happened to contain over 800 000 links with ".." as a path component. It also happened to contain over 5.4 million hard links.

The behavior is more precisely described not as quadratic, but O(n*m), where n is the number of symlinks pointing outside of the extraction directory, and m is the number of hardlinks. It happens because of the security precautions gnu tar performs to make sure a malicious archive can't set up a symlink pointing to an arbitrary place in your filesystem, and then extract files to it. The strategy they use has an unoptimized corner case when the archive also has a lot of hardlinks.

The mechanism could probably be improved by remembering the list of deferred symlinks in a hash map rather than a linked list, but also, both symlinks pointing outside the extraction directory, and hardlinks, are uncommon.

(Edited to correct: The problem exists even if the hardlinks do not actually point at the symlinks.)

To clarify, the problem doesn't only exist with symlinks pointing outside of the extraction directory. The problem exists if the symlink's target path is a relative directory which contains "..". That means a symlink at "somedir/subdir1/foo" which points to "../bar" will also use the delayed link mechanism, even though the symlink doesn't point to outside of the extraction directory.

Relative symlinks which contain ".." are incredibly common in some contexts.

Yup, and that's worst-case quadratic in number of links (symlinks and hardlinks).

It's still weird they didn't do a hashmap.

It's C and they probably never thought of anyone having a file with that many links.

> You may think that the numeric values (file_mode, file_size, file_mtime, ...) would be encoded in base 10, or maybe in hex, or using plain binary numbers ("base 256"). But no, they're actually encoded as octal strings (with a NUL terminator, or sometimes a space terminator). Tar is the only file format I know of which uses base 8 to encode numbers.

Tar’s “competitor”, cpio, does this as well (at least in one of the popular implementations). The Xcode XIP file, if you’re familiar with that particular format, is a couple layers of wrapping on top of a cpio archive, so deep inside my tool to expand these there’s a spot where I read these all out in a similar fashion: https://github.com/saagarjha/unxip/blob/5dcfe2c0f9578bc43070...

More generally, though, both tar and cpio doesn’t have an index because they don’t need them, they’re meant to be decompressed in a streaming fashion where this information isn’t necessary. It’s kind of inconvenient if you want to extract a single file, but for large archives you really want to compress files together and maintaining an index for this is both expensive and nontrivial, so I guess these formats are just “good enough”.

Git also uses octal to encode entry modes in tree objects. That’s despite tree objects not being ascii-readable (object-ids are stored in binary form rather than the hex used in most other locations).

TBF the only common numeric format I’ve yet to see is base64, git uses octal, decimal, hex, binary (BE), at least two completely different VLEs, plus a bitmap scheme.

I always used to think there was something magic about these formats (like, only really smart people would design a new format, and my custom csv was somehow ‘less’ than other formats), but now I’m older and I find that it’s just a bunch of files concatenated together to make a new one.

The magic is lost, but I feel a lot better about myself :)

That's the beauty! It's just a bunch of files concated and when you throw compression algorithm X over it, it works pretty well. Super neat IMO.

Well it's neat for a 10 minute hack on an internal project. Not so great that it's become the standard way of distributing files in Linux.

Although in this case I'd say it's the fault of symlinks daring to exist and spew their flaws over everyone.

Plus maybe slightly C's fault for not coming with a data structure library so everyone uses linked lists for everything. Presumably this wouldn't have happened in any other language (e.g. C++) because they would have used a set or hashmap or whatever.

A stone age language dealing with a misguided filesystem feature for a brain-dead archive format. Surprising that it works at all really!

Considering there are other file formats that are used and occupy the same space as tar balls, the fact that tar balls are common for quite a bit of archiving should represent a win. Other formats (like zip, 7z, rar, etc) have come along and while they're in use and zips specifically are very common, tar balls are still used. Must speak something to them?

I believe they're popular in Linux because they come preinstalled.

Zip also can't store symlinks but I doubt that is very important in most situations.

Tar is definitely very simplistic (and pretty crummy). Things get quite a bit more complicated when you get into the compression side of archiving though. And a lot less fun (I hate bitstreams).

If you want to see more examples of accidentally quadratic behaviors: https://accidentallyquadratic.tumblr.com/

Truly Tumblr was created for disgusting smut.

(Most recent post Aug 2019)

One advantage of tar is that, because the format has no built-in support for compression or random access, the entire archive is compressed together. Similarities in adjacent files will improve the compression ratio.

To support random access, the ZIP format must compress each file separately.

With tar.gz, there is not much of a benefit: the window size is only 32KB. Now with LZMA, the situation is much different.

That can still be a significant benefit if the files are small, which is not uncommon.

The 512 bytes header is a bit of an annoyance there (especially compared to zip’s 30+file name) but not that much.

And then, obviously, one of the advantages of tar is you can compress with whatever you want.

Then just compress every 10MB or whatever independently. Probably a good idea for parallelism anyway.

What are you talking about?

You two were debating the benefits of `tar.gz` style solid compression vs zip style independent compression of files.

Zip allows random access while solid compression will give slightly better compression for small similar files.

I was pointing out that you can have the best of both worlds by catting all the files together like `tar` but then compressing it in chunks instead of as one big stream. Plus compressing it in chunks allows you to decompress it in parallel.

That's just not true, at least theoretically speaking (I have no idea about ZIP's actual internals). One could do a two-pass algorithm to find common substrings to build a dictionary, store this shared dictionary once, and then compress the files individually referring to this shared dictionary for the common substrings.

You definitely could. I did it for a proprietary format once. Each item was quite small (typically 100-ish bytes IIRC), and I needed random access.

One approach would have been to split it into pages of a few kilobytes and separately compress those. But on this platform, I/O was primitive and very slow, and the CPU was also extremely slow. So it was advantageous to decompress only the necessary data.

So I came up with a rudimentary custom compression algorithm. I scanned all entries for common strings and made a shared dictionary. Each entry was then stored as a sequence of indexes into the shared dictionary. Decompressing an entry was dead simple: read the sequence of dictionary indices and copy from the dictionary to the output.

On a side note, the unexpected challenge of this method was how to choose which strings to put in the shared dictionary. I tried a few things and ended up with a heuristic that tries to balance favoring longer strings with favoring frequently occurring strings. I'm sure there are much better ways than that.

If I had it to do over, I probably would have gone with the compressed pages approach simply because the implementation would've been done so much quicker. But I think I did achieve faster decompression and a better compression ratio.

As long as you have a pointer index in the header for the start/end of every individual file I don't see why that wouldn't work.

Well, except no zip decompression library I know supports anything like that.

One disadvantage of tar is that because it has no built-in support for compression it has to give up the very important feature of random access for a trivial improvement in compression ratio.

Anyway 7z can optionally do solid compression anyway. I've no idea why it isn't more popular the Linux world. I guess the only real reason is the CLI tools don't come installed by default.

I've been going through the first edition of Art of Intel x86 Assembly and there was a mention that DEC used octal numbers (PDP-11 uses octal) and a cross reference of the tar wiki page indicates that the first tar was written for Version 7 Unix, which was made for the PDP-11.

Gnu tar most likely inherited the octal system in order to retain compatibility with the original tar utility.

Fun fact[1]: the x86 instruction encoding (pre Pentium or so) is itself octal in organization. This is immediately apparent in the 2+3+3-bit structure of the ModR/M and SIB bytes, but looking at the opcode table that way is also helpful. The manuals stubbornly describe it in hex, though—you have to go way, way back to the Datapoint 2200 “programmable terminal” to find a manual that actually talks in octal, and that wasn’t even made by Intel! (The 8008 was commissioned from Intel by the original manifacturer as a “2200 on a chip”.)

Weird Intel history aside, everybody used octal in the old days. I don’t actually know why, but I suspect this goes back to how early binary computers used 36-bit words (later 18- or 12-bit ones) because that’s how many bits you need to represent 10 decimal digits (and a sign), the standard for (electro)mechanical arithmometers which (among other devices) they were trying to displace. So three-bit groupings made more sense than four-bit ones. Besides, as far as instruction encodings go, a 8-way mux seems like a reasonable size if a 16-way one is too expensive.

(Octal on the 16-bit PDP-11 is probably a holdover from the 12-bit PDP-8? Looking at the encoding table, it does seem to be using three- and six-bit groups a lot.)

[1] https://news.ycombinator.com/item?id=30409889

> but I suspect this goes back to how early binary computers used 36-bit words (later 18- or 12-bit ones)

Actually the PDP-1 was an 18 bit machine; its successor the PDP-6 had 36 bit words (and 18 bit addresses -- yes it was explicitly designed to be a Lisp machine in 1963). Other DEC machines like the PDP-7 (on which Unix was developed) were 18-bit machines (Multics used an 18-bit non-DEC architechture).

Probably the most popular minicomputer ever, the PDP-8, used 12 bits, as did a bunch of DEC industrial control machines.

There's a story about Grace Hopper, who learned octal for the BINAC. She later had problems balancing her checkbook before she realized she was doing so in octal.

That lead her to conclude it would be better to teach the computer to handle decimal than to force everyone to use octal.

A lot of the old systems used 6-bit character codes (https://en.wikipedia.org/wiki/Six-bit_character_code ), including BCD six-bit codes, making 6*n word sizes more appropriate.

My dad used to give me maths problems when I was little but tell me to solve them in specific bases (typically decimal, octal or hex, but not always). Doing long division by hand in hex gives you a feel for how the numbers relate.

The anecdote about decimal makes sense — she was the key to the design of COBOL.

I used quite a few machines with six bit characters into the mid 1980s.

While GNU tar inherits this from the original tar, it doesn't seem to be something intrinsic to the PDP-11 or DEC given that the earlier Unix 'tap' archival system uses 'plain binary numbers ("base 256")', not octals.

Here's the V7 tar.c: https://github.com/dspinellis/unix-history-repo/blob/Researc...

and it's documentation: https://github.com/dspinellis/unix-history-repo/blob/Researc...

The C code clearly shows the octal (I've never used "%o" myself!):

sscanf(dblock.dbuf.mode, "%o", &i); sp->st_mode = i; sscanf(dblock.dbuf.uid, "%o", &i); sp->st_uid = i; sscanf(dblock.dbuf.gid, "%o", &i); sp->st_gid = i; sscanf(dblock.dbuf.size, "%lo", &sp->st_size); sscanf(dblock.dbuf.mtime, "%lo", &sp->st_mtime); sscanf(dblock.dbuf.chksum, "%o", &chksum);

Now, older versions of Unix supported DECTape, going back to the first manual from 1971 (see the "tap" command at http://www.bitsavers.org/pdf/bellLabs/unix/UNIX_ProgrammersM... , page "5_06", which is 94 of 211 - the command-line interface is clearly related to tar).

Here's the tp.5 format description from V6 at https://github.com/dspinellis/unix-history-repo/blob/Researc...

  DEC/mag tape formats
  Each entry has the following format:
    path name 32 bytes
    mode 2 bytes
    uid 1 byte
    gid 1 byte
    unused 1 byte
    size 3 bytes
    time modified 4 bytes
    tape address 2 bytes
    unused 16 bytes
    check sum 2 bytes
These are in essentially the same order as the "struct file_header" for tar shown in the linked-to essay.

But you can see at https://github.com/dspinellis/unix-history-repo/blob/Researc... that mtime is an int[2], so 4 bytes (yes, 16-bit integers; confirm with https://github.com/dspinellis/unix-history-repo/blob/Researc... showing i_mode is an integer).

Which means this older "tap" DECTape support uses 'plain binary numbers ("base 256")', not octal.

The effect of restricting writes to only a directory and its descendants seems like the perfect use-case for chroot(), except that comes with some other limitations too.

I started out computing in the PC world with the DOS family (then moving on to Windows) before working with any of the Unixes, so my perspective may be different from those who started with Unix/Linux, but in my experience it seems links (both soft and especially hard ones when applied to directories) cause a rather large number of problems in comparison to their usefulness; they make the filesystem a graph instead of a tree, adding complexity to all applications that need to traverse it.

You're not wrong about symlinks being problematic; there's an LWN post about it from yesterday: https://lwn.net/Articles/902247/

Luckily, most systems don't support directory hard links. (IIRC, maxOS does though, through some extremely hacky mechanism.)

I think you meant to link https://lwn.net/Articles/899543/ (and HN discussion at https://news.ycombinator.com/item?id=32190032 ) instead of that single comment. It takes the security angle (which is frankly oversold these days, but that's another rant...), but also makes some very good points about the complexity they introduce --- which does, among other things, create security issues. FWIW, I don't hardlinks are a good idea either. (In FAT filesystems, it is possible to have "hardlinks" where the same data is referred to by multiple names, but those are actually errors and referred to as "crosslinked files".)

Without links, there's a 1-to-1 mapping between (absolute) paths and file contents. When operating on a path, there's no need to consider whether the API will act on the link or what it points to.

APFS does not support directory hardlinks, although HFS+ did.

HN crowd might have some interesting takes on this: what is your preferred archive (and/or compression) format, and why?

I’ve been using .tar.xz for archiving, but haven’t looked into what the “best” option really is.

If I will need to access individual files or it needs to be usable by others on arbitrary operating systems, zip is best. Random access is greatly helped by the directory at the end of a zip file. .tar.*z is particularly hostile to random access.

If I just need an archive for easy schlepping around, I use tar if the content is not likely compressible or tar.gz if it is compressible. The slowness of better compression or the likelihood that I will struggle with missing utilities tends to make me shy away from other compressors.

I would only highly optimize for size and suffer the slow compression if it will be downloaded a lot of times and that makes a difference for user experience or bandwidth costs.

For what it's worth, the guy who maintains GNU ed and wrote lzip says you should not do that[0].

0. https://www.nongnu.org/lzip/xz_inadequate.html

I remember some discussion about that a while back on HN. The TL;DR is a) he does say that and b) he would say that, since lzip is effectively a competitor to xz.

That said, I've been using pixz (which is compatible with xz but can do parallel decompression and a few other things) for many years on many dozens of terabytes of compressed data and have never had any problems.

.tar.zst - I have found Zstd to be enormously faster than gzip, bzip2, or xz. Its compression ratio is good and it's available in most Linux distros.

All the traditional UNIX/POSIX archive formats are inadequate for archiving modern file systems, because they lose a part of the file metadata, e.g. they may lose the extended attributes or access-control lists, or they may truncate the time stamps.

Most archivers have various extensions for the tar or pax file formats, to deal with modern metadata, but the format extensions may differ between various "tar" or "pax" implementations and not all such extensions really succeed to not lose any metadata.

One archiver available on Linux/*BSD systems, for which I have checked that it does the job right and it is able to archive files from a filesystem like XFS, without losing metadata like most other tar/pax/cpio programs, is the "bsdtar" program from the "libarchive" package, when used as a "pax" program, i.e. when invoked with the options "bsdtar --create --format=pax".

Since many years ago, I have been using exclusively this archiver, to avoid losing information when making archive files. It is also very versatile, allowing to combine archiving with many finely configurable compression or encryption algorithms.

Some years ago, the most widely available tar program, the GNU tar, was not able to store XFS or UFS files without data loss.

It is possible that the GNU tar has been improved meanwhile and now it can do its job right, but I had no reason to go back to it, so I have never checked it again for changes.

Of course, Windows file formats, e.g. zip or 7z, are even less appropriate for archiving UNIX/POSIX file systems than the traditional tar/cpio/pax.

Maybe a bit unorthodox, but I've been using SquashFS with xz for compression for long-term archival (I generally prefer zstd, but for long-term archival I don't mind waiting longer for better compression with xz).

SquashFS files have file-based deduplication, fast random access, and mountability, all of which are lacking from .tar.* archives without resorting to other indexing tools. And they're mountable on any Linux without installing anything.

Only downside is they're readonly (or more accurately append-only), but for my uses, that's totally fine.

I use windows, so .zip. Maybe on rare occasion .7z if I really want better compression.

.tar.xz files are a gross pain on windows and I’m always annoyed when people use them.

Why are they a pain?

How are tarballs any more of a pain than .7z?

I'll have a guess: because 7-zip gui doesn't decompress tar.gz files in a single step (ie. first you extract the tar, then you extract it again)

IIRC 7zip actually does it in one step. But many tools such as WinZip do not. I forget the exact behavior of built-in, WinZip, and WinRar.

For personal stuff I haven't compressed an archive in a while. I just copy the containing folder to an external drive and a cloud storage if it's really important. Everything I really really want archived is less than 1TB already. That fits on an external drive and cheap cloud storage subscription.

If I need to send something I'll use zip for personal stuff or tar.gz at work because I know everyone is using some kind of Linux and it's the only terminal zip command I have memorized.

Just to think out of the box: How about not creating archives, just leave the files as they are in a directory. For transferring, use a client that preserves directory structure.

The space of such clients and their features or popularity is limited though. Rsync has millions flags that add complexity. Then you have bit torrent. From what I know it doesn’t compress. Others: git, nfs, ftp.

Usually you want some kind of compression though.

That's actually the reason why I used tar instead of rsync; I only had around 750 GiB of space. The 519GiB tar.gz would fit, the 1.1TiB directory structure wouldn't.

I have always been partial to .uha, the files produced by Uwe Herklotz' UHARC archive utility. I mostly ran into them on pirated game rips downloaded from DALnet IRC in the 90s and early aughts. It's extremely slow, however.

If I was going to be packing lots of data I'd probably use mopaq (.mpq) which has support for LZMA and Huffman coding.

I use tar.lz (Lzip) for stuff that only I will care about.

For sending stuff to other people I just use zip, because it is the lowest common denominator, but I make sure to always use info-zip so that I don't use any weird proprietary extensions, and get proper zip64 support.

> I decided that learning the tar file format and making my own tar extractor would probably be faster than waiting for tar.

If you want something done right, you gotta do it yourself!

“I'm absolutely certain that it's possible to make GNU tar extract in O(n) without --absolute-paths by replacing the linked list with a hash map. But that's an adventure for another time.“ from the article maybe someone could put in a pr

Pretty sure there's tons of C libraries that do not use hash maps because it requires implementing a hash map.

The fun thing is, GNU tar already has a hash map implementation, and it uses that hash map to make sense of hard links in the archive creation code. It's just not used in the extractor for whatever reason.

Probably because whoever wrote the extractor didn't want to learn how to use some custom hashmap written by the guy who wrote the compressor. Or maybe didn't even know it existed.

It uses the hash table from gnulib: https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=l... -- so that's not the reason, gnulib's hash table should be pretty general and robust.

I have also written a tar format extractor, which is still in use for the backup system at Fastmail:


It's capable of reading and writing tar files, and the big benefit over extracting is that you can repack a tar by (as the name says) streaming it, with a decision callback which allows you to discard, keep, or "edit" a file - where edit does unpack to a temporary file at which point you can arbitrarily change the file or the header info and then keep or discard once you're done.

I wonder if Jörg Schilling's (RIP) star* handles this better. I recall he was pretty scathing about GNU tar back when I used Solaris.

* http://cdrtools.sourceforge.net/private/star.html

> Jörg Schilling's (RIP)

As someone who burned a lot of CDs on Linux on the 90s, you bet i recognize that name, and didn't hear this news.


There were some interesting personal anecdotes shared here: https://news.ycombinator.com/item?id=28827388

I didn't mention it there, but I was actually responsible for some pax format related work many years back and reached out to him with a question pertaining to his star implementation.

He had strong feelings about his star being the star and an unrelated star archiving tool needing to be renamed. Something I was in no position to do or advocate for. I no longer remember if we actually discussed the technical matter or not at any point.

I was wondering if mention of him would pop up in the comments since he wrote a tar (and possibly pax, I no longer remember) implementation.

> I decided that learning the tar file format and making my own tar extractor would probably be faster than waiting for tar. And I was right; before the day was over, I had a working tar extractor, and I had successfully extracted my 1.1TiB tarball.

That's awesome!

I once did the same thing (something else, not a tar). I was on discord at the time and typed in messages about the stupid stuff I noticed while writing my own code. A guy told me I'm an asshole and a shit programmer/personality because I wasn't willing to use stick with/figure out standard tools + blocked me. Which is weird because why wouldn't you want more tools available. I gave away my source at the end

Don't be afraid of writing your own stuff!

I wonder if the non-portable linux openat2 with RESOLVE_BENEATH could help with this problem. The documentation suggests that it should prevent escaping the intended location via ".." components literally or in symlinks.

Author of openat2(2) here, yes it would (as well as protecting against races where the target directory is having path components changed to symlinks during extraction). RESOLVE_IN_ROOT would be more akin to extracting in a chroot(2). But you would need to be careful how you deal with path operations in the rest of the program in a way that is fairly easy to mess up.

I've been working on a userspace library[1] which would make writing userspace programs that interact with these kinds of dangerous paths more safe (though it's been on the backburner recently).

[1]: https://github.com/openSUSE/libpathrs

"In any case, the result is that a tar extractor which wants to support both pax tar files and GNU tar files needs to support 5 different ways of reading the file path, 5 different ways of reading the link path, and 3 different ways of reading the file size."


I've never had a big enough n to notice this.

In general, it is preferable now to use squashfs for most tar use cases three days. Random access, high compression levels, and completely operable with userland tools and a FUSE file system. tar does have some niche uses (streaming) but if you don’t need streaming, you can get massively faster compression into a format much more ergonomically pleasant.

tar totally burned me very very recently. I wrote a Python script to create a tar file from a bunch of different ML model artifacts. The tar file untarred fine to my computer, but I could not open it with an existing Go program. The Go program would try to create files and folders if they weren’t available, but it seemed the way I had made the tar file, it didn’t “create an empty directory” first before adding the files in that directory.

When I untarred it in the terminal with xvf I would see the files in the directories inside the tarball being created. But when I untarred a previous working version of this tarball, it would create the directory before untarring it.

Googling all of this is impossible by the way. I couldn’t figure out how to get Python to create a “blank directory” first inside the tarball so eventually I gave up and made my script take in a path to an already existing folder instead of individual files.

I have to say that using an archive format with this feature set for source distribution or really anything but backups makes me uneasy. When I unzip some file from the internet I dont want it putting files owned by some random uid in ../../.

Like what if dogpictures.tar.gz contains a ../.bashrc. Untar from $USER/Downloads and whoops..

wait, is that a thing with tar???

EDIT: nevermind, I read the actual article. This only happens if you explicitly tell it to with the -P flag.

You'll be happy to learn that Python's builtin tarfile module extracts and overwrites `../parent/paths` and `/absolute/paths` at those locations. Never use Python.

Ah, you are right. I can't believe my eyes skipped over that.

If you like the tar format then you may be interested in reading https://blog.habets.se/2019/11/CVE-2019-14866-gnu-cpio.html

This is not a problem with the file format, it is a problem with the program that is interpreting it.

bsdtar (libarchive) - I wonder if it avoids these issues?

TL;DR, IIUIC: Hardlink creation code in tar checks if a hardlink created or not by going over a list of previously created hardlink placeholders, and does that every time it needs to create a hardlink.

I wonder if there could be better design. Maybe on packing side. Why wouldn't you order files and links in such way that single pass is enough.

Better design is not to use Tar

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact