The tar archive format, and why GNU tar extracts in quadratic time

Twirrim · on July 23, 2022

> Tar is pretty unusual for an archive file format. There's no archive header, no index of files to fascilitate seeking, no magic bytes to help file and its ilk detect whether a file is a tar archive, no footer, no archive-wide metadata. The only kind of thing in a tar file is a file object.

tar = tape archive. It was designed around streaming files to a tape drive in a serial fashion. Tape bandwidth has almost always been slow compared to source data devices. You need to start writing to the target device as fast as possible. Taking time to construct a full index just delays starting the slowest stage of things, while providing only relatively minimal benefit.

Generating an index is fairly straightforward, the file headers give you the information you need, including what you need to know to get to the next file header. Your only bottleneck is just how fast random reads are. If the file is on disk, it should be relatively trivial.

It's even possible to do this relatively cheaply with Range headers if you're looking at a tar file stored on an HTTP endpoint, though you'll likely want to think about some smart pre-fetching behaviour.

vetinari · on July 23, 2022

The issue with tapes is not the bandwidth, it is the seek latency, could be around ~2 minutes end to end (with the correct tape already loaded). So tapes are not used as random-access, thus no central index on tape archive. The backup software systems keep their catalogs elsewhere.

Waterluvian · on July 23, 2022

When I was a beginner I once accidentally created an Amazon RDS Postgres database backed by tape. I was so flustered by how slow it was until I saw the setting.

emmjay_ · on July 23, 2022

I'm not sure how this is possible. You can only choose EBS backed storage for RDS Postgres. Do you mean 'magnetic' storage? That's not tape.

oogali · on July 23, 2022

It's not possible.

GP was likely confused by the different types of hard drives (standard, sc1, st1), and assumed it was tape or was incorrectly told by someone else that it was tape.

If I squint hard enough, the only viable explanation I could come up with is GP said "I created the database using the st1 storage type" and someone responded "st!? As in a UNIX SCSI tape drive[0]?"

0: https://man7.org/linux/man-pages/man4/st.4.html

Waterluvian · on July 23, 2022

It could be tape. But perhaps you’re right that it’s HDDs I guess? I assumed most of their normal storage is hard disk and not SSD. They’ve got a lot.

Edit: I checked. You’re 100% right. Well now my anecdote sucks! :) And now, seven years later, I’m back to start on wondering why it was so brutally slow.

emmjay_ · on July 23, 2022

Heh. It's not tape. :)

If you provisioned a really small amount of GP2, you can run out of I/O credits pretty quickly and get throttled to baseline performance (and it used to be 7 times the storage provisioned, or some such, and now it's 100 IOPS). That's brutally slow for a working database.

SoftTalker · on July 23, 2022

When I was in school, the professor of our operating systems class told us about a virtual memory project implementation that had the swapfile on tape. It worked, but it wasn't anything you'd actually ever use.

Waterluvian · on July 23, 2022

When I was 9 or 10 I once set virtual memory on my Macintosh to live in a 44MB SCSI cartridge drive. It was brutally slow.

ufo · on July 23, 2022

Lol! I'm curious how slow it got... How slow are we talking about here?

Waterluvian · on July 23, 2022

A non-cached GET for a single database record was like 7 seconds.

It was especially scary because I was in the “I have no idea what I’m doing, and I’m at a tech startup and am a team of one.”

lazide · on July 24, 2022

Still too fast for it to have been tape hah. If it was on tape, it would have taken minutes

Twirrim · on July 24, 2022

Latency is absolutely killer with tapes, but bandwidth is still a major issue. The current generation LTO-9 is only capable of 400MB/s in throughput. That's still slower than a SATA3 drive.

Last time I used tape in earnest was a while back though (2005ish), LTO-2 ruled the roost at a whopping 40MB/s, with LTO-3 just making an appearance. Around the time I left the company they were building out a disk storage array as a backup target, and shifting tape further and further away from the main backup path. Data production and storage needs across the industry had long been growing at a greater rate than tape capacity and performance.

vetinari · on July 24, 2022

> The current generation LTO-9 is only capable of 400MB/s in throughput. That's still slower than a SATA3 drive.

True, but that's the speed of SATA3 interface, or SSD drives. Single spinning rust drive can realistically achieve 150-250 MB/s.

jcrawfordor · on July 24, 2022

I wouldn't view this as being about write speed... keep in mind that the common ZIP implements a very simple optimization to address this exact issue, which is writing the index at the end of the file instead of at the beginning. Actually there were tools that did this when writing to tape as well, but tar wasn't one of them. In fact a lot of common tape archiving solutions would just put the index on its own dedicated tape, which of course makes sense when you consider that a computer operator would probably need to find out which tape a record of interest was on. Similarly it is possible to build an "out-of-band" index for a tar file but I'm not aware of a tool that does so... probably mostly because this is way less practical if the tarball has been compressed (compressing the files in the tarball instead fixes this issue and I've written tools that do it that way before, but it doesn't seem to be common either).

The issue is much more about seek speed. Putting the master header at the end of ZIP files requires initially reading them backwards to locate the beginning of that data structure. This works fine on random access devices but is of course pretty untenable on tape. Even if the tape drive could actually read backwards without seeking to read each successive block in the normal direction (I'm sure there's at least one weird tape drive out there that could but not the common models, although if you read very small blocks you could get pretty good performance doing this as long as you kept the seeks within the buffer columns), the seek time to get to the end of the tape and then back to the beginning could be minutes.

bluedino · on July 25, 2022

Backup programs (at least as early as the 90's) would get around this by creating index or catalog files for the backup tape, and keeping those locally. You could re-generate the catalog file by re-indexing the backup tape, but of course that took a considerable amount of time. If you were restoring to a new system and no longer had the catalog files, for example.

Another benefit was that you could quickly browse the contents of your backups without having to read or even insert the tape.

iforgotpassword · on July 24, 2022

It's only really seekable if you don't compress it further. That's because a tar file gets compressed after the fact. Even with compression formats that support starting decompression somewhere in the middle, you now either need an additional file index that tells you which files there are and where in the compressed stream to start reading, or you have to try and scan the while archive as you would with an uncompressed tar by seeking along the headers and hope the files are large enough that you can actually skip large enough chunks to matter.

It is indeed really odd that the *nix world never came up with their own, agreed upon, indexed or at least efficiency seekable archive format. I think if one of the extensions would have added per-file compression that would have had the highest chance of success.

wglb · on July 24, 2022

> Tape bandwidth has almost always been slow compared to source data devices

This is not the case.

IshKebab · on July 24, 2022

> Your only bottleneck is how fast random reads are. If the file is on disk, it should be relatively trivial.

Ah yes, how fast is a random read to the middle of a 1TB file... that has been compressed as a single gzip stream...

forrestthewoods · on July 23, 2022

[flagged]

mort96 · on July 23, 2022

You don't loathe tar files due to a deficiency in the tar format. You loathe tar files because your operating system vendor doesn't build in support for one of the most widely used archive formats in the world. Complain to Microsoft, not to people who use tar.

Tar might've been designed for use with tape, but no part of its design makes the HDD/SSD use-case worse just to work better with tape.

thayne · on July 24, 2022

I exclusively use linux, and I wish there was something better than tar that was more widely used. In particular, that didn't have the problem of several subtly incompatible sets of extensions, and that had better support of using an index, especially in tandem with compression.

But that "something better" isn't zip. It's probably something that doesn't currently exist.

jakogut · on July 24, 2022

What about 7z?

https://en.m.wikipedia.org/wiki/7z

thayne · on July 24, 2022

7z doesn't support unix file permissions, which means it isn't suitable as a tar replacement.

usefulcat · on July 24, 2022

I believe pixz adds a file index when compressing tar files. It’s definitely much, much faster at extracting a single file from a large archive compared to a tar file compressed with gzip or zstd.

ciupicri · on July 24, 2022

For what it's worth there is the SQLite Archive, but it's definitely not popular and there is no support for metadata like ownership, permissions or extended attributes (ACLs, SELinux). Also compression is missing on my Fedora system.

https://www.sqlite.org/sqlar.html / https://news.ycombinator.com/item?id=28668615

thayne · on July 24, 2022

> there is no support for metadata like ownership, permissions or extended attributes

That makes it unsuitable as a tar replacement for many use cases.

guipsp · on July 24, 2022

Well, the lack of index for one

spicybright · on July 24, 2022

It seems fairly trivial to scan the whole file and build an in memory index to speed things up, and I'm sure most viewers do. I can't say I've ever had significant speed issues with most archives to see this as a problem.

jcrawfordor · on July 24, 2022

It's actually pretty terrible in practice. Keep in mind that most common tar files have also been run through a stream compression algorithm, which means scanning the archive for file headers requires decompressing the entire thing. Some software does index tar files automatically (and of course the typical CLI tools will if you want) but for a large archive, say 20gb, I have had to stare at a progress bar for 30s+ while the tool indexed the archive. Extracting a single file required another 30s+ wait while it scanned the archive again to get back to that file, since with gzip compression it's not trivial to store enough state to be able to decompress at a point without having to decompress everything up to there.

usefulcat · on July 24, 2022

I believe pixz was created to solve the exact problem you have just described. For xz/lzma compression at least.

LtWorf · on July 24, 2022

rar files are the same if you want to extract them.

nrclark · on July 24, 2022

I see tar files used with compression pretty often, usually gzip or xz. That can mess up the ability to seek around in the tar archive and build a good index.

I use tar.gz as my go-to archive format, don't get me wrong. It's easy, well-supported, handles POSIX concepts well, and is pretty universal. But seekability isn't it's strong-suit.

lazide · on July 24, 2022

Good luck doing that with a multi-terabyte compressed tar file!

forrestthewoods · on July 24, 2022

What makes tar files better than zip files? Why should anyone prefer .tar.xz to an even more commonly used format that has even broader support?

NobodyNada · on July 24, 2022

There’s a very good reason to prefer .tar.gz (or xz or whatever) to .zip: tar.gz files deliver better compression (ranging from “marginally better” to “significantly better” depending on what you’re compressing).

In a .zip, the files are each compressed individually using DEFLATE and then concatenated to create an archive, whereas in a .tar.gz the files are first concatenated into a .tar archive and then DEFLATEd all together.

Because of this, a .tar.gz often achieves much better compression on archives containing many small files, because the compression algorithm can eliminate redundancies across files. The downside is that you can’t decompress an individual file without decompressing every preceding file in the stream, because DEFLATE does not support random access. (And so tar’s lack of index is an advantage here; an index would not be useful if you can’t seek.)

This is why e.g. open source software downloads often use .tar.gz. A source code archive has hundreds or thousands of tiny text files with a ton of redundancy between files in variable & function names, keywords, common code snippets/patterns, etc., so tar.gz delivers significantly better compression than zip. And there’s little use for random access of individual files, since all the files need to be extracted in order to compile the program anyway.

The abbreviation “tape archive” may be anachronistic nowadays, but the performance cbaracteristics of a tape drive — namely, fast sequential access but absolutely awful random access — coincide with the performance characteristics of compression algorithms. So an archive format designed for packaging files up to be stored on a tape is perfect for packaging files up to be compressed.

WithinReason · on July 24, 2022

The trade-off is that it takes a long time to extract a single file from a solid archive. The 7z format supports a "solid block size" parameter for this reason (for all supported compression algorithms AFAICT) which can be set to anything from "compress all files individually" to "size of whole archive"

mort96 · on July 24, 2022

I wasn't saying tar is better than zip (though if you want to read about issues with the zip file format, have a look at this post: https://games.greggman.com/game/zip-rant/ -- HN discussion: https://news.ycombinator.com/item?id=27925393). I'm just saying there's nothing about the tar file format itself that makes you loathe it, you just loathe that Windows doesn't support it. That's not a problem with tar.

I don't even know if it's true that tar is less common than zip. I know zip is incredibly common, but truly, so is tar. In the UNIX world, _everything_ happens through tar. And as someone who almost exclusively operates in the UNIX world, I interact with zip files very rarely, while I work with tarballs all the time.

Just don't blame the format for a deficiency in your operating system, that's all I'm saying.

forrestthewoods · on July 24, 2022

To be fair, I almost never come across tar files. Most crossplatform software provides .tar.gz for Linux/macOS and .zip for Windows.

Should windows have native support for tar.gz files? Maybe! Maybe not. I dunno. So when I come across something using that format for windows what it really comes across is half-ass Windows support. Which isn’t the end of the world. But it’s rarely a good sign.

mort96 · on July 24, 2022

Ah, yeah, that's completely fair. If someone is making an archive _for Windows users_, that archive should absolutely be a zip. A tarball definitely sends the signal that Windows users aren't the primary audience. Sometimes that's okay, sometimes it's a sign of a really shoddy port.

account42 · on July 26, 2022

Sort of. I wouldn't make a separate source archive for Windows users - anyone who can compile stuff will manage installing 7-Zip or another archiver that handles .tar and solid compression wastes less space and bandwith. For Windows-specific archives .zip is a no brainer though.

justbaker · on July 24, 2022

7-Zip for windows is always something I go for on a fresh install. Then I also have rar support. But screw rar.

Also on a fresh install I install WSL, so tar is always available that way too.

Aeolun · on July 24, 2022

.tar.gz is a tar file, it’s just gzipped afterwards.

account42 · on July 26, 2022

Technicall yes, but if you care about user experience you should treat it as a single compressed archive not as one archive in another like e.g. 7-Zip does.

joveian · on July 24, 2022

For unixy systems zip doesn't have sufficient metadata. I suspect that is almost the entire reason zip isn't used more on such systems. It is still used quite a bit when that doesn't matter.

tomsthumb · on July 24, 2022

Why should you couple your archive format to a compression algorithm?

mort96 · on July 24, 2022

Well, there are reasons. If your archive format handles compression, it can be designed in such a way that you can seek and extract only parts of the archive. If the archive format doesn't handle compression, you're dependent on reading through the archive sequentially from start to finish.

That's not to say tar is wrong to not have native compression, it's just one reason why it's not crazy for archive formats to natively support compression.

tomsthumb · on July 24, 2022

I’m semi-sure that this is possible with .tar.gz files already. I’ve used vim to view a text file within a few different rather large archives without noticing the machine choke up on extracting several gigs before showing content. Certainly nothing was written to disk in those cases.

jcrawfordor · on July 24, 2022

.tar.gz files can only be read sequentially, but there are optimizations in place on common tools that make this surprisingly fast as long as there's enough memory available to essentially mmap the decompressed form. The problem is bigger with archives in the tens of GB (actually pretty common for tarballs since it's popular as a lowest-common-denominator backup format) or resource-constrained systems where the swapping becomes untenable.

seba_dos1 · on July 24, 2022

There are extensions for gzip that can make it coarsely seekable, I wouldn't be surprised if some archive tools used that.

koolba · on July 24, 2022

The beauty of tar is using it as a stream of files. That lets you pipe a tar stream from a program (e.g. git) into another program (e.g. docker context) without loading the entire contents into RAM.

seized · on July 24, 2022

Install 7Zip. It is $CurrentYear after all.

ekianjo · on July 24, 2022

> Just use a zip like a normal person.

zip files are an ancient format, not without flaws too.

kentonv · on July 24, 2022

Punchline:

> My tarball happened to contain over 800 000 links with ".." as a path component. It also happened to contain over 5.4 million hard links.

The behavior is more precisely described not as quadratic, but O(n*m), where n is the number of symlinks pointing outside of the extraction directory, and m is the number of hardlinks. It happens because of the security precautions gnu tar performs to make sure a malicious archive can't set up a symlink pointing to an arbitrary place in your filesystem, and then extract files to it. The strategy they use has an unoptimized corner case when the archive also has a lot of hardlinks.

The mechanism could probably be improved by remembering the list of deferred symlinks in a hash map rather than a linked list, but also, both symlinks pointing outside the extraction directory, and hardlinks, are uncommon.

(Edited to correct: The problem exists even if the hardlinks do not actually point at the symlinks.)

mort96 · on July 24, 2022

To clarify, the problem doesn't only exist with symlinks pointing outside of the extraction directory. The problem exists if the symlink's target path is a relative directory which contains "..". That means a symlink at "somedir/subdir1/foo" which points to "../bar" will also use the delayed link mechanism, even though the symlink doesn't point to outside of the extraction directory.

Relative symlinks which contain ".." are incredibly common in some contexts.

romwell · on July 24, 2022

Yup, and that's worst-case quadratic in number of links (symlinks and hardlinks).

It's still weird they didn't do a hashmap.

account42 · on July 26, 2022

It's C and they probably never thought of anyone having a file with that many links.

saagarjha · on July 23, 2022

> You may think that the numeric values (file_mode, file_size, file_mtime, ...) would be encoded in base 10, or maybe in hex, or using plain binary numbers ("base 256"). But no, they're actually encoded as octal strings (with a NUL terminator, or sometimes a space terminator). Tar is the only file format I know of which uses base 8 to encode numbers.

Tar’s “competitor”, cpio, does this as well (at least in one of the popular implementations). The Xcode XIP file, if you’re familiar with that particular format, is a couple layers of wrapping on top of a cpio archive, so deep inside my tool to expand these there’s a spot where I read these all out in a similar fashion: https://github.com/saagarjha/unxip/blob/5dcfe2c0f9578bc43070...

More generally, though, both tar and cpio doesn’t have an index because they don’t need them, they’re meant to be decompressed in a streaming fashion where this information isn’t necessary. It’s kind of inconvenient if you want to extract a single file, but for large archives you really want to compress files together and maintaining an index for this is both expensive and nontrivial, so I guess these formats are just “good enough”.

masklinn · on July 24, 2022

Git also uses octal to encode entry modes in tree objects. That’s despite tree objects not being ascii-readable (object-ids are stored in binary form rather than the hex used in most other locations).

TBF the only common numeric format I’ve yet to see is base64, git uses octal, decimal, hex, binary (BE), at least two completely different VLEs, plus a bitmap scheme.

Aeolun · on July 24, 2022

I always used to think there was something magic about these formats (like, only really smart people would design a new format, and my custom csv was somehow ‘less’ than other formats), but now I’m older and I find that it’s just a bunch of files concatenated together to make a new one.

The magic is lost, but I feel a lot better about myself :)

jjice · on July 24, 2022

That's the beauty! It's just a bunch of files concated and when you throw compression algorithm X over it, it works pretty well. Super neat IMO.

IshKebab · on July 24, 2022

Well it's neat for a 10 minute hack on an internal project. Not so great that it's become the standard way of distributing files in Linux.

Although in this case I'd say it's the fault of symlinks daring to exist and spew their flaws over everyone.

Plus maybe slightly C's fault for not coming with a data structure library so everyone uses linked lists for everything. Presumably this wouldn't have happened in any other language (e.g. C++) because they would have used a set or hashmap or whatever.

A stone age language dealing with a misguided filesystem feature for a brain-dead archive format. Surprising that it works at all really!

jjice · on July 25, 2022

Considering there are other file formats that are used and occupy the same space as tar balls, the fact that tar balls are common for quite a bit of archiving should represent a win. Other formats (like zip, 7z, rar, etc) have come along and while they're in use and zips specifically are very common, tar balls are still used. Must speak something to them?

IshKebab · on July 25, 2022

I believe they're popular in Linux because they come preinstalled.

Zip also can't store symlinks but I doubt that is very important in most situations.

masklinn · on July 24, 2022

Tar is definitely very simplistic (and pretty crummy). Things get quite a bit more complicated when you get into the compression side of archiving though. And a lot less fun (I hate bitstreams).

vjeux · on July 23, 2022

If you want to see more examples of accidentally quadratic behaviors: https://accidentallyquadratic.tumblr.com/

atomlib · on July 24, 2022

Truly Tumblr was created for disgusting smut.

exikyut · on July 24, 2022

(Most recent post Aug 2019)

mdavidn · on July 23, 2022

One advantage of tar is that, because the format has no built-in support for compression or random access, the entire archive is compressed together. Similarities in adjacent files will improve the compression ratio.

To support random access, the ZIP format must compress each file separately.

xyzzyz · on July 23, 2022

With tar.gz, there is not much of a benefit: the window size is only 32KB. Now with LZMA, the situation is much different.

masklinn · on July 24, 2022

That can still be a significant benefit if the files are small, which is not uncommon.

The 512 bytes header is a bit of an annoyance there (especially compared to zip’s 30+file name) but not that much.

And then, obviously, one of the advantages of tar is you can compress with whatever you want.

IshKebab · on July 24, 2022

Then just compress every 10MB or whatever independently. Probably a good idea for parallelism anyway.

masklinn · on July 24, 2022

What are you talking about?

IshKebab · on July 24, 2022

You two were debating the benefits of `tar.gz` style solid compression vs zip style independent compression of files.

Zip allows random access while solid compression will give slightly better compression for small similar files.

I was pointing out that you can have the best of both worlds by catting all the files together like `tar` but then compressing it in chunks instead of as one big stream. Plus compressing it in chunks allows you to decompress it in parallel.

orlp · on July 23, 2022

That's just not true, at least theoretically speaking (I have no idea about ZIP's actual internals). One could do a two-pass algorithm to find common substrings to build a dictionary, store this shared dictionary once, and then compress the files individually referring to this shared dictionary for the common substrings.

adrianmonk · on July 24, 2022

You definitely could. I did it for a proprietary format once. Each item was quite small (typically 100-ish bytes IIRC), and I needed random access.

One approach would have been to split it into pages of a few kilobytes and separately compress those. But on this platform, I/O was primitive and very slow, and the CPU was also extremely slow. So it was advantageous to decompress only the necessary data.

So I came up with a rudimentary custom compression algorithm. I scanned all entries for common strings and made a shared dictionary. Each entry was then stored as a sequence of indexes into the shared dictionary. Decompressing an entry was dead simple: read the sequence of dictionary indices and copy from the dictionary to the output.

On a side note, the unexpected challenge of this method was how to choose which strings to put in the shared dictionary. I tried a few things and ended up with a heuristic that tries to balance favoring longer strings with favoring frequently occurring strings. I'm sure there are much better ways than that.

If I had it to do over, I probably would have gone with the compressed pages approach simply because the implementation would've been done so much quicker. But I think I did achieve faster decompression and a better compression ratio.

bick_nyers · on July 23, 2022

As long as you have a pointer index in the header for the start/end of every individual file I don't see why that wouldn't work.

lazide · on July 24, 2022

Well, except no zip decompression library I know supports anything like that.

IshKebab · on July 24, 2022

One disadvantage of tar is that because it has no built-in support for compression it has to give up the very important feature of random access for a trivial improvement in compression ratio.

Anyway 7z can optionally do solid compression anyway. I've no idea why it isn't more popular the Linux world. I guess the only real reason is the CLI tools don't come installed by default.

barkingcat · on July 23, 2022

I've been going through the first edition of Art of Intel x86 Assembly and there was a mention that DEC used octal numbers (PDP-11 uses octal) and a cross reference of the tar wiki page indicates that the first tar was written for Version 7 Unix, which was made for the PDP-11.

Gnu tar most likely inherited the octal system in order to retain compatibility with the original tar utility.

mananaysiempre · on July 23, 2022

Fun fact[1]: the x86 instruction encoding (pre Pentium or so) is itself octal in organization. This is immediately apparent in the 2+3+3-bit structure of the ModR/M and SIB bytes, but looking at the opcode table that way is also helpful. The manuals stubbornly describe it in hex, though—you have to go way, way back to the Datapoint 2200 “programmable terminal” to find a manual that actually talks in octal, and that wasn’t even made by Intel! (The 8008 was commissioned from Intel by the original manifacturer as a “2200 on a chip”.)

Weird Intel history aside, everybody used octal in the old days. I don’t actually know why, but I suspect this goes back to how early binary computers used 36-bit words (later 18- or 12-bit ones) because that’s how many bits you need to represent 10 decimal digits (and a sign), the standard for (electro)mechanical arithmometers which (among other devices) they were trying to displace. So three-bit groupings made more sense than four-bit ones. Besides, as far as instruction encodings go, a 8-way mux seems like a reasonable size if a 16-way one is too expensive.

(Octal on the 16-bit PDP-11 is probably a holdover from the 12-bit PDP-8? Looking at the encoding table, it does seem to be using three- and six-bit groups a lot.)

[1] https://news.ycombinator.com/item?id=30409889

gumby · on July 24, 2022

> but I suspect this goes back to how early binary computers used 36-bit words (later 18- or 12-bit ones)

Actually the PDP-1 was an 18 bit machine; its successor the PDP-6 had 36 bit words (and 18 bit addresses -- yes it was explicitly designed to be a Lisp machine in 1963). Other DEC machines like the PDP-7 (on which Unix was developed) were 18-bit machines (Multics used an 18-bit non-DEC architechture).

Probably the most popular minicomputer ever, the PDP-8, used 12 bits, as did a bunch of DEC industrial control machines.

eesmith · on July 24, 2022

There's a story about Grace Hopper, who learned octal for the BINAC. She later had problems balancing her checkbook before she realized she was doing so in octal.

That lead her to conclude it would be better to teach the computer to handle decimal than to force everyone to use octal.

A lot of the old systems used 6-bit character codes (https://en.wikipedia.org/wiki/Six-bit_character_code ), including BCD six-bit codes, making 6*n word sizes more appropriate.

gumby · on July 24, 2022

My dad used to give me maths problems when I was little but tell me to solve them in specific bases (typically decimal, octal or hex, but not always). Doing long division by hand in hex gives you a feel for how the numbers relate.

The anecdote about decimal makes sense — she was the key to the design of COBOL.

I used quite a few machines with six bit characters into the mid 1980s.

eesmith · on July 23, 2022

While GNU tar inherits this from the original tar, it doesn't seem to be something intrinsic to the PDP-11 or DEC given that the earlier Unix 'tap' archival system uses 'plain binary numbers ("base 256")', not octals.

Here's the V7 tar.c: https://github.com/dspinellis/unix-history-repo/blob/Researc...

and it's documentation: https://github.com/dspinellis/unix-history-repo/blob/Researc...

The C code clearly shows the octal (I've never used "%o" myself!):

sscanf(dblock.dbuf.mode, "%o", &i); sp->st_mode = i; sscanf(dblock.dbuf.uid, "%o", &i); sp->st_uid = i; sscanf(dblock.dbuf.gid, "%o", &i); sp->st_gid = i; sscanf(dblock.dbuf.size, "%lo", &sp->st_size); sscanf(dblock.dbuf.mtime, "%lo", &sp->st_mtime); sscanf(dblock.dbuf.chksum, "%o", &chksum);

Now, older versions of Unix supported DECTape, going back to the first manual from 1971 (see the "tap" command at http://www.bitsavers.org/pdf/bellLabs/unix/UNIX_ProgrammersM... , page "5_06", which is 94 of 211 - the command-line interface is clearly related to tar).

Here's the tp.5 format description from V6 at https://github.com/dspinellis/unix-history-repo/blob/Researc...

  DEC/mag tape formats
   ...
  Each entry has the following format:
    path name 32 bytes
    mode 2 bytes
    uid 1 byte
    gid 1 byte
    unused 1 byte
    size 3 bytes
    time modified 4 bytes
    tape address 2 bytes
    unused 16 bytes
    check sum 2 bytes

These are in essentially the same order as the "struct file_header" for tar shown in the linked-to essay.

But you can see at https://github.com/dspinellis/unix-history-repo/blob/Researc... that mtime is an int[2], so 4 bytes (yes, 16-bit integers; confirm with https://github.com/dspinellis/unix-history-repo/blob/Researc... showing i_mode is an integer).

Which means this older "tap" DECTape support uses 'plain binary numbers ("base 256")', not octal.

userbinator · on July 23, 2022

The effect of restricting writes to only a directory and its descendants seems like the perfect use-case for chroot(), except that comes with some other limitations too.

I started out computing in the PC world with the DOS family (then moving on to Windows) before working with any of the Unixes, so my perspective may be different from those who started with Unix/Linux, but in my experience it seems links (both soft and especially hard ones when applied to directories) cause a rather large number of problems in comparison to their usefulness; they make the filesystem a graph instead of a tree, adding complexity to all applications that need to traverse it.

mort96 · on July 23, 2022

You're not wrong about symlinks being problematic; there's an LWN post about it from yesterday: https://lwn.net/Articles/902247/

Luckily, most systems don't support directory hard links. (IIRC, maxOS does though, through some extremely hacky mechanism.)

userbinator · on July 23, 2022

I think you meant to link https://lwn.net/Articles/899543/ (and HN discussion at https://news.ycombinator.com/item?id=32190032 ) instead of that single comment. It takes the security angle (which is frankly oversold these days, but that's another rant...), but also makes some very good points about the complexity they introduce --- which does, among other things, create security issues. FWIW, I don't hardlinks are a good idea either. (In FAT filesystems, it is possible to have "hardlinks" where the same data is referred to by multiple names, but those are actually errors and referred to as "crosslinked files".)

Without links, there's a 1-to-1 mapping between (absolute) paths and file contents. When operating on a path, there's no need to consider whether the API will act on the link or what it points to.

saagarjha · on July 23, 2022

APFS does not support directory hardlinks, although HFS+ did.

jonpalmisc · on July 23, 2022

HN crowd might have some interesting takes on this: what is your preferred archive (and/or compression) format, and why?

I’ve been using .tar.xz for archiving, but haven’t looked into what the “best” option really is.

mgerdts · on July 24, 2022

If I will need to access individual files or it needs to be usable by others on arbitrary operating systems, zip is best. Random access is greatly helped by the directory at the end of a zip file. .tar.*z is particularly hostile to random access.

If I just need an archive for easy schlepping around, I use tar if the content is not likely compressible or tar.gz if it is compressible. The slowness of better compression or the likelihood that I will struggle with missing utilities tends to make me shy away from other compressors.

I would only highly optimize for size and suffer the slow compression if it will be downloaded a lot of times and that makes a difference for user experience or bandwidth costs.

bejelentkezni · on July 24, 2022

For what it's worth, the guy who maintains GNU ed and wrote lzip says you should not do that[0].

0. https://www.nongnu.org/lzip/xz_inadequate.html

usefulcat · on July 24, 2022

I remember some discussion about that a while back on HN. The TL;DR is a) he does say that and b) he would say that, since lzip is effectively a competitor to xz.

That said, I've been using pixz (which is compatible with xz but can do parallel decompression and a few other things) for many years on many dozens of terabytes of compressed data and have never had any problems.

johnwfinigan · on July 24, 2022

.tar.zst - I have found Zstd to be enormously faster than gzip, bzip2, or xz. Its compression ratio is good and it's available in most Linux distros.

adrian_b · on July 24, 2022

All the traditional UNIX/POSIX archive formats are inadequate for archiving modern file systems, because they lose a part of the file metadata, e.g. they may lose the extended attributes or access-control lists, or they may truncate the time stamps.

Most archivers have various extensions for the tar or pax file formats, to deal with modern metadata, but the format extensions may differ between various "tar" or "pax" implementations and not all such extensions really succeed to not lose any metadata.

One archiver available on Linux/*BSD systems, for which I have checked that it does the job right and it is able to archive files from a filesystem like XFS, without losing metadata like most other tar/pax/cpio programs, is the "bsdtar" program from the "libarchive" package, when used as a "pax" program, i.e. when invoked with the options "bsdtar --create --format=pax".

Since many years ago, I have been using exclusively this archiver, to avoid losing information when making archive files. It is also very versatile, allowing to combine archiving with many finely configurable compression or encryption algorithms.

Some years ago, the most widely available tar program, the GNU tar, was not able to store XFS or UFS files without data loss.

It is possible that the GNU tar has been improved meanwhile and now it can do its job right, but I had no reason to go back to it, so I have never checked it again for changes.

Of course, Windows file formats, e.g. zip or 7z, are even less appropriate for archiving UNIX/POSIX file systems than the traditional tar/cpio/pax.

dkulchenko · on July 24, 2022

Maybe a bit unorthodox, but I've been using SquashFS with xz for compression for long-term archival (I generally prefer zstd, but for long-term archival I don't mind waiting longer for better compression with xz).

SquashFS files have file-based deduplication, fast random access, and mountability, all of which are lacking from .tar.* archives without resorting to other indexing tools. And they're mountable on any Linux without installing anything.

Only downside is they're readonly (or more accurately append-only), but for my uses, that's totally fine.

forrestthewoods · on July 23, 2022

I use windows, so .zip. Maybe on rare occasion .7z if I really want better compression.

.tar.xz files are a gross pain on windows and I’m always annoyed when people use them.

Affric · on July 24, 2022

Why are they a pain?

auscompgeek · on July 24, 2022

How are tarballs any more of a pain than .7z?

kasabali · on July 24, 2022

I'll have a guess: because 7-zip gui doesn't decompress tar.gz files in a single step (ie. first you extract the tar, then you extract it again)

forrestthewoods · on July 24, 2022

IIRC 7zip actually does it in one step. But many tools such as WinZip do not. I forget the exact behavior of built-in, WinZip, and WinRar.

squeaky-clean · on July 24, 2022

For personal stuff I haven't compressed an archive in a while. I just copy the containing folder to an external drive and a cloud storage if it's really important. Everything I really really want archived is less than 1TB already. That fits on an external drive and cheap cloud storage subscription.

If I need to send something I'll use zip for personal stuff or tar.gz at work because I know everyone is using some kind of Linux and it's the only terminal zip command I have memorized.

Too · on July 24, 2022

Just to think out of the box: How about not creating archives, just leave the files as they are in a directory. For transferring, use a client that preserves directory structure.

The space of such clients and their features or popularity is limited though. Rsync has millions flags that add complexity. Then you have bit torrent. From what I know it doesn’t compress. Others: git, nfs, ftp.

mort96 · on July 24, 2022

Usually you want some kind of compression though.

That's actually the reason why I used tar instead of rsync; I only had around 750 GiB of space. The 519GiB tar.gz would fit, the 1.1TiB directory structure wouldn't.

_ofdw · on July 23, 2022

I have always been partial to .uha, the files produced by Uwe Herklotz' UHARC archive utility. I mostly ran into them on pirated game rips downloaded from DALnet IRC in the 90s and early aughts. It's extremely slow, however.

If I was going to be packing lots of data I'd probably use mopaq (.mpq) which has support for LZMA and Huffman coding.

LeoPanthera · on July 24, 2022

I use tar.lz (Lzip) for stuff that only I will care about.

For sending stuff to other people I just use zip, because it is the lowest common denominator, but I make sure to always use info-zip so that I don't use any weird proprietary extensions, and get proper zip64 support.

game-of-throws · on July 23, 2022

> I decided that learning the tar file format and making my own tar extractor would probably be faster than waiting for tar.

If you want something done right, you gotta do it yourself!

brian_herman · on July 23, 2022

“I'm absolutely certain that it's possible to make GNU tar extract in O(n) without --absolute-paths by replacing the linked list with a hash map. But that's an adventure for another time.“ from the article maybe someone could put in a pr

stabbles · on July 23, 2022

Pretty sure there's tons of C libraries that do not use hash maps because it requires implementing a hash map.

mort96 · on July 23, 2022

The fun thing is, GNU tar already has a hash map implementation, and it uses that hash map to make sense of hard links in the archive creation code. It's just not used in the extractor for whatever reason.

IshKebab · on July 24, 2022

Probably because whoever wrote the extractor didn't want to learn how to use some custom hashmap written by the guy who wrote the compressor. Or maybe didn't even know it existed.

mort96 · on July 24, 2022

It uses the hash table from gnulib: https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=l... -- so that's not the reason, gnulib's hash table should be pretty general and robust.

brongondwana · on July 24, 2022

I have also written a tar format extractor, which is still in use for the backup system at Fastmail:

https://metacpan.org/pod/Archive::Tar::Stream

It's capable of reading and writing tar files, and the big benefit over extracting is that you can repack a tar by (as the name says) streaming it, with a decision callback which allows you to discard, keep, or "edit" a file - where edit does unpack to a temporary file at which point you can arbitrarily change the file or the header info and then keep or discard once you're done.

komadori · on July 23, 2022

I wonder if Jörg Schilling's (RIP) star* handles this better. I recall he was pretty scathing about GNU tar back when I used Solaris.

* http://cdrtools.sourceforge.net/private/star.html

asveikau · on July 23, 2022

> Jörg Schilling's (RIP)

As someone who burned a lot of CDs on Linux on the 90s, you bet i recognize that name, and didn't hear this news.

https://lwn.net/Articles/872489/

macintux · on July 23, 2022

There were some interesting personal anecdotes shared here: https://news.ycombinator.com/item?id=28827388

mauvehaus · on July 24, 2022

I didn't mention it there, but I was actually responsible for some pax format related work many years back and reached out to him with a question pertaining to his star implementation.

He had strong feelings about his star being the star and an unrelated star archiving tool needing to be renamed. Something I was in no position to do or advocate for. I no longer remember if we actually discussed the technical matter or not at any point.

I was wondering if mention of him would pop up in the comments since he wrote a tar (and possibly pax, I no longer remember) implementation.

ArrayBoundCheck · on July 24, 2022

> I decided that learning the tar file format and making my own tar extractor would probably be faster than waiting for tar. And I was right; before the day was over, I had a working tar extractor, and I had successfully extracted my 1.1TiB tarball.

That's awesome!

I once did the same thing (something else, not a tar). I was on discord at the time and typed in messages about the stupid stuff I noticed while writing my own code. A guy told me I'm an asshole and a shit programmer/personality because I wasn't willing to use stick with/figure out standard tools + blocked me. Which is weird because why wouldn't you want more tools available. I gave away my source at the end

Don't be afraid of writing your own stuff!

jepler · on July 24, 2022

I wonder if the non-portable linux openat2 with RESOLVE_BENEATH could help with this problem. The documentation suggests that it should prevent escaping the intended location via ".." components literally or in symlinks.

cyphar · on July 25, 2022

Author of openat2(2) here, yes it would (as well as protecting against races where the target directory is having path components changed to symlinks during extraction). RESOLVE_IN_ROOT would be more akin to extracting in a chroot(2). But you would need to be careful how you deal with path operations in the rest of the program in a way that is fairly easy to mess up.

I've been working on a userspace library[1] which would make writing userspace programs that interact with these kinds of dangerous paths more safe (though it's been on the backburner recently).

[1]: https://github.com/openSUSE/libpathrs

1vuio0pswjnm7 · on July 23, 2022

"In any case, the result is that a tar extractor which wants to support both pax tar files and GNU tar files needs to support 5 different ways of reading the file path, 5 different ways of reading the link path, and 3 different ways of reading the file size."

https://github.com/libarchive/libarchive

thanatos519 · on July 23, 2022

I've never had a big enough n to notice this.

ctur · on July 24, 2022

In general, it is preferable now to use squashfs for most tar use cases three days. Random access, high compression levels, and completely operable with userland tools and a FUSE file system. tar does have some niche uses (streaming) but if you don’t need streaming, you can get massively faster compression into a format much more ergonomically pleasant.

cdelsolar · on July 24, 2022

tar totally burned me very very recently. I wrote a Python script to create a tar file from a bunch of different ML model artifacts. The tar file untarred fine to my computer, but I could not open it with an existing Go program. The Go program would try to create files and folders if they weren’t available, but it seemed the way I had made the tar file, it didn’t “create an empty directory” first before adding the files in that directory.

When I untarred it in the terminal with xvf I would see the files in the directories inside the tarball being created. But when I untarred a previous working version of this tarball, it would create the directory before untarring it.

Googling all of this is impossible by the way. I couldn’t figure out how to get Python to create a “blank directory” first inside the tarball so eventually I gave up and made my script take in a path to an already existing folder instead of individual files.

im3w1l · on July 23, 2022

I have to say that using an archive format with this feature set for source distribution or really anything but backups makes me uneasy. When I unzip some file from the internet I dont want it putting files owned by some random uid in ../../.

Like what if dogpictures.tar.gz contains a ../.bashrc. Untar from $USER/Downloads and whoops..

tpoacher · on July 23, 2022

wait, is that a thing with tar???

EDIT: nevermind, I read the actual article. This only happens if you explicitly tell it to with the -P flag.

stabbles · on July 23, 2022

You'll be happy to learn that Python's builtin tarfile module extracts and overwrites `../parent/paths` and `/absolute/paths` at those locations. Never use Python.

im3w1l · on July 23, 2022

Ah, you are right. I can't believe my eyes skipped over that.

thomashabets2 · on July 25, 2022

If you like the tar format then you may be interested in reading https://blog.habets.se/2019/11/CVE-2019-14866-gnu-cpio.html

mgerdts · on July 24, 2022

This is not a problem with the file format, it is a problem with the program that is interpreting it.

mjevans · on July 23, 2022

bsdtar (libarchive) - I wonder if it avoids these issues?

sedatk · on July 24, 2022

TL;DR, IIUIC: Hardlink creation code in tar checks if a hardlink created or not by going over a list of previously created hardlink placeholders, and does that every time it needs to create a hardlink.

Ekaros · on July 23, 2022

I wonder if there could be better design. Maybe on packing side. Why wouldn't you order files and links in such way that single pass is enough.

killingtime74 · on July 23, 2022

Better design is not to use Tar