tar = tape archive. It was designed around streaming files to a tape drive in a serial fashion. Tape bandwidth has almost always been slow compared to source data devices. You need to start writing to the target device as fast as possible. Taking time to construct a full index just delays starting the slowest stage of things, while providing only relatively minimal benefit.
Generating an index is fairly straightforward, the file headers give you the information you need, including what you need to know to get to the next file header. Your only bottleneck is just how fast random reads are. If the file is on disk, it should be relatively trivial.
It's even possible to do this relatively cheaply with Range headers if you're looking at a tar file stored on an HTTP endpoint, though you'll likely want to think about some smart pre-fetching behaviour.
GP was likely confused by the different types of hard drives (standard, sc1, st1), and assumed it was tape or was incorrectly told by someone else that it was tape.
If I squint hard enough, the only viable explanation I could come up with is GP said "I created the database using the st1 storage type" and someone responded "st!? As in a UNIX SCSI tape drive?"
Edit: I checked. You’re 100% right. Well now my anecdote sucks! :) And now, seven years later, I’m back to start on wondering why it was so brutally slow.
If you provisioned a really small amount of GP2, you can run out of I/O credits pretty quickly and get throttled to baseline performance (and it used to be 7 times the storage provisioned, or some such, and now it's 100 IOPS). That's brutally slow for a working database.
It was especially scary because I was in the “I have no idea what I’m doing, and I’m at a tech startup and am a team of one.”
Last time I used tape in earnest was a while back though (2005ish), LTO-2 ruled the roost at a whopping 40MB/s, with LTO-3 just making an appearance. Around the time I left the company they were building out a disk storage array as a backup target, and shifting tape further and further away from the main backup path. Data production and storage needs across the industry had long been growing at a greater rate than tape capacity and performance.
True, but that's the speed of SATA3 interface, or SSD drives. Single spinning rust drive can realistically achieve 150-250 MB/s.
The issue is much more about seek speed. Putting the master header at the end of ZIP files requires initially reading them backwards to locate the beginning of that data structure. This works fine on random access devices but is of course pretty untenable on tape. Even if the tape drive could actually read backwards without seeking to read each successive block in the normal direction (I'm sure there's at least one weird tape drive out there that could but not the common models, although if you read very small blocks you could get pretty good performance doing this as long as you kept the seeks within the buffer columns), the seek time to get to the end of the tape and then back to the beginning could be minutes.
Another benefit was that you could quickly browse the contents of your backups without having to read or even insert the tape.
It is indeed really odd that the *nix world never came up with their own, agreed upon, indexed or at least efficiency seekable archive format. I think if one of the extensions would have added per-file compression that would have had the highest chance of success.
This is not the case.
Ah yes, how fast is a random read to the middle of a 1TB file... that has been compressed as a single gzip stream...
> My tarball happened to contain over 800 000 links with ".." as a path component. It also happened to contain over 5.4 million hard links.
The behavior is more precisely described not as quadratic, but O(n*m), where n is the number of symlinks pointing outside of the extraction directory, and m is the number of hardlinks. It happens because of the security precautions gnu tar performs to make sure a malicious archive can't set up a symlink pointing to an arbitrary place in your filesystem, and then extract files to it. The strategy they use has an unoptimized corner case when the archive also has a lot of hardlinks.
The mechanism could probably be improved by remembering the list of deferred symlinks in a hash map rather than a linked list, but also, both symlinks pointing outside the extraction directory, and hardlinks, are uncommon.
(Edited to correct: The problem exists even if the hardlinks do not actually point at the symlinks.)
Relative symlinks which contain ".." are incredibly common in some contexts.
It's still weird they didn't do a hashmap.
Tar’s “competitor”, cpio, does this as well (at least in one of the popular implementations). The Xcode XIP file, if you’re familiar with that particular format, is a couple layers of wrapping on top of a cpio archive, so deep inside my tool to expand these there’s a spot where I read these all out in a similar fashion: https://github.com/saagarjha/unxip/blob/5dcfe2c0f9578bc43070...
More generally, though, both tar and cpio doesn’t have an index because they don’t need them, they’re meant to be decompressed in a streaming fashion where this information isn’t necessary. It’s kind of inconvenient if you want to extract a single file, but for large archives you really want to compress files together and maintaining an index for this is both expensive and nontrivial, so I guess these formats are just “good enough”.
TBF the only common numeric format I’ve yet to see is base64, git uses octal, decimal, hex, binary (BE), at least two completely different VLEs, plus a bitmap scheme.
The magic is lost, but I feel a lot better about myself :)
Although in this case I'd say it's the fault of symlinks daring to exist and spew their flaws over everyone.
Plus maybe slightly C's fault for not coming with a data structure library so everyone uses linked lists for everything. Presumably this wouldn't have happened in any other language (e.g. C++) because they would have used a set or hashmap or whatever.
A stone age language dealing with a misguided filesystem feature for a brain-dead archive format. Surprising that it works at all really!
Zip also can't store symlinks but I doubt that is very important in most situations.
To support random access, the ZIP format must compress each file separately.
The 512 bytes header is a bit of an annoyance there (especially compared to zip’s 30+file name) but not that much.
And then, obviously, one of the advantages of tar is you can compress with whatever you want.
Zip allows random access while solid compression will give slightly better compression for small similar files.
I was pointing out that you can have the best of both worlds by catting all the files together like `tar` but then compressing it in chunks instead of as one big stream. Plus compressing it in chunks allows you to decompress it in parallel.
One approach would have been to split it into pages of a few kilobytes and separately compress those. But on this platform, I/O was primitive and very slow, and the CPU was also extremely slow. So it was advantageous to decompress only the necessary data.
So I came up with a rudimentary custom compression algorithm. I scanned all entries for common strings and made a shared dictionary. Each entry was then stored as a sequence of indexes into the shared dictionary. Decompressing an entry was dead simple: read the sequence of dictionary indices and copy from the dictionary to the output.
On a side note, the unexpected challenge of this method was how to choose which strings to put in the shared dictionary. I tried a few things and ended up with a heuristic that tries to balance favoring longer strings with favoring frequently occurring strings. I'm sure there are much better ways than that.
If I had it to do over, I probably would have gone with the compressed pages approach simply because the implementation would've been done so much quicker. But I think I did achieve faster decompression and a better compression ratio.
Anyway 7z can optionally do solid compression anyway. I've no idea why it isn't more popular the Linux world. I guess the only real reason is the CLI tools don't come installed by default.
Gnu tar most likely inherited the octal system in order to retain compatibility with the original tar utility.
Weird Intel history aside, everybody used octal in the old days. I don’t actually know why, but I suspect this goes back to how early binary computers used 36-bit words (later 18- or 12-bit ones) because that’s how many bits you need to represent 10 decimal digits (and a sign), the standard for (electro)mechanical arithmometers which (among other devices) they were trying to displace. So three-bit groupings made more sense than four-bit ones. Besides, as far as instruction encodings go, a 8-way mux seems like a reasonable size if a 16-way one is too expensive.
(Octal on the 16-bit PDP-11 is probably a holdover from the 12-bit PDP-8? Looking at the encoding table, it does seem to be using three- and six-bit groups a lot.)
Actually the PDP-1 was an 18 bit machine; its successor the PDP-6 had 36 bit words (and 18 bit addresses -- yes it was explicitly designed to be a Lisp machine in 1963). Other DEC machines like the PDP-7 (on which Unix was developed) were 18-bit machines (Multics used an 18-bit non-DEC architechture).
Probably the most popular minicomputer ever, the PDP-8, used 12 bits, as did a bunch of DEC industrial control machines.
That lead her to conclude it would be better to teach the computer to handle decimal than to force everyone to use octal.
A lot of the old systems used 6-bit character codes (https://en.wikipedia.org/wiki/Six-bit_character_code ), including BCD six-bit codes, making 6*n word sizes more appropriate.
The anecdote about decimal makes sense — she was the key to the design of COBOL.
I used quite a few machines with six bit characters into the mid 1980s.
Here's the V7 tar.c: https://github.com/dspinellis/unix-history-repo/blob/Researc...
and it's documentation: https://github.com/dspinellis/unix-history-repo/blob/Researc...
The C code clearly shows the octal (I've never used "%o" myself!):
sscanf(dblock.dbuf.mode, "%o", &i);
sp->st_mode = i;
sscanf(dblock.dbuf.uid, "%o", &i);
sp->st_uid = i;
sscanf(dblock.dbuf.gid, "%o", &i);
sp->st_gid = i;
sscanf(dblock.dbuf.size, "%lo", &sp->st_size);
sscanf(dblock.dbuf.mtime, "%lo", &sp->st_mtime);
sscanf(dblock.dbuf.chksum, "%o", &chksum);
Now, older versions of Unix supported DECTape, going back to the first manual from 1971 (see the "tap" command at http://www.bitsavers.org/pdf/bellLabs/unix/UNIX_ProgrammersM... , page "5_06", which is 94 of 211 - the command-line interface is clearly related to tar).
Here's the tp.5 format description from V6 at https://github.com/dspinellis/unix-history-repo/blob/Researc...
DEC/mag tape formats
Each entry has the following format:
path name 32 bytes
mode 2 bytes
uid 1 byte
gid 1 byte
unused 1 byte
size 3 bytes
time modified 4 bytes
tape address 2 bytes
unused 16 bytes
check sum 2 bytes
But you can see at https://github.com/dspinellis/unix-history-repo/blob/Researc... that mtime is an int, so 4 bytes (yes, 16-bit integers; confirm with https://github.com/dspinellis/unix-history-repo/blob/Researc... showing i_mode is an integer).
Which means this older "tap" DECTape support uses 'plain binary numbers ("base 256")', not octal.
I started out computing in the PC world with the DOS family (then moving on to Windows) before working with any of the Unixes, so my perspective may be different from those who started with Unix/Linux, but in my experience it seems links (both soft and especially hard ones when applied to directories) cause a rather large number of problems in comparison to their usefulness; they make the filesystem a graph instead of a tree, adding complexity to all applications that need to traverse it.
Luckily, most systems don't support directory hard links. (IIRC, maxOS does though, through some extremely hacky mechanism.)
Without links, there's a 1-to-1 mapping between (absolute) paths and file contents. When operating on a path, there's no need to consider whether the API will act on the link or what it points to.
I’ve been using .tar.xz for archiving, but haven’t looked into what the “best” option really is.
If I just need an archive for easy schlepping around, I use tar if the content is not likely compressible or tar.gz if it is compressible. The slowness of better compression or the likelihood that I will struggle with missing utilities tends to make me shy away from other compressors.
I would only highly optimize for size and suffer the slow compression if it will be downloaded a lot of times and that makes a difference for user experience or bandwidth costs.
That said, I've been using pixz (which is compatible with xz but can do parallel decompression and a few other things) for many years on many dozens of terabytes of compressed data and have never had any problems.
Most archivers have various extensions for the tar or pax file formats, to deal with modern metadata, but the format extensions may differ between various "tar" or "pax" implementations and not all such extensions really succeed to not lose any metadata.
One archiver available on Linux/*BSD systems, for which I have checked that it does the job right and it is able to archive files from a filesystem like XFS, without losing metadata like most other tar/pax/cpio programs, is the "bsdtar" program from the "libarchive" package, when used as a "pax" program, i.e. when invoked with the options "bsdtar --create --format=pax".
Since many years ago, I have been using exclusively this archiver, to avoid losing information when making archive files. It is also very versatile, allowing to combine archiving with many finely configurable compression or encryption algorithms.
Some years ago, the most widely available tar program, the GNU tar, was not able to store XFS or UFS files without data loss.
It is possible that the GNU tar has been improved meanwhile and now it can do its job right, but I had no reason to go back to it, so I have never checked it again for changes.
Of course, Windows file formats, e.g. zip or 7z, are even less appropriate for archiving UNIX/POSIX file systems than the traditional tar/cpio/pax.
SquashFS files have file-based deduplication, fast random access, and mountability, all of which are lacking from .tar.* archives without resorting to other indexing tools. And they're mountable on any Linux without installing anything.
Only downside is they're readonly (or more accurately append-only), but for my uses, that's totally fine.
.tar.xz files are a gross pain on windows and I’m always annoyed when people use them.
If I need to send something I'll use zip for personal stuff or tar.gz at work because I know everyone is using some kind of Linux and it's the only terminal zip command I have memorized.
The space of such clients and their features or popularity is limited though. Rsync has millions flags that add complexity. Then you have bit torrent. From what I know it doesn’t compress. Others: git, nfs, ftp.
That's actually the reason why I used tar instead of rsync; I only had around 750 GiB of space. The 519GiB tar.gz would fit, the 1.1TiB directory structure wouldn't.
If I was going to be packing lots of data I'd probably use mopaq (.mpq) which has support for LZMA and Huffman coding.
For sending stuff to other people I just use zip, because it is the lowest common denominator, but I make sure to always use info-zip so that I don't use any weird proprietary extensions, and get proper zip64 support.
If you want something done right, you gotta do it yourself!
It's capable of reading and writing tar files, and the big benefit over extracting is that you can repack a tar by (as the name says) streaming it, with a decision callback which allows you to discard, keep, or "edit" a file - where edit does unpack to a temporary file at which point you can arbitrarily change the file or the header info and then keep or discard once you're done.
As someone who burned a lot of CDs on Linux on the 90s, you bet i recognize that name, and didn't hear this news.
He had strong feelings about his star being the star and an unrelated star archiving tool needing to be renamed. Something I was in no position to do or advocate for. I no longer remember if we actually discussed the technical matter or not at any point.
I was wondering if mention of him would pop up in the comments since he wrote a tar (and possibly pax, I no longer remember) implementation.
I once did the same thing (something else, not a tar). I was on discord at the time and typed in messages about the stupid stuff I noticed while writing my own code. A guy told me I'm an asshole and a shit programmer/personality because I wasn't willing to use stick with/figure out standard tools + blocked me. Which is weird because why wouldn't you want more tools available. I gave away my source at the end
Don't be afraid of writing your own stuff!
I've been working on a userspace library which would make writing userspace programs that interact with these kinds of dangerous paths more safe (though it's been on the backburner recently).
When I untarred it in the terminal with xvf I would see the files in the directories inside the tarball being created. But when I untarred a previous working version of this tarball, it would create the directory before untarring it.
Googling all of this is impossible by the way. I couldn’t figure out how to get Python to create a “blank directory” first inside the tarball so eventually I gave up and made my script take in a path to an already existing folder instead of individual files.
Like what if dogpictures.tar.gz contains a ../.bashrc. Untar from $USER/Downloads and whoops..
EDIT: nevermind, I read the actual article. This only happens if you explicitly tell it to with the -P flag.