To me, most of the claims are arguable.
To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.
To say padding is "useless"... I don't understand why padding and byte-alignment that is given so much vitriol. Look at how much padding the tar format has. And tar is a good example of how "useless padding" was used to extend the format to support larger files. So this supposed "flaw" has been in tar for dozens of years, with no disastrous effects at all.
The xz decision was not made "blindly". There was thought behind the decision.
And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.
I'll leave it at this for now, but there's more I could write.
3 individual headers for one file format is unnecessary complexity.
> To say padding is "useless"
Padding in general is not useless, but padding in a compression format is very counterproductive.
> And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.
This isn't about "someone making a bad implementation!", it's about crucial features being optional. That is, completely compliant implementations may or may not be able to decompress a given XZ archive, and may or may not be able to validate the archive.
XZ may not have been chosen blindly, but it certainly does not seem like a sensible format. There is no benefit to this complexity. We do not need or benefit from a format that is flexible, as we can just swap format and tool if we want to swap algorithms, like we have done so many times before (a proper compression format is just a tiny algorithm-specific header + trailing checksum, so it is not worth generalizing away).
Any and all benefits of XZ lie in LZMA2. We could have lzip2 and avoid all of these problems.
(I have no opinion as to whether LZIP should supersede GZIP/BZIP2, but XZ certainly seems like a poor choice.)
So all these file formats are unnecessarily complex?
- all OpenDocument formats
- all MS office formats
- all multimedia container formats
- deb/rpm packages
Multimedia containers, while too complicated, don't really qualify for a position on that list. These containers are basically just special purpose file containers, and thus the headers of the "files" within should not contribute to the header count.
deb/rpm is also a good example for old and quite obnoxious formats. Deb is an AR archive of two GZIP compressed TAR archives (control and data) and a single file (debian-binary). TAR replaced AR for all but a few ancient tasks long ago, but for some reason, Deb uses both. A tar.gz with 3 files/folders that were not tar'd or compressed would have been much simpler. I believe RPM goes that route, but rather than TAR they use CPIO, and rather than embedding the metadata inside the archive, the RPM package has its own header.
Both RPM and DEB have given support for using a bunch of compression formats, meaning that not only do the content of the DEB/RPM package have dependencies, but there each package can now basically end up having its own dependencies that need to be satisfied before you can even read the package in the first place. Oh, and one of the supported compression formats is XZ now, adding an extra dependency as your version of XZ might not support the contained XZ archive at all.
I recall an article posted here detailing how incredibly bloated and crufty the RPM format was.
Just because it's in tar doesn't mean that the design is flawless. tar was created a long time ago, when a lot of things we are concerned with now weren't even thought of.
Deterministic, bit-reproduceable archives are one thing that tar has recently struggled with, because the archive format was not originaly designed with that in mind. With more foresight and a better archive format, this need not have been an issue at all.
 - https://lists.gnu.org/archive/html/help-tar/2015-05/msg00005...
While I think he made a case, I somewhat doubt that the other formats are flawless, and the real answer would lie in a more open analysis of all of them.
Like some other compressed formats, an lzip file is just a series of compressed blocks concatenated together, each block starting with a magic number and containing a certain amount of compressed data. There’s no overall file header, nor any marker that a particular block is the last one. This structure has the advantage that you can simply concatenate two lzip files, and the result is a valid lzip file that decompresses to the concatenation of what the inputs decompress to.
Thus, when the decompressor has finished reading a block and sees there’s more input data left in the file, there are two possibilities for what that data could contain. It could be another lzip block corresponding to additional compressed data. Or it could be any other random binary data, if the user is taking advantage of the “trailing data” feature, in which case the rest of the file should be silently ignored.
How do you tell the difference? Simply enough, by checking if the data starts with the 4-byte lzip magic number. If the magic number itself is corrupted in any way? Then the entire rest of the file is treated as “trailing data” and ignored. I hope the user notices their data is missing before they delete the compressed original…
It might be possible to identify an lzip block that has its magic number corrupted, e.g. by checking whether the trailing CRC is valid. However, at least at the time I discovered this, lzip’s decompressor made no attempt to do so. It’s possible the behavior has improved in later releases; I haven’t checked.
But at least at the time this article was written: pot, meet kettle.
Their advocacy in this thread was so good that I removed lzip from my system.
To add to that, if you need parity to recover from errors, you need to calculate how much based on your storage medium durability and projected life span. It's not the file format's concern. The xz crc should be irrelevant.
So you've archived two or more copies of each file? That means you're use at least twice as much space (and if you're keeping the original as well, more than twice).
For the likely corruption of the occasional single bit flip here and there, you could do a lot better by using something like par2 and/or dvdisaster (depending on what media you're archiving to).
It took me just one minor "data loss incident" ~20 years ago to very quickly convince me to become a lifetime member of the "backup all the things to a few different locations" club.
> That means you're use at least twice as much space (and if you're keeping the original as well, more than twice).
"Storage is cheap."
99% of the digital data I'm keeping for the long term is family photos and videos. All my photos go to Dropbox (easy copy-from-device and access anywhere) and are then backed up to multiple locations by CrashPlan.
It'll be a while yet, but in the next few years I'll be hitting the 1TB Dropbox limit. I'm hoping that Dropbox make a >1TB 'consumer' plan in the next couple of years. There's no way I'm assuming my backups are fine, deleting from Dropbox to make space, then finding out in a few years that some set of photos is missing.
I also sync up to Google Drive - but again, there's a 1TB limit (or a large cost).
In the future, I might have to create a new Dropbox account and keep the old one running. Storage might be cheap, but keeping it cheap is tricky.
If it's really for pure backup, not continuous sync, Glacier is $4 per TB.
Depending on how price conscious you are, I agree with the GP's "keeping it cheap is tricky". And with things like backup, even if you do it yourself, the time spent maintaining it should be negligible: Occasionally kick off a format shift or failed drive replacement, have scripts running everything else.
Yes. But what you get in return is not having that data at home. It doesn't matter how many copies you have locally if your home gets robbed, flooded, or burns down.
No I don't, because it's a waste of space and money. Using par2 and/or dvdisaster I can archive a lot more files on to the same archival media and still get enough redundancy to feel secure.
Cheap is relative. Are you really buying an extra 2 TB's of storage to archive 1 TB of data, because that's what you'd need to do to archive 2 copies of each file. That's a huge waste of space and money that adds up when you're archiving a lot of data.
If your needs are small or your pockets deep, you can afford to do what you're proposing, but for the rest of us who aren't made out of money it's just not practical.
On the other hand, when I have worked at places which could afford to have multiple archives at various locations, I've made sure each of those archives were protected with par2 or dvdisaster, so I could recover from both rather than have one of the archives fail because of a bit flip error.
$ sudo zpool get size zdata
NAME PROPERTY VALUE SOURCE
zdata size 21.8T -
It's fine that you "feel secure" with your current backup regimen -- and I certainly hope you never lose any important data.
After losing data once, though, I promised myself I'd do my best to make sure that it never happened again. The "primary copy" of all my data lives on the individual machines (my workstation, primarily, but there's a bit on my main laptop too) but there's also a copy of it all on a server out in the garage as well as yet another server (see above) that I have in an ISP's facility nearby. There's yet another copy of a small fraction of my files (the "really, really, really important stuff") that's sitting in AWS (via tarsnap) as well.
Some folks are satisfied with a copy of their family photos copied onto a flash drive and tossed into a drawer or an external USB drive permanently sitting on the desk next to their computer. I know of several small companies in my area that thought they were safe with an external USB drive connected to their server... until they got hit with ransomware.
My laptop has a pair of mirrored SSDs, my workstation has a pair of mirrored SSDs and a pair of mirrored "spinners". The server in the garage (my "first backup") has RAID10. That box at the ISP has mirrored SSDs plus a "raidz2" that the backups live on. Some of us just want a little bit more reassurance than others. :-)
Which gets back to the original point. If you use format that can be more easily recovered, then having the same amount of copies, you're data is more secure.
You've probably been downvoted because it's perceived as showing off, but it is a nice setup.
I've also spend more time than I'm willing to admit with planning researching configuring and maintaining different backup strategies, and just wanted to say that I regret some of that. It's easy to become data hoarder and it's easy to spend more time on preserving it than it is actually worth. I mean, think about how much of this data is worth to people other than you, i.e. what happens to it when you die. Life's short and there are so many things that are more exciting than backups.
Don't get me wrong though. Backups are important.Just know how much exactly are they important to you.
5% parity archives is an easy sell, on top of 200% for off site copies.
But my troll aside, I agree. If losing the data would cause you harm or make you sad (losing photos of your kids for example), you definitely need to have multiple backups in multiple locations, ideally controlled by different parties (so one bug on your cloud provider's side doesn't wipe out both of the copies they store for you). I've been burned by this with personal data a few times. The stakes get even higher when you are responsible for someone else's data. If they don't want to pay for the extra storage, make sure they understand the risk involved.
If your data is not in three different places it might as well not exist.
...relative to ... ? Is it better than lzip? lzip sounds like it would also use LZMA-based compression, right? This  sounds like an interesting and more detailed/up-to-date comparison. Also by the same author BTW.
People began using xz because mostly because they (e.g. distro maintainers like Debian) had started seeing 7z files floating around, thought they were cool, and so wanted a format that did what 7z did but was an open standard rather than being dictated by some company. xz was that format, so they leapt on it.
As it turns out, lzip had already been around for a year (though I'm not sure in what state of usability) before the xz project was started, but the people who created xz weren't looking for something that compressed better, they were looking for something that compressed better like 7z, and xz is that.
(Meanwhile, what 7z/xz is actually better at, AFAIK, is long-range identical-run deduplication; this is what makes it the tool of choice in the video-game archival community for making archives of every variation of a ROM file. Stick 100 slight variations of a 5MB file together into one .7z (or .tar.xz) file, and they'll compress down to roughly 1.2x the size of a single variant of the file.)
7z is an open standard, and the SDK is public domain:
Thus, the proposed benefits of the compression ratio apply equally to lzip as they do to 7z and xz. When the article talks about shortcomings of the xz file format compared to the lzip file format, it's talking about file structure and metadata, not compression algorithm. Just running some informal comparisons on my machine, an empty (zero byte) file results in a 36 byte lzip file and a 32 byte xz file, while my hosts file of 1346 bytes compresses to a 738 byte lzip file and a 772 byte xz file. An mtree file listing my home directory comes to 268 Mbytes uncompressed, resulting in an 81M lzip file and an 80M xz file (a difference of 720 Kbytes, less than 0.9% overhead). Suffice it to say, the compression of the two files is comparable. Yet, the lzip file format also has the advantages discussed in the article.
That said, for long-term archival I wouldn't use any of the above: I prefer zpaq, which offers better compression than straight LZMA, along with dedup, journaling, incremental backup, append-only archives, and some other desirable features for archival. Together with an mtree listing (to capture some metadata that zpaq doesn't record) and some error recovery files (par2 or zfec), this makes a good archival solution, though I hesitate to call it perfect.
401728 freebsd-11.0-release-amd64-disc1.iso.xz 5m0.606s
406440 freebsd-11.0-release-amd64-disc1.iso.lz 5m43.375s
430872 freebsd-11.0-release-amd64-disc1.iso.bz2 1m38.654s
440400 freebsd-11.0-release-amd64-disc1.iso.gz 0m27.073s
431740 freebsd-11.0-release-amd64-disc1.iso.zst 0m3.424s
Previously discussed here on HN back then:
The author has made some minor revisions since then. Here are the main differences to the page compared to when it was first discussed here:
And here's the full page history:
I'm not so sure, using tools suitable long-term archiving by default might not be a bad practice. The thing about archiving is that it's often hard to know in advance what exactly you want to keep long-term. Using more robust formats probably won't cost much in the short term, but could pay off in the long term.
The fork compiled for me this week, when the official 0.3 version on Sourceforge wouldn't. I vaguely remembered par3 being discussed, but couldn't find anything usable. And that's an example of why to be wary of new formats, I guess?
The program does pretty much the same thing as it did a decade ago.
Unfortunately, mainstream tooling is largely fire-and-forget and never includes verification (e.g. copying succeeds even if the written data is getting garbled), so one is forced to use multi-step workflows to get around this. It's pretty discouraging that no strong abstractions exist in this space.
(People underestimate how frequently memory corruption can actually occur, almost two years ago when Overwatch first came out the game kept crashing - it took me forever to find the cause was a faulty DIMM. Hell, right now the R320 I have in my rack at home has an error indicator because one of my 2 year old Crucial RDIMM's has an excessive amount of correctable errors).
Same goes for the bit jpeg. Sure, it might not be ideal technically, but recommending JPEG2000 (presumably as there is no JPEG2) with its ridiculously poor software support seems weak too. What use is robust file that you can't open?
If you can't control the underlying storage, then ditto. Keeping and maintaining explicit parity chunks is somewhat inconvenient, but it works.
But if you just want to avoid bitrot of your own files, sitting on your own HDD, I'd recommend using a reliable storage system instead. ZFS or, at higher and more complicated levels, Ceph/Rook and its kin. That still offers a posix interface (unlike parity files), while being just as safe.
Do any other file systems other than ZFS support adding parity in a single HDD config? Last I checked getting ZFS in Linux required lots of side band steps due to licensing issues.
ZFSOnLinux is developed outside Linux’ tree for 2 reasons. One, it is easier that way and two, Linus does not want it in the main tree. Consequently, you need to install it in addition to the kernel as if it were entirely userspace software. That does not add anymore difficulty than say, installing Google Chrome. :/
Parity is ECC (which is usually Reed-Solomon, which is just a fancy name for a big set of more equations than data chunk you have, so that's how it adds in redundancy) with 1 bit. Usually you should aim for +20-40% redundancy.
Ceph, HDFS and other distributed storage systems implement erasure coding (which is subtly different from error correction coding), which I would recommend for handling backups.
I think for backup (as in small-scale, "fits on one disk") error-correcting codes are not a really good approach, because IME hard disks with one error you notice usually have made many more errors - or will do so shortly. In that case no ECC will help you. If, on the other hand, you're looking at an isolated error, then only very little data is affected (on average).
For example, a bit error in a chunk in a tool like borg/restic will only break that chunk; a piece of a file or perhaps part of a directory listing.
So for these kinds of scenarios "just use multiple backup drives with fully independent backups" is better and simpler.
For large scale in house things: Ceph regurarly does scrubbing of the data. (Compares checksums.) and DreamHost has DreamObjects.
Thanks for mentioning borg/restic, I have never heard of them. (rsnapshot [rsync] works well, but it's not so shiny) Deduplication sounds nice. (rsnapshot uses hardlinks.)
That made me look for something btrfs based, and here's this https://github.com/digint/btrbk seems useful (send btrfs snapshots to a remote somewhere, also can be encrypted), could be useful for small setups.
(1) They need full support for all FS oddities (xattrs, rforks, acls etc.) wherever you move the data
(2) They don't checksum the data at all.
The newer tools don't have either problem that much: For (1) they pack/unpack these in their own format which doesn't need anything special, so if you move your data twice in a circle you won't lose any (but their support for strange things might not be as polished as e.g. rsync's or GNU coreutils). And for deduplication they have to do (2) with cryptographic hashes.
However (as an ex-dev of one of these) they all have one or the other problem/limitations that won't go away. (Borg has its cache and weak encryption, restic iirc has difficult-to-avoid performance problems with large trees etc.)
Something that nowadays might also need to be discussed is if and how vulnerable your on-line backup is against BREACH-like attacks. E.g. .tar.gz is pretty bad there.
Yeah, crypto is something that doesn't play well with dedupe, especially if you don't trust the target backup server.
Uh, BREACH was a beast (he-he). I'm a bit still uneasy after thinking about how long these bugs were lurking in OpenSSL. Thankfully the splendid work of Intel engineers quickly diverted the nexus of our bad feels away from such high level matters :|
 That's something that the btrfs/ZFS/Ceph should/could fix. (And btrfs supports incremental mode for send+receive.)
xz can be amazing. It can also bite you.
I've had payloads that compress to 0.16 with gzip then compress to 0.016 with xz. Hurray! Then I've had payloads where xz compression is par, or worse. However, with "best or extreme" compression, xz can peg your CPU for much longer. gzip and bzip2 will take minutes and xz -9 is taking hours at 100% CPU.
As annoying as that is, getting an order of magnitude better in many circumstances is hard to give up.
My compromise is "xz -1". It usually delivers pretty good results, in reasonable time, with manageable CPU/Memory usage.
FYI. The datasets are largely text-ish. Usually in 250MB-1GB chunks. So talking JSON data, webpages, and the like.
If you store enough of the same type of data, invest in redesigning the application. There's a reason we all use jpegs over zipped bitmaps...
It's because it's an appropriate compression - just like xz can be? Not sure what you're actually suggesting here.
When I last looked into this issue, it seemed that erasure codes, like with Parchive/par/par2, was the way to go. (As others have mentioned here.) I haven't tried it out as I haven't needed that level of robustness.
Then why use the default settings?
I tend to use the maximum settings, which are much more of a memory hog, but I have enough memory where that's not an issue.
Just use the settings that are right for you.
I think he saw "'best' compression" and stopped looking there.
It's not like xz is unable to be lighter on memory, if that's what you want. It's an option setting away.
This article is likely more relevant to tape archives than anything most people use today.
When I burn data (including xz archives) on to DVD for archival storage, I use dvdisaster for the same purpose.
I've tested both by damaging archives and scratching DVDs, and these tools work great for recovery. The amount of redundancy (with a tradeoff
for space) is also tuneable for both.
 - https://github.com/Parchive/par2cmdline
 - http://dvdisaster.net/
The author seems to think the xz container file format should do that.
When you remove this requirement, nearly all his arguments become moot.
On the contrary. People archive files to save space, exchange files with each other over unreliable networks able to corrupt data, store them in corrupted ram and corrupted disks, even if just temporary. Compression formats are there to help with that, this is their main purpose. This is why fast and proper checksumming is expected, but not cryptographic, like sha256, that adds nothing to this goal but overhead.
I can understand the concerns about versioning and fragmented extension implementations though.
Actually, one uses the tape archive utility, tar, to write directly to the tape. (-:
renice 19 -p $$ > /dev/null 2>&1
Use tar + xz to save extra metadata about the file(s), even if it is only 1 file.
tar cf - ~/test_files/* | xz -9ec -T0 > ./test.tar.xz
sha256sum ~/test_files/* | sort -n > ~/test_files/.sha256
23725264 zig-linux-x86_64-0.2.0.cc35f085.tar.xz 63.05 seconds
23627771 zig-linux-x86_64-0.2.0.cc35f085.tar.lz 83.42 seconds
Perhaps folks are trying to stick with packages that are in their base repo. p7zip is usually outside of the standard base repos.
Packing a bunch of files together as .tgz is a quite universal format and compresses most of the redundancy out. It has some pathological cases but those are rare, and for general files it's still in the same ballpark with other compressors.
I remember using .tbz2 in the turn of the millennium because at the time download/upload times did matter and in some cases it was actually faster to compress with bzip2 and then send over less data.
But DSL broadband pretty much made it not matter any longer: transfers were fast enough that I don't think I've specifically downloaded or specifically created a .tbz2 archive for years. Good old .tgz is more than enough. Files are usually copied in seconds instead of minutes, and really big files still take hours and hours.
None of the compressors really turn a 15-minute download into a 5-minute download consistently. And the download is likely to be fast enough anyway. Disk space is cheap enough that you haven't needed the best compression methods for ages in order to stuff as much data on portable or backup media.
Ditto for p7zip. It has more features and compresses faster and better but for all practical purposes zip is just as good. Eventhough it's slower it won't take more than a breeze to create and transfer, and it unzips virtually everywhere.
The only issue I've had with xz is that it doesn't notice if it is not actually compressing the file like other utilities do and then just store the file uncompressed, so if you try to xz a tar file with a bunch of already highly compressed media files then it both takes forever and and you end up with a nontrivially larger file than you started with.
Also, I like that, unlike gzip, xz can sha256 the uncompressed data if you use the -C sha256 option, providing a good integrity check. Yes, I would really like to use a format that doesn't silently decompress incorrect data and I can't understand why the author of this article thinks that is a bad thing. For backups I keep an mtree file inside the tar file with sha512 of each file and then the -C sha256 option to be able to easily test the compressed tar file without needing another file. In some cases I encrypt the txz with the scrypt utility (which stores HMAC-SHA256 of the encrypted data).
A major problem of zip is the "codepage hell" (it has been almost eradicated in browsers but still lives in zip archives, e-mails and non-.Net Windows programs). With 7z you just always know nobody is going to have problems decoding the names of the files inside it, whatever languages those are in, regardless to the system locale.
If your storage fails, maybe you'll have a problem, but you'd have a problem anyway.
Sometimes I feel like genuine technical concerns are buried by the authors being jerks and blowing things way out of proportion. I, for one, tend to lose interest when I hear hyperbolic mudslinging.
What is the probability of a complete HD failure in a year?
tar c foo | gzip > foo.tar.gz
tar c foo | bzip2 > foo.tar.bz2
document.body.style['max-width'] = '550px'; document.body.style.margin = '0 auto'
"What medium should be used for long term, high volume, data storage (archival)?"
It mostly focuses on the media instead of formats though.
Amazon Glacier runs on BDXL disc libraries (like a tape library). There's nothing truly expensive about producing BDXL media, there just isn't enough volume in the consumer market to make it worthwhile. If you contract directly with suppliers for a few million discs at a time, that's not an issue (you did say high-volume, right?).
For medium-scale users, tape libraries are still the way to go. You can have petabytes of near-line storage in a rack. Storage conditions are not really a concern in a datacenter, which is where they should live.
(CERN has about 200 petabytes of tapes for their long-term storage.)
If you mean "high-volume for a small business", probably also tapes, or BD discs with 20% parity encoding to guard against bitrot.
Small users should also consider dumping it in Glacier as a fallback - make it Amazon's problem. If you have a significant stream of data it'll get expensive over time, but if it's business-critical data then you don't really have a choice, do you?
This has been a rumor I've heard for quite a while (probably since shortly after Glacier was announced) but has it ever been confirmed?
You can also use xzip on top of something that can correct errors, such as par2.
> According to [Koopman] (p. 50), one of the "Seven Deadly Sins" (i.e., bad ideas) of CRC and checksum use is failing to protect a message length field. This causes vulnerabilities due to framing errors. Note that the effects of a framing error in a data stream are more serious than what Figure 1 suggests. Not only data at a random position are interpreted as the CRC. Whatever data that follow the bogus CRC will be interpreted as the beginning of the following field, preventing the successful decoding of any remaining data in the stream.
> Except the 'Backward Size' field in the stream footer, none of the many length fields in the xz format is protected by a check sequence of any kind. Not even a parity bit. All of them suffer from the framing vulnerability illustrated in the picture above.
Wow ... that is inexcusably idiotic. Whoever designed that shouldn't be programming. Out of professional disdain, I pledge never to use this garbage.
We certainly should have environments where we can tell someone code is shit, it's just silly and counterproductive to then leap to attacks on the abilities on the person behind it.