Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Xz format inadequate for long-term archiving (nongnu.org)
209 points by martianh on Oct 22, 2016 | hide | past | favorite | 91 comments



OK, I'll bite. This article is overblown in its criticisms and has some issues of its own.

1. The article itself recommends that "a tool is supposed to do one thing and do it well". Well, xz is for compression, nothing else. If you want to check integrity, use a hashing tool such as sha256sum. If you want to recover from errors, generate parity files using par2 or zfec.

2. The article claims xz was chosen by prominent open source projects due to hype. No, it was chosen because of its favorable size/speed tradeoffs.

3. It's not clear what the author means by "long-term archiving". Archives professionals (as in, people who are actually employed by memory institutions) will tell you of many factors (all unmentioned here) that have a bearing on whether a file format (particularly a compressed one) is suitable for use in data preservation.

4. The section on trailing data is particularly bizarre. By claiming that xz is "telling you what you can't do with your files", it seems that the author considers it perfectly reasonable to append arbitrary data to the end of a file and expect it to continue functioning. Is my normality detector off today or is this just a wacky thing to want to do?

5. The article is written by the author of lzip (which it compares favorably to xz), but does not disclose this. Admittedly it is hosted on the lzip site, but overall the article comes across as an opinionated hit-piece designed to sow doubt in a competitor.


> The article claims xz was chosen by prominent open source projects due to hype. No, it was chosen because of its favorable size/speed tradeoffs.

No: that's why the LZMA compression format was chosen, and why the .lzma container format temporarily became popular as people started using lzma-tools. The move to xz, which results in slightly larger files for seemingly arbitrary reasons and which technically provides one benefit (seekability) almost everyone defeats (by compressing a tar file) while using the same relying on the same underlying compression algorithm is less-well motivated.


Were there many prominent open source projects using the LZMA format before the creation of xz? Debian, for example (which is where this particular flamefest originates), adopted[0] xz the year following the stable v5.0.0 of XZ-Utils[1].

My impression was that the whole 7zip->p7zip->lzma_alone->xz journey was concluded before any "xz hype" began.

[0] https://lists.debian.org/debian-devel-announce/2011/08/msg00...

[1] http://git.tukaani.org/?p=xz.git;a=blob;f=NEWS;hb=HEAD


I don't quite understand the timeline you are drawing here but the Debian Package Manager is a perfect example of a prominent open source project which used the LZMA format before the creation of xz. They recently shifted to "deprecate" the LZMA format, a change which I have removed from the negligible fork of dpkg I ship to the tens of millions of people running Cydia.


Regarding #4: Padding out a file is popular because of tape drives. You'll typically write the archive to the an appropriately sized tape raw, without a filesystem (which is just another format you have to maintain a parser for). So because the file will be smaller than the tape, there will be garbage at the end of the tape. Because we wrote the file to the tape directly, there is no way of knowing where it ends, and a tool reading the file will need to just deal with the garbage data.


Doesn't one typically write to tape drives with tar, rather than writing a raw file?


Indeed. This is why the tar format explicitly allows garbage data at the end. So then people started pondering all of the nifty or clever things they could do with tar files. And they didn't want to give it up when they started compressing the tar files.


Couldn't you just reverse the order then, and create an xz.tar? Maybe I don't understand the benefit of taring the data first.


Tar is just bundling into a single file AFAIK. There is a slight benefit depending on your compression tool to tar and then compress, because (AFAIK again)some tools compress files individually and then write them into a hierarchical file(I guess this is what xz does as well, since it's searchable?). If you tar first, these tools will work better, since they encode patterns found in all files instead if doing it per file(which means e.g. if there is a header once per file, that will get compressed in the tar.comorpress, not in the . compress)


I've created sortedtar Brewster if this assumption and while it is correct tree benefit is mostly negligible except for some edge cases.


Old man moment!

Kids these days don't appreciate having random addressable storage for archive/backup data!


Garbage data is not a problem since the length is known.


I do not understand the xz format enough to evaluate that claim myself, but TFA explicitly claims that garbage data is a problem.


Hum, no. When you tar a set of files directly into a tape, you don't know the resulting tar size beforehand. Even less if you compress the result.


I would think you could just write the size at the end of the tape?


When you're reading the data later, how would you know where to find "the end" if you don't already know the length?


You don't know the end of the data, but presumably(?) you know the end of the tape.


We are talking past each other here...

Making a backup with tar is done by typing something like that on bash:

> tar -c - dir1 dir2 dir3 > /dev/tape

That will (hopefully, I doubt I got the tar switches right) backup those dirs into the tape (that will actually have a weird name, not '/dev/tape').

Now, in practice Linux doesn't always know the size of a tape you inserted. But this is not the issue, if you accept the seeks needed for that, you'd better write at the beginning anyway.


Considering neither lzip nor xz is a commercial product, I don't think your final point stands. To me, it only makes sense that the person who wrote a "competing" file format would have a strong negative opinion of xz


When I use xz for archival purposes I always use par2[1] to provide redundancy and recoverability in case of errors.

When I burn data (including xz archives) on to DVD for archival storage, I use dvdisaster[2] for the same purpose.

I've tested both by damaging archives and scratching DVDs, and these tools work great for recovery. The amount of redundancy (with a tradeoff for space) is also tuneable for both.

[1] - https://github.com/BlackIkeEagle/par2cmdline

[2] - http://dvdisaster.net/


I recently did the same, but used tar without compression. For starters most of the data I was archiving didn't compress well (like photos), but mainly if par2 fails, I can still recover at least some of the data.

With compressed files you usually need it to be perfect to recovery any data, but you can use fixtar [0] to recover some data from a corrupted tar archive.

I used BluRay as the media cost is about the same as DVDs per GB here (at least 25GB discs). Also there are HTL discs available, which supposidely have a greater longevity than DVDs.

[0] -http://riaschissl.bestsolution.at/2015/03/repair-corrupt-tar...


How much parity data (%) do you have?

What's the oldest disc you've tested?

What do you think about PAR3?


I'm afraid my archiving and testing are not very systematic. I don't usually go back and test my burnt DVDs, except when I need to get data off of them, which is not very often. But so far I haven't run in to issues, and have been pretty lucky with the DVDs I've used. I don't think I can recall a situation where dvdisaster actually had to recover from errors on any old DVD. Of course, it's a good idea to transfer valuable data from old DVDs to new ones every now and then, as they won't last forever.

I just tried a DVD from 2013 (Verbatim DVD-R, 16x, 4.7 GB) and it read fine, without any errors, and dvdisaster found that the checksums on it matched.

I usually try to use at least 20% redundancy, but will settle for less if the data's not very important. If the data's really important I'll max out the amount of redundancy (up to dvdisaster's limit, which I don't remember off-hand). Sometimes I'll even burn an extra DVD with the same data, both with dvdisaster error correction on them.

As for par3, I only heard about it for the first time today in this thread. So I have no opinions on it except to say that if it's an improvement over par2, I'm all for it. Backwards compatibility with par2 also would be nice.


Thanks for the info

Shame about the status of the dvdisaster project - as of 2015 the Mac OS X and Windows ports are discontinued.


That's just for the new versions. You should still be able to get the old versions and those should still work just fine.


The article's right that you probably should not use xz (or even any similar formats) for archival purposes.

However what's totally awesome about xz as a container format is seekable random access to compressed files. I use it to store infrequently used disk images, which I can boot up without even decompressing them. (https://rwmj.wordpress.com/2013/06/24/xz-plugin-for-nbdkit/)


I agree completely, in fact it seems odd that the article is really complaining about the format being more complicated than bzip2 or gzip, when the comparison is not really apples-to-apples. Comparing it with zip would seem more sensible to me.

A compressed container format that allows quick access to individual files is very useful and actually improves data security - corruption in one part of the data will be far less likely to ruin the entire collection of files, whereas corruption in a compressed tar file may lead to the loss of everything (I know recovery tools exist, but they cannot prevent one file in the container being dependent upon data from another file.)


You are confused. xz does not provide the ability to find individual files inside of the format: it is not at all comparable to zip files, and like the other formats mentioned it compresses one file. The "seekable" property is that it lets you somewhat efficiently decompress arbitrary byte ranges inside of the file, which is why a compressed disk image (as used by your parent commenter) benefits from this property but the industry-standard usage of compressing a tar file (which is a file format which inherently makes it impossible to find individual files without reading the whole thing) totally throws away this benefit.


A tar file doesn't have a central directory like a zip, so you do have to search all the file headers. However, each file header contains the length of the file it describes, which lets you seek past the content of any files you don't care about if the tar isn't stored in a way that would prevent seeking.


That would work if your tar was uncompressed. The lengths in the tar are meaningless in a compressed file.


That's not true, xz allows efficient decompression of arbitrary byte ranges inside the file.


Does it let you seek by a decompressed byte count, or or does it let you decode arbitrary compressed byte ranges?


Really? I thought that's what the page was telling me when it said .xz was a container format. In that case, its file format is extra odd!


Yeah: what they mean by "container format" here is similar to the usage in video compression file formats, where ".avi" doesn't imply any particular compression algorithms. An xz file is a container format which is designed for use to store another container format.


The article's point is, in part, that xz is more like zip or rar than gzip or bzip2, and use of it like those is incongruous (e.g. .tar.xz files). (Its other points are that xz has a number of systematic flaws, in its error detection and some "misfeatures".)


Ah - I presumed that the debian xz-based packages were using it as an archive, instead of the old 'ar'. But I just looked and you are right, the xz support is for the data.tar.xz that itself is still inside the 'ar' archive format. That is odd. I guess they chose this way for simplicity - far less code needs changing in all the .deb file processing utilities.


I can't speak for Debian, but I personally did not realize that xz was as I stated above before reading this article, so I wouldn't be surprised if some people simply didn't realize it.


Unsafe: 1 Likely to cause severe data loss even in case of the smallest corruption (a single bit flip). 2 Likely to produce false negatives.

It does seem like xz is somewhat overengineered, but I don't think that's a characteristic unique to it; any other algorithm with similar compression performance will yield similar behaviour on corrupted data, since the whole point and why compression works is to remove redundancy. I say use something like Reed-Solomon on the compressed data, thus introducing a little redundancy, if you really want error correction.


I haven't read the whole article yet, but gzip and bzip2 recover fairly well from errors. As they are divided into blocks, flipping a bit will make you lose, at most, one block. This is not related to error correction though, but a way of mitigating losses.

>> Just one bit flip in the msb of any byte causes the remaining records to be read incorrectly. It also causes the size of the index to be calculated incorrectly, losing the position of the CRC32 and the stream footer.

This sounds severe enough to me.


As they are divided into blocks, flipping a bit will make you lose, at most, one block.

What happens if bits in the block header are corrupted? If it can't find the start of the next block, the same thing will happen.

Also, breaking up the data into blocks will decrease compression, since each block starts with a fresh state. It is ultimately a tradeoff between compression ratio and error resistance.


> If it can't find the start of the next block, the same thing will happen.

This has been addressed in the article. "Bzip2 is affected by this defect to a lesser extent; it contains two unprotected length fields in each block header. Gzip may be considered free from this defect because its only top-level unprotected length field (XLEN) can be validated using the LEN fields in the extra subfields. Lzip is free from this defect."

> Also, breaking up the data into blocks will decrease compression

This has been tested very thoroughly. Larger block sizes give rapidly diminishing marginal returns (man bzip2). Now, the largest you can go with bzip2 is 900KB.


gzip usually references the data from the previous window in the current window. If you lose one block, you're likely to lose the following one. And the one following it. etc. until the end of the compressed data.


Yeah, seems like a trivial fix to me. Use Par2.


Par2 has inherent limitations that make it look distinctly historical (no Unicode etc.). There is a preliminary Par3, but the main reason why Par doesn't enjoy the popularity it should have is simple: Abysmal tooling.

I've tried all the usual implementations that are available on Windows recently, and they were all unusable. Like in "try to give it a hundred files, each in a single-megabyte range, and it will crash hard".


Why would par2 need to handle anything related to encoding? It's acting on the data independently of the actual contents of the data.

I'd agree about the tooling problems with it, it's "acceptable" on a linux/unix command line but I've never seen anything elsewhere that even looked halfway usable.


File names.


> It does seem like xz is somewhat overengineered, but [...] any other algorithm with similar compression performance will yield similar behaviour on corrupted data

You seem to confuse format and compression algorithm. From what the article says, the format seems bad. Once any data is damaged, all the rest of reading the file (as opposed to decompressing the data from file) goes out of the window. No way to re-synchronize reading with data blocks after corruption.


Of course, any archival strategy that leaves you vulnerable to the problems discussed in the article is defective. You ought to be backing up to multiple physically distant locations if you're really concerned about archival.

The argument makes more sense for the casual user who isn't paying for a professional backup service. But how many of those are using xz? They're more likely to be using zip files; and they'll be in a better place for it.

The zip (concatenation of compressed streams) vs tar.gz (compression of concatenated streams) distinction is useful in an archival context, if one is genuinely worried about bitrot and the risk of needing to recover partial data where the container is damaged. A concatenation of compressed streams will have worse compression especially for lots of small files, but it is far easier to recover from an error.


Another option, should you need better compression than zip with the same damage tolerance, is xz files in an uncompressed tar.


How does TAR mitigate the issues mentioned in the paper? Sincerely curious, my knowledge is limited. AFAICS, XZ can still "corrupt" the TAR stream and you'll need quite a bit of extra error correction to recover from that.


OP means xz all individual files, then tar them without compression.


Not as silly a suggestion as it sounds


All of the arguments here seem to assume corruption (even extensibility is a form of it), and to treat it as a thing that can be fixed by the decompressor. It seems xz isn't ideal in that case, but none of the others are great either. As mentioned elsewhere in this thread, if integrity is your primary concern you should use PAR or some other error-correction mechanism. If you just need a binary integrity indicator you should have separate hashes published somewhere.

xz wasn't chosen for its archival integrity AFAIK, it was chosen for its size. Package distribution ought to spend as much resource as practical to reduce bytes over the wire. Size is why bzip2 "won" over gz despite being many times slower and requiring much more RAM. lrzip would be my first choice for maximum compression, but its downstream resource requirements are far too high vs any of the others, so from my experience xz would be the answer anyway.


> xz wasn't chosen for its archival integrity AFAIK, it was chosen for its size.

No: that's why the LZMA compression algorithm is being used. The switch from the original LZMA container format to the "improved" xz container format had nothing to do with size, as the old container generated slightly smaller files in addition to not having this combination of negative properties.


I remember the hype around LZMA2 and xz being that, because it uses blocks, you can do the compression and decompression on multiple threads. LZMA was generally considered too slow to use for regular usage (only for archival, basically) and multicore was the new way to scale so this was perfect.


The time for complaining about xz has well and truly passed. One might infer that the problems with xz are related to the attitude of other developers...

But I dogress. When you create archival packups you use must use error correcting systems. Par is a good place to start. https://multipar.eu/ https://github.com/Parchive/par2cmdline


While I agree that you should use PAR-like error correction systems when doing archival, one of the points of the article is that xz has poor error detection and recovery, compared to any of the other formats available.

If it offered none, and explicitly stated that, it would be one thing, but it offers some that functions quite poorly in practice, which is rather another.


It's only a problem if you are relying in it to detect and correct errors. I've got SHA checksums and FEC and ZFS and cascading backups for that, which are deliberately not relying on one code provider not having critical bugs. Also, I may have crypto, and I need error detection and FEC on the outside of that. So it seems like a storm in a teacup really.

I think there is scope in xz to embed some FEC in the archive as a backwards compatible extension. Not such if the decoding scheme could be changed to reduce error cascades etc.


> This article started with a series of posts to the debian-devel mailing list [Debian], where it became clear that nobody had analyzed xz in any depth before adopting it in the Debian package format.

Debian source packages are cryptographically signed. Debian archives are also cryptographically signed. There's no error-correction, but you don't want that. You just want to verify integrity every time you make a copy, and re-copy if it didn't copy right. And all the tools do that.

I remember this discussion on debian-devel, and there were a lot of people pointing out that the author's guesses at Debian's threat model had little to do with Debian's actual threat model, at which point the author became frustrated.


The Debian file format also doesn't take advantage of any supposed benefit of xz over lzma, and so is simply taking in the cost of xz's larger file size for what I consider to be "absolutely no good reason" (and as a major downstream of dpkg I pointed this out to its author a month or two ago when I re-asserted non-deprecated status for .lzma Debian packages as now one of only two noticeable changes I maintain to dpkg).


I very strongly disagree with the author's complaint that "Xz is unreasonably extensible". The history of computers is littered by numbers and numbers of restrictive design choices that ended up biting us in the ass later. Let's design FAT12. Oops a 32MB file size limit is no longer sufficient? Let's make FAT16, err I mean FAT32, err no really NTFS... How about 7-bit ASCII, err 8-bit ISO-8859, err I mean 16-bit Unicode, err no I really mean 32-bit Unicode, now that should be sufficient, right?

So it was the right choice for the Xz format to use variable-length integer encoding. Not all integer fields will need to represent values up to 2^63, but it doesn't matter.


I don't think the variable-length encoding itself is the author's concern, it's adding additional filters later without an overall version number. Unlike long file names in FAT32, adding additional filters is not backwards compatible. With a version number, it's super clear what can decode a file. Without, your xz file is only decodeable if your decoder implements the same set of filters as the encoder, rather than the much simpler approach of gzip where any gzip decoder can decode any gzip file.


That's certainly an interesting read but if I really care about something I'll usually generate PAR files. At that point, does it really matter what archiver I'm using at the time or should I still be worried?

https://github.com/Parchive/par2cmdline


People who professionally care about long term archival usually don't care about heavy compression, as storage costs are always dwarfed by data governance and other costs.

Individuals may care about space more because the storage or medium is expensive for them. Even then, bulk data like photos and video doesn't compress well, so this argument is still pointless.


Does anyone know zpaq compares? Lrzip [0] shows it as having desirable archival performance.

[0] http://ck.kolivas.org/apps/lrzip/README.benchmarks


ZPAQ is a defined standard format which embedds the decompression algorithm for each block as bytecode into the archive. One can define a transformation (eg. LZ77 or color transformations of a picture) and compress this data through context mixing and arithmetic coding. It uses checksums but I can't judge rightnow whether there is some issue implied concerning bitflips.


ZPAQ is a defined standard format which embedds the decompression algorithm for each block as bytecode into the archive.

Is that a Turing-complete bytecode? If so, I envision some interesting applications with regard to procedurally-generated content...


Yes, there are two examples for pi here: http://mattmahoney.net/dc/zpaqutil.html and one where the input does not exactly have to be pi, but can contain it: https://github.com/pothos/zpaqlpy/blob/master/test/mixedpi2....

But of course for real world use cases one would choose a general compression algorithm or write a specific one for a certain type of data you want to handle (because it will perform better than a general algorithm).


Is there any actionable advice here for a Windows user? I mean let's say one is using 7zip which defaults to .7z format. Does that correspond to any of the formats discussed in the article? Would it be better to use a different format or a different program?


It uses lzma/lzma2


"the right way of implementing binary filters is to write a preprocessor that applies the filter to the data before feeding them to the compressor. (See for example mince)."

So, what does mince do? https://github.com/Kingsford-Group/mince/blob/master/README.... says:

"Mince is a technique for encoding collections of short reads."

That doesn't tell me much.

Also, http://bioinformatics.oxfordjournals.org/content/early/2015/...:

"We present a novel technique to boost the compression of sequencing that is based on the concept of bucketing similar reads so that they appear nearby in the file"

I'm still not sure I understand that fully, but guesstimate that it shuffles file content in such a way that it compresses so much better that it more than offsets the space needed to store the information on how to deshuffle the decompressed data. Is that correct?

Or does it require you to specify the parts and discards the reshuffling information, assuming you aren't interested in the order?


I think the latter. According to the decompression instructions in README.md, "The reads will NOT be in the same order as in the original file".

It sounds like it's meant results of DNA sequencing processes that produce sequences for random small pieces of the molecule.


Sure sounds a lot like the Burrows--Wheeler transform...


> everybody else will blindly use whatever formats we choose

I'm a developer and I was LITERALLY in the process of backing up my iPhoto library (200GB) using Xz just now.

welp



I love rzip but it is important to note that it is not a stream compressor. So it is unsuitable for most of the use cases of LZMA2 which can compress data on the fly or have data appended on the end of a file at a later date.

So rzip is probably excellent for archival situations (not counting any flaws the algorithm may have) but not so great for things like compressed tarballs (which are meant to be appendable).


Here's a description of the format http://tukaani.org/xz/format.html


> The only reliable way of knowing if a given version of a xz decompressor can decompress a given file is by trial and error. The 'file' utility does not provide any help:

  $ file COPYING.*
  COPYING.lz: lzip compressed data, version: 1
  COPYING.xz: XZ compressed data
To me, this indicates a lack of data in the magic database used by file, more than something inherently bad with xz.


I just put files in tars because everything is already compressed. Really, 99% of formats already have their own compression - you might get a couple % by recompressing them. Maybe a few text files that I have are not compressed but I can probably live with them taking up space.


Tar is not a compressed format.

This is why it's usually paired with gzip or xz.


I know it's not a compression format. My point is most formats that you put in tars are already compressed! Jpgs - compressed, RAWs - already compressed, any text format that isn't a plain-text file - already compressed. Compressing compressed formats is hardly worth it.


I just use ZIP, that way I can still open it in 10 years.


Not really related to the article, but would you look at that web page. This is what all of them should look like! Nothing fancy, just good content, 64.91KB in total.


And yet I can't read it on my phone without scrolling side to side.


That's because your browser sucks. Try Opera Mobile or something similar. Good browsers can reflow text to fit whatever screen size they display at. This website has zero CSS related to layout, so it is your browser's duty to render it nicely.

It boggles my mind why reflow is not one of the most important and most highly regarded features of any mobile browser.


The problem is quickly solved on Firefox thanks to its "Fit text to width" addon [1].

>it is your browser's duty to render it nicely

I think browsers should have the ultimate say on how to display websites, according to the user's preferences (think of text size and colors). Sadly, with websites gradually transforming into "web apps", users are losing control.

[1]: https://addons.mozilla.org/En-us/firefox/addon/fit-text-to-w...


Thanks for pointing to that add-on. Works nicely.


even after 'zooming in'? it didn't create a new viewport for me after doing that that.

on opera mobile: are you truly comfortable to use a 'default on man-in-the-middle-ware' from china? (it proxies everything and compresses it, to save data bandwidth)

i stopped using it in August because of that. might've been too paranoid though.


Just turn off the "VPN" proxying if that makes you uncomfortable.

Personally I am less concerned about the Boogiemen from China/Russia than about the pervasive ruin of privacy by US companies.


Yeah, they should add the meta/viewport with scale=1 tag to the head, and it would improve readability on phones. I mean it't annoying that the web is in a state where that is necessary, but until it's solved, it's better (and not much more difficult) than nothing.


The only archival formats that are known to work over the long term are stone tablets, cave paintings and parchment.




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: