
Xz format inadequate for long-term archiving - martianh
http://www.nongnu.org/lzip/xz_inadequate.html
======
jl6
OK, I'll bite. This article is overblown in its criticisms and has some issues
of its own.

1\. The article itself recommends that "a tool is supposed to do one thing and
do it well". Well, xz is for compression, nothing else. If you want to check
integrity, use a hashing tool such as sha256sum. If you want to recover from
errors, generate parity files using par2 or zfec.

2\. The article claims xz was chosen by prominent open source projects due to
hype. No, it was chosen because of its favorable size/speed tradeoffs.

3\. It's not clear what the author means by "long-term archiving". Archives
professionals (as in, people who are actually employed by memory institutions)
will tell you of many factors (all unmentioned here) that have a bearing on
whether a file format (particularly a compressed one) is suitable for use in
data preservation.

4\. The section on trailing data is particularly bizarre. By claiming that xz
is "telling you what you can't do with your files", it seems that the author
considers it perfectly reasonable to append arbitrary data to the end of a
file and expect it to continue functioning. Is my normality detector off today
or is this just a wacky thing to want to do?

5\. The article is written by the author of lzip (which it compares favorably
to xz), but does not disclose this. Admittedly it is hosted on the lzip site,
but overall the article comes across as an opinionated hit-piece designed to
sow doubt in a competitor.

~~~
LukeShu
Regarding #4: Padding out a file is popular because of tape drives. You'll
typically write the archive to the an appropriately sized tape raw, without a
filesystem (which is just another format you have to maintain a parser for).
So because the file will be smaller than the tape, there will be garbage at
the end of the tape. Because we wrote the file to the tape directly, there is
no way of knowing where it ends, and a tool reading the file will need to just
deal with the garbage data.

~~~
the_mitsuhiko
Garbage data is not a problem since the length is known.

~~~
marcosdumay
Hum, no. When you tar a set of files directly into a tape, you don't know the
resulting tar size beforehand. Even less if you compress the result.

~~~
mnarayan01
I would think you could just write the size at the end of the tape?

~~~
ikawe
When you're reading the data later, how would you know where to find "the end"
if you don't already know the length?

~~~
mnarayan01
You don't know the end of the _data_ , but presumably(?) you know the end of
the _tape_.

~~~
marcosdumay
We are talking past each other here...

Making a backup with tar is done by typing something like that on bash:

> tar -c - dir1 dir2 dir3 > /dev/tape

That will (hopefully, I doubt I got the tar switches right) backup those dirs
into the tape (that will actually have a weird name, not '/dev/tape').

Now, in practice Linux doesn't always know the size of a tape you inserted.
But this is not the issue, if you accept the seeks needed for that, you'd
better write at the beginning anyway.

------
pmoriarty
When I use xz for archival purposes I always use par2[1] to provide redundancy
and recoverability in case of errors.

When I burn data (including xz archives) on to DVD for archival storage, I use
dvdisaster[2] for the same purpose.

I've tested both by damaging archives and scratching DVDs, and these tools
work great for recovery. The amount of redundancy (with a tradeoff for space)
is also tuneable for both.

[1] -
[https://github.com/BlackIkeEagle/par2cmdline](https://github.com/BlackIkeEagle/par2cmdline)

[2] - [http://dvdisaster.net/](http://dvdisaster.net/)

~~~
voltagex_
How much parity data (%) do you have?

What's the oldest disc you've tested?

What do you think about PAR3?

~~~
pmoriarty
I'm afraid my archiving and testing are not very systematic. I don't usually
go back and test my burnt DVDs, except when I need to get data off of them,
which is not very often. But so far I haven't run in to issues, and have been
pretty lucky with the DVDs I've used. I don't think I can recall a situation
where dvdisaster actually had to recover from errors on any old DVD. Of
course, it's a good idea to transfer valuable data from old DVDs to new ones
every now and then, as they won't last forever.

I just tried a DVD from 2013 (Verbatim DVD-R, 16x, 4.7 GB) and it read fine,
without any errors, and dvdisaster found that the checksums on it matched.

I usually try to use at least 20% redundancy, but will settle for less if the
data's not very important. If the data's really important I'll max out the
amount of redundancy (up to dvdisaster's limit, which I don't remember off-
hand). Sometimes I'll even burn an extra DVD with the same data, both with
dvdisaster error correction on them.

As for par3, I only heard about it for the first time today in this thread. So
I have no opinions on it except to say that if it's an improvement over par2,
I'm all for it. Backwards compatibility with par2 also would be nice.

~~~
voltagex_
Thanks for the info

Shame about the status of the dvdisaster project - as of 2015 the Mac OS X and
Windows ports are discontinued.

~~~
pmoriarty
That's just for the new versions. You should still be able to get the old
versions and those should still work just fine.

------
rwmj
The article's right that you probably should not use xz (or even any similar
formats) for archival purposes.

However what's totally awesome about xz as a container format is seekable
random access to compressed files. I use it to store infrequently used disk
images, which I can boot up without even decompressing them.
([https://rwmj.wordpress.com/2013/06/24/xz-plugin-for-
nbdkit/](https://rwmj.wordpress.com/2013/06/24/xz-plugin-for-nbdkit/))

~~~
joosters
I agree completely, in fact it seems odd that the article is really
complaining about the format being more complicated than bzip2 or gzip, when
the comparison is not really apples-to-apples. Comparing it with zip would
seem more sensible to me.

A compressed container format that allows quick access to individual files is
very useful and actually improves data security - corruption in one part of
the data will be far less likely to ruin the entire collection of files,
whereas corruption in a compressed tar file may lead to the loss of everything
(I know recovery tools exist, but they cannot prevent one file in the
container being dependent upon data from another file.)

~~~
saurik
You are confused. xz does not provide the ability to find individual files
inside of the format: it is not at all comparable to zip files, and like the
other formats mentioned it compresses one file. The "seekable" property is
that it lets you somewhat efficiently decompress arbitrary byte ranges inside
of the file, which is why a compressed disk image (as used by your parent
commenter) benefits from this property but the industry-standard usage of
compressing a tar file (which is a file format which inherently makes it
impossible to find individual files without reading the whole thing) totally
throws away this benefit.

~~~
yuubi
A tar file doesn't have a central directory like a zip, so you do have to
search all the file headers. However, each file header contains the length of
the file it describes, which lets you seek past the content of any files you
don't care about if the tar isn't stored in a way that would prevent seeking.

~~~
Skunkleton
That would work if your tar was uncompressed. The lengths in the tar are
meaningless in a compressed file.

~~~
shawnz
That's not true, xz allows efficient decompression of arbitrary byte ranges
inside the file.

~~~
Skunkleton
Does it let you seek by a decompressed byte count, or or does it let you
decode arbitrary compressed byte ranges?

------
userbinator
_Unsafe: 1 Likely to cause severe data loss even in case of the smallest
corruption (a single bit flip). 2 Likely to produce false negatives._

It does seem like xz is somewhat overengineered, but I don't think that's a
characteristic unique to it; any other algorithm with similar compression
performance will yield similar behaviour on corrupted data, since the whole
point and why compression works is to remove redundancy. I say use something
like Reed-Solomon on the _compressed_ data, thus introducing a little
redundancy, if you really want error correction.

~~~
lake99
I haven't read the whole article yet, but gzip and bzip2 recover fairly well
from errors. As they are divided into blocks, flipping a bit will make you
lose, at most, one block. This is not related to error correction though, but
a way of mitigating losses.

>> Just one bit flip in the msb of any byte causes the remaining records to be
read incorrectly. It also causes the size of the index to be calculated
incorrectly, losing the position of the CRC32 and the stream footer.

This sounds severe enough to me.

~~~
userbinator
_As they are divided into blocks, flipping a bit will make you lose, at most,
one block._

What happens if bits in the block header are corrupted? If it can't find the
start of the next block, the same thing will happen.

Also, breaking up the data into blocks will decrease compression, since each
block starts with a fresh state. It is ultimately a tradeoff between
compression ratio and error resistance.

~~~
lake99
> If it can't find the start of the next block, the same thing will happen.

This has been addressed in the article. "Bzip2 is affected by this defect to a
lesser extent; it contains two unprotected length fields in each block header.
Gzip may be considered free from this defect because its only top-level
unprotected length field (XLEN) can be validated using the LEN fields in the
extra subfields. Lzip is free from this defect."

> Also, breaking up the data into blocks will decrease compression

This has been tested very thoroughly. Larger block sizes give rapidly
diminishing marginal returns (man bzip2). Now, the largest you can go with
bzip2 is 900KB.

------
barrkel
Of course, any archival strategy that leaves you vulnerable to the problems
discussed in the article is defective. You ought to be backing up to multiple
physically distant locations if you're really concerned about archival.

The argument makes more sense for the casual user who isn't paying for a
professional backup service. But how many of those are using xz? They're more
likely to be using zip files; and they'll be in a better place for it.

The zip (concatenation of compressed streams) vs tar.gz (compression of
concatenated streams) distinction is useful in an archival context, if one is
genuinely worried about bitrot and the risk of needing to recover partial data
where the container is damaged. A concatenation of compressed streams will
have worse compression especially for lots of small files, but it is far
easier to recover from an error.

~~~
akx
Another option, should you need better compression than zip with the same
damage tolerance, is xz files in an uncompressed tar.

~~~
fbender
How does TAR mitigate the issues mentioned in the paper? Sincerely curious, my
knowledge is limited. AFAICS, XZ can still "corrupt" the TAR stream and you'll
need quite a bit of extra error correction to recover from that.

~~~
nothrabannosir
OP means xz all individual files, then tar them without compression.

~~~
lathiat
Not as silly a suggestion as it sounds

------
web007
All of the arguments here seem to assume corruption (even extensibility is a
form of it), and to treat it as a thing that can be fixed by the decompressor.
It seems xz isn't ideal in that case, but none of the others are great either.
As mentioned elsewhere in this thread, if integrity is your primary concern
you should use PAR or some other error-correction mechanism. If you just need
a binary integrity indicator you should have separate hashes published
somewhere.

xz wasn't chosen for its archival integrity AFAIK, it was chosen for its size.
Package distribution ought to spend as much resource as practical to reduce
bytes over the wire. Size is why bzip2 "won" over gz despite being many times
slower and requiring much more RAM. lrzip would be my first choice for maximum
compression, but its downstream resource requirements are far too high vs any
of the others, so from my experience xz would be the answer anyway.

~~~
saurik
> xz wasn't chosen for its archival integrity AFAIK, it was chosen for its
> size.

No: that's why the LZMA compression algorithm is being used. The switch from
the original LZMA container format to the "improved" xz container format had
nothing to do with size, as the old container generated slightly smaller files
in addition to not having this combination of negative properties.

~~~
amaranth
I remember the hype around LZMA2 and xz being that, because it uses blocks,
you can do the compression and decompression on multiple threads. LZMA was
generally considered too slow to use for regular usage (only for archival,
basically) and multicore was the new way to scale so this was perfect.

------
angry_octet
The time for complaining about xz has well and truly passed. One might infer
that the problems with xz are related to the attitude of other developers...

But I dogress. When you create archival packups you use must use error
correcting systems. Par is a good place to start.
[https://multipar.eu/](https://multipar.eu/)
[https://github.com/Parchive/par2cmdline](https://github.com/Parchive/par2cmdline)

~~~
rincebrain
While I agree that you should use PAR-like error correction systems when doing
archival, one of the points of the article is that xz has poor error detection
and recovery, compared to any of the other formats available.

If it offered none, and explicitly stated that, it would be one thing, but it
offers some that functions quite poorly in practice, which is rather another.

~~~
angry_octet
It's only a problem if you are relying in it to detect and correct errors.
I've got SHA checksums and FEC and ZFS and cascading backups for that, which
are deliberately not relying on one code provider not having critical bugs.
Also, I may have crypto, and I need error detection and FEC on the outside of
that. So it seems like a storm in a teacup really.

I think there is scope in xz to embed some FEC in the archive as a backwards
compatible extension. Not such if the decoding scheme could be changed to
reduce error cascades etc.

------
geofft
> _This article started with a series of posts to the debian-devel mailing
> list [Debian], where it became clear that nobody had analyzed xz in any
> depth before adopting it in the Debian package format._

Debian source packages are cryptographically signed. Debian archives are also
cryptographically signed. There's no error- _correction_ , but you don't want
that. You just want to verify integrity every time you make a copy, and re-
copy if it didn't copy right. And all the tools do that.

I remember this discussion on debian-devel, and there were a lot of people
pointing out that the author's guesses at Debian's threat model had little to
do with Debian's actual threat model, at which point the author became
frustrated.

~~~
saurik
The Debian file format also doesn't take advantage of any supposed benefit of
xz over lzma, and so is simply taking in the cost of xz's larger file size for
what I consider to be "absolutely no good reason" (and as a major downstream
of dpkg I pointed this out to its author a month or two ago when I re-asserted
non-deprecated status for .lzma Debian packages as now one of only two
noticeable changes I maintain to dpkg).

------
mrb
I very strongly disagree with the author's complaint that _" Xz is
unreasonably extensible"_. The history of computers is _littered_ by numbers
and numbers of restrictive design choices that ended up biting us in the ass
later. Let's design FAT12. Oops a 32MB file size limit is no longer
sufficient? Let's make FAT16, err I mean FAT32, err no really NTFS... How
about 7-bit ASCII, err 8-bit ISO-8859, err I mean 16-bit Unicode, err no I
really mean 32-bit Unicode, now that should be sufficient, right?

So it was the right choice for the Xz format to use variable-length integer
encoding. Not all integer fields will need to represent values up to 2^63, but
it doesn't matter.

~~~
TD-Linux
I don't think the variable-length encoding itself is the author's concern,
it's adding additional filters later without an overall version number. Unlike
long file names in FAT32, adding additional filters is not backwards
compatible. With a version number, it's super clear what can decode a file.
Without, your xz file is only decodeable if your decoder implements the same
set of filters as the encoder, rather than the much simpler approach of gzip
where any gzip decoder can decode any gzip file.

------
alyandon
That's certainly an interesting read but if I really care about something I'll
usually generate PAR files. At that point, does it really matter what archiver
I'm using at the time or should I still be worried?

[https://github.com/Parchive/par2cmdline](https://github.com/Parchive/par2cmdline)

------
Spooky23
People who professionally care about long term archival usually don't care
about heavy compression, as storage costs are always dwarfed by data
governance and other costs.

Individuals may care about space more because the storage or medium is
expensive for them. Even then, bulk data like photos and video doesn't
compress well, so this argument is still pointless.

------
2bluesc
Does anyone know zpaq compares? Lrzip [0] shows it as having desirable
archival performance.

[0]
[http://ck.kolivas.org/apps/lrzip/README.benchmarks](http://ck.kolivas.org/apps/lrzip/README.benchmarks)

~~~
Klasiaster
ZPAQ is a defined standard format which embedds the decompression algorithm
for each block as bytecode into the archive. One can define a transformation
(eg. LZ77 or color transformations of a picture) and compress this data
through context mixing and arithmetic coding. It uses checksums but I can't
judge rightnow whether there is some issue implied concerning bitflips.

~~~
userbinator
_ZPAQ is a defined standard format which embedds the decompression algorithm
for each block as bytecode into the archive._

Is that a Turing-complete bytecode? If so, I envision some interesting
applications with regard to procedurally-generated content...

~~~
Klasiaster
Yes, there are two examples for pi here:
[http://mattmahoney.net/dc/zpaqutil.html](http://mattmahoney.net/dc/zpaqutil.html)
and one where the input does not exactly have to be pi, but can contain it:
[https://github.com/pothos/zpaqlpy/blob/master/test/mixedpi2....](https://github.com/pothos/zpaqlpy/blob/master/test/mixedpi2.cfg)

But of course for real world use cases one would choose a general compression
algorithm or write a specific one for a certain type of data you want to
handle (because it will perform better than a general algorithm).

------
rwallace
Is there any actionable advice here for a Windows user? I mean let's say one
is using 7zip which defaults to .7z format. Does that correspond to any of the
formats discussed in the article? Would it be better to use a different format
or a different program?

~~~
gruez
It uses lzma/lzma2

------
Someone
_" the right way of implementing binary filters is to write a preprocessor
that applies the filter to the data before feeding them to the compressor.
(See for example mince)."_

So, what does _mince_ do? [https://github.com/Kingsford-
Group/mince/blob/master/README....](https://github.com/Kingsford-
Group/mince/blob/master/README.md) says:

 _" Mince is a technique for encoding collections of short reads."_

That doesn't tell me much.

Also,
[http://bioinformatics.oxfordjournals.org/content/early/2015/...](http://bioinformatics.oxfordjournals.org/content/early/2015/05/25/bioinformatics.btv248):

 _" We present a novel technique to boost the compression of sequencing that
is based on the concept of bucketing similar reads so that they appear nearby
in the file"_

I'm still not sure I understand that fully, but guesstimate that it shuffles
file content in such a way that it compresses so much better that it more than
offsets the space needed to store the information on how to deshuffle the
decompressed data. Is that correct?

Or does it require you to specify the parts and discards the reshuffling
information, assuming you aren't interested in the order?

~~~
yuubi
I think the latter. According to the decompression instructions in README.md,
"The reads will NOT be in the same order as in the original file".

It sounds like it's meant results of DNA sequencing processes that produce
sequences for random small pieces of the molecule.

------
0xCMP
> everybody else will blindly use whatever formats we choose

I'm a developer and I was LITERALLY in the process of backing up my iPhoto
library (200GB) using Xz just now.

 _welp_

------
snissn
[https://rzip.samba.org/](https://rzip.samba.org/) is pretty cool

~~~
riskable
I love rzip but it is important to note that it is not a stream compressor. So
it is unsuitable for most of the use cases of LZMA2 which can compress data on
the fly or have data appended on the end of a file at a later date.

So rzip is probably _excellent_ for archival situations (not counting any
flaws the algorithm may have) but not so great for things like compressed
tarballs (which are meant to be appendable).

------
ape4
Here's a description of the format
[http://tukaani.org/xz/format.html](http://tukaani.org/xz/format.html)

------
glandium
> The only reliable way of knowing if a given version of a xz decompressor can
> decompress a given file is by trial and error. The 'file' utility does not
> provide any help:
    
    
      $ file COPYING.*
      COPYING.lz: lzip compressed data, version: 1
      COPYING.xz: XZ compressed data
    

To me, this indicates a lack of data in the magic database used by file, more
than something inherently bad with xz.

------
Paul_S
I just put files in tars because everything is already compressed. Really, 99%
of formats already have their own compression - you might get a couple % by
recompressing them. Maybe a few text files that I have are not compressed but
I can probably live with them taking up space.

~~~
Crespyl
Tar is not a compressed format.

This is why it's usually paired with gzip or xz.

~~~
Paul_S
I know it's not a compression format. My point is most formats that you put in
tars are already compressed! Jpgs - compressed, RAWs - already compressed, any
text format that isn't a plain-text file - already compressed. Compressing
compressed formats is hardly worth it.

------
kayamon
I just use ZIP, that way I can still open it in 10 years.

------
da4c30ff
Not really related to the article, but would you look at that web page. This
is what all of them should look like! Nothing fancy, just good content,
64.91KB in total.

~~~
cgvgffyv
And yet I can't read it on my phone without scrolling side to side.

~~~
anc84
That's because your browser sucks. Try Opera Mobile or something similar. Good
browsers can reflow text to fit whatever screen size they display at. This
website has zero CSS related to layout, so it is your browser's duty to render
it nicely.

It boggles my mind why reflow is not one of the most important and most highly
regarded features of any mobile browser.

~~~
noisem4ker
The problem is quickly solved on Firefox thanks to its "Fit text to width"
addon [1].

>it is your browser's duty to render it nicely

I think browsers should have the ultimate say on how to display websites,
according to the user's preferences (think of text size and colors). Sadly,
with websites gradually transforming into "web apps", users are losing
control.

[1]: [https://addons.mozilla.org/En-us/firefox/addon/fit-text-
to-w...](https://addons.mozilla.org/En-us/firefox/addon/fit-text-to-width/)

~~~
sverige
Thanks for pointing to that add-on. Works nicely.

------
gaius
The only archival formats that are known to work over the long term are stone
tablets, cave paintings and parchment.

