
Xz format inadequate for long-term archiving (2016) - UkiahSmith
http://lzip.nongnu.org/xz_inadequate.html
======
jordigh
This guy used to go around GNU mailing lists (and others) trying to get us to
use lzip.

[https://gcc.gnu.org/ml/gcc/2017-06/msg00044.html](https://gcc.gnu.org/ml/gcc/2017-06/msg00044.html)

[https://lists.debian.org/debian-
devel/2017/06/msg00433.html](https://lists.debian.org/debian-
devel/2017/06/msg00433.html)

It was a bit bizarre when he hit the Octave mailing list.

Eventually, people just wanted xz back:

[http://octave.1599824.n4.nabble.com/opinion-bring-back-
Octav...](http://octave.1599824.n4.nabble.com/opinion-bring-back-Octave-xz-
source-release-td4683705.html)

~~~
AdmiralAsshat
Everyone's gotta have their white whale, I guess.

------
brianpgordon
Previous discussions:

[https://news.ycombinator.com/item?id=12768425](https://news.ycombinator.com/item?id=12768425)

[https://news.ycombinator.com/item?id=16884832](https://news.ycombinator.com/item?id=16884832)

------
esaym
Interestingly, since "recovery" is mentioned several times, I decided to test
myself.

I took a copy of a jpeg image, compressed it different times with either gzip
or bzip2, then with a hexeditor modified one byte.

The recovery instructions for gzip is to simply do "zcat corrupt_file.gz >
corrupt_file". While for bzip2 is to use the bzip2recover command which just
dumps the blocks out individually (corrupt ones and all).

Uncompressing the corrupt gzip jpeg file via zcat at all times resulted in an
image file the same size as the original and could be opened with any image
viewer although the colors were clearly off.

I never could recover the image compressed with bzip2. Trying to extract all
the recovered blocks made by bzip2recover via bzcat would just choke on the
single corrupted block. And the smallest you can make a block is 100K (vs 32K
for gzip?). Obviously pulling 100K out of a jpeg will not work.

Though I'm still confused as to how the corrupted gzip file extracted to a
file of the same size as the original. I guess gzip writes out the corrupted
data as well instead of choking on it? I guess gzip is the winner here. Having
a file with a corrupted byte is much better than having a file with 100K of
data missing...

~~~
gmueckl
Your method is clearly flawed. Altering a single byte once is insufficient as
a test unless you analyzed the structure of the compressed file first to see
where the really important information is stored. It may well be that you just
modified a verbatim string from the source data in the gzip case, but
corrupted a bit of metadata about how the compressed data is structured in the
bzip2 case. If you tried a different random bytes, the results might be
reversed.

The proper test would be to iterate over every bit in the compressed file,
flip it and try to recover. Then compute number of successful recoveries
against the number of bits tested. Compression algorithms that perform
similarly should gmhave similar likelyhoods that a single bit flip corrupts
the entirety of the data.

~~~
esaym
I thought about that as well. I tried it three different times all with the
same results.

~~~
dual_basis
Three? Well then, case closed!

~~~
fao_
Did the poster imply that their test was the be-all and end-all of error
tolerance in common-use compressions systems? No. Then why did you assume that
they did say that, and then write such a useless comment

------
xoa
Not that many of the complaints aren't reasonable, but I thought that in
general compression/format was orthogonal to parity, which is what I assume is
actually wanted for long-term archiving? I always figured that the goal should
normally to be able to get back out a bit-perfect copy of whatever went in,
using something like Parchive at the file level or ZFS for online storage at
the fs level. I guess on the principle of layers and graceful failure modes
it's better if even sub-archives can handle some level of corruption without
total failure, and from a long term perspective of implementation independence
simpler/better specified is preferable, but that still doesn't seem to
substitute for just having enough parity built in to both notice corruption
and fully recover from it to fairly extreme levels.

~~~
matt-attack
I think with archiving it’s more than that. Sure you can guarantee that the
actual tool you just compressed with can restore the original perfectly. But
with long term digital archiving I think you need the assurance that the
“spec” called “xz” could be perfectly reimplemented by an expert in the
future. Based solely on documentation. And on a platform that doesn’t exist
today. That is, you must assume the original executable is either not
available or not able to be executed.

~~~
ltbarcly3
Why would they need to recreate it solely based on 'documentation'? It is open
source, the source code is the documentation. It seems just as likely that the
source would survive as it is likely that some complete technical
documentation would survive. Maybe they wouldn't be able to compile it
(probably they would be able to compile it, I don't see why they wouldn't have
some kind of computer emulator available), but it's better than any other kind
of documentation you could provide.

The src/ tree of xz is 335k (compressed with gzip). If you are worried future
digital historians won't be able to figure out the xz format, throw a copy of
the gzip'd source onto every drive you store archives on, it would basically
be free and would almost guarantee they would have a complete copy of exactly
what they would need to decompress the files.

~~~
matt-attack
You're exhibiting shortsightedness when it comes to "source". If I give you
some RPG [1] or maybe some ALGO 58 [2] source code are you going to just
compile and run it no problem? How about some FLOW-MATIC [3]?

Point being that computer languages come and go.

[1]
[https://en.wikipedia.org/wiki/IBM_RPG](https://en.wikipedia.org/wiki/IBM_RPG)

[2]
[https://en.wikipedia.org/wiki/ALGOL_58](https://en.wikipedia.org/wiki/ALGOL_58)

[3] [https://en.wikipedia.org/wiki/FLOW-
MATIC](https://en.wikipedia.org/wiki/FLOW-MATIC)

~~~
ltbarcly3
Yes, programming languages come and go, but I don't see how that matters. Some
future historian will either have access to a working copy of xz or they will
not. If they don't, and they want to implement it, having a copy of the source
code is far better than anything else you could give them. Sure, future
programming languages will be quite different, but humans will certainly be
able to read and understand C code. If humanity has forgotten how to read C
code (and lost all knowledge of it), how are they going to read this
documentation you seem to prefer? Human languages come and go also..

[https://en.wikipedia.org/wiki/Egyptian_hieroglyphs](https://en.wikipedia.org/wiki/Egyptian_hieroglyphs)

[https://en.wikipedia.org/wiki/Judaeo-
Aragonese](https://en.wikipedia.org/wiki/Judaeo-Aragonese)

[https://en.wikipedia.org/wiki/Latin](https://en.wikipedia.org/wiki/Latin)

Any argument you can make about historians being able to recover dead
languages you can make the exact same argument for their ability to recover
dead computer languages, and there is no better or more accurate specification
than the actual code.

So let me add to my recommendation, in addition to a copy of the xz source
code, include a plain text copy of any 'how to program in C' book, or just the
wikipedia page for the C language. That is more than enough for them to
construct a program that can decompress xz files, once they relearn how to
read whatever long dead language the book is written in (Ancient Pre-Cataclysm
Earth English for example).

~~~
fao_
> If humanity has forgotten how to read C code (and lost all knowledge of it),
> how are they going to read this documentation you seem to prefer?

Sure but are they going to remember something like, weird precedence rules
(See: &), undefined behaviour, etc. Just because they want to reimplement a
specific, small, program does not mean they want to relearn several languages.
What you're saying could easily blow up from 'how to code C' to 'reading the
GCC / Clang compiler source code to figure out how a specific UB was
implemented, which the program in this specific case falls into', which I'm
sure nobody wants to spend their weekend doing, implementing something like
`xz` could simply be a midpoint in their destination, they don't want to spend
weeks digging up COBOL. Have at least _some_ consideration for the human
element, jeez.

Documentation, specifically _mathematical_ documentation, is more fault
tolerant than either psuedocode or actual code.

At any other time, I would agree with you, but where archivism is concerned, I
do not.

~~~
ltbarcly3
>> What you're saying could easily blow up from 'how to code C' to 'reading
the GCC / Clang compiler source code to figure out how a specific UB was
implemented, which the program in this specific case falls into', which I'm
sure nobody wants to spend their weekend doing

There will be many, many people that will gladly dig into the minutia and
technical details of arcane hardware, especially when it means making progress
towards filling in the historical record. This is already the case today,
there is a working
[https://en.wikipedia.org/wiki/Colossus_computer](https://en.wikipedia.org/wiki/Colossus_computer)
reconstructed just because it was historically significant.

~~~
fao_
I think you missed the implication that I didn't state explicitly, but figured
was pretty clear:

> which I'm sure nobody wants to spend their weekend doing [if their original
> goal was to simply reconstruct xz].

------
microcolonel
> _" 3 Then, why some free software projects use xz?"_

Because the files are usually smaller than gzip, with faster decompression
than bzip2, and the library is available on most systems.

~~~
h1d
Archiving for distribution and backups are very different things. You don't
care if some app distribution compressed file gets corrupted, you just
compress again but your compressed backup files usually don't have much source
of reference.

I wouldn't use any unreliable format for backups. I picked bzip2 for stability
and compression rate.

~~~
microcolonel
In my opinion, the compressor is not the right place to add data integrity
mechanisms, especially since data integrity mechanisms only really apply to
particular media. Data on hard drives don't get corrupted in the same way as
data on TLC SSDs, and generally on the latter you're better off with
redundancy and diversification, than with inline error correcting codes.

Honestly, I don't see why xz should have _any_ of its own data integrity
mechanisms whatsoever, except maybe a whole-archive CRC32 or similar.

~~~
duskwuff
Right. The purpose of an archiver/compressor is to store a bunch of files
together, and use as little space as possible to do it. Data integrity / error
correction / redundancy all lie in the opposite direction of that goal.

------
Adamantcheese
How about something like ZPAQ instead for archiving? Especially if you're
doing backups and not a lot of the information is changing.

------
ltbarcly3
No file format is perfect, I've been using xz for years and I can't think of a
single issue I have had. The compression rate is dramatically better than gzip
or bzip2 for many types of archives (especially when there is a large
redundancy, for example when compressing spidered web pages from the same site
you can get well over 99% size reduction compared to 70% reduction for gzip,
which means using less than one 30th of the disk space).

Lately I have been using zstd for some things since it gives good compression
and is much faster than xz.

This criticism of xz just seems nit picky and impractical, especially if you
are compressing tar archives and/or storing the archives on some kind of raid
which can correct some read errors (such as raid5).

------
asveikau
I remember seeing this article before. This time the reaction that surges for
me is: if you want long-term archiving but don't assume redundant storage,
it's not going to go well. Put your long-term archives on ZFS.

~~~
StavrosK
Why are you assuming they aren't assuming redundant storage? Redundant storage
isn't a cure-all, there's still a chance two blocks on two disks will fail in
the exact same spot.

~~~
kipari
I reckon that the chance of the same two blocks on two different disks failing
between ZFS scrubs would be incredibly small.

~~~
cmurf
It is incredibly small if you don't consider either drive failing. But if one
drive fails, it happens with some regularity that a sector on the good drive
is bad. In actuality, only one sector is bad, but in effect the dead drive
means its mirror is also bad.

This comes up on the linux raid list with some frequency whenever there are
drive failures with raid56, and the subsequently the raid trips over a single
bad sector.

But it's true that lack of scrubbing contributes to this scenario, as well as
the terrible combination of consumer drives with very high bad sector recovery
times and the Linux SCSI command timer default of 30 seconds. That combination
ends up causing a masking of bad sectors that end up not getting repaired, and
as a user you may not realize that the link resets are not normal and suggest
a bad sector as the cause.

~~~
zaphirplane
Are you saying that a failure happens which isn’t detected and when the 2nd
failure occurs we notice because the data is inaccessible?

Which raid s/w does this ?

~~~
cmurf
Correct. All that depend on the SCSI block layer, which includes libata and
thus common consumer SATA drives. A NAS or better drive will come out of the
box with short error time outs, typically 70 deciseconds, and quickly issue a
read error with the LBA of the offending bad sector, and the RAID can then
know to obtain a copy or reconstruct from parity, write the good data to the
bad sector thus fixing it. Either the write works, or if it fails the drive
firmware is responsible for remapping that LBA to a reserve physical sector.

In the case where the drive error timeout is longer than the SCSI block layer,
it just results in a link reset. The actual problem with the drive is obscured
by the reset, including the bad sector, so it never gets repaired.

Btrfs, mdadm, lvm are affected and I'm pretty sure ZFS on Linux as well
assuming they haven't totally reimplemented their own block layer outside of
the SCSI subsystem.

It's a super irritating problem, the kernel developers know all about it, but
thus far it's considered something distributions should change for the use
cases that need it. And what that means so far is distros don't change it and
users using consumer drives with high error recovery times, get bitten.

[https://raid.wiki.kernel.org/index.php/Timeout_Mismatch](https://raid.wiki.kernel.org/index.php/Timeout_Mismatch)

~~~
zaphirplane
The link you posted talks about the raid software kicking a whole disk out of
the raid array when the disk takes too long to respond (basically but not
exactly) due to 2 timeout variables mismatch

The post I was responding to implied a raid array could be degraded and you
wouldn’t know till it completely failed

Interesting nevertheless

------
mkj
A bit of speculation here, but perhaps xz won over lzip because it has a real
manpage?

lzip has the usual infuriating short summary of options with a "run info lzip
for the complete manual". Also the source code repository doesn't even seem
linked directly from the lzip homepage - technical considerations aren't the
only thing that determines if software is "better", it also has to be well
presented.

------
shmerl
xz-utils should implement parallel decompression already. pixz is doing it,
but stock xz is not. Most end users benefit from faster decompression.

------
SEJeff
This should have (2016) in the title.

~~~
Thoreandan
and "from the author of lzip, a competing lzma library that never went viral".

Welcome to the Better Technology that Shoulda Made It bench. Your seat's over
there next to OS/2, BeOS, and OpenGenera.

~~~
nkoren
Amiga forever!!!!!!

~~~
zmix
Ha! I just wanted to add this. But you did it first! :-)

------
LinuxBender
If you first use tar to preserve xattrs/etc.. then you can use anything to
compress. xz, bz2, 7z, even arj if you are feeling nostalgic.

    
    
        tar cvfJ ./files.tar.xz /some/dir

~~~
zamalek
You've missed the point of the article entirely. A single bit-flip (which is
almost guaranteed over _long-term_ ) can easily render the _entire_ xz file
corrupt.

This has nothing to do with xattrs/etc.

~~~
LinuxBender
Yes, I am totally on auto-pilot today. I'm used to a different article that
gets re-posted often about xz and my browser blocks non-https sites so I
assumed it was that other article.

That said, I use xz in automation that compresses files on one end and
decompresses on the other. I've not had any file corruption thus far.
checksums always match. Hopefully the author has submitted bug reports and
ways to reproduce.

