
File Integrity 6: Which image format is most resilient? - zdw
https://eclecticlight.co/2020/04/21/file-integrity-6-which-image-format-is-most-resilient/
======
nneonneo
In general, most compressed file formats don't even bother to be resilient
against errors, since error correction mostly runs against the point of
compression (to make files as small as possible). Some file formats do have
error detection - for example, PNG provides a CRC-32 of every chunk, and
compressed formats like Zip and gzip have similar checksums.

If you happen to know (or guess) that a compressed file with a checksum only
has 1-2 bytes in error, it is definitely possible to recover the original file
through bruteforce. You can probably even do this to a file without a checksum
just by testing which possibilities parse correctly.

However, corruption on filesystems rarely takes the form of a small number of
bytes; oftentimes the file's metadata gets corrupted such that entire blocks
of the file (anywhere from 512 bytes to 64KB) are swapped, dropped or lost.
Instead of worrying about corruption-resistant image formats (since storing
everything as uncompressed TIFF/BMP is incredibly wasteful), it's probably
worth using a more corruption-resistant filesystem instead (i.e. not FAT!).

~~~
elcritch
What’s surprising to me is that no file system (even ZFS) or database utilize
error correction controls. Well aside from 1 bit/byte ECC style codes or raid.
There are a few forms of forward error correction [1], which are already
widely used in hardware [2]. There are some projects trying for long term
fixity of storage [3], but nothing default or built in, afaict.

I assumed RocksDB or some other modern KV engine would include FEC, at least
the meta-data. Using DB’s on embedded devices reveals how easily (and
regularly) a whole DB of data can be corrupted. A few bytes of error and an
entire IoT DB can be lost. FEC’s can be cpu expensive but not that expensive
nowadays.

1:
[https://en.wikipedia.org/wiki/Forward_error_correction](https://en.wikipedia.org/wiki/Forward_error_correction)
2:
[https://www.electronicdesign.com/technologies/communications...](https://www.electronicdesign.com/technologies/communications/article/21789969/use-
forward-error-correction-to-improve-data-communications) 3:
[https://github.com/lrq3000/pyFileFixity](https://github.com/lrq3000/pyFileFixity)

~~~
variaga
I would expect that filesystems do not integrate conventional FEC (e.g. block
codes), because FEC is poorly suited to the kind of errors that affect
filesystems.

FEC works against errors that have a predictable expected error rate, where
errors are evenly distributed among the data. This applies to block codes
(where FEC will have a maximum decodable error rate per block) and
convolutional codes which don't have 'blocks', but still have a maximum local
error density they can reliably correct. For media where burst errors are a
possibility, FEC is almost always paired with interleaving to increase the
likelihood hor a ~uniform distribution of errors.

FEC works pretty well when the problem is "I wrote some data to disk that I
can't read back correctly, but I can read the other data back just fine". At
the media level, hard drives have used FEC to protect individual bits/sectors
for a long time. Likewise at the disk level, RAID 4/5/6 do use FEC to protect
against individual disk failures.

Errors experienced at a filesystem level are not usually due to 'random'
corruption of the data, but to invalid operations on the data structure e.g. a
crash while flushing the data to disk which leaves the filesystem in an
inconsistent state. In other words, the filesystem-specific errors are more
likely to be "I wrote the wrong data to disk" or "I only wrote some of the
data I meant to".

FEC will not defend against that well, because any FEC must be written to the
disk along with the data to prevent the FEC+data from getting into an
inconsistent state. For example, the 'RAID 5 write hole' (which also happens
with RAID 1 and RAID 6) exists because physical disks will not commit data to
the disk at exactly the same time even when the write commands are issued at
the same time (which mostly won't happen anyway, since write commands tend to
be serialized on your SCSI/Fiberchannel/whatever interface to multiple
drives). That means there exists a time where the data has been updated but
the parity has not (or vice versa), and if a drive fails at that time, the
RAID array may not be cleanly recoverable.

So for practical reasons, if you write the wrong data to disk, you'll probably
write the wrong FEC too, and FEC won't help recover much. If your allocation
table gets corrupted and you write part of one file on top of another, you'll
be overwriting the FEC too. For conventional FEC.

BUT... error correction isn't just "conventional FEC". Error correction in
general is any time you add redundancy to data to make it more tolerant to
errors. In the filesystem space, this happens, but it isn't usually done with
parity matricies or convolutional state machines. Instead, redundancy is added
by things like log-structured file systems. The 'write what you plan to do,
then write the actual data, then write that you successfully did it (in that
order)' behavior of log structured file systems does add redundancy and does
make error recovery more possible.

In summary, "error correction" in general is used at several levels for file
storage:

\- block FEC at the bit level

\- block FEC at the drive level if RAID is used

\- robust CRC error detect + automatic repeat request over your PCIe/NVMe link
or over a network connection

\- Redundant metadata in your filesystem journal

but the key is to use the right kind of error correction for the expected
error conditions.

~~~
emilfihlman
I suggest you read a bit more about reliable data transmission and FEC. Burst
error FECs absolutely available and used (mostly FEC + interleaving in this
order).

------
vitovito
JPEG can be made almost as resilient as an uncompressed format by adding in
restart markers.

Restart markers duplicate the JPEG headers within the file, at however many
intervals you like, so the bitstream can be, well, restarted, limiting the
effect of any corruption.

`jpegtran` can insert them losslessly.

~~~
sllabres
Interesting, didn't know. For arbitrary data I found "rsbep".

------
jwr
I wrote my little consistency checker
([https://github.com/jwr/ccheck](https://github.com/jwr/ccheck)) specifically
for checking my image archives against corruption. I am appalled that in this
day and age we still (mostly) use tools and filesystems that have no way of
knowing if data has been corrupted.

Before I wrote the tool, I searched online and couldn't find anything that
would be small, simple, easily maintained, and would last for at least 10
years.

I now use it with all my archives, which means I know if a given replica is
correct.

------
mark-r
I like the idea of a utility that can add ECC to any arbitrary file. You'd
need to ensure you had an uncorrupted copy of the utility that you could use
in the future though, or the images become unreadable even if they're
uncorrupted themselves - unless the ECC is contained in a sidecar file.

~~~
zamadatix
PAR2 or most file compression formats can fit the bill. Of course if it's
going to stay put a filesystem that does this might be more convenient.

~~~
mark-r
Yes, it might be more convenient when built into a filesystem. But now you
must keep alive an OS that can handle that filesystem. That seems like a much
greater burden for the long haul.

------
cerberusss
If you're afraid of file corruption, install par2 with homebrew.

My procedure for pictures is that monthly, I copy them off of my phone onto my
storage. Then I run par2 in that directory to add a parity file. That all goes
onto the backup.

~~~
brnt
I do the same. To help with 'maintenance' (adding par to newly added files,
checking and handling integrity for existing ones) I wrote a small utility you
may like:
[https://github.com/brenthuisman/par2deep](https://github.com/brenthuisman/par2deep)

~~~
cerberusss
Very nice, thanks!

------
BiteCode_dev
Genuine question: why would you want that? Isn't that the role of the file
system or the data transport?

------
arthurfm
This might be a stupid question, but with a container format such as HEIF
could you simply store multiple identical copies of an image inside the
container? If one image gets corrupted you have the other to fall back on.

Two HEIF or AVIF images should be about the same size as a single JPEG too.

~~~
exmadscientist
It's both much less wasteful and much more useful to add an error correction
(FEC) block instead. With random errors spread across two copies, how do you
know which chunks of each are correct? Sure, you could use a Merkle tree or
similar, but FEC is a better solution. For size penalties of order 10% in
practical use, FEC blocks can detect and correct errors not just in the data
but in the FEC blocks themselves too.

Their biggest downside is that their computational complexity can be high.
This is only a problem during creation and decoding on error; if the data is
checksummed, you only need to take the slow path and touch the FEC block if an
error is encountered. This is a great tradeoff for archival uses, and good for
many other uses.

------
social_quotient
Is “vandal” by chance an open source utility? I’d like to test some other
files and formats.

~~~
mark-r
It doesn't sound like it would be hard to write your own. Not sure how useful
it is though, since it doesn't make any attempt to replicate the kind of
errors you would see in the wild.

------
renewiltord
Would you outperform uncompressed TIFF with sufficient copies of a compressed
file? Or compressed file plus parity file?

------
mark-r
There used to be a format that was highly resilient to errors. It was called
film.

~~~
avmich
Of course singular - or linear, say, from bad projection device - defects in
that are rather final, and this format has a host of inconveniences like fire
hazard, inability to get transferred over networks, comparatively huge costs
of copying, large physical size demands etc.

We can perhaps use a convenient digital format for images and add to it, ahem,
anti-corruption measures. Single file is still going to be vulnerable - even
though to a less extent - if you, say, add some Reed-Solomon protection; but
now you can use distributed networks the size of a planet, or even more.
Organization of such a storage could be an interesting problem - at least
partially solved already.

~~~
mark-r
I'm not saying film is perfect. But it's an often overlooked choice that has
worked well for the last 50 years.

