
How Libcorrect Corrects Errors, Part I - brian-armstrong
http://quiet.github.io/quiet-blog/2018/09/16/How-Libcorrect-Does-Forward-Error-Correction.html
======
FooBarWidget
So there's the phenomenon of bit rot. Filesystems like ZFS add data checksums,
and they recommend you to scrub the data once in a while, and then when errors
are detected you are supposed to restore from a replica/backup (whether
restoration happens automatically or manually is besides the point; the point
is that a replica/backup is required in order to restore).

Why can't error correction codes be used instead of replicas/backups, as the
primary means to recover from bitrot? (NOTE: I am talking about the _primary_
means; obviously a backup is still necessary for more serious forms of
disaster recovery. I am not advocating abolishing backups altogether) That
makes it a lot easier for single-disk use cases, e.g. laptops or consumer
devices. It also saves space, which is also important because laptop SSDs and
phone SD cards aren't that big. A replica doubles the space requirements,
while an error correction code is smaller than that.

~~~
guitarbill
> and then when errors are detected you are supposed to restore from a
> replica/backup.

ZFS will recover/self-heal from errors if it can definitively figure out which
data is correct.

many storage technologies do protect against errors, CD/DVD/HDD all have ECC
in the physical layer. but without control/knowledge of the physical medium
(so at filesystem level), how do you distribute the ECC sensibly? you can't.

another issue is if the other hardware doesn't do ECC, for example with non-
ECC RAM. RAM is arguably more susceptible to bitflips due to it's dynamic
nature. can you recover from errors if the information in your RAM can't be
trusted? it's a hard problem, and ZFS pretty much requires ECC RAM for any
data integrity guarantees to work.

it also doesn't necessarily matter. a single bit flip in e.g. a jpg or mp4
file doesn't necessarily render it unusable, so people don't care.

finally, ECC is a bit useless if your whole drive or device fails, which is a
much more common failure mode.

storage is cheap nowadays, and even double or triple redundancy is cheaper and
more straight-forward than trying to be clever.

~~~
FooBarWidget
> how do you distribute the ECC sensibly? you can't.

I am not an expert on ECC. Are you saying that you can't store the ECC just
anywhere? Just storing it alongside the data is not good enough? Why does ECC
need special treatment?

> storage is cheap nowadays, and even double or triple redundancy is cheaper
> and more straight-forward than trying to be clever.

Tell that to Apple to who charges $500+ for a 1 TB SSD upgrade in a Macbook
Pro. :-( "Cheap" is relative. I am worried about bitrot on my laptop but I
also don't want to half my disk space in order protect against that.

~~~
pwg
> > how do you distribute the ECC sensibly? you can't.

> I am not an expert on ECC. Are you saying that you can't store the ECC just
> anywhere?

Well, you can put it "just anywhere", but /where/ you put it determines /what
failure types/ you can recover from.

> Just storing it alongside the data is not good enough? Why does ECC need
> special treatment?

If you want to recover from bitrot, then putting the ECC data for a sector
alongside the data in the same sector is sufficient (you'll have less bytes
stored per sector, but if a bit flips, you can recover the original data).

But, storing the ECC in the same sector with the data it protects will not
protect against losing the entire sector (drive can't read whole sector
error). In this instance both the data and the ECC is lost simultaneously, so
the ECC can not help here if it is also lost at the same time. So if you want
to protect against loss of a sector you need your ECC stored somewhere else
(i.e., on a different sector that is unlikely to be correlated to the lost one
in a failure situation) so that you still have the ECC available when the
sector you are protecting goes away.

But, if you are protecting against loss of an entire physical drive, then the
ECC for the drive needs to be on another physical drive (same reasons apply as
for a "sector", just at the level of a whole physical disk).

It is all tradeoffs. You /can/ put it anywhere, but where you choose to store
it determines which failure types you can recover from.

~~~
blattimwind
Disk drives use sector-level ECC, silent sector corruption should be (and IME
is) more rare than unrecoverable sectors.

