
ZFS won’t save you: fancy filesystem fanatics need to get a clue about bit rot - gphreak
https://nctritech.wordpress.com/2017/03/07/zfs-wont-save-you-fancy-filesystem-fanatics-need-to-get-a-clue-about-bit-rot-and-raid-5/
======
kabdib
> While it is true that keeping a hash of a chunk of data will tell you if
> that data is damaged or not, the filesystem CRCs are an unnecessary and
> redundant waste of space ...

A few years ago I, when I was on a game console team, a hardware engineer came
to my desk and said, "Can you find out what's wrong with this disk drive?" It
had come from a customer whose complaint was that games sometimes failed to
download and game saves became unreadable.

I spent a fun afternoon tracking down what turned out to be a stuck-at-zero
bit on that drive's cache. Just above the drive's ECC-it-to-death block
storage was this flaky bit of RAM that was going totally unchecked. The
console had a Merkle-tree based file system and easily detected the failure,
but without that addition checking the corruption would have been very subtle,
most of the time.

Okay, so that's just one system out of millions, right? What are the chances?
Well, at the scale of millions, pretty much any hole in data integrity is
going to be found out and affect real, live customers at _some_ not
insignificant rate. You really shouldn't be amazed at the number of single-bit
memory errors happening on consumer hardware (from consoles to PCs -- and I
assume phones). You should expect these failures and determine in advance if
they are important to you and your customers.

Just asserting "CRCs are useless" is putting a lot of trust on stuff that has
real-world failure modes.

~~~
rgbrenner
_Just asserting "CRCs are useless" is putting a lot of trust on stuff that has
real-world failure modes._

Yes, and he does this over and over again throughout the article. I have
personally experienced at least 3 scenarios that he has determined won't
happen.

If this guy wrote a filesystem (something that he pretends to have enough
experience to critique), it would be an unreliable unusable piece of crap.

~~~
AstralStorm
You have worse problems that a filesystem won't catch if RAM gets randomly
corrupted. Including said CRC check itself getting corrupted or code writing
putt data structures to disk being wrong. Neither of those is caught by CRC
better than by a dirty bit. It so happens that journaling file systems already
have a degree of redundancy for writes built into them unless you defeat it.

~~~
wyoung2
The trivial case is that the data is corrupted in RAM prior to being written.
If we take the simple case of a 2-disk mirror, the same wrong data is going to
be written to both disks, the checksums will match, and the filesystems and
underlying disks will be oblivious to the problem. ZFS can't help here, but
neither can RAID-5.

The far more risky situations involve reading back data.

A properly-optimized RAID or RAID-like system will read half the blocks from
one disk and half from the other when dealing with a 2-disk mirror.

With RAID-1, if the data blocks read cleanly from one disk — that is, the hard
disk's ECC does its thing, as the author expects — but the data bytes are then
corrupted in RAM during the DMA transfer, RAID won't detect the problem. Your
application will simply have errors in those blocks, and it'll be oblivious to
the problem unless there is some corruption detection ability in the data
format.

With a ZFS mirror, things are different. If the blocks are cleanly read from
the disk (again according to those in-drive ECC checks) but the bytes are
corrupted during the DMA transfer to RAM, ZFS _will_ detect it, because it
always double-checks the hashes — cryptographycally-strong _hashes_ , mind,
not CRCs, as the author misstates — after reading the data in from disk. This
will cause ZFS to attempt a second read from the corresponding block in the
other side of the mirror. Assuming you don't get a second RAM corruption, the
checksum will match this time, so ZFS will re-write the clean block to the
first disk. ZFS is incorrectly assuming it was the drive that corrupted the
block, but it doesn't matter because all that happens is a correct block is
overwritten with the same correct block.

Now let's take a trickier case. What if your RAM is so flaky that it re-
corrupts the clean block on its way back out to the first disk during this
unnecessary re-write? ZFS will write the correct checksum along with that
block's data, so that when it comes time to re-read that block, the checksum
won't match the data. It doesn't matter whether the RAM corrupts the
checksummed data or the checksum itself, because the odds are astronomically
against both being corrupted in a way that causes the two to match. When ZFS
is told to re-read that corrupted block, either by the application or by a
background scrub, it will again decide it needs to overwrite the first disk's
copy of the block with the copy from the second disk, which this time is in
fact corrupted on-disk. Unless your RAM corrupts the data a third time, _this_
time it will write the correct data to disk.

RAID can't do any of that. All RAID can do is say, "These two blocks don't
match each other, but both have good on-disk ECC, so _PANIC_." Different RAID
implementations do different things here. Some will just mark the array as
degraded and force the operator to choose one disk to mirror back onto the
other. If the operator guesses wrong, you've got two copies of the bad data
now.

ZFS doesn't have to guess: it _knows_ which copy is wrong with astronomical
odds in favor of being correct.

------
asveikau
A few years ago I had a drive at home that was flipping bits, randomly
corrupting my files. It inspired me to build a ZFS disk server and introduce
redundancy in my home setup.

A bunch of this article reads as if this scenario, which I in fact hit, won't
happen, drives do it better, etc. But it happens. It happened to me. The drive
did not "magically fix itself", and instead got worse over time. With ZFS, if
it happens again, I can be told where it happened, exactly what files are
affected, etc., and that's already better than what I got with that other disk
which didn't have ZFS.

Plus the ZFS tools like snapshotting, send/receive, scrub being able to check
integrity while the system is running... Those are great features.

~~~
wyoung2
I've got a ZFS server here that regularly detects some small number of megs of
incorrect data on each week's scrub. This week, it happens to be 4.28M. Every
week, ZFS finds the correct copy and fixes it.

I have no idea what the problem is with this server. There are no SMART
failures or kernel messages indicating hardware failure, and the system
doesn't hard-crash. The thing is, I don't actually _have to_ care, because ZFS
is actively taking care of the problem. Until one of the disks goes so bad
that SMART or the kernel's SATA layer or ZFS can point me at it, I can just
passively let ZFS continue protecting me.

If this were a RAID, the first risk is that the RAID system wouldn't have a
scrub command at all. Some do, but not all. Without such a command, those on-
disk ECCs the author heaps so much praise on won't help him. I've got the same
ECCs backing my ZFS, and clearly the data is getting corrupted anyway,
somehow.

Let's keep the author's context in mind, which is apparently that we're going
to use motherboard or software RAID, since he's budgeted $0 for a hardware
RAID card, so the chances are higher that there is no scrub or verify command.

If our RAID implementation _does_ happen to have a scrub or verify command, it
might be forced to just kick one of the disks out or mark the whole array as
degraded, depending on where in the chain the corruption happened. If it does
that, it'll take a whole lot longer to rewrite one of the author's cheap 3 TB
disks than it took ZFS on my file server to fix the few megs of corrupted
blocks.

And that's not all. I have a second anecdote, the plural of which is "data,"
right? :)

Another ZFS-based system I manage had a disk die outright in it. SMART errors,
I/O timeouts, the whole bit. Very easy to diagnose. So, I attached a third
disk in an external hard disk enclosure to the pained ZFS mirror, which caused
ZFS to start resilvering it.

Before I go on, I want to point out that this shows another case where ZFS has
a clear advantage. In a typical hardware RAID setup, a 2-disk mirror is more
likely to be done with a 2-port RAID card, because they're cheaper than 4-port
and 8-port cards. That means there is a very real chance that you couldn't set
up a 3-disk mirror at all, which means you're temporarily reduced to no
redundancy during the resilver operation. Even if you've got a spare RAID port
on the RAID card or motherboard, you might not have another internal disk slot
to put the disk in. With ZFS, I don't need either: ZFS doesn't care if two of
a pool's disks are in a high-end RAID enclosure configured for JBOD and the
third is in a cheap USB enclosure.

The point of having a temporary 3-disk mirror is that the dying disk wasn't
quite dead yet. That means it was still useful for maintaining redundancy
during the resilvering operation. With the RAID setup, you might be forced to
replace the dying disk with the new disk, which means you lose all your
redundancy during the resilver.

Now as it happens, sometime during the resilver operation, `zfs status` began
showing corruptions. ZFS was actively fixing them like a trooper, but this was
still very bad. It turned out that the cheap USB external disk enclosure I was
using for the third disk was flaky, so that when resilvering the new disk, it
wasn't always able to write reliably. I unmounted the ZFS pool, moved the new
disk to a different external USB disk enclosure, re-mounted the pool, and
watched it pick the resilvering process right back up. Once that was done, I
detached the dying disk from the mirror and did a scrub pass to clear the data
errors, and I was back in business having lost no data, despite the hardware
actively trying to kill my data _twice over_.

There are still cases where I'll use RAID over ZFS, but I'm under no illusions
that ZFS has no real advantages over RAID. I've seen plenty of evidence to the
contrary.

~~~
rubatuga
By the way, are you running ZFS on a linux server? Or BSD? Just want to set
one up for myself too.

~~~
wyoung2
No.

The first anecdote is about a TrueOS box — which previews what will become
FreeBSD 12 — and the second is about a macOS Sierra box running OpenZFS on OS
X.

Since TrueOS, O3X and ZoL are all based on OpenZFS, I expect that you will
have the opportunity to replicate my experiences should you have disks that
die. Now I don't know whether to wish you good luck or not. :)

------
Mindless2112
As someone who has lost some files to a silently malfunctioning hard disk in
the past, I think I'll stick with ZFS. Checksumming, RAID-Z, and periodic
scrubbing would have saved my files. Even having backups did not -- after all,
what good is a bit-for-bit copy of a corrupted file?

(On a side note, ZFS -- at least OpenZFS -- doesn't support any _CRC_
algorithms for use as its checksum.)

~~~
AstralStorm
Mostly periodic scrubbing and patrol reads I reckon. Which is as required with
RAID without ZFS.

~~~
wyoung2
Scrub/verify/patrols, whatever you want to call it, with RAID all it can do is
say, "Well shit, these two copies don't match. What do you want me to do about
it, boss?"

ZFS doesn't have to guess which copy is wrong. It _knows_ , and it will
automatically replace it.

More, ZFS will even do this on a ZFS mirror when reading half the data blocks
from one disk and half from the other, because it reads the cryptographically-
strong checksums in with each data block and checks them before delivering the
data to the application. If the checksum doesn't match, it rewrites that block
from the redundant copy on the other disk(s).

RAID can't do that. If one of a mirror's data blocks is corrupted on disk but
with a correct ECC, so that the two blocks don't match but both read cleanly,
RAID can't tell which one is correct, so it'll typically just force the system
administrator to choose one disk to overwrite the other with. That exchanges
astronomical odds against incorrect data for coin flip odds against.

------
rgbrenner
For an article with that tone, you would think the author would have more
experience. It's literally filled with flawed and uninformed or inexperienced
thinking.

From the idea that SMART reliably detects hard drive failures.. to dismissing
data protection for no reason other than it sounds unlikely to the author
(which in several cases I know personally to be false... because I've
experienced those failures).

ZFS is a very well designed filesystem. Things weren't added haphazardly or
because they sounded cool. The author would do well to try to understand why
those protections were added.

~~~
AstralStorm
Almost all of the protections are also afforded by plain old RAID without ZFS.
Why waste space on a CRC when you still get to run a redundancy check? If FS
structure is corrupted CRC won't save you anyway. An FSCK might instead.

------
DiabloD3
This entire article can be summarized as the following: RAID is not a
replacement for backups.

Sun/Oracle, and a lot of popular third party documentation, has said as such
very openly, and commands like zfs send/recv exist to easily automate zfs
cloning (to backup from one zfs fs to another, for example, if you choose to
do it that way).

I suspect whoever wrote this missed the boat on why zfs works.

------
notacoward
Totally off base, on several points. Any kind of checksum on the disk only
protects what gets to the disk. Filesystem-level CRCs can protect the _entire
data path_. If you have a defect in your RAID card or HBA, or anywhere in the
software stack below the filesystem, on-disk CRCs will happily "validate" the
already-corrupted data while filesystem-level CRCs are likely to detect the
corruption. The author dismisses it as a "remotely likely scenario" but I've
seen it happen for real many times. Maybe that's because I have about 3.5x as
many years of experience as the author, across what's probably thousands of
times as many machines or drives (I've worked on some big system).

The same "I've never seen it so it's not real" fallacy appears again in the
discussion of RAID 5. He says that losing a second drive during a rebuild is
"statistically very unlikely" but that's not so. Not only have I seen it many
times, but the simple math of disk capacities and interface speeds shows that
it's not really all that unlikely. I've seen _RAID 6_ fail because of
overlapping rebuild times, leading people to push for more powerful erasure-
coding schemes. Over the lifetime of even a medium-sized system, concurrent
failures on RAID 5 are likely enough to justify using something stronger.

I was one of the earliest and most outspoken critics of ZFS hype and FUD when
it came out. It was and is no panacea, but that doesn't justify more FUD in
the other direction to sell backup products or services.

------
Veratyr
While he's right that it's not as big an issue as ZFS fanatics make it out to
be, it _is_ a real issue and they're not just pulling it out their asses.
There are a number of studies that actually measured the error rate, some of
the bigger ones being done by CERN [0], NetApp [1] and IA (I think there's
meant to be a talk or something to go with this one) [2].

ZFS certainly isn't a magic wand you should wave at anything and everything
and it doesn't replace backups but it does make the chances of something going
wrong undetected much smaller and even though the chances are small to begin
with, there are times when you just can't accept it at all.

[0]:
[https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kele...](https://www.nsc.liu.se/lcsc2007/presentations/LCSC_2007-kelemen.pdf)

[1]:
[https://www.usenix.org/legacy/events/fast08/tech/full_papers...](https://www.usenix.org/legacy/events/fast08/tech/full_papers/bairavasundaram/bairavasundaram_html/index.html)

[2]:
[http://storageconference.us/2006/Presentations/39rWFlagg.pdf](http://storageconference.us/2006/Presentations/39rWFlagg.pdf)

~~~
X86BSD
Actually it does replace backups with replication and/or cloning.

~~~
bbatha
Its not backed up until its _at least_ on an external system, ideally in
triplicate off-box, off-site, and cold storage. Cloning and replication makes
it easier to backup but is no substitute.

~~~
X86BSD
ZFS send/recv to an offsite ZFS box is a backup. Replacing tape systems. It's
incremental. Its compressed. It's faster.

------
ATsch
>Snapshots may help, but they depend on the damage being caught before the
snapshot of the good data is removed. If you save something and come back six
months later and find it’s damaged, your snapshots might just contain a few
months with the damaged file and the good copy was lost a long time ago.

The author seems to misunderstand the purpose of snapshots. As frequently [1]
pointed out, snapshots are not in fact backups and should not be used for
longer term storage.

Also the same argument can be used on Backups: "Backups may help, but they
depend on the damage being caught before the backup of the good data is
removed. If you save something and come back six months later and find it’s
damaged, your backups might just contain a few months with the damaged file
and the good copy was lost a long time ago."

[1] [http://www.cobaltiron.com/2014/01/06/blog-snapshots-are-
not-...](http://www.cobaltiron.com/2014/01/06/blog-snapshots-are-not-backups/)

------
OpenZFSonLinux
This blog post was deleted hours after I posted the following comment rebuking
most of what was said:

I don’t know much about btrfs so I’ll stick to ZFS related comments. ZFS does
not use CRC, by default it uses fletcher4 checksum. Fletcher’s checksum is
made to approach CRC properties without the computational overhead usually
associated with CRC.

Without a checksum, there is no way to tell if the data you read back is
different from what you wrote down. As you said corruption can happen for a
variety of reason – due to bugs or HW failure anywhere in the storage stack.
Just like other filesystems not all types of corruption will be caught even by
ZFS, especially on the write to disk side. However, ZFS will catch bit rot and
a host of other corruptions, while non-checksumming filesystems will just pass
the corrupted data back to the application. Hard drives don’t do it better,
they have no idea if they’ve bit rotted over time and there are many other
components that may and do corrupt data, it’s not as rare as you think. The
longer you hold data and the more data you have the higher the chance you will
see corruption at some point.

I want to do my best to avoid corrupting data and then giving it back to my
users so I would like to know if my data has been corrupted (not to mention
I’d like it to self-heal as well which is what ZFS will do if there is a good
copy available). If you care about your data use a checksumming filesystem
period. Ideally, a checksumming filesystem that doesn’t keep the checksum next
to the data. A typical checksum is less than 0.14 Kb while a block that it’s
protecting is 128 Kb by default. I’ll take that 0.1% “waste of space” to
detect corruption all day, any day. Now let’s remember ZFS can also do in-line
compression which will easily save you 3-50% of storage space (depending on
the data you’re storing) and calling a checksum a “waste of space” is even
more laughable.

I do want to say that I wholeheartedly agree with “Nothing replaces backups”
no matter what filesystem you’re using. Backing up between two OpenZFS pools
machines in different physical location is super easy using zfs snapshot-ting
and send/receive functionality.

------
zlynx
He missed all the history of ZFS too. Sun had actual customers with bit rot.
Even though they were running systems with the highest types of server
hardware Sun provided, they had invisible data errors which were only noticed
when the files were used and analysis showed ECC passing bit errors.

ZFS was created to solve actual business problems.

------
random_comment
This entire article can be summarised as 'guy who has never used ZFS and has
no idea whatsoever about how it works writes a critique that exposes their
ignorance publicly'.

Here's a quote:

\- _“ZFS has CRCs for data integrity_

 _A certain category of people are terrified of the techno-bogeyman named “bit
rot.” These people think that a movie file not playing back or a picture
getting mangled is caused by data on hard drives “rotting” over time without
any warning. The magical remedy they use to combat this today is the holy CRC,
or “cyclic redundancy check.” It’s a certain family of hash algorithms that
produce a magic number that will always be the same if the data used to
generate it is the same every time._

 _This is, by far, the number one pain in the ass statement out of the classic
ZFS fanboy’s mouth... "_

Meanwhile in reality...

ZFS does not use CRCs for checksums.

It's very hard to take someone's view seriously when they are making mistakes
at this level.

ZFS allows a range of checksum algorithms, including SHA256, and you can even
specify per dataset the strength of checksum you want.

\- _" Hard drives already do it better"_

No, they don't, or Oracle/Sun/OpenZFS developers wouldn't have spent time and
money making it.

It makes a bit of a difference when your disk says 'whoops, sorry, CRC fail,
that block's gone?' and it was holding your whole filesystem together. Or when
a power surge or bad component fries the whole drive at once.

ZFS allows optional duplication of metadata or data blocks automatically; as
well as multiple levels of RAID-equivalency for automatic, transparent
rebuilding of data/metadata in the presence of multiple unreliable or failed
devices. Hard drives... don't do that.

Even ZFS running on a single disk can automatically keep 2 (or more) copies on
disk of whatever datasets you think are especially important - just check the
flag. Regular hard drives don't offer that.

\- _What about the very unlikely scenario where several bits flip in a
specific way that thwarts the hard drive’s ECC? This is the only scenario
where the hard drive would lose data silently, therefore it’s also the only
bit rot scenario that ZFS CRCs can help with._

Well, that and entire disk failures.

And power failures leading to inconsistency on the drive.

And cable faults leading to the wrong data being sent to the drive to be
written.

And drive firmware bugs.

And faulty cache memory or faulty controllers on the hard drive.

And poorly connected drives with intermittent glitches / timeouts in
communication.

You get the idea.

I could also point out that ZFS allows you to backup quickly and precisely
(via snapshots, and incremental snapshot diffs).

It allows you to detect errors as they appear (via scrubs) rather than find
out years later when your photos are filled with vomit coloured blocks.

It also tells you every time it opens a file if it has found an error, and
corrected it in the background for you - thank god! This 'passive warning'
feature alone lets you quickly realise you have a bad disk or cable so you can
do something about it. Consider the same situation with a hard drive over a
period of years...

ZFS is a copy-on-write filesystem, so if something naughty happens like a
power-cut during an update to a file, your original data is still there.
Unlike a hard disk (or RAID).

It's trivial to set up automatic snapshots, which as well as allowing known-
point-in-time recovery, are an exceptionally effective way to prevent viruses,
user errors etc from wrecking your data. You can always wind back the clock.

Where is the author losing his data (that he knows of, and in his very limited
experience...): _All of my data loss tends to come from poorly typed ‘rm’
commands._ ... so, exactly the kind of situation that ZFS snapshots allow
instant, certain, trouble-free recovery from in the space of seconds? [either
by rolling back the filesystem, or by conveniently 'dipping into' past
snapshots as though they were present-day directories as needed]

Anyway I do hope Mr/Ms nctritech learns to read the beginner's guide for
technologies they critique in future, maybe even try them once or twice,
before they write their critique.

What next?

 _" Why even use C? Everything you can do in C, you can do in PHP anyway!"_

~~~
Veratyr
> No, they don't, or Oracle wouldn't have spent money making it.

Tiny nitpick but though Oracle now owns and develops ZFS, Sun Microsystems was
the company that initially designed and implemented it. They worked on it for
5 years after they released it, before Oracle acquired them.

~~~
random_comment
Whoops, thanks for the catch. Have updated and also added OpenZFS to that
sentence.

------
Jaepa
I think one of the universal truths in tech is that, those for it, and those
annoyed by it both kind of miss the point.

------
X86BSD
I think what bothers me most is this person owns a computer related business.
He is actively endangering people's data out of willful ignorance. It's highly
unethical.

