Hacker News new | comments | show | ask | jobs | submit login

Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.

Yes, and he does this over and over again throughout the article. I have personally experienced at least 3 scenarios that he has determined won't happen.

If this guy wrote a filesystem (something that he pretends to have enough experience to critique), it would be an unreliable unusable piece of crap.

You have worse problems that a filesystem won't catch if RAM gets randomly corrupted. Including said CRC check itself getting corrupted or code writing putt data structures to disk being wrong. Neither of those is caught by CRC better than by a dirty bit. It so happens that journaling file systems already have a degree of redundancy for writes built into them unless you defeat it.

The trivial case is that the data is corrupted in RAM prior to being written. If we take the simple case of a 2-disk mirror, the same wrong data is going to be written to both disks, the checksums will match, and the filesystems and underlying disks will be oblivious to the problem. ZFS can't help here, but neither can RAID-5.

The far more risky situations involve reading back data.

A properly-optimized RAID or RAID-like system will read half the blocks from one disk and half from the other when dealing with a 2-disk mirror.

With RAID-1, if the data blocks read cleanly from one disk — that is, the hard disk's ECC does its thing, as the author expects — but the data bytes are then corrupted in RAM during the DMA transfer, RAID won't detect the problem. Your application will simply have errors in those blocks, and it'll be oblivious to the problem unless there is some corruption detection ability in the data format.

With a ZFS mirror, things are different. If the blocks are cleanly read from the disk (again according to those in-drive ECC checks) but the bytes are corrupted during the DMA transfer to RAM, ZFS will detect it, because it always double-checks the hashes — cryptographycally-strong hashes, mind, not CRCs, as the author misstates — after reading the data in from disk. This will cause ZFS to attempt a second read from the corresponding block in the other side of the mirror. Assuming you don't get a second RAM corruption, the checksum will match this time, so ZFS will re-write the clean block to the first disk. ZFS is incorrectly assuming it was the drive that corrupted the block, but it doesn't matter because all that happens is a correct block is overwritten with the same correct block.

Now let's take a trickier case. What if your RAM is so flaky that it re-corrupts the clean block on its way back out to the first disk during this unnecessary re-write? ZFS will write the correct checksum along with that block's data, so that when it comes time to re-read that block, the checksum won't match the data. It doesn't matter whether the RAM corrupts the checksummed data or the checksum itself, because the odds are astronomically against both being corrupted in a way that causes the two to match. When ZFS is told to re-read that corrupted block, either by the application or by a background scrub, it will again decide it needs to overwrite the first disk's copy of the block with the copy from the second disk, which this time is in fact corrupted on-disk. Unless your RAM corrupts the data a third time, this time it will write the correct data to disk.

RAID can't do any of that. All RAID can do is say, "These two blocks don't match each other, but both have good on-disk ECC, so PANIC." Different RAID implementations do different things here. Some will just mark the array as degraded and force the operator to choose one disk to mirror back onto the other. If the operator guesses wrong, you've got two copies of the bad data now.

ZFS doesn't have to guess: it knows which copy is wrong with astronomical odds in favor of being correct.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact