Hacker News new | comments | show | ask | jobs | submit login

> While it is true that keeping a hash of a chunk of data will tell you if that data is damaged or not, the filesystem CRCs are an unnecessary and redundant waste of space ...

A few years ago I, when I was on a game console team, a hardware engineer came to my desk and said, "Can you find out what's wrong with this disk drive?" It had come from a customer whose complaint was that games sometimes failed to download and game saves became unreadable.

I spent a fun afternoon tracking down what turned out to be a stuck-at-zero bit on that drive's cache. Just above the drive's ECC-it-to-death block storage was this flaky bit of RAM that was going totally unchecked. The console had a Merkle-tree based file system and easily detected the failure, but without that addition checking the corruption would have been very subtle, most of the time.

Okay, so that's just one system out of millions, right? What are the chances? Well, at the scale of millions, pretty much any hole in data integrity is going to be found out and affect real, live customers at some not insignificant rate. You really shouldn't be amazed at the number of single-bit memory errors happening on consumer hardware (from consoles to PCs -- and I assume phones). You should expect these failures and determine in advance if they are important to you and your customers.

Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.

Just asserting "CRCs are useless" is putting a lot of trust on stuff that has real-world failure modes.

Yes, and he does this over and over again throughout the article. I have personally experienced at least 3 scenarios that he has determined won't happen.

If this guy wrote a filesystem (something that he pretends to have enough experience to critique), it would be an unreliable unusable piece of crap.

You have worse problems that a filesystem won't catch if RAM gets randomly corrupted. Including said CRC check itself getting corrupted or code writing putt data structures to disk being wrong. Neither of those is caught by CRC better than by a dirty bit. It so happens that journaling file systems already have a degree of redundancy for writes built into them unless you defeat it.

The trivial case is that the data is corrupted in RAM prior to being written. If we take the simple case of a 2-disk mirror, the same wrong data is going to be written to both disks, the checksums will match, and the filesystems and underlying disks will be oblivious to the problem. ZFS can't help here, but neither can RAID-5.

The far more risky situations involve reading back data.

A properly-optimized RAID or RAID-like system will read half the blocks from one disk and half from the other when dealing with a 2-disk mirror.

With RAID-1, if the data blocks read cleanly from one disk — that is, the hard disk's ECC does its thing, as the author expects — but the data bytes are then corrupted in RAM during the DMA transfer, RAID won't detect the problem. Your application will simply have errors in those blocks, and it'll be oblivious to the problem unless there is some corruption detection ability in the data format.

With a ZFS mirror, things are different. If the blocks are cleanly read from the disk (again according to those in-drive ECC checks) but the bytes are corrupted during the DMA transfer to RAM, ZFS will detect it, because it always double-checks the hashes — cryptographycally-strong hashes, mind, not CRCs, as the author misstates — after reading the data in from disk. This will cause ZFS to attempt a second read from the corresponding block in the other side of the mirror. Assuming you don't get a second RAM corruption, the checksum will match this time, so ZFS will re-write the clean block to the first disk. ZFS is incorrectly assuming it was the drive that corrupted the block, but it doesn't matter because all that happens is a correct block is overwritten with the same correct block.

Now let's take a trickier case. What if your RAM is so flaky that it re-corrupts the clean block on its way back out to the first disk during this unnecessary re-write? ZFS will write the correct checksum along with that block's data, so that when it comes time to re-read that block, the checksum won't match the data. It doesn't matter whether the RAM corrupts the checksummed data or the checksum itself, because the odds are astronomically against both being corrupted in a way that causes the two to match. When ZFS is told to re-read that corrupted block, either by the application or by a background scrub, it will again decide it needs to overwrite the first disk's copy of the block with the copy from the second disk, which this time is in fact corrupted on-disk. Unless your RAM corrupts the data a third time, this time it will write the correct data to disk.

RAID can't do any of that. All RAID can do is say, "These two blocks don't match each other, but both have good on-disk ECC, so PANIC." Different RAID implementations do different things here. Some will just mark the array as degraded and force the operator to choose one disk to mirror back onto the other. If the operator guesses wrong, you've got two copies of the bad data now.

ZFS doesn't have to guess: it knows which copy is wrong with astronomical odds in favor of being correct.

Consumer hardware is notriously busted. Even most of the enterprise hardware isn't flawless. Firmware bugs etc. "Your hardware is actively trying to kill your data and ZFS job is to prevent it." To paraphrase Allan Jude.

But did the console's software checking help in that case? Either way you're going to have a customer complaining about problems.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact