*"And that is because the designers assumed that the disk would be fine or just ...

zerd · on Jan 16, 2014

Firstly, I must say that I'm talking about consumer available RAID systems here, since that's what the article discusses.

>> "And that is because the designers assumed that the disk would be fine or just fail, and not be in some state in-between."

> Interesting, if you make that assumption you will get burned.

Probably why the designers of RAID have moved on to other storage systems. Yet, RAID is currently the only viable option for consumers to try to get "reliable" storage. Which is why the article was written.

> The whole point of RAID for data reliability is to catch these things. (RAID for performance is difference and people tolerate errors in exchange for faster performance).

No, RAID is for withstanding disk failures, where the disk fails in a predicted manner. This is a common misunderstanding. Nothing in the RAID specifications say that they should detect silent bit flipping. If you find a system that does this, it goes beyond the RAID specification (which is good). But you seem to be under the impression that most systems do that, but that is not true. Most RAID systems that consumers can get their hands on won't detect a single bit flip. The author demonstrated it on software raid. I am fairly certain that the same thing will happen if you try it on a hardware RAID controller from any of LSI, 3ware, Areca, Promise etc.

For instance Google has multiple layers of checksumming to guard themselves from bitrot, at filesystem and application level. If RAID did protect against it, why don't they trust it?

ChuckMcM · on Jan 16, 2014

In this particular case (bad data from disk, without an error indication from the disk) is pretty straight forward programming. If the device doesn't allow for additional check bits in the sectors, then reserving one sector out of 15 or one out of 9 (say if you're doing 4K blocks) for check data on the other sectors will let you catch misbehaving drives, and, if necessary, reconstruct bad data with the RAID parity.

If you use a 15/16ths scheme you only "lose" a bit more than 7% of your drive to check data, and you gain the ability to avoid really nasty silent corruption. I'm not sure why even a "consumer" RAID solution wouldn't do that (even the RAID-1 folks (mirroring) can use this for a bit more protection, although the mean-bit-error spec still bites them in trying to do a re-silver)

As for Google, If you'd like to understand the choice they made (and may un-make) you have to look at the cost of adding an available disk. The 'magic' thing they figured out was they were adding lots and lots of machines, and most of those machines had a small kernel, a few apps, and some memory, and an unused IDE (later SATA) port or ports. So adding another disk to the pizza box was "free" (and if you look at the server in the Computer History Museum you will see they just wrapped a piece of velcro around the disk and stuck it down next to the motherboard on the 'pizza pan'.) In that model an R3 system which takes no computation (its just copying, no ECC computation) with data "chunks" (in the GFS sense) which had built in check bits) and voila "free" storage. And that really is very cost effective if you have stuff for the CPUs to do, it breaks down when the amount of storage you need exceeds what you can acquire by either adding a drive to a machine with a spare port, or replacing all the drives with their denser next generation model. Steve Kleiman the former CTO of NetApp used to model storage with what he called a 'slot tax' which was the marginal cost of adding a drive to the network (so fraction of a power supply, chassis, cabling, I/O card, and carrier (if there was one)). Installing in pre-existing machines at Google the slot tax was as close to zero as you can reasonably make it.

That said, as Googles storage requirements exceeded their CPU requirements it became clear that the cost was going to be an issue. I left right about that time, but since that time there has been some interesting work at Amazon and Facebook with so called "cold" storage, which are ways to have the drive live in a datacenter but powered off most of the time.

I can't agree with this statement, "No, RAID is for withstanding disk failures, where the disk fails in a predicted manner." mostly because disks have never failed in a "predicted manner", that is what Garth Gibson invented RAID in the first place, he noted all the money DEC and IBM were spending on trying to make an inherently unreliable system reliable, and observed if you were willing to give up some of the disk capacity (through redundancy, check bits) you could take inexpensive, unreliable, drives, and turn them into something that was as reliable as the expensive drives from these manufacturers. It is a very powerful concept, and changed how storage was delivered. The paper is a great read, even today.