A batch of disks in a RAID array that were all from the same manufacturing lot. ...

jerf · on April 13, 2022

There's a couple of other well-known ways to screw up redundancy too, though I hope they're in far enough to know them.

One of my favorite is, if you want to store three copies of something on a lot of disks, randomly select which three disks you want to send the copies to so that each bit of content is stored on three random disks. It superficially seems very appealing; I have 3 copies of everything, nothing is likely to go wrong!

However, what you've actually created is a situation where if any three disks go down, you're guaranteed to lose data because as you scale up, the probability that those three disks were the three randomly-chosen disks for some set of data goes to 1. The probability of three disks going down at some point also likewise trends to 1 as you scale up.

Not saying that was the case here, just sharing it as one of the well-known pitfalls I find particularly interesting.

GauntletWizard · on April 13, 2022

I keep hearing this pitfall, and then I keep seeing the results of idiots who don't understand that by not picking three totally at random, what they have created is a giant mess should any of those three fail, instead of a quick and easy to clean up mass that you are constantly raid rebuilding from.

jl6 · on April 13, 2022

How could any approach involving n copies of data mitigate against loss of n devices? What is the superior approach that avoids this pitfall?

toast0 · on April 13, 2022

If you store three copies of all your files, with copies uniformly distributed over the disks, and three disks fail simultaneously [1], you are guaranteed to lose data.

If instead, you break your disks into N/3 groups of three, and store each file three times in one group, you only lose data when all three disks in one group fail simultaneously. If you're lucky, you can lose 2/3rds of your disks before you lose data.

Edit to add:

If you're using mirroring, it's low cost to do mirroring by groups although uniform distribution can help with hotspots. With uniform distribution, even if one device has two popular files, the other copies of those files are likely to be on devices with only one popular file. With grouping, if one group gets two popular files, all devices in the group have two popular files and therefore elevated use.

If you're using a parity scheme (raid6/raidz2/some other erasure coding), the cost to do grouping is more apparent. Each group has the +X parity cost; If you did raidz2, and had 45 drives in a case, you could do one set of 45, and get 43 drives worth of storage, or three sets of 15 and get 39 drives worth of storage, or five sets of 9 to get 35 drives worth of storage, etc. In each of the smaller sets, you still can lose data with the first three failures, but you need an increasingly large number of failures to guarantee data loss. Also, the size of the data lost when there's a loss is smaller.

[1] or at least, the third disk fails before either of the first two failures is repaired

jl6 · on April 13, 2022

Intuitively, it feels as though grouping is a tradeoff that lowers the probability of a data loss event, at the expense of increasing the amount of data lost in a data loss event?

Edit: “proof” by example…

9 disks, replication factor of 3. 9 choose 3 = 84 ways to allocate a file’s replicas across the disks. If 3 disks fail, on average 1/84 of the total data is lost (assuming >>84 files).

Versus… 3 groups of 3 disks. If 3 disks fail, the probability of the second disk failure being in the same group as the first disk failure is 2/8, and then 1/7 that the third failure is also in the same group. (2/8)*(1/7)=1/28.

So a 100% chance of losing 1/84 of the data versus a 1/28 chance of losing 1/3 of the data. Same total expected amount of data lost.

jerf · on April 14, 2022

You begin to see why this is a subtle problem, and why I think it catches so many people. Naive statistics, and perhaps "intuitive" statistics, say it's no different. In reality it's hugely different.

It is generally preferable in reality to take the second choice because additional realities intervene to give you time to deal with the problem, allowing the existence of fairly highly-reliable systems, whereas those same additional realities mean that in practice, you will lose data with some frequency if you select the uniformly random result.

Sometimes the better systems still catastrophically lose data, that is indeed inevitable as you point out, but what it takes to knock out that system would have result in lost data in the bad uniformly-random system, too, so in practice it's not much of a tradeoff.

hoofhearted · on April 13, 2022

I believe the more likely scenario was that they lost a disk and replaced it.. While the disk array was restriping, another disk in the raid array failed because of the restriping load, and all data would be lost at that point in a raid setup as they described.

toast0 · on April 13, 2022

I dunno, HPE had two issues with disks failing at specific power-on hours[1], and that wasn't the first time I've heard of similar things, just the one I can remember the vendor for. It's a lot more work to mix batches and stagger power on times than to build a bunch of disks systems at one time and turn them all on at once.

[1] https://www.hpe.com/us/en/services/sas-ssd-advisory.html