> If you have, say, a 10-drive wide RAID6 you would need to source drives from 5...

> If you have, say, a 10-drive wide RAID6 you would need to source drives from 5 manufacturers/batches/models in order to be resilient to that kind of failure. Even if that was feasible that seems horrible to maintain long-term.

If anything, it's easier to maintain, as all you need to ensure on replacing a drive is to not unintentionally make the array have too many of one type of drive. In practice, it means you just regularly cycle what model you buy for your spares instead of the often totally counter-productive practice of making extra effort to find a supply of the exact same model.

In effect, most places I've done this, it has simply translated into refilling our spares from the currently most cost-effective model or two, and cycling manufacturers, instead of continuing to buy the same model.

The point is not to religiously prevent any kind of potentially unfortunate mixing, because these errors are fairly rare, but to reduce a very real chance using very simple means.

Over the 20+ years I've been doing this, I've seen at least 4-5 cases where homogenous raid arrays have been a major liability (the first one, that taught me to avoid this was the infamous IBM Death Star, where the film on the platters was almost totally scraped off; we had an array that we thankfully didn't lose data from, thanks to backups and careful management once the drives started failing once a week - only for it to take the array 4-5 days to rebuild... we didn't lose data, but we lost a lot of time babysitting that system and working around our dependency on it as a precaution).

I started mixing manufacturers after having had near-misses with several arrays with OCZ drives, where it appears to have been firmware problems across drive models.

> Doing a red/blue setup where your red systems use one type of drive and your blue systems use another type of drive seems like it could be reasonably accomplished.

You need multiple systems too, but the point is that every hour a system is down because of an easily avoidable problem is an hour where your system has reduced resilience and capacity. It's trivial to prevent these kinds of errors from taking down a raid array, so it's pretty pointless not to.