RAID's Days May Be Numbered

wmf · on Sept 18, 2009

You can see the solution to this in something like XIV's chunk RAID (I suppose this is an extreme form of declustered RAID) where a 1 TB disk can be rebuilt in 30 minutes because all disks in the system perform rebuild in parallel. I think Xiotech and Drobo also use chunk RAID.

I am concerned that no one in the open source world appears to be working on chunk RAID. We may end up in a situation where it's fundamentally unsafe to run Linux/BSD software RAID on our 4 TB disks.

JulianMorrison · on Sept 18, 2009

I'm surprised file systems don't offer the ability to save data to disk as an error-correcting code.

logicalmind · on Sept 18, 2009

That's what RAID5 does. It protects from single drive failure by placing an "error-correcting code" on an alternate disk. Unless you're talking about error-correcting codes on a single disk, eg. within the filesystem. But this doesn't solve the same problem as RAID is intended to solve.

JulianMorrison · on Sept 18, 2009

Parity is a very weak ECC, basically. A more complex one of the sort used in outer-space networking can detect and rescue large swathes of destruction, or even tolerate it and prioritize relocation of the most damaged data. It doesn't protect against outright drive failure, but straight-up mirroring will do that. This is of course at the cost of tunable space wastage.

aplusbi · on Sept 18, 2009

Technically that's RAID 4, 5 distributes the parity across all disks.

signa11 · on Sept 18, 2009

isn't that exactly what zfs supports ?

JulianMorrison · on Sept 18, 2009

No, ZFS has checksums. It detects errors but can only repair them if the data is mirrored.

rbanffy · on Sept 18, 2009

ECC would work if you could read both the ECC and enough of the data out of the defective block. I am not sure how frequent this situation is, but I suspect full-block failures are more common.

Considering the trends in storage capacity, I think the mirroring approach done with ZFS is a clever solution. One could even conceive a file system that would position multiple copies of blocks of frequently accessed files all over the disks in order to maximize throughput, keeping only a minimum redundancy condition declared by file.

I am happy file systems are no longer boring ;-)

rbanffy · on Sept 19, 2009

Ah... The difference a day makes.

The ECC idea could work, but it would be best to build it into the drive's firmware. This way the drive would get read errors less often, could attempt to rewrite the block or to relocate it somewhere else in case of a "bad enough" error.

jasonwatkinspdx · on Sept 20, 2009

I think in the longer term we'll see storage that does read repairs continuously and so can tolerate out of sync redundant data. Where exactly to place this in the file/volume/block hierarchy isn't clear.

ars · on Sept 18, 2009

Very good points, but I think his conclusion is wrong.

I think the fix will come from the filesystem. A filesystem that automatically mirrors (or distributes ecc code for) files, and checks checksums is the future.

wmf · on Sept 18, 2009

Moving the same functionality from one layer to another won't help much. For example, rebuilding a full disk in ZFS takes just as long as in traditional RAID because the same work is being done. ZFS does help if your filesystem is not full since it doesn't bother rebuilding empty space, but this doesn't provide the order of magnitude improvement that we are looking for.

jasonwatkinspdx · on Sept 20, 2009

That's not entirely true. If we're at the block level, it's more difficult for us to optimize recovery. If we're at a higher level where we have access to some sort of summary information, like the checksummed tree of blocks in zfs, then we can quickly isolate mismatched data and copy only that from the available to the recovering drive.

If we want the same capability at a layer below the file system, say the raid controller, than we end up needing a much more complicated implementation that is in effect a layering of two filesystems. Given general trends towards clustered commodity hardware I'd expect the software to keep getting smarter and the block level devices to keep getting dumber.

It seems like there's a role that's missing in the storage hierarchy. Something like an extent manager that doesn't maintain all the actual metadata, leaving that to the file system above, but that does have a concept of a tree of indirectly referencing extents.

Something similar to how allocation and garbage collection can be provided by libraries in c++.

rbanffy · on Sept 18, 2009

Except that a filesystem can tune the redundancy of the data to the amount of free space in the storage pool. It could be considered full when a minimum of redundancy was reached (like every block being stored at least on two different disks).

As soon as a disk fails and is taken out of the pool, the file system could (conceivably, at least) start replicating all data on the remaining disks maintaining the redundancy minimum and shrinking the window of vulnerability to multi-drive failures.

By the time a second disk fails, all data on it could be replicated elsewhere and all you would observe would be a shrinking file system (and urgent messages from the server).

docmach · on Sept 18, 2009

That is exactly what ZFS does.

yardie · on Sept 18, 2009

RAID is not a backup solution. Anyone that seriously uses RAID knows what it is and when to use it. If you are worried about the integrity of your data than you need a real backup solution. RAIDn won't restore your data incase of shock, fire, or hurricanes.

A thought that occurred to me is I don't know anyone that replaces a hard drive when it gets too old. They replace them when SMART says failure is imminent or the drive has already crashed. Hard drives should be treated like tapes. No one replaces tape when they are too ragged to be used. Tapes are replaced on a scheduled (like 30-40 writes). Replace drives after a duty cycle of 5,000 hours and you can minimize your exposure.

e40 · on Sept 18, 2009

"RAID is not a backup solution. Anyone that seriously uses RAID knows what it is and when to use it. If you are worried about the integrity of your data than you need a real backup solution. RAIDn won't restore your data incase of shock, fire, or hurricanes."

You say that like it has something to do with the article. It doesn't and it is not implied in any way.

mmt · on Sept 18, 2009

Why 5000?

Personally, I would quadruple that:

http://usenix.org/events/fast07/tech/schroeder/schroeder_htm... http://usenix.org/events/fast07/tech/full_papers/pinheiro/pi...

yardie · on Sept 18, 2009

5000 is just something I made up as an example. What should be done is HD manufacturers should rate their drives up to a certain amount of hours they have tested themselves. I'm not saying the MTBF of drives. But the chances of data integrity along the life of the drive. Like after 5000 hours you will have 99% integrity, 10000 hours is 95% integrity, 20000 hours is 80%, and so on.

They same way tire manufacturers rate their tires for so many miles or kilometers. No logical person waits for a car tire to self destruct before replacement. You note the odometer and make a record of when to change them. So you're not the unfortunate schmuck stuck in snow storm with shredded radials.

Data storage is the antithesis of good maintenance. Instead of proaction its reaction. You buy and install a drive, wait for it to die, then replace it.

Tuna-Fish · on Sept 18, 2009

Actually, I'm not so sure your advice is very good.

The likelihood of a drive failing does not linearly increase with it's age. For relatively young drives, the opposite is true -- if the drive has survived a few weeks of heavy use, it is much LESS likely to fail in the future than a fresh drive off a shelf. Tires are changed routinely because they wear -- hard drives don't so much wear as spontaneously fail. Changing them "before failure" costs you money and increases the amount of drives that blow up in your face.

yardie · on Sept 18, 2009

I think this has more to do with manufacturing consistency than anything else. If a hard drive fails in a few weeks than it's quite likely the lot it came from is also likely to fail.

Unless something fundamental has changed about hard drives I think my point is still correct. Hard drives are electromechanical devices that also wear; bearings wear, fluids evaporate, and heads hit the platter. They spontaneously fail like light bulbs spontaneously fail. It's not spontaneous at all. If you know a light bulb is about to blow or is near the end of it's life cycle you change it.

In the greater scheme of things the cost of the hardware is miniscule compared to the data on it. If drives start randomly blowing up in your face it's time to get a different model.

What I'm getting at is we put more care into the maintenance of our cars than we do our data. If you think hard drives are expensive than what do you think of the cost of productivity while an office full of people wait for a RAID to rebuild. Instead of waiting for the inevitable failure wouldn't it be better to cycle old drives on a friday evening when usage is low.

RAID6 is just a bandaid on a bandaid. It came about because drives are of the same age when a RAID is built. And start failing around the same time. RAID 5 can recover from one failed drive, RAID6 can do 2. But you are still vulnerable if a 3rd. But this is still reactive thinking, "I will replace them as they fail", rather than using proactive solutions.

mmt · on Sept 18, 2009

For relatively young drives, the opposite is true -- if the drive has survived a few weeks of heavy use, it is much LESS likely to fail in the future than a fresh drive off a shelf

References, please.

Tuna-Fish · on Sept 19, 2009

(pdf) http://labs.google.com/papers/disk_failures.pdf results start on page 4.

tocomment · on Sept 18, 2009

Dumb question here, how do I get at the smart readings? How will I know if it's warning me? I'd like to know for Ubuntu and for Windows XP.

spydez · on Sept 18, 2009

In Google's whitepaper on hard drives a while back, they showed that, basically, SMART is useless.

So, take SMART your data with grains of salt and all that.

thwarted · on Sept 18, 2009

"man smartctl" is the place to start.

I don't know how to do it on Windows. There's probably a GUI. Although it seems that smartmontools has Windows support.

Confusion · on Sept 18, 2009

Of course RAID is a backup solution. What is a redundancy if not an immediately available backup? Having a RAID 1 or 5 array ensures that you can continue to serve your data when one of your disks crashes. Saying RAID is not a backup solution is just playing wordgames by redefining 'backup'.

sp332 · on Sept 18, 2009

It's not a backup because if you delete a file from the "main" filesystem, it is immediately deleted from the "backup" as well.

decode · on Sept 18, 2009

I think the important thing here is that the term "backup" only makes sense within the context of what kind of thing you're protecting against.

RAID is a useful backup for drive failures, if you're able to replace failed drives within a certain amount of time, but not for manual deletions. DVDs and tapes are useful backups for manual deletions, but not for fires. DVDs and tapes, stored at an offsite location, are useful backups for manual deletions and fires, but possibly not for earthquakes.

Since all of the different backup and archival methods have major pros and cons, the only way to say something meaningful about how useful they are is to talk about them within context.

sjs · on Sept 19, 2009

We've solved this semantic issue. Redundancy is used to refer to availability while backup is used to refer to recovery from data loss. Hence the R in RAID.

sjs · on Sept 18, 2009

"I know this sounds like a semantic quibble, but words mean things."

RAID is a redundancy mechanism. It's there in the name. rdiff-backup is a backup tool.

theBobMcCormick · on Sept 18, 2009

RAID is not a backup solution, it's an availability solution. RAID (with the exception or RAID 0 of course) helps you stay available and running in the event of a hardware failure. Raid is no more a "backup" mechanism then load balancing!