Hacker News new | comments | show | ask | jobs | submit login
How multi-disk failures happen (sysadmin1138.net)
113 points by Serplat 1784 days ago | hide | past | web | 91 comments | favorite

My favorite raid 5 failure mode is when a old hardware card fails and you have no stock spare and the maint contract was not renewed years ago and the cost of a new card is too much to be expensed and buying as a capital replacement will take a week or two minimum to be approved. And/or the card has been discontinued so you have to buy a used one from a shady foreign surplus dealer. I've seen too much of this type of thing... I can tolerate the (minimal) cost of software raid but I can't survive the possible downtime of hardware raid, so its been software raid for me for pretty much the last decade.

Another fun one was the quad redundant power supply with all four plugs going into the same power strip.

And there was the power supply that blew out every drive in the box simultaneously. I suppose its not any worse than a lightning strike, other than the onsite tech assumed it was a cooling failure, so he replaced the dusty fans and all the drives, thus destroying an entire set of drives upon powerup (and fans, I would guess).

The poison drive tray where every slot you jammed it into, it bent the backplane pins. That turned into a huge expensive disaster.

You don't need to replace old/discontinued RAID card with the used one (assuming you mean same make and model). The new card from same manufacturer will work. The RAID card writes DCBs on disks that are read by the new RAID card upon replacement.

"The RAID card writes DCBs on disks that are read by the new RAID card upon replacement."

Which also makes for very nice RAID failures, like this one that has happened to me on an HP controller:

A drive fails because of some SCSI electronics problem and when you replace it, the controller gives it a different SCSI ID. Now, the controller maps RAID arrays to drives, and it is now impossible to add the replacement drive to the degraded array because SCSI IDs in these controllers aren't user defined and the controller doesn't allow the degraded array to be modified.

And since the controller has now happily written it's configuration onto the drives, it doesn't matter that you shuffle the drives around to try to force the controller into giving up it's internal configuration.

Oh, and the controller is an onboard controller, so you can't just replace it with another one (which would also read the configuration on disk and put himself in the same stupid state, I suppose).

Using RAID-5 is the primary error here. RAID-5 (or single parity RAID of any kind) is obsolete, period. The story here doesn't ring true to me to be honest; I'm currently herding several hundred multi-terabytes servers, and multiple drive failures appear in one and only case: when using Seagate Barracuda ES2 1TB or WD desktop class drives. These are the two very problematic setups. In all other cases, use RAID-6 and all will be well.

I'd add that current "enhanced format" drives are tremendously better than most older drives. If your drive is a 1 or 2 TB old (512B sectors) drive, use it only for backup or whatever menial use for unimportant data.

To add some extra emphasis to wazoox's point, RAID-6 is always a better choice than RAID-5.

If you're willing take a capacity hit for improved write performance, RAID-1+0 is great. Though you can only survive two disks failing if they are in different pairs.

You should also not look at RAID as infallible, if the data is important it should be mirrored in multiple locations.

> You should also not look at RAID as infallible, if the data is important it should be mirrored in multiple locations.

Right. Because it bears repeating still: RAID is not for backup, RAID is for high availability.

The simple way to support this is that RAID doesn't protect against a slip of the fingers on the commandline.

And of course, depending on the application, you should be using 1+0 for performance reasons over 5 anyway, if it's a database server.

Yes. The decision to use RAID5 instead of RAIDs 1, 10 or 0+1 shows a decision to cut costs. With mirroring, the mirror drives would take over when a disk failed.

Also, only one hot spare is in the set. Another cost saving.

Yet another decision - RAID5 across 7 disks plus a hot spare. Instead of say across 6 disks or 5 disks. You have two more chances for a disk to go bust and will have to be rebuilt from parity.

What if the disks are OK but the server host adapter card gets fried? Or the cable between the server and array? Some disk arrays allow for redundant access to the array, and some OS's can handle that failover.

Before I read the article, I thought it might discuss heat. Excessive heat is usually the cause when disk arrays start melting down one after another. Usually the meltdown happens in an on-site server closet/room which was never properly set up for having servers running 24/7. Usually the straw which breaks the back is a combination of more equipment added, and hot summer days. Then portable ACs are purchased to temporarily mitigate things, but if their condensation water deposits are not regularly dumped, they stop helping. This situation occurs more than you would imagine, luckily I have not been the one who had to deal with this for every time I have seen it (although sometimes I have). Usually the servers are non-production ones which don't make the cut to go into the data center.

The heat problem happens in data centers as well, believe it or not. A cheap thermometer is worth buying if you sense too much heat around your servers. Usually the heat problem is less bad, but the general data center temperature is a few degrees higher than what it should be, and this leads to more equipment failure.

Hard drives are pretty resilient to high temperatures. Google did a reliability analysis of thousands of hard drives and found:

"Overall our experiments can confirm previously reported temperature effects only for the high end of our temperature range and especially for older drives. In the lower and middle temperature ranges, higher temperatures are not associated with higher failure rates. This is a fairly surprising result, which could indicate that datacenter or server designers have more freedom than previously thought when setting operating temperatures for equipment that contains disk drives. We can conclude that at moderate temperature ranges it is likely that there are other effects which affect failure rates much more strongly than temperatures do."


I'd love to see the same research for SSDs.

Also, parent post does talk about higher end temps, not middle range temps.

Personally I've not had much luck with hot-spares, I'd prefer to have the spare in the array (in the case of RAID-6) so I can find out if there's a problem with that drive before it's the only thing standing in the way of total failure.

I'm a fan of the temperature@lert USB sensors; $130 gets you peace of mind: http://www.temperaturealert.com/

>With mirroring, the mirror drives would take over when a disk failed.

The mirror drives that also have bad sectors? Then you go even longer without noticing you have problems.

Maybe for the enterprise. What about home use?

About a month ago, I had 4 drives scattered around the house, each in its own enclosure, and I wanted to consolidate them into one unit. Money was an issue, so I wanted to recycle as many of them as possible instead of buying new ones. A Synology NAS along with a single extra drive allowed me near-optimal use of space with 1-drive redundancy. Of course, I have weekly backups to an external drive, so even if the array fails during a drive swap, I'll still have all my important files.

Any other solution would either require me to buy more drives (a significant expense at $100+ a pop), sacrifice redundancy, or build my own NAS with ZFS (which would have significant administration overhead, cost more, and be larger than my Synology unit).

Backing up to an external drive isn't enough if your really worried about the data. If your house burns down, or the single backup drive fails, your out of luck.

Synology's devices support automatic backup to S3, use it.

EMC uses RAID5 as the default for storage arrays, and then has some number of global hot spares. Netapp uses RAID6 by default, and then also has some number of hot spares. I've never had data loss from either system as a result of multi-drive failure. RAID5 is perfectly fine in most instances.

Desktop drives will drop out of RAID arrays frequently, so you have to use RAID6 if you choose to go that route. If a disk drops into deep checking mode for physical errors, then it won't respond to the RAID controller fast enough, and then be considered a dead drive. It will subsequently be re-detected, and then array has to be rebuilt.

> RAID5 is perfectly fine in most instances.

If you're using low capacity, enterprise class SAS drives, mostly, yes. However when using large capacity SATA drives, it most definitely isn't.

SATA drives (even "enterprise" SATA drives) have an official unrecoverable error rate of 1/10^14. From my experience, the truth is more like 1/10^13.

10^13 bits is roughly 10 terabytes. That means that every time you're reading 10 TB, you are statistically certain to encounter an unrecoverable bit error (and have a 1% chance of having 100 errors, of course). In the case of a rebuilding 10 TB array (only a couple 3 or 4 TB drives) using RAID-5, that means that you're almost sure to have an ECC error that will prevent you from ever rebuilding properly without corruption.

I have an 8 bay Drobo Pro with eight 2 TB drives, and I have a drive go out every few months. Of course this is because we went cheap with WD Green drives. However the Drobo Pro offers to use two drives "as protection" so you can have two drives go out and it keeps going.

You could be having the TLER problem. WD green drives can take a long time to do error correction - this causes the raid to drop the drive as failed. This solution is to respond with an error quickly and let the raid fix it with parity.... but at least it doesn't drop the drive. Or buy more expensive drives. Wait isnt that the idea behind RAID: Redundant Array of Inexpensive Disks..

You might want to look at this... http://hardforum.com/showthread.php?t=1285254

btw, I am currently working on recovering an 8 bay drobopro thats in a reboot loop...

The reboot loop may be a bad unit. I just had a first edition 8 bay replaced free even though it was way out of warranty. Call them and ask nicely.

TLER really isn't a problem with software raid which the drobo and all other home NAS's that I know of use.

It may very well be the issue, if a disk takes a long time to recover from errors even a software RAID will throw the disk out or risk being the problem by itself. TLER will limit the damage.

Ofcourse, if the software was smart enough to know how to handle this properly and recover automatically it need not drop the disk completely from the array.

RAID-5 (or single parity RAID of any kind) is obsolete, period.

RAID-6 offers different compromises relative to RAID-5 (for one, twice the parity space), so it isn't quite like one is the successor of the other. And once you're talking about multiple disk failures, you're at the existential point where you should probably be talking about whole array failures (e.g. your controller has quietly been writing junk for the last hour), and how to deal with that scenario.

>it isn't quite like one is the successor of the other.

Given the current price of hard drives, I don't get how "twice the parity space" can even matter. Furthermore, modern RAID controllers perform almost exactly the same using RAID-5 or RAID-6 (verified on most 3Ware, LSI, Adaptec and Areca controllers).

So yes, RAID-6 definitely is RAID-5 successor.

> how to deal with that scenario.

RAID is not an alternative to backup and never was. You deal with that scenario through proper backup or replication.

RAID-6 is in no universe a RAID-5 successor. Simplifications of enterprise needs and risk tolerances and compromise acceptance is sophistry. It is telling enough that despite the bluster of some on Hacker News, major storage vendors (ergo - people who know much more than you) still make RAID-5 the default. Maybe they just haven't read the news.

Regarding the backup -- yeah, no kidding. That was the point. If the argument is "this is better because it can accept one more of countless possible failure modes", then "better" can continue indefinitely (why not 10 parity copies?) In the real world of compromise considerations there is a benefit return assessment that draws a line at a probability point.

It also sounds like many on here think you buy a box of disks and then make one universal logical volume on it (e.g. "if you have a spare why not just make it RAID-6?"). Because the spare(s) are usually universal, and you have many logical volumes encompassing RAID-10, 0, 5, 6, whatever the situation calls for.

Apparently you're new to the web forums and aren't aware that different posts from different people may display only partial views of varying opinions, conflating a couple of my answers to different questions as one happily, and putting in someone else's answers with it for good measure. I, for instance, made no comment in this thread about proper hot spare policy.

About RAID-5: some major storage vendors still use it for smaller arrays. Some other don't use it anymore (NetApp, DDN come to mind). The notion that RAID-5 isn't fit for arrays of large capacity drives doesn't come from me and is hardly new. You don't need any links as obviously you know all about this already.

I've set up my first terabyte SAN in the 90s back when it used to fill a whole rack, 9 GB micropolis drives where hot and SSA was the new interconnect, but I probably know less about storage than you.

Ah, the grizzled vet angle. The bit about me being new to the web is particularly adorable, especially given that I prefaced it by referring to other people (ergo, there was no confusion). When you have many logical volumes suddenly you can make choices like "does this necessitate the extra protection of RAID-6, given the compromises"? And people are making that choice to this day, and no one is saying "Oh look, there's RAID-6 which is the newer version of RAID-5 so it's my default choice".

I'm going to be blunt for a moment. If you are not using ZFS, you deserve what you get.

As the author realizes, hardware RAID, or naive software RAID, is becoming more and more useless given the size of volumes and the bit densities (and thus error rates) of those drives.

The only solution to this is a proper file system and volume manager that can proactively discover bit rot and give you time to do something about it. At the moment, the only real solution is ZFS.

ZFS is great, possibly the closest to perfection available at any price today.

But theres a word beginning with "O" and ending with "racle", they are so focused on the short term buck they are massacring their potential revenues with their short sighted approach of keeping Solaris out of everyone's hands.

... which is why you should be using FreeBSD.

For several reasons (now including coughunitycough), I wish I had built my in-basement cluster with freebsd rather than ubuntu. Likely I will move the boxes over one-by-one.

The Linux answer to ZFS is btrfs, and it's almost ready to go.

What's it waiting on, and who other than Val is working on it?

Answering my own question: http://git.kernel.org/?p=linux/kernel/git/mason/btrfs-progs....

... and Valerie Auror's less active on that than I'd thought. Hrm.

Call me when btrfs has parity.

That's only true for open source implementations (if you can call ZFS that, the only really decent implementation is still locked up inside Oracle). Netapp's WAFL (which ZFS copied) is the original implementation of a RAID system that is resilient to bit rot and the other array vendors (EMC, HDS etc) all have similar systems.

What features are open source ZFS implementations (ZFS v28) lacking to that on Solaris aside from encryption? I run a 6-drive raidz2 array on my FreeBSD workstation, and I put it through its paces, and it's pretty awesome. Rock solid stable, in my experience.

I hear TRIM support for ZFS is now in the HEAD branch and in the works for FreeBSD 10.

ZFS is beautiful and wonderful, but my perhaps outdated understanding is that the Linux port isn't as stable or mature as it is elsewhere. Is that not true? Or is it that the advantages of ZFS are so great that people should switch to illumos or some BSD?

I haven't used the Linux version in a while, but as it's not a first class citizen of the kernel, I don't think it will ever be the same as a BSD implementation. In my opinion, if you're doing anything with storage, you'd be crazy not to use ZFS, even if that means learning a new OS/kernel.

> the Linux port isn't as stable or mature as it is elsewhere.

Which one: the FUSE one, or the native ZFS on Linux under CDDL?

I haven't used the FUSE one extensively but the native one kept leaking memory until the system was exhausted. (Specifically it did this whenever I messed with a large number of files, I'm sure it works fine for media storage as StavrosK says.)

I'm using zfsonlinux on Ubuntu, and so far (months) it has been solid. Granted, it's just for media storage, but it's fine.

How does ZFS discover bit rot if you don't read the bits? I have mine scrubbing once a week, is that unnecessary?

ZFS auto-heals on read so the files you touch regularly are fine. You do need to scrub regularly to prevent bit-rot across all your files. I go with once-a-week as well.

so how is that better than raid? it sounds identical to what is described in the article - the problem was not scrubbing (afaict).

ZFS is at the file level, not the disk level. So recovering from errors (i.e. rebuilds) is MUCH faster.

Only when the array is mostly empty. For several years hardware RAID controllers have been rebuilding only used space, too. Really, ZFS isn't that much of the miracle some want it to be.

How do they know what space is used or unused?

Apparently they keep a block list somewhere.

but that's not what we were talking about! no-one was saying "zfs sucks as much as raid but at least it rebuilds faster afterwards". the implication was that zfs avoided the problem in the article (when, it seems, both zfs and raid need to be scrubbed, and both avoid the problem when that is done).

If you follow the advice in this paper[1] you will be measuring media errors in your drives. That means re-reading all data every N days, even archived data. Without periodically re-reading and validating (checksumming) the data you can't tell if it has rotted in place. Since the distribution of errors over drives is very exponential you should then pro-actively remove the worst drives in your system. That will avoid an accumulation of errors and sudden multiple drive failure as described here.

Durability is like a diamond: it is forever.

1. http://research.google.com/pubs/pub32774.html

Diamonds burn just as well as any other carbon does, once you get them hot enough.

De Beers probably like this usage of diamonds.

I have had a multi-disk failure occur with a RAID-1 setup. Server was pre-built from a large vendor and worked fine until both disks failed at the exact same time (within minutes).

Took the disks out to find that they had sequential serial numbers.

Called vendor for replacement only to have them tell me that they had issues with that batch, yet did not make any attempt to inform me.

Spent the day restoring from tape backup.

TLDR: If you buy a pre-built server check that the disks aren't all from the same batch.

This is a problem even if you don't buy pre-built. You're going to be buying similarly specced drives at similar times and you're probably buying from vendors from the same rough geographical area so chances are you're buying drives from the same batch anyway.

It used to be worse: all the drives in a RAID setup had to have the exact same specifications or the thing wouldn't work, which pretty much guaranteed near simultaneous failure of multiple drives, but even today, with somewhat more flexible software raid setups, it's still a problem.

At a place I used to work we used to joke that a drive failure warning from a RAID controller was nothing more than a signal to get out the backup tapes and start building a new server.

I also had a RAID-5 array fail due to 3/4 drives failing near-simultaneously. All of the drives were from the same batch. Some months later, a friend came to me with a computer problem. Her drive had failed. I was able to take a look at the drive and, to my amazement, the drive was from the same batch! Based off of a wild hunch, I swapped the controller board from my one remaining good drive into my friend's drive. The drive worked fine and I was able to recover all her data! It is interesting that the drive failures were likely due to the fact that the drives were from from the same batch but also that fact probably allowed me to seamlessly swap the controller boards!

One thing I've learnt early on my career as a sysadmin is that disk quality is very important, and so is the quality of the RAID controller or software RAID subsystem. After you have a multiple drive failure on a supposedly safe RAID-1, and get forced into stitching it back into operation with a combination of "badblocks" and "dd", you'll quickly understand why...

A good RAID controller won't let a drive with bad sectors continue to operate silently in an array. Once an unreadable sector is detected, the drive is failed immediately, period.

The problem is in the detection, but good RAID controllers "scrub" the whole array periodically. If they don't, or if you are paranoid like me, the same can be accomplished by having "smartd" initiate a long "SMART" self-test on the drives every week.

Good controllers will even fail drives when a transient error happens, one which triggers a bad block reallocation by the drive, for example. This is what makes some people fix failed drives by taking them out and putting them back in. After a rebuild the drive will operate normally without any errors, but you are putting yourself at a serious risk of it failing during a rebuild if another drive happens to fail, so DON'T do this.

Some others will react differently to these transient errors. EMC arrays, for instance, will copy a suspicious drive to the hot-spare and call home for a replacement. This is much faster than a full rebuild, but also much safer because it doesn't increase the risk of a second drive failing while doing it.

Oh, and did I mention that cheap drives lie?

Avoid using desktop drives on production servers for important data, even in a RAID configuration, if you don't have some kind of replicated storage layer above your RAID layer (meaning you can afford recovering one node from backup for speed and resync with the master to make it current).

Your advice is ok for someone who is willing to take no risks and to spend the money on that. It is not strictly correct for all situations. In fact storage arrays are not likely to drop a disk on the first medium error since medium errors are a fact of life and do not necessarily indicate a bad disk. Ofcourse, given that there is a medium error it warrants a long term inspection to make sure that the medium errors are not consistent and come too often on a specific drive, that is a cause of concern but a single medium error is of no real significance.

I also found that higher-end drives lie, I used SAS Nearline drives that failed easy and often and I used standard SATA drives that were more resilient. It depends on the vendor and make. May also depend on the batch but I never found a proof for that in my work.

Maybe I was wrong in using the term "transient error"...

A bad block reallocation can be seen as a transient error from the controller's perspective, but it isn't silent provided the drive doesn't lie about it (and one would expect that a particular storage system vendor doesn't choose - and brand - drives that lie to their own controllers).

The storage system may ignore medium errors that force a repeated read (below a certain threshold), but they shouldn't ignore a medium error where the bad sector reallocation count increases afterwards (which is just another medium error threshold being hit, this time by the drive itself).

I'm not saying that higher-end drives are more reliable or not. Given that most standard SATA errors go undetected for longer, one could even argue that higher-end drives seem to fail much more frequently... I've had more FC drives replaced in a single EMC storage array than in the rest of the servers (which have a mix of internal 2.5in SAS and older 3.5in SCSI320 drives), and we certainly replace more drives in servers than desktops.

But that's another topic entirely.

Having worked at NetApp for a few years at the turn of the century this is "old" news :-) But it is always important to internalize this stuff. M3 (or 3x mirrors) is still computationally the most efficient (no compute just I/O to three drives), R6 is the most space efficient at the expense of some CPU. Erasure codes are great for read only data sets (they can be relatively cheap and achieve good bandwidth) but they suffer a fairly high I/O burden during write (n data + m code blocks for one read-modify-write).

Bottom line is that reliably storing data is more complicated than just writing it on to a disk.

The author is wrong that "enterprise quality disks just plain last longer". CMU did a study on this topic on a population of 100 thousand drives, and found that enterprise-grade drives do not seem to be more reliable than consumer-grade drives. See the conclusion in: http://www.cs.cmu.edu/~bianca/fast07.pdf This legend must die.

The author is also wrong when saying "a non-recoverable read error [is] a function of disk electronics not a bad block". An NRE can happen for different reasons, one of them is when (data and error-correction) bits in the block get corrupted in a way that prevent the error-correction logic from detecting this error. So the block is technically bad, just not bad enough to cause the drive logic to declare it as a read failure.

The CMU study is most probably flawed as it looked at hardware replacement records and didn't take into consideration the different usage and threshold for replacement between enterprise and consumer drives. Most enterprise drives are used in enterprise servers and storage systems that monitor drive errors closely using SMART. The threshold for drive errors is much lower with such systems and drives are replaced quickly. My company (storage system vendor) replaces disk even when the SMART alerts impending failure and doesn't wait for actual failure. This will come across in hardware replacement log as more frequent replacement of enterprise disks. The consumer disks are used in consumer systems. The consumers don't proactively replace disks, they wait until disk actually fails. This will show up in hardware replacement log as less frequent replacement of consumer disk.

I have actually used 'defective' enterprise disks in consumer systems for years after they were labeled defective by storage system vendors. About a decade ago, I used to buy such defective enterprise disks in bulk at auction from server and storage manufacturers and sold them as refurbished disks to consumers after testing.

I fail to see your point about the threshold of replacement. Assuming that enterprise-class drives get replaced sooner because sysadmins monitor SMART, it is still widely acknowledged that SMART errors are a strong indicated that the drive will fail soon. For example the Google study on drive reliability showed this correlation on consumer-class drives [1] There is no reason to believe this correlation doesn't exist with enterprise-class drives (or else, what would be the point of SMART?). Therefore the replacement threshold is mostly irrelevant as the enterprise drive replaced due to SMART would have failed soon anyway.

I really don't understand this skewed perception of consumer- vs enterprise-grade harddrives. Do you believe that enterprise CPUs are more reliable than consumer CPUs? How about enterprise NICs vs consumer NICs?

Consumer-grade drives are sold in volumes so much larger than enterprise-grade drives, that vendors have strong incentives to make them as reliable as possible. I would even say they have incentives to make them more reliable than enterprise-grade drives. Because a single percentage point improvement in their reliability will drastically reduce the costs associated to warranty claims and repairs.

My own experience confirms the CMU study. I have worked at 2 companies selling each about 2-5 thousand drives as part of appliances, to customers across the world. One company was using SCSI drives, the other IDE/SATA. And the replacement rates were similar.

I can see your point about the usage being different which could invalidate the CMU findings about consumer vs enterprise drive reliability. But I don't personally believe it explains it. The CMU study + my annecdotal evidence one 2-5 thousand drives + the fact that no study has ever showed data suggesting enterprise drives are more reliable, makes me think that they are not.

[1] http://static.googleusercontent.com/external_content/untrust...

I have been doing raid data recovery for many years. A very common scenario is "Two Drives Failed at Once". This is usually not the case. What usually happens is 1 drive fails. The raid then goes into degraded mode and continues to function. Nobody notices the warnings. Some time later... months or years even, a second drive fails, the raid goes down - now they notice. They call in the techs who declare 'two drives failed'. This is when your data is most at risk. People start swapping boards and drives, repower the system, rebuild drives, rebuild parity, force stale drives online etc. I have seen alot of raids that would have been recoverable had they done the proper steps. Then they hand it over and say they found it this way... didn't touch a thing... www.alandata.com

Would have upvoted if not for the web link at the end of the comment.

Let us not forget the little considered other cause which usually takes out your tape infrastructure as well: fire.

One outfit I worked decided to stick a brand new APC UPS in the bottom of the rack as it was in their office. It promptly caught fire and burned the entire rack out. The fire protection system did fuck all as well other than scare the shit out of the staff. Scraping molten cat5 cables off with a paint scraper was not fun.

Fortunately it was all covered by the DR procedure. Tip: write one and test it. That's more important than anything.

Do you know what caused the fire? Improper installation? Overheating? Defective equipment?

Defective APC charge regulator apparently. To be honest they were good and their insurance paid out very quickly to both us and the site owners.

And this is why, everywhere I've ever worked, I've had to say RAID is NOT backup. There are varying degrees of receptiveness to this, because actual backups are a giant pain in the ass, and have a very annoying lifecycle.

There are better reasons for RAID not being back than this obscure rare fault condition.

Another way to think about this problem is to turn it on its head and assume that hard drives will fail and make it a strength and not a weakness. Something like OpenStack Storage[1] is built around the idea that hard drives are transient and replacing them should be painless. In fact, the more drives you have the less problems you have.

Basically, you keep multiple copies of the same data across different clusters of hardware. If a drive or two (or ten) go bad, just replace them, there is no rebuild time. Sure, it costs some disk space to keep n copies of data, but drives are just getting cheaper and there are de-duplication schemes being developed to help with this. Its not like RAID-6 is super efficient either.

Just my two cents...

[1] http://www.openstack.org/software/openstack-storage/

isn't this what raid scrubbing is for? http://en.gentoo-wiki.com/wiki/Software_RAID_Install#Data_Sc...

    for raid in /sys/block/md*/md/sync_action; do                                   
        echo "check" >> ${raid}
does that fix the issue? i run that once a week. i thought i was ok. am i not? if i am, isn't this old news?

yes, you should be fine. AFAIK most distros set up cronjobs for this automatically (for example, Debian/Ubuntu runs that once a month).

I am going to ask a honestly stupid question. What is going to happen to ZFS? Sorry if this is Slightly off topic. Although the comments has already started discussing on it.

The OpenSource version of it, or the BSD version of it is only up to v28. And it seems after that Oracle is no longer putting out update as open source and what will happen after that? Disparity between Oracle version and BSD version? And are features still being developed? Most of limitation listed in Wiki hasn't change at all for the past years and are still listed as under development.


Raid Is Not A Backup!

Oh, and your raid controller should monitor for smart errors and you should seek to replace disks when you start seeing sector rewrites.

Not very efficient but I try to avoid placing all volumes on a single raid set.

I'm not a Unx sysadmin at all and don't know much about hard drive: I'm just a software dev.

But from the beginning of TFA, after reading this:

"Bad blocks. Two of them. However, as the blocks aren't anywhere near the active volumes they go largely undetected."

The FIRST* thing that came to my mind was: "What!? Isn't that a long-solved problem!? Aren't disks / controllers / RAID setups much better now at detecting such problem right away".

I've got a huge issue with the "largely undetected". I may, at one point, need storage for a gig I'm working on. And I certainly don't want stuff problems like that to go "largely undetected".

So quickly skipping most of the article and going to the comments:

"It's worth pointing out that many hardware RAID controller support a periodic "scrubbing" operation ("Patrol Read" on Dell PERC controllers, "background scrub" on HP Smart Array controllers), and some software RAID implementations can do something similar (the "check" functionality in Linux md-style software RAID, for example). Running these kinds of tasks periodically will help maintain the "health" of your RAID arrays by forcing disks to perform block-level relocations for media defects and to accurately report uncorrectable errors up to the RAID controller or software RAID in a timely fashion."

To which the author of TFA himself replies:

"Yes, that is something I should have made clearer. This is the very reason that RAID systems have background processes that scan all the blocks."

Which leaves me all a bit confused about TFA, despite all the shiny graphs.

Basically, I don't really understand the premises of "bad blocks going largely undetected" in 2013...

I dealt with this exact problem for a number of years. Background scrubbing takes away I/O resources and can be a disaster on your workload if you rely on sequential reads/writes. For that reason, most controllers are configured by default to only scrub when the disk is totally idle which is never. Even if the controller had a better definition of idle, scrubbing an entire disk to find those rotten bits would take a long long time, a disk would almost certainly fail before that.

I use the built in SMART full disk check. It's quite good at only reading when the disk is idle, and it checks the entire disk.

A quick self test every day for all disks, and a long (i.e. full read) self test once a week.

The RAID is then checked on top of that one a month (although that slows things down a bit).

With sufficient redundancy available, could you temporarily take a drive out of the RAID for scrubbing, and then add it back in when you're done, to avoid conflicting with ongoing work and destroying linear access patterns?

The rebuild would be worse than the scrubbing.

A better plan is to light up your disaster recovery plan weekly, and while the DR system is handling the load, scrub to your hearts content on the down system.

Depending on the cost of your hardware vs the cost of your labor vs the cost of downtime, dual servers, one flagged as production and one flagged as development, alternate flags every weekend, might work out. You'll hear lots of bragging about that not being possible because the hardware is too expensive, not so much bragging about labor cost and downtime cost. I worked at financial services corp about two decades ago where downtime was supposedly in excess of $1M/hr. They had triple mainframes set up, basically three machine rooms inside the machine room.

The disks themselves (new ones at least) have a background scan process (called BMS or BGMS) which helps considerably. The one thing the disk can't do by itself is correct unrecoverable errors since by the definition the disk can't recover from them :-)

The combination of BMS and disk scrubbing at the RAID level should handle almost all of the issues that are pointed by the original post.

Though RAID scrubs can and do take a long time to complete, depending on the performance impact that you are willing to suffer on a continuous basis it can take a week or two to perform proper scrubbing.

Proper scrubbing would include not just reading the RAID chunk on a disk but to also read the other associated chunks from the other disks and verify that the parity is still intact. In RAID5 you will not be able to recover if the parity is bad as you won't know what chunk has gone bad.

I've been coding such systems for a while now and as a shameless plug would point to http://disksurvey.com/blog/ if there are things of interest I'd be happy to take requests and write about them as well.

I have a home server with three disks and ZFS, for my photos and things, so I'm not an expert. However, Ubuntu's md-raid includes scrubbing once a week by default, and I added scrubbing to my ZFS setup via crontab, again once a week (I'm not sure if ZFS does it automatically, but I don't think it does. I would appreciate a correction, if someone knows for sure).

The article assumes no scrubbing, which is a stupid thing to run without, as detailed from the article. So it's basically "why pointing a gun at your foot and pulling the trigger is bad", "because you're going to shoot yourself in the foot".

The article describes why scrubs don't happen often enough: it's slow and disruptive. I have a 3-way RAID-1 /home partition (long story) and it's checked on the first Sunday of the month. I always remember this because I can tell from the performance of my workstation that something is up with the disk. This is with operations like a single thread running "ls". If you're running a production service, you're also going to notice, and you're also going to have more than 3TB of drive to scan. That makes running regular scans rather difficult.

You may add something like this to /etc/periodic.conf:

daily_status_zfs_enable="YES" daily_scrub_zfs_enable="YES" daily_scrub_zfs_default_threshold="6" # in days

and it will scrub the pools every 6 days (and send you a report in the daily run output).

Very nice, thank you, I will try that. I am rather dismayed, however, by learning in this thread that my disks have 4K sector sizes and ZFS autodetected 512 bytes, which means I'll have to destroy the pool and recreate it...

It happens all the time...

If you run "camcontrol identify ada0" (or whatever your device is) you can find out before it is too late:

sector size logical 512, physical 512, offset 0

This is from a lucky drive of course :)

Hmm, there's no such command in Ubuntu, maybe it's from BSD?

camcontrol is from FreeBSD.

I don't have a Linux box available right now but maybe "hdparm -I" does something similar: "request identification info directly from the drive".

Yep, that works:

    Logical  Sector size:                   512 bytes
    Physical Sector size:                  4096 bytes
I'm guessing that's not very nice. My ZFS pool was created with ashift 9 (this is 2^9=512 bytes), when it should be 12 (2^12=4096). I will have to copy everything off and back on again.

For everyone who wants to check, and because I couldn't find info on it, run:

    zdb | grep ashift
And see if it's 9 or 12.

ZFS doesn't do it automatically, you have to crontab it. I have it crontabbed for the 1st or the 15th.

On 3ware(LSI) controllers you can schedule a 'verify' task to run on a schedule. I do believe it is on by default, but I could be wrong. It is good to tell admins about this being that an uninformed one may turn the settings off without realizing what can occur if it is disabled.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact