
How multi-disk failures happen - Serplat
http://sysadmin1138.net/mt/blog/2012/12/how-multi-disk-failures-happen.shtml
======
VLM
My favorite raid 5 failure mode is when a old hardware card fails and you have
no stock spare and the maint contract was not renewed years ago and the cost
of a new card is too much to be expensed and buying as a capital replacement
will take a week or two minimum to be approved. And/or the card has been
discontinued so you have to buy a used one from a shady foreign surplus
dealer. I've seen too much of this type of thing... I can tolerate the
(minimal) cost of software raid but I can't survive the possible downtime of
hardware raid, so its been software raid for me for pretty much the last
decade.

Another fun one was the quad redundant power supply with all four plugs going
into the same power strip.

And there was the power supply that blew out every drive in the box
simultaneously. I suppose its not any worse than a lightning strike, other
than the onsite tech assumed it was a cooling failure, so he replaced the
dusty fans and all the drives, thus destroying an entire set of drives upon
powerup (and fans, I would guess).

The poison drive tray where every slot you jammed it into, it bent the
backplane pins. That turned into a huge expensive disaster.

~~~
akg_67
You don't need to replace old/discontinued RAID card with the used one
(assuming you mean same make and model). The new card from same manufacturer
will work. The RAID card writes DCBs on disks that are read by the new RAID
card upon replacement.

~~~
CrLf
"The RAID card writes DCBs on disks that are read by the new RAID card upon
replacement."

Which also makes for very nice RAID failures, like this one that has happened
to me on an HP controller:

A drive fails because of some SCSI electronics problem and when you replace
it, the controller gives it a different SCSI ID. Now, the controller maps RAID
arrays to drives, and it is now impossible to add the replacement drive to the
degraded array because SCSI IDs in these controllers aren't user defined and
the controller doesn't allow the degraded array to be modified.

And since the controller has now happily written it's configuration onto the
drives, it doesn't matter that you shuffle the drives around to try to force
the controller into giving up it's internal configuration.

Oh, and the controller is an onboard controller, so you can't just replace it
with another one (which would also read the configuration on disk and put
himself in the same stupid state, I suppose).

------
wazoox
Using RAID-5 is the primary error here. RAID-5 (or single parity RAID of any
kind) is _obsolete_ , period. The story here doesn't ring true to me to be
honest; I'm currently herding several hundred multi-terabytes servers, and
multiple drive failures appear in one and only case: when using Seagate
Barracuda ES2 1TB or WD desktop class drives. These are the two very
problematic setups. In all other cases, use RAID-6 and all will be well.

I'd add that current "enhanced format" drives are tremendously better than
most older drives. If your drive is a 1 or 2 TB old (512B sectors) drive, use
it only for backup or whatever menial use for unimportant data.

~~~
Ologn
Yes. The decision to use RAID5 instead of RAIDs 1, 10 or 0+1 shows a decision
to cut costs. With mirroring, the mirror drives would take over when a disk
failed.

Also, only one hot spare is in the set. Another cost saving.

Yet another decision - RAID5 across 7 disks plus a hot spare. Instead of say
across 6 disks or 5 disks. You have two more chances for a disk to go bust and
will have to be rebuilt from parity.

What if the disks are OK but the server host adapter card gets fried? Or the
cable between the server and array? Some disk arrays allow for redundant
access to the array, and some OS's can handle that failover.

Before I read the article, I thought it might discuss heat. Excessive heat is
usually the cause when disk arrays start melting down one after another.
Usually the meltdown happens in an on-site server closet/room which was never
properly set up for having servers running 24/7. Usually the straw which
breaks the back is a combination of more equipment added, and hot summer days.
Then portable ACs are purchased to temporarily mitigate things, but if their
condensation water deposits are not regularly dumped, they stop helping. This
situation occurs more than you would imagine, luckily I have not been the one
who had to deal with this for every time I have seen it (although sometimes I
have). Usually the servers are non-production ones which don't make the cut to
go into the data center.

The heat problem happens in data centers as well, believe it or not. A cheap
thermometer is worth buying if you sense too much heat around your servers.
Usually the heat problem is less bad, but the general data center temperature
is a few degrees higher than what it should be, and this leads to more
equipment failure.

~~~
mprovost
Hard drives are pretty resilient to high temperatures. Google did a
reliability analysis of thousands of hard drives and found:

"Overall our experiments can conﬁrm previously reported temperature effects
only for the high end of our temperature range and especially for older
drives. In the lower and middle temperature ranges, higher temperatures are
not associated with higher failure rates. This is a fairly surprising result,
which could indicate that datacenter or server designers have more freedom
than previously thought when setting operating temperatures for equipment that
contains disk drives. We can conclude that at moderate temperature ranges it
is likely that there are other effects which affect failure rates much more
strongly than temperatures do."

<http://research.google.com/archive/disk_failures.pdf>

~~~
DanBC
I'd love to see the same research for SSDs.

Also, parent post does talk about higher end temps, not middle range temps.

------
insaneirish
I'm going to be blunt for a moment. If you are not using ZFS, you deserve what
you get.

As the author realizes, hardware RAID, or naive software RAID, is becoming
more and more useless given the size of volumes and the bit densities (and
thus error rates) of those drives.

The only solution to this is a proper file system and volume manager that can
proactively discover bit rot and give you time to do something about it. At
the moment, the only real solution is ZFS.

~~~
scarmig
ZFS is beautiful and wonderful, but my perhaps outdated understanding is that
the Linux port isn't as stable or mature as it is elsewhere. Is that not true?
Or is it that the advantages of ZFS are so great that people should switch to
illumos or some BSD?

~~~
lloeki
> _the Linux port isn't as stable or mature as it is elsewhere._

Which one: the FUSE one, or the native ZFS on Linux under CDDL?

~~~
Dylan16807
I haven't used the FUSE one extensively but the native one kept leaking memory
until the system was exhausted. (Specifically it did this whenever I messed
with a large number of files, I'm sure it works fine for media storage as
StavrosK says.)

------
SpikeGronim
If you follow the advice in this paper[1] you will be measuring media errors
in your drives. That means re-reading all data every N days, even archived
data. Without periodically re-reading and validating (checksumming) the data
you can't tell if it has rotted in place. Since the distribution of errors
over drives is very exponential you should then pro-actively remove the worst
drives in your system. That will avoid an accumulation of errors and sudden
multiple drive failure as described here.

Durability is like a diamond: it is forever.

1\. <http://research.google.com/pubs/pub32774.html>

~~~
wiredfool
Diamonds burn just as well as any other carbon does, once you get them hot
enough.

~~~
lostlogin
De Beers probably like this usage of diamonds.

------
sschueller
I have had a multi-disk failure occur with a RAID-1 setup. Server was pre-
built from a large vendor and worked fine until both disks failed at the exact
same time (within minutes).

Took the disks out to find that they had sequential serial numbers.

Called vendor for replacement only to have them tell me that they had issues
with that batch, yet did not make any attempt to inform me.

Spent the day restoring from tape backup.

TLDR: If you buy a pre-built server check that the disks aren't all from the
same batch.

~~~
mtts
This is a problem even if you don't buy pre-built. You're going to be buying
similarly specced drives at similar times and you're probably buying from
vendors from the same rough geographical area so chances are you're buying
drives from the same batch anyway.

It used to be worse: all the drives in a RAID setup had to have the _exact_
same specifications or the thing wouldn't work, which pretty much guaranteed
near simultaneous failure of multiple drives, but even today, with somewhat
more flexible software raid setups, it's still a problem.

At a place I used to work we used to joke that a drive failure warning from a
RAID controller was nothing more than a signal to get out the backup tapes and
start building a new server.

------
CrLf
One thing I've learnt early on my career as a sysadmin is that disk quality is
_very_ important, and so is the quality of the RAID controller or software
RAID subsystem. After you have a multiple drive failure on a supposedly safe
RAID-1, and get forced into stitching it back into operation with a
combination of "badblocks" and "dd", you'll quickly understand why...

A good RAID controller won't let a drive with bad sectors continue to operate
silently in an array. Once an unreadable sector is detected, the drive is
failed immediately, period.

The problem is in the detection, but good RAID controllers "scrub" the whole
array periodically. If they don't, or if you are paranoid like me, the same
can be accomplished by having "smartd" initiate a long "SMART" self-test on
the drives every week.

Good controllers will even fail drives when a transient error happens, one
which triggers a bad block reallocation by the drive, for example. This is
what makes some people fix failed drives by taking them out and putting them
back in. After a rebuild the drive will operate normally without any errors,
but you are putting yourself at a serious risk of it failing during a rebuild
if another drive happens to fail, so DON'T do this.

Some others will react differently to these transient errors. EMC arrays, for
instance, will copy a suspicious drive to the hot-spare and call home for a
replacement. This is much faster than a full rebuild, but also much safer
because it doesn't increase the risk of a second drive failing while doing it.

Oh, and did I mention that cheap drives lie?

Avoid using desktop drives on production servers for important data, even in a
RAID configuration, if you don't have some kind of replicated storage layer
above your RAID layer (meaning you can afford recovering one node from backup
for speed and resync with the master to make it current).

~~~
baruch
Your advice is ok for someone who is willing to take no risks and to spend the
money on that. It is not strictly correct for all situations. In fact storage
arrays are not likely to drop a disk on the first medium error since medium
errors are a fact of life and do not necessarily indicate a bad disk.
Ofcourse, given that there is a medium error it warrants a long term
inspection to make sure that the medium errors are not consistent and come too
often on a specific drive, that is a cause of concern but a single medium
error is of no real significance.

I also found that higher-end drives lie, I used SAS Nearline drives that
failed easy and often and I used standard SATA drives that were more
resilient. It depends on the vendor and make. May also depend on the batch but
I never found a proof for that in my work.

~~~
CrLf
Maybe I was wrong in using the term "transient error"...

A bad block reallocation can be seen as a transient error from the
controller's perspective, but it isn't silent provided the drive doesn't lie
about it (and one would expect that a particular storage system vendor doesn't
choose - and brand - drives that lie to their own controllers).

The storage system may ignore medium errors that force a repeated read (below
a certain threshold), but they shouldn't ignore a medium error where the bad
sector reallocation count increases afterwards (which is just another medium
error threshold being hit, this time by the drive itself).

I'm not saying that higher-end drives are more reliable or not. Given that
most standard SATA errors go undetected for longer, one could even argue that
higher-end drives seem to fail much more frequently... I've had more FC drives
replaced in a single EMC storage array than in the rest of the servers (which
have a mix of internal 2.5in SAS and older 3.5in SCSI320 drives), and we
certainly replace more drives in servers than desktops.

But that's another topic entirely.

------
ChuckMcM
Having worked at NetApp for a few years at the turn of the century this is
"old" news :-) But it is always important to internalize this stuff. M3 (or 3x
mirrors) is still computationally the most efficient (no compute just I/O to
three drives), R6 is the most space efficient at the expense of some CPU.
Erasure codes are great for read only data sets (they can be relatively cheap
and achieve good bandwidth) but they suffer a fairly high I/O burden during
write (n data + m code blocks for one read-modify-write).

Bottom line is that reliably storing data is more complicated than just
writing it on to a disk.

------
mrb
The author is wrong that "enterprise quality disks just plain last longer".
CMU did a study on this topic on a population of 100 thousand drives, and
found that enterprise-grade drives do not seem to be more reliable than
consumer-grade drives. See the conclusion in:
<http://www.cs.cmu.edu/~bianca/fast07.pdf> This legend must die.

The author is also wrong when saying "a non-recoverable read error [is] a
function of disk electronics not a bad block". An NRE can happen for different
reasons, one of them is when (data and error-correction) bits in the block
_get corrupted_ in a way that prevent the error-correction logic from
detecting this error. So the block is technically bad, just not bad enough to
cause the drive logic to declare it as a read failure.

~~~
akg_67
The CMU study is most probably flawed as it looked at hardware replacement
records and didn't take into consideration the different usage and threshold
for replacement between enterprise and consumer drives. Most enterprise drives
are used in enterprise servers and storage systems that monitor drive errors
closely using SMART. The threshold for drive errors is much lower with such
systems and drives are replaced quickly. My company (storage system vendor)
replaces disk even when the SMART alerts impending failure and doesn't wait
for actual failure. This will come across in hardware replacement log as more
frequent replacement of enterprise disks. The consumer disks are used in
consumer systems. The consumers don't proactively replace disks, they wait
until disk actually fails. This will show up in hardware replacement log as
less frequent replacement of consumer disk.

I have actually used 'defective' enterprise disks in consumer systems for
years after they were labeled defective by storage system vendors. About a
decade ago, I used to buy such defective enterprise disks in bulk at auction
from server and storage manufacturers and sold them as refurbished disks to
consumers after testing.

~~~
mrb
I fail to see your point about the threshold of replacement. Assuming that
enterprise-class drives get replaced sooner because sysadmins monitor SMART,
it is still widely acknowledged that SMART errors are a strong indicated that
the drive _will fail_ soon. For example the Google study on drive reliability
showed this correlation on consumer-class drives [1] There is no reason to
believe this correlation doesn't exist with enterprise-class drives (or else,
what would be the point of SMART?). Therefore the replacement threshold is
mostly irrelevant as the enterprise drive replaced due to SMART would have
failed soon anyway.

I really don't understand this skewed perception of consumer- vs enterprise-
grade harddrives. Do you believe that enterprise CPUs are more reliable than
consumer CPUs? How about enterprise NICs vs consumer NICs?

Consumer-grade drives are sold in volumes so much larger than enterprise-grade
drives, that vendors have strong incentives to make them as reliable as
possible. I would even say they have incentives to make them more reliable
than enterprise-grade drives. Because a single percentage point improvement in
their reliability will drastically reduce the costs associated to warranty
claims and repairs.

My own experience confirms the CMU study. I have worked at 2 companies selling
each about 2-5 thousand drives as part of appliances, to customers across the
world. One company was using SCSI drives, the other IDE/SATA. And the
replacement rates were similar.

I can see your point about the usage being different which could invalidate
the CMU findings about consumer vs enterprise drive reliability. But I don't
personally believe it explains it. The CMU study + my annecdotal evidence one
2-5 thousand drives + the fact that no study has ever _showed data_ suggesting
enterprise drives are more reliable, makes me think that they are not.

[1]
[http://static.googleusercontent.com/external_content/untrust...](http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf)

------
wayne_h
I have been doing raid data recovery for many years. A very common scenario is
"Two Drives Failed at Once". This is usually not the case. What usually
happens is 1 drive fails. The raid then goes into degraded mode and continues
to function. Nobody notices the warnings. Some time later... months or years
even, a second drive fails, the raid goes down - now they notice. They call in
the techs who declare 'two drives failed'. This is when your data is most at
risk. People start swapping boards and drives, repower the system, rebuild
drives, rebuild parity, force stale drives online etc. I have seen alot of
raids that would have been recoverable had they done the proper steps. Then
they hand it over and say they found it this way... didn't touch a thing...
www.alandata.com

~~~
kristianp
Would have upvoted if not for the web link at the end of the comment.

------
meaty
Let us not forget the little considered other cause which usually takes out
your tape infrastructure as well: fire.

One outfit I worked decided to stick a brand new APC UPS in the bottom of the
rack as it was in their office. It promptly caught fire and burned the entire
rack out. The fire protection system did fuck all as well other than scare the
shit out of the staff. Scraping molten cat5 cables off with a paint scraper
was not fun.

Fortunately it was all covered by the DR procedure. Tip: write one and test
it. That's more important than anything.

~~~
jzwinck
Do you know what caused the fire? Improper installation? Overheating?
Defective equipment?

~~~
meaty
Defective APC charge regulator apparently. To be honest they were good and
their insurance paid out very quickly to both us and the site owners.

------
andrewvc
And this is why, everywhere I've ever worked, I've had to say RAID is _NOT_
backup. There are varying degrees of receptiveness to this, because actual
backups are a giant pain in the ass, and have a very annoying lifecycle.

~~~
DanBC
There are better reasons for RAID not being back than this obscure rare fault
condition.

------
Corrado
Another way to think about this problem is to turn it on its head and assume
that hard drives _will_ fail and make it a strength and not a weakness.
Something like OpenStack Storage[1] is built around the idea that hard drives
are transient and replacing them should be painless. In fact, the more drives
you have the less problems you have.

Basically, you keep multiple copies of the same data across different clusters
of hardware. If a drive or two (or ten) go bad, just replace them, there is no
rebuild time. Sure, it costs some disk space to keep n copies of data, but
drives are just getting cheaper and there are de-duplication schemes being
developed to help with this. Its not like RAID-6 is super efficient either.

Just my two cents...

[1] <http://www.openstack.org/software/openstack-storage/>

------
andrewcooke
isn't this what raid scrubbing is for? [http://en.gentoo-
wiki.com/wiki/Software_RAID_Install#Data_Sc...](http://en.gentoo-
wiki.com/wiki/Software_RAID_Install#Data_Scrubbing)

    
    
        for raid in /sys/block/md*/md/sync_action; do                                   
            echo "check" >> ${raid}
    

does that fix the issue? i run that once a week. i thought i was ok. am i not?
if i am, isn't this old news?

~~~
pfg
yes, you should be fine. AFAIK most distros set up cronjobs for this
automatically (for example, Debian/Ubuntu runs that once a month).

------
ksec
I am going to ask a honestly stupid question. What is going to happen to ZFS?
Sorry if this is Slightly off topic. Although the comments has already started
discussing on it.

The OpenSource version of it, or the BSD version of it is only up to v28. And
it seems after that Oracle is no longer putting out update as open source and
what will happen after that? Disparity between Oracle version and BSD version?
And are features still being developed? Most of limitation listed in Wiki
hasn't change at all for the past years and are still listed as under
development.

------
pixl97
R.I.N.A.B

Raid Is Not A Backup!

Oh, and your raid controller should monitor for smart errors and you should
seek to replace disks when you start seeing sector rewrites.

------
ksadeghi
Not very efficient but I try to avoid placing all volumes on a single raid
set.

------
martinced
I'm not a Un _x sysadmin at all and don't know much about hard drive: I'm just
a software dev.

But from the beginning of TFA, after reading this:

"Bad blocks. Two of them. However, as the blocks aren't anywhere near the
active volumes they go largely undetected."

The _FIRST* thing that came to my mind was: "What!? Isn't that a long-solved
problem!? Aren't disks / controllers / RAID setups much better now at
detecting such problem right away".

I've got a huge issue with the "largely undetected". I may, at one point, need
storage for a gig I'm working on. And I certainly don't want stuff problems
like that to go "largely undetected".

So quickly skipping most of the article and going to the comments:

"It's worth pointing out that many hardware RAID controller support a periodic
"scrubbing" operation ("Patrol Read" on Dell PERC controllers, "background
scrub" on HP Smart Array controllers), and some software RAID implementations
can do something similar (the "check" functionality in Linux md-style software
RAID, for example). Running these kinds of tasks periodically will help
maintain the "health" of your RAID arrays by forcing disks to perform block-
level relocations for media defects and to accurately report uncorrectable
errors up to the RAID controller or software RAID in a timely fashion."

To which the author of TFA himself replies:

"Yes, that is something I should have made clearer. This is the very reason
that RAID systems have background processes that scan all the blocks."

Which leaves me all a bit confused about TFA, despite all the shiny graphs.

Basically, I don't really understand the premises of "bad blocks going largely
undetected" in 2013...

~~~
StavrosK
I have a home server with three disks and ZFS, for my photos and things, so
I'm not an expert. However, Ubuntu's md-raid includes scrubbing once a week by
default, and I added scrubbing to my ZFS setup via crontab, again once a week
(I'm not sure if ZFS does it automatically, but I don't think it does. I would
appreciate a correction, if someone knows for sure).

The article assumes no scrubbing, which is a stupid thing to run without, as
detailed from the article. So it's basically "why pointing a gun at your foot
and pulling the trigger is bad", "because you're going to shoot yourself in
the foot".

~~~
olgeni
You may add something like this to /etc/periodic.conf:

daily_status_zfs_enable="YES" daily_scrub_zfs_enable="YES"
daily_scrub_zfs_default_threshold="6" # in days

and it will scrub the pools every 6 days (and send you a report in the daily
run output).

~~~
StavrosK
Very nice, thank you, I will try that. I am rather dismayed, however, by
learning in this thread that my disks have 4K sector sizes and ZFS
autodetected 512 bytes, which means I'll have to destroy the pool and recreate
it...

~~~
olgeni
It happens all the time...

If you run "camcontrol identify ada0" (or whatever your device is) you can
find out before it is too late:

sector size logical 512, physical 512, offset 0

This is from a lucky drive of course :)

~~~
StavrosK
Hmm, there's no such command in Ubuntu, maybe it's from BSD?

~~~
olgeni
camcontrol is from FreeBSD.

I don't have a Linux box available right now but maybe "hdparm -I" does
something similar: "request identification info directly from the drive".

~~~
StavrosK
Yep, that works:

    
    
        Logical  Sector size:                   512 bytes
        Physical Sector size:                  4096 bytes
    

I'm guessing that's not very nice. My ZFS pool was created with ashift 9 (this
is 2^9=512 bytes), when it should be 12 (2^12=4096). I will have to copy
everything off and back on again.

For everyone who wants to check, and because I couldn't find info on it, run:

    
    
        zdb | grep ashift
    

And see if it's 9 or 12.

