
Why RAID 5 stops working in 2009 (2007) - pmoriarty
http://www.zdnet.com/blog/storage/why-raid-5-stops-working-in-2009/162
======
ars
Note: This article is from 2007 and is quite prescient.

It's completely shameful how bad the specified read error rates are now.

It's to the point that if you read an entire disk 4TB you have a 32% chance of
one bit being wrong!

That means hard disks can no longer be considered reliable devices that return
the data written to them, you now need a second layer in software checking
checksums.

[http://www.wolframalpha.com/input/?i=4TB+%2F+10^14+bit++in+%...](http://www.wolframalpha.com/input/?i=4TB+%2F+10^14+bit++in+%25)

For extra money they sell hard disks with 10^15 reliability instead of 10^14 -
this should be standard!

~~~
vacri
I think we have very different definitions of 'completely shameful'. One bit
wrong in four _trillion_ , on a device that only costs a couple of hundred
dollars?

~~~
ars
It sounds good doesn't it? But then you compare to the size of the disk and
realize that it's not actually that good.

The shameful part is that they actually sell the 10^15 drives - and not for a
ton more money either. They simply bin drives and the better ones are rated as
10^15. Instead of doing that they should figure out what's different in the
manufacturing of those and do it across the board.

~~~
ted_dunning
Yeah... shameful. I am sure that there is something simple that all the drive
manufacturers are just overlooking in their process that would take them to
10^-17 BER.

When was the last time _you_ created a physical artifact for sale? That had
even just part per billion malfunction rates?

It isn't as easy as these guys make it look by a long stretch.

------
keypusher
This is already well known in enterprise storage, and one of the solutions is
erasure coding. Some products implementing this exist today and others are
currently in the works. Basically you go to something like 20+5, instead of
4+1. Of course this only makes sense when you have 50+ drives in a tray.

[http://www.networkcomputing.com/storage/what-comes-after-
rai...](http://www.networkcomputing.com/storage/what-comes-after-raid-erasure-
codes/a/d-id/1232357)

~~~
mprovost
Practically yes, erasure coding solutions do tend to have a lot of hard
drives. That's probably because their efficiency really makes them much
cheaper for people who are buying lots of storage, for a small setup the cost
of the drives themselves isn't that significant. But mathematically it works
with smaller numbers. You could have an 8 of 12 setup and be able to sustain
the loss of 4 drives while only storing an extra 50%.

------
e12e
There was an update in 2013:

[http://www.zdnet.com/has-raid5-stopped-
working-7000019939/](http://www.zdnet.com/has-raid5-stopped-
working-7000019939/)

And based on a sample size of 1, it looks like desktop 4tb drives still have
an URE of 10^14:

[http://www.seagate.com/www-content/product-
content/barracuda...](http://www.seagate.com/www-content/product-
content/barracuda-fam/desktop-hdd/barracuda-7200-14/en-gb/docs/desktop-hdd-
data-sheet-ds1770-1-1212gb.pdf)

Then there's of course SSDs:

[https://www.youtube.com/watch?v=eULFf6F5Ri8](https://www.youtube.com/watch?v=eULFf6F5Ri8)

Yeah, no. I don't think it's fair to compare raid0 and raid5 for durability --
but ssds seem to have very low ure-rates, and you can now (almost) reasonably
get 1tb ssds. Not sure about how long they can be realistically expected to
last, though. Long enough (2-3 years) for when you'll probably want to replace
those 14 1tb ssds in raid6 with a three 16 tb ssds in raid1 with one hot-
spare?

~~~
zanny
You could get a bunch of 840 Evo drives, say, 7 of them, for around $3500, for
a 6TB raid5 that would probably get at least a gigabyte a second even with
parity calculations overhead.

Though ZFS and Btrfs solve the unreadable bit problem with extent checksums,
so you really don't need to worry about mechanical disks, yet. You would need
two URE's in the same block to kill the checksumming.

------
lmm
I lost data to this kind of problem[1]. The linux dm-raid handles these kind
of failures extremely poorly (or did at the time), even when one following all
available tutorials. (When I reported my experience one developer said I
should have set a cronjob to recursively md5sum my / every week or so - not
exactly user-friendly, and not mentioned in any of the tutorials). When you
attempt to rebuild a dmraid array, even a raid6 one, expect to lose all your
data.

Now I use ZFS (on FreeBSD), which handles these kind of errors much more
gracefully; if there's an isolated URE you might lose data in that particular
file, but it won't destroy the whole array.

[1] Yeah yeah, RAID is not a backup. I'm talking about data I didn't consider
worth the cost of backing up, as a poor student at the time.

~~~
ars
> I should have set a cronjob to recursively md5sum my / every week or so

If you use Debian then install the debsums program which will do that for you
for non-user data and report any errors.

You should also install mdadm and set it to check the array every month.

And finally install smartmontools and have it do a short self check every day
and a long one (i.e. a full disk read) every week.

------
topbanana
I don't use any storage level redundancy at all. I found that misconfiguration
(my fault, on two occasions) made it more unreliable that just a single disk.
I rely on cloud backup, and I'll take the hit if/when I need to rebuild my
machines

~~~
fulafel
Raid is no replacement for backups in any case!

For most people it's better to make a nightly copy to other HD. Raid is for
saving downtime on HD failures, doesn't save you from accidental rm -r, word
processor corrupting your thesis, GPU driver crash corrupting your filesystem,
box getting owned, etc.

~~~
autokad
i think what he means is, down time isn't a big deal since recovering from
back up is quick enough in his case. I never found raid 5 particularly helpful
because: #1 hard disks tend to fail at the same time #2 systems fail to give
warnings on bad disks ie everything is working, no red flashing lights, things
go boom, restart, just kidding, lol you got 2 bad disks. #3 redundant servers
> redundant disks for uptime #4 cloud

but thats just me and my use cases

edit: when i use raid, i use raid 10 or raid 0

------
barrkel
The storage efficiency gains from RAID5 are not worth the risks, and when you
go to RAID6, you lose even more efficiency.

You're better off with RAID10 (only need to read one drive to replace, not all
the drives). Better performance all round too.

~~~
chadnickbok
I'm not sure I fully understand why RAID10 is better than RAID5. Reading the
RAID wikipedia article, they seem to imply that a typical RAID10 setup uses
n*2 disks, where each block of data is written to two drives.

But how does that help in the case of drive failure? If a drive fails, then as
size increases won't the exposure to a URE also increase? Is it better than
RAID5?

~~~
wtallis
When a drive in a RAID10 array dies, then the data on that drive is still
directly available on a drive that's not protected by redundancy, and the
other half of the data is still protected. Rebuilding the array requires
reading a single drive without error. Rebuilding a RAID5 requires reading
_n-1_ drives and computing parity without error.

------
cybojanek
Can't wait for btrfs. Benchmark from today:

[https://docs.google.com/spreadsheets/d/1L5bVGU95D0Cu1gJoQhBh...](https://docs.google.com/spreadsheets/d/1L5bVGU95D0Cu1gJoQhBhCSTd0uVATi05-UYKcORfTiI/)

~~~
GhotiFish
Is BTRFS really still at the "do not use in production" phase? I'm surprised
it's still considered unstable. Seems like a case of the "google beta's"

~~~
thristian
Russell Coker's reports of his experiences with BTRFS give me the screaming
heebie-jeebies, no matter how up-beat and positive he stays about it:
[http://etbe.coker.com.au/tag/btrfs/](http://etbe.coker.com.au/tag/btrfs/)

~~~
pmoriarty
What about ZFS?

~~~
MarkSweep
ZFS has been around a bit longer than Btrfs, which the ZFS proponents claim
has allowed for it to become more stable. Watching the commit logs of Illumos
(the open-source derivative of Solaris) most commits seem to be related to
adding features or reducing IO latency variance. Problems with lost data are
few and far between.

As an appeal to authority, a number of companies currently trust[1] their data
to ZFS, Joyent probably being the most well know of them. I store my personal
data[2] on ZFS, though my needs are modest.

[1]: [http://open-zfs.org/wiki/Companies](http://open-zfs.org/wiki/Companies)
[2]: [http://www.awise.us/2013/03/10/smartos-home-
server.html](http://www.awise.us/2013/03/10/smartos-home-server.html)

~~~
hga
And the list at [1] doesn't include rsync.net, who's business is providing
reliable off site disk storage.

------
brunorsini
Never used anything but RAID 1 on my Synology NAS. Had a drive fail there once
and rebuilding the array was as simple as swapping it for a new one (took a
few hours but it was truly "plug and play" and no data was lost).

Do the things mentioned on the article imply that a system like the Promise
Pegasus2 R6 12TB (6 by 2TB) Thunderbolt 2 RAID System
([http://store.apple.com/us/product/HE152VC/A/promise-
pegasus2...](http://store.apple.com/us/product/HE152VC/A/promise-
pegasus2-r6-12tb-6-by-2tb-thunderbolt-2-raid-system)) is actually not
guaranteed to survive 1 HD crash when configured in RAID 5? I'm a bit confused
now, would appreciate any help there...

------
click170
So my understanding is that the problem is that the raid driver throws an
error when it fails to read a bit, asserting that it can't read the whole
array.

Does this mean that since it's the raid driver itself claiming "Hes dead,
Jim", that most filesystem-level protections will be ineffective since the
array itself is "dead"? Of course, if your filesystem leveraged multiple RAID
arrays as independent disks, it would have a chance.

For software raid at least, could it not leverage some kind of Hamming or
Reed-Solomon code so that it doesn't fail hard like this?

------
js2
Shrink the RAID group size? 7 x 1 TB disks in RAID5 gives you 6 TB usable
space. With 2 TB drives use a 4 disk RAID group. You're still protecting the
same 6 TB (at a small loss of efficiency), but eventually cost catches up (if
4 x 2TB isn't immediately cheaper than 7 x 1TB it won't take long till it is).
This more than covers the reliability decrease of the higher capacity drives.

------
Nican
I am not an expert on the topic, but nobody seems to have mentioned RAID-Z.
[https://blogs.oracle.com/bonwick/entry/raid_z](https://blogs.oracle.com/bonwick/entry/raid_z)
Can anyone comment on this?

------
gnopgnip
This is just looking at the physical layer for consumer drives. Any SAN is
going to be using something with less errors. Also every year storage costs
continue to drop. Double parity, or eventually triple parity are cheaper than
ever to implement.

------
spullara
When I put together my home 8-disk server back in 2009 it was pretty clear to
me I needed to protect against 2 failures. Even if you just look at the 12
hour recovery times.

How much does this change if you are using modern SSDs?

~~~
mikevm
I believe that SSDs fail differently, in some respects. If your SSDs "wears
out", you may not be able to reprogram it (write to it), but you may still be
able to read from it.

Edit: Apparently this is wrong. Read comments below

~~~
tfigment
My experience as we use SSDs for our embedded solutions and sometimes have
very frequent writes is that when it goes the whole partition is pretty much
hosed. I'm not sure what exactly fails but it seems like the file system index
or journal goes and then the whole partition is mostly useless. We use a write
filter to avoid writes to OS partition (writes redirect changes to memory) so
it is mostly readonly but it still can fail and and when it does the system is
unrecoverable. Other partitions may still be seemingly okay but we just
replace once a failure is spotted. Disk based drives usually don't fail as
catastrophically in my experience but do have more errors.

Having said that we have been using them for past 5 years and only have had a
dozen failures in that time over probably 500 deployments. And I think several
of those were service replacing because they could not determine a root cause.
Fortunately, we don't really have to try to recover anything from the drives
but it does cause downtime when they fail.

~~~
autokad
what is your gut feeling on failure rate vs standard disks?

~~~
yuhong
I think it probably depends on the quality of the SSD firmware.

------
maerF0x0
I wonder if amazon has counter measures to this? Or should I expect to see
this kind of (un)reliability replicated in a VPS / "cloud" environment ?

------
pat2man
For now we can get away with other tricks like ZFS's ditto blocks but at some
point the whole redundant storage system will need to be re-though.

------
le_meta
This article brought to you by the cloud. ZDNet..

