
Understanding RAID: How performance scales from one disk to eight - feross
https://arstechnica.com/information-technology/2020/04/understanding-raid-how-performance-scales-from-one-disk-to-eight/
======
zxcvgm
> _One final warning about hardware RAID controllers: It's difficult to
> predict whether a hardware RAID array created under one controller will
> import successfully to a different model of controller later._

> _We find that with hardware RAID, it's frequently difficult to tell whether
> you're nuking your array or importing it safely. So in the event of a
> controller failure and replacement, you may end up sweating bullets,
> YOLOing, and hoping._

This was the main reason I decided to switch to Linux software RAID from a
3ware card years ago.

A thunderstorm caused a power trip and after powering the system back on, the
3ware controller lost its EEPROM contents that held the PCI vid/pid.
Fortunately, it was just a RAID 1 mirror. We used dd to cut past the metadata
header and managed to mount the ext2 filesystem.

Using software RAID, I’m confident that we can take the drives to another
machine in a pinch and still be able to mount it without much hassle.

~~~
vardump
People still use hardware RAID? Windows exempted, as there's unfortunately
nothing better available for Windows. (ReFS SW RAID might be ok, haven't
checked its state in a while. I know Microsoft is definitely working hard to
fix this.)

Hardware RAID is just data corruption waiting to happen, simply because there
are more vulnerable components and interconnects along the data path.

It's much better to checksum data while it's still protected by ECC RAM and
CPU ECC caches – which is something only software RAID can do. Although even
that is not a total guarantee.

There's no longer reason to assume a slow SoC on a RAID adapter (even with XOR
acceleration) would perform better than an Intel or AMD server CPU, as long as
I/O path is not bandwidth limited.

~~~
e12e
> Windows exempted, as there's unfortunately nothing better available for
> Windows.

Something wrong with dynamic disks for software raid for the os in windows?

[https://docs.microsoft.com/en-
us/windows/win32/fileio/basic-...](https://docs.microsoft.com/en-
us/windows/win32/fileio/basic-and-dynamic-disks)

Or storage spaces? (the latter i have not used)

[https://support.microsoft.com/en-
gb/help/12438/windows-10-st...](https://support.microsoft.com/en-
gb/help/12438/windows-10-storage-spaces)

~~~
vardump
Create a mirrored volume out of two disks, both with ReFS (or ZFS) and dynamic
disk software RAID.

Remember to enable ReFS integrity streams! ZFS protects full integrity by
default. NTFS supports only protecting metadata, file data integrity is not
protected.

Fill the volumes with known data. Randomly corrupt some blocks out of disks
(but not from both disks at same offsets!).

Observe which solution recovers and returns correct file data and which does
not.

Hint: you might want to stop using dynamic disk RAID after this experiment...

Storage Spaces operates on a higher level, building on lower level volumes.
I'd feel way more comfortable using it on ReFS, at least once it can be
considered sufficiently mature in the future. See:
[https://docs.microsoft.com/en-us/windows-
server/storage/refs...](https://docs.microsoft.com/en-us/windows-
server/storage/refs/refs-overview#storage-spaces).

No qualms about trusting ZFS, it's _very_ solid on good hardware. Of course
running ZFS on Windows in production _might_ be a Very Bad Idea, so do it on
FreBSD, Solaris derivatives or Linux instead!

------
louwrentius
I have a passion for storage and I really liked this article. I'm also very
interested in the follow-up about ZFS.

Please note that the Author Jim Salter also has a nice podcast:
[https://techsnap.systems](https://techsnap.systems)

Also interesting: [https://arstechnica.com/gadgets/2020/02/how-fast-are-your-
di...](https://arstechnica.com/gadgets/2020/02/how-fast-are-your-disks-find-
out-the-open-source-way-with-fio/)

My main gripe with the article is that the topic of latency seems
underdeveloped.

The testing I've done using some of the parameters discussed on the article
shows extreme, probably unrealistic latencies on the storage.

~~~
louwrentius
A note about RAID arrays: Yes it is true that drives have become bigger but
not faster.

This causes RAID rebuilds to take longer. But I am not aware of any hard
evidence that this means that this increases the risk of a failed rebuild.

There are more sectors that could be bad but you need to do a patrol read /
scrub at least one month to detect them in advance.

I think there are a lot of scares about RAID of which I really wonder how much
of it is rooted in anekdote and folklore.

~~~
seized
If you have a set of N drives all purchased at the same time, they are going
to all have the same approximate bathtub curve and could start failing at the
same time.

I had five Seagate 7200.11s (terrible, shitty drives in many ways) in a RAIDZ1
pool. One failed, I started a resilver. Another started failing during that
time. The rebuild slowed significantly due to the second failing drive. ZFS
resilvered eventually then the second drive failed completely. That was at
only 26k hours or so. I think a third started acting up shortly after but I
was already replacing them all.

These were only 500GB drives, and the chances aligned on two of them....

I only use RAIDZ2 now (and I don't use Seagates).

~~~
zepearl
I had as well some Seagate drives failing... . (using now only HGST and WD
drives - no experience with Toshiba)

Question about ZFS and its RAIDZ: do you have any recommendation (personal
experience, links, books, ...) concerning parameters/setup to be used when
setting up a RAIDZ(1 and 2)?

I'm new to ZFS and I already had to do a lot of tests when using ZFS only on 1
HDD until I finally managed to get good performance out of it (used by a
"Clickhouse" database which itself writes data in a CoW-style => I had to
raise the "recordsize" to 2MiB) and I imagine that with RAIDZ it can get more
complicated?

I would like to set up a RAIDZ for the database (again, "Clickhouse", which
generates multi-GB files) and two more to be used as simple NAS (a main one
and a backup, storing files of all sizes).

I searched a lot and found some websites which were "ok", and bought as well 2
tiny books ( [https://www.amazon.com/Introducing-ZFS-Linux-Understand-
Stor...](https://www.amazon.com/Introducing-ZFS-Linux-Understand-Storage-
ebook/dp/B077QTFLY8/ref=sr_1_1?dchild=1&keywords=introducing+zfs+on+linux&qid=1587147735&sr=8-1)
and [https://www.amazon.com/ZFS-Linux-Administration-William-
Spei...](https://www.amazon.com/ZFS-Linux-Administration-William-
Speirs/dp/154462204X/ref=sr_1_1?dchild=1&keywords=zfs+internals+and+administration&qid=1587147772&sr=8-1)
) but the books were mediocre and the informations I found in the web were
sparse and a bit oldish.

Cheers

~~~
seized
Servethehome.com has some subforums with a lot of info, particularly the Napp-
it one which is a management UI for ZFS like FreeNAS is Reddit.com/r/ZFS
Reddit.com/r/datahoarder

I'll try to think of some others.

Honestly the best way is to start tinkering and see how it works for you as it
can depend heavily on what you're using it for.

But yes there is a lot of FUD out there.

One thing to beware of, RAIDZ1/2 is limited to the IOPS of one drive per
VDEV... Can be quite limiting for some things.

~~~
zepearl
Thank you!

Didn't know that website - it has a lot of interesting stuff.

Ok, I'll then do some tests and will see how the RAIDZ behaves.

~~~
seized
Read up on IOPS vs VDEVs and ashift, ashift has some important considerations
for space consumption and performance and future drive replacement. You almost
always want ashift=12.

------
ptha
Following the link to _libeatmydata_ that some hardware controllers use for
asynchronous writes:

 _libeatmydata is a small LD_PRELOAD library designed to (transparently)
disable fsync (and friends, like open(O_SYNC)). This has two side-effects:
making software that writes data safely to disk a lot quicker and making this
software no longer crash safe.

DO NOT use libeatmydata on software where you care about what it stores. It's
called libEAT-MY-DATA for a reason._

Also just some latin pedantry, the article mentions the phrase: _Caveat
imperator_ a couple of times. I assume to mean _Let the buyer beware_ , which
as far as I know is _Caveat emptor_.

Google latin translating:
[https://translate.google.com/#view=home&op=translate&sl=la&t...](https://translate.google.com/#view=home&op=translate&sl=la&tl=en&text=caveat%20emptor%0Acaveat%20imperator)

Does caveat imperator have another meaning (perhaps the emperor is
infallible)?

~~~
ezzaf
I understood "caveat imperator" to mean operator (commander) beware

[https://en.wikipedia.org/wiki/Imperator](https://en.wikipedia.org/wiki/Imperator)

~~~
avianlyric
Yeah the author clarifies this in the comments.

[https://arstechnica.com/information-
technology/2020/04/under...](https://arstechnica.com/information-
technology/2020/04/understanding-raid-how-performance-scales-from-one-disk-to-
eight/?comments=1&post=38815324#comment-38815324)

------
411111111111111
Great write-up. Personally I'd have liked a more explicit warning for using
raid 5/6 as the bigger they get, the more likely you're to lose the whole
array on failures because the remaining disks will be under heavy load for
recovery and are likely about to fail as well.

He did mention it as a side note, but it deserves more attention I think.

~~~
bluegreyred
Yes, there's also the topic of unrecoverable read errors (URE) and their
effect on successful RAID rebuilds. [1][2]

Most consumer drives are still sold rated as <10^14 bits read per error.
That's 12.5 terabytes, so in the worst case you could end up in situations
where — on average — you are unable read a full 16TB drive without an error.
Needless to say this is less than optimal for rebuilding a failed RAID array.

Anecdotal evidence (i.e. very low error rates during ZFS scrubs) suggests that
manufacturers underrate their drives and they are much more reliable than
that, but it is something to keep in mind.

Fortunately drive capacity completely outpaced my needs for personal data
storage in the recent years, so I am happy with JBOD or RAID1, with backups of
course.

[1][http://www.raidtips.com/raid5-ure.aspx](http://www.raidtips.com/raid5-ure.aspx)
[2][https://magj.github.io/raid-failure/](https://magj.github.io/raid-
failure/)

~~~
vardump
> Anecdotal evidence (i.e. very low error rates during ZFS scrubs) suggests
> that manufacturers underrate their drives and they are much more reliable
> than that, but it is something to keep in mind.

I've seen plenty of drives returning invalid data with correct CRC over the
years. On reliable server-grade Xeon + ECC hardware. Of course the vast
majority of drives never do it, there's just no way to know which ones do
until it happens.

Firmware bugs in weird corner cases? Cosmic rays? Perhaps, but I think it's
more reasonable just consider it one of those weird things that occasionally
Just Happen (TM) and just need to be protected against at a higher level.

All drives produced in the last 30 years or so are running ever more
complicated software stacks. For example they all have features that move data
at risk to safer locations without the host system requesting or even knowing
about it. Their physical (like bits on spinning rust or NAND flash block) and
logical (what the host sees) data representations can be _completely_
different.

Plain old CRC errors, though, are _way_ more frequent. I feel much more
comfortable about those errors, at least the drive knows the data is
corrupted.

------
rconti
It seems every post here saying negative things about "hardware RAID" should
prefix the term with "inexpensive consumer" or at least "single-system
integrated".

Which, to be fair, makes sense with the subject of the article limiting us to
a discussion of 8 drives.

~~~
SlowRobotAhead
Right. I struggle to see how anyone in this thread is seriously comparing
their PC with ZFS to my server with 12x 3.84TB SAS SSD and a PERC 740
controller. Yet they are with absolutes like “Hardware RAID is dead”

~~~
ecpottinger
Wait a few years till something happens to your controller and you try to get
a compatible version.

~~~
SlowRobotAhead
Yea, I think Dell makes more than 100 PERC740s.

Plus, you’re talking about RAID as a backup, it is not. RAID is redundancy,
not disaster recovery. If your backup solution is bas d on your RAID
controller series, you’ve already lost.

------
Rafuino
This is really well written and I learned a bunch. I wonder how much money
vendors make by selling hardware RAID when customers could get away with
kernel RAID. Also, thank you for this bit! "All tests performed here are
random access, because nearly any real-world storage workload is random
access."

Question for you smart people... is there a reason to use such an old Linux
kernel version? They're testing on 4.15, which was released over 20 months
ago. Would there be any appreciable difference on 5.6.4 or something in the
5.x range? Perhaps not since hard drives can't really improve much, but I'm
curious.

~~~
Arnt
Hint: You haven't read about any performance breakthrough in the past several
years.

~~~
Rafuino
Not in HDDs... there's been quite a bit of improvement in SSDs and new memory
media in the past several years.

~~~
Arnt
Not in the linux kernel's block layer, which is what GP had in mind.

The block layer has been able to deliver very close to 100% of the drives'
stated bandwidth for many years now.

~~~
Rafuino
Who is GP?

------
Diederich
> In our experience, administrators are overwhelmingly likely not to notice
> when a hardware controller's cache batteries fail. Frequently, those
> administrators will still be operating their systems at reduced performance
> and reliability levels for years afterward.

Can anyone else weigh in on this claim? In the two jobs where I've been
responsible for such equipment, monitoring both the performance and battery
status was pretty critical and done with due care and focus.

------
juskrey
I have avoided RAID for the past 20 years and will avoid it in future. The
whole technology and its benefits is about picking pennies before the
steamroller, with explosive complexity of failures. I prefer much more simple
and understandable replication and backup solutions - there are dozens of
them.

~~~
bouncycastle
It's very important to understand that RAID is not a backup solution. You
still need to backup even if you have RAID, no matter what level.

What it offers (in parity mode) is improved reliability and more chance that
your server will be online if your drive dies. In majority of cases, it saves
you from restoring from backup. Just replace the bad drive, and away you go!
However, never assume that it will protect you from data loss. Backup. Backup.
Backup!

~~~
prussian
Yes, but what the parent comment is saying is that simple replication can
replace the need for parity arrays entirely.

~~~
Rychard
In the sense that your backups are a surviving copy of your data, sure.

That said, an argument such as "system is down while we restore from backup;
some recent data may be lost" will not go over as well as "the system remains
available, but performance will be negatively impacted while the array
rebuilds; no data has been lost at this time".

In the former, you're in crisis mode. In the latter, you'll probably be under
some stress, but you're still in the clear.

Edit: But we're all on the same page when it comes to backups. They're a
necessity.

~~~
411111111111111
I'm pretty sure he's thinking more about distributed filesystems vs raids than
what you're talking about.

I.e. ceph with pods distributed over a bunch of disks without a raid

------
montjoy
I wish articles like this could afford production-grade hardware to test on.
When you’ve run large Fibre or SAS arrays in the past on enterprise hardware
it’s hard not to see gaps in the testing methods/setup. Sequential writes
metrics ARE important when doing database backups. SATA (even nearline) and
SAS disks have vastly different performance profiles so I would expect optimal
configuration would be different. What about SSD vs. spinning? What about
different rotational speeds? Were the disks aligned properly? How does
performance scale with increasing the number of threads requesting I/O? The
setup in the article really only applies to certain prosumer/small business
requirements. I guess I’m not the right audience for the article.

------
tobyhinloopen
The conclusion is unclear to me. Is the author saying RAID is a backup?

~~~
WantonQuantum
Yes. And RAID10 is the best backup strategy because 10 is bigger than 5 or 6.

~~~
hans_castorp
So how do I create a RAID42 then? Wouldn't that be the ultimate solution to
all RAID problems?

------
mohammedhdotio
Great writing. however a question for geeks. what is wrong with using RAID0
for performance along with external backup/deduplication solution like borg
backup/duplicati etc... ?

~~~
oofabz
There are three drawbacks to the setup you describe:

\- If a drive fails, the system is offline until you restore from backup. Fine
for personal use, but not okay if customers are paying you for a service.

\- If a drive fails, any changes since the last backup are lost. Fine for
long-term archival, not so great for a bank ledger.

\- No capability to detect data corruption. RAID with parity can run a scrub
to find and repair bit rot. If data is corrupted on your RAID 0, you probably
won't notice, and you'll back up the corrupted. If your filesystem has
checksums you will be able to detect the corruption but not repair it. Your
filesystem probably doesn't have checksums.

If you frequently run I/O intensive workloads, the extra performance might be
worth the tradeoffs. If you're looking for your PC to feel slightly faster it
seems foolish.

~~~
mohammedhdotio
Hello and thanks for your time.

\- I've got 1 HDD failure on RAID10 4 x 10TB HDDs, I had to take the system
down to do RAID10 rebuild. because running server with huge I/O slowdown the
rebuild and I had my fears of second HDD failure. (I had to take the system
down anyway). RAID didn't help in my usage scenario.

\- I guess deduplication fixes this already.

\- Any advice on ways fixing this and preserving my RAID0 usage scenario?

~~~
oofabz
Broadly speaking, you have two options. One is, make the server faster so that
it can keep operating during a rebuild. You could do this by adding more
drives. The other option is, add a second server to handle the load while you
fix the first.

------
yardie
This is a great article on the basics of RAID. Outside of consumer grade RAID
systems it's a bit out of date. There are some perceptible performance
differences between enterprise and consumer storage systems.

I manage a small (<100TB) storage system for our office. We have servers using
RAID10 SSDs and we have other servers using RAID10 10k SAS HDDs. The read
performance on the SATA SSDs is great but the writes are atrocious.

~~~
theevilsharpie
> The read performance on the SATA SSDs is great but the writes are atrocious.

If a 10K SAS HDD array is outperforming even a single SATA SSD, either your
I/O access pattern is extremely sequential, or there's something wrong with
how your storage is configured.

~~~
yardie
Our environment is almost entirely virtualized, so yes, extremely sequential.
Each host has 2 or more 10Gbps NICs. So starting, suspending, moving,
snapshoting and cloning a VM shows the limits in both. Our SAN is 4 years old
and has tiered storage (10k SAS mainline HDDs and high-write SAS SSDs). We use
cheaper SSDs in some instances. But there is a performance difference of
15%-25% between SAS and SATA and that can't be ignored.

------
ngcc_hk
Got my qnap 8 drive to backup my synology 8 drive. Now the zfs will come soon
once those in place. Hope that is a good mutual backup.

------
skyde
would have loved to see raid-Z in this test.

~~~
kondro
The conclusion says that it’s coming.

