
Unbalanced reads from SSDs in software RAID mirrors in Linux - zdw
https://utcc.utoronto.ca/~cks/space/blog/linux/UnbalancedSSDMirrorReads
======
rkeene2
I tried to get a patch in to do round robin reads in Linux RAID1 a few years
ago due to this, but there was too great a request for doing a lot of
benchmarks to show that it was ever useful that I abandoned getting it
upstream.

[http://www.rkeene.org/viewer/tmp/linux-2.6.35.4-2rrrr.diff.h...](http://www.rkeene.org/viewer/tmp/linux-2.6.35.4-2rrrr.diff.htm)

~~~
z3t4
My guess is that using the same disk performs better then round robin ... the
disk could for example already have the data in cache, while the other disk
has not. When it comes to optimization - the results are often unexpected. And
you would have to do a lot of benchmarks on different disks, loads, use cases
etc to make a conclusion, and you might find unexpected outcomes that you did
not think about, like both disks being self-bricked at the same time as they
have almost the same rw count.

~~~
digikata
One could do delayed round-robin. For one day (or other time period), the
first-choice disk is disk A, the next time period, disk B... etc. And for
intentional non-uniformity, you can vary the time-period.. it's one day for
disk A, half a day for B. etc...

~~~
kpil
What is the gain?

~~~
digikata
Possibly pushing the likelyhood of first failure out, preferably beyond
capital lifetime assumptions? Possibly increasing system performance over the
lifetime of the SSDs.

But, it's possible that this solution is not particularly useful depending on
what mixes of equipment and workloads one expects to operate.

------
c0l0
Pro tip: use RAID10 instead of RAID1 with Linux md to have the same level of
redundancy for arrays with only two legs. RAID10 will properly stripe/balance
reads across them, effectively doubling linear read throughput. The only real
downside is that you will have to find and actively choose a sensible chunk
size that fits both your hardware and workload.

~~~
ryan-c
I think you have to specify layout=far2 with two drive raid10 on linux if you
want that.

------
barrkel
I wonder if there are cache effects involved as well - continuously using the
same SSD most of the time may reuse cached metadata relevant to the SSD
innards that the kernel FS cache can't help with.

~~~
Avernar
The FS cache sits above the raid driver.

~~~
barrkel
Well, it could hardly be below, could it. Was there an insight you wanted to
mention?

~~~
Avernar
Let me clarify. All the VFS caches are above the md driver. The cache we care
about, the buffer cache, is not below the md driver.

You'd need some kind of cache below the md driver to cache "metadata relevant
to the SSD innards". Since there isn't one therefore there can't be a penalty
due to caching to switch between SSDs in a mirror. That answers your question.

But since you want an insight, here's one for you. There is no "metadata
relevant to the SSD innards" available. The md drivers do not have this
information. I wish they did. When I was still playing with the md code I'd
love to know the block and page sizes of the drives. But that information is
two bytes. Not really something that would need a cache.

~~~
barrkel
Let me clarify what I was saying: I was redundantly (for clarity) talking
about cache on the SSD, trying to make it extra clear that I wasn't talking
about any kernel cache.

I'm trying to figure out how, exactly, you've read my attempt at excessive
clarity; it's not clear to me yet what you misread, since it seems like you're
trying to correct some misunderstanding, but it hasn't worked because you
haven't said anything that isn't already blindingly obvious.

 _There is no "metadata relevant to the SSD innards" available. The md drivers
do not have this information_

This is not an insight.

The cache I'm interested in is tied up in how the SSD presents as a block
device, but isn't implemented as a block device, as block devices are normally
understood (wear levelling and remapping, hidden parallelism / striping,
etc.). Yes, the FS cache can't help with this. Yes, the RAID implementation
can't help with this. Of course. It's internal to the SSD. It's innards. I
don't see what information you added.

~~~
Avernar
I may have misread your original post but your hostility isn't helping to
clarify things. Instead of replying back "I was talking about the SSD cache"
you replay back with "Well, it could hardly be below, could it. Was there an
insight you wanted to mention?" So instead of clarifying things you decided to
be rude.

Now that I know what you're talking about, let's try to answer your question.
There are SSD controllers that don't use a RAM cache on the drive. It's not
necessary for performance and doesn't do as much as on a HDD if present. The
main benefit of a drive cache while reading is for read ahead. This is not
needed for an SSD as it doesn't have to wait for the sectors to show up under
the head.

The only thing that will affect things is reading file system blocks from the
same SSD page/block. The md code already does this. If the next block
requested is after the last blocks read it will use the same SSD.

If the block is from somewhere else then it doesn't matter if it comes from
this SSD or that SSD. Access time will be the same.

------
wolf550e
AFAIK, there is no need to do "wear leveling" for reads, SSDs have "unlimited"
life cycle reads. So the first SSD seeing ten times the reads will not make it
fail sooner.

~~~
rcthompson
Even if reads did wear out an SSD, wouldn't it be better to wear them out one
at a time rather then wear them all out equally so they all fail in rapid
succession once they reach the end of their life cycles?

~~~
xorgar831
I would say no, you'd want the whole disk to fail then so you can just replace
it vs. it getting smaller over time, which would make capacity planning pretty
complex.

~~~
rcthompson
We're talking about RAID1 here, so multiple SSDs. I'm saying you don't want
multiple disks to fail in rapid succession, which is what you'd maybe get if
you exactly equalized the wear on them.

~~~
xorgar831
That makes sense, and you actually do see RAIDs fail like that in the real
world sometimes even on non-SSDs.

------
Avernar
Back when I still ran Linux on my personal servers I noticed this. I was
adding TRIM support to the md drivers before the maintainers finally got
around to it. That's when I noticed the balancing code tended to favour one
disk.

I added round-robin code and experimented with splitting large sequental
requests between idle disks.

------
smegel
Does it matter? Do reads affect lifetime of SSDs? It can't be affecting
performance, as if the drives _were_ busy then it would be balanced much more
evenly.

------
bhouston
Balancing reads may lead to improved performance if an individual SSD is near
it's ops or datarate limits.

~~~
pantalaimon
But if this were the case the disc would not be idle, right?

