
The Sorry State of Copy-On-Write File Systems - pjotrligthart
http://louwrentius.com/the-sorry-state-of-cow-file-systems.html
======
gnoway
Discussion from 6 months ago:

[https://news.ycombinator.com/item?id=9128404](https://news.ycombinator.com/item?id=9128404)

------
transfire
Someone should write an article on the sorry state of file system in general.
ZFS and BTRFS are improvements, though still not quite there yet. But distance
between user and storage still seems vast and primitive. Perhaps Seagates
Kinetic drives are the future we need? ([http://www.seagate.com/tech-
insights/kinetic-vision-how-seag...](http://www.seagate.com/tech-
insights/kinetic-vision-how-seagate-new-developer-tools-meets-the-needs-of-
cloud-storage-platforms-master-ti/))

------
yellowapple
Translation: "ZFS and btrfs are 'incomplete' because their handling of striped
RAID is incomplete."

Disregarding the fact that things like mdadm still exist, and _further_
disregarding the fact that the vast majority of filesystems out there don't
bother with trying to implement RAID at all (probably because - again - there
are things like mdadm that do that already), RAID5/6 are generally a bad idea
compared to a RAID1 or RAID10. Both RAID5 and RAID6 make multiple _very_
incorrect assumptions:

* Failure of multiple drives in a short period of time is rare (in reality, if one drive fails (excluding bathtub-curve-related infant mortality), the likelihood of subsequent drive failures increases significantly)

* The failed drive can be replaced and repopulated quickly (in reality, this is becoming less and less true as drives get bigger and bigger, thus taking more and more time to rebuild the failed array member; SSDs buy some time here, but that's not sustainable)

* Bit rot / "cosmic rays" are rare (in reality, silent data errors happen all the time, as has been demonstrated [0] as an example of why RAID5/6 is woefully insufficient for even the most basic protection against data corruption)

Basically, if you want something striping, and care _at all_ about data
integrity, go with RAID10. Only use RAID5/6 if you don't care about data loss
(whether due to a comprehensive backup policy or a comprehensive redundancy
policy on a machine-level), though in that case you might as well just cut to
the chase and use RAID0.

I really wish this article would've dug into some of the _real_ shortcomings
of these filesystems; btrfs in particular would be incredibly useful compared
to a more traditional LVM approach if it supported file encryption and swap
subvolumes (for either of these things, LVM (+ LUKS for encryption) is
necessary with or without btrfs). Instead they get criticized over not
supporting things that a sane sysadmin wouldn't touch with a ten-foot pole in
this day and age.

[0]:
[http://www.miracleas.com/BAARF/Why_RAID5_is_bad_news.pdf](http://www.miracleas.com/BAARF/Why_RAID5_is_bad_news.pdf)

~~~
pedrocr
_> Basically, if you want something striping, and care at all about data
integrity, go with RAID10. Only use RAID5/6 if you don't care about data loss
(whether due to a comprehensive backup policy or a comprehensive redundancy
policy on a machine-level), though in that case you might as well just cut to
the chase and use RAID0._

Your points are valid but this doesn't follow. RAID10 is strictly worse than
RAID6 for redundancy. In a four drive array any two drives can fail with RAID6
whereas with RAID10 2 drives may be enough to blow up half your data. RAID10
is actually a less redundancy but higher performance tradeoff compared to
RAID6.

~~~
yellowapple
I should've clarified that RAID1 is the _absolute_ best if data integrity is
your sole concern. Regardless...

> RAID10 is strictly worse than RAID6 for redundancy.

Only if you use two drives per mirror; RAID10's worst-case matches RAID6's
when using three disks per mirror.

RAID6 also lacks a "best case" that's distinct from its worst case. You lose
three disks, you're hosed, period. You lose three disks in a RAID10 (even one
with two disks per mirror), it's much more probable that your data will be
intact. You can mitigate this further by using a bigger array (RAID10 total
failure probability is affected by array size, unlike RAID6) or by using
different vendors for each disk in the mirrors (example: each mirror has one
Intel and one Samsung SSD of the same capacity) - which, while having some
performance and capacity implications, actually mitigates the "oh no I bought
a bad batch of hard drives and now my RAID is kaput" failure case (unlike with
RAID6, where mixing/matching vendors won't help you unless you mix/match them
for every disk).

There are also some recoverability benefits of RAID10 v. RAID6; notably,
recovering from a fault in RAID6 requires recomputing parity, while no such
thing is necessary for RAID10. This mitigates the "oh no my array took too
long to fix itself and now my RAID is kaput" failure case.

~~~
pedrocr
_> Only if you use two drives per mirror; RAID10's worst-case matches RAID6's
when using three disks per mirror._

That's not a proper comparison as you are now using more disks in the RAID10
array than the RAID6 to get the same space. If you're willing to have 6 disks
to get the space of 2 you could use an array with 3 parity disks (what ZFS
calls Z3) and now you can fail any 3 disks just like RAID10 and you've only
used 5 disks instead of 6. If you have the option for 4 parity disks (I don't
think ZFS provides a Z4 level) you could fail any 4 drives of 6 and get the
space of 2, so 1 better than RAID10. N-Parity setups are strictly more
redundant than RAID10 for the same number of disks and required capacity.

 _> actually mitigates the "oh no I bought a bad batch of hard drives and now
my RAID is kaput" failure case (unlike with RAID6, where mixing/matching
vendors won't help you unless you mix/match them for every disk)._

This makes no sense to me. I mix and match vendors in RAID6 arrays just fine
and it lowers the correlation of failures between disks just as well. In fact
parity RAID is again strictly better at this as well. Since the redundancy is
across the whole array instead of split into clusters whenever you add more
drives from different manufacturers you're adding that redundancy to the whole
array. With RAID10 if you have 3 drives per mirror your maximum redundancy
before losing data is 3 manufacturers.

 _> There are also some recoverability benefits of RAID10 v. RAID6; notably,
recovering from a fault in RAID6 requires recomputing parity, while no such
thing is necessary for RAID10. This mitigates the "oh no my array took too
long to fix itself and now my RAID is kaput" failure case._

Yeah, performance is the real advantage of RAID10 as I mentioned and indeed
that performance can be important during rebuilds so it has some durability
implications.

~~~
yellowapple
> That's not a proper comparison as you are now using more disks in the RAID10
> array than the RAID6 to get the same space.

You already have to use more disks for a RAID10 than in a RAID6, so I don't
see why this is that big of a deal. Anything involving mirroring will involve
at least a 50% cut in capacity (or more, such as with a 3-way or even 4-way
mirror). RAID1 and RAID10 don't prioritize capacity; they prioritize
robustness.

> N-Parity setups are strictly more redundant than RAID10 for the same number
> of disks and required capacity.

But _not_ more robust against failure, as I've already described (hopefully)
rather clearly. If we want to play this game, I can keep adding more disks per
mirror, and now my worst-case will match that of your n-parity array while
having a _vastly_ better best-case (and, additionally, dragging the
probability of that worst-case happening closer and closer to zero). Remember:
if you're a RAID1 or RAID10 user who cares about disk capacity, you're doing
something wrong.

> I mix and match vendors in RAID6 arrays just fine and it lowers the
> correlation of failures between disks just as well.

Do you mix and match all disks, though? Because if you have at least three (in
the case of RAID6) disks, no matter where they are in the array, that have the
same failure curve (i.e. are the same model/manufacturer), then you're gonna
have a bad time when the first one of them dies.

That's my point. In order to get the same inherent benefit as RAID10 here,
you'd have to have a different vendor/model for every single drive in your
RAID6 array.

> whenever you add more drives from different manufacturers you're adding that
> redundancy to the whole array.

Only if they're all different. You'll eventually get to the point where there
aren't enough manufacturers in the world (or you'll be compromising on
manufacturer diversity - which is, granted, what _most_ people do).

> With RAID10 if you have 3 drives per mirror your maximum redundancy before
> losing data is 3 manufacturers.

And that's fine, because if one of those manufacturers sold me a defective
batch, the problem's isolated to that side of the mirror. That's my point;
with RAID10, it's no longer a game of sheer numbers of different models and
manufacturers, but instead the more manageable strategy of giving each side of
the mirror a different bathtub curve to ride, thus mitigating the chance of
multiple hard drive batches failing at the same time.

~~~
pedrocr
_> You already have to use more disks for a RAID10 than in a RAID6, so I don't
see why this is that big of a deal._

Not really. RAID6 and RAID10 are only directly comparable at 4 disks as they
have the same capacity with the same number of disks. And in that case RAID6
is strictly better than RAID10 (any 2 disks can fail vs at best 2 disks can
fail).

 _> But not more robust against failure, as I've already described (hopefully)
rather clearly. If we want to play this game, I can keep adding more disks per
mirror, and now my worst-case will match that of your n-parity array while
having a vastly better best-case_

This is simply not true. An n-parity array will always be strictly (and I mean
strictly) more redundant than the equivalent RAID array. For 6 disks this
means using 4 parity disks which allows failing any 4 disks which is RAID10's
best case but better than it's worst case (any 2 disks can fail). If you go
for 8 disks RAID10 allows a best case where 6 disks can fail which with a
6-parity raid is your normal case, and so on. Parity is strictly better than
mirroring at redundancy its just usually not worth it for performance reasons.
At the file level it may make sense though (Backblaze does file level 3-parity
for example) as you can choose to take the hit only on certain files. Brtfs
also allows per-file raid levels so it's too bad they're not pursuing that
external patch for n-parity raid levels that the original article mentions.

(skipped some parts as the core of the issue is below)

 _> And that's fine, because if one of those manufacturers sold me a defective
batch, the problem's isolated to that side of the mirror. That's my point;
with RAID10, it's no longer a game of sheer numbers of different models and
manufacturers, but instead the more manageable strategy of giving each side of
the mirror a different bathtub curve to ride, thus mitigating the chance of
multiple hard drive batches failing at the same time._

This is not a RAID10 advantage it's a way to mitigate the RAID10 disadvantage
compared to parity. Parity RAID with the same capacity/number of disks can
survive more disk failures than RAID10 (as I explained above). In both cases
you want the pool of N disks to be as diversified as possible to avoid
correlation between failures. Parity exploits that lack of correlation
completely (any disk failure is just like any other, it doesn't matter which
disk is which) whereas RAID10 will blow up earlier if all disks in one side
fail at once so you mitigate that by making the two sides different.

~~~
yellowapple
> RAID6 and RAID10 are only directly comparable at 4 disks

They're really not _ever_ directly comparable, because RAID10 isn't a hard-
and-fast RAID level like RAID6, but rather the combination of two (really, one
plus a quasi-level, but whatever). RAID10 encapulates any situation where
mirrors are striped together.

> For 6 disks this means using 4 parity disks which allows failing any 4 disks
> which is RAID10's best case but better than it's worst case (any 2 disks can
> fail).

You're getting caught up on redundancy for a given total number of disks, and
in the process missing the point entirely. If we _really_ want to play that
game, then I'll just roll a RAID1 (a.k.a. a RAID10 with only one set of
mirrors) and call it a day, because with a n-disk array, a RAID1 will _always_
be at least as redundant as the equivalent parity-based RAID (really, it'll
always be able to survive one more disk failure than an n-1 parity-block
array; you can see this rather easily when comparing RAID5 with a three-way
RAID1).

From there, my point should be more clear. Just like how you can constantly
add more and more parity blocks (and disks to handle them), I can constantly
widen the mirror. Eventually we'll both get to the point where we realize
"what the hell are we doing; there's 100 copies of every block; we should
start adding some more non-redundancy disks", at which point RAID10's
advantages become more clear, since each mirror group added to the striped
whole instantly bumps up capacity _and_ the best-case without compromising the
worst-case. Meanwhile, you _do_ have the advantage of having more flexibility
in the quantity of disks added, but doing so doesn't help your best-case
unless you feel like resuming the whole "let's make every disk a parity or
mirror disk" arms race.

And before you mention it: yes, I know RAID1 isn't _traditionally_ counted
under RAID10, but a RAID10 with only one mirror group is possible to configure
using tools like `mdadm`, and while this wouldn't _look_ different from a
RAID1 at first glance, it would certainly be different as soon as more mirror
groups are added.

> Parity is strictly better than mirroring at redundancy

I think this remark is the core of your misunderstanding here. Are you really
trying to claim than an n-sized parity array with n-1 parity blocks is
_better_ than a RAID1? You might want to clarify your position there if that's
not what you meant to suggest :)

> Parity exploits that lack of correlation completely (any disk failure is
> just like any other, it doesn't matter which disk is which)

Which is exactly why you're incorrect. Since any disk failure is like any
other, every single disk has to be unique in order for a parity-based array to
benefit from any diversity. With mirroring, you don't have to go nearly as far
for that same benefit; you just need _each mirror_ to have a different bathtub
curve. That's my point.

\--

This has been a fun discussion, and there is certainly lots of room for debate
on the merits of mirroring v. parity, but I'm starting to think that we should
just agree to disagree on this one.

~~~
pedrocr
_> You're getting caught up on redundancy for a given total number of disks,
and in the process missing the point entirely._

That is the only point. Let me state it like this: you have N disks of which
you want to get M capacity out of (M < N), what's the RAID configuration that
maximizes redundancy. My assertion is that parity raid is strictly better than
RAID10 at doing that, you haven't really refuted it.

 _> From there, my point should be more clear. Just like how you can
constantly add more and more parity blocks (and disks to handle them), I can
constantly widen the mirror. _

Of course you can, and for any given number of disks you'll always be behind
on redundancy. You can of course just throw more hardware at the problem and
if your choices are between RAID6 (2-parity) and RAID10 then RAID10 eventually
wins because you need more parity disks than RAID6 provides to take advantage
of the extra hardware. Since most block RAID implementations only do 2 or at
most 3-parity, RAID10 ends up being the only practical solution for lots of
disks. But for a 4 or 5 disk setup (most NAS applications for example) RAID6
or ZFS 3-parity support is a better choice if the performance tradeoff is
workable (as it is in most NAS).

 _> because with a n-disk array, a RAID1 will always be at least as redundant
as the equivalent parity-based RAID _

This is a red herring, yes there is the lower bound of no implementation with
N disks can survive the loss of all N disks, and if you want the space of only
a single disk RAID1 of N disks is the obvious choice.

 _(really, it 'll always be able to survive one more disk failure than an n-1
parity-block array; you can see this rather easily when comparing RAID5 with a
three-way RAID1)._

This is again wrong. A 3 disk RAID1 has the size of 1 disk and a 3 disk RAID5
has the size of 2 disks so they're not the same. To make it comparable you
need to make the array RAID6 at which point you have the same space and
redundancy. You'd still do a RAID1 though as for the 1 disk size case there is
no added benefit of parity.

 _> RAID10's advantages become more clear, since each mirror group added to
the striped whole instantly bumps up capacity and the best-case without
compromising the worst-case. Meanwhile, you do have the advantage of having
more flexibility in the quantity of disks added, but doing so doesn't help
your best-case unless you feel like resuming the whole "let's make every disk
a parity or mirror disk" arms race._

This is missing the point entirely. The parity normal case is the same as the
RAID10 best case and better than the worst case. There's no real way around
that.

 _> Which is exactly why you're incorrect. Since any disk failure is like any
other, every single disk has to be unique in order for a parity-based array to
benefit from any diversity_

This is just not true. Having diversity means that failures become
uncorrelated. Parity RAID explores that even better than RAID10. The property
you keep bringing up is basically saying "since RAID10 has worse redundancy
than n-parity it benefits more from uncorrelated disk failures" which is true
but only by partially mitigating a problem that parity RAID doesn't have.

 _> This has been a fun discussion, and there is certainly lots of room for
debate on the merits of mirroring v. parity, but I'm starting to think that we
should just agree to disagree on this one._

I think that by now I've explained my point clearly enough and even repeated
myself a bit so yeah, I'll drop it here.

~~~
yellowapple
Sorry, I promised I'd let us agree to disagree, and I will after this, but I
just can't help myself on this one point:

> This is again wrong. A 3 disk RAID1 has the size of 1 disk and a 3 disk
> RAID5 has the size of 2 disks so they're not the same. To make it comparable
> you need to make the array RAID6 at which point you have the same space and
> redundancy. You'd still do a RAID1 though as for the 1 disk size case there
> is no added benefit of parity.

Pardon, but _what_?

Think about this for a second.

For one, now you're moving the goalpost by shifting gears to disk capacity
alone (which, as I've mentioned repeatedly, is a non-factor if you're
considering anything mirror-based). _With the same number of disks_ ,
mirroring always beats parity in a RAID. There's no getting around that.

Don't believe me? Take those four disks (you need four at minimum) in your
RAID6. Your array can survive two disk failures. Now I build a RAID1 with four
disks. Mine can survive three disk failures. Yes, there's a capacity hit, but
it's been _repeatedly_ established that it's not _capacity_ that matters.

This still holds true when comparing a three-disk RAID5 and a three-disk
RAID1. This still holds true when comparing _any_ n-disk n-1-parity-block
array with _any_ n-disk RAID1 with the same value of n.

Your points are much more valid when they're targeted at RAID10 instead of the
concept of mirroring in general.

~~~
pedrocr
I've explained all this so there's nothing new here at all. I'll try again if
it helps.

 _> Don't believe me? Take those four disks (you need four at minimum) in your
RAID6. Your array can survive two disk failures. Now I build a RAID1 with four
disks. Mine can survive three disk failures._

Apples and oranges again...

 _> Yes, there's a capacity hit, but it's been repeatedly established that
it's not capacity that matters._

This is not established. This is something you keep saying but makes no sense.
My NAS has 4 2TB disks, I've configured it as RAID6 for 4TB of space where any
two disks can fail. RAID10 is strictly inferior to that and RAID1 would only
give me 2TB of space which is not enough for me. This result is generalizable.
For a given number of disks and a given target array usable size the n-parity
RAID configuration always has superior redundancy to the RAID10 configuration,
there's no way around that.

 _> Your points are much more valid when they're targeted at RAID10 instead of
the concept of mirroring in general._

My points have always been about RAID10 and not RAID1. If RAID1 is enough
that's always the better option as obviously it's not possible to improve on
"you have N disks and N-1 of them can fail".

Note that's also the reason RAID5 implementations are limited to 3 disks
minimum and RAID6 to 4 disks minimum. In theory you could configure a RAID5
with 2 (1+1parity) and a RAID6 with 3 (1+2parity) but since it doesn't make
sense to calculate parity of a single disk that just ends up meaning RAID1.

------
oconnore
People who use filesystems like this seem to completely misunderstand what
RAID is for. They seem to care about their data somewhat, which of course
means they have off-site backups.

Given that they have reliable backups, redundancy is only useful to the extent
that it improves uptime. Uptime is best maintained by eliminating single
points of failure. Raid is a good first step, but at some point, creating a
massive disk array on a single motherboard/cpu/disk-controller is anything but
that. And they don't seem to be complaining of downtime.

Given that they have already achieved data safety, and don't care about
uptime, the only reasonable explanation I can surmise is that they're
confused.

~~~
gnoway
In your opinion, what is the best way to address the use case of spreading I/O
across multiple spindles for non-uptime-related performance reasons?

~~~
toomuchtodo
Who even cares about spindles anymore?

Need fast? SSD on the PCI bus. Need lots of storage? Spinning disk where you
care not about performance.

~~~
simoncion
> Who even cares about spindles anymore?

Folks who have a _shitload_ of data to store, but want more perf than a single
drive can give them?

------
transfire
Wonder if anyone has ever thought about building a SQL server directly in a
hard drive?

~~~
zokier
I think more interesting question would be if anyone has build practical
userland on top of SQL (or any reasonably rich database for that matter).

Here is some discussion about SQL on raw devices:
[http://dba.stackexchange.com/questions/80036/is-there-a-
way-...](http://dba.stackexchange.com/questions/80036/is-there-a-way-to-store-
a-postgresql-database-directly-on-a-block-device-not-fi)

~~~
protomyth
The Newton skipped SQL but had an object database with queries.

------
jsprogrammer
Isn't COW a fundamentally hard problem? How does one expect a complete,
comprehensive solution?

------
Vegemeister
s/tripple/triple/g

