
Btrfs RAID 5/6 Code Found to Be Very Unsafe and Will Likely Require a Rewrite - pantalaimon
http://phoronix.com/scan.php?page=news_item&px=Btrfs-RAID-56-Is-Bad
======
gbrown_
Bryan Cantrill expresses this[1] in a manner I certainly agree with. Software
that is designed/ intended to be reliable should not go through large periods
of instability only to be written off as "prepubescence".

[1][http://www.youtube.com/watch?v=79fvDDPaIoY&t=18m28s](http://www.youtube.com/watch?v=79fvDDPaIoY&t=18m28s)

~~~
cpeterso
Bryan Cantrill is such a great storyteller. "My therapist says I need to talk
about the firmware revision numbers, just to let it out so it doesn't corrode
me." He then shares a story about a firmware bug where the polarity would be
reversed of the disk read/write head. :)

~~~
rdtsc
Completely agree.

I often just search for videos by him and just listen to him present a topic
or talk about anything. I learned about things normally I wouldn't go and read
about otherwise -- history of Sun, ZFS, dtrace, containerization (Joyent,
zones legacy, KVM, comparison with Docker).

In that respect he is like David Beazley (of Python fame). I could just listen
to David talk all day and not get tired of it.

------
Philipp__
It is amazing how stubborn Linux people are. They constantly refuse and reject
superior engineering that was done outside of their realm. (yeah I was
pointing at ZFS, but that would be just one of many)

~~~
X86BSD
This is Linux's legacy. Rejecting already solved problems only to try reinvent
the wheel their way and often their wheel is broken and horribly engineered if
engineered at all. See btrfs v zfs, kqueue vs their current poll of the month
method, dtrace vs system tap, the list goes on a mile long.

It's so frustrating to watch this. To see all these projects being developed
on Linux where it's easier for others to simply reimplement the idea from
scratch than try and remove all the Linux only code from them.

The Linux world only thinks about itself. In a spectacularly selfish way.

~~~
emmelaich
Linux's 'legacy' is a policy of accepting code before it is known to be
perfect. I don't know anyone who runs BTRFS and I certainly would not do so
myself. Getting accepted as part of the kernel code is not an imprimatur of
some sort.

I'd much rather this than the *BSD policies.

Oh and talking about FreeBSD, it's hardly bug free. Look at these Golang
issues and tell me it doesn't indicate something wrong in the FreeBSD kernel
threads/fork/exec code.

1\. [https://golang.org/issue/16136](https://golang.org/issue/16136) 2\.
[https://golang.org/issue/15658](https://golang.org/issue/15658) 3\.
[https://golang.org/issue/16396](https://golang.org/issue/16396)

~~~
X86BSD
I'm not sure it indicates anything. They are all incredibly vague and no one
seems to have any idea what's going on.

------
teraflop
I've never really understood the rationale behind integrating RAID into the
filesystem. It seems like a giant layering violation, and unnecessary given
that mdadm already exists and has many years of testing behind it.

The btrfs FAQ [1] says that "unlike MD-RAID, btrfs knows what blocks are
actually used by data/metadata, and can use that information in a
rebuild/recovery situation", but is that really a good enough reason to
reimplement the entire RAID subsystem from scratch? And couldn't TRIM/discard
provide the same benefits?

For me, the clincher is that mdadm has a very stable, well-defined on-disk
format, which is a huge bonus if you hit a bug or make a mistake. I once
almost lost a personal RAID10 array by recklessly trying to add an extra drive
with no backup, and without fully understanding what I was doing. I was able
to recover all of my data by hacking together a Python script to reassemble
all of the blocks in the correct order. I can't imagine how much effort that
would have taken if I had to build something that understood the full details
of btrfs's data structures.

[1]:
[https://btrfs.wiki.kernel.org/index.php/FAQ#Case_study:_btrf...](https://btrfs.wiki.kernel.org/index.php/FAQ#Case_study:_btrfs-
raid_5.2F6_versus_MD-RAID_5.2F6)

~~~
snuxoll
Integrating RAID into the filesystem solves a lot of problems, one big one
being data integrity. When you have a dumb RAID layer unaware of the data on
disk it will just replicate or shard it between disks, when you request that
data back and you have replicas it has no way to verify that data is correct,
if one replica is bad but the other is good your filesystem has no way to
correct for it since it just sees one big block device.

Moving storage pooling functionality to the filesystem is the right call for
anything with server workloads, ZFS's self-healing functionality wouldn't work
without it (and this is precisely why I use FreeNAS instead Unraid or other
alternatives, I care about the long-term integrity of my data, especially my
VMWare images and photo collection).

~~~
teraflop
Enforcing data integrity is great, but I don't see why it has to happen at the
filesystem level. There are any number of ways you could solve it while still
keeping the RAID functionality at the level of unstructured block devices. I
can think of three just off the top of my head:

#1: Do checksumming at the block level. Within each block, the RAID driver
reserves a few bytes at the end for a checksum, and verifies them before
returning data to the FS. (This would only work if the filesystem supports
non-power-of-two block sizes.)

#2: Similar to #1, but pack the checksums for multiple data blocks into a
separate dedicated checksum block. (This adds some extra read latency in the
worst case, but caching could mitigate it.)

#3: Let the filesystem handle checksumming, but extend the block device API to
provide feedback to the kernel if the checksum is invalid. I think you only
really need two API calls -- one which says "read this block as fast as
possible and I'll verify it", and one which says "the block you returned looks
bad, try to reconstruct it and give me all available candidates".

~~~
dap
> Do checksumming at the block level.

This doesn't handle phantom or misdirected writes. You can't really do that
unless you keep the checksums elsewhere, and specifically in the place where
you intend to reference the data.

> Let the filesystem handle checksumming, but extend the block device API to
> provide feedback to the kernel if the checksum is invalid.

I think by the time you get this API right, it's going to seem like just as
bad a layering violation.

ZFS has a coherent set of layers built into it. They're just not the ones that
existed before it. But I don't think those previous layers were designed with
many of the important failure modes of real hardware devices in mind. That's
understandable given when they were designed, but we've come a long way since
then in the understanding of those failure modes and our expectations of
system integrity.

------
sp332
Btrfs people have been telling everyone that RAID5/6 code is incomplete for
years now, so hopefully people weren't depending on it for important things.
The wiki link is broken, here's the page
[https://btrfs.wiki.kernel.org/index.php/RAID56](https://btrfs.wiki.kernel.org/index.php/RAID56)
and you can see in the history that on the very first version of the page, in
the first sentence, "the recovery and rebuild code isn't bulletproof or
complete yet."

~~~
pantalaimon
This warning has only been added recently

[https://btrfs.wiki.kernel.org/index.php?title=RAID56&diff=30...](https://btrfs.wiki.kernel.org/index.php?title=RAID56&diff=30445&oldid=29769)

~~~
sp332
Oh, so that's ~7 months of people maybe using it for important things. That
could be really bad then.

~~~
pantalaimon
Well that line about it stabilizing over the next couple of kernel releases
was added in February 2015, so when I installed Ubuntu 16.04 on my NAS I
thought things probably had calmed down sufficiently.

As I learned the hard way, I was wrong and that warning is very much
justified.

[https://btrfs.wiki.kernel.org/index.php?title=RAID56&diff=29...](https://btrfs.wiki.kernel.org/index.php?title=RAID56&diff=29294&oldid=25681)

~~~
sp332
Well, I have a problem with a btrfs RAID1 (which isn't really RAID1 but
whatever) where if I read certain files my kernel crashes. So that's fun too.

~~~
rleigh
Better that than the unrecoverable dataloss I suffered (it trashed both
disks... and also panicked the kernel). In fact, they were so screwed it
panicked the system on every boot until I booted with a non-Btrfs-supporting
kernel and blanked them with dd...

------
PaulHoule
Fancy file systems are for people who think their life isn't exciting enough.

~~~
lmm
Depends how you're defining "fancy". I use ZFS because I don't want any drama,
and it has delivered that (Linux md was more complicated to set up, to the
extent that I lost data). But it's a mature, solid filesystem in a way that
BTRFS isn't yet.

~~~
click170
Downvoted because I frequently hear this sentiment from _both sides_ of the
argument but rarely is anything more than an anecdote offered to support it.
Feels more and more like a mud slinging competition instead of an assessment
of merit.

~~~
rleigh
Many people use both filesystems, and you'll see many claims of how both work
really well and that the user hasn't had any problems. The problem with these
people's experiences is not that they are untrue, but that they are primarily
from people who haven't had hardware failure/glitches, and who have never had
their systems run the failure-case codepaths. Everything's fine and dandy up
until the point you lose all your data.

I've run Btrfs on many systems since just after it started to be usable, and
written software with Btrfs-specific support which hammers it (and LVM) like
nothing else creating and destroying tens of thousands of transient snapshots.
I've also now run ZFS on several systems, admittedly over a smaller timeframe
(3 years vs 7-8ish).

I've had Btrfs totally trash a RAID1 mirror from a transient SATA cable
connector glitch. On this test system, I had half the disk using Btrfs, half
using mdraid/LVM. The mdraid half recovered and resynced transparently as soon
as I reseated the connector; no service interruption or dataloss. Btrfs ceased
to function, and on reboot toasted both mirrors resulting in total
unrecoverable dataloss and repeated kernel panics. That's been fixed a while,
but right here we're seeing the same thing. The failure codepaths, which are
of critical importance, are untested and buggy. And even non-failure codepaths
are still bad. Take the snapshotting case above, I had to take the system
offline and do a full manual rebalance every 18 hours. The time from fresh new
filesystem to read-only unbalanced disaster was just 18 hours when thrashed
continuously, at most using 10% of the total space. And lastly, the
performance of some things such as fsync are truly abysmal, to the extent that
we had to use "eatmydata" to completely disable it for apt/dpkg operations!
When under heavy parallel workloads, it could take many tens of minutes or
hours(!) to complete writes which ext4 would complete in a minute or so.

I've yet to experience any problems at all with ZFS. Now that might have been
luck on my part, but it might also be down to better design and quality of
implementation. It's certainly been battle tested in high end installations.
That's not to say that Btrfs doesn't have some neat features; it does a few
things ZFS doesn't, like rebalancing data over its devices while ZFS only does
that on write. But Btrfs has let me down badly every time I've used it in
anger, and those few neat features don't make up for its lack of robustness--
the primary purpose of the filesystem is to reliably store data, and it fails
at that.

I don't like to see "mud slinging", since such fanboyism is unobjective and
uninformed. I've reached my opinion based upon several years of practical
intensive use of Btrfs for various things, the most demanding of which was
repeated whole-archive rebuilds of the whole of Debian when I was maintaining
the Debian build tools, and wrote btrfs snapshot support specifically for
them, doing over 30 parallel builds on a single system using independent
snapshots per build with over 20000 snapshots per run, creating and destroying
several per second. The experiment was disastrous, and showed Btrfs to be
unsuitable for such intensive workloads. When your filesystem is guaranteed be
turned read-only at some unpredictable and unknown point in the future, you
can't rely on it. Regular rebalancing mitigates but doesn't solve this, and
has a terrible performance impact. Not dataloss per se (unless it makes you
lose writes when it turns read-only), but it's a serious design or
implementation flaw. I did all this testing and adding of Btrfs support to
various tools because I had high hopes for its potential; unfortunately they
exposed serious shortcomings, many of which exist to this day. Today I'm using
ZFS, not because of any irrational prejudice against Btrfs, but because Btrfs
has never managed to deliver a robust and well tested filesystem!

~~~
nisa
I'm not having nearly your experience but I just want to say I'm agreeing 100%
to you conclusions based on my experience. We ran at Uni a Hadoop Cluster that
had disks slowly dying (some bad sectors every few days, but otherwise fine)
and lacked the money to replace them. We replaced ext4 with ZFS (no raid, just
plain zpools with failmode=continue) and ZFS ran mostly fine and scrubs kept
the metadata sane. Never had data loss (HDFS has it's own replication and
checksumming, we just need sane metadata for Hadoop to run and intermediate
MapReduce outputs where send to directories with zfs set copies=2) and we only
replaced the botched disks that had longer scrub times or couldn't survive a
scrub. I'm still surprised how ZFS managed to pull that off. Only bugs I've
found where related to ZoL at this time but could be worked around.

btrfs switched to readonly as fast as ext4 (which is probably the correct
thing to do) but was useless for this problem.

On new hardware with new enterprise disks we choose btrfs and we had a painful
tour of crashes, data corruption, metadata corruption (undeletable files),
deadlocks until kernel 4.4 where things got a little bit better. Here disks
and server where enterprise class and fully working. Just btrfs bugs. Also no
RAID. I'm not doing that anymore so I don't know if any new bugs appeared but
the whole experience will keep me from ever using btrfs. This was ~2years ago
and you could easily find lot's of slides Fujitsu or Suse that btrfs is stable
and you can use it (around kernel 3.13-3.16).

It's probably fine for your notebook or even your backup HDD but don't think
you can stress it without experiencing pain (be it corruption, hangups or
dataloss) or just abysmal performance.

That beeing said ZFS on Linux is also a far cry from rock solid but I'm
optimistic that they flesh out the problems and tackle them in a solid way.

As a Linux fanboy for years this gave me some solid appreciation for Solaris
engineering.

------
koverstreet
If anyone wants to help out with bcachefs, Reed-Solomon might be a fun place
to jump in. And bcachefs so far has a better track record with people's data
than btrfs :)

------
moogly
Does this impact Synology's Hybrid RAID running on Btrfs? Btrfs is nowadays
the default fs on Synology boxes.

~~~
Calms
No, there is no impact to Synology's Hybrid RAID running on btrfs. Synology
uses mdadm/lvm for RAID.

------
LeoPanthera
Is it possible to install Linux on HAMMER?

~~~
X86BSD
You mean is it possible to port and run hammer on Linux?

Anything is possible but the work required would be a big challenge.

