
Battle testing data integrity verification with ZFS and Btrfs - iio7
http://www.unixsheikh.com/articles/battle-testing-data-integrity-verification-with-zfs-btrfs-and-mdadm-dm-integrity.html
======
kissgyorgy
Last year when I wanted to build a 10 disk ZFS server in RAIDZ2 and researched
about the data-integrity and fault-tolerance aspects, I found this video of
guys literally making hardware failure by conducting electricity into the
motherboard attached to a RAIDZ2 array:

[https://www.youtube.com/watch?v=vxFNBZIAClc](https://www.youtube.com/watch?v=vxFNBZIAClc)

and they could not make any errors. This was pretty brutal. When I saw this
video, I decided I never want to use any other filesystem than ZFS ever.

~~~
the8472
I think irradiating the system would be a better test since it would induce
random bitflips anywhere in the system while it is keeps running and
reading/writing data instead of inducing a massive fault that will almost
immediately stop any IO operations.

Overclocking some components might work too.

~~~
myrandomcomment
Worked at a major switch vendor as one of the 50 original staff. We put our
gear through a ton of test - 4 corners - low power high heat, high power low
heat, etc. One of the most interesting was taking the device to a government
lab here in NorCal in order to shoot radioactive particle into the board while
passing packets. It was pretty cool. The goal was to test TCAM and switch OS
RAM to see how it handle bit flips. It passed. The board itself, well we could
not have it back for a few weeks to allow it to "cool". The other thing of
note was winning a US gov. contact was to enable commands to override any heat
related auto shutdown. We were told "look, if the device is in a situation
where this is an issue, burn it up but keep it working as long as you can".
This was for navy ships.

~~~
shabble
Sometimes referred to as "Battleshort"[1] mode

[1]
[https://en.wikipedia.org/wiki/Battleshort](https://en.wikipedia.org/wiki/Battleshort)

------
bronco21016
I’m really confused what the author is trying to accomplish here. He sets a
scenario that Btrfs openly says is known to cause issues. He doesn’t
necessarily come to the conclusion that one needs to trash Btrfs but I’m not
sure why you would go through this exercise prior to deployment if the
exercise is pitting an undeployable configuration against something that has
already been heavily battle tested. Until Btrfs development marks the RAID5/6
write hole issue fixed this test is pointless.

I’m a little disheartened to read all of the negative comments about Btrfs in
this thread as well. I’ve spent a ton of time researching Btrfs in RAID 10 for
deployment on my home lab (99% Linux environment) for when it’s time to expand
storage and from everything I read it seemed like it was going to be a good
idea. Now I’m back to wondering if I should research ZFS again.

~~~
cmurf
I think you're best off using what you're familiar with, and keeping multiple
backup copies of the important stuff. I've used Btrfs for 9-10 years and
haven't ever had unplanned data loss. Planned, due to testing, including
intentional sabotage, for bug reporting, yes. And in the vast majority of
those cases, I could still get data off by mounting read-only. I use it for
sysroot on all of my Linux computers, even the RPi, and for primary network
storage and three backup copies. A fourth copy is ZFS.

If you've had a negative experience, it can leave a bad taste in the mouth.
People do this with food too, "I once got violently sick off chicken soup,
I'll never eat it again." I wouldn't be surprised if there's an xkcd to the
effect of how filesystem data loss is like food poisoning.

There is a gotcha with Btrfs raid10, it's does not really scale like a strict
raid 1+0. In that traditional case, you specify drive pairs to be mirrors,
sometimes with drives on different controllers so if a whole controller dies,
you still have all the other mirrors on other controller and the array lives
on. You just can't lose two of any mirrored pair. Btrfs raid10 is not a raid
at the block level, it's done at the block group level. The only guarantee you
have with any size Btrfs raid10 is the loss of one drive.

~~~
cmurf
Correction: Btrfs raid is not at the _device level_.

------
funkaster
Really good article. I also agree with the author: ZFS is light years ahead of
Btrfs. I currently use it in my home backup server (used to be a rpi with 2
hdd, moved to rockpro64 with 4 hdd and a sata controller) and it's just great:
super easy to maintain and fix, even swapping disks is not a huge endeavor.

~~~
h1d
Is btrfs still actively developed? RedHat ditched it and never heard btrfs
ever got widely popular, so I take it as an abandonware after all these years
of slow progress.

~~~
dralley
It's not abandonware. SUSE uses it as a default, and Facebook uses it too I
believe.

[https://news.ycombinator.com/item?id=14907771](https://news.ycombinator.com/item?id=14907771)

~~~
ulzeraj
And they use it to deliver some beadm-alike solution to SuSE. Its integrated
into their package management so tou can easily boot and rollback stuff from
grub in case of a botched upgrade. Its even more interesting with the new
transactional-update servers in which the rootfs is read only and upgrades are
applied to a new clone which will be promoted to main file system during the
next boot.

------
FullyFunctional
Reading through most comments and I still think a few points are worth
mentioning:

* btrfs has a few features that are Really Nice and missing in ZFS: ability of have a file system of mixed drives and adding and removing drives at will, with rebalancing. With ZFS growing a file system is painful and shrinking impossible(? still?). There has been work on it recently though.

* ZFS has a IMO MUCH cleaner design and concepts (pool & filesystems); mirrored by a much cleaner and clearer set of commands. Working with btrfs still feels like an unfinished hack. As human error is still a major concern, this is not a trivial issue.

I _have_ lost data to MD, had scary issues with BTRFS, but never had issues
with ZFS in 8+ years. (The fact that FreeNAS is FreeBSD based which I'm less
inclined to mess with also means that I mostly leave my appliance alone.)

~~~
agapon
> With ZFS growing a file system is painful and shrinking impossible

You probably mean something else rather than a file system. In ZFS you do not
need to grow or shrink file systems at all.

~~~
FullyFunctional
Sorry, I did mean a pool, but as is pointed out, it's being worked on.

------
pnutjam
I've been using btrfs for years with no problems. I currently use a btrfs
volume for my backup drive. It mounts, accepts the backup, takes a snapshot
and unmounts. Has anyone seen how btrfs handles sync, it's pretty awesome,
like rsync, but only sends changed blocks.

------
bakul
I first started using zfs in 2005 when my hardware raid failed. Since then
I’ve moved the disks to a new server in 2009 and replaced all the disks twice
(one at a time for redirecting). Finally I built a new server this year. This
time I’m using zfs send/recv to copy data to the new disks. The old server is
still working 10 years later & its latest disks have been in use 24x7 for over
5 years now. Zpool scrub on the old server takes days now (compared to one
hour on copied zpool on the new server).

Even back in 2009 I heard some Linux enthusiasts tell me how btrfs was going
to be better than zfs!

~~~
yjftsjthsd-h
> Even back in 2009 I heard some Linux enthusiasts tell me how btrfs was going
> to be better than zfs!

What's sad is that it should have been; the CDDL situation is really
unfortunate. Honestly, even if BTRFS performance were worse, it would be worth
it in order to have a fully-supported mainlined FS... but instead its
reputation is for data loss, so it's dead (yes, I know it works if you're
careful, but that's a terrible quality in a filesystem).

------
O_H_E
Related: take a look at bcachefs

[https://en.m.wikipedia.org/wiki/Bcachefs](https://en.m.wikipedia.org/wiki/Bcachefs)

~~~
zaarn
I run bcachefs on all my personal machines and it's been quite a joy. As fast
as ext4 with most of the features of btrfs. I can't wait until it's mainlined.

~~~
O_H_E
Huh, so it is already usable (tho maybe not be production ready)

Good to know.

~~~
zaarn
Personally, I wouldn't trust it for actual production yet, Kent does claim it
is. I'd want it mainlined before deploying it on my servers.

But also by experience, bcachefs is incredibly stable. The only issues I had
was when mismatching the tools and the kernel, leading to fsck being confused
when outdated compared to the kernel. But even with that, I haven never lost
any data or had it even as much as hiccup.

------
aidenn0
ZFS is the only filesystem I've ever had completely crap out on me without any
indication of a hardware issue[1]. I don't recall the error message I got
currently, but asking around about it on the various ZFS irc channels the
answer was invariably "I hope you have backups" This was probably a fluke, but
did sour me a bit.

1: Btrfs refused to mount at one point due to a bug; the helpful folks on
#btrfs walked me through the process of downgrading my linux kernel to get it
into a working stat eagina. At this point I switched away from btrfs.

------
myrandomcomment
I use a FreeNAS box at home..from IXsystems. Good stuff. I paid a bit more for
ECC. Why? my families history, all this pictures are kept on the NAS (backup
to backblaze) and a local dive. In the last 8 years, in Photos I have
encountered maybe 11 issue where the picture was screwed. Each time I looked
at the NAS copy, snapshot, etc. where I was able to recover the correct photo.
The cost difference over time in a few cups of coffee. It is worth it. If you
can afford the NAS I do not understand how you cannot afford the ECC.

------
tomxor
> Myth: ZFS requires tons of memory [...] The only situation in which ZFS
> requires lots of memory is if you specifically use de-duplication

It's also totally worth tons of memory when you use that feature with intent.
If you use dedup in combination with automated snapshots you get the most
space efficient, fast and reliable incremental backup solution in existence -
yes it will consume your whole server, that's the cost (works best separately
as a backup server).

------
vasili111
Whats is your personal experience with Brtfs

~~~
JustFinishedBSG
BTRFS is the _only_ filesytem that failed and was rendered unrecoverable in
all my life ( after a hard reset ) .

That's an anecdote sure but that's enough to never use it ever again for me.

In my opinion BTRFS is rotted at the core, I'm more interested by Bcachefs
future.

~~~
aidenn0
I mentioned this elsewhere, but as a counter anecdote, ZFS is the only
filesystem that failed and was rendered unrecoverable in all of _my_ life.
That being said btrfs was quite janky, and I had to recover it on more than
one occasion.

------
cmurf
This is a really good write up.

It is curious the performance differences found between ZFS and Btrfs, as I've
always had the reverse experience, with ZFS being slower by maybe 15%. The
scrubbing on md raid does take a while, every block must be checked as it has
no idea what blocks are in use or not; although a write-intent bitmap would
avoid a complete resync after an unclean shutdown.

~~~
StavrosK
I have a joke NAS and I use ZFS for the disks. Unfortunately, at some point I
started having problems where deleting a file will take half a minute (each
file is only around a gigabyte). I have no idea why performance is killed like
this, and nobody I asked on IRC seems to know why this is happening.

~~~
craigyk
My guess is that ZFS is already pretty slow at deletes, your pool was pretty
close to capacity and fragmented.

~~~
yjftsjthsd-h
Oh, yeah: ZFS at least used to degrade catastrophically in terms of
performance once you got above... I think 80%? Granted, that was literally
Solaris, so I don't know if it's been fixed since then.

~~~
cmurf
One of my long time Btrfs raid1 backups is 99% full. Writes still go at full
speed (the much reduced speed you expect to get writing to the interior tracks
of spinning disks). But these do avoid that nastiest form of fragmentation due
to it only receiving snapshots from another filesystem, making it act like a
tape backup until full. Deleting snapshots produces large contiguous holes of
space for sequential writes.

This form of snapshotting also scales up well. I've had hundreds of snapshots
with no performance reduction, and deleting them is fast also. In cases with
many changes in between snapshots, it results in much more complicated
metadata. And now while making a snapshot is still cheap and fast, deleting
older ones starts to become more expensive for the backref walk.

------
walrus01
RAID5 has been excessively risky and obsolete for a long time - not enough
parity, and too much risk of data loss from unrecoverable read error during a
massive-drive multi drive rebuild (like, an eight drive * 8TB RAID5). Better
tests for production use would be RAIDZ2.

~~~
bscphil
I agree about RAID5, but personally I don't think RAIDZ2 really works as an
alternative. Everyone should use RAID10.

The reason for this actually has nothing to do with data protection (although
not having to read your entire disc set to rebuild is nice). The reason is
that it's hard to figure out who RAID5/6 are actually for. The enterprise is
all on RAID10 (hence no one fixing the BTRFS write hole). So you'd think it
would be for enthusiasts who don't want to purchase as many discs, right?
Well, in my case at least, parity disc modes are useless for me because it
means I would have to buy discs of exactly the same size into the indefinite
future!

I started with 4TB discs, then I've added 8TB discs, and now I'm looking at
newer Western Digital 10TB helium discs, which are actually much lower power
than the 8TB ones. But none of those 4TB or 8TB discs have failed! So while in
theory if stick to the drive size you start out with, you can save money with
RAIDZ by using parity discs and adding a drive when you need more storage (at
the cost of a decreased parity %), practically speaking a lot of that money is
wasted since you either have to neglect larger more efficient drives, or
replace working older drives when you want to upgrade.

