
Five Years of Btrfs - vordoo
https://markmcb.com/2020/01/07/five-years-of-btrfs/
======
InTheArena
I went on a quest a few years ago, thinking it would be good for the industry
to standardize on a single next generation filesystem for UNIX. I started with
ZFS on linux since that seemed to have the most vocal advocates. That lasted
about a half year, until a bug in the code resulted in a completely corrupt
disk, and I had to restore 4TB of data over a month from offside backups. That
plus the licensing confusion around ZFS has made it impossible for ZFS to be
the defacto choice.

I went down the BTRFS path, despite it's dodgy reputation when netgear
announced their little embedded NASes, and switched my server over to it. The
experience was solid enough that I bought high-end synology and have had zero
problems with it.

~~~
clSTophEjUdRanu
I really don't understand the insane hype around ZFS. You can't read any
thread that touches on filesystems without the ZFS zealots coming out.

~~~
tjoff
The hype is quite easy to understand. Snapshots and checksums are two complete
game-changers. ZFS has them both. And there are no real alternatives in many
cases.

I've personally waited for BTRFS longer than a decade but my use-cases are yet
to be considered stable (not something you really mess with in regard to
filesystems).

Honestly, as sure as I have been on the success of BTRFS I now consider BTRFS
dead on arrival - if it will ever even arrive. The pace of development is
slower than the universe around it, that might be too harsh but really - no
RAID6 yet? A decade ago the impression I got was "soon". And now 2-drive
parity is becoming obsolete.

ZFS has tons of warts for home-use, I agree. So, for a home-user with high
demands I don't see anything exciting in the future.

~~~
TurningCanadian
There were a bunch of btrfs raid56 patches last year. I think the known bugs
have been addressed and is just that the wiki page hasn't been updated.

Re obsolete, are you referring to RAID1C3?

~~~
tjoff
I'm thinking of this:

[https://www.zdnet.com/article/why-raid-5-stops-working-
in-20...](https://www.zdnet.com/article/why-raid-5-stops-working-in-2009/)

I'd much prefer something like raidz3 compared to the authors setup.

RAID1C3 is nice but very expensive for use in bulk storage at home.

------
derefr
A question for HN: what filesystem and/or block-device abstraction layer would
you use on a database server, if you wanted to perform scheduled incremental
backups using filesystem-level consistent snapshotting and differential
snapshot shipping to object storage, _instead of_ using the DBMS’s own
replication layer to achieve this effect? (I.e. you want disaster recovery,
not high availability.)

Or, to put that another way: what are AWS and GCP using in their SANs (EBS;
GCE PD) that allows them to take on-demand incremental snapshots of SAN
volumes, and then ship those snapshots away from the origin node into safer
out-of-cluster replicated storage (e.g. object storage)? It it proprietary, or
is it just several FOSS technologies glued together?

My naive guess would be that the cloud hosts are either using ZFS volumes, or
LVM LVs (which _do_ have incremental snapshot capability, if the disk is
created in a thin pool) under iSCSI. (Or they’re relying on whatever point-
solution VMware et al sold them.)

If you control the filesystem layer (i.e. you don’t need to be filesystem-
agnostic), would Btrfs snapshots be better for this same use-case?

~~~
muxator
I do not think it would be a good idea to use file system level snapshotting
for backing up a database. The database "knows better" about its internals,
and can give more guarantees about the consistency of its data. I would trust
a filesystem-levdl backup only as a last resort.

~~~
iracic
It is possible to put database in state that is "ready" for snapshot, pushing
changes to disk and sort of freezing I/O during snapshot.

~~~
jstrong
this is generally not a matter of concern for a copy on write filesystem like
zfs, since it's not possible for the file to be in an "in between" state. If a
write were in progress, the filesystem would still be pointing to the previous
state. Only when the data is written to disk is the pointer moved to the new
location.

~~~
tjoff
It very much is a concern. ZFS has no knowledge about the internals of a
database, which parts of a file are related to each other etc.

~~~
RX14
DBMSes always keep their database in the file system in a consistent state to
be able to recover from system crashes. Taking a file system snapshot is
equivalent to pulling the power on the database server in terms of data
recovery, but databases are designed to support this.

~~~
tjoff
As do filesystems. Yet I've seen anyone argue that cutting the power is the
recommended way of doing backups.

In fact the opposite, make sure to use an UPS just so that you can shutdown
cleanly in the unfortunate event.

For example: [https://blogs.oracle.com/paulie/backing-up-mysql-using-
zfs-s...](https://blogs.oracle.com/paulie/backing-up-mysql-using-zfs-
snapshots-and-clones)

~~~
wmf
Some people came up with the idea of "crash-only software", arguing that it's
better to maintain one code path (recovering from a crash) than two (clean
start and recovery), but it hasn't caught on that much.
[https://www.usenix.org/legacy/events/hotos03/tech/full_paper...](https://www.usenix.org/legacy/events/hotos03/tech/full_papers/candea/candea_html/index.html)

~~~
paulddraper
Google follows this, I believe

------
gravypod
I've seen a lot of the hacker community focusing on btrfs and zfs but very
little focusing on ceph. I think ceph has a lot of the features that we want
in a file system and some things that aren't even possible on traditional file
systems (per-file redundancy settings) with very little downsides. The setup
is a little more complex involving a few daemons to manage disks, balance,
monitor, etc. I wish there was something similar to FreeNAS for ceph that only
focused on making the experience seemless because I think if it became more
popular in the home lab space we'd see lots of cool tools pop up for it.

~~~
louwrentius
I love Ceph, I even wrote an intro about it for those who are not familiar
with it.

[https://louwrentius.com/understanding-ceph-open-source-
scala...](https://louwrentius.com/understanding-ceph-open-source-scalable-
storage.html)

But Ceph is not designed to be a competitor to BTRFS or ZFS. The core vision
of Ceph is scalability. If you need petabytes of storage and the performance
to scale with it, take a look at Ceph.

I may be totally wrong here, but from what I understand about Ceph, it's not
meant as a file system for a single computer. I don't understand the idea of
running Ceph on your laptop/desktop. It's possible to run it that way but it
defeats it's purpose.

I've build a small lab setup with Ceph:

[https://louwrentius.com/my-ceph-test-cluster-based-on-
raspbe...](https://louwrentius.com/my-ceph-test-cluster-based-on-raspberry-
pis-and-hp-microservers.html)

Also, there's the issue of performance, in particular latency. That's a bit of
a weak spot of Ceph, from what I can tell. Again, may be wrong. But I found
these notes interesting.

[https://yourcmc.ru/wiki/Ceph_performance](https://yourcmc.ru/wiki/Ceph_performance)

~~~
seabrookmx
This.

In fact, it's really common to use a ZFS array on single nodes, and then
create a SAN using multiple such machines by layering Ceph on top.

~~~
louwrentius
That's interesting, but it's layers upon layers... (RIP latency), I think.
Unless it's about just bandwidth and volume, then latency is not that big of a
deal.

~~~
seabrookmx
You don't have to use ZFS snapshots. I haven't run a system like this in
production but presumably you choose ZFS because it's flexible in how you
configure the arrays (as is say, LVM) and because it supports checksumming.

------
pojntfx
Love using Btrfs; the is no better filesystem than it nowadays that it's
reliability issues have been fixed.

~~~
pantalaimon
> nowadays that it's reliability issues have been fixed

Is this also true for RAID5/6?

~~~
jxcl
This issue has its own wiki page on the BTRFS wiki:

[https://btrfs.wiki.kernel.org/index.php/RAID56](https://btrfs.wiki.kernel.org/index.php/RAID56)

So, no, that particular issue hasn't been fixed.

~~~
3fe9a03ccd14ca5
> * For data, it should be safe as long as a scrub is run immediately after
> any unclean shutdown.*

That’s unfortunate. Does the scrub run automatically in those situations?
Consumer hardware will be the most prone to intermittent power failure.

------
tezzer
I've had one issue with btrfs that took it off my radar completely. A customer
had a runaway issue that filled a btrfs device with unimportant things. We
found the errant process and killed it, but apparently if a btrfs device is
completely full, you can't delete anything to free up space. File removal
requires some amount of free space. Bricked the device, annoyed a customer,
back to ext4.

~~~
takeda
ZFS had this issue (I believe fixed) workaround was to pick up one large file
that you wanted to delete and do `echo -n > /the/unimportant/file` once the
file was reduced in size to 0, rm started to work again.

Not sure if that workaround would work in btrfs, but it worked on ZFS.

~~~
rcthompson
What happens if the file has already found its way into a snapshot? Then
presumably that command will not free any space.

~~~
takeda
Well, rm wouldn't free the space either so you either would remove the
snapshot or chose a different file.

------
kiney
I use BTRFS on several devices for years. The tooling is a bit rough, but no
major problems. Just recently data checksumming saved me: In December I
replace an old 2TB drive in my RAID1 (2+4+4+4) with an 8TB drive. The new
drive had checksum errors after a few weeks which BTRFS handled gracefully.
With "classical" RAID i might only have noticed when it's to late. (I RMAed
the bad drive)

    
    
      [/dev/mapper/h4_crypt].write_io_errs    0
      [/dev/mapper/h4_crypt].read_io_errs     0
      [/dev/mapper/h4_crypt].flush_io_errs    0
      [/dev/mapper/h4_crypt].corruption_errs  0
      [/dev/mapper/h4_crypt].generation_errs  0
      [/dev/mapper/h2_crypt].write_io_errs    0
      [/dev/mapper/h2_crypt].read_io_errs     30
      [/dev/mapper/h2_crypt].flush_io_errs    0
      [/dev/mapper/h2_crypt].corruption_errs  0
      [/dev/mapper/h2_crypt].generation_errs  0
      [/dev/mapper/h1_crypt].write_io_errs    0
      [/dev/mapper/h1_crypt].read_io_errs     0
      [/dev/mapper/h1_crypt].flush_io_errs    0
      [/dev/mapper/h1_crypt].corruption_errs  0
      [/dev/mapper/h1_crypt].generation_errs  0
      [/dev/mapper/h3_crypt].write_io_errs    0
      [/dev/mapper/h3_crypt].read_io_errs     0
      [/dev/mapper/h3_crypt].flush_io_errs    0
      [/dev/mapper/h3_crypt].corruption_errs  0
      [/dev/mapper/h3_crypt].generation_errs  0
      [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].write_io_errs    0
      [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].read_io_errs     16
      [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].flush_io_errs    0
      [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].corruption_errs  20619
      [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].generation_errs  0
    

edit: formatting

------
epx
I have been using btrfs in my "NAS"/personal server for 3 years, changed disk
configuration a couple times, I do snapshots every hour and prune them using a
Fibonacci-like timeline, no problems yet.

~~~
Teknoman117
My experience has been the same. Admittedly, I've not tried native BTRFS
parity raid (I'm sitting the volume on top of mdraid). But, I ran the
"mkfs.btrfs" 5 years ago at this point for my desktop and no data loss yet. I
back things up religiously, so I'm not too worried about the volume failing,
but it'll be nice if btrfs parity raid gets stabilized, because I could
replace my current NAS storage config.

I used to use ZFS on my NAS, but after running it for a year and fiddling with
it, I wasn't able to tune it in a way I liked. I always had random performance
problems and zvols were super slow. It's now dm-integrity on all disks, an
mdraid raid6 volume over those, with LVM2 on top of that and mirrored NVMe
disks as a read and write cache.

I also wish BTRFS would add extents at some point so you could run virtual
machine images from it without weird performance issues from time to time
(although I imagine this is less of an issue on SSDs because they're
"fragmented" inside anyways).

------
alyandon
I use btrfs in raid1 mode and the ability to shrink/grow/add/remove devices at
will without data loss or extended downtime led me to choose btrfs over zfs on
my home servers.

~~~
cyphar
You can grow and add/remove raid1 devices (mirror vdevs) in ZFS without any
significant work or downtime. Shrinking does require a bit more work, but
depending on your setup it can be done fairly painlessly with send/recv (and
shrinking is usually not something which is a very common administrative
operation).

~~~
3fe9a03ccd14ca5
How? My understanding is that you create a new vdev and add the old vdev as a
device, basically recursively creating volumes with each new device you add.

~~~
cyphar
Which operation are you asking about? [1] is a sister comment which I posted
that outlines how to do most of the operations I mentioned.

[1]:
[https://news.ycombinator.com/item?id=22168494](https://news.ycombinator.com/item?id=22168494)

------
Shalle135
Is there any specific reasons to run btrfs over for example ext4? You can
create/shrink/grow pools, create encrypted volumes etc by using LVM.

It all depends on the application but in the majority of cases the io
performance of btrfs is worse than the alternatives.

Redhat for example choose to deprecate btrfs for unknown reasons while SUSE
made it it’s default. The future of it seems uncertain which may cause a lot
of headache’s in major environments if implemented there.

~~~
derefr
Redhat and SUSE (SLES) are both enterprise environments, so at every level,
they have to choose one tech stack to go all-in on (i.e. to train their
support staffs on), and then discourage their customers from using the others.
(“Deprecating” a component, for such orgs, means that some of their customers
are now stuck with it, and they’ll continue to support _those_ customers in
their use of it, but they certainly won’t support _new_ customers using it.)

The fact that one enterprise-support provider went all-in on Btrfs, while
another didn’t, basically tells you that the choice is pretty arbitrary. If
_no_ enterprise-support provider used Btrfs, _then_ I’d be concerned.

~~~
Arnavion
The enterprise provider that actually develops btrfs continues to support
btrfs, and one enterprise provider that doesn't stopped supporting it.

People treat RH stopping support of btrfs as some sort of death knell for it.
Meanwhile all the btrfs users are confused why RH's opinion should matter at
all when they weren't that involved with developing it in the first place.

As an opensuse user, btrfs has saved multiple machines from botched updates by
letting me revert to the snapshot from right before the update was applied
(opensuse's update tool automatically takes snapshots before and after
updates).

~~~
Conan_Kudo
Red Hat _used_ to be heavily involved in Btrfs development. In fact, they are
present in a huge chunk of its development in the first few years. But their
developers were hired away by Facebook, leaving Red Hat with nobody who work
on Btrfs regularly. That's the underlying cause for why they stopped
supporting it. Hiring someone to work on Btrfs takes time and effort that they
don't have a reason to spend right now.

------
zielmicha
fsync is still a bit slow on BTRFS (on ZFS too, but to a smaller degree). For
example, I just did a quick benchmark on Linux 5.3.0 - installing Emacs on
fresh Ubuntu 18.04 chroot (dpkg calls fsync after every installed package).

ext4 - 33s, ZFS - 50s, btfrs - 74s

(test was ran on Vultr.com 2GB virtual machine, backing disk was allocated
using "fallocate --length 10G" on ext4 filesystem, the results are very
consistent)

~~~
rossmohax
is ext4 also from fallocated file on underneath ext4?

~~~
zielmicha
Yes.

------
lousken
Did anyone had the courage to use btrfs in production? Any stories to share?

~~~
jhalstead
Seems like Facebook uses it:

"Btrfs has played a role in increasing efficiency and resource utilization in
Facebook’s data centers in a number of different applications. Recently, Btrfs
helped eliminate priority inversions caused by the journaling behavior of the
previous filesystem, when used for I/O control with cgroup2 (described below).
Btrfs is the only filesystem implementation that currently works with resource
isolation, and it’s now deployed on millions of servers, driving significant
efficiency gains."

[https://engineering.fb.com/open-
source/linux/](https://engineering.fb.com/open-source/linux/)

~~~
alexgartrell
Yeah there are a remarkable set of container runtime tasks (package downloads,
rootfs creation and management, etc) that are way easier with btrfs. It wasn’t
always smooth sailing but luckily Chris, Josef, Omar and others are awesome
and now (and for the last while) we are asking for features rather than fixes.

------
pQd
i've been using BTRFS since 2014 to store backups. there is a noticeable
performance penalty when rsync'ing hundreds of thousands of files to a
spinning-rust disk connected to USB-SATA dock when BTRFS is used instead of
EXT4. i'm accepting it in exchange for ability to run scheduled scrub of the
data to detect potential bitrot.

since 2017 i'm also using BTRFS to host mysql replication slaves. every 15
min, 1h, 12h crash-consistent snapshots of the running database files are
taken and kept for couple of days. there's consensus that - due to its COW
nature - BTRFS is not well suited for hosting vms, databases or any other type
of files that change frequently. performance is significantly worse compared
to EXT4 - this can lead to slave lag. but slave-lag can be mitigated by: using
NVMe drives and relaxing durability of MySQL innodb engine. i've used those
snapshots few times each year - it worked fine so far. snapshots should never
be the main backup strategy, independently of them there's a full database
backup done daily from masters using mysqldump. snapshots are useful whenever
you need to very quickly access state of the production data from few minutes
or hours ago - for instance after fat fingering some live data.

during those years i've seen kernel crashes most likely due to BTRFS but i did
not lose data as long as the underlying drives were healthy.

------
izacus
It's also worth noting that Synology uses btrfs as an option to do
checksumming and snapshots on their NAS devices.

They're still using their own RAID layer though.

~~~
ValentineC
> _They 're still using their own RAID layer though._

Synology's RAID implementation is largely mdadm + LVM.

------
cmurf
kernel 5.5 released Sunday. Btrfs now has raid1c3, raid1c4 profiles for 3 and
4 copy raid1. Adds new checksum algorithms: xxhash, blake2b, sha256.

Async discards coming in 5.6. [https://lore.kernel.org/linux-
btrfs/cover.1580142284.git.dst...](https://lore.kernel.org/linux-
btrfs/cover.1580142284.git.dsterba@suse.com/T/#u)

------
abotsis
It’s worth noting that much of the premise of the article (wanting
flexibility) is outdated. Zfs has support for removing top-level raid 0/1
vdevs now. So you can take a raid10 pool, and remove a top level mirror vdev
completely. Note that this doesn’t work for raid5/6 vdevs, but as the author
points out, those are becoming less and less used because of rebuild time and
performance.

In addition to the slew of other features Btrfs is missing (send/recv, dedup,
etc) zfs allows you to dedicate something like an Intel optane (or other
similar high write endurance, low latency ssd) to act as stable storage for
sync writes, and a different device (typically mlc or tlc flash) to extend the
read cache.

~~~
kstrauser
I think there's a selection bias here: people using RAID 5/6 may not be using
ZFS as much because it's not well supported. I'd bet money that those levels
are much more common in SOHO settings than RAID 10 is, because it's still the
sweet spot for "I need lots of storage" vs "...and am willing to spend drive's
worth of storage on availability". For instance, anyone using a NAS primarily
as a backup target for desktops and small servers may love RAID 5, but be
unwilling to throw money at a "better" RAID 10 setup.

------
geophertz
Is using btrfs on a personal machine something to do? It seems that all the
comments as well as articles about it, just assume you're running it on a
server.

The ability to add and remove disks on a desktop machine is very tempting.

~~~
wtfrmyinitials
I've been running it on my desktop for a while and it's been wonderful. I have
a cron job set to take a snapshot of the filesystem hourly so if I ever blow a
file away or a package upgrade goes wonky I'm back up and running in minutes.

------
mdip
I've been a `btrfs` user for the better part of 4 years despite, at the time,
a very vocal group providing advice against it[0].

I'll be the first to say that it isn't a silver bullet for everything. But
then, what filesystem really is? Filesystems are such a critical part of a
running OS that we expect perfection for every use case; filesystem bugs or
quirks[1] result in data loss which is usually _Really Bad_ (tm).

That said, for the last two years, I've been running Linux on a Thinkpad with
a Windows 10 VM in KVM/qemu -- both are running all the time. When I first
configured my Windows 10 VM, performance was _brutal_ ; there were times when
writes would stall the mouse cursor and the issue was directly related to
`btrfs`. I didn't ditch the file-system, I switched to a raw volume for my VM
and adjusted some settings that affected how `btrfs` interacted with it. I
discovered similar things happened when running a `balance` on the filesystem
and after a bit of research, found that changing the IO scheduler to one more
commonly used on spindle HDDs made everything more stable.

So why use something that requires so much grief to get it working? Because
those settings changes are a minor inconvenience compared against the things
"I don't have to mess with" to cover a bigger problem that I frequently
encountered: OS recovery. An out-of-the-box OpenSUSE Tumbleweed installation
uses `btrfs` on root. Every time software is added/modified, or `yast` (the
user-friendly administrative tool) is run, a snapshot is taken automatically.
When I or my OS screws something up, I have a boot menu that lets me "go back"
to prior to the modification. It Just Works(tm). In the last two years, I've
had around 4-5 cases where my OS was wrecked by keeping things up to date, or
tweaking configuration. In the past, I'd be re-installing. Now, I reboot after
applying updates and if things are messed up, I reboot again, restore from a
read-only snapshot and I'm back. I have no use for RAID or much else[2] which
is one of the oft-repeated "issues" people identify with `btrfs`.

It fits for my use-case, along with many of the other use-cases I encounter
frequently. It's not perfect, but neither is _any_ filesystem. I won't even
argue that other people with the _same use case_ will come to the same
conclusion. But as far as I'm concerned, _damn_ it works well.

[0] I want to say that an installation of openSUSE ended up causing me to
switch to `btrfs`, but I can't remember for sure -- that's all I run,
personally, and it is a default for a new installation's root drive.

[1] Bug: a specific feature (i.e. RAID) just doesn't work. Quirk: the
filesystem has multiple concepts of "free space" that don't necessarily line
up with what running applications understand.

[2] My servers all have LSI or other hardware RAID controllers and present the
array as a single disk to the OS; I'm not relying on my filesystem to manage
that. My laptop has a single SSD.

------
nickik
Being 'The Dude' of file system is literally the opposite of what I want. When
looking at ZFS talks and the incredible complexity of some of those operations
that Btrfs seems to think are 'no big deal', I will simply not trust that.
Specially because it has been proven over and over again that Btrfs claims its
'stable' and then a new series of issues show up. Or its 'stable' but not if
you use 'XY feature', or if the disk is 'to full' or whatever.

I remember using it after I had heard it was 'stable' and it eat my data not
long after (not using crazy features or anything). I certainty will not use it
again. A FS should be stable from the beginning, as stable core that you can
then build features around, rather then a system with lots of feature that
promises to be stable in a couple years (and then wasn't years after being in
the kernel already).

Using ZFS for me has been nothing but joy in comparison. Growing the ZFS pool
for me has been no issue at all, I never saw a reason why I would want to
reconfigure my pool. I went from 4TB to 16TB+ so far in multiple iterations.

Overall not having ZFS in Linux is a huge failure of the Linux world. I think
its much more NIMBY then a license issue.

~~~
loudmax
> I think its much more NIMBY then a license issue

How do you propose that ZFS be brought into Linux? When Sun released ZFS as
open source, they made a deliberate decision to use a license that prevented
it from being integrated into the Linux kernel. This was no accident. At the
time, Sun was still pushing OpenSolaris which was losing ground to Linux. The
ZFS on Linux project gets around this restriction by running ZFS in user
space, but this is not optimal.

You can make a legitimate argument that Linux should have been released under
a BSD style license (I think that would be wrong, but it's plausible). I don't
see how you can argue that ZFS's license is somehow the fault of the Linux
world.

~~~
trasz
On the other hand, choosing to use GPL would prevent it from being integrated
anywhere else. You'd also lose the patent protection granted by CDDL.

~~~
vetinari
It's not like they couldn't use dual licenses, like Mozilla did at the time,
for example.

------
curt15
BTRFS is well known for being ill-suited to VMs or databases. How come ZFS
doesn't have that reputation?

~~~
xorcist
There is an attribute called NOCOW that can be set on specific files that
should not be copy-on-write, which is what messes with databases, filesystem
images and other things that needs fast in-place updates.

It can also be a set as a flag on subvolumes.

~~~
jolmg
You can also set such an attribute on files and subvolumes in btrfs.

[https://wiki.archlinux.org/index.php/Btrfs#Disabling_CoW](https://wiki.archlinux.org/index.php/Btrfs#Disabling_CoW)

------
c0ffe
I have a small Nextcloud instance at home that uses BTRFS (on HDD, with
noatime option) for file storage, and XFS (on SSD) for database.

I started it just for testing, and has been running for up to two years, and
had no problems so far.

------
shmerl
I'm using Btrfs currently, but I'm waiting for Bcachefs to replace it.

~~~
mekster
How far has it come to replace any of the production ready filesystems?

It says it's feature complete in 2015 and was trying to put itself into kernel
mainline in 2018 but I don't see much about anyone using it in production.

~~~
shmerl
It's not production ready so far for sure. I didn't really follow all details
on that.

------
e40
I've heard a lot of people say they won't use Btrfs due to reliability. Would
have been nice to see that addressed.

~~~
ailideex
What about the reliability? Are many people losing data with Btrfs?

~~~
jbotz
As best I can tell, reports of data loss on btrfs are all from the early
20-teens; after about 2014 or so I can't find anyone who claims to have lost
data due to a btrfs bug on an up-to-date system.

~~~
Fnoord
RAID5 on btrfs has a write hole last time I checked. Bug has been around
forever, and was around in 2014 for sure.

Phoronix has some thorough performance comparisons between Ext4fs, Btrfs, XFS,
and ZFS.

~~~
vetinari
That one is here to stay, it is a property of software-based RAID. If it
bothers you, use UPS.

~~~
zielmicha
ZFS-based RAID-5 (called raidz1) doesn't have write hole.

[https://blogs.oracle.com/ahl/what-is-
raid-z](https://blogs.oracle.com/ahl/what-is-raid-z)

~~~
vetinari
Because ZFS raidz1 is not raid5, it's even labelled differently. Yes, it is a
parity-based raid, but has slightly different semantics.

------
cyphar
This article makes a few mistakes with regards to ZFS. Some are understandable
(the author presumably last looked at the state of ZFS 5 years ago), but some
were not true even 5 years ago:

> If you want to grow the pool, you basically have two recommended options:
> _add a new identical vdev_ , or replace both devices in the existing vdev
> with higher capacity devices.

You can add vdevs to a pool which are different types or have different
parities. It's not really recommended because it means that you're making it
harder to know how many failures your pool can survive, but it's definitely
something you can do -- and it's just as easy as adding any other vdev to your
pool:

    
    
      % zpool add <pool> <vdev> <devices...>
    

This has always been possible with ZFS, as far as I'm aware.

> So let’s say you had no writes for a month and continual reads. Those two
> new disks would go 100% unused. Only when you started writing data would
> they start to see utilization

This part is accurate...

> and only for the newly written files.

... but this part is not. Modifying an existing file will almost certainly
result in data being copied to the newer vdev -- because ZFS will send more
writes to drives that are less utilised (and if most of the data is on the
older vdevs, then most reads are to the older vdevs, and thus the newer vdevs
get more writes).

> It’s likely that for the life of that pool, you’d always have a heavier load
> on your oldest vdevs. Not the end of the world, but it definitely kills some
> performance advantages of striping data.

This is also half-true -- it's definitely not ideal that ZFS doesn't have a
defrag feature, but the above-mentioned characteristic means that eventually
your pool will not be so unbalanced.

> Want to break a pool into smaller pools? Can’t do it. So let’s say you built
> your 2x8 + 2x8 pool. Then a few years from now 40 TB disks are available and
> you want to go back to a simple two disk mirror. There’s no way to shrink to
> just 2x40.

This is now possible. ZoL 0.8 and later support top-level mirror vdev removal.

> Got a 4-disk raidz2 pool and want to add a disk? Can’t do it.

It is true that this is not possible at the moment, but in the interest of
fairness I'd like to mention that it is currently being worked on[1].

> For most fundamental changes, the answer is simple: start over. To be fair,
> that’s not always a terrible idea, but it does require some maintenance down
> time.

This is true, but I believe that the author makes it sound much harder than it
actually is (it does have some maintenance downtime, but because you can
snapshot the filesystem the downtime can be as little as a minute):

    
    
        # Assuming you've already created the new pool $new_pool.
        % zfs snapshot -r $old_pool/ROOT@base_snapshot
        % zfs send $old_pool/ROOT@base_snapshot | zfs recv $new_pool/ROOT
    
        # The base copy is done -- no downtime. Now we take some downtime by stopping all use of the pool.
        % take_offline $old_pool # or do whatever it takes for your particular system
        % zfs mount -o ro $old_pool/ROOT # optional
        % zfs snapshot -r $old_pool/ROOT@last_snapshot
        % zfs send -i @base_snapshot $old_pool/ROOT@last_snapshot | zfs recv $new_pool/ROOT
    
        # Finally, get rid of the old pool and add our new pool.
        % zpool export $old_pool
        % zpool import $new_pool $old_pool
        % zfs mount -a # probably optional
    

[1]:
[https://www.youtube.com/watch?v=Njt82e_3qVo](https://www.youtube.com/watch?v=Njt82e_3qVo)

------
lazylizard
¯\\_(ツ)_/¯

Raidz2+spares, compression, snapshots and send/receive are very useful. And
zil and cache are easier than lvmcache..

------
zozbot234
I'm so sorry teacher, Btrfs ate my homework.

------
gitgudnubs
Storage spaces is probably the best software raid available today.
Unfortunately, it comes with windows.

It supports heterogenous drives, safe rebalancing (create a third copy, THEN
delete the old copy), fault domains (3-way mirror, but no 2 copies can be on
the same disk/enclosure/server/whatever), erasure coding, hierarchical storage
based on disk type (e.g., use NVMe for the log, SSD for the cache), clustering
(paxos, probably). Then you toss ReFS on top, and you're done.

The only compelling reasons to buy windows server are to run third party
software or a storage spaces/ReFS file share.

