I was also surprised how good the discussion was on fedora-devel. It seems that people didn't break into camps of "over my dead body", and "if we do not do this I'll leave", etc. It seemed more like a "this is my really thought out proposal, but is it actually feasible or did I forget anything". Then people raised problems, sometimes the proposer(s) already had a solution, sometimes it was something they didn't think of. What's nice is that despite seeing various problems everyone seemed to understand that people are trying to improve things.
I didn't read fedora-devel for many years so that was a welcome change over the past.
The concept of an application 'supporting a (specific) file system' sounds slightly ridiculous to me.
Why is Btrfs so special that apps should be rewritten for it? The always-compared ZFS seems to work just fine without alterations being necessary, so what make Btrfs worthy of extra considerations?
At most perhaps copy-on-write (COW) file systems as a general class could perhaps be considered, as that may help with things like shingled drives and SSDs (i.e., zoned drives may be aligned with COW FSes).
This concept is actually way less ridiculous than you might think, for any application which needs any guarantees about data durability, locking, etc (which includes everything from the obvious ones like postgres to things like Dropbox). I found https://danluu.com/deconstruct-files/ a fascinating read diving into this.
ZFS has no Raid5 write-hole...no traditional Raid problems at all (z1/2/3)
>leaving files/directories/volumes without CoW
No...just NO, just for a special Cases like a Database.
The thread was about where CoW is problematic, both ZFS and Btrfs are identically problematic in this sense.
> No...just NO, just for a special Cases like a Database.
Like... for what we were talking about.
This can go too far—I still remember web development a decade ago—and I would prefer to have perfect abstractions that never need piercing, but absent that specialized logic seems _fine_.
Clearly you weren’t around in the days when NFS was a thing.
I was and am. I currently help admin an HPC environment with several petabytes of NFS storage.
And the main problems I've experienced with NFS have generally been with Linux locking. I've had a lot less drama with Solaris and BSD NFS (though I did uncover two Solaris NFS server bugs over the years).
Whenever I have NFS issues I generally start with the assumption that there's a bug in the NFS client code.
The fact you’re not having trouble just means either the application writers put in the effort to properly support it, or your use case is simple enough not to run into problems.
Given the hiccups that we've had with MySQL and locking, I'm not sure that's the case. (Less so with Postgres.)
I have only a moderate amount of knowledge of the applications that the dozen research groups use on the cluster, I just make sure the bits flow and haven't heard too many complaints over the years.
You should disable CoW and Caching on the filesystem/-set where VM and Databases-files reside, but that counts for ZFS as well...well for all CoW-FS's in fact.
https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv... Relevant parameters: effective_cache_size/shared_buffers/work_mem
"The number one rule for setting up an Oracle database on ZFS is to set ZFS recordsize equal to
the database block size for the file systems that contain the Oracle data files."
"Starting with Oracle Solaris 11.1, blocks that are referenced from either the snapshot or from the
database image actually share ZFS cache buffers for a reduced memory footprint."
"Taking a snapshot of a database can involve placing the database in hot-backup mode just for the
instant necessary to capture the ZFS snapshot. The database can then be returned to normal operation."
"A clone is a full read/write accessible fork of the original database. Therefore, by using either the primary
or a replicated storage system, you can cheaply create hundreds of copies of your up-to-date database to
run test and development workloads under real data conditions."
They also allow compression of the archived (intent) logs.
It least for default libvirt locations fedora disables CoW (chattr +C). No change in software needed.
The wrong parity and unrecoverable errors has been confirmed by multiple parties. The Btrfs RAID 5/6 code has been called as much as fatally flawed
It would be interesting to know if this has been addressed already.
The big fat red warning on btrfs' own wiki page is not inspiring trust either: https://btrfs.wiki.kernel.org/index.php/RAID56
For data, it should be safe as long as a scrub is run immediately after any unclean shutdown
Anyway a current write-up on btrfs raid 5 here:
Includes this observation:
"btrfs raid5 is quantitatively more robust against data corruption than ext4+mdadm (which cannot self-repair corruption at all), but not as
reliable as btrfs raid1 (which can self-repair all single-disk corruptions detectable by csum check)."
But worth reading the whole thing.
I must say that (fingers crossed) until now I haven't had any issues with it. There have been a couple of unclean shutdowns that haven't led to any corruptions. Scrub runs every couple of weeks and on top of that the data has an offline (external disk) and off-site (rsync.net through borgbackup) backup. I'm not expecting BTRFS to let me down, but if it does I have recovery options. I tend to be very careful with my 17+ years of photo archives, especially since I'm generating hundreds of megabytes of new content every week since my daughter was born.
It definitely doesn't look good that RAID 5/6 is broken, but I feel very safe using the other stable RAID modes.
Sometimes I'm dreaming of setting up ZFS - however the downside of that is that it doesn't live in the kernel and you're forced to work with kernel modules. It's certainly doable, but since I had a kernel module issue with Wireguard a few months ago I'll be "looking the cat out of the tree" for a bit more before I decide if I actually want to make the move. For now BTRFS feels stable for me, so until that changes - or the benefits of ZFS increase by a fair amount - I'll probably stay on this.
I'm sure some would be interested in this, but Synology seems to not take GPLv2 too seriously and nobody seems to care about this.
RAID 5/6 in btrfs is relatively neglected because the big users that pay developers to work on btrfs (and put the work up-stream) don't care that much about it. E.g. Facebook is probably using it for their HDFS storage, so HDFS gives them the RAID functionality.
There was a recent discussion on LKML where Chris Mason (I think) described all this.
I buy a old Workstation slap FreeBSD with ZFS (RaidZ1 or 2) on it, and i am much happier.
Now I know there are proper reasons for it. Good to know.
Note that btrfs-scrub (according to the docs) is looking for on-disk block errors, comparing the data to the CRC (or whatever algo) to see if that matches. If it can recover the original data, the bad block is re-written with the correct CRC.
So btrfs-scrub does not find or fix any errors with the filesystem structure.
For that you need to run btrfs-check, and that can only be done offline.
You almost never want to run btrfs-check. In itself, Btrfs is fairly "self healing", and only if you run into "weird things" in the log like invalid index files should you run check, and almost certainly never with the --repair option.
The thing about Btrfs (and ZFS) is that it is Copy-on-Write, and as such, new data is written to an unused part of the filesystem. It doesn't need to modify structures of existing data, thereby reducing the risk of bad metadata.
A file in Btrfs is either written or it is not. If a power outage occurs after a file has been written, but before the metadata is updated, the file is not written, and in case you're updating a file, the old file is still preserved.
If a crash occurs while writing metadata, Btrfs has multiple copies of metadata, and is able to repair itself online.
In the most typical case where you have a crash or power failure, you just reboot normally. No file system check happens nor is it needed. It just continues from the most recent successfully committed transaction. Since it's copy-on-write, there's no inconsistency in the file system, even if the most recent changes might be lost.
If you're not using RAID5 or RAID6 for metadata, the write hole won't affect the filesystem structure and there shouldn't be any issues for btrfs-check to find.
The relative benefits are discussed in "man 5 btrfs" in the "CHECKSUM ALGORITHM" section.
man 5 btrfs | col -b | sed -n '/^CHECKSUM/,/^FILESYSTEM/p'
With one of the two, the filesystem was allowed to fill to 99% capacity (no automatic monitoring), so obviously that is operator error, and I'm not aware of any filesystem that can handle such situations gracefully. The system became unresponsive, with btrfs-transaction taking up an increasing percentage of CPU time. Removing files and snapshots did not see an increase in free space.
So what is super-curious is that the other server, which only ever got up to about 30% full, also started exhibiting the same symptoms: unresponsive, high load from btrfs-transaction.
I was able to mount the filesystems in read-only mode, and recover the files, and checked it against the offsite backup. So no data loss, only a service loss.
Both systems had 10 or so subvolumes, and read-only snapshots were taken three times per day for each subvolume. After 4 years, that's close to 15000 snapshots, maybe more.
I searched around, but didn't find anything particularly relevant to this issue.
Btrfs requires a regularly run administrative task via "btrfs balance". Balancing consolidates data between partially filled block groups, which will restore consumed space even if a block group is not entirely empty. Typically, one runs balance with a set of gradually increasing parameters to control the records affected (ie: bash script that starts with block groups that are more empty and proceeds towards ones with a higher percentage utilization).
In my experience with btrfs a few years ago, running a balance in some situations would result is incredibly high system wide latency.
The man page  does not appear to reference the free space issue directly, so I'm not sure if they've removed this need.
No file system should require this kind of regular hand holding. If it does, it's a bug, and needs to be reported and fixed, not papered over with user initiated maintenance tasks. It is acceptable to have utilities for optimizing the layout from time to time, but they should be considered optimizations, not requirements. If it becomes required to use them as a work around, that's both suboptimal and neat. But there's still a bug that needs a proper fix.
In the case a workaround is needed, it's recommended to do targeted filtered balances, not a full one. A full balance isn't going to solve a problem that a proper filtered balance can't.
openSuse (a distro which notably defaults to btrfs) packages these btrfsmaintenance scripts , and it appears they may have included it in their default install (I can't find a list of packages). Their wiki page on disabling btrfsmaintenance  implies that if disabled, manual maintenance (presumably via running, among other commands, some form of balance) is needed.
There's also an entry in the btrfs wiki  that indicates running `btrfs balance` will recover unused space in some cases. It helpfully notes that prior to "at least 3.14", balance was "sometimes" needed to recover free space in a file-system full state. (The lack of precision here doesn't inspire confidence)
Another btrfs wiki page  indicates that running balance may be needed to recover space "after removing lots of files or deleting snapshots".
Kernel 3.14 is ancient. I can't bring myself to worry about it.
I think your fourth paragraph cherry pick is disingenuous. A more complete excerpt, "There’s a special case when the block groups are completely unused, possibly left after removing lots of files or deleting snapshots. Removing empty block groups is automatic since 3.18."
Fedora 33 users will be getting kernel 5.8 from day one, and 5.9 soon after release.
I doubt my case is atypical. I haven't balanced my years old non-test real world used Btrfs file systems.
I think you could switch your concern and criticism to the fact wikis get stale.
Ah, so Fedora is limiting it's use of snapshots to avoid the need to have balances occur? Do you have some info on what level snapshot usage has to rise to before balances are needed on a regular basis? Is Fedora using snapshots at all?
There's no direct correlation between having many snapshots, and needing to balance. The once per month balance used in openSUSE is to preempt or reduce the chances of out of space error in one type of chunk (block group of extents), while there remains significant free space in another type of chunk.
Chunks are created dynamically, and are either type metadata or data (also system but it can be ignored). Different workloads have different data/metadata ratio demands, hence dynamic allocation. Snapshots are almost entirely metadata. More snapshotting means more usage of metadata.
If the pattern dramatically changes, the ratio also changes, and the dynamic allocation can alter course. Except when the disk is fully allocated. In that case, heavy metadata writes will completely fill metadata chunks, and ENOSPC even though there's still unused space in data chunks.
A filtered balance can move extents from one chunk to another, and once a chunk is empty, it can be deallocated. That unallocated space can now be allocated into a different type of chunk, thus avoiding ENOSPC. Or at least when ENOSPC happens, there's essentially no free space in either chunk type, at the same time. A "true" ENOSPC.
There are all kinds of mitigations for the problem in newer kernels. And in my opinion it's better to not paper over problems, but fixing the remaining edge cases.
This hasnt been my experience within the past few years, though I do remember it being somewhat necessary in the past. Both systems I'm running btrfs on have their free space within a few percent of the unallocated space (indicating most used blocks are fairly full and little space is wasted).
I would assume that this is not a factor with RAID-1, where all system, metadata, and data is duplicated. I could see this as being very important for the higher RAID levels.
We have since stopped deploying any sort of btrfs RAID (even RAID-1), and have gone back to using Linux MD.
If I could mount the failed btrfs filesystem read-write for an appreciable length of time, I could try the balance operation.
I'll need to set up a cron job on the other systems to run balance on a regular basis alongside scrub (obviously not at the exact same time).
On a slightly unrelated note, I once suffered a complete failure of BTRFS where after a shutdown it just wouldn't mount anything again. Interestingly, on IRC, I was told that his is because the firmware on my Samsung NVME SSD was buggy. It might be, but ext4 has not failed me once in that regard.
An inability to mount Btrfs means metadata has been hit by some kind of corruption, and there are more structures in Btrfs that are critical. If they're hit, you see mount failure. And repair can be harder on Btrfs as well.
But it also has a lot more opportunities for recovery that aren't very well understood or discussed. In part because very serious problems like this aren't that common. And also, quite a lot of people give up. Maybe they have backups and just start over with a new file system and restore. Or through no fault of their own they don't persevere with 'btrfs restore' - which is a very capable tool but requires specialized knowledge right now to use it effectively.
One of the things that'll take a mindset shift is the idea of emphasizing recoveries over repairs. One improvement coming soonish (hopefully end of the year) is more tolerant read-only rescue mount option, making it possible for users to recover with normal tools rather than 'btrfs restore'.
Which would be Raid5/6 and 1
"When 'btrfs scrub' is used for a raid5
array, it still runs a thread for each disk, but each thread reads data
blocks from all disks in order to compute parity. This is a performance
disaster, as every disk is read and written competitively by each thread"
You know what, thinking about it, I do actually have some rather choice words to describe the situation:
This boggles the mind to a level that requires further explanation, because the casual observer would likely fail to grasp the enormity of the failure that has occurred here. This isn't like, "oops, I forgot to up-shift gears in my car when going on the onramp", this is more like "the pilot forgot about the flaps after takeoff and the plane ran out of fuel.". There's a fundamental difference in the expectation of quality between, say, a random command line utility and a RAID filesystem.
To give some context: BTRFS was developed largely concurrent with, and in direct competition to Sun's ZFS. Unlike all previous SAN arrays, RAID cards, and filesystems, ZFS was explicitly designed for reliability. Sun famously had a 'test rig' where they abused each new build to death. Physically pulling disks. Randomly corrupting blocks. Running multiple operation types in parallel, while pulling disks. That kind of thing.
When I read ZFS whitepapers, I was amazed at how many fundamental flaws in RAID integrity protection they discovered, and then solved. Rigorously.
Meanwhile, BTRFS literally says, in 2020: Don't trust is, especially not for metadata, or data, or while scrubbing, which you had better baby-sit, otherwise say goodbye to your production environment!
More fun quotes:
- plan for the filesystem to be unusable during recovery.
- be prepared to reboot multiple times during disk replacement.
- btrfs raid5 does not provide as complete protection against
on-disk data corruption as btrfs raid1 does.
- scrub and dev stats report data corruption on wrong devices
- scrub sometimes counts a csum error as a read error instead
- errors during readahead operations are repaired without
incrementing dev stats, discarding critical failure information.
This is not just a raid5 bug, it affects all btrfs profiles.
PS: To the people downvoting this, please explain how you like people to be uninformed about catastrophic data corruption going ignored for 4 years below in the comments.
You're describing this like it's supposed to work. Btrfs-raid5 is clearly labelled as a bad idea in pretty much every relevant doc. It's not an issue that the tool is broken for this use case because you're not expected to actually try that configuration. mkfs.btrfs specifically says: "RAID5/6 is still considered experimental and shouldn’t be employed for production use."
It's really not that big of an issue in that context. "How was running with scissors never tested for safety?" - just don't try it in the first place.
Personally, i would understand if some developers tried a new approach on this raid5/6 and possibly come up with new and better solutions. I am wondering why it seems to be available in mainline. Why wasn't the patch/fork/branch rejected or why wasn't the option disabled or the code completely removed? I don't have btrfs installed but it sounds like this option is available in every btrfs release, although it is warned about to NOT USE IT. Why is it available in the first place?!
ZFS started development in 2001, and released on Solaris 10 in 2006. Btrfs started development in 2007, and the disk format was stabilized in 2013. ZFS has continued to gain features over the years, but I would say btrfs was largely developed after ZFS and still hasn't reached feature parity with the original release of ZFS due to the RAID5 issue.
Btrfs is slower than lvm+ext4, doesn't like working with large files, requires more ongoing maintenance (scrub, rebalancing), and is more prone to data corruption. Given that lvm can do snapshots under ext4, the only real benefit of btrfs is btrfs send, but for most use cases that doesn't seem like a large enough benefit to be worth the rest of brtfs' drawbacks.
Subvolumes: "Lightweight partitions" allow you more easily to define the scope of a snapshot. With ext4+lvm you have to create a new partition and logical volume.
I don't think, it is a big game-changer, but then I don't think, the other things are such dealbreakers as many people think. I consider "is more prone to data corruption" as largely historical/anecdotal.
Btrfs send sounds appealing, but personally, I prefer my backup solution to be independent of the filesystem.
If there's no LVM, most point-and-click tools allow for resizing a partition and a contained filesystem in a fairly simple process. (Having a LUKS encrypted volume might complicate that, but if the encryption is on the filesystem level, that could also be avoided.)
People who dual-boot might also need to resize their partitions if their needs regarding the operating systems change, or if they made a mistake in their original allocations.
That's probably not enough of a reason to make a major switch, but it could be one task that could be simplified.
Looking at my disk in gparted now, while it shows the LVM physical volume (and would presumably allow resizing it if it weren't in use), it doesn't seem to show the logical volumes and their filesystems inside of the physical volume at all. The partition information window for the LVM PV does list the logical volumes, but not the main window where you can plan the changes.
I don't know if that's a general limitation of gparted or if it's a peculiarity with my setup, but that's how it appears to me.
I'd challenge that. The main reason for using btrfs in desktop use is precisely is that your data is less likely to be corrupted. Specifically because unlike lvm+ext4, btrfs does full data checksums and therefore is capable of detecting silent data corruption.
I'm not sure that's a good description. Scrubbing is an option that you get extra. You don't need to use it and the behaviour won't be different than for example ext4 with regards to bad data detection. It's purely an extra feature.
If rebalancing is useful for desktop users (wasn't really in my experience), I'm sure it will get a system-provided job that balances the resource use and amount of reclaimed space.
There is some debate on this question:
I'm curious if the situation is improved for BtrFS.
Instant, unlimited, zero-weight snapshots. You'd be surprised how often you use snapshots when you have them. A couple weeks ago I screwed up a save point in a game. Instead of just living with my mistake, I rolled my Steam directory back an hour.
If I recall this was because they didn't have the developers able to work on the software, preferring to use XFS + LVM to accomplish some of the goals of BTRFS as their STRATIS project.
I wonder what this means for RHEL going forward in RHEL 9?
I wouldn't say it means nothing because Fedora is looked at as a proving ground for inclusion in RHEL, but I would agree that one shouldn't read much into it.
There are plenty of software packaged/supported on Fedora that isn't and won't be shipped in RHEL. BTRFS may or may not just be yet another one like that. I've heard/seen more excitement about Stratis (which does seem awesome so far) than I have btrfs.
The thing to understand about RHEL is they backport... everything.... to ancient kernels.
Code that has lots of churn can be very difficult to backport, particularly to such an old codebase.
I can't see needing BTFS on my granddaddy boxes, but we've definitely made use of backported code.
- No this doesn't affect RHEL.
- It's only for Fedora Desktop spin (which for various reasons including this, but also others, you shouldn't use even on a Desktop - I install Fedora Server on my laptop).
- Only a subset of btrfs features will be used, especially avoiding the ones which are known to be problematic.
I use Fedora Workstation on several laptops and desktops (without issue). I'm curious if I'm missing some problem/opportunity.
(These spins are all about defaults - so installing Server doesn't make it any less useful for desktops, but you may have to dnf install a few things the first time you use it.)
If it by default doesn't install email client and music players I don't use that might be nice though.
The "workstation" version uses GNOME.
$ systemctl status firewalld
● firewalld.service - firewalld - dynamic firewall daemon
Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)
- on one hand, nothing listens on these ports by default; and they have a point that it does break user applications;
- on the other hand, they should be closed, and users should open them explicitly once they install an app that needs it. However, I'm afraid we would see exactly the same thing we see with SELinux: many tutorials about installing apps or services start with "Disable SELinux" instead of how to enable exactly what is needed. We would see the same thing with firewall and "Disable firewall" as step 1.
- as a sidenote to the second point: many apps do not ship with firewalld profile, so it is not a matter of `firewall-cmd -add-service $appname` either.
I don't mind it becoming popular, but please for the love of $Deity can we please make the docs as good as ZFS. I will contribute cash if needs be.
Facebook are utterly shit at documenting things, so yes it might work for their usecase, but they essentially store knowledge through Shamanism, which is terrible unless you're inducted into the world of the spirits.
Absolutely, the best one is at Archwiki and even that one is meeh.
The btrfs-convert tool hypothetically leaves the ext filesystem all but untouched, and COW's the needed filesystem metadata onto the end of it, with data modifications coming thereafter (or intelligently stored within the free space of the ext system). You wait until you feel comfortable with BTRFS, then delete the preserved ext system and run a balance, which rewrites all data to disk in the usual structure. Alternatively, the preserved system can be restored, although I don't actually see instructions for that.
I love btrfs and would trust it but am glad to have backups of irreplaceable data.
The wiki also doesn't mention how well this process plays with LUKS, so, given how sensitive headers and such can be, I'll probably wait to hear from some F33 early-adopters on the conversion tool to see how it turns out. I'm reasonably happy with my ext4 system as it is, and most of my core data is on backup disks anyway, so I'm not sure how much the integrity checking of BTRFS would help. Seems like it would be more critical to have on the backup disks to make sure I'm not propagating the bit-rot.
good call, subvolumes came to the rescue of the /home partitioning. It's useful, but the inflexibility was in the proper split of available disk space. Novice users then saw warnings for /home - when there was still plenty of space on the root partition.
If you want to get a better picture, also about Zfs and Fedora, read the previous Btrfs threads where the developers took the time to discuss it. And to kill some FUD.
I have no association with Btrfs or Fedora but I'd like to have a modern FS in-tree as battle tested as it can be.
It can take multiple seconds (like 5-10) on my fileserver to delete just one ~10GB file.
laughs in zfs
I have one CentOS machine where I can't update ZFS from 0.7 to 0.8, if I want to boot again (zfs#8885).
This seems to be more of a udev and 'device' discovery problem than something inherit to ZFS/ZoL.
I'm pretty sure I can't replicate this bug on Debian or Ubuntu.
The point is, you chose your distro for a reason and one little piece of software is far from enough to override that.