Hacker News new | past | comments | ask | show | jobs | submit login
Btrfs Coming to Fedora 33 (fedoramagazine.org)
107 points by caution 6 months ago | hide | past | favorite | 137 comments



I've quickly skimmed the discussion on fedora-devel regarding btrfs. I wondered mainly how they'd handle the various cases where btrfs does not work well, e.g. files that change often inline (databases, VMs, etc). Apparently an application can tell to treat those files differently. So it's basically a matter of fixing various software to work nicely with btrfs as well as any similar filesystem. As mentioned in the thread, openSUSE already uses btrfs for loads of years. I do wonder why it isn't more supported by upstream software (postgres, VM things, etc).

I was also surprised how good the discussion was on fedora-devel. It seems that people didn't break into camps of "over my dead body", and "if we do not do this I'll leave", etc. It seemed more like a "this is my really thought out proposal, but is it actually feasible or did I forget anything". Then people raised problems, sometimes the proposer(s) already had a solution, sometimes it was something they didn't think of. What's nice is that despite seeing various problems everyone seemed to understand that people are trying to improve things.

I didn't read fedora-devel for many years so that was a welcome change over the past.


> So it's basically a matter of fixing various software to work nicely with btrfs as well as any similar filesystem. I do wonder why it isn't more supported by upstream software (postgres, VM things, etc).

The concept of an application 'supporting a (specific) file system' sounds slightly ridiculous to me.

Why is Btrfs so special that apps should be rewritten for it? The always-compared ZFS seems to work just fine without alterations being necessary, so what make Btrfs worthy of extra considerations?

At most perhaps copy-on-write (COW) file systems as a general class could perhaps be considered, as that may help with things like shingled drives and SSDs (i.e., zoned drives may be aligned with COW FSes).


Historically programs are "rewritten" (really, adapted) all the time to support changes to the underlying system. E.g., read-ahead, mmap() support, fadvise() support, file locking, different file locking semantics and failure modes in network filesystems, sparse files, hashed directory entries, asynchronous I/O, changes to underlying media characteristics like SSD seek times, etc.


> The concept of an application 'supporting a (specific) file system' sounds slightly ridiculous to me.

This concept is actually way less ridiculous than you might think, for any application which needs any guarantees about data durability, locking, etc (which includes everything from the obvious ones like postgres to things like Dropbox). I found https://danluu.com/deconstruct-files/ a fascinating read diving into this.


Because Fedora, among other distributions, will not include ZFS for licensing reasons. (Leaving aside any other discussion of the relative merits of ZFS and Btrfs.)


ZFS has a good share, if not 100%, of the same problems as Btrfs, and in the end of the day it's always matter of leaving files/directories/volumes without CoW, which is hardly a software rewrite as the grandparent insinuated, and more of a packaging issue.


>ZFS has a good share, if not 100%

ZFS has no Raid5 write-hole...no traditional Raid problems at all (z1/2/3)

>leaving files/directories/volumes without CoW

No...just NO, just for a special Cases like a Database.


> ZFS has no Raid5 write-hole...no traditional Raid problems at all (z1/2/3)

The thread was about where CoW is problematic, both ZFS and Btrfs are identically problematic in this sense.

> No...just NO, just for a special Cases like a Database.

Like... for what we were talking about.


To me, it doesn't sound any more ridiculous that applications having specialized support on different architectures. Most applications get away without it, but when performance really matters and general-purpose abstractions aren't sufficient, having special cases seems okay—especially when there's only a small number of architectures (or filesystems) to handle.

This can go too far—I still remember web development a decade ago—and I would prefer to have perfect abstractions that never need piercing, but absent that specialized logic seems _fine_.


> The concept of an application 'supporting a (specific) file system' sounds slightly ridiculous to me.

Clearly you weren’t around in the days when NFS was a thing.


> Clearly you weren’t around in the days when NFS was a thing.

I was and am. I currently help admin an HPC environment with several petabytes of NFS storage.

And the main problems I've experienced with NFS have generally been with Linux locking. I've had a lot less drama with Solaris and BSD NFS (though I did uncover two Solaris NFS server bugs over the years).

Whenever I have NFS issues I generally start with the assumption that there's a bug in the NFS client code.


If you think applications work on NFS by just expecting general Unix filesystem semantics you’re in for a surprise.

The fact you’re not having trouble just means either the application writers put in the effort to properly support it, or your use case is simple enough not to run into problems.


> either the application writers put in the effort to properly support it

Given the hiccups that we've had with MySQL and locking, I'm not sure that's the case. (Less so with Postgres.)

I have only a moderate amount of knowledge of the applications that the dozen research groups use on the cluster, I just make sure the bits flow and haven't heard too many complaints over the years.


Clearly packages like MySQL and PostGreSQL made sure they work on NFS or they would very clearly tell you not to use it because they’d know it’s a recipe for disaster. The issues I am referring to typically are with things like homegrown shell scripts or research software.


Wait, NFS isn't a thing anymore?


>databases, VMs, etc

You should disable CoW and Caching on the filesystem/-set where VM and Databases-files reside, but that counts for ZFS as well...well for all CoW-FS's in fact.


I understand disabling CoW for databases, as the DB itself will manage it, but why disable caching?


Depends on the RDBMS, but for big ones it's common practice to tune[0] them to allocate a big chunk of memory (or all of it) and let them manage caching by themselves. After all, they know better about the structure of the db files than the OS.

[0]https://wiki.postgresql.org/wiki/Tuning_Your_PostgreSQL_Serv... Relevant parameters: effective_cache_size/shared_buffers/work_mem


Oracle offers a document for running their database on ZFS.

https://www.oracle.com/technetwork/server-storage/solaris10/...

"The number one rule for setting up an Oracle database on ZFS is to set ZFS recordsize equal to the database block size for the file systems that contain the Oracle data files."

"Starting with Oracle Solaris 11.1, blocks that are referenced from either the snapshot or from the database image actually share ZFS cache buffers for a reduced memory footprint."

"Taking a snapshot of a database can involve placing the database in hot-backup mode just for the instant necessary to capture the ZFS snapshot. The database can then be returned to normal operation."

"A clone is a full read/write accessible fork of the original database. Therefore, by using either the primary or a replicated storage system, you can cheaply create hundreds of copies of your up-to-date database to run test and development workloads under real data conditions."

They also allow compression of the archived (intent) logs.


In case of Postgres it knows better what to cache, you can let zfs additionally cache but it's a bit of waste, better give that ram to postgres than to Arc.


I think PostgreSQL actually is a bit of an odd one in that it relies on the OS to do proper caching of whatever data is read from disk. It has its own buffers for various things, but I'm pretty sure it doesn't (yet) implement its own disk cache.


No, with ZFS you just tune the recordsize of the dataset to match your database record size.


I don't think CoW can be disabled on ZFS... which is fine because VMs and databases pretty well with ZFS, even with CoW (a block size adjustment might be needed, though).


Yes true, but with Databases you should change the recordsize to 8 or 16k (Mysql or Postgres) and tune full_page_writes in postgres to off (zfs takes care of that), and just cache the metadata because the postgres-cache is better informed of what to cache.


How does ZFS avoid the performance pitfalls of btrfs's CoW implementation for VMs and databases?


Can you explain the reasoning for this?


On a copy-on-write filesystem, modification of part of a file means fragmenting it, which carries a large performance penalty. Disabling the CoW behavior means you either cannot take snapshots of the file, or that snapshots need to duplicate the file in its entirety, but it allows for quick modification of the file in-place without fragmenting.


Is that assuming HDDs not SSDs ?


> I wondered mainly how they'd handle the various cases where btrfs does not work well, e.g. files that change often inline (databases, VMs, etc).

It least for default libvirt locations fedora disables CoW (chattr +C). No change in software needed.


This is what I was thinking as well. It just means packagers have to be mindful of what kind of files their maintained software makes and to appropriately carry these metadata changes where they are needed. I don't really see an issue with this mindset.


btrfs has made some headlines in the past about its severely broken checksum computation in RAID modes, which rules them out for production use, i.e. https://phoronix.com/scan.php?page=news_item&px=Btrfs-RAID-5...

The wrong parity and unrecoverable errors has been confirmed by multiple parties. The Btrfs RAID 5/6 code has been called as much as fatally flawed

It would be interesting to know if this has been addressed already.

The big fat red warning on btrfs' own wiki page is not inspiring trust either: https://btrfs.wiki.kernel.org/index.php/RAID56

For data, it should be safe as long as a scrub is run immediately after any unclean shutdown


Raid 5/6 isn't exactly relevant to this article, given its about laptop/desktop default file system. And the installer won't let you do raid5/6.

Anyway a current write-up on btrfs raid 5 here:

https://lore.kernel.org/linux-btrfs/20200627032414.GX10769@h...

Includes this observation:

"btrfs raid5 is quantitatively more robust against data corruption than ext4+mdadm (which cannot self-repair corruption at all), but not as reliable as btrfs raid1 (which can self-repair all single-disk corruptions detectable by csum check)."

But worth reading the whole thing.


I configured BTRFS for my data a couple of years ago on my Debian machine. It's using RAID10 - and the RAID 5/6 issue was widely known back then so I did not dare to touch that.

I must say that (fingers crossed) until now I haven't had any issues with it. There have been a couple of unclean shutdowns that haven't led to any corruptions. Scrub runs every couple of weeks and on top of that the data has an offline (external disk) and off-site (rsync.net through borgbackup) backup. I'm not expecting BTRFS to let me down, but if it does I have recovery options. I tend to be very careful with my 17+ years of photo archives, especially since I'm generating hundreds of megabytes of new content every week since my daughter was born.

It definitely doesn't look good that RAID 5/6 is broken, but I feel very safe using the other stable RAID modes.

Sometimes I'm dreaming of setting up ZFS - however the downside of that is that it doesn't live in the kernel and you're forced to work with kernel modules. It's certainly doable, but since I had a kernel module issue with Wireguard a few months ago I'll be "looking the cat out of the tree" for a bit more before I decide if I actually want to make the move. For now BTRFS feels stable for me, so until that changes - or the benefits of ZFS increase by a fair amount - I'll probably stay on this.


It's interesting to see how Synology "solves" this problem: https://www.synology.com/en-global/knowledgebase/DSM/tutoria...


Synology has solved it by running btrfs on top of mdraid, then patching it so that btrfs reports errors to mdraid and putting repair code into mdraid. This reliably works.

I'm sure some would be interested in this, but Synology seems to not take GPLv2 too seriously and nobody seems to care about this.

RAID 5/6 in btrfs is relatively neglected because the big users that pay developers to work on btrfs (and put the work up-stream) don't care that much about it. E.g. Facebook is probably using it for their HDFS storage, so HDFS gives them the RAID functionality.


They use it for everything except MySQL databases, which are stored on XFS.

There was a recent discussion on LKML where Chris Mason (I think) described all this.


Does XFS have better performance than ext4 for MySQL DBs?


I really don't trust Synology anymore, had so many destroyed Raid's in the past mainly with RS407.

I buy a old Workstation slap FreeBSD with ZFS (RaidZ1 or 2) on it, and i am much happier.


Exactly the same way ReadyNAS solves it...


I observed that on my ReadyNAS too and always thought it was a weird setup.

Now I know there are proper reasons for it. Good to know.


> For data, it should be safe as long as a scrub is run immediately after any unclean shutdown

Note that btrfs-scrub (according to the docs) is looking for on-disk block errors, comparing the data to the CRC (or whatever algo) to see if that matches. If it can recover the original data, the bad block is re-written with the correct CRC.

So btrfs-scrub does not find or fix any errors with the filesystem structure.

For that you need to run btrfs-check, and that can only be done offline.


> For that you need to run btrfs-check, and that can only be done offline.

You almost never want to run btrfs-check. In itself, Btrfs is fairly "self healing", and only if you run into "weird things" in the log like invalid index files should you run check, and almost certainly never with the --repair option.

The thing about Btrfs (and ZFS) is that it is Copy-on-Write, and as such, new data is written to an unused part of the filesystem. It doesn't need to modify structures of existing data, thereby reducing the risk of bad metadata.

A file in Btrfs is either written or it is not. If a power outage occurs after a file has been written, but before the metadata is updated, the file is not written, and in case you're updating a file, the old file is still preserved.

If a crash occurs while writing metadata, Btrfs has multiple copies of metadata, and is able to repair itself online.


This is true, although since kernel 5.2, and continuing to be enhanced, is the tree checker. There's read time and write time tree checking that can catch specific errors in the file system and prevent further confusion. This usually means it goes read-only.

In the most typical case where you have a crash or power failure, you just reboot normally. No file system check happens nor is it needed. It just continues from the most recent successfully committed transaction. Since it's copy-on-write, there's no inconsistency in the file system, even if the most recent changes might be lost.


The reasoning for running a scrub after an unclean shutdown is not to catch corruption of the filesystem structure, but to catch data corruption resulting from the RAID5 write hole.

If you're not using RAID5 or RAID6 for metadata, the write hole won't affect the filesystem structure and there shouldn't be any issues for btrfs-check to find.


The latest mkfs.btrfs on my Oracle Linux 7 supports "crc32c, xxhash, sha256 or blake2."

The relative benefits are discussed in "man 5 btrfs" in the "CHECKSUM ALGORITHM" section.

    man 5 btrfs | col -b | sed -n '/^CHECKSUM/,/^FILESYSTEM/p'


There are some modes that are horribly broken in btrfs (e.g. most of RAID except RAID1), but the common options are safe and in continuous use.


We've recently suffered a catastrophic failure with btrfs RAID-1 on two fileservers recently. They were both RAID-1 with two 4TB drives, stock Ubuntu 16.04.

With one of the two, the filesystem was allowed to fill to 99% capacity (no automatic monitoring), so obviously that is operator error, and I'm not aware of any filesystem that can handle such situations gracefully. The system became unresponsive, with btrfs-transaction taking up an increasing percentage of CPU time. Removing files and snapshots did not see an increase in free space.

So what is super-curious is that the other server, which only ever got up to about 30% full, also started exhibiting the same symptoms: unresponsive, high load from btrfs-transaction.

I was able to mount the filesystems in read-only mode, and recover the files, and checked it against the offsite backup. So no data loss, only a service loss.

Both systems had 10 or so subvolumes, and read-only snapshots were taken three times per day for each subvolume. After 4 years, that's close to 15000 snapshots, maybe more.

I searched around, but didn't find anything particularly relevant to this issue.


My experience with btrfs is from a few years ago, so it's possible they've changed the semantics here since then, but this might be applicable:

Btrfs requires a regularly run administrative task via "btrfs balance". Balancing consolidates data between partially filled block groups, which will restore consumed space even if a block group is not entirely empty. Typically, one runs balance with a set of gradually increasing parameters to control the records affected (ie: bash script that starts with block groups that are more empty and proceeds towards ones with a higher percentage utilization).

In my experience with btrfs a few years ago, running a balance in some situations would result is incredibly high system wide latency.

The man page [1] does not appear to reference the free space issue directly, so I'm not sure if they've removed this need.

1: https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-balanc...


What is the source of "Btrfs requires a regularly run administrative task...".

No file system should require this kind of regular hand holding. If it does, it's a bug, and needs to be reported and fixed, not papered over with user initiated maintenance tasks. It is acceptable to have utilities for optimizing the layout from time to time, but they should be considered optimizations, not requirements. If it becomes required to use them as a work around, that's both suboptimal and neat. But there's still a bug that needs a proper fix.

In the case a workaround is needed, it's recommended to do targeted filtered balances, not a full one. A full balance isn't going to solve a problem that a proper filtered balance can't.


There are whole repos[1] dedicated to scripts to perform the delicate work of running `btrfs balance` in various ways to ensure space is recovered.

openSuse (a distro which notably defaults to btrfs) packages these btrfsmaintenance scripts [2], and it appears they may have included it in their default install (I can't find a list of packages). Their wiki page on disabling btrfsmaintenance [5] implies that if disabled, manual maintenance (presumably via running, among other commands, some form of balance) is needed.

There's also an entry in the btrfs wiki [3] that indicates running `btrfs balance` will recover unused space in some cases. It helpfully notes that prior to "at least 3.14", balance was "sometimes" needed to recover free space in a file-system full state. (The lack of precision here doesn't inspire confidence)

Another btrfs wiki page [4] indicates that running balance may be needed to recover space "after removing lots of files or deleting snapshots".

1: https://github.com/kdave/btrfsmaintenance 2: https://software.opensuse.org/package/btrfsmaintenance 3: https://btrfs.wiki.kernel.org/index.php/FAQ#What_does_.22bal... 4: https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs-balanc... 5: https://en.opensuse.org/SDB:Disable_btrfsmaintenance


openSUSE has automatic snapshots with a fairly extensive retention policy.

Kernel 3.14 is ancient. I can't bring myself to worry about it.

I think your fourth paragraph cherry pick is disingenuous. A more complete excerpt, "There’s a special case when the block groups are completely unused, possibly left after removing lots of files or deleting snapshots. Removing empty block groups is automatic since 3.18."

Fedora 33 users will be getting kernel 5.8 from day one, and 5.9 soon after release.

I doubt my case is atypical. I haven't balanced my years old non-test real world used Btrfs file systems.

I think you could switch your concern and criticism to the fact wikis get stale.


> openSUSE has automatic snapshots with a fairly extensive retention policy.

Ah, so Fedora is limiting it's use of snapshots to avoid the need to have balances occur? Do you have some info on what level snapshot usage has to rise to before balances are needed on a regular basis? Is Fedora using snapshots at all?


There is no automatic snapshotting regime on Fedora.

There's no direct correlation between having many snapshots, and needing to balance. The once per month balance used in openSUSE is to preempt or reduce the chances of out of space error in one type of chunk (block group of extents), while there remains significant free space in another type of chunk.

Chunks are created dynamically, and are either type metadata or data (also system but it can be ignored). Different workloads have different data/metadata ratio demands, hence dynamic allocation. Snapshots are almost entirely metadata. More snapshotting means more usage of metadata.

If the pattern dramatically changes, the ratio also changes, and the dynamic allocation can alter course. Except when the disk is fully allocated. In that case, heavy metadata writes will completely fill metadata chunks, and ENOSPC even though there's still unused space in data chunks.

A filtered balance can move extents from one chunk to another, and once a chunk is empty, it can be deallocated. That unallocated space can now be allocated into a different type of chunk, thus avoiding ENOSPC. Or at least when ENOSPC happens, there's essentially no free space in either chunk type, at the same time. A "true" ENOSPC.

There are all kinds of mitigations for the problem in newer kernels. And in my opinion it's better to not paper over problems, but fixing the remaining edge cases.


> Btrfs requires a regularly run administrative task via "btrfs balance"

This hasnt been my experience within the past few years, though I do remember it being somewhat necessary in the past. Both systems I'm running btrfs on have their free space within a few percent of the unallocated space (indicating most used blocks are fairly full and little space is wasted).


Are you using snapshots in btrfs? Perhaps trying to use this feature was what caused it to be needed for my use case.


Interesting.

I would assume that this is not a factor with RAID-1, where all system, metadata, and data is duplicated. I could see this as being very important for the higher RAID levels.

We have since stopped deploying any sort of btrfs RAID (even RAID-1), and have gone back to using Linux MD.


Free space and the space taken up by metadata and snapshots can take a little attention to figure out

https://ohthehugemanatee.org/blog/2019/02/11/btrfs-out-of-sp...


Now that is interesting.

If I could mount the failed btrfs filesystem read-write for an appreciable length of time, I could try the balance operation.

I'll need to set up a cron job on the other systems to run balance on a regular basis alongside scrub (obviously not at the exact same time).


To be clear: I was running a non-raid (single disk) setup where I found `btrfs balance` to be needed.


I suffered a similar breakdown with no raid. But this was about 5 years ago. The filesystem filled up, had plenty of snapshots, but removing snapshots did not actually clean up any space. Had this happen twice. So I don't think this is related to RAID. Just regular old butter filesystem.

On a slightly unrelated note, I once suffered a complete failure of BTRFS where after a shutdown it just wouldn't mount anything again. Interestingly, on IRC, I was told that his is because the firmware on my Samsung NVME SSD was buggy. It might be, but ext4 has not failed me once in that regard.


With SSD it's common that they start returning zeros or garbage, sometimes transiently, before they fail. In the case of single device Btrfs, it basically acts as an early warning detection system, and it will be a lot more sensitive to this than other file systems because both metadata (the fs itself) and your data, are checksummed. And the data is a much bigger portion of the payload on your drive. It's way more likely to be hit by hardware related issues, including even memory bit flips.

An inability to mount Btrfs means metadata has been hit by some kind of corruption, and there are more structures in Btrfs that are critical. If they're hit, you see mount failure. And repair can be harder on Btrfs as well.

But it also has a lot more opportunities for recovery that aren't very well understood or discussed. In part because very serious problems like this aren't that common. And also, quite a lot of people give up. Maybe they have backups and just start over with a new file system and restore. Or through no fault of their own they don't persevere with 'btrfs restore' - which is a very capable tool but requires specialized knowledge right now to use it effectively.

One of the things that'll take a mindset shift is the idea of emphasizing recoveries over repairs. One improvement coming soonish (hopefully end of the year) is more tolerant read-only rescue mount option, making it possible for users to recover with normal tools rather than 'btrfs restore'.


That SSD seemed healthy in every other respect and it's still chugging along today just fine. On ZFS no less. The kicker in the catastrophic failure was that none of the recovery tools worked and I'd need to write my own recovery tools to fill in the gaps where the metadata got corrupted. I also did not have much time to faff about - I imaged the disk and reinstalled the machine. As for the metadata, I was under the impression that there were no checksums for metadata in BTRFS?


Superblocks, and every node and leaf in every tree is checksummed. While it's possible to disable checksumming and copy-on-write for data, it's not possible for metadata (the fs itself).


>but the common options are safe and in continuous use

Which would be Raid5/6 and 1


I'm reading through the links, and at first I thought this was a historical problem that was last reported in 2016/2017, but then gems like this from mid 2020 are popping up:

    "When 'btrfs scrub' is used for a raid5
    array, it still runs a thread for each disk, but each thread reads data
    blocks from all disks in order to compute parity.  This is a performance
    disaster, as every disk is read and written competitively by each thread"
I have no words. Why would anyone ever think that this is a good idea? Who sat down at their computer and typed the code that does this!? How was this never tested?

You know what, thinking about it, I do actually have some rather choice words to describe the situation:

This boggles the mind to a level that requires further explanation, because the casual observer would likely fail to grasp the enormity of the failure that has occurred here. This isn't like, "oops, I forgot to up-shift gears in my car when going on the onramp", this is more like "the pilot forgot about the flaps after takeoff and the plane ran out of fuel.". There's a fundamental difference in the expectation of quality between, say, a random command line utility and a RAID filesystem.

To give some context: BTRFS was developed largely concurrent with, and in direct competition to Sun's ZFS. Unlike all previous SAN arrays, RAID cards, and filesystems, ZFS was explicitly designed for reliability. Sun famously had a 'test rig' where they abused each new build to death. Physically pulling disks. Randomly corrupting blocks. Running multiple operation types in parallel, while pulling disks. That kind of thing.

When I read ZFS whitepapers, I was amazed at how many fundamental flaws in RAID integrity protection they discovered, and then solved. Rigorously.

Meanwhile, BTRFS literally says, in 2020: Don't trust is, especially not for metadata, or data, or while scrubbing, which you had better baby-sit, otherwise say goodbye to your production environment!

More fun quotes:

    - plan for the filesystem to be unusable during recovery.
    - be prepared to reboot multiple times during disk replacement.
    - btrfs raid5 does not provide as complete protection against 
      on-disk data corruption as btrfs raid1 does.
    - scrub and dev stats report data corruption on wrong devices
      in raid5.
    - scrub sometimes counts a csum error as a read error instead
      on raid5.
    - errors during readahead operations are repaired without
      incrementing dev stats, discarding critical failure information.
      This is not just a raid5 bug, it affects all btrfs profiles.
You'd have to be nuts to use BTRFS for RAID 5 or 6, and I would question its use for any form of RAID.

PS: To the people downvoting this, please explain how you like people to be uninformed about catastrophic data corruption going ignored for 4 years below in the comments.


> How was this never tested?

You're describing this like it's supposed to work. Btrfs-raid5 is clearly labelled as a bad idea in pretty much every relevant doc. It's not an issue that the tool is broken for this use case because you're not expected to actually try that configuration. mkfs.btrfs specifically says: "RAID5/6 is still considered experimental and shouldn’t be employed for production use."

It's really not that big of an issue in that context. "How was running with scissors never tested for safety?" - just don't try it in the first place.


Probably yes, but i think the comment was also about why was it implemented that way in the first place, questioning the ability and quality of the corresponding developers or how it was possible to be introduced into the filesystem code, when everybody seems to know that this is horribly broken.

Personally, i would understand if some developers tried a new approach on this raid5/6 and possibly come up with new and better solutions. I am wondering why it seems to be available in mainline. Why wasn't the patch/fork/branch rejected or why wasn't the option disabled or the code completely removed? I don't have btrfs installed but it sounds like this option is available in every btrfs release, although it is warned about to NOT USE IT. Why is it available in the first place?!


> BTRFS was developed largely concurrent with, and in direct competition to Sun's ZFS

ZFS started development in 2001, and released on Solaris 10 in 2006. Btrfs started development in 2007, and the disk format was stabilized in 2013. ZFS has continued to gain features over the years, but I would say btrfs was largely developed after ZFS and still hasn't reached feature parity with the original release of ZFS due to the RAID5 issue.


Synology and SuSE are using it in production. I've been avoiding it even though we use openSuSE and it selects btrfs by default.


Synology uses a combination of btrfs and md to avoid btrfs's flaws


I don't understand why you were downvoted. That is huge and I did not see this detailed in the original article and not even heard about it before.


Fanboys...there was even a time when peoples defended Windows for being the best server-OS, and no one needs check-summing in the FS because HW-raid is the "professional" way.


For desktop use, what problem does btrfs solve better than lvm+ext4?

Btrfs is slower than lvm+ext4, doesn't like working with large files, requires more ongoing maintenance (scrub, rebalancing), and is more prone to data corruption. Given that lvm can do snapshots under ext4, the only real benefit of btrfs is btrfs send, but for most use cases that doesn't seem like a large enough benefit to be worth the rest of brtfs' drawbacks.


Transparent compression.

Subvolumes: "Lightweight partitions" allow you more easily to define the scope of a snapshot. With ext4+lvm you have to create a new partition and logical volume.

I don't think, it is a big game-changer, but then I don't think, the other things are such dealbreakers as many people think. I consider "is more prone to data corruption" as largely historical/anecdotal.

Btrfs send sounds appealing, but personally, I prefer my backup solution to be independent of the filesystem.


When I replaced the SSD in my laptop with a larger one and needed to resize the partition (and filesystem, etc.) to use the capacity of the larger SSD, I had to do that at least on the filesystem, LVM and partition table levels. Or at least I don't think I found any tools that would have done the entire job for me, because I remember doing the layers step-by-step.

If there's no LVM, most point-and-click tools allow for resizing a partition and a contained filesystem in a fairly simple process. (Having a LUKS encrypted volume might complicate that, but if the encryption is on the filesystem level, that could also be avoided.)

People who dual-boot might also need to resize their partitions if their needs regarding the operating systems change, or if they made a mistake in their original allocations.

That's probably not enough of a reason to make a major switch, but it could be one task that could be simplified.


Doesn't gparted do this for you? It will automatically figure even the most complex of operations that require filesystem copy.


I can't remember exactly, as it's been a few years. I'm pretty sure gparted would have been one of the first things I looked into, though.

Looking at my disk in gparted now, while it shows the LVM physical volume (and would presumably allow resizing it if it weren't in use), it doesn't seem to show the logical volumes and their filesystems inside of the physical volume at all. The partition information window for the LVM PV does list the logical volumes, but not the main window where you can plan the changes.

I don't know if that's a general limitation of gparted or if it's a peculiarity with my setup, but that's how it appears to me.


> is more prone to data corruption.

I'd challenge that. The main reason for using btrfs in desktop use is precisely is that your data is less likely to be corrupted. Specifically because unlike lvm+ext4, btrfs does full data checksums and therefore is capable of detecting silent data corruption.


It depends what type of corruption is more likely - silent bit flips in hardware, or a bug in btrfs code.


> requires more ongoing maintenance (scrub, rebalancing)

I'm not sure that's a good description. Scrubbing is an option that you get extra. You don't need to use it and the behaviour won't be different than for example ext4 with regards to bad data detection. It's purely an extra feature.

If rebalancing is useful for desktop users (wasn't really in my experience), I'm sure it will get a system-provided job that balances the resource use and amount of reclaimed space.


There are concerns with ZFS of the "scrub of death" on a system lacking ECC ram:

https://jrs-s.net/2015/02/03/will-zfs-and-non-ecc-ram-kill-y...

There is some debate on this question:

https://arstechnica.com/civis/viewtopic.php?f=2&t=1235679&p=...

I'm curious if the situation is improved for BtrFS.


The first link actually explains why ZFS is no more dangerous than any other filesystem.


honestly, I've only ever needed to rebalance on a desktop system when statfs() returned an f_bavail=0 and some program decided to take this information seriously and refused to write at all. There are still quirks with statfs info coming from btrfs volumes only solved with intense rebalancing.


btrfs supports zero-copy file and block cloning which is very good for some applications, e.g. rr.


I can't speak for everyone, but in terms of administration, it's much easier to deal with than having to sort through the filesystem sundae you end up making with device mapper frameworks and the filesystem cherry on top of it all. In a humorous twist, subvolumes are basically thin provisioned logical volumes too and btrfs still had quotas to make them behave as such.


I use the `lv`, `pv`, and `e2fsprogs` families of commands pretty routinely in my day job (we minimally logical volume filesystems on hosts and leave it up to application/deployment/user what to do with the bulk of the storage) and haven't found this to be a real issue. After a couple minutes thinking about what you're doing, you never really need to revisit this. You should probably spend a few minutes thinking about what you're going to be doing when dealing with your filesystems anyway.


Checksums so I don't backup corrupted data and never notice.

Instant, unlimited, zero-weight snapshots. You'd be surprised how often you use snapshots when you have them. A couple weeks ago I screwed up a save point in a game. Instead of just living with my mistake, I rolled my Steam directory back an hour.

Transparent compression.


Time to resilver a failed disk on btrfs depends on the amount of data on the partition, while the same on lvm is constant on the full size of the disk.


Interesting - As RHEL is downstream of Fedora, I thought BTRFS was not being explored further by Red Hat, on account of their deprecation of it downstream in RHEL.

If I recall this was because they didn't have the developers able to work on the software, preferring to use XFS + LVM to accomplish some of the goals of BTRFS as their STRATIS project.

I wonder what this means for RHEL going forward in RHEL 9?


It means nothing. This is only for the Fedora Desktop spin, and it's lead by the Fedora community not Red Hat.


Disclaimer: I work for Red Hat but I have zero internal insight into this. I'm just a happy Fedora desktop user

I wouldn't say it means nothing because Fedora is looked at as a proving ground for inclusion in RHEL, but I would agree that one shouldn't read much into it.

There are plenty of software packaged/supported on Fedora that isn't and won't be shipped in RHEL. BTRFS may or may not just be yet another one like that. I've heard/seen more excitement about Stratis (which does seem awesome so far) than I have btrfs.


I think RHEL currently uses XFS and Fedora uses ext4 so they are already not using the same file system. Its possible Fedora will switch to btrfs and RHEL will remain on XFS.


They stopped supporting it on RHEL (even experimentally) because the code churn was (is?) too high.

The thing to understand about RHEL is they backport... everything.... to ancient kernels.

Code that has lots of churn can be very difficult to backport, particularly to such an old codebase.


As a full-time RHEL admin, there is a reason: we have some old, like ooooooooold, kernels and servers floating around.

I can't see needing BTFS on my granddaddy boxes, but we've definitely made use of backported code.


A few points:

- No this doesn't affect RHEL.

- It's only for Fedora Desktop spin (which for various reasons including this, but also others, you shouldn't use even on a Desktop - I install Fedora Server on my laptop).

- Only a subset of btrfs features will be used, especially avoiding the ones which are known to be problematic.


Can you elaborate on the reasons for using Fedora server instead?

I use Fedora Workstation on several laptops and desktops (without issue). I'm curious if I'm missing some problem/opportunity.


Defaults to GNOME, firewall disabled, ext4 (and now btrfs) instead of XFS.

(These spins are all about defaults - so installing Server doesn't make it any less useful for desktops, but you may have to dnf install a few things the first time you use it.)


What do you mean "defaults to GNOME"? Server defaults to no DE at all I thought (been a year or two since I've used it) and Fedora defaults to GNOME anyway.

If it by default doesn't install email client and music players I don't use that might be nice though.


> What do you mean "defaults to GNOME"? Server defaults to no DE at all I thought (been a year or two since I've used it) and Fedora defaults to GNOME anyway.

The "workstation" version uses GNOME.


These points seem to be about why fedora workstation is not a great fit for him/her. Workstation defaults to gnome, other spins or server don't.


I think the defaults they’re referring to are for fedora desktop, not server


Yes you're right - thank you. On a re-read I see that I had the assertions the wrong way round.


I'm pretty sure Fedora Workstation comes with firewalld enabled:

    $ systemctl status firewalld
    ● firewalld.service - firewalld - dynamic firewall daemon
         Loaded: loaded (/usr/lib/systemd/system/firewalld.service; enabled; vendor preset: enabled)


But with all the high ports open by default: https://lists.fedoraproject.org/archives/list/devel@lists.fe... The justification was that blocking firewall ports breaks user applications.


I have mixed feelings on this:

- on one hand, nothing listens on these ports by default; and they have a point that it does break user applications;

- on the other hand, they should be closed, and users should open them explicitly once they install an app that needs it. However, I'm afraid we would see exactly the same thing we see with SELinux: many tutorials about installing apps or services start with "Disable SELinux" instead of how to enable exactly what is needed. We would see the same thing with firewall and "Disable firewall" as step 1.

- as a sidenote to the second point: many apps do not ship with firewalld profile, so it is not a matter of `firewall-cmd -add-service $appname` either.


I have no issues with BTRFS, apart from the documentation is really poor.

I don't mind it becoming popular, but please for the love of $Deity can we please make the docs as good as ZFS. I will contribute cash if needs be.

Facebook are utterly shit at documenting things, so yes it might work for their usecase, but they essentially store knowledge through Shamanism, which is terrible unless you're inducted into the world of the spirits.


I for one support Facebook's push to make more problems solvable solely by entering a mud hut and ingesting hallucinogenics.


That won't make them less evil, just less coherent.


Its only less coherent if youre not off your tits.


>documentation is really poor

Absolutely, the best one is at Archwiki and even that one is meeh.


It’s funny that fedora is moving to supporting btrfs as the default FS, when red hat has stopped supporting it altogether. I’m a big fan. In a world where we can’t have GPL compatible ZFS, btrfs is the next best thing.


There's probably no way to convert the legacy ext4+LUKS filesystem over, so, I guess I'll decide when F33 comes out whether I want BTRFS enough to do a clean install rather than an upgrade.


You can convert an ext4 filesystem to btrfs using btrfs-convert.


Is that stable?


Alright, so, disclaimer, I have lost data doing this but it was purely operator error! Closed my SSH session at the worst possible time. One of the podcast hosts at Linux Unplugged made the same mistake.

The btrfs-convert tool hypothetically leaves the ext filesystem all but untouched, and COW's the needed filesystem metadata onto the end of it, with data modifications coming thereafter (or intelligently stored within the free space of the ext system). You wait until you feel comfortable with BTRFS, then delete the preserved ext system and run a balance, which rewrites all data to disk in the usual structure. Alternatively, the preserved system can be restored, although I don't actually see instructions for that.

https://btrfs.wiki.kernel.org/index.php/Conversion_from_Ext3

I love btrfs and would trust it but am glad to have backups of irreplaceable data.


I suppose I wouldn't need to worry about an SSH disconnect while working on localhost, but I'd still be a little concerned about trying it. I recently had a Pop!_OS/Ubuntu system irreparably damaged during an upgrade because the lock screen come on during the five minutes I had stepped away from monitoring to use the bathroom.

The wiki also doesn't mention how well this process plays with LUKS, so, given how sensitive headers and such can be, I'll probably wait to hear from some F33 early-adopters on the conversion tool to see how it turns out. I'm reasonably happy with my ext4 system as it is, and most of my core data is on backup disks anyway, so I'm not sure how much the integrity checking of BTRFS would help. Seems like it would be more critical to have on the backup disks to make sure I'm not propagating the bit-rot.


> The switch to Btrfs will use a single-partition disk layout, and Btrfs’ built-in volume management. The previous default layout placed constraints on disk usage that can be a difficult adjustment for novice users. Btrfs solves this problem by avoiding it.

good call, subvolumes came to the rescue of the /home partitioning. It's useful, but the inflexibility was in the proper split of available disk space. Novice users then saw warnings for /home - when there was still plenty of space on the root partition.


It's painful to see how much disinformation (even in subtle ways or to go off the rails) some topics are getting in here.

If you want to get a better picture, also about Zfs and Fedora, read the previous Btrfs threads where the developers took the time to discuss it. And to kill some FUD.

I have no association with Btrfs or Fedora but I'd like to have a modern FS in-tree as battle tested as it can be.


Surprising, I thought RH had drawn pretty clear lines in the sand in favor of evolving XFS to be the universal disk FS for Linux.


Is btrfs still super slow at deleting large files?

It can take multiple seconds (like 5-10) on my fileserver to delete just one ~10GB file.


The 'rm' command should complete pretty quickly. Actually freeing up space takes time since a delete is subject to delayed allocation. The default transaction commit time is 30 seconds. And if there are snapshots or reflink copies, a backref walk is needed before extents can be freed.


I'm talking about the time it takes for the `rm` command to finish. I've started the habit of running all `rm` commands in the background on btrfs.


If it's reproducible, my suggestion is to strace the rm command and find out what it's doing that's taking so long; and post it to the mailing list:

https://btrfs.wiki.kernel.org/index.php/Btrfs_mailing_list


Recording the kernel profile with perf would be more useful, I think. strace would probably only show long "unlink" syscall.


Yeah, and if perf is unrevealing, sysrq+t.


I tested removing a 10GiB file of random data from a btrfs filesystem, it took 254ms.


Do you have snapshots enabled by any chance, or any other btrfs features?


I'm not aware of using any features. Where should I check this?


So how long until systemd-somethingmajor requires btrfs?


I just herd my roommate saying "i don't want to play too much with my amd cpu overclocking because if i crash my system too much btrfs might get corrputed again".

laughs in zfs


Don't worry, zfs has it's own share of problems.

I have one CentOS machine where I can't update ZFS from 0.7 to 0.8, if I want to boot again (zfs#8885).


* https://github.com/openzfs/zfs/issues/8885

This seems to be more of a udev and 'device' discovery problem than something inherit to ZFS/ZoL.


*zfs on Linux


Or switch distros.

I'm pretty sure I can't replicate this bug on Debian or Ubuntu.


That is like asking to move to a different house because you don't like the color of the walls.


To use that analogy, no, its more like moving to a different house because lead paint remediation is too costly and error prone.


That seems like an entirely different analogy and basically just says this apartment is bad, you should move. Then why did you move in in the first place?

The point is, you chose your distro for a reason and one little piece of software is far from enough to override that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: