Hacker News new | past | comments | ask | show | jobs | submit login
Five Years of Btrfs (markmcb.com)
214 points by vordoo 20 days ago | hide | past | web | favorite | 237 comments

I went on a quest a few years ago, thinking it would be good for the industry to standardize on a single next generation filesystem for UNIX. I started with ZFS on linux since that seemed to have the most vocal advocates. That lasted about a half year, until a bug in the code resulted in a completely corrupt disk, and I had to restore 4TB of data over a month from offside backups. That plus the licensing confusion around ZFS has made it impossible for ZFS to be the defacto choice.

I went down the BTRFS path, despite it's dodgy reputation when netgear announced their little embedded NASes, and switched my server over to it. The experience was solid enough that I bought high-end synology and have had zero problems with it.

Btrfs is the only FS I used that resulted in complete FS corruption losing nearly all data on disk, not once, but 3 times.

After that, none of the features like compression, snapshots, COW or checksums meant anything to me. I'm much happier with ext4 and xfs on lvm.

It seems a lot of people have these stories, and then people like me and OP who have had btrfs survive the most fucked up situations (I've had a btrfs nas built on "random drives I've had lying around" and abused it for 5 years and had 0 bugs at all).

I'm not sure what causes it, but there seems to be an effect where btrfs loves you or hates you and few people with mixed experiences regarding data loss. One possible cause is distro choice tends to be per person and how up to date said distro keeps it's kernel. But, I'm not sure.

> It seems a lot of people have these stories, and then people like me and OP who have had btrfs survive the most fucked up situations (I've had a btrfs nas built on "random drives I've had lying around" and abused it for 5 years and had 0 bugs at all).

Why wouldn't you expect it to survive that? Is there a particular reason to believe those drives are broken? I.e., are they older consumer drives known to lie about cache flushes? do they have bad sectors? How have you abused it? What kind of load? Did you fill the filesystem (which another commenter mentioned seems to be a common element of most sad btrfs stories)? did your system frequently lose power while under write load?

Lacking more details, I'd just say one user experiencing 0 bugs in 5 years should be completely unremarkable. I expect filesystems to be very reliable, so a lot of people having stories of corruption means stay away from btrfs. Having some people with stories of no corruption doesn't really move the needle. Together, these stories still mean stay away from btrfs!

That's hyperbole, it can't be taken seriously. OpenSUSE uses Btrfs by default, if there were more problems outside what's expected by md+LVM+ext4 (or XFS), which is the feature comprised by Btrfs and then some, they wouldn't have made the on-going investments they have. Facebook has been using it in production with thousands of installations for years.

You want details from people experiencing zero problems, but you don't ask for details from people who are? That's a weird way to go about conducting the necessary autopsies, to discover and fix bugs.

Anyway, I monitor the upstream filesystems lists, and they all have bugs. They're all fixing bugs. They're all adding new features. And that introduces bugs that need fixing. It's not that remarkable, until of course someone suggests only one file system is to be avoided, while also providing no details, but depends on conjecture.

I asked RX14 why they called out their lack of problems as remarkable ("survive the most fucked up situations"). It sounds strange, as I mentioned.

I don't need to ask people who've had problems because I've had them myself, in unremarkable circumstances, a while back. I'm sure I could find reports on the mailing list as well, in which others have already asked for details.

In my experience, btrfs is very fragile in power loss or kernel crash/panic scenarios. It very consistently causes soft lockups on file read/writes after power loss until you run a `brtfs check --repair` on it. My experience is mostly on Arch, so it's not a case where it's out of date and missing patches.

Sounds like hardware problems in the storage stack. Btrfs developers contributed the dm-log-writes target to the kernel, expressly for conducting power loss tests on file systems. All the file systems benefit from this work. https://www.kernel.org/doc/Documentation/device-mapper/log-w... And Btrfs is doing the right thing these days.

I recently conducted system resource starvation tests where a compile process spun off enough threads to soak the system to the point it becomes unresponsive. I did over 100 forced power off tests while the Btrfs file system was being written to as part of that compile. Zero complaints: not on mount, not on scrubs, not with btrfs check, and not any while in normal operation following those power offs.

If you want to complain about Btrfs, complain about the man page warning to not use --repair without advice from a developer. You did know about that warning, right?

100% was not a hardware problem. Works fine on other filesystems ️

That's an inadequate answer because it rests on other file systems assuming the hardware is working reliably. Btrfs and ZFS don't make such assumptions, that's why everything is checksummed. They are canaries for hardware, firmware, and software problems in the storage stack that other filesystems ignore.

This was my experience. We had a brief power outage at work and my btrfs (root) partition was toast. Spent a whole day rebuilding my system afterwards. Will definitely not go that route again.

The only difference is that none of the repair tools were able to recover the filesystem, but I was able to dump the files themselves to a new disk to recover them. Really not sure why, it was very strange.

I ran btrfs on a laptop. 2 things.

Once I ended up with a bunch of zero length files (presumably metadata was written before content?).

I also, multiple times ended up with errors related to full drives despite by drive not being full. Deleting snapshots seemed to help.

Then I went to a zfs fs on root and never had another problem.

Since a year I literally daily turn of my machine by pulling a plug (home automation turns off all plugs at midnight to make me go bed ;).

My quite large 1tb multivolume, multisnapshot BTRFS fs never had any problems.

And it's quite aggressive cfg (big fs commit).

P.S. I do have backups though.

Ugh. You are testing your home the Netflix way [1] :-)

Why not putting poweroff in a cron task a bit before midnight so you don't uselessly risk hosing your file system? You can always restore your backup but it takes time!

[1] https://arstechnica.com/information-technology/2012/07/netfl...

I think the probable cause is that it's not common bugs that cause the corruption but uncommon ones. Most of the time, they work fine. But you really want a stronger guarantee than that out of your filesystem.

Historically, the biggest bugs in btrfs were when you came close to filling up the filesystem. For the longest time, you'd get -ENOSPC (no space left) even when you had many Gb of space left due to really bad metadata and block level space usage.

I'm a huge Mac fanboy, but APFS really kicks me in the teeth sometimes. Aside from things like snapshots, clones, etc. not being accessible to users (well, not really), or being able to create subvolumes at specific mount points which forget those mount points next reboot, it had an extremely strange behavior (possibly relating to snapshots/CoW?) where once it was full, it stayed full forever until you rebooted.

Basically, any time a runaway process filled my disk, I just had to hard-reboot and hope I didn't have any unsaved work or state that I needed to preserve.

Really makes me hope that Apple is going to further extend APFS to not just be baby's first CoW volume-management filesystem.

> it had an extremely strange behavior (possibly relating to snapshots/CoW?) where once it was full, it stayed full forever until you rebooted.

Do you have Time Machine enabled? I think it uses snapshots, which explains why the filesystem stays full. I've hit this myself and was initially surprised to see rm not improving matters (possibly even making it worse) but it makes sense with snapshots. The working on reboot was a surprise. I'd put off fixing the machine for at least a week, and when I went to actually fix it, it was quite anticlimatic to just reboot and have it work. Maybe it checks for this condition on reboot and dumps Time Machine snapshots if so.

That was the less scary part of my macOS filesystem integrity worries. My full disk started when it was staging a full Time Machine backup after I got a dialog saying:

> Time Machine completed a verification of your backups on "my.nas.address". To improve reliability, Time Machine must create a new backup for you.

...for the Nth time. I don't know for certain if the problem is with Apple's software or with my NAS's (Synology) but these backups are clearly not as reliable as one would hope...

Let's not forget about various performance issues which were exacerbated by "low free space" conditions (i.e. after you filled the volume beyond 80 % these started to pop up). A file system that will sometimes go down well into the fractional IOPS range is not very useful.

Some of these are fixed by now, though.

"the biggest bugs in btrfs were when you came close to filling up the filesystem" :)

I used to read every email on btrfs-devel for a year or so.

This is my experinence too. Works great with lots of free space, as soon as space gets tight, performance deteriorates really fast. Nevertheless, for me it has been worthy.

There's a good Bryan Cantrill talk about that.[1] The gist is that eventually, when you throw enough resources at a problem, all that's left are the really uncommon problem and bugs, and this is specifically what you get in the data path (including drive firmware) where things get harder and harder to figure out as the code gets more hidden and obscure.

As with all his talks, you can expect it to be quite entertaining as well as informative and historical (if from his POV).

1: https://www.youtube.com/watch?v=fE2KDzZaxvE

Personally I think that in the case of a CoW filesystem, bugs which cause corruption should be very uncommon because of the very nature of the CoW mechanism, especially if coupled together with data checksums as publicized in the case of BTRFS.

If things still get trashed then I tend to think that the very foundation of the FS is bad.

But maybe I'm just naive :)

One anecdote of a filesystem working fine and one anecdote of it becoming a disaster don't cancel each other out.

I wouldn't buy a $5 USB thumb drive if half the people said it lost their data and half said it worked fine.

I'd buy it -but only for short-term use to sneakernet shit I already had backed up reliably somewhere else.

of course, where we run into problems is that btrfs is meant to be the reliable backup. Oops.

You realize that there are $5 thumb drives that work, just like there are filesystems that actually work right? There isn't any benefit to using something broken, these problems have been solved.

> I'm not sure what causes it, but there seems to be an effect where btrfs loves you or hates you and few people with mixed experiences regarding data loss.

I tried, I really tried to like btrfs.

On the servers/workstations I’ve had few serious issues, but a few “gotchas” you need to know to keep things running smoothly.

On every laptop I’ve had, I’ve had btrfs fail on me. Repeatedly.

So I gave up on it. ZFS for me these days.

> btrfs loves you or hates you

This is how superstitious traditions start, and ritualistic sacrifice in particular, I'd think.

>and ritualistic sacrifice in particular, I'd think.

Does data loss count as a sacrifice in this instance?

If it does, I think the ZFS "rebuild the pool from scratch" should as well, since that seems far more ritualistic.

>but there seems to be an effect where btrfs loves you or hates you

Surely it depends on the btrfs implementation. e.g. Arch Linux getting daily kernel updates vs an enterprise distro

Just as unstable on Arch as of a month or two ago.

Same and same. Never saw any problems with btrfs. Really like the memory consumption of btrfs!

> an effect where btrfs loves you or hates you

Same thing happens with operating systems.

Sorry, but this is an anecdata.

Down there, 2/3 of this hackersnews discussion (if you are patient to get there) you can see questions about production deployment of btrfs, with some VERY interesting answers of BIG deployments of btrfs. Read success confirmed with data. My takeaway from reading whole discussion:

* lot of people (individuals) praise of btrfs

* lot of people (ind.) tell about problems

* quite nice features/btrfs usage patterns, not matched even by zfs mentioned

* still for VM/DB you shall consider different approach (thin LVM + xfs or ext4) and slave machine WITH btrfs and snapshots on it

* quite many problems/deficienses of ZFS mentioned (apart fomr typical license/kernel inclusion)

* lot of new features on the way in recent kernels for btrfs

* btrfs is not dead

p.s. worth to comment that kernel 5.6 just received another huge new features batch for btrfs (async discard!)

ZFS is the only FS I used that resulted in complete FS corruption, losing nearly all data on disk (only once though).

Legitimately curious what the ZFS bug was. I’ve not heard of a TFDL bug in zfs for a Loooong time.

The reason Synology btrfs is mostly solid is because they refused to ever use the btrfs raid layer. But the second you move to btrfs on LVM you lose a large portion of the supposed benefits.

Having used both, never lost data on zfs and I’ve been using it since it was released and have had it save me from silent data corruption. BTRFS hasn’t ever lost me an entire file system, but I’ve definitely lost files.

I really don't understand the insane hype around ZFS. You can't read any thread that touches on filesystems without the ZFS zealots coming out.

ZFS is mature/stable, its feature set is basically unmatched (data checksums, compression, atomic snapshots, RAID(0,1,10,5,6), send/receive) by any other option on Linux, and what competition it does have is unstable in some configurations (BTRFS), essentially dead in the water (reiserfs), in early development (bcachefs), or far more complex to manage (gluster, ceph, LVM+XFS). Other than the licensing issue, ZFS is basically a silver bullet.

I agree and I'd like to add to the list of feature-set the adaptive cache (which does not only take into account the last time a block was used but as well how frequently it was used) and the SSD-cache ("ARC" respectively "L2ARC" in ZFS jargon).

Also don't underrate good documentation and easy to use tooling.

This was the killer feature for me.

I had been wanting to try ZFS on my home NAS for a while (for snapshotting/redundancy/data integrity) and finally got enough disks that it made sense. I wasn't looking forward to learning what I presumed to be a very complicated system though. About 15 minutes into my research for setting up and maintaining a ZFS filesystem and I just went - wait thats it? So incredibly simple and well documented, it has been a joy to use. It is very rarely that complicated operations on complicated systems use such simple and easy to understand commands. It just does what I expect!

ZFS is incredibly easy to learn to use, whereas btrfs is quite complicated to learn/use, and even more so if you've used ZFS since a lot of things are either just different enough to be weird, or so different that it makes no sense.

Examples: ZFS snapshots can be recursive (-r) or not, whereas on btrfs they cannot be recursive; in discussions I've seen, this is mentioned as "a feature", since you can create a subvolume for data that you don't want to be part of the snapshot, but it also prevents you from dividing up a logical heirarchy into multiple behaviours (compression vs. not, block size, etc.).

> but it also prevents you from dividing up a logical heirarchy into multiple behaviours (compression vs. not, block size, etc.).

Bind mounts can get around most of the limitations here, at the cost of polluting one directory with the canonical locations of all your special-purpose subvolumes. I think it's still awkward to simultaneously snapshot every subvolume that is mounted under a particular tree for incremental backup purposes.

The hype is quite easy to understand. Snapshots and checksums are two complete game-changers. ZFS has them both. And there are no real alternatives in many cases.

I've personally waited for BTRFS longer than a decade but my use-cases are yet to be considered stable (not something you really mess with in regard to filesystems).

Honestly, as sure as I have been on the success of BTRFS I now consider BTRFS dead on arrival - if it will ever even arrive. The pace of development is slower than the universe around it, that might be too harsh but really - no RAID6 yet? A decade ago the impression I got was "soon". And now 2-drive parity is becoming obsolete.

ZFS has tons of warts for home-use, I agree. So, for a home-user with high demands I don't see anything exciting in the future.

There were a bunch of btrfs raid56 patches last year. I think the known bugs have been addressed and is just that the wiki page hasn't been updated.

Re obsolete, are you referring to RAID1C3?

I'm thinking of this:


I'd much prefer something like raidz3 compared to the authors setup.

RAID1C3 is nice but very expensive for use in bulk storage at home.

What warts do you speak of?

No defragmentation, and as far as I'm aware all copy-on-write filesystems suffer greatly from fragmentation once utilization goes too high. ZFS will never recover unless you restart from scratch.

No way to rebalance a pool. Also increasing a pool always results in less reliability (in terms of drive losses that results in the whole pool going down).

No proper recovery tools if something goes wrong.

Then the lack of flexibility talked about in the article. This means the up-front cost and total cost is vastly more than a more typical setup where you can buy drives spread out over many years and take advantage of falling prices, less power consumption and noise (in part because you typically start such an array with higher density drives, since the low cost and longevity allows you to).

Probably forgot some other reasons.

That said I still use zfs (freenas) at home. But because of the above it is quite hard to blindly recommend it.

lvm and hence ext4 etc have had snapshots for ages.

As do NTFS. But they are not really comparable to "real" filesystem snapshotting, at least not in my opinion.

I don't think I am a zealot, nor a heavy user, but I use it on 1 machine at home (an NFS server running FreeBSD, which I have clients for elsewhere in my house). I came to this idea when I saw some data loss on some magnetic disks in my house, and repairing or even assessing the level of damage was difficult.

My experience is that it's pretty good. The tooling does what it says without a lot of drama. I can scrub while the system is in use and don't notice it mostly. I have seen some small corruptions that it was able to flag for me with specific filenames and fix. Snapshotting and send/receive is also very handy.

I heard some people say they don't like to use it under heavy load. That seems reasonable to me. You're paying costs to get the integrity piece. So it's not for every use or every user. It is very good at what it does, however.

Same with me. I just figured out at some point, 10 years ago, that it is nice to have snapshots on root disk. And figured out FreeBSD is supporting ZFS. Tryed it, loved it, used it. The ZFS on linux was destabilized in latest versions (`ls /.zfs/snapshots`) and they blew it considerably by adding it to systemd (I need to reboot fedora multiple times before it boots ever since), but at least I know that my data are not lost (unlike btrfs, had two major crashes in two years). Quite frankly I'll rather wait for Raisser to get out of jail than use btrfs again. Anyway, I bet on Hammer2.

ZFS is like really good snow tires in the winter. You can tell people with other tires how great it is to have really good tires, but they dont believe you until they experience the benefits for themselves. Or put another way, no one "needs" ZFS until they really need it, then they wont live without it ever again.

I switched to it after the 7200.11 firmware mess, where the drives reported successful writes but didnt write anything. ZFS would have caught that, my Adaptec card certainly couldnt have and didnt.

ZFS to the rescue again a while later when those (now firmware updated) 7200.11 drives started dropping after 15k hours of service. ZFS saved my data when two drives started failing in my RAID5 set at the same time.

All the weird minor problems that would cause random issues or performance issues for other file systems like flaky SATA cables, intermittent HBA/backplane ports, etc. ZFS catches them all and informs you.

Having been hit by bit rot, corrupted files, corrupted file systems, etc etc before switching, ZFS is fantastic. And there is something great about watching it scrub at >1GB/sec, verifying every single bit of your data.

I had to check the dictionary for the meaning of "hype"

a situation in which something is advertised and discussed in newspapers, on television, etc. a lot in order to attract everyone's interest:

May be its just me because Morden day usage of "hype" seems to involve and implies a negative meaning, especially in tech. Similar to false advertising. And no one was actively promoting ZFS, they were only very "responsive".

And then zealots, I had to reread 226 comments, ran to Cambridge dictionary

a person who has very strong opinions about something, and tries to make other people have them too

I dont see anyone having strong opinions and force others to have the same. If anything a lot of people are showing not because the love ZFS, but they have been burnt by btrfs.

Eh, i'm waiting for them to rewrite it in Rust.

Poe's law

This is such a meta-comment that I actually LOLd!

ZFS is the worst filesystem/volume manager, except all others.

Have you tried it? I went from having never used ZFS to loving it (and I guess being one of those zealots) very quickly after setting it up. So simple yet so powerful!

>You can't read any thread that touches on filesystems without the ZFS zealots coming out.

Agreed 100%. That's particularly annoying to us desktop users. It took me years to figure out that no, FreeBSD aside, it doesn't bring anything to the table outside of enterprise storage use cases. At least it doesn't bring anything that's worth the hassles (I don't have to export ntfs filesystems before using them on another computer; same for ext4 -and then there's performance).

I have a synology NAS on btrfs. One of the best computer purchases I've ever made.

I’ll second this, it’s fantastic. The time it takes to expand when adding a second 16TB is deeply average (8 days) but that’s about it for downsides. It’s the best computer I’ve owned.

Hard to standardize on something that can't be maintained in the same place all your other filesystems are in (in the Kernel) for licensing reasons.

Only the boot file system drivers need to be in the kernel. As long as there is a stable ABI, it's fine for everything else to be someplace else.

> As long as there is a stable ABI, it's fine for everything else to be someplace else.

Mainline Linux has a policy against in-kernel ABI stability guarantees. User-space is given ABI stability guarantees, in-kernel code by intention is not. That includes filesystems.

Do you have a link to the bug issue? ZFS purpotedly never had any corruption issues on release versions, so that makes it a really interesting case.

A question for HN: what filesystem and/or block-device abstraction layer would you use on a database server, if you wanted to perform scheduled incremental backups using filesystem-level consistent snapshotting and differential snapshot shipping to object storage, instead of using the DBMS’s own replication layer to achieve this effect? (I.e. you want disaster recovery, not high availability.)

Or, to put that another way: what are AWS and GCP using in their SANs (EBS; GCE PD) that allows them to take on-demand incremental snapshots of SAN volumes, and then ship those snapshots away from the origin node into safer out-of-cluster replicated storage (e.g. object storage)? It it proprietary, or is it just several FOSS technologies glued together?

My naive guess would be that the cloud hosts are either using ZFS volumes, or LVM LVs (which do have incremental snapshot capability, if the disk is created in a thin pool) under iSCSI. (Or they’re relying on whatever point-solution VMware et al sold them.)

If you control the filesystem layer (i.e. you don’t need to be filesystem-agnostic), would Btrfs snapshots be better for this same use-case?

AWS and GCP most likely use their own proprietary stuff. Various storage systems (such as Netapp) are able to provide snapshots at the storage system level, and if you're interested in something open source, a Ceph cluster can also provide you snapshottable block devices; whether it's a good idea for a database is another question.

Filesystem snapshots are a legitimate way of backing up databases, but it's not quite as simple as just taking a snapshot. For PostgreSQL for example you will still need to call pg_start_backup() and ensure your WAL archives are properly stored in your object storage system for point-in-time recovery. Without the database-specific precautions, your snapshots will still be crash-consistent and most likely usable in some manner, but not quite proper backups.

Using BTRFS or ZFS as the database filesystem has its own footguns. For example, the default record size of ZFS datasets doesn't match the block size of most databases, so if you forget to take that into account, you'll very likely see rather terrible performance.

If the database is PostgreSQL, I would strongly advise about forgetting about filesystem snapshots and instead using streamed backups (if on premises use barman, if in cloud WAL-E or WAL-G (never used it but looks like improvement over WAL-E).

This gives you backup with a replay value, so you can restore at any point in time. You can also use such backup for setting up replication. There's still a daily backup which is there to speed up recovery and increase resiliency. Those backups don't really put much load on the database, but if that's a concern you can back up the replica (which is what cloud providers or at least AWS is doing).

As for ZFS, out of the box ZFS is not a good file system for databases, although you can get a good performance after tuning. You for example want to configure it to have block sizes alizened with database blocks, configure ZIL, perhaps changing block hashing algorithms (although I think current default should be fast).

As for your question how are cloud providers are doing it, most of us can speculate. To me it looks like standard RDS instances are simply on EBS (which are utilizing S3). In Aurora they skipped EBS and implemented another database storage directly.

It seems like the backups are performed in traditional way though.

I do not think it would be a good idea to use file system level snapshotting for backing up a database. The database "knows better" about its internals, and can give more guarantees about the consistency of its data. I would trust a filesystem-levdl backup only as a last resort.

It is possible to put database in state that is "ready" for snapshot, pushing changes to disk and sort of freezing I/O during snapshot.

this is generally not a matter of concern for a copy on write filesystem like zfs, since it's not possible for the file to be in an "in between" state. If a write were in progress, the filesystem would still be pointing to the previous state. Only when the data is written to disk is the pointer moved to the new location.

It very much is a concern. ZFS has no knowledge about the internals of a database, which parts of a file are related to each other etc.

DBMSes always keep their database in the file system in a consistent state to be able to recover from system crashes. Taking a file system snapshot is equivalent to pulling the power on the database server in terms of data recovery, but databases are designed to support this.

As do filesystems. Yet I've seen anyone argue that cutting the power is the recommended way of doing backups.

In fact the opposite, make sure to use an UPS just so that you can shutdown cleanly in the unfortunate event.

For example: https://blogs.oracle.com/paulie/backing-up-mysql-using-zfs-s...

Some people came up with the idea of "crash-only software", arguing that it's better to maintain one code path (recovering from a crash) than two (clean start and recovery), but it hasn't caught on that much. https://www.usenix.org/legacy/events/hotos03/tech/full_paper...

Google follows this, I believe

It matters when various writes to files need to be ordered, and the DB processes is keeping some of the consistency part in memory.

It is a matter of concern if said database systems leaves its filesystem contents in an inconsistent state at any point. ZFS, BTRFS, and others can only keep consistent what they have control over.

I also think database specific backup makes more sense.

Some people recommend filesystem snapshotting but wouldn't that make recovery a slow process because you have to load up the entire database even if you just wanted to look up on data of a small table?

Maybe backing up only small tables as SQL dumps while keeping a file system snapshot would be a good compromise.

Create a LVM thinpool, create a thin LV in that, format with XFS, put the database on top.

Each time you want to a backup, create a CoW snapshot of the thin LV, then mount it somewhere and run the backup.

The "main" thin LV should be happily chugging along independently when you are doing that.

And all this is stable proven technology available about anywhere (eq. RHEL 7+).

My understanding is that performance on LVM drops dramatically after the first snapshot due to the way it handles CoW (synchronous writes on top of your async write). Is that no longer true, or only in certain circumstances?

It seems as though the way to go would be to take a 'snapshot', back it up, and then delete it immediately; is that right?


I think this complaint applies to the original LVM2 snapshots, not the new thin ones.

>> Or, to put that another way: what are AWS and GCP using in their SANs (EBS; GCE PD) that allows them to take on-demand incremental snapshots of SAN volumes, and then ship those snapshots away from the origin node into safer out-of-cluster replicated storage (e.g. object storage)?

As far as I know AWS does not use SANs because they consider it as anti-pattern. Most backups land on S3 because of reliability and price.

SAN and s3 are different beasts.

EBS is very much a SAN, if you read the docs, the Nitro HBA Controllers have dedicated bandwidth allocation for doing just EBS.

As there is a dedicated network for just servicing block storage, that sounds suspiciously like a Storage Area Network to me.

S3 for backup makes lots of sense, its ubiqutous, reliable and smeared over lots of regions. It also works well with large files. Its also orders of magnitude cheaper than EBS to run.

Sure thing. I was referring to the lack of SAN in the context of backups. Yes, EBS is a SAN in that sense.

So how is S3 implemented? Does it reuse any publicly available open source component?

I don’t think they’ve published anything specifically on S3’s architecture (someone please correct me if I’m wrong, I last looked into this a long time ago), but

1. they came out with S3 soon after coming out with their Dynamo paper (before releasing DynamoDB, even); and

2. there’s a good constructive proof, as a studyable FOSS system, for how to build object storage on top of a Dynamo architecture, in the form of Riak CS (object storage) which is built atop Riak KV (a Dynamo impl.) Riak CS seems to make pretty much the same set of guarantees (in terms of time/space complexity of operations, possible durability numbers per scaled number of copies, etc.) that S3 does, so it’s a fair guess that they’re similarly-architected systems.

It is a closed source project that has many components. I am not aware if any of those are opensource.

Assuming you can afford 2 machines this setup works pretty well for me.

Primary DB-Server -> XFS on LVM with a LVM caching SSD

Secondary (write-only) mirror DB-Server -> ZFS

The DB is replicating automatically to the secondary server by the database internal replication features. At the secondary I am then able to lock the DB temporarily, doing a ZFS snapshot and maybe could do a ZFS send/receive afterwards without affecting the primary server.

ZFS is great for doing snapshots and archiving of huge amount of data, but it's very very bad for production databases in terms of performance. Most database aren't designed to deal with the CoW feature of ZFS, which leads to a very bad write performance and database fragmentation in the end.



See various "Private Cloud" Linux distributions for implementation examples of this. Such as Proxmox which does it out of the box on ZFS and soon on btrfs too.

xfs + dm-snapshot / lvm snapshots. Very fast and very reliable.

AWS, and possibly GCP only allow 1:1 mapping of volumes (publicly, I know AWS allow it under the hood. )

Which makes synchronising snaphots a lot easier (and caching too, but thats another thing entirely.)

They are treated as block storage, so on the outside don't have to worry about what filesystem is running on it. (in practice they have to be a bit aware, so that they don't snapshot unbootable or dirty images, but I assume thats mostly handled by an OS plugin)


AWS et al snapshots are at the block level. Linux has poorly documented primitives for this.

If you put your VM images on a Filesystem provided by ZFS or BTRFS then you can snapshot your images, without having to buy a SAN, or expensive controller.

ZFS has by far the best documentation. BTRFS's documentation has improved, but the tools are still difficult to use.

I've seen a lot of the hacker community focusing on btrfs and zfs but very little focusing on ceph. I think ceph has a lot of the features that we want in a file system and some things that aren't even possible on traditional file systems (per-file redundancy settings) with very little downsides. The setup is a little more complex involving a few daemons to manage disks, balance, monitor, etc. I wish there was something similar to FreeNAS for ceph that only focused on making the experience seemless because I think if it became more popular in the home lab space we'd see lots of cool tools pop up for it.

I love Ceph, I even wrote an intro about it for those who are not familiar with it.


But Ceph is not designed to be a competitor to BTRFS or ZFS. The core vision of Ceph is scalability. If you need petabytes of storage and the performance to scale with it, take a look at Ceph.

I may be totally wrong here, but from what I understand about Ceph, it's not meant as a file system for a single computer. I don't understand the idea of running Ceph on your laptop/desktop. It's possible to run it that way but it defeats it's purpose.

I've build a small lab setup with Ceph:


Also, there's the issue of performance, in particular latency. That's a bit of a weak spot of Ceph, from what I can tell. Again, may be wrong. But I found these notes interesting.



In fact, it's really common to use a ZFS array on single nodes, and then create a SAN using multiple such machines by layering Ceph on top.

That's interesting, but it's layers upon layers... (RIP latency), I think. Unless it's about just bandwidth and volume, then latency is not that big of a deal.

You don't have to use ZFS snapshots. I haven't run a system like this in production but presumably you choose ZFS because it's flexible in how you configure the arrays (as is say, LVM) and because it supports checksumming.

I never really store data on my local machines anymore. All of my data is either hosted only on, or backed up to, my storage server. I think the selling point of ceph is that every server in my apartment can be part of my storage cluster and data I really want to avoid loosing can be persisted across all of them.

For me latency isn't really a large issue. I read and write everything locally on my SSD-backed desktop/laptop and then sync my files to my storage node via git or rsync or something. For me data integrity and availability are important.

Love using Btrfs; the is no better filesystem than it nowadays that it's reliability issues have been fixed.

I have to beg to differ here as I had a different experience that I literally just posted about to Reddit yesterday


tl;dr Unbeknownst to me I had a bad drive cable for an external NVMe enclosure that was causing intermittent I/O errors (only during high drive utilization) that went undetected by BTRFS and slowly corrupted my drive, eventually leading to an unbootable and unrepairable system (and to be fair, I should have scrubbed instead of attempting btrfsck --repair from another booted drive, but I don't care what you say, a --repair function should NOT potentially cause FURTHER corruption if it is at all available in the tooling! Like, just fucking rip it out if it can potentially make things worse, or recode the damn thing to just act defensively... jeez)

Wiped the drive and started over with Ubuntu 19.10 and its new integrated ZFS on Root support... ZFS detected the IO issue pretty much instantly and prevented further errors by freezing I/O. Swapped the cable out during my troubleshooting and the issue went away. Also, drive is plenty fast, read test at 800MB/s

I'll throw in my own anecdote. ZFS on root caused me a significant amount of headache when the proxmox node I was using it on just randomly decided it wasn't going to boot anymore. The ZFS pools were fine, no data was lost, but no amount of messing with it fixed the zfsonroot and it was quite difficult to find quality search results for.

And of course it was a weekend where my parents and siblings and in-laws were visiting, so I had the joy of going around messing with DNS settings wherever someone had a device that only paid attention to the first two DNS servers in the DHCP settings.

(I've since changed my DNS setup- now I only have a primary self-hosted one that's on an RPi in my networking cabinet, and the second entry is Google. I figure if I only get two servers that are respected for real, I'm making sure one of them is google.)

> I only have a primary self-hosted one that's on an RPi in my networking cabinet, and the second entry is Google.

I was under the impression that there was no such thing as primary and secondary for DNS, just ‘here is one’ and ‘here is another’, with someone going for a terrible naming system of ‘primary’ and secondary’. I’m no expert and my knowledge come from messing about with Pihole and reading their documentation.

The first nameserver listed in resolv.conf is kind of a primary as it will always be consulted first, unless you add "options rotate". The next nameserver only come into play if the first doesn't respond (default 5 seconds, also tunable with options). They're not named primary/secondary in the file but could be considered that way.

Don't rely on this behaviour, many DNS libraries, will query all or n to save on latency.

I suspect that both BTFS and ZFS are currently good enough under most configurations that most users don't have a problem with whichever they choose, and it's only a tiny fraction that has a really good or bad experience and becomes a rabid advocate based on their anecdotes.

This is an obvious truism. Of course they appear to work correctly under ideal conditions.

The real question is how they behave under less than ideal conditions. It is these conditions where Btrfs has performed poorly, and where ZFS has performed very well. I lost several Btrfs filesystems due to its poorly-tested and broken error handling trashing the filesystem beyond recovery.

The selling point of both of these filesystems is their robustness, fault-tolerance and ability to self-heal. Only one of them actually delivers.

That's really a packaging issue, not a ZFS issue, but I feel your pain.

The best suggestion I can offer is to use a distribution that treats it like a first-class citizen, such as... well, the Ubuntu support is still beta level, so only NixOS for now.

> when the proxmox node I was using it on just randomly decided it wasn't going to boot anymore

could this possibly be proxmox's fault more than ZFS's fault? You even said the pools were fine

That's why FS integration into the kernel would have been so important for the whole software ecosystem.

I tend to agree with you here -- reliability has been a non-issue for me, though I've never configured `btrfs` in its RAID configuration.

Performance becomes an issue in certain cases, but in every one that I've encountered, adjusting configuration has resolved the problems to my satisfaction.

Would my Windows 10 VM run better under a different filesystem, rather than `btrfs` with various tweaks applied? Reading relatively recent articles on the subject would suggest that it would, however, I'd rather work with a single filesystem type and understand its strengths/weaknesses than manage two different filesystems as long as I can get performance to a usable state.

We have run btrfs in RAID configuration, but that has usability issues, even just doing RAID-1.

We've switched back to using MD (mdadm) for RAID-1 setup, and then using btrfs on top of that for the snapshots, send / receive, block-level CRC and such.

Dealing with failed drives isn't as easy with btrfs as it is with Linux MD.

Does that include performance reliability?

It wasn't very long ago that I had BTRFS drives on two separate systems develop crippling performance issues, with random delays increasing up to seconds, and the filesystem going unresponsive for even longer when I deleted snapshots. I think something about the performance was degrading every time an hourly snapshot was made, even though the system only kept a couple dozen of them at a time.

> nowadays that it's reliability issues have been fixed

Is this also true for RAID5/6?

This issue has its own wiki page on the BTRFS wiki:


So, no, that particular issue hasn't been fixed.

> * For data, it should be safe as long as a scrub is run immediately after any unclean shutdown.*

That’s unfortunate. Does the scrub run automatically in those situations? Consumer hardware will be the most prone to intermittent power failure.

There are some big caveats there. https://btrfs.wiki.kernel.org/index.php/RAID56

is it considered better than ZFS?

I've had one issue with btrfs that took it off my radar completely. A customer had a runaway issue that filled a btrfs device with unimportant things. We found the errant process and killed it, but apparently if a btrfs device is completely full, you can't delete anything to free up space. File removal requires some amount of free space. Bricked the device, annoyed a customer, back to ext4.

ZFS had this issue (I believe fixed) workaround was to pick up one large file that you wanted to delete and do `echo -n > /the/unimportant/file` once the file was reduced in size to 0, rm started to work again.

Not sure if that workaround would work in btrfs, but it worked on ZFS.

ZFS reserves 1/64 of every disk precisely so it can't be truly fully allocated. It leaves enough room to delete snapshots, truncate files, and so forth.

Mind that everything is copy-on-write, you can't do anything, even metadata changes, without allocating new blocks. It needs the reserve space.

I had a ZFS bug once where they increased the amount reserved in a new release which caused my file system to be 100% and me unable to delete anything until I went back to the previous release.

Btrfs uses the the disk completely. This is harder to do (also compared to e.g. ext4 reserving a fixed amount of inode space which may be unused when the disk is full). At some point they added an in-memory "global reserve" metadata space which allows you to delete stuff even if the file system is full.

What happens if the file has already found its way into a snapshot? Then presumably that command will not free any space.

Well, rm wouldn't free the space either so you either would remove the snapshot or chose a different file.

True. For me, I freed up space by nuking old snapshots, when I ran into this on btrfs.

See also: 'truncate -s 0 /the/file'

Yep. I had this happen a few weeks ago (I'm not sure how much maintenance the server has had since it was set up 2-3 years previous). Thankfully, after seeing whatever the error was ("No space left on device" or something) and furrowing my brow it seemed obvious enough to try without having to search for a solution. It seemed just dumb enough to work.

The trick was to insert a USB drive, tell BTRFS it's block storage and delete away. Once you're done you tell it not to use the disk and you're good.

This is why I'm back on ext4 now, too.

I do a similar thing with my laptop's swap partition. swapoff, add it to btrfs, then remove it and mkswap again. Always seemed safer than a potentially dodgy USB drive.

Or, if you don't care about your data, add a RAM disk :P

Good trick, I'll save that one. Thanks!

I'm surprised they wouldn't reserved space.

IIRC ext4 reserves a (configurable) portion of the disk for system management; it seems like btrfs could easily do the same.

Ext4 reserves space that can be used only by root, it is so system services can continue to work when users take all the space. It doesn't have issues like this if you exhaust all of that space.

In ZFS and I'm sure in btrfs you can set up quotas and reserved space, globally and or user, but by default it is set to 0. I actually set my quota to 80% because apparently if you fill ZFS more it causes heavy fragmentation.

You're 100% correct; ext4 reserves this for a different purpose. I'm just saying it's not a entirely novel idea (albeit for a different reason).

Ext4 reserved space also helps with fragmentation.

To be more specific, reserved space on ext3 helps the fs so that it can be more flexible during allocation and avoid fragmentation.

Ext4 has delayed allocation mount option for that purpose so reserved space is not as much important for that but it'd still help if you turn off delayed allocations.

Won't that only work for non-root processes?

A copy on write file system has this potential problem because nothing is overwritten. To delete anything requires space to write the metadata change reflecting the deletion, and before the data extends can be freed the change must be committed to stable media.

It's been years since Btrfs introduced "global reserve" which reserves enough metadata space to ensure it's possible to delete files on full file systems. But an old work around for this is to add a small device to the Btrfs volume, making is a 2 device volume. It could be a USB stick, a zram device (ramdisk), partition, or even a loop mounted file on some other file system. Delete the files, and now you can remove the temporary 2nd device.

this wasn't fixed years ago ? I had a BTRFs partition filled by a rogue process and it not bricked. it allow me to remove the junk files without any issue. And I'm talking of a Ubuntu 14.04 Ltd server

This was 14.04 as well, on a Tegra (armhf) system.

eta: looked up the ticket, customer reported "when trying to delete any files, even as root, btrfs says "cannot remove"", field engineering observed the same.

I use BTRFS on several devices for years. The tooling is a bit rough, but no major problems. Just recently data checksumming saved me: In December I replace an old 2TB drive in my RAID1 (2+4+4+4) with an 8TB drive. The new drive had checksum errors after a few weeks which BTRFS handled gracefully. With "classical" RAID i might only have noticed when it's to late. (I RMAed the bad drive)

  [/dev/mapper/h4_crypt].write_io_errs    0
  [/dev/mapper/h4_crypt].read_io_errs     0
  [/dev/mapper/h4_crypt].flush_io_errs    0
  [/dev/mapper/h4_crypt].corruption_errs  0
  [/dev/mapper/h4_crypt].generation_errs  0
  [/dev/mapper/h2_crypt].write_io_errs    0
  [/dev/mapper/h2_crypt].read_io_errs     30
  [/dev/mapper/h2_crypt].flush_io_errs    0
  [/dev/mapper/h2_crypt].corruption_errs  0
  [/dev/mapper/h2_crypt].generation_errs  0
  [/dev/mapper/h1_crypt].write_io_errs    0
  [/dev/mapper/h1_crypt].read_io_errs     0
  [/dev/mapper/h1_crypt].flush_io_errs    0
  [/dev/mapper/h1_crypt].corruption_errs  0
  [/dev/mapper/h1_crypt].generation_errs  0
  [/dev/mapper/h3_crypt].write_io_errs    0
  [/dev/mapper/h3_crypt].read_io_errs     0
  [/dev/mapper/h3_crypt].flush_io_errs    0
  [/dev/mapper/h3_crypt].corruption_errs  0
  [/dev/mapper/h3_crypt].generation_errs  0
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].write_io_errs    0
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].read_io_errs     16
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].flush_io_errs    0
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].corruption_errs  20619
  [/dev/mapper/luks-e120f41e-9c8a-4808-876f-fa6665ee8bb8].generation_errs  0
edit: formatting

I have been using btrfs in my "NAS"/personal server for 3 years, changed disk configuration a couple times, I do snapshots every hour and prune them using a Fibonacci-like timeline, no problems yet.

My experience has been the same. Admittedly, I've not tried native BTRFS parity raid (I'm sitting the volume on top of mdraid). But, I ran the "mkfs.btrfs" 5 years ago at this point for my desktop and no data loss yet. I back things up religiously, so I'm not too worried about the volume failing, but it'll be nice if btrfs parity raid gets stabilized, because I could replace my current NAS storage config.

I used to use ZFS on my NAS, but after running it for a year and fiddling with it, I wasn't able to tune it in a way I liked. I always had random performance problems and zvols were super slow. It's now dm-integrity on all disks, an mdraid raid6 volume over those, with LVM2 on top of that and mirrored NVMe disks as a read and write cache.

I also wish BTRFS would add extents at some point so you could run virtual machine images from it without weird performance issues from time to time (although I imagine this is less of an issue on SSDs because they're "fragmented" inside anyways).

I use btrfs in raid1 mode and the ability to shrink/grow/add/remove devices at will without data loss or extended downtime led me to choose btrfs over zfs on my home servers.

You can grow and add/remove raid1 devices (mirror vdevs) in ZFS without any significant work or downtime. Shrinking does require a bit more work, but depending on your setup it can be done fairly painlessly with send/recv (and shrinking is usually not something which is a very common administrative operation).

"fairly painlessly" and "without significant work or downtime" doesn't sound like it lines up with btrfs's, which I would describe as "one command and zero downtime (just some io load if you rebalance immediately)" for both operations. btrfs is also mainline, which increases how painless it is to use.

BTRFS does have some scary stories from earlier in its development, and true raid5 seems like it's unlikely to be safe for quite a while, but raid1 and "normal" fs usage has been rock solid in my experience. The only time I've ever had an issue was probably 4 years ago at this point, and it was solved by just booting an Arch live iso and running a btrfs command that was basically "fix exactly the bug that your error message indicates". I don't remember exactly what it is, something about two sizes not matching, but googling the text it showed at boot led me directly to the command to fix it. Certainly dramatically less trouble than I've ever had when hardware RAID goes south.

I do agree that modern lvm does probably compete with btrfs, but again you're trading how dang simple btrfs raid1 is to manage for monkeying with partitions in lvm in exchange for ~some? performance.

IMO ZFS is in a weird spot where I don't know where I'd use it. It's too complicated/annoying to admin for me to want to run it in my basement for myself/my family, and for anything bigger or more professional I'd use ceph or a problem-domain-specific storage system (HDFS, clickhouse, aws, etc).

The first operations I mentioned (adding or removing a device from a vdev, or adding a new vdev) are one command with no downtime:

  % zpool attach <pool> <vdev> <device> // add to a vdev
  % zpool detach <pool> <vdev> <device> // remove from a vdev
  % zpool add <pool> <vdev> <devices...> // add a new vdev
In the newest ZFS versions, you can also remove mirror and singleton vdevs (this does require some time -- because the data needs to be copied from the drives) but it's all done in the background:

  % zpool remove <pool> <vdev>
Shrinking a pool "the old way" (which is still sometimes necessary depending on what you're doing) is definitely more involved -- you have to create a new pool with the layout you need and then do a zfs send/recv from your old pool to the new one. This does only take a handful of commands but I would definitely consider it to be a much more complicated affair than the operations I mentioned above.

I would not (nor did I) compare LVM (or md-raid) to btrfs or ZFS -- those technologies have fundamental limitations regarding the integrity of your data that ZFS (and btrfs) don't have. And don't get me wrong -- I don't have a problem with btrfs (I run btrfs on all of my machines except my home server -- which runs ZFS), I just disagree with GP's point that ease of use is an argument for btrfs over ZFS. There are many arguments for either technology.

> btrfs is also mainline, which increases how painless it is to use.

I agree that this is one argument to pick btrfs over ZFS (though on most distributions it isn't really that hard to install ZFS, the fact that btrfs requires zero extra work to use on Linux is a benefit).

How? My understanding is that you create a new vdev and add the old vdev as a device, basically recursively creating volumes with each new device you add.

Which operation are you asking about? [1] is a sister comment which I posted that outlines how to do most of the operations I mentioned.

[1]: https://news.ycombinator.com/item?id=22168494

Is there any specific reasons to run btrfs over for example ext4? You can create/shrink/grow pools, create encrypted volumes etc by using LVM.

It all depends on the application but in the majority of cases the io performance of btrfs is worse than the alternatives.

Redhat for example choose to deprecate btrfs for unknown reasons while SUSE made it it’s default. The future of it seems uncertain which may cause a lot of headache’s in major environments if implemented there.

Redhat and SUSE (SLES) are both enterprise environments, so at every level, they have to choose one tech stack to go all-in on (i.e. to train their support staffs on), and then discourage their customers from using the others. (“Deprecating” a component, for such orgs, means that some of their customers are now stuck with it, and they’ll continue to support those customers in their use of it, but they certainly won’t support new customers using it.)

The fact that one enterprise-support provider went all-in on Btrfs, while another didn’t, basically tells you that the choice is pretty arbitrary. If no enterprise-support provider used Btrfs, then I’d be concerned.

The enterprise provider that actually develops btrfs continues to support btrfs, and one enterprise provider that doesn't stopped supporting it.

People treat RH stopping support of btrfs as some sort of death knell for it. Meanwhile all the btrfs users are confused why RH's opinion should matter at all when they weren't that involved with developing it in the first place.

As an opensuse user, btrfs has saved multiple machines from botched updates by letting me revert to the snapshot from right before the update was applied (opensuse's update tool automatically takes snapshots before and after updates).

Red Hat used to be heavily involved in Btrfs development. In fact, they are present in a huge chunk of its development in the first few years. But their developers were hired away by Facebook, leaving Red Hat with nobody who work on Btrfs regularly. That's the underlying cause for why they stopped supporting it. Hiring someone to work on Btrfs takes time and effort that they don't have a reason to spend right now.

I've also been running btrfs (on CentOS7) for about five years on my home NAS.

One advantage is it detects bit rot -- and you can scrub the disks once a week looking for the bad blocks.

I also like the inline compression.

I run at RAID1 and the only issue I had was several years ago there was a bug about freeing allocations so occasionally the filesystem would be full but not full.

> Redhat for example choose to deprecate btrfs for unknown reason

According to Josef Bacik, RH deprecated btrfs because he was the engineer in charge for Btrfs and had left the company.


Docker on btrfs can benefit from fs supported layers. It's fantastic!

I avoid LVM because trying to install with it always seems to break things, either making the install fail or breaking later during an upgrade. And I mean on normal-ass distros like Debian and Ubuntu, not anything odd, and not even with "fancy" features like disk encryption involved (I can only imagine the mess that'd introduce).

Then again I avoided Grub for years because I found it fiddlier and more breakage-prone than LILO, so possibly I'm just an idiot and/or jinxed when it comes to new things in Linux.

For me, killer feature is transparent compression, I work with a lot of numerical data in Postgres, and running it over btrfs is the only viable way to compress it.

Also, who uses btrfs in production? I only heard Facebook is using it somewhere but never read about others. Why is Facebook using btrfs yet there seems to be no publicity to make it more popular for external contributions.

fsync is still a bit slow on BTRFS (on ZFS too, but to a smaller degree). For example, I just did a quick benchmark on Linux 5.3.0 - installing Emacs on fresh Ubuntu 18.04 chroot (dpkg calls fsync after every installed package).

ext4 - 33s, ZFS - 50s, btfrs - 74s

(test was ran on Vultr.com 2GB virtual machine, backing disk was allocated using "fallocate --length 10G" on ext4 filesystem, the results are very consistent)

is ext4 also from fallocated file on underneath ext4?


Did anyone had the courage to use btrfs in production? Any stories to share?

Seems like Facebook uses it:

"Btrfs has played a role in increasing efficiency and resource utilization in Facebook’s data centers in a number of different applications. Recently, Btrfs helped eliminate priority inversions caused by the journaling behavior of the previous filesystem, when used for I/O control with cgroup2 (described below). Btrfs is the only filesystem implementation that currently works with resource isolation, and it’s now deployed on millions of servers, driving significant efficiency gains."


Yeah there are a remarkable set of container runtime tasks (package downloads, rootfs creation and management, etc) that are way easier with btrfs. It wasn’t always smooth sailing but luckily Chris, Josef, Omar and others are awesome and now (and for the last while) we are asking for features rather than fixes.

At a previous job, I deployed btrfs to production in a system that continuously spins up and shuts down thousands of VMs. A key feature that I was able to leverage to make this easy is seed devices. This btrfs feature works similarly to overlay filesystems.

If I were doing that today, I would do a bake-off of OverlayFS vs. btrfs for this feature. Btrfs has many other compelling features that may make it worth using, although it's always been slower than ext4/xfs so I'd also need to check how it does with modern ultra high performance NVMe drives.

Btrfs never lost our data, although there was a kernel panic in the journal writing code in the Linux 3.2/Ubuntu 12.04 timeframe. The panic would not cause data loss but it did wedge VMs. Since that was fixed, it's had a 100% reliable run in that system, to my knowledge.

I heard people get stable btrfs when certain features are turned off, so it may be helpful to say what you have turned on or off with its features when saying how it has been stable or not.

It's the default on recent Synology NAS, in my experience. No particular issue in my limited experience. Mostly transparent for the user.

(also a happy syno user here, been using it on several NAS's quite happily).

My rough understanding is synology did some pretty heavy modifications to btrfs in their implementation though... (a quick google finds me nothing to back this up, but i remember reading about it somewhere...)

Not modifications per se, but it doesn't quite do the "normal" setup. Encryption is a mess (you can't export encrypted volumes via NFS), and the caching layer on top of it seems prone to corruption on the SSD (I've had my NVMe mirror cache drop twice over the last year and a half).

I'd like to see them move to full disk encryption rather then their current approach.

They do encryption/compression on subvolume level; each share you create is a separate subvolume.

For RAID5, they are using it on top of LVM, but with some modification - the synology implementation hooks LVM and btrfs together, so it gets ZFS-like properties.

There's a guy on the internet, who was playing around with it: https://daltondur.st/syno_btrfs_1/

So they have fixed the last big hurdle to btrfs adoption in the small (single node) NAS space and are just sitting on it (violating the GPL). I urge any Synology user to write them to send you the Linux kernel source then upload it somewhere... though, their last Linux kernel drop seems to have been in 2017, so not much hope there...

CZ.NIC (the .cz tld administrator) is shipping OpenWRT-based wifi routers Turris Omnia and Turris Mox. Both of them use btrfs as their filesystem.

We (the build2 project) use it in our CI infrastructure for VM storage. For every build we make a snapshot of a VM, boot it, build, drop the snapshot, repeat. So we are talking about making/dropping snapshots every couple of minutes 24x7 for months without a reboot. We haven't had a single issue.

No problems with unbalancing?

When I was doing whole rebuilds of Debian, using e.g. 8 parallel builds of >18000 packages, it was creating and destroying a snapshot once every few seconds to minutes, at most 8 snapshots in existence at once. It got unbalanced and went write only every 36 hours. A clean brand new filesystem which never had more than 10% space utilisation and was typically around 1%.

At home: I'm running a RAID1 btrfs on my 12 disk cold storage (rackmount, SAS backplane, JBOD SAS controller). It has two new 4TB 24/7-rated SATA disks I got for that NAS, the rest is mostly salvaged from work (old drives, 500GB to 1TB). I had exactly the same selling point on btrfs as the author - I see a "huge" 7.8TB RAID1, and once it fills up I just swap an old disk (or two) for another 24/7 disk with decent TB/$.

At work: I was told our OpenSUSEs had some failures/data-loss, so we're not using the default btrfs on these. Though I don't know with what version that was (we migrated to OpenSUSE about 3 years ago).

I think some of SUSE's customers certainly must? Otherwise it would not make any sense for them to support it and stand behind it by now...?

i've been using BTRFS since 2014 to store backups. there is a noticeable performance penalty when rsync'ing hundreds of thousands of files to a spinning-rust disk connected to USB-SATA dock when BTRFS is used instead of EXT4. i'm accepting it in exchange for ability to run scheduled scrub of the data to detect potential bitrot.

since 2017 i'm also using BTRFS to host mysql replication slaves. every 15 min, 1h, 12h crash-consistent snapshots of the running database files are taken and kept for couple of days. there's consensus that - due to its COW nature - BTRFS is not well suited for hosting vms, databases or any other type of files that change frequently. performance is significantly worse compared to EXT4 - this can lead to slave lag. but slave-lag can be mitigated by: using NVMe drives and relaxing durability of MySQL innodb engine. i've used those snapshots few times each year - it worked fine so far. snapshots should never be the main backup strategy, independently of them there's a full database backup done daily from masters using mysqldump. snapshots are useful whenever you need to very quickly access state of the production data from few minutes or hours ago - for instance after fat fingering some live data.

during those years i've seen kernel crashes most likely due to BTRFS but i did not lose data as long as the underlying drives were healthy.

It’s worth noting that much of the premise of the article (wanting flexibility) is outdated. Zfs has support for removing top-level raid 0/1 vdevs now. So you can take a raid10 pool, and remove a top level mirror vdev completely. Note that this doesn’t work for raid5/6 vdevs, but as the author points out, those are becoming less and less used because of rebuild time and performance.

In addition to the slew of other features Btrfs is missing (send/recv, dedup, etc) zfs allows you to dedicate something like an Intel optane (or other similar high write endurance, low latency ssd) to act as stable storage for sync writes, and a different device (typically mlc or tlc flash) to extend the read cache.

I think there's a selection bias here: people using RAID 5/6 may not be using ZFS as much because it's not well supported. I'd bet money that those levels are much more common in SOHO settings than RAID 10 is, because it's still the sweet spot for "I need lots of storage" vs "...and am willing to spend drive's worth of storage on availability". For instance, anyone using a NAS primarily as a backup target for desktops and small servers may love RAID 5, but be unwilling to throw money at a "better" RAID 10 setup.

btrfs has send/recv. And dedup, which is more efficient that ZFS' since it can be performed offline, on select parts of the filesystem and doesn't have to keep gigabytes of dedup tables in memory.

zfs remove is not a very good implementation - it keeps the old blocks around (as a virtual device) and redirects them to new locations. This is fine for "oops I accidentally added a device" but not great otherwise.

*Just kidding on send/recv, looks like it’s there now. Substitute with encryption if you need another example.

Is using btrfs on a personal machine something to do? It seems that all the comments as well as articles about it, just assume you're running it on a server.

The ability to add and remove disks on a desktop machine is very tempting.

I've been running it on my desktop for a while and it's been wonderful. I have a cron job set to take a snapshot of the filesystem hourly so if I ever blow a file away or a package upgrade goes wonky I'm back up and running in minutes.

I've been a `btrfs` user for the better part of 4 years despite, at the time, a very vocal group providing advice against it[0].

I'll be the first to say that it isn't a silver bullet for everything. But then, what filesystem really is? Filesystems are such a critical part of a running OS that we expect perfection for every use case; filesystem bugs or quirks[1] result in data loss which is usually Really Bad(tm).

That said, for the last two years, I've been running Linux on a Thinkpad with a Windows 10 VM in KVM/qemu -- both are running all the time. When I first configured my Windows 10 VM, performance was brutal; there were times when writes would stall the mouse cursor and the issue was directly related to `btrfs`. I didn't ditch the file-system, I switched to a raw volume for my VM and adjusted some settings that affected how `btrfs` interacted with it. I discovered similar things happened when running a `balance` on the filesystem and after a bit of research, found that changing the IO scheduler to one more commonly used on spindle HDDs made everything more stable.

So why use something that requires so much grief to get it working? Because those settings changes are a minor inconvenience compared against the things "I don't have to mess with" to cover a bigger problem that I frequently encountered: OS recovery. An out-of-the-box OpenSUSE Tumbleweed installation uses `btrfs` on root. Every time software is added/modified, or `yast` (the user-friendly administrative tool) is run, a snapshot is taken automatically. When I or my OS screws something up, I have a boot menu that lets me "go back" to prior to the modification. It Just Works(tm). In the last two years, I've had around 4-5 cases where my OS was wrecked by keeping things up to date, or tweaking configuration. In the past, I'd be re-installing. Now, I reboot after applying updates and if things are messed up, I reboot again, restore from a read-only snapshot and I'm back. I have no use for RAID or much else[2] which is one of the oft-repeated "issues" people identify with `btrfs`.

It fits for my use-case, along with many of the other use-cases I encounter frequently. It's not perfect, but neither is any filesystem. I won't even argue that other people with the same use case will come to the same conclusion. But as far as I'm concerned, damn it works well.

[0] I want to say that an installation of openSUSE ended up causing me to switch to `btrfs`, but I can't remember for sure -- that's all I run, personally, and it is a default for a new installation's root drive.

[1] Bug: a specific feature (i.e. RAID) just doesn't work. Quirk: the filesystem has multiple concepts of "free space" that don't necessarily line up with what running applications understand.

[2] My servers all have LSI or other hardware RAID controllers and present the array as a single disk to the OS; I'm not relying on my filesystem to manage that. My laptop has a single SSD.

It's also worth noting that Synology uses btrfs as an option to do checksumming and snapshots on their NAS devices.

They're still using their own RAID layer though.

> They're still using their own RAID layer though.

Synology's RAID implementation is largely mdadm + LVM.

kernel 5.5 released Sunday. Btrfs now has raid1c3, raid1c4 profiles for 3 and 4 copy raid1. Adds new checksum algorithms: xxhash, blake2b, sha256.

Async discards coming in 5.6. https://lore.kernel.org/linux-btrfs/cover.1580142284.git.dst...

Being 'The Dude' of file system is literally the opposite of what I want. When looking at ZFS talks and the incredible complexity of some of those operations that Btrfs seems to think are 'no big deal', I will simply not trust that. Specially because it has been proven over and over again that Btrfs claims its 'stable' and then a new series of issues show up. Or its 'stable' but not if you use 'XY feature', or if the disk is 'to full' or whatever.

I remember using it after I had heard it was 'stable' and it eat my data not long after (not using crazy features or anything). I certainty will not use it again. A FS should be stable from the beginning, as stable core that you can then build features around, rather then a system with lots of feature that promises to be stable in a couple years (and then wasn't years after being in the kernel already).

Using ZFS for me has been nothing but joy in comparison. Growing the ZFS pool for me has been no issue at all, I never saw a reason why I would want to reconfigure my pool. I went from 4TB to 16TB+ so far in multiple iterations.

Overall not having ZFS in Linux is a huge failure of the Linux world. I think its much more NIMBY then a license issue.

> I think its much more NIMBY then a license issue

How do you propose that ZFS be brought into Linux? When Sun released ZFS as open source, they made a deliberate decision to use a license that prevented it from being integrated into the Linux kernel. This was no accident. At the time, Sun was still pushing OpenSolaris which was losing ground to Linux. The ZFS on Linux project gets around this restriction by running ZFS in user space, but this is not optimal.

You can make a legitimate argument that Linux should have been released under a BSD style license (I think that would be wrong, but it's plausible). I don't see how you can argue that ZFS's license is somehow the fault of the Linux world.

> The ZFS on Linux project gets around this restriction by running ZFS in user space

ZFS on Linux is a kernel module. You may be thinking of ZFS-FUSE which runs in user space using FUSE, but I'm not sure if it's being maintained any more.

On the other hand, choosing to use GPL would prevent it from being integrated anywhere else. You'd also lose the patent protection granted by CDDL.

It's not like they couldn't use dual licenses, like Mozilla did at the time, for example.

Any license more liberal than the GPL would also be fine. For example: MIT/X11, 2-clause BSD, or Ruby's license.

> When Sun released ZFS as open source, they made a deliberate decision to use a license that prevented it from being integrated into the Linux kernel

This is simply totally false no matter how many times people repeat it. It pure FUD.

Sun picked the licence because they had to allow linking with closed code for their products, going with the GPL was simply not viable given the situation with drivers on their platforms. Their licence is actually build on the Mozilla licence without forcing resolution in California. Sun actually spend quite a bit of time and resources to develop a really good licence and made it as open they could given their constraints.

Also, Sun very agressivily pushed their technologies to other systems and Linux would have been no exeption. Sun helped Apple integrate D-Trace, and at the same time the hedge an evil plan to not give it to Linux? They helped upstream things to the BSDs as well.

That simply conspiricy nonsense that was typical with the 'its actually GNU/Linux' crowd that was pushed in the 2000s. Sun was seen as evil coorprate trying to stamp on the 'real open source' community, looking back on this now, the absurdity of that sentement should be clear. Sun made mistakes, but their overall track record was staller.

The idea that the function of the GPL is to block other Open Source code from integrating into an Open Source project is an abolute insane concept and a total perversion of the Idea of Open Source. Literally using the supposed 'most free' GPL to activly block and exclude other Open Source code from people.

Sun's motivation for choosing the CDDL is beside the point. Unless ZFS is released under a license that allows it to be redistributed under the GPL, ZFS cannot be legally built into Linux as a filesystem.

If you have reason to believe that Linux developers can go ahead and simply integrate ZFS into Linux without worrying about the license, I'm sure lawyers from the FSF, IBM, Cannonical, etc would love to hear your explanation.

Uh, isn't Canonical already shipping ZFS in the latest Ubuntu? https://www.techrepublic.com/article/something-exciting-is-c...

Their argument (as I understand it) is that loadable kernel modules are separate discrete pieces of software that do not become "part of" the kernel and do not have to care about kernel licensing. They can be any license, including proprietary, like nvidia drivers.

There is about zero need for "integration" as in static linking / inclusion in the linux repo btw. Nothing wrong with dkms.

This is not a new thought and the lawyers actually understand this. GPL was not designed to protect user from open source. And its an idiotic missapplication. Oracle themself diliver ZFS with Linux, and so do many, many others.

The only 'argument' is that we can't do it because 'big bad oracle' will sue you but that really doesn't hold up.

> Sun's motivation for choosing the CDDL is beside the point.

GP made a claim to Sun's motivation, rebutting that claim seams reasonable.

> Sun picked the licence because they had to allow linking with closed code for their products

Wasn't Sun always the copyright holder? Licenses only apply to Licensees, not Licensors - or am I missing something (e.g. collaborators not needing to reassign copyright back to sun, etc)..

IIRC, Sun was not the copyright holder of everything in Solaris.

I mostly agree, and that's largely why I use ZFS a lot. But:

> A FS should be stable from the beginning,

If this is your standard, I don't think there's a file system out there that meets it. ZFS has had data-loss bugs. I doubt there is any non-toy file system that hasn't.

I've thought about what standard should apply to this - it is a prove-a-negative problem, that filesystem-X in combination with whatever recent kernel will not lose data. I don't have a good answer, but the one I came up with is "multiple years without a dataloss bug, of quick turnaround to other bug fixes, and a warm-fuzzy feeling about the developers."

It was designed to primary not lose data from the very beginning. That was at the very core of every design choice. Maybe there were a few such bugs but I have not read of any, while in comparison Btrfs I have read a whole of them.

Compare how bcachefs/zfs approaches these challenges and then go back to the early years of Btrfs. There is really no comparison.

I don't disagree that maintainer-goals and practices make me trust ZFS more. They do. But there have been bugs. Last major one I remember in the core code is this one:


But there was a Linux-specific data loss bug with ZFS in 2018:


Of course you should use what you like. And I agree that ZFS is safer. But again, I don't know of any file system that can say it has "been stable from the beginning", if stable means no data loss.

BTRFS is well known for being ill-suited to VMs or databases. How come ZFS doesn't have that reputation?

There is an attribute called NOCOW that can be set on specific files that should not be copy-on-write, which is what messes with databases, filesystem images and other things that needs fast in-place updates.

It can also be a set as a flag on subvolumes.

You can also set such an attribute on files and subvolumes in btrfs.


I think ZFS was just ahead. We considered btrfs many years ago to serve up VM's for ESX via NFS, but it just wasn't as performant unless you ran in async mode which jeopardized data integrity. ZFS let you introduce SSD-based ZIL and L2ARC caching which made performance totally fine in sync mode.

We're mostly NetApp AFF these days, but early on had close to a petabyte of ZFS-based storage power VM's on SuperMicro or Dell gear. Definitely was higher touch than NetApp but far less expensive.

ZFS on Solaris 10 was ill suited for almost everything out of the box and was not feasible for MongoDB due to the half are integration between ZFS and the rest of Solaris

I'm using Btrfs currently, but I'm waiting for Bcachefs to replace it.

How far has it come to replace any of the production ready filesystems?

It says it's feature complete in 2015 and was trying to put itself into kernel mainline in 2018 but I don't see much about anyone using it in production.

It's not production ready so far for sure. I didn't really follow all details on that.

I have a small Nextcloud instance at home that uses BTRFS (on HDD, with noatime option) for file storage, and XFS (on SSD) for database.

I started it just for testing, and has been running for up to two years, and had no problems so far.

I've heard a lot of people say they won't use Btrfs due to reliability. Would have been nice to see that addressed.

What about the reliability? Are many people losing data with Btrfs?

Caveat: I don't use RAID[0].

In about 4 years of running it on a couple of servers and countless virtuals/desktops, I've never had a reliability issue that was directly related to btrfs. I do not have my servers plugged in to UPSes, I have the occasional "shutdown due to power loss". The only time I've lost data has been due to cable disconnection in my hardware RAID array, and even then I was able to recover a substantial amount of its `btrfs` stored files.

[0] Well, not filesystem-provided RAID; I have LSI controllers that provide the array to the OS as a single disk.

As best I can tell, reports of data loss on btrfs are all from the early 20-teens; after about 2014 or so I can't find anyone who claims to have lost data due to a btrfs bug on an up-to-date system.

RAID5 on btrfs has a write hole last time I checked. Bug has been around forever, and was around in 2014 for sure.

Phoronix has some thorough performance comparisons between Ext4fs, Btrfs, XFS, and ZFS.

The write-hole problem is a rare case, wherever it happens. https://lwn.net/Articles/665299/

On Btrfs, in case of bad parity being used to reconstruct a stripe, the resulting bad reconstruction is still subject to data checksumming, and will EIO. Corrupt data won't be sent to user space.

That one is here to stay, it is a property of software-based RAID. If it bothers you, use UPS.

ZFS-based RAID-5 (called raidz1) doesn't have write hole.


Because ZFS raidz1 is not raid5, it's even labelled differently. Yes, it is a parity-based raid, but has slightly different semantics.

I think in Linux, if you're using mdadm there is the ability to specify a write journal; all data (i.e. blocks+parity) gets written to the journal first, and then gets cleaned up after everything gets completed successfully, and the journal is replayed after a power failure.

Mind you, for that to work well you'd want a victim SSD with a write speed at least that of the array...

Hardware RAID can also suffer from this indeed but does ZFS suffer from it as well? With exactly the same impact? AFAIK the filesystem stays consistent on ZFS.

raidz1 is not raid5.

From https://pthree.org/2012/12/05/zfs-administration-part-ii-rai... :

> ather than the stripe width be statically set at creation, the stripe width is dynamic. Every block transactionally flushed to disk is its own stripe width. Every RAIDZ write is a full stripe write. Further, the parity bit is flushed with the stripe simultaneously, completely eliminating the RAID-5 write hole. So, in the event of a power failure, you either have the latest flush of data, or you don't. But, your disks will not be inconsistent.

> There's a catch however. With standardized parity-based RAID, the logic is as simple as "every disk XORs to zero". With dynamic variable stripe width, such as RAIDZ, this doesn't work. Instead, we must pull up the ZFS metadata to determine RAIDZ geometry on every read. If you're paying attention, you'll notice the impossibility of such if the filesystem and the RAID are separate products; your RAID card knows nothing of your filesystem, and vice-versa. This is what makes ZFS win.

Raidz1 isn't raid5, but if it mostly solves the same problem for users, without running into the write hole issue, isn't that suggesting we use raidz1 on zfs instead of raid5 on btrfs if we're concerned about unclean shutdowns?

What would we be missing in terms of capabilities by having raidz1 instead of raid5? (Just from the redundancy and performance point of view; let's assume everything else on btrfs and zfs is equal)

That is just nonsense. Btrfs had well known and well published issues for years. And since when are the data of 20 somethings not important?

"Early 20-teens" means the years 2010-2014 or so in this case. Nothing to do with people or their age.

It's the default for new Synology devices, and has been for a while. I suspect others are using it in a similar situation for home-grade NAS and up into the prosumer end of the market.

I feel like Btrfs is probably going to be well tested here, but I wonder how many of these users are diagnosing Btrfs problems when they occur? It's going to be more evident to some people, and you have to assume that some of the vendors are competent, but this is against a backdrop of people throwing this kit away or starting from scratch versus performing a root cause analysis.

I've personally been running this since it was stable on my DS1515+. I haven't had filesystem issues yet, but I make sure my important stuff is backed up elsewhere. A local backup like this is convenient for faster recovery in a lot of situations though which is why I keep it. I've SSH'd to the device and played around a little, but I fear I'd hit something proprietary, if the worst recovery situation occurred and I had to get everything from the DS1515+. If it was just an Ubuntu box I wouldn't have those fears, but the Syno NAS package is compelling.

My understanding is most bugs are ironed out of btrfs itself, but tooling is still weak. For example, if you have a disk drive go bad on you and you manage to recover ~ half of the sectors with a disk imaging tool, you won't be able to extract files from the image without extreme effort.

Why hasn't this caught up? Is it the case that data recovery companies are hoarding this after investing in their own tools, or something fundamental to the community?

Reliability does not only mean data loss. It may not be losing data but crashing every few hours, or locking up the system, or requiring constant monitoring and maintenance etc.

This article makes a few mistakes with regards to ZFS. Some are understandable (the author presumably last looked at the state of ZFS 5 years ago), but some were not true even 5 years ago:

> If you want to grow the pool, you basically have two recommended options: add a new identical vdev, or replace both devices in the existing vdev with higher capacity devices.

You can add vdevs to a pool which are different types or have different parities. It's not really recommended because it means that you're making it harder to know how many failures your pool can survive, but it's definitely something you can do -- and it's just as easy as adding any other vdev to your pool:

  % zpool add <pool> <vdev> <devices...>
This has always been possible with ZFS, as far as I'm aware.

> So let’s say you had no writes for a month and continual reads. Those two new disks would go 100% unused. Only when you started writing data would they start to see utilization

This part is accurate...

> and only for the newly written files.

... but this part is not. Modifying an existing file will almost certainly result in data being copied to the newer vdev -- because ZFS will send more writes to drives that are less utilised (and if most of the data is on the older vdevs, then most reads are to the older vdevs, and thus the newer vdevs get more writes).

> It’s likely that for the life of that pool, you’d always have a heavier load on your oldest vdevs. Not the end of the world, but it definitely kills some performance advantages of striping data.

This is also half-true -- it's definitely not ideal that ZFS doesn't have a defrag feature, but the above-mentioned characteristic means that eventually your pool will not be so unbalanced.

> Want to break a pool into smaller pools? Can’t do it. So let’s say you built your 2x8 + 2x8 pool. Then a few years from now 40 TB disks are available and you want to go back to a simple two disk mirror. There’s no way to shrink to just 2x40.

This is now possible. ZoL 0.8 and later support top-level mirror vdev removal.

> Got a 4-disk raidz2 pool and want to add a disk? Can’t do it.

It is true that this is not possible at the moment, but in the interest of fairness I'd like to mention that it is currently being worked on[1].

> For most fundamental changes, the answer is simple: start over. To be fair, that’s not always a terrible idea, but it does require some maintenance down time.

This is true, but I believe that the author makes it sound much harder than it actually is (it does have some maintenance downtime, but because you can snapshot the filesystem the downtime can be as little as a minute):

    # Assuming you've already created the new pool $new_pool.
    % zfs snapshot -r $old_pool/ROOT@base_snapshot
    % zfs send $old_pool/ROOT@base_snapshot | zfs recv $new_pool/ROOT

    # The base copy is done -- no downtime. Now we take some downtime by stopping all use of the pool.
    % take_offline $old_pool # or do whatever it takes for your particular system
    % zfs mount -o ro $old_pool/ROOT # optional
    % zfs snapshot -r $old_pool/ROOT@last_snapshot
    % zfs send -i @base_snapshot $old_pool/ROOT@last_snapshot | zfs recv $new_pool/ROOT

    # Finally, get rid of the old pool and add our new pool.
    % zpool export $old_pool
    % zpool import $new_pool $old_pool
    % zfs mount -a # probably optional
[1]: https://www.youtube.com/watch?v=Njt82e_3qVo


Raidz2+spares, compression, snapshots and send/receive are very useful. And zil and cache are easier than lvmcache..

I'm so sorry teacher, Btrfs ate my homework.

Storage spaces is probably the best software raid available today. Unfortunately, it comes with windows.

It supports heterogenous drives, safe rebalancing (create a third copy, THEN delete the old copy), fault domains (3-way mirror, but no 2 copies can be on the same disk/enclosure/server/whatever), erasure coding, hierarchical storage based on disk type (e.g., use NVMe for the log, SSD for the cache), clustering (paxos, probably). Then you toss ReFS on top, and you're done.

The only compelling reasons to buy windows server are to run third party software or a storage spaces/ReFS file share.

Applications are open for YC Summer 2020

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact