I went down the BTRFS path, despite it's dodgy reputation when netgear announced their little embedded NASes, and switched my server over to it. The experience was solid enough that I bought high-end synology and have had zero problems with it.
After that, none of the features like compression, snapshots, COW or checksums meant anything to me. I'm much happier with ext4 and xfs on lvm.
I'm not sure what causes it, but there seems to be an effect where btrfs loves you or hates you and few people with mixed experiences regarding data loss. One possible cause is distro choice tends to be per person and how up to date said distro keeps it's kernel. But, I'm not sure.
Why wouldn't you expect it to survive that? Is there a particular reason to believe those drives are broken? I.e., are they older consumer drives known to lie about cache flushes? do they have bad sectors? How have you abused it? What kind of load? Did you fill the filesystem (which another commenter mentioned seems to be a common element of most sad btrfs stories)? did your system frequently lose power while under write load?
Lacking more details, I'd just say one user experiencing 0 bugs in 5 years should be completely unremarkable. I expect filesystems to be very reliable, so a lot of people having stories of corruption means stay away from btrfs. Having some people with stories of no corruption doesn't really move the needle. Together, these stories still mean stay away from btrfs!
You want details from people experiencing zero problems, but you don't ask for details from people who are? That's a weird way to go about conducting the necessary autopsies, to discover and fix bugs.
Anyway, I monitor the upstream filesystems lists, and they all have bugs. They're all fixing bugs. They're all adding new features. And that introduces bugs that need fixing. It's not that remarkable, until of course someone suggests only one file system is to be avoided, while also providing no details, but depends on conjecture.
I don't need to ask people who've had problems because I've had them myself, in unremarkable circumstances, a while back. I'm sure I could find reports on the mailing list as well, in which others have already asked for details.
I recently conducted system resource starvation tests where a compile process spun off enough threads to soak the system to the point it becomes unresponsive. I did over 100 forced power off tests while the Btrfs file system was being written to as part of that compile. Zero complaints: not on mount, not on scrubs, not with btrfs check, and not any while in normal operation following those power offs.
If you want to complain about Btrfs, complain about the man page warning to not use --repair without advice from a developer. You did know about that warning, right?
The only difference is that none of the repair tools were able to recover the filesystem, but I was able to dump the files themselves to a new disk to recover them. Really not sure why, it was very strange.
Once I ended up with a bunch of zero length files (presumably metadata was written before content?).
I also, multiple times ended up with errors related to full drives despite by drive not being full. Deleting snapshots seemed to help.
Then I went to a zfs fs on root and never had another problem.
My quite large 1tb multivolume, multisnapshot BTRFS fs never had any problems.
And it's quite aggressive cfg (big fs commit).
P.S. I do have backups though.
Why not putting poweroff in a cron task a bit before midnight so you don't uselessly risk hosing your file system? You can always restore your backup but it takes time!
Basically, any time a runaway process filled my disk, I just had to hard-reboot and hope I didn't have any unsaved work or state that I needed to preserve.
Really makes me hope that Apple is going to further extend APFS to not just be baby's first CoW volume-management filesystem.
Do you have Time Machine enabled? I think it uses snapshots, which explains why the filesystem stays full. I've hit this myself and was initially surprised to see rm not improving matters (possibly even making it worse) but it makes sense with snapshots. The working on reboot was a surprise. I'd put off fixing the machine for at least a week, and when I went to actually fix it, it was quite anticlimatic to just reboot and have it work. Maybe it checks for this condition on reboot and dumps Time Machine snapshots if so.
That was the less scary part of my macOS filesystem integrity worries. My full disk started when it was staging a full Time Machine backup after I got a dialog saying:
> Time Machine completed a verification of your backups on "my.nas.address". To improve reliability, Time Machine must create a new backup for you.
...for the Nth time. I don't know for certain if the problem is with Apple's software or with my NAS's (Synology) but these backups are clearly not as reliable as one would hope...
Some of these are fixed by now, though.
I used to read every email on btrfs-devel for a year or so.
As with all his talks, you can expect it to be quite entertaining as well as informative and historical (if from his POV).
If things still get trashed then I tend to think that the very foundation of the FS is bad.
But maybe I'm just naive :)
I wouldn't buy a $5 USB thumb drive if half the people said it lost their data and half said it worked fine.
of course, where we run into problems is that btrfs is meant to be the reliable backup. Oops.
I tried, I really tried to like btrfs.
On the servers/workstations I’ve had few serious issues, but a few “gotchas” you need to know to keep things running smoothly.
On every laptop I’ve had, I’ve had btrfs fail on me. Repeatedly.
So I gave up on it. ZFS for me these days.
This is how superstitious traditions start, and ritualistic sacrifice in particular, I'd think.
Does data loss count as a sacrifice in this instance?
Surely it depends on the btrfs implementation.
e.g. Arch Linux getting daily kernel updates vs an enterprise distro
Same thing happens with operating systems.
Down there, 2/3 of this hackersnews discussion (if you are patient to get there) you can see questions about production deployment of btrfs, with some VERY interesting answers of BIG deployments of btrfs. Read success confirmed with data.
My takeaway from reading whole discussion:
* lot of people (individuals) praise of btrfs
* lot of people (ind.) tell about problems
* quite nice features/btrfs usage patterns, not matched even by zfs mentioned
* still for VM/DB you shall consider different approach (thin LVM + xfs or ext4) and slave machine WITH btrfs and snapshots on it
* quite many problems/deficienses of ZFS mentioned (apart fomr typical license/kernel inclusion)
* lot of new features on the way in recent kernels for btrfs
* btrfs is not dead
p.s. worth to comment that kernel 5.6 just received another huge new features batch for btrfs (async discard!)
The reason Synology btrfs is mostly solid is because they refused to ever use the btrfs raid layer. But the second you move to btrfs on LVM you lose a large portion of the supposed benefits.
Having used both, never lost data on zfs and I’ve been using it since it was released and have had it save me from silent data corruption. BTRFS hasn’t ever lost me an entire file system, but I’ve definitely lost files.
I had been wanting to try ZFS on my home NAS for a while (for snapshotting/redundancy/data integrity) and finally got enough disks that it made sense. I wasn't looking forward to learning what I presumed to be a very complicated system though. About 15 minutes into my research for setting up and maintaining a ZFS filesystem and I just went - wait thats it? So incredibly simple and well documented, it has been a joy to use. It is very rarely that complicated operations on complicated systems use such simple and easy to understand commands. It just does what I expect!
Examples: ZFS snapshots can be recursive (-r) or not, whereas on btrfs they cannot be recursive; in discussions I've seen, this is mentioned as "a feature", since you can create a subvolume for data that you don't want to be part of the snapshot, but it also prevents you from dividing up a logical heirarchy into multiple behaviours (compression vs. not, block size, etc.).
Bind mounts can get around most of the limitations here, at the cost of polluting one directory with the canonical locations of all your special-purpose subvolumes. I think it's still awkward to simultaneously snapshot every subvolume that is mounted under a particular tree for incremental backup purposes.
I've personally waited for BTRFS longer than a decade but my use-cases are yet to be considered stable (not something you really mess with in regard to filesystems).
Honestly, as sure as I have been on the success of BTRFS I now consider BTRFS dead on arrival - if it will ever even arrive. The pace of development is slower than the universe around it, that might be too harsh but really - no RAID6 yet? A decade ago the impression I got was "soon". And now 2-drive parity is becoming obsolete.
ZFS has tons of warts for home-use, I agree. So, for a home-user with high demands I don't see anything exciting in the future.
Re obsolete, are you referring to RAID1C3?
I'd much prefer something like raidz3 compared to the authors setup.
RAID1C3 is nice but very expensive for use in bulk storage at home.
No way to rebalance a pool. Also increasing a pool always results in less reliability (in terms of drive losses that results in the whole pool going down).
No proper recovery tools if something goes wrong.
Then the lack of flexibility talked about in the article. This means the up-front cost and total cost is vastly more than a more typical setup where you can buy drives spread out over many years and take advantage of falling prices, less power consumption and noise (in part because you typically start such an array with higher density drives, since the low cost and longevity allows you to).
Probably forgot some other reasons.
That said I still use zfs (freenas) at home. But because of the above it is quite hard to blindly recommend it.
My experience is that it's pretty good. The tooling does what it says without a lot of drama. I can scrub while the system is in use and don't notice it mostly. I have seen some small corruptions that it was able to flag for me with specific filenames and fix. Snapshotting and send/receive is also very handy.
I heard some people say they don't like to use it under heavy load. That seems reasonable to me. You're paying costs to get the integrity piece. So it's not for every use or every user. It is very good at what it does, however.
I switched to it after the 7200.11 firmware mess, where the drives reported successful writes but didnt write anything. ZFS would have caught that, my Adaptec card certainly couldnt have and didnt.
ZFS to the rescue again a while later when those (now firmware updated) 7200.11 drives started dropping after 15k hours of service. ZFS saved my data when two drives started failing in my RAID5 set at the same time.
All the weird minor problems that would cause random issues or performance issues for other file systems like flaky SATA cables, intermittent HBA/backplane ports, etc. ZFS catches them all and informs you.
Having been hit by bit rot, corrupted files, corrupted file systems, etc etc before switching, ZFS is fantastic. And there is something great about watching it scrub at >1GB/sec, verifying every single bit of your data.
a situation in which something is advertised and discussed in newspapers, on television, etc. a lot in order to attract everyone's interest:
May be its just me because Morden day usage of "hype" seems to involve and implies a negative meaning, especially in tech. Similar to false advertising. And no one was actively promoting ZFS, they were only very "responsive".
And then zealots, I had to reread 226 comments, ran to Cambridge dictionary
a person who has very strong opinions about something, and tries to make other people have them too
I dont see anyone having strong opinions and force others to have the same. If anything a lot of people are showing not because the love ZFS, but they have been burnt by btrfs.
Agreed 100%. That's particularly annoying to us desktop users. It took me years to figure out that no, FreeBSD aside, it doesn't bring anything to the table outside of enterprise storage use cases. At least it doesn't bring anything that's worth the hassles (I don't have to export ntfs filesystems before using them on another computer; same for ext4 -and then there's performance).
Mainline Linux has a policy against in-kernel ABI stability guarantees. User-space is given ABI stability guarantees, in-kernel code by intention is not. That includes filesystems.
Or, to put that another way: what are AWS and GCP using in their SANs (EBS; GCE PD) that allows them to take on-demand incremental snapshots of SAN volumes, and then ship those snapshots away from the origin node into safer out-of-cluster replicated storage (e.g. object storage)? It it proprietary, or is it just several FOSS technologies glued together?
My naive guess would be that the cloud hosts are either using ZFS volumes, or LVM LVs (which do have incremental snapshot capability, if the disk is created in a thin pool) under iSCSI. (Or they’re relying on whatever point-solution VMware et al sold them.)
If you control the filesystem layer (i.e. you don’t need to be filesystem-agnostic), would Btrfs snapshots be better for this same use-case?
Filesystem snapshots are a legitimate way of backing up databases, but it's not quite as simple as just taking a snapshot. For PostgreSQL for example you will still need to call pg_start_backup() and ensure your WAL archives are properly stored in your object storage system for point-in-time recovery. Without the database-specific precautions, your snapshots will still be crash-consistent and most likely usable in some manner, but not quite proper backups.
Using BTRFS or ZFS as the database filesystem has its own footguns. For example, the default record size of ZFS datasets doesn't match the block size of most databases, so if you forget to take that into account, you'll very likely see rather terrible performance.
This gives you backup with a replay value, so you can restore at any point in time. You can also use such backup for setting up replication. There's still a daily backup which is there to speed up recovery and increase resiliency. Those backups don't really put much load on the database, but if that's a concern you can back up the replica (which is what cloud providers or at least AWS is doing).
As for ZFS, out of the box ZFS is not a good file system for databases, although you can get a good performance after tuning. You for example want to configure it to have block sizes alizened with database blocks, configure ZIL, perhaps changing block hashing algorithms (although I think current default should be fast).
As for your question how are cloud providers are doing it, most of us can speculate. To me it looks like standard RDS instances are simply on EBS (which are utilizing S3). In Aurora they skipped EBS and implemented another database storage directly.
It seems like the backups are performed in traditional way though.
In fact the opposite, make sure to use an UPS just so that you can shutdown cleanly in the unfortunate event.
For example: https://blogs.oracle.com/paulie/backing-up-mysql-using-zfs-s...
Some people recommend filesystem snapshotting but wouldn't that make recovery a slow process because you have to load up the entire database even if you just wanted to look up on data of a small table?
Maybe backing up only small tables as SQL dumps while keeping a file system snapshot would be a good compromise.
Each time you want to a backup, create a CoW snapshot of the thin LV, then mount it somewhere and run the backup.
The "main" thin LV should be happily chugging along independently when you are doing that.
And all this is stable proven technology available about anywhere (eq. RHEL 7+).
It seems as though the way to go would be to take a 'snapshot', back it up, and then delete it immediately; is that right?
As far as I know AWS does not use SANs because they consider it as anti-pattern. Most backups land on S3 because of reliability and price.
EBS is very much a SAN, if you read the docs, the Nitro HBA Controllers have dedicated bandwidth allocation for doing just EBS.
As there is a dedicated network for just servicing block storage, that sounds suspiciously like a Storage Area Network to me.
S3 for backup makes lots of sense, its ubiqutous, reliable and smeared over lots of regions. It also works well with large files. Its also orders of magnitude cheaper than EBS to run.
1. they came out with S3 soon after coming out with their Dynamo paper (before releasing DynamoDB, even); and
2. there’s a good constructive proof, as a studyable FOSS system, for how to build object storage on top of a Dynamo architecture, in the form of Riak CS (object storage) which is built atop Riak KV (a Dynamo impl.) Riak CS seems to make pretty much the same set of guarantees (in terms of time/space complexity of operations, possible durability numbers per scaled number of copies, etc.) that S3 does, so it’s a fair guess that they’re similarly-architected systems.
Primary DB-Server -> XFS on LVM with a LVM caching SSD
Secondary (write-only) mirror DB-Server -> ZFS
The DB is replicating automatically to the secondary server by the database internal replication features. At the secondary I am then able to lock the DB temporarily, doing a ZFS snapshot and maybe could do a ZFS send/receive afterwards without affecting the primary server.
ZFS is great for doing snapshots and archiving of huge amount of data, but it's very very bad for production databases in terms of performance. Most database aren't designed to deal with the CoW feature of ZFS, which leads to a very bad write performance and database fragmentation in the end.
See various "Private Cloud" Linux distributions for implementation examples of this. Such as Proxmox which does it out of the box on ZFS and soon on btrfs too.
Which makes synchronising snaphots a lot easier (and caching too, but thats another thing entirely.)
They are treated as block storage, so on the outside don't have to worry about what filesystem is running on it. (in practice they have to be a bit aware, so that they don't snapshot unbootable or dirty images, but I assume thats mostly handled by an OS plugin)
AWS et al snapshots are at the block level. Linux has poorly documented primitives for this.
If you put your VM images on a Filesystem provided by ZFS or BTRFS then you can snapshot your images, without having to buy a SAN, or expensive controller.
ZFS has by far the best documentation. BTRFS's documentation has improved, but the tools are still difficult to use.
But Ceph is not designed to be a competitor to BTRFS or ZFS. The core vision of Ceph is scalability. If you need petabytes of storage and the performance to scale with it, take a look at Ceph.
I may be totally wrong here, but from what I understand about Ceph, it's not meant as a file system for a single computer. I don't understand the idea of running Ceph on your laptop/desktop.
It's possible to run it that way but it defeats it's purpose.
I've build a small lab setup with Ceph:
Also, there's the issue of performance, in particular latency. That's a bit of a weak spot of Ceph, from what I can tell. Again, may be wrong. But I found these notes interesting.
In fact, it's really common to use a ZFS array on single nodes, and then create a SAN using multiple such machines by layering Ceph on top.
For me latency isn't really a large issue. I read and write everything locally on my SSD-backed desktop/laptop and then sync my files to my storage node via git or rsync or something. For me data integrity and availability are important.
tl;dr Unbeknownst to me I had a bad drive cable for an external NVMe enclosure that was causing intermittent I/O errors (only during high drive utilization) that went undetected by BTRFS and slowly corrupted my drive, eventually leading to an unbootable and unrepairable system (and to be fair, I should have scrubbed instead of attempting btrfsck --repair from another booted drive, but I don't care what you say, a --repair function should NOT potentially cause FURTHER corruption if it is at all available in the tooling! Like, just fucking rip it out if it can potentially make things worse, or recode the damn thing to just act defensively... jeez)
Wiped the drive and started over with Ubuntu 19.10 and its new integrated ZFS on Root support... ZFS detected the IO issue pretty much instantly and prevented further errors by freezing I/O. Swapped the cable out during my troubleshooting and the issue went away. Also, drive is plenty fast, read test at 800MB/s
And of course it was a weekend where my parents and siblings and in-laws were visiting, so I had the joy of going around messing with DNS settings wherever someone had a device that only paid attention to the first two DNS servers in the DHCP settings.
(I've since changed my DNS setup- now I only have a primary self-hosted one that's on an RPi in my networking cabinet, and the second entry is Google. I figure if I only get two servers that are respected for real, I'm making sure one of them is google.)
I was under the impression that there was no such thing as primary and secondary for DNS, just ‘here is one’ and ‘here is another’, with someone going for a terrible naming system of ‘primary’ and secondary’.
I’m no expert and my knowledge come from messing about with Pihole and reading their documentation.
The real question is how they behave under less than ideal conditions. It is these conditions where Btrfs has performed poorly, and where ZFS has performed very well. I lost several Btrfs filesystems due to its poorly-tested and broken error handling trashing the filesystem beyond recovery.
The selling point of both of these filesystems is their robustness, fault-tolerance and ability to self-heal. Only one of them actually delivers.
The best suggestion I can offer is to use a distribution that treats it like a first-class citizen, such as... well, the Ubuntu support is still beta level, so only NixOS for now.
could this possibly be proxmox's fault more than ZFS's fault? You even said the pools were fine
Performance becomes an issue in certain cases, but in every one that I've encountered, adjusting configuration has resolved the problems to my satisfaction.
Would my Windows 10 VM run better under a different filesystem, rather than `btrfs` with various tweaks applied? Reading relatively recent articles on the subject would suggest that it would, however, I'd rather work with a single filesystem type and understand its strengths/weaknesses than manage two different filesystems as long as I can get performance to a usable state.
We've switched back to using MD (mdadm) for RAID-1 setup, and then using btrfs on top of that for the snapshots, send / receive, block-level CRC and such.
Dealing with failed drives isn't as easy with btrfs as it is with Linux MD.
It wasn't very long ago that I had BTRFS drives on two separate systems develop crippling performance issues, with random delays increasing up to seconds, and the filesystem going unresponsive for even longer when I deleted snapshots. I think something about the performance was degrading every time an hourly snapshot was made, even though the system only kept a couple dozen of them at a time.
Is this also true for RAID5/6?
So, no, that particular issue hasn't been fixed.
That’s unfortunate. Does the scrub run automatically in those situations? Consumer hardware will be the most prone to intermittent power failure.
Not sure if that workaround would work in btrfs, but it worked on ZFS.
Mind that everything is copy-on-write, you can't do anything, even metadata changes, without allocating new blocks. It needs the reserve space.
Btrfs uses the the disk completely. This is harder to do (also compared to e.g. ext4 reserving a fixed amount of inode space which may be unused when the disk is full). At some point they added an in-memory "global reserve" metadata space which allows you to delete stuff even if the file system is full.
This is why I'm back on ext4 now, too.
IIRC ext4 reserves a (configurable) portion of the disk for system management; it seems like btrfs could easily do the same.
In ZFS and I'm sure in btrfs you can set up quotas and reserved space, globally and or user, but by default it is set to 0. I actually set my quota to 80% because apparently if you fill ZFS more it causes heavy fragmentation.
To be more specific, reserved space on ext3 helps the fs so that it can be more flexible during allocation and avoid fragmentation.
Ext4 has delayed allocation mount option for that purpose so reserved space is not as much important for that but it'd still help if you turn off delayed allocations.
It's been years since Btrfs introduced "global reserve" which reserves enough metadata space to ensure it's possible to delete files on full file systems. But an old work around for this is to add a small device to the Btrfs volume, making is a 2 device volume. It could be a USB stick, a zram device (ramdisk), partition, or even a loop mounted file on some other file system. Delete the files, and now you can remove the temporary 2nd device.
eta: looked up the ticket, customer reported "when trying to delete any files, even as root, btrfs says "cannot remove"", field engineering observed the same.
I used to use ZFS on my NAS, but after running it for a year and fiddling with it, I wasn't able to tune it in a way I liked. I always had random performance problems and zvols were super slow. It's now dm-integrity on all disks, an mdraid raid6 volume over those, with LVM2 on top of that and mirrored NVMe disks as a read and write cache.
I also wish BTRFS would add extents at some point so you could run virtual machine images from it without weird performance issues from time to time (although I imagine this is less of an issue on SSDs because they're "fragmented" inside anyways).
BTRFS does have some scary stories from earlier in its development, and true raid5 seems like it's unlikely to be safe for quite a while, but raid1 and "normal" fs usage has been rock solid in my experience. The only time I've ever had an issue was probably 4 years ago at this point, and it was solved by just booting an Arch live iso and running a btrfs command that was basically "fix exactly the bug that your error message indicates". I don't remember exactly what it is, something about two sizes not matching, but googling the text it showed at boot led me directly to the command to fix it. Certainly dramatically less trouble than I've ever had when hardware RAID goes south.
I do agree that modern lvm does probably compete with btrfs, but again you're trading how dang simple btrfs raid1 is to manage for monkeying with partitions in lvm in exchange for ~some? performance.
IMO ZFS is in a weird spot where I don't know where I'd use it. It's too complicated/annoying to admin for me to want to run it in my basement for myself/my family, and for anything bigger or more professional I'd use ceph or a problem-domain-specific storage system (HDFS, clickhouse, aws, etc).
% zpool attach <pool> <vdev> <device> // add to a vdev
% zpool detach <pool> <vdev> <device> // remove from a vdev
% zpool add <pool> <vdev> <devices...> // add a new vdev
% zpool remove <pool> <vdev>
I would not (nor did I) compare LVM (or md-raid) to btrfs or ZFS -- those technologies have fundamental limitations regarding the integrity of your data that ZFS (and btrfs) don't have. And don't get me wrong -- I don't have a problem with btrfs (I run btrfs on all of my machines except my home server -- which runs ZFS), I just disagree with GP's point that ease of use is an argument for btrfs over ZFS. There are many arguments for either technology.
> btrfs is also mainline, which increases how painless it is to use.
I agree that this is one argument to pick btrfs over ZFS (though on most distributions it isn't really that hard to install ZFS, the fact that btrfs requires zero extra work to use on Linux is a benefit).
It all depends on the application but in the majority of cases the io performance of btrfs is worse than the alternatives.
Redhat for example choose to deprecate btrfs for unknown reasons while SUSE made it it’s default. The future of it seems uncertain which may cause a lot of headache’s in major environments if implemented there.
The fact that one enterprise-support provider went all-in on Btrfs, while another didn’t, basically tells you that the choice is pretty arbitrary. If no enterprise-support provider used Btrfs, then I’d be concerned.
People treat RH stopping support of btrfs as some sort of death knell for it. Meanwhile all the btrfs users are confused why RH's opinion should matter at all when they weren't that involved with developing it in the first place.
As an opensuse user, btrfs has saved multiple machines from botched updates by letting me revert to the snapshot from right before the update was applied (opensuse's update tool automatically takes snapshots before and after updates).
One advantage is it detects bit rot -- and you can scrub the disks once a week looking for the bad blocks.
I also like the inline compression.
I run at RAID1 and the only issue I had was several years ago there was a bug about freeing allocations so occasionally the filesystem would be full but not full.
According to Josef Bacik, RH deprecated btrfs because he was the engineer in charge for Btrfs and had left the company.
Then again I avoided Grub for years because I found it fiddlier and more breakage-prone than LILO, so possibly I'm just an idiot and/or jinxed when it comes to new things in Linux.
ext4 - 33s,
ZFS - 50s,
btfrs - 74s
(test was ran on Vultr.com 2GB virtual machine, backing disk was allocated using "fallocate --length 10G" on ext4 filesystem, the results are very consistent)
"Btrfs has played a role in increasing efficiency and resource utilization in Facebook’s data centers in a number of different applications. Recently, Btrfs helped eliminate priority inversions caused by the journaling behavior of the previous filesystem, when used for I/O control with cgroup2 (described below). Btrfs is the only filesystem implementation that currently works with resource isolation, and it’s now deployed on millions of servers, driving significant efficiency gains."
If I were doing that today, I would do a bake-off of OverlayFS vs. btrfs for this feature. Btrfs has many other compelling features that may make it worth using, although it's always been slower than ext4/xfs so I'd also need to check how it does with modern ultra high performance NVMe drives.
Btrfs never lost our data, although there was a kernel panic in the journal writing code in the Linux 3.2/Ubuntu 12.04 timeframe. The panic would not cause data loss but it did wedge VMs. Since that was fixed, it's had a 100% reliable run in that system, to my knowledge.
My rough understanding is synology did some pretty heavy modifications to btrfs in their implementation though... (a quick google finds me nothing to back this up, but i remember reading about it somewhere...)
I'd like to see them move to full disk encryption rather then their current approach.
For RAID5, they are using it on top of LVM, but with some modification - the synology implementation hooks LVM and btrfs together, so it gets ZFS-like properties.
There's a guy on the internet, who was playing around with it: https://daltondur.st/syno_btrfs_1/
When I was doing whole rebuilds of Debian, using e.g. 8 parallel builds of >18000 packages, it was creating and destroying a snapshot once every few seconds to minutes, at most 8 snapshots in existence at once. It got unbalanced and went write only every 36 hours. A clean brand new filesystem which never had more than 10% space utilisation and was typically around 1%.
At work: I was told our OpenSUSEs had some failures/data-loss, so we're not using the default btrfs on these. Though I don't know with what version that was (we migrated to OpenSUSE about 3 years ago).
since 2017 i'm also using BTRFS to host mysql replication slaves. every 15 min, 1h, 12h crash-consistent snapshots of the running database files are taken and kept for couple of days. there's consensus that - due to its COW nature - BTRFS is not well suited for hosting vms, databases or any other type of files that change frequently. performance is significantly worse compared to EXT4 - this can lead to slave lag. but slave-lag can be mitigated by: using NVMe drives and relaxing durability of MySQL innodb engine. i've used those snapshots few times each year - it worked fine so far. snapshots should never be the main backup strategy, independently of them there's a full database backup done daily from masters using mysqldump. snapshots are useful whenever you need to very quickly access state of the production data from few minutes or hours ago - for instance after fat fingering some live data.
during those years i've seen kernel crashes most likely due to BTRFS but i did not lose data as long as the underlying drives were healthy.
In addition to the slew of other features Btrfs is missing (send/recv, dedup, etc) zfs allows you to dedicate something like an Intel optane (or other similar high write endurance, low latency ssd) to act as stable storage for sync writes, and a different device (typically mlc or tlc flash) to extend the read cache.
The ability to add and remove disks on a desktop machine is very tempting.
I'll be the first to say that it isn't a silver bullet for everything. But then, what filesystem really is? Filesystems are such a critical part of a running OS that we expect perfection for every use case; filesystem bugs or quirks result in data loss which is usually Really Bad(tm).
That said, for the last two years, I've been running Linux on a Thinkpad with a Windows 10 VM in KVM/qemu -- both are running all the time. When I first configured my Windows 10 VM, performance was brutal; there were times when writes would stall the mouse cursor and the issue was directly related to `btrfs`. I didn't ditch the file-system, I switched to a raw volume for my VM and adjusted some settings that affected how `btrfs` interacted with it. I discovered similar things happened when running a `balance` on the filesystem and after a bit of research, found that changing the IO scheduler to one more commonly used on spindle HDDs made everything more stable.
So why use something that requires so much grief to get it working? Because those settings changes are a minor inconvenience compared against the things "I don't have to mess with" to cover a bigger problem that I frequently encountered: OS recovery. An out-of-the-box OpenSUSE Tumbleweed installation uses `btrfs` on root. Every time software is added/modified, or `yast` (the user-friendly administrative tool) is run, a snapshot is taken automatically. When I or my OS screws something up, I have a boot menu that lets me "go back" to prior to the modification. It Just Works(tm). In the last two years, I've had around 4-5 cases where my OS was wrecked by keeping things up to date, or tweaking configuration. In the past, I'd be re-installing. Now, I reboot after applying updates and if things are messed up, I reboot again, restore from a read-only snapshot and I'm back. I have no use for RAID or much else which is one of the oft-repeated "issues" people identify with `btrfs`.
It fits for my use-case, along with many of the other use-cases I encounter frequently. It's not perfect, but neither is any filesystem. I won't even argue that other people with the same use case will come to the same conclusion. But as far as I'm concerned, damn it works well.
 I want to say that an installation of openSUSE ended up causing me to switch to `btrfs`, but I can't remember for sure -- that's all I run, personally, and it is a default for a new installation's root drive.
 Bug: a specific feature (i.e. RAID) just doesn't work. Quirk: the filesystem has multiple concepts of "free space" that don't necessarily line up with what running applications understand.
 My servers all have LSI or other hardware RAID controllers and present the array as a single disk to the OS; I'm not relying on my filesystem to manage that. My laptop has a single SSD.
They're still using their own RAID layer though.
Synology's RAID implementation is largely mdadm + LVM.
Async discards coming in 5.6.
I remember using it after I had heard it was 'stable' and it eat my data not long after (not using crazy features or anything). I certainty will not use it again. A FS should be stable from the beginning, as stable core that you can then build features around, rather then a system with lots of feature that promises to be stable in a couple years (and then wasn't years after being in the kernel already).
Using ZFS for me has been nothing but joy in comparison. Growing the ZFS pool for me has been no issue at all, I never saw a reason why I would want to reconfigure my pool. I went from 4TB to 16TB+ so far in multiple iterations.
Overall not having ZFS in Linux is a huge failure of the Linux world. I think its much more NIMBY then a license issue.
How do you propose that ZFS be brought into Linux? When Sun released ZFS as open source, they made a deliberate decision to use a license that prevented it from being integrated into the Linux kernel. This was no accident. At the time, Sun was still pushing OpenSolaris which was losing ground to Linux. The ZFS on Linux project gets around this restriction by running ZFS in user space, but this is not optimal.
You can make a legitimate argument that Linux should have been released under a BSD style license (I think that would be wrong, but it's plausible). I don't see how you can argue that ZFS's license is somehow the fault of the Linux world.
ZFS on Linux is a kernel module. You may be thinking of ZFS-FUSE which runs in user space using FUSE, but I'm not sure if it's being maintained any more.
This is simply totally false no matter how many times people repeat it. It pure FUD.
Sun picked the licence because they had to allow linking with closed code for their products, going with the GPL was simply not viable given the situation with drivers on their platforms. Their licence is actually build on the Mozilla licence without forcing resolution in California. Sun actually spend quite a bit of time and resources to develop a really good licence and made it as open they could given their constraints.
Also, Sun very agressivily pushed their technologies to other systems and Linux would have been no exeption. Sun helped Apple integrate D-Trace, and at the same time the hedge an evil plan to not give it to Linux? They helped upstream things to the BSDs as well.
That simply conspiricy nonsense that was typical with the 'its actually GNU/Linux' crowd that was pushed in the 2000s. Sun was seen as evil coorprate trying to stamp on the 'real open source' community, looking back on this now, the absurdity of that sentement should be clear. Sun made mistakes, but their overall track record was staller.
The idea that the function of the GPL is to block other Open Source code from integrating into an Open Source project is an abolute insane concept and a total perversion of the Idea of Open Source. Literally using the supposed 'most free' GPL to activly block and exclude other Open Source code from people.
If you have reason to believe that Linux developers can go ahead and simply integrate ZFS into Linux without worrying about the license, I'm sure lawyers from the FSF, IBM, Cannonical, etc would love to hear your explanation.
Their argument (as I understand it) is that loadable kernel modules are separate discrete pieces of software that do not become "part of" the kernel and do not have to care about kernel licensing. They can be any license, including proprietary, like nvidia drivers.
There is about zero need for "integration" as in static linking / inclusion in the linux repo btw. Nothing wrong with dkms.
The only 'argument' is that we can't do it because 'big bad oracle' will sue you but that really doesn't hold up.
GP made a claim to Sun's motivation, rebutting that claim seams reasonable.
Wasn't Sun always the copyright holder? Licenses only apply to Licensees, not Licensors - or am I missing something (e.g. collaborators not needing to reassign copyright back to sun, etc)..
> A FS should be stable from the beginning,
If this is your standard, I don't think there's a file system out there that meets it. ZFS has had data-loss bugs. I doubt there is any non-toy file system that hasn't.
I've thought about what standard should apply to this - it is a prove-a-negative problem, that filesystem-X in combination with whatever recent kernel will not lose data. I don't have a good answer, but the one I came up with is "multiple years without a dataloss bug, of quick turnaround to other bug fixes, and a warm-fuzzy feeling about the developers."
Compare how bcachefs/zfs approaches these challenges and then go back to the early years of Btrfs. There is really no comparison.
But there was a Linux-specific data loss bug with ZFS in 2018:
Of course you should use what you like. And I agree that ZFS is safer. But again, I don't know of any file system that can say it has "been stable from the beginning", if stable means no data loss.
It can also be a set as a flag on subvolumes.
We're mostly NetApp AFF these days, but early on had close to a petabyte of ZFS-based storage power VM's on SuperMicro or Dell gear. Definitely was higher touch than NetApp but far less expensive.
It says it's feature complete in 2015 and was trying to put itself into kernel mainline in 2018 but I don't see much about anyone using it in production.
I started it just for testing, and has been running for up to two years, and had no problems so far.
In about 4 years of running it on a couple of servers and countless virtuals/desktops, I've never had a reliability issue that was directly related to btrfs. I do not have my servers plugged in to UPSes, I have the occasional "shutdown due to power loss". The only time I've lost data has been due to cable disconnection in my hardware RAID array, and even then I was able to recover a substantial amount of its `btrfs` stored files.
 Well, not filesystem-provided RAID; I have LSI controllers that provide the array to the OS as a single disk.
Phoronix has some thorough performance comparisons between Ext4fs, Btrfs, XFS, and ZFS.
On Btrfs, in case of bad parity being used to reconstruct a stripe, the resulting bad reconstruction is still subject to data checksumming, and will EIO. Corrupt data won't be sent to user space.
Mind you, for that to work well you'd want a victim SSD with a write speed at least that of the array...
From https://pthree.org/2012/12/05/zfs-administration-part-ii-rai... :
> ather than the stripe width be statically set at creation, the stripe width is dynamic. Every block transactionally flushed to disk is its own stripe width. Every RAIDZ write is a full stripe write. Further, the parity bit is flushed with the stripe simultaneously, completely eliminating the RAID-5 write hole. So, in the event of a power failure, you either have the latest flush of data, or you don't. But, your disks will not be inconsistent.
> There's a catch however. With standardized parity-based RAID, the logic is as simple as "every disk XORs to zero". With dynamic variable stripe width, such as RAIDZ, this doesn't work. Instead, we must pull up the ZFS metadata to determine RAIDZ geometry on every read. If you're paying attention, you'll notice the impossibility of such if the filesystem and the RAID are separate products; your RAID card knows nothing of your filesystem, and vice-versa. This is what makes ZFS win.
What would we be missing in terms of capabilities by having raidz1 instead of raid5? (Just from the redundancy and performance point of view; let's assume everything else on btrfs and zfs is equal)
I feel like Btrfs is probably going to be well tested here, but I wonder how many of these users are diagnosing Btrfs problems when they occur? It's going to be more evident to some people, and you have to assume that some of the vendors are competent, but this is against a backdrop of people throwing this kit away or starting from scratch versus performing a root cause analysis.
I've personally been running this since it was stable on my DS1515+. I haven't had filesystem issues yet, but I make sure my important stuff is backed up elsewhere. A local backup like this is convenient for faster recovery in a lot of situations though which is why I keep it. I've SSH'd to the device and played around a little, but I fear I'd hit something proprietary, if the worst recovery situation occurred and I had to get everything from the DS1515+. If it was just an Ubuntu box I wouldn't have those fears, but the Syno NAS package is compelling.
> If you want to grow the pool, you basically have two recommended options: add a new identical vdev, or replace both devices in the existing vdev with higher capacity devices.
You can add vdevs to a pool which are different types or have different parities. It's not really recommended because it means that you're making it harder to know how many failures your pool can survive, but it's definitely something you can do -- and it's just as easy as adding any other vdev to your pool:
% zpool add <pool> <vdev> <devices...>
> So let’s say you had no writes for a month and continual reads. Those two new disks would go 100% unused. Only when you started writing data would they start to see utilization
This part is accurate...
> and only for the newly written files.
... but this part is not. Modifying an existing file will almost certainly result in data being copied to the newer vdev -- because ZFS will send more writes to drives that are less utilised (and if most of the data is on the older vdevs, then most reads are to the older vdevs, and thus the newer vdevs get more writes).
> It’s likely that for the life of that pool, you’d always have a heavier load on your oldest vdevs. Not the end of the world, but it definitely kills some performance advantages of striping data.
This is also half-true -- it's definitely not ideal that ZFS doesn't have a defrag feature, but the above-mentioned characteristic means that eventually your pool will not be so unbalanced.
> Want to break a pool into smaller pools? Can’t do it. So let’s say you built your 2x8 + 2x8 pool. Then a few years from now 40 TB disks are available and you want to go back to a simple two disk mirror. There’s no way to shrink to just 2x40.
This is now possible. ZoL 0.8 and later support top-level mirror vdev removal.
> Got a 4-disk raidz2 pool and want to add a disk? Can’t do it.
It is true that this is not possible at the moment, but in the interest of fairness I'd like to mention that it is currently being worked on.
> For most fundamental changes, the answer is simple: start over. To be fair, that’s not always a terrible idea, but it does require some maintenance down time.
This is true, but I believe that the author makes it sound much harder than it actually is (it does have some maintenance downtime, but because you can snapshot the filesystem the downtime can be as little as a minute):
# Assuming you've already created the new pool $new_pool.
% zfs snapshot -r $old_pool/ROOT@base_snapshot
% zfs send $old_pool/ROOT@base_snapshot | zfs recv $new_pool/ROOT
# The base copy is done -- no downtime. Now we take some downtime by stopping all use of the pool.
% take_offline $old_pool # or do whatever it takes for your particular system
% zfs mount -o ro $old_pool/ROOT # optional
% zfs snapshot -r $old_pool/ROOT@last_snapshot
% zfs send -i @base_snapshot $old_pool/ROOT@last_snapshot | zfs recv $new_pool/ROOT
# Finally, get rid of the old pool and add our new pool.
% zpool export $old_pool
% zpool import $new_pool $old_pool
% zfs mount -a # probably optional
Raidz2+spares, compression, snapshots and send/receive are very useful. And zil and cache are easier than lvmcache..
It supports heterogenous drives, safe rebalancing (create a third copy, THEN delete the old copy), fault domains (3-way mirror, but no 2 copies can be on the same disk/enclosure/server/whatever), erasure coding, hierarchical storage based on disk type (e.g., use NVMe for the log, SSD for the cache), clustering (paxos, probably). Then you toss ReFS on top, and you're done.
The only compelling reasons to buy windows server are to run third party software or a storage spaces/ReFS file share.