I used to be in the "why BTRFS" camp and religiously would only install ext4 without LVM on Fedora for my laptops, desktops, and servers. When I saw so many subsequent releases persist in offering BTRFS by default, I decided to try it for a recent laptop install, because honestly, the appeal of deduplication, checksumming, snapshotting, and so many other features that modern FSes generally come with (e.g., ZFS), I just decided to jump the gun and went ahead and installed it.
I can safely say it has not presented any problem with me thus far, and I am at the stage of my life where I realize that I don't have the time to fiddle as much with settings. If the distributions are willing to take that maintenance on their shoulders, I'm willing to trust them and deal with the consequences – at least I know I'm not alone.
It's obviously not there as a NAS filesystem, ZFS drop-in replacement, etc. But if what you take away from that is that BTRFS is no good as a filesystem on a single drive system, you're missing out. Just a few weeks ago I used a snapshot to get myself out of some horrible rebase issue that lost half my changes. Could I have gone to the reflog and done other magic? Probably. But browsing my .snapshots directory was infinitely easier!
Snapshots are the best thing in the world for me on Arch. I'm specifically using it cause I like to tinker with exotic hardware and it has the most sane defaults for most of the things I care about. Pacman is great, but the AUR can be a bit a sketchy sometimes on choices that package authors make. Having snapshots every time package changes happen that I can roll back to from my boot loader is _awesome_. If you've ever used a distro that does kernel backups in your boot loader, it's like that, except it's whole packages at a time! And being able to use subvolumes to control which parts of a snapshot to restore is awesome. I can roll back my system without touching my home directory, even on a single drive setup!
But browsing my .snapshots directory was infinitely easier!
I second this however I don't use the filesystem to get this functionality. I most often use XFS and have a cronjob that calls an old perl script called "rsnapshot" [1] that makes use of hardlinks to deal with duplicate content and save space. One can create both local and remote snapshots. Similar to your situation I have used this to fix corrupted git repos which I could have done within git itself but rsnapshot was many times easier and I am lazy.
For me personally it was a matter of benchmarks. In the past I have seen higher performance numbers especially when dealing with directories that contain a very large number of files creating / deleting vs ext4. I've not tested since the 5.11 kernel time-frame however. But to your point it would be nice to see XFS have an option to grow inode limits dynamically like brtfs instead of having to manually adjust it. I've honestly never even tried out brtfs but it looks like I should.
By the way, "git reflog" can usually get you out of horribly botched rebases without using special filesystem features: git reset --hard <sha1 of last good state from reflog>
The reflog can be a PITA to walk through. A less well known thing is that you can spelunk through the reflog by saying `<ref>@{N}` which means whatever commit `<ref>` was pointing at N changes to the ref ago. Super handy to double-check that the rebase squashing commits didn't screw things up if there were merge conflicts in the fixups.
Synology is running on top of mdraid, but does not use dm-integrity. Since it can be enabled per-share, and share creates a btrfs subvolume, that would be kind of difficult.
For scrubbing, plain-old btrfs-scrub is being used.
Does BTRFS actually need to be augmented like that? I've been afraid of using it for a NAS because it doesn't sound like it's as trustworthy as ZFS when it comes to handling bitrot. But I don't know if that's actually true. When I tried to find info on it a couple weeks ago, a lot of people were trying to claim that bitrot isn't a thing if you have ECC RAM.
BtrFS was previously restricted to crc32c checksums. This was enhanced to allow several more, including sha256. The crc32c checksum has also been improved as xxhash, which promises fewer collisions for safer deduplication. When configured for sha256, BtrFS uses a checksum that is as strong as ZFS.
However, the checksum must be chosen at the time of filesystem creation. I don't know of any way to upgrade an existing BtrFS filesystem.
Contrast this to ZFS, which allows the checksum to be modified on a file-by-file basis.
Hmm well that's good to know. I like the idea of being able to have mismatched drive capacities and easy expandability which is why I had been looking at Btrfs. I will have to look more into the checksum options.
It's raid 5/6 comes with a warning from the developers not to use it and RAID 1 is a weird arrangement that actually keeps 2 copies on however many disks and can lose data if a disk comes and goes for example with a bad cable.
Bitrot is a thing, getting random bit flips/etc past all the ECC and data integrity checks in the storage path is much harder.
Having filesystem level protection is good, but it its like going from 80% to 85% protected. That is because the most critical part remaining unprotected in a traditional RAID/etc system is actually the application to filesystem interface. Posix and linux are largly to blame here because the default IO model should be async interface where the completion only fires when the data is persisted and things like read()/write()/close() should be fully serialized with the persistence layer. Otherwise, even with btrfs the easiest way to lose data is simply write it to disk, close the file, and pull the power plug.
For example if your use case is a file archive (think raw photo or video), then fiesysytem interface does not matter - if the computer crashes soon after copy, you re-copy from original media. But bit flips are very real and can ruin your day.
I'm really curious, what your storage stack is that your getting undetected bit flips. This stuff was my day job for ~10 years. I've seen every kind of error you can imagine, and I can't actually remember "bit flips" showing up in end user data, that wasn't eventually attributable to something stupid like lack of ECC ram, or software bugs. Random bit flips tend to show up two ways, the interface/storage mechanism gets "slow" (due to retries/etc) or flat out read errors. This isn't the 1980's where you could actually read data back from your storage mechanism and get flipped bits, there are too many layers of ECC on the storage media for it to go undetected. Combined with media scrubbing failures tend to be all or nothing, the drive goes from working to 100% dead, or the raid kicks it when the relocated sector counts start trending up, or the device goes in a read only mode. What most people don't understand is that IO interfaces these days aren't designed to be 100% electrically perfect. The interface performance/capacity is pushed until the bit error rate (BER) is not insignificant, and then error correction is applied to assure that the end result is basically perfect.
But as I mentioned these days I'm pretty sure nearly all the storage loss that isn't physical damage is actually software bugs. Just a couple months ago I uploaded a multiple GB file (on my ECC protected workstation) to a major hyperscaler's cloud storage/sharing option. Sent the link to a colleague half way around the globe and they reported an unusual crash. So I asked them to md5sum the file and they got a different result from what I got, so I downloaded the file myself and diffed it against the original and right in the middle there was a ~50k block of garbage. Uploaded it again, and it was fine. Blame it on my browser, or whatever if you will, but the end result was quite disturbing because I'm fairly certain my local storage stack was fine. These days i'm really quite skeptical of "advanced" filesystem/etc. What I want is a dumb one where the number 1 priority is data consistency. I'm not sure that is an accurate reflection of many of them, where winning the storage perf benchmark, or feature wars seems to be a higher priority.
Last time I have seen data damage was about 2005, this was multiple SATA drives connected to a regular consumer motherboard running Linux (sorry, don't remember the brands). If I remember right, I think there was an 8-byte block damaged every few gigabyte transferred or so? So a very high number of damaged files given I had a few terabytes of data.
I never found the cause, because I just switched to a completely different system to copy data. I know it was not disk-specific because this was happening on multiple hard drives, nor it was physical damage, as SMART/syslog were silent, and reading the disk again was giving correct data. Memory was fine -- not ECC, but I did run a lot of memtest's on it.
Later on, I found some blog posts which mentioned the similar problem and claim it was result of bad SATA card, or bad cable, or even bad power supply. I remember there was original one made by Jeff Bonwick on his ZFS blog, but I cannot find it anymore. Here is a more modern link instead: https://changelog.complete.org/archives/9769-silent-data-cor...
I now have the homegrown checksumming solution which I use after each major file transfer, and I have not seen any data corruption yet (*known on the wood().
Link-level CRCs don't protect against bit flips that happen during processing or while transitioning between links with different CRCs. For maximum end-to-end integrity you want to calculate the check value over the data in (ECC) RAM before writing it and the check value to storage, and verify the check value after reading all the data back in—this ensures that a bit-flip will be detected no matter where it occurs in the pipeline.
No it isn't. At some point data is in transit in an IC without any error handling. CRCs aren't 100% reliable. Bit flips are guaranteed to happen with non-zero probability in all digital electronics. Bad data eventually makes it to persistent storage.
I don't know what parts of the storage stack you guys are working on, but I worked on enterprise storage systems we had 100% coverage in one form or another from the moment it left the RAM, via the intel interconnects, PCIe, the FC adapters which were either ECC or were using some form of data protection wrapping the entire transaction. So random bit flips in serdes/etc didn't cause problems because the higher level link protocols protected the data between the onboard ram and the endpoint, then there were the higher level protocol also providing their own data integrity.
Looking at even SATA 1, if it is implemented _CORRECTLY_ you get the packet CRC protecting the link data from the point its formed to the point the endpoint verifies the result. So like ethernet sometimes it doesn't matter if some random piece of junk in the process doesn't do its own ECC validation, because its covered under a higher level of the stack.
If your adapters are "desktop" grade then I might consider seeking another vendor if you care about data integrity, some vendors are definitely shipping crap, but there are vendors I can assure you will detect link/etc failures.
And as a side, note, I've seen a lot of bad data, a huge percentage of it, were kernel/etc filesystem errors. We added a bunch of out of band extra metrics to track when/where writes were going and our own metadata layers/etc, and it uncovered a whole bunch of software errors.
>>it is designed with a focus on data integrity by protecting the user's data on disk against silent data corruption caused by data degradation, power surges (voltage spikes), bugs in disk firmware, phantom writes (the previous write did not make it to disk), misdirected reads/writes (the disk accesses the wrong block), DMA parity errors between the array and server memory or from the driver (since the checksum validates data inside the array), driver errors (data winds up in the wrong buffer inside the kernel), accidental overwrites (such as swapping to a live file system), etc.
You can typically make a branch and push it up to the server, and I usually do this before any rebase. In future development scenarios (mac, windows) will you be able to do this reliably in the future?
That sounds extremely unlikely. Did you look into what the reflog actually is? And a tag is literally just a pointer to a commit hash. Sure, you can tag it, but don't embarrass yourself by pushing it to a remote.
Pro tip: Don't pretend to know better than everyone else, then go "no no no" like an 8 year old when someone asks for info ;) It's ok to admit you're wrong.
BTRFS works fine. I use it on my everyday laptop without problems. Compression can help on devices with not a lot of disk, and also copy on write. However, BTRFS has its drawbacks, for example it's tricky to have a swapfile on it (now it's possible with some special attributes).
Also I wouldn't trust BTRFS for the purpose of data archiving, for the fact that ext4 is a proven (and simpler) filesystem, thus it's less likely to became corrupt, and it's more likely to being able to recover data from it if it will become corrupt (or the disk has some bad sectors and that sort of stuff).
> Also I wouldn't trust BTRFS for the purpose of data archiving, for the fact that ext4 is a proven (and simpler) filesystem, thus it's less likely to became corrupt, and it's more likely to being able to recover data from it if it will become corrupt (or the disk has some bad sectors and that sort of stuff).
On the contrary; I'm using btrfs and not ext4 on NAS (Synology) specifically, because the former does checksumming and bitrot detection and the latter does not.
I was using urbackup with ext4 and was having issues that caused corruption and couldn't figure out why, seen a recommendation to use urbackup with BTRFS and have had no corruption since. I have used ext4 in every other use case and had no issue so im not saying ext4 is at fault but so far BTRFS has worked great for me.
A NAS it's not a backup, it's something you use to store data on it.
The backup I was referring to is the offline one. If I need to backup data, something that unfortunately I don't do as often as I should, I need a filesystem that is reliable, that is proven (ext4 is around since more than a decade, and if we count the previous version even more, so in 20 years I'm confident that I would be able to mount an ext4 hard drive that I forgot in the garage in a modern system, with BTRFS, who knows), and that for it they exist a lot of tools in case something goes wrong (there are ton of tools to recover data from damaged ext4 drives, are we sure that with BTRFS is as easy? If I have a filesystem with compression recovering data I don't think is a simple as running photorec...)
Also, the filesystem of a backup drive is not something you can change easily. I still have an old 1Tb drive that I formatted long long time ago in NTFS, and I never changed the filesystem since having to backup all the data to another drive (find it another 1Tb drive), format the drive, and copy back the data will take 1 day. Not that there are things super important on that drive, mostly is stuff I downloaded from the internet years ago, still it's an example why for a backup drive I don't want to have the cutting edge choice that then creates problems in the future.
Ext4 is ubiquitous, so it's my filesystem of choice for all the purpose that have the requirement that the data must be archived for more than 2 years.
I'm not sure how you prove that ext4 is less likely to become corrupt. But it is easily shown that it's less likely to inform you that there's corruption.
Quite a lot of the assumptions of earlier file systems is the hardware either returns correctness, or reports a problem e.g. uncorrectable read error or media error. That's been shown to be untrue even with enterprise class hardware, largely by the ZFS developers, hence why it exists. And also why ZFS has had quite a lot less "bad press" where Btrfs wasn't developed in a kind of skunkworks, it was developed out in the open where quite a lot of early users were using it with ordinary every day hardware.
And as it turns out, we see most hardware by make/model doing mostly the right things, a small number of make/models, making up a significant minority of usage volume, don't do the right things. Hence, Btrfs has always had full checksumming of data and metadata. Both XFS and ext4 were running into the same kinds of problems Btrfs (and ZFS before it) revealed - torn writes, misdirected writes, bit rot, memory bit flips, and even SSD's exhibit prefail behavior by returning either zeros or garbage instead of data (or metadata). XFS and ext4 subsequently added metadata checksums, which further reinforced the understanding that devices sometimes do the wrong thing and also lie about it.
It is true that overwriting filesystems have a better chance of repairing metadata inconsistencies. A big reason why is locality. They have fixed locations on disk for different kinds of metadata, thus a lot of correct assumptions can be made about what should be in that location. Btrfs doesn't have that at all, it has very few fixed locations for metadata (pretty much just the super blocks). Since no assumptions can be made about what's been found in metadata areas, it's harder to fix.
So the strategy is different with Btrfs (and probably ZFS too since it has a fairly nascent fsck even compared to Btrfs's) - cheap and fast replication of data via snapshots and send/receive, which requires no deep traversal of either the source or destination. And equally cheap and fast restore (backwards replication) using the same method. Conversely, conventional backup and restore are meaningfully different when reversing, so you have to test both the backup and restore to really understand if your backup method is reliable. That's going to be your disaster go to rather than trying to fix them. Fixing is almost certainly going to take much longer than restoring. If you don't have current backups, at least Btrfs now has various rescue mount options to make the file system more tolerant of broken file systems, but as a consequence you also have to mount read-only. Pretty good chance you can still get your data out, even if it's inconvenient to have to wipe the file system and create a new one. It'll still be faster than mucking with repair.
Also, Btrfs since kernel 5.3 has both read time and write time tree checkers, that verify certain trees for consistency, not just blindly accepting checksums. Various problems are exposed and stopped before they can cause worse problems, and even helps find memory bitflips and btrfs bugs. Btrfs doesn't just complain about hardware related issues, it'll rat itself out if it's to blame for the problem - which at this point isn't happening any more often than ext4 or XFS in very large deployments (millions of instances).
> I'm not sure how you prove that ext4 is less likely to become corrupt. But it is easily shown that it's less likely to inform you that there's corruption.
I didn't talk only about corruption of the filesystem itself (I don't know if it's more or less likely with BTRFS, someone says that BTRFS is more likely to become corrupt with power failures, I don't know if it's true), but also from hardware failures. In case of a disk with damaged sectors (I know that we should have 3 backups with one offsite, but you always have that one disk with important data on it that it's a year that you are promising to backup next day till it breaks) I think that a filesystem with a simpler structure will lead to an higher probability of recovering the data, while I think that with BTRFS, or any filesystem that is COW, uses compression, volumes, etc that is more difficult, because files are not stored as plain blocks on the disk, but have a more complex structure that must need to be decoded.
Also BTRFS is kind of a new filesystem, that has two disadvantages, there are not all the tools that were developed over the years for ext4, and also BTRFS driver is continuing evolving. Why I can be pretty confident that if I format an hard disk today with an ext4 filesystem in 20 years I will find a driver for a modern Linux (or whatever OS will replace it in 20 years) to mount it, can we have the same assurance with BTRFS? I don't know.
So for the purpose of making backups and archiving data, I think that I will stick with ext4 for a while. While on my laptop, and systems that I use, I use BTRFS without any problems.
>someone says that BTRFS is more likely to become corrupt with power failures
No. If the drive honors flush/FUA, Btrfs is less likely to corrupt data or metadata than overwriting file systems because the interruption won't result in incomplete overwrites. So this would hold true for any copy-on-write vs overwriting file system (and probably also log based file systems). The trouble is if the drive is transiently lying about flush/FUA success, and then there's an ill timed crash. There's the chance the super blocks written point to trees that don't exist yet because the write order hasn't been honored due to flush/FUA being ignored. There are backup trees, so it might be possible to work around this defect with the `rescue=usebackuproot` mount option, but sometimes the defect is so bad that you get all kinds of write reordering such that Btrfs only finds trees with the wrong generation, and it fails to mount. Often it's still possible to get your data out with the offline scrape tool, `btrfs restore`. But it's a difficult problem to deal with. In theory it's similar on ZFS but I know nothing about its on-disk format so maybe its metadata has some locality in which case certain assumptions could be made to allow it to better work around such a drive firmware defect? I'm not sure. On a power fail, it is possible Btrfs loses the most recently written data if the writes that were in-progress and thus not yet fully committed to stable media. How much data really depends on the application doing the writes.
>In case of a disk with damaged sectors
Btrfs by default keeps two copies of metadata and it automatically deals with this problem, while also self-healing when such problems are encountered.
>a filesystem with a simpler structure will lead to an higher probability of recovering the data, while I think that with BTRFS, or any filesystem that is COW, uses compression, volumes, etc that is more difficult, because files are not stored as plain blocks on the disk, but have a more complex structure that must need to be decoded.
The ondisk format is fairly simple and extendible. Metadata isn't subject to compression. In the case of bad sectors with compressed (user) data, you'll certainly lose more data than if it weren't compressed. There's an expected trade off here, it's not really a Btrfs issue but just the way all compression algorithms work. You get some small corruption and it'll have a bigger effect.
>So for the purpose of making backups and archiving data, I think that I will stick with ext4 for a while.
I used to hedge my bets by having multiple copies of data on different file systems (including ZFS) but haven't done that in years. I've seen too many cases of (hardware induced) data corruption being replicated into backups and archives without any warning it was happening until it was too late - and only corrupt copies remained.
I didn't know about the swapfile thing... but TIL. I had been wondering how to to make a non-snapshotted volume for some other reasons, though, so that's a 2 birds with one stone thing, thank you!
I have not used BTRFS for years, but I remember at some point a BTRFS regression prevented me from booting my system. It is hard to regain trust after such a meltdown from such a fundamental component. That said, I believe my Synology NAS uses btrfs and it has never had an issue.
I've been in the same boat. Around 2012 or 2013 I put BTRFS on my DIY NAS/media server. For some reason, totally unprovoked the array just went belly up with no warning or logs. I tried and tried without success and couldn't recover it. Fortunately I had good, recent backups and restored to ext4+LVM and I'm still there 10 years later.
BTRFS sounded cool with all it's new features, but the reality is that ext4+LVM does absolutely everything I need and it's never given me any issues.
I'm sure BTRFS is much more robust these days, but I'm still gun shy!
In 2019 I was setting up my new computer at my new job, and the Ubuntu installer had btrfs as an option. Figuring that it had been ages since the last time I'd heard about btrfs issues, I opted for that.
A week later, the power failed in the office, and my filesystem got corrupted. In the middle of repairing, the power dipped again, and my entire fs was unrecoverable after that. I managed to get the data off by booting off a separate drive and running a command to extract everything, but it would never mount no matter what I did.
I've never had an issue with ext4, xfs, or zfs no matter how much I've abused them over the past 10+ years, but if losing power twice can wipe out my filesystem then no thanks, I'm out.
Are you sure you never had an issue, or did you just not notice? Ext4 and xfs can mangle your files and never let you know because they don't checksum. This comes up often in comparisons and I wish people paid attention to the difference.
Yeah...I used to live in an old neighborhood with above ground lines and huuuge trees. Every strong windblow would flip the power out for a second. Some days a few times a day, some days never.
EXT4 has never once failed me, and I personally battle tested it by working while losing power probably a total of 200 times. I probably should have bought a UPS come to think of it.
Not the OP, but I deployed btrfs on tens of embedded systems that had their power consistently cut on a daily basis. Ended up with unmountable filesystems repeatedly. Switched to ext4 and never had an unmountable filesystem again.
Around the timeframe you mentioned, I lost a BTRFS filesystem when I filled it up. Probably could have recovered it if I had known more, but oh well. I definitely feel the gun-shyness!
However, I'd want to add that at a previous job, I had a ext4 fs go belly-up in a similar way. One day, just died without warning. Maybe could have recovered it, but like others have mentioned we'd have no guarantees about the data.
Moral of the story is, of course, always have backups :)
I sort of had the same experience, dropped it for a decade, and came back around. It's a lot more robust these days. Also the btfsmaintenance tools take a lot of the load off of an admin. I just use the default settings and don't have any issues.
It certainly used to be the case that BTRFS had some nasty behaviour patterns when it started to run low on space. It could well be that it has not presented any problem for you yet.
On the other hand, those days might be behind it. I haven't kept track.
There are still edge cases but I can count on one hand the number of users who have run into it since Fedora switched to Btrfs by default (on desktops) two years ago (and Cloud edition in the most recent Fedora release late last year).
The upstream maintainer of btrfs-progs also maintains btrfs maintenance https://github.com/kdave/btrfsmaintenance which is a set of scripts for doing various tasks, one of which is a periodic filtered balance focusing on data block groups. I don't run it myself, because well (a) I haven't run into such a bug in years myself (b) I want to run into such a bug so I can report it and get it fixed. The maintenance script basically papers over any remaining issues, for those folks who are more interested in avoiding issues than bug reporting (a completely reasonable position).
There's been a metric ton of work on ENOSPC bugs over the years, but a large pile were set free with the ticketed ENOSPC system quite a few years ago now (circa 2015? 2016?)
bcachefs is probably the other big name here, but since distros still seem to pick btrfs, I don't think it's considered "production ready" yet. The bcachefs website still labels it as "beta" quality.
bcachefs is written by the same person who wrote bcache... And that has a lot of subtle eat-your-data pitfalls.[1] I don't really trust bcachefs not to do the same.
But... any of the other stuff? Checksums, deduplication, compression, etc. Citations, please.
LVM VDO wasn't really a thing until a couple of years ago, and I've never actually heard anyone recommend using it. Definitely not within its "traditional domain".
Perhaps you mean on other operating systems... when this discussion is all about a Linux filesystem and whether other Linux filesystems provided these features, including the comment you replied to?
> But... any of the other stuff? Checksums, deduplication, compression, etc. Citations, please.
Yes. The Linux LVM logical volume manager for example can do those things.
Traditional domain meaning this kind of data management traditionally came under the purview of LVMs. Not that an implementation has had particular features for a length of time, but that you might (also) look there rather than the filesystem for those features.
Deduplication and compression have traditionally been file system features not volume management features:
As fair as I know the only solution for deduplication in LVM is VDO and that was created after ZFS and isn't technically part of LVM (it's a kernel module that sits between the filesystem and LVM).
Same is true for compression except in the case of compression that's been a file system feature since the days of NTFS on NT 3.5 (released 1995) -- I can't recall when UNIX file systems first saw compression but I'd wager it was before ZFS, which we've already acknowledge predates VDO.
I also thing it makes sense that the above should be part of the file system domain because it's an integral part to how the file system organizes the data (much like how we don't argue that the journal isn't part of the file system domain).
As for checksums, they have existed in all parts of the stack, from the application level (for example rsync), to the volume level (RAID controller cards) to the file system. I wouldn't say that's something that has ever had a "traditional domain".
The things I'd say ZFS and its ilk do that isn't part of its traditional domain is RAIDing (ie management of the physical devices) and cache control. Both make a lot of sense being managed by the file system in modern file systems but I can also sympathize with those who think that's a step too far. In the case of RAIDing, you can still run ZFS on LVM or a hardware controller if you wish -- I wouldn't suggest you do so because it gives you additional complications for zero benefit but the option is still there if you want it. But with regards to cache, the only way to opt out of ZFS managing it's own cache is not to use ZFS.
Data plane management sure has traditionally been done with logical volume managers on unix particularly which introduced logical volume managers around late 80s. Not sure when deduplication exactly was available for Linux, probably around the same time as it was for ZFS on Solaris with VxVM. Various other vendors had their own LVM based dedup (NetApp comes to mind).
Anyway it wasn't really a statement about what was there first or not, but rather that because something may not exist for a filesystem does not mean it does not exist, i.e., look at other layers for such functionality.
Are they going to add zfs to the TUI installer for Ubuntu Server? I've tried several times over the last year, and I have yet to find a good method for getting Ubuntu Server running on zfs root.
Step 1 ("Prepare the Install Environment") starts by assuming you have a Live CD, which implies the desktop distro. I've been to that guide several times while searching fruitlessly for a method to install Ubuntu server onto a zfs root. I think you would basically have to spin your own ISO at this point.
Are you using a UPS on the desktop? A recent HN thread highlighted BTRFS issues, especially with regards to dataloss on power loss. Also, there's a "write hole issue" on some RAID configurations, RAID 4 or 6 I think.
That said, I'm thinking about leaving a BTRFS partition unmounted and mounting it only to perform backups, taking advantage of the snapshotting features.
This would seem to suggest that ANY raid configuration is unacceptable.
Device replace
--------------
>Device replace and device delete insist on being able to read or reconstruct all data. If any read fails due to an IO error, the delete/replace operation is aborted and the administrator must remove or replace the damaged data before trying again.
Device replace isn't something where "mostly ok" is a good enough status.
I interpreted "being able to read or reconstruct all data" as meaning that there must exist a good valid copy of each chunk of data somewhere in the array, but not necessarily on the device that you're trying to remove or replace. That interpretation matches my experience, which is that btrfs can correctly handle replacing a drive that is dead or dying. You should certainly expect errors if eg. both copies of a chunk in a RAID1 are lost/corrupted.
I was an early adopter and some bad experiences early on made it a bitter pill. I swore it off for a decade, and about year and half came back around. It's MUCH MUCH better now. With automated "sensible" settings for btrfsmaintenance tools it's actually just fine now.
I honestly don't see the point of Btrfs. ZFS is more mature, more stable, better defaults, etc. The only reason Btrfs existed was because ZFS wasn't GPL. But now we have a way of running ZFS on Linux Btrfs seems utterly redundant.
I mean I'm all for choice on Linux but when it comes to file systems I'd rather have fewer choices but those choices be absolutely rock solid. Btrfs might have gotten to that stage now however it's eaten enough peoples data (including mine) over the years that I can't help wondering why people bothered to persist with using it when a better option was available.
Anecdata: Once upon a time I installed a SuSE with their default choice of ReiserFS (v3) as root partition. Couple of months later that filesystem was dead beyond repair. I don't know whether I did something wrong, but I've been very wary of "defaults" ever since. That said, that was a different time and I tend to see a ZFS or a BTRFS in my near future.
One of the a little weird things about btrfs is that some Software (e.g. OBS) seems to have a hard time to get the free space on disk. Maybe because they assume the usual ways of getting free space works (which doesn't on btrfs)
> If the distributions are willing to take that maintenance on their shoulders, I'm willing to trust them and deal with the consequences – at least I know I'm not alone.
But then they make changes that add an insane amount of complexity, and suddenly you're running into random errors and googling all the time to try to find the magical fixes to all the problems you didn't have before.
Although this would be an interesting way to drag some of my old NTFS filesystems kicking & screaming into the 21st century, I'd never do one of these in-place conversions again. I tried to go from ext3 to btrfs several years ago - and it would catastrophically fail after light usage. (W're talking less than a few hours of desktop-class usage. In retrospect I think it was `autodefrag/defragment` that would kill it.) I tried that conversion a few times and it never worked, I think I even tried to go from ext3->ext4->btrfs. This was on an Arch install so (presumably) it was the latest and greatest kernel & userspace available at the time.
I eventually gave up (/ got sick of doing restores) and just copied the data into a fresh btrfs volume. That worked "great" up until I realized (a) I had to turn off CoW for a bunch of things I wanted to snapshot, (b) you can't actually defrag in practice because it unlinks shared extents and (c) btrfs on a multi-drive array has a failure mode that will leave your root filesystem readonly; which is just a footgun that shouldn't exist in a production-facing filesystem. - I should add that these were not particularly huge filesytems: the ext3 conversion fiasco was ~64G, and my servers were like ~200G and ~100G respectively. I also was doing "raid1"/"raid10" style setups, and not exercising the supposedly broken raid5/raid6 code in any way.
I think I probably lost three or four filesystems which were supposed to be "redundant" before I gave up and switched to ZFS. Between (a) & (b) above btrfs just has very few advantages compared to ZFS. Really the only thing going for it was being available in mainline kernel builds. (Which, frankly, I don't consider that to be an advantage the way the GPL zealots on the LKML seem to think it is.)
> ...btrfs just has very few advantages compared to ZFS. Really the only thing going for it was being available in mainline kernel builds.
ZFS doesn't have defrag, and BtrFS does.
There was a paper recently on purposefully introducing fragmentation, and the approach could drastically reduce performance on any filesystem that was tested.
This can be fixed in BtrFS. I don't see how to recover from this on ZFS, apart from a massive resilver.
I'm pretty dependent on the ability to deduplicate files in place without massive overhead. The built in defrag on BTRFS is unfortunate but I think you can defragment and re-deduplicate.
I don't know, I'm just hoping for a filesystem that can get these features right to come along...
In-place conversion of NTFS? You either still believe in a god or need to google the price of harddrives these days. Honest question tho, why would anybody do in-place conversion of partitions?
>You either still believe in a god or need to google the price of harddrives these days.
That was pretty funny, and I agree a thousand times over. When I was younger (read: had shallower pockets) I was willing to spend time on these hacks to avoid the need for intermediate storage. Now that I'm wiser, crankier, and surrounded by cheap storage: I would rather just have Bezos send me a pair of drives in <24h to chuck in a sled. They can populate while I'm occupied and/or sleeping.
My time spent troubleshooting this crap when it inevitably explodes is just not worth the price of a couple of drives; and if I still manage to cock everything up at least the latter approach leaves me with one or more backup copies. If everything goes according to plan well hey the usable storage on my NAS just went up ;-). I feel bad for the people that will inevitably run this command on the only copy of their data. (Though I would hope the userland tool ships w/ plenty of warnings to the contrary.)
Maybe if your NTFS drive is less than half full, at least I assume this is a limitation of this project since it mentions keeping an original copy... Still, belief in god seems about right, or you have good backups. I had about 2.5 TB on a 3 TB NTFS drive I decided to move over to ZFS, just rsynced to a few various drives since I didn't have that much contiguous space elsewhere (building a NAS later...), learned I had a file whose name is too long for either ZFS or ext4 and had to rename it, and after making a zpool out of the drive I just rsynced everything back. Doing it in place would have saved hours.. but only hours, on something not high urgency that doesn't require babysitting.
The backup is a reflink copy as per readme - that means data blocks are shared with live filesystem and don't occupy extra space but there's probably quite a bit of metadata.
Just because something is cheap doesn't mean I'm fine with buying it for a one-shot use.
Buying an extra disk for just the conversion is wasteful, and then you need space to keep it stashed forever when you never use it. Not at all sustainable, I'd rather leave the hardware on the market for people who _actually_ need it.
So you buy an external 1TB drive just for the sake of the conversion, then create a new partition, then copy your 1TB of data over, then... what? Wipe your PC, boot into a live CD, then copy the partition over? Do you find this easier/more worthwhile than an in-place conversion? How/why?
From the same person that made WinBtrfs and Quibble, a Windows NT Btrfs installable filesystem and bootloader. And yes, with all of that one can boot and run Windows natively on Btrfs, at least in theory.
That's in common with the conversion from ext[234] and reiserfs, too. Makes it easy to both undo the conversion and inspect the original image in case the btrfs metadata became wrong somehow.
In a former life I ran a web site with a co-founder. We needed to upgrade our main system (we only had 2), and had mirrored RAID1 hard drives, some backup but not great. We tested the new system, it appeared to work fine, so the plan was to take it to the colo, rsync the old system to the new one, make sure everything ran okay, then bring the old system home.
We did the rsync, started the new system, it seemed to be working okay, but then we started seeing some weird errors. After some investigation, it looked like the rsync didn't work right. We were tired, it was getting late, so we decided to put one of the original mirrors in the new system since we knew it worked.
Started up the new system with the old mirror, it ran for a while, then started acted weird too. At that point we only had 1 mirror left, were beat, and decided to pack the old and new system up and bring it all back to the office (my co-founder's house!) and figure out what was going on. We couldn't afford to lose the last mirror.
After making another mirror in the old system, we started testing the new system. It seemed to work fine with 1 disk in either bay (it had 2). But when we put them in together and started doing I/O from A to B, it corrupted drive A. We weren't even writing to drive A!
For the next test, I put both drives on 1 IDE controller instead of each on its own controller. (Motherboards had 2 IDE controllers, each supported 2 drives). That worked fine.
It turns out there was a defect on the MB and if both IDE ports were active, it got confused and sent data to the wrong drive. We needed the CPU upgrade so ended up running both drives on 1 IDE port and it worked fine until we replaced it a year later.
But we learned a valuable lesson: never ever use your production data when doing any kind of upgrade. Make copies, trash them, but don't use the originals. I think that lesson applies to the idea of doing an inplace conversion from NTFS to Btrfs, even if it says it keeps a backup. Do yourself a favor and copy the whole drive first, then mess around with the copy.
I used btrfs on an EC2 instance with two local SSDs that were mirrored for a CI pipeline running Concourse. It would botch up every few months, and I got to automating setup so that it was easy to recreate. I never did find the actual source of the instance botch-up though. It was either the local PostgreSQL instance running on btrfs, btrfs, or the Concourse software. I pretty much ruled out the PostgreSQL being the originating source of the issue, but didn't get further than that. I don't know if anyone would suspect mdadm.
Other than whatever that instability was, I can say that the performance was exceptional and would use that setup again, with more investigation into causes of the instability.
What I really want: ext4 performance with instant snapshots plus optional transparent compression when it can improve performance. There is only one promise to deliver this AFAIK: bcachefs, but it still isn't mature yet.
You can actually use a sparse zvol pretty decently for this too. You don't get the file level checksumming or some of the other features but you can still snapshot the volume and get pretty good performance out of it. I've got a few programs that don't get along too well with zfs that I use that way.
Personal anecdote: I've been using BTRFS on my laptop running Manjaro for the past year with no issues. Originally I had it running in an encrypted LUKS partition on a single Samsung NVMe, but for the past month I've been running two NVMe drives in RAID 0 with a LUKS volume on top of that and BTRFS inside of that. In both cases I've had no performance issues, no reliability issues or data loss (even when having to force shutdown the laptop due to unrelated freezes), and have been able to save and restore from snapshots with zero issues.
butter-fs[1] would not be the destination FS I would have chosen but such an effort deserves kudos.
[1] ...given how broken it seemingly is, see features that are half baked like raid-5. But I am a ZFS snob so don't mind me, my fs of choice has it's own issues.
BTRFS has been stable for years now as long as you don't use unsupported features like the aforementioned RAID5. A properly set up btrfs system is fine for production use, though note the "properly set up" bit, as a good number of distros still don't set it up right. I suspect the latter bit is why people continue to have issues with it (which is definitely a big downside compared to something like ZFS's "no admin intervention required" policy).
Regardless, in-place conversion is specifically a feature of btrfs due to how its designed. Since it doesn't require a lot of fixed metadata, you can convert a fs in-place by throwing the btrfs metadata into unallocated space and just point to the same blocks as the original fs. I think it even copies the original fs's metadata too, so you can mount the filesystem as either the original or btrfs for a while.
I've just lost about a week of my life fighting with a bog-simple device replace on a RAID1 btrfs filesystem. I had the misfortune of receiving a replacement disk which was also bad, and somehow the process of attempting to rebuild the array hosed the filesystem on the good disk. Asked for help on the irc channel and the only advice I got was "upgrade to the latest btrfs-tools" (because apparently the "stable" version shipped with debian stable isn't really stable when a problem arises?).
I also was misled by btrfs-check which happily scanned my un-mountable filesystem with no complaints, and learned on the irc that btrfs-check is "an odd duck," in that it doesn't actually check all the things the documentation says it does.
This experience, and the fact that simple things like the "can only mount a degraded filesystem rw one time" bug remain after years of complaints, simply because the devs adopt an arrogant "you're doing it wrong if you don't fix your filesystem the first time you mount it" (despite the tools giving you no indication that's required) attitude, have convinced me to never touch btrfs again.
So keep parroting "it's stable!" all you want, my experience has shown btrfs is "stable" until you have a problem.
> (because apparently the "stable" version shipped with debian stable isn't really stable when a problem arises?).
"Stable" here means "unchanging". It doesn't mean bug-free. My personal experience is, you'll generally encounter less bugs on a rolling release distro (like arch) or distros with frequent updates (like fedora). The upside of stable distros (like debian) is that new bugs (or other breaking changes) won't be introduced during the distro's release lifetime.
Debian's actually one of the distros I thought of when I said "properly set up". Their tools packages are very out of date, they don't install the proper maintenance setup by default, and the installer doesn't support subvolumes. Going through the man page, yeah I see it does mention using btrfs-check for various parts when generally that is not recommended (see the Arch Wiki[0] or OpenSUSE docs[1] to see how they warn against it).
> So keep parroting "it's stable!" all you want, my experience has shown btrfs is "stable" until you have a problem.
I've been running it on multiple production machines for years now, as well as my home machine. Facebook has been using it in production for I think over a decade now, and it's used by Google and Synology on some of their products.
I'm not saying it doesn't have problems (I've certainly faced a few), but it is tiresome reading the same cracks against it because they set it up without reading the docs. You never see the same thing against someone running ZFS without regular scrubs or in RAIDZ1.
It just seems weird to me to still be seeing RTFM more than a decade after I last received an actual FM for me to R.
A well-designed, general-audience technology product doesn't require one to be initiated into the mysteries before using it. The phone in my pocket is literally a million times more complicated than my first computer and I haven't read a bit of documentation. It works fine.
If btrfs wants to be something people use, its promoters need to stop blaming users for bad outcomes and start making it so that the default setup is what people need to get results at least as good as the competition. I have never read an extfs manual, but when I've had problems nobody has ever blamed bad outcomes on me not reading an extfs manual.
Particularly as in this case I did read the fucking manual. I read every relevant page of the btrfs-wiki and the man-pages before building this filesystem. What I've found is there are still relevant implementation details documented only in the dev's head or the endless mailing list archives.
Btrfs has not been, is currently not, and unlikely in future to become a general population usable file system, which is a shame as 10 years ago it looked like a promising move forward.
Its window to it was when setting up ZFS included lots of hand-waving. That window now has closed. ZFS is sable, does not eat data, does not have a cult of wizards spelling "RTFM" and is can be installed in major distributions using easy to follow procedure. In a year or two I expect that procedure to be fully automated, to a point where one could do a root on ZFS.
I haven't tried this yet but supposedly the Ubuntu installer can setup ZFS on root for a very basic[1] install. (i.e: No redundancy, and no encryption. The former one could trivially add after the fact by attaching a mirror & doing a scrub. The latter you could also do post-install w/ some zfs send+recv shenanigans, and maybe some initramfs changes.)
I do use the Ubuntu live image pretty regularly when I need to import zpools in a preboot environment and it works great. In general it's not my favorite distro - but I'm happy to see they're doing some of the leg work to bring ZFS to a wider audience.
> In a year or two I expect that procedure to be fully automated, to a point where one could do a root on ZFS.
Ubuntu has been able to install directly to root-on-ZFS automatically since 20.04. I don't think any other major distros are as aggressive about supporting ZFS due to the licensing problem, but the software is already there.
> You never see the same thing against someone running ZFS without regular scrubs or in RAIDZ1.
ZFS doesn't have these kinds of hidden gotchas, and that's the key difference. Yeah ok somebody's being dumb if they never scrub and find out they have uncorrectable bad data come from two drives on a raidz1. That's exactly the advertised limitation of raidz1: it can survive a single complete drive failure, and can't repair data that has been corrupted on two (or more) drives at once.
If you are in the scenario, as the GP was, that you have a two-disk mirror and regular scrubs have assured that one of the disks has only good data, and the other dies, ZFS won't corrupt the data on its own. If you try replacing the bad drive with another bad drive, eventually the bad drive will fail or produce so many errors that ZFS stops trying to use it, and you'll know. The pool will continue on with the good drive and tell you about it. Then you buy another replacement and hope that one is good. No surprises.
>ZFS doesn't have these kinds of hidden gotchas, and that's the key difference. Yeah ok somebody's being dumb if they never scrub and find out they have uncorrectable bad data come from two drives on a raidz1. That's exactly the advertised limitation of raidz1: it can survive a single complete drive failure, and can't repair data that has been corrupted on two (or more) drives at once.
Why is ZFS requiring scrubs and understanding the limitations of it's RAID implementations okay, but btrfs requiring scrubs and understanding the limitations of its RAID implementations "hidden gotchas"?
> If you are in the scenario, as the GP was, that you have a two-disk mirror and regular scrubs have assured that one of the disks has only good data, and the other dies, ZFS won't corrupt the data on its own.
Honestly, I don't know enough about GP's situation to really comment on what happened there. It could have been btrfs or perhaps they were using hardware RAID and the controller screwed up. ZFS is definitely very good in that regard and I want to be clear that I'm not saying ZFS is bad or that btrfs is better then it; I've been using ZFS much longer then I have btrfs, back before ZoL was a thing.
> Why is ZFS requiring scrubs and understanding the limitations of it's RAID implementations okay, but btrfs requiring scrubs and understanding the limitations of its RAID implementations "hidden gotchas"?
It read more like btrfs corrupting data despite good scrubbing practice; hosing the file system on the good drive instead of letting it remain good, for instance. If that's a misreading, that is where my position came from.
Regular scrubs and understanding the limitations of redundancy models is good on both systems, yes.
My own anecdotal evidence though: btrfs really does snag itself into surprising and disastrous situations at alarming frequency. Between being unable to reshape a pool (eg, removing a disk when plenty of free space exists) and not being safe with unclean shutdowns, it's hard to ever trust it. It even went a few good years where it seemed to be abandoned, but I guess since 2018 or so it's been picked up again.
Ah, I understand what you're saying now. Yeah, that's fair assuming it was btrfs's fault for the data loss.
>Between being unable to reshape a pool (eg, removing a disk when plenty of free space exists) and not being safe with unclean shutdowns, it's hard to ever trust it. It even went a few good years where it seemed to be abandoned, but I guess since 2018 or so it's been picked up again.
FYI, btrfs does support reshaping pool with the btrfs device commands.
for reference: this is referring to a systemd service script (systemd-udev-trigger/systemd-udev-settle) with a race condition where the pool may not be mounted by the time systemd tries to use it.
that's (a) not really a bug in ZFS, and (b) "fails to boot sometimes" is pretty different from btrfs shitting the bed and corrupting its pool. There was one of those recently with ZFS iirc (and specifically only ZFS-on-Linux) but they are fairly rare and notable when they occur!
(As a general statement, ZoL is less mature than ZFS-on-FreeBSD and likely (perhaps) to continue to be so given the licensing issues. I've also run into some problems where I can't send a dataset from FreeBSD to a ZoL pool (but rsync works fine). But again, generally bugs that actually lead to data loss are exceedingly rare.)
It just kind of sucks, when it happens, you are thrown into recovery console at boot and there is no solution.
From what I saw, it happened if the disk initialization took longer and ZoL looked for its pools before all disks were found. It hints at improper dependencies in ZoL startup.
btrfs is default on two distros (that I know of) OpenSUSE and Fedora. If you're using one of those two distros, don't read the documentation, and then your data is eaten/etc then that's fair and you can rightfully be upset. I would be too.
But if you're using it in some other setup, then that means you went out of your way to try a more complicated filesystem. I would think it's reasonable to do at least a quick scan of the btrfs wiki or your distro's documentation before continuing with that, the same way I'd expect someone would do the same for ZFS.
> I would think it's reasonable to do at least a quick scan of the btrfs wiki or your distro's documentation before continuing with that, the same way I'd expect someone would do the same for ZFS.
I would argue that the defaults should not be dangerous. If a filesystem is released in to the world as stable and ready for use, it's not absurd to expect that running mkfs.<whatever> /dev/<disk> will get you something that's not going to eat your data. It might not be optimized, but it shouldn't be dangerous.
Dangerous defaults are loaded footguns and should be treated as bugs.
If there is no safe option, there should not be a default and the user should be required to explicitly make a choice. At that point you can blame them for not reading the documentation.
What dangerous defaults do you think may exist with btrfs? A simple mkfs.btrfs won't set you up for later trouble unless you go out of your way to give it multiple devices and ask for one of the parity RAID modes.
There are some arguable footguns in the btrfs-tools workflows for repairing a damaged filesystem, but that's exactly the fragile situation where asking the user to RTFM before making more changes is perfectly reasonable.
> What dangerous defaults do you think may exist with btrfs?
The ones being discussed in the parent posts in this thread. I am not myself particularly familiar with btrfs internals, having only used it once, but I have heard of there being issues along those lines.
> There are some arguable footguns in the btrfs-tools workflows for repairing a damaged filesystem, but that's exactly the fragile situation where asking the user to RTFM before making more changes is perfectly reasonable.
My point is that in the cases where RTFM is a requirement there should not be a default behavior. Doing the dangerous thing should always require an explicit request and not be something that one can autocomplete their way to. If there is a "doing X is only safe when you also fill in Y parameter and put Z in mode W" then it either shouldn't let me do X without those other things at all or should require a "--yes-really-do-the-stupid-thing" type flag.
The thing mentioned upthread where it's possible to mount a damaged filesystem RW, but only once. If that's true, then attempting to do so through a normal command someone who knows Unixy systems might just perform without thinking should scream bloody murder to make sure the user knows what's going on, and then should require some explicit confirmation of intent to move forward with the dangerous and/or irreversable operation.
You're being too vague. It's silly to opine on what btrfs should do without checking whether that's already the case. You should actually point out any specific default behaviors that are questionable rather than simply speculating ones that may or may not exist. The only specific thing you've mentioned so far is one that you could trivially have found out requires a non-standard mount option.
To be clear, I'm not saying distros are setting it up "wrong" just not..correctly? As in, if I setup a filesystem I would assume most distros would set it up with best practices in mind.
What I'm trying to convey is most of the time when I help someone out with btrfs issues it always turns out they haven't been running regular scrubs or balances and their distro didn't setup anything so they didn't know. Or they don't understand how much diskspace they have because their distro's docs still have them running du instead of "btrfs filesystem usage".
For specific distros, arch/arch-derivatives had autodefrag as a default mount option which caused some thrashing a while back and IIRC Debian still doesn't support subvolumes properly, as well as not installing btrfs-maintenance.
As far as I can tell, that is true. But given the number of years it spent broken, the amount of time I spent recovering broken arrays, and the (unknown but >0) number of my files that it ate, you'll forgive me for not being enthusiastic now.
I can safely say it has not presented any problem with me thus far, and I am at the stage of my life where I realize that I don't have the time to fiddle as much with settings. If the distributions are willing to take that maintenance on their shoulders, I'm willing to trust them and deal with the consequences – at least I know I'm not alone.