Hacker News new | past | comments | ask | show | jobs | submit login
Btrfs RAID 5/6 Code Found to Be Very Unsafe and Will Likely Require a Rewrite (phoronix.com)
87 points by pantalaimon on Aug 5, 2016 | hide | past | web | favorite | 89 comments



Bryan Cantrill expresses this[1] in a manner I certainly agree with. Software that is designed/ intended to be reliable should not go through large periods of instability only to be written off as "prepubescence".

[1]http://www.youtube.com/watch?v=79fvDDPaIoY&t=18m28s


Bryan Cantrill is such a great storyteller. "My therapist says I need to talk about the firmware revision numbers, just to let it out so it doesn't corrode me." He then shares a story about a firmware bug where the polarity would be reversed of the disk read/write head. :)


Completely agree.

I often just search for videos by him and just listen to him present a topic or talk about anything. I learned about things normally I wouldn't go and read about otherwise -- history of Sun, ZFS, dtrace, containerization (Joyent, zones legacy, KVM, comparison with Docker).

In that respect he is like David Beazley (of Python fame). I could just listen to David talk all day and not get tired of it.


Ooh, like that analogy. Pure gold:

(Paraphrasing):

---

ZFS never went through a period where people lost data using it. ... amazing how people are willing to write that off like 'Oh well, it kind of lost my data but Omg! A million eyeballs'.

This is like a 12 year old who is stealing cars -- this is a serious issue. Not like "Well...he's only 12" it is more like "He stole a car at gunpoint, having just robbed a liquor store". Can't say let's just wait till he's 18. Would you want to know what he's like when he's 18?

---


Technically, ZFS never went through a period where people lost data using it. It just kernel paniced when they tried to access their data and besides it was probably their fault for using hardware that was slightly flaky. (Early ZFS adopters had a really interesting time of it, judging from the complaints people had back then.)


I am guessing in general losing data (overwriting, dropping it on the floor) is usually worse than crashing the system.



About the first video talking about epoll design flaws, doesn't EPOLLONESHOT kind of solve this? Unless I am missing something..


Perhaps it does. But the point was that the Linux world never seems to get things right. They implement horrendous solutions that are not engineered or even well thought out and wind up reimplementing said solution over and over and over again to solve the problem instead of learning anything from others who have solved these problems already. They refuse to look outside their echo chamber and double down on NIH and just implement piss poor solutions to problems. And it's so frustrating for the rest of us to watch.


> Perhaps it does. But the point was that the Linux world never seems to get things right.

Well in this particular case, seems like this feature was added in linux 2.6.2 (2004). epoll itself was added in linux 2.5.44 (2002), so yep, you're correct that they didn't make this right immediately. Now I think it is rather easy to play this game of pinpointing flaws this way and it could be done the other way around, probably in an equally as unconstructive matter (e.g: how long did freebsd wait for robust mutexes?).

That said, amazing speaker, thanks for the links.


Waiting to implement a feature while you actually engineer it well and robustly is not the same as throwing poop at a wall and seeing what sticks then scraping the poop off and forming it into another shape and throwing at the wall again.


Seems like you are no fan of linux's robust mutexes implementation. Sincerely, you probably know more than me about the implementation and I'd be curious to have your take on it as I have not looked at how (and so late) it was implemented in FreeBSD.



edge triggered is the correct way to use epoll, which has been there since the beginning.

The reason epoll has a stupid mode is that it was also designed to be a drop in replacement for select.


It is amazing how stubborn Linux people are. They constantly refuse and reject superior engineering that was done outside of their realm. (yeah I was pointing at ZFS, but that would be just one of many)


This is Linux's legacy. Rejecting already solved problems only to try reinvent the wheel their way and often their wheel is broken and horribly engineered if engineered at all. See btrfs v zfs, kqueue vs their current poll of the month method, dtrace vs system tap, the list goes on a mile long.

It's so frustrating to watch this. To see all these projects being developed on Linux where it's easier for others to simply reimplement the idea from scratch than try and remove all the Linux only code from them.

The Linux world only thinks about itself. In a spectacularly selfish way.


I spent a couple of decades being a die hard Linux fanatic, and contributor to many parts. The amount of work being done is incredible. But I'd have to say that I've become over time increasingly annoyed by the vast duplication of effort, pointless wheel reinvention and pointless scrapping of perfectly good code just because someone wanted to rewrite it rather than carefully refactor it. Looking at the long term effects over the last 20 years, a lot of that work was utterly pointless and detrimental; my patches to fix the GNOME canvas were ignored in the bug tracker for over a decade before it was closed--meanwhile they wrote in parallel over six replacements, all of which are now dead and none of which improved the canonical (and now also dead) implementation! All the power management and hardware abstraction stuff has gone through multiple rewrites from scratch, and with little thought to backward compatibility. This is not to denigrate the effort people put in, but effort needs to be directed effectively or else it is ultimately wasted. The success of Linux has lead to a disturbing amount of NIH attitudes--we're so big we don't need to care about compatibility and interoperability--foolishly forgetting that these things are what caused the original adoption of Linux, being compatible with everything under the sun was it's big selling point.

In the case of Btrfs it's a laudable effort, but has yet to approach production readiness in an awfully long time. Even if the licences are incompatible, the on-disk structures of ZFS could have been used to create a compatible implementation.

So I'm afraid I'm increasingly using other systems such as FreeBSD (with ZFS) and leaving Linux behind somewhat. Ultimately open systems, open standards and free software and software freedom are more important than a single implementation.


Well, I think the problem is somewhat more subtle, but to an end effect that's the same. That is, it's useful to have people trying new things, and even throwing shit at the wall and seeing what sticks. Interesting new ideas come out of that. The problem is that you don't want that to necessarily be the de facto strategy for your accepted solutions to problems, and definitely not how a major player in your ecosystem routinely progresses.

If Linux was a research experiment with exposure but little use compared to the other Unix systems, we would probably all look at it much more favorably, not because they are doing things right, but because they are willing to try crazy shit that we would all get to see the post-mortems for. The problem is that Linux is not some semi-obscure OS, so instead of these experiments being interesting reading for the weekend, we're often forced to become users of them, whether we like to or not.


Linux's 'legacy' is a policy of accepting code before it is known to be perfect. I don't know anyone who runs BTRFS and I certainly would not do so myself. Getting accepted as part of the kernel code is not an imprimatur of some sort.

I'd much rather this than the *BSD policies.

Oh and talking about FreeBSD, it's hardly bug free. Look at these Golang issues and tell me it doesn't indicate something wrong in the FreeBSD kernel threads/fork/exec code.

1. https://golang.org/issue/16136 2. https://golang.org/issue/15658 3. https://golang.org/issue/16396


I'm not sure it indicates anything. They are all incredibly vague and no one seems to have any idea what's going on.


... and even if I did use BTRFS, I wouldn't use raid5/6.

I'd never use raid5/6 on anything. Hardware is cheap.


This is why you need to update the phrase "GNU's not UNIX" to "Linux is not UNIX." I use systemd and it works fine but it's a great example of a complete departure from the idea of small single-task tools.


>..."Linux is not UNIX." I use systemd and it works fine but it's a great example of a complete departure from the idea of small single-task tools.

And how does this approach compare to other UNIXes? And I don't mean UNIXes from the 1970s, I mean ones which are currently in use: FreeBSD, OpenBSD, Solaris, MacOSX. Last I heard, they all had roughly equivalent init systems to systemd.


Solaris and MacOS perhaps, if we are talking strict init. but both of the BSDs use BSD init. There is a push by Jordan Hubbard to get FreeBSD to adopt Launchd (the MacOS init) but so far he has been rebuffed and has instead set out to fork FreeBSD into NextBSD.

Frankly this is what should have happened with the likes of Debian regarding systemd as well. A parallel release offering systemd as the default, alongside sysv. Then it would be up to the admins to pick, rather than have to apply contortions during or right after install to switch from systemd to sysv.

Instead Debian adopted systemd, and we got Devuan as a sysv fork...


Well I can't speak to other systemd competitors but BSD and System V's scripts never had the concept of bundling in firewall configuration, cgroup config, mount points, etc into PID 1.


ZFS is more stable, but the architecture it forces you into is IMO far from ideal in many cases.

I've played with BTRFS (in RAID-1 mode) instead of ZFS because of the latter's limitations about adding drives. With BTRFS, you can drop in a new drive at any time and rebalance. This is how things ought to work. ZFS's concept of VDEVs might make sense for huge corporate storage arrays, but they're just a pain if you're running a small server or NAS, where growing the array drive-by-drive is a very common scenario.

Due to the complexity of dealing with VDEVs, I've personally never seen much use for ZFS on Linux vs. just biting the bullet and dealing with LVM over mdadm. BTRFS is the closest thing to a compelling argument to moving away from that setup I've seen on Linux.


That and being able to trivially resize/reshape BTRFS storage volumes after adding/removing devices makes it more suitable for my personal home use.


The lack of ZFS is due to licensing.


Ubuntu has it


In violation of the license; probably.


AFAIK it's DKMS'ed just like Nvidia's kernel driver, so it's a user decision, just like media codec installation. Also, OpenZFS doesn't reuse kernel interfaces it's not allowed to, and that's why they carry their own crypto code now.


It's actually not DKMS in 16.04 - crypto problem still applies though.


Not necessarily. It's a run-time loaded module (as e.g. NVIDIA's graphics driver is) and 'taints' the kernel with CDDL, but since you -- the user -- aren't redistributing anything you are not breaking any copyright laws.

EDIT: ... and neither are Ubuntu since what they're doing is "mere aggregation" (see the GPL).


If it's a run-time loaded module with an incompatible license, that means it isn't really something you want as your primary filesystem. For a storage volume, sure, but not your root filesystem.

Honestly, this whole debacle could have been avoided if ZFS had been relicensed under some kind of compatible license like one of the BSD licenses.


There's no difference between the two types of filesystem; that's an arbitrary judgement.

Both the GPL and CDDL are distribution licenses; using a ZFS module with Linux isn't a problem, you aren't violating any of the licences by using them together on your computer system. There's plenty of precedent for non-GPL kernel modules. It's not a problem in practice.


I agree completely[1]. I'm not sure Ubuntu Server actually even offers ZFS during the install on any volume, nevermind on the root FS.

EDIT: [1] Well, now that I think about it... I don't agree that the skepticism should be based on licensing. I just think that ZFS-on-Linux is perhaps not mature enough quite yet and we do have off-site backups anyway, so we might as well try it if we're not doing anything absolutely 24/365-critical.


It's not supported by the installer (for 16.04), but you can do it by hand. It's poorly documented--there are several guides with contradictory advice--and it took me several attempts, but all the bits are there from GRUB to the initramfs and the main init. I got my workstation EFI booting with GRUB direct to root on ZFS a few weeks back; I can now "zfs send" whole system snapshots directly to my NAS.

Once this is a single option in the installer like it is for LVM, it will become properly usable. Right now, it's for masochists who want to experiment only (for the rootfs).

Regarding maturity, it's certainly not as well tested as on e.g. FreeBSD or Solaris, but it's been around for a good while at this point and I know a few people who have been running it for several years trouble free. I would certainly place it above Btrfs in the reliability stakes. The main limitations I've seen are cosmetic or Linux-specific; the zfs command is setuid root, which prevents delegating dataset admin to users (snapshot/send/recv your data, create your own subsidiary datasets etc.), Linux doesn't do NFSv4 ACLs in its VFS, and you can't transparently create NFS exports via zfs properties. None of these affect data integrity though. Not holding my breath waiting for NFSv4 ACLs, though it would certainly be a massively beneficial feature.


> Honestly, this whole debacle could have been avoided if ZFS had been relicensed under some kind of compatible license like one of the BSD licenses.

Or if Linux had been similarly relicensed. I think that's actually more reasonable, since it's the Linux license that may take issue with CDDL code, not the reverse.

Besides that, ZFS is covered by several patents, and the CDDL grants a patent license. If ZFS had been BSD-licensed, you'd have a much worse debacle.


It's impossible to relicense Linux. Linus has talked about this in the past: he'd have to track down literally thousands of contributors to get their permission. Many of them are probably impossible to track down, some are surely dead. There's no feasible way to relicense a project like Linux.

ZFS, OTOH, is owned by a single entity, so relicensing it would be simple. It should be completely possible to make a new license that grants a patent license but is also GPL compatible.


Like Ubuntu or not, they went on record saying that they had consulted with legal counsel who was heavily involved with open source and that they signed off.

I'll take their word for it in good faith.


How is the license being violated?


I think HAMMER might be a good candidate for use with Linux, it's not as mature as ZFS and still lacks features that I actually use ZFS for (namely the self-healing and data integrity features), but HAMMER2 is well on its way and should soon be a viable alternative (read: not replacement) for many use-cases where ZFS is of benefit.


The problem with HAMMER is that it reminds me of Stark Industries' inept competition in Iron Man 2.


To be fair, if there weren't licensing issues around ZFS, I'm pretty sure it would have been adopted without much resistance. As it is, there are STILL concerns about the licensing and Canonical went ahead and added it anyway! I don't see how you can classify that as Not Invented here syndrome.


> I was pointing at ZFS, but that would be just one of many

Taking close to two decades to make xfs an equal citizen would be another example.


The thing that need 8/16GB of ram to run. Creating a solid FS takes time, ZFS is good because it has been around for 10years.


ZFS runs fine with just 1GB of ram. Even several terrabytes of storage. When it comes to knowledge about ZFS, avoid about anything that comes from the FreeNAS community. It is quackery at its best.


The problem is that there's so much quackery around ZFS when you Google for answers to any major problem. It's frequently hard to get good information.


I just want to point out that the official documentation is absolutely great.

http://docs.oracle.com/cd/E19253-01/819-5461/


Doesn't ZFS dedup eat lots of ram? btrfs offers on-demand deduplication instead of it being always-on.


Yes, it does. But dedup is rarely that useful in my experience, it sounds nice on paper but it often doesn't provide benefits worth the CPU and IOPS demand it generates. Compression is "good enough" 98% of the time, and ZFS does a fine job with it.


If you take snapshotted backups from ext4 etc filsystems (no send/recv), where users regularly reorganise photo folders, dedup is invaluable :-)


dedup is quite useful when you have snapshots that somehow have lost their space-sharing. I guess it's due to files being replaced with identical files.


The Btrfs deduplication... also needs vast quantities of RAM. That's just the nature of the beast. Don't use it, and both systems have relatively modest requirements.


but it's on-demand and can use a scratch file, which makes it far more managable, e.g. you can let it run over the weekend.


If you need filesystem dedup... Perhaps handle it in userspace instead


Since ZoL does not have reflink copies or the range-dedup ioctl you cannot do it from userspace like you can with btrfs.


I run ZFS on a VM with 512MB RAM. It does web/mail/file serving and it has worked great.


Isn't that just for the default dedup implementation where it's suggested to have 2-5GB per TB?

HAMMER fs can dedup with little RAM, but the ZFS devs aren't ignorant or uninformed, so I would expect them to have a plan to remedy the inefficiency of OpenZFS's dedup implementation. I know there are commercially avaialbe alternaitve dedup versions, but I don't have any experience with those.


Btrfs entered the Linux kernel over 7 years ago. (I've used both btrfs and ZFS a lot in production and would never touch btrfs again. I really like ZFS though.)


"The development of Btrfs began in 2007, and by August 2014, the file system's on-disk format has been marked as stable."


The on-disk format is the least of the problems with Btrfs. The quality of the implementation of the filesystem which uses that format is the bigger issue. It's still awful, and I've been using it since the start.

Two years ago, I migrated everything to ZFS and haven't looked back. It works. It hasn't lost any data, and it's easy to set up and administer. I can't give it any higher praise than that. It does its stuff with no fuss or drama.


I've never really understood the rationale behind integrating RAID into the filesystem. It seems like a giant layering violation, and unnecessary given that mdadm already exists and has many years of testing behind it.

The btrfs FAQ [1] says that "unlike MD-RAID, btrfs knows what blocks are actually used by data/metadata, and can use that information in a rebuild/recovery situation", but is that really a good enough reason to reimplement the entire RAID subsystem from scratch? And couldn't TRIM/discard provide the same benefits?

For me, the clincher is that mdadm has a very stable, well-defined on-disk format, which is a huge bonus if you hit a bug or make a mistake. I once almost lost a personal RAID10 array by recklessly trying to add an extra drive with no backup, and without fully understanding what I was doing. I was able to recover all of my data by hacking together a Python script to reassemble all of the blocks in the correct order. I can't imagine how much effort that would have taken if I had to build something that understood the full details of btrfs's data structures.

[1]: https://btrfs.wiki.kernel.org/index.php/FAQ#Case_study:_btrf...


Integrating RAID into the filesystem solves a lot of problems, one big one being data integrity. When you have a dumb RAID layer unaware of the data on disk it will just replicate or shard it between disks, when you request that data back and you have replicas it has no way to verify that data is correct, if one replica is bad but the other is good your filesystem has no way to correct for it since it just sees one big block device.

Moving storage pooling functionality to the filesystem is the right call for anything with server workloads, ZFS's self-healing functionality wouldn't work without it (and this is precisely why I use FreeNAS instead Unraid or other alternatives, I care about the long-term integrity of my data, especially my VMWare images and photo collection).


Enforcing data integrity is great, but I don't see why it has to happen at the filesystem level. There are any number of ways you could solve it while still keeping the RAID functionality at the level of unstructured block devices. I can think of three just off the top of my head:

#1: Do checksumming at the block level. Within each block, the RAID driver reserves a few bytes at the end for a checksum, and verifies them before returning data to the FS. (This would only work if the filesystem supports non-power-of-two block sizes.)

#2: Similar to #1, but pack the checksums for multiple data blocks into a separate dedicated checksum block. (This adds some extra read latency in the worst case, but caching could mitigate it.)

#3: Let the filesystem handle checksumming, but extend the block device API to provide feedback to the kernel if the checksum is invalid. I think you only really need two API calls -- one which says "read this block as fast as possible and I'll verify it", and one which says "the block you returned looks bad, try to reconstruct it and give me all available candidates".


> Do checksumming at the block level.

This doesn't handle phantom or misdirected writes. You can't really do that unless you keep the checksums elsewhere, and specifically in the place where you intend to reference the data.

> Let the filesystem handle checksumming, but extend the block device API to provide feedback to the kernel if the checksum is invalid.

I think by the time you get this API right, it's going to seem like just as bad a layering violation.

ZFS has a coherent set of layers built into it. They're just not the ones that existed before it. But I don't think those previous layers were designed with many of the important failure modes of real hardware devices in mind. That's understandable given when they were designed, but we've come a long way since then in the understanding of those failure modes and our expectations of system integrity.


You're right that it doesn't have to happen at the filesystem level, but going down to block or device level is the wrong direction. Almost all useful metadata is lost by the time you get to the block level.

At the filesystem level, you can specify different redundancy policies for different types of data. You can mirror metadata but not file contents, for example, which is a pretty good policy available in BTRFS that you can't achieve at the block level. You can also do sensible things like mirror across three drives by putting each file on two of the three drives. But there's a better solution.

The better solution is to handle this at a higher level. It's common to want data replicated across machines, at which point block-level redundancy really only sucks up disk space for no benefit. This is how cloud storage works, and it's how you set up an Exchange server these days, et cetera.

RAID and block-level redundancy will still be around (somebody will always need it), but it's a dying technology, at least in the sense that new deployments are using it less and less frequently.


Sometimes collapsing arbitrary layers gives you more context to work on. Judging from ZFS history, this is the correct path. (and if you've ever had to rebuild a mostly free, large, raid5-6 array, you will appreciate filesystem aware rebuilds)


I used to think this was a horrible layering violation as well. I'd been using various filesystems on LVM on RAID for a good 15 years, and from a certain point of view these layers make a great deal of sense.

But these layers also impede a number of important things needed for data integrity checking and efficient rebuilding on failure, as well as complicating administration when you need to rebuild bits of an array after failure or fiddle with logical volumes. From the point of view of a sysadmin, ZFS is a revelation. Datasets are simple; creating, resizing (quota), deleting are trivial and safe. Operations on the pool are logical and (relatively) safer than with mdraid or hardware RAID. Snapshots are safer--you are doing it at the filesystem level rather than the block device level, so it can never be inconsistent or get corrupted when your snapshot device is invalidated at some arbitrary future point when the block delta exceeds the device size.

The layering is still there. It's just subtly different. If you're used to LVM, then a "pool" is basically a volume group. The "vdev"s making up the pool are the physical volumes, which might be RAID sets. The "datasets" or "zvols" are logical volumes. It's a reinvention of what we already had, but it's more powerful, more flexible, and vastly easier to administer.

I was sceptical, but after learning about it (and there is a learning curve), doing some simple test installs (often repeated as I realised I'd not set up the pool with the right blocksize or partition alignment, or the optimal dataset structure) and then some actual deployments, I'm a convert. What it offers is currently unmatched.


Btrfs people have been telling everyone that RAID5/6 code is incomplete for years now, so hopefully people weren't depending on it for important things. The wiki link is broken, here's the page https://btrfs.wiki.kernel.org/index.php/RAID56 and you can see in the history that on the very first version of the page, in the first sentence, "the recovery and rebuild code isn't bulletproof or complete yet."


This warning has only been added recently

https://btrfs.wiki.kernel.org/index.php?title=RAID56&diff=30...


Oh, so that's ~7 months of people maybe using it for important things. That could be really bad then.


Well that line about it stabilizing over the next couple of kernel releases was added in February 2015, so when I installed Ubuntu 16.04 on my NAS I thought things probably had calmed down sufficiently.

As I learned the hard way, I was wrong and that warning is very much justified.

https://btrfs.wiki.kernel.org/index.php?title=RAID56&diff=29...


Well, I have a problem with a btrfs RAID1 (which isn't really RAID1 but whatever) where if I read certain files my kernel crashes. So that's fun too.


Better that than the unrecoverable dataloss I suffered (it trashed both disks... and also panicked the kernel). In fact, they were so screwed it panicked the system on every boot until I booted with a non-Btrfs-supporting kernel and blanked them with dd...


Fancy file systems are for people who think their life isn't exciting enough.


Depends how you're defining "fancy". I use ZFS because I don't want any drama, and it has delivered that (Linux md was more complicated to set up, to the extent that I lost data). But it's a mature, solid filesystem in a way that BTRFS isn't yet.


Downvoted because I frequently hear this sentiment from both sides of the argument but rarely is anything more than an anecdote offered to support it. Feels more and more like a mud slinging competition instead of an assessment of merit.


You may wish to read the article we are discussing, then.


Many people use both filesystems, and you'll see many claims of how both work really well and that the user hasn't had any problems. The problem with these people's experiences is not that they are untrue, but that they are primarily from people who haven't had hardware failure/glitches, and who have never had their systems run the failure-case codepaths. Everything's fine and dandy up until the point you lose all your data.

I've run Btrfs on many systems since just after it started to be usable, and written software with Btrfs-specific support which hammers it (and LVM) like nothing else creating and destroying tens of thousands of transient snapshots. I've also now run ZFS on several systems, admittedly over a smaller timeframe (3 years vs 7-8ish).

I've had Btrfs totally trash a RAID1 mirror from a transient SATA cable connector glitch. On this test system, I had half the disk using Btrfs, half using mdraid/LVM. The mdraid half recovered and resynced transparently as soon as I reseated the connector; no service interruption or dataloss. Btrfs ceased to function, and on reboot toasted both mirrors resulting in total unrecoverable dataloss and repeated kernel panics. That's been fixed a while, but right here we're seeing the same thing. The failure codepaths, which are of critical importance, are untested and buggy. And even non-failure codepaths are still bad. Take the snapshotting case above, I had to take the system offline and do a full manual rebalance every 18 hours. The time from fresh new filesystem to read-only unbalanced disaster was just 18 hours when thrashed continuously, at most using 10% of the total space. And lastly, the performance of some things such as fsync are truly abysmal, to the extent that we had to use "eatmydata" to completely disable it for apt/dpkg operations! When under heavy parallel workloads, it could take many tens of minutes or hours(!) to complete writes which ext4 would complete in a minute or so.

I've yet to experience any problems at all with ZFS. Now that might have been luck on my part, but it might also be down to better design and quality of implementation. It's certainly been battle tested in high end installations. That's not to say that Btrfs doesn't have some neat features; it does a few things ZFS doesn't, like rebalancing data over its devices while ZFS only does that on write. But Btrfs has let me down badly every time I've used it in anger, and those few neat features don't make up for its lack of robustness--the primary purpose of the filesystem is to reliably store data, and it fails at that.

I don't like to see "mud slinging", since such fanboyism is unobjective and uninformed. I've reached my opinion based upon several years of practical intensive use of Btrfs for various things, the most demanding of which was repeated whole-archive rebuilds of the whole of Debian when I was maintaining the Debian build tools, and wrote btrfs snapshot support specifically for them, doing over 30 parallel builds on a single system using independent snapshots per build with over 20000 snapshots per run, creating and destroying several per second. The experiment was disastrous, and showed Btrfs to be unsuitable for such intensive workloads. When your filesystem is guaranteed be turned read-only at some unpredictable and unknown point in the future, you can't rely on it. Regular rebalancing mitigates but doesn't solve this, and has a terrible performance impact. Not dataloss per se (unless it makes you lose writes when it turns read-only), but it's a serious design or implementation flaw. I did all this testing and adding of Btrfs support to various tools because I had high hopes for its potential; unfortunately they exposed serious shortcomings, many of which exist to this day. Today I'm using ZFS, not because of any irrational prejudice against Btrfs, but because Btrfs has never managed to deliver a robust and well tested filesystem!


I'm not having nearly your experience but I just want to say I'm agreeing 100% to you conclusions based on my experience. We ran at Uni a Hadoop Cluster that had disks slowly dying (some bad sectors every few days, but otherwise fine) and lacked the money to replace them. We replaced ext4 with ZFS (no raid, just plain zpools with failmode=continue) and ZFS ran mostly fine and scrubs kept the metadata sane. Never had data loss (HDFS has it's own replication and checksumming, we just need sane metadata for Hadoop to run and intermediate MapReduce outputs where send to directories with zfs set copies=2) and we only replaced the botched disks that had longer scrub times or couldn't survive a scrub. I'm still surprised how ZFS managed to pull that off. Only bugs I've found where related to ZoL at this time but could be worked around.

btrfs switched to readonly as fast as ext4 (which is probably the correct thing to do) but was useless for this problem.

On new hardware with new enterprise disks we choose btrfs and we had a painful tour of crashes, data corruption, metadata corruption (undeletable files), deadlocks until kernel 4.4 where things got a little bit better. Here disks and server where enterprise class and fully working. Just btrfs bugs. Also no RAID. I'm not doing that anymore so I don't know if any new bugs appeared but the whole experience will keep me from ever using btrfs. This was ~2years ago and you could easily find lot's of slides Fujitsu or Suse that btrfs is stable and you can use it (around kernel 3.13-3.16).

It's probably fine for your notebook or even your backup HDD but don't think you can stress it without experiencing pain (be it corruption, hangups or dataloss) or just abysmal performance.

That beeing said ZFS on Linux is also a far cry from rock solid but I'm optimistic that they flesh out the problems and tackle them in a solid way.

As a Linux fanboy for years this gave me some solid appreciation for Solaris engineering.


Unstable file systems are for people who think their life isn't exciting enough.

If you've been following along at all in btrfs development, this doesn't really come as a huge surprise.


Had nothing but great results from ZFS on a FreeNAS server.


If anyone wants to help out with bcachefs, Reed-Solomon might be a fun place to jump in. And bcachefs so far has a better track record with people's data than btrfs :)


Does this impact Synology's Hybrid RAID running on Btrfs? Btrfs is nowadays the default fs on Synology boxes.


No, there is no impact to Synology's Hybrid RAID running on btrfs. Synology uses mdadm/lvm for RAID.


Is it possible to install Linux on HAMMER?


You mean is it possible to port and run hammer on Linux?

Anything is possible but the work required would be a big challenge.


I'm interested to know as well. I read up on it yesterday but couldn't find anything but a student project that only provided read access.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: