Hacker News new | past | comments | ask | show | jobs | submit login
⁠Btrfs has been deprecated in RHEL (redhat.com)
368 points by alrs on Aug 2, 2017 | hide | past | favorite | 338 comments

People are making a bigger deal of this than it is. Since I left Red Hat in 2012 there hasn't been another engineer to pick up the work, and it is _a lot_ of work.

For RHEL you are stuck on one kernel for an entire release. Every fix has to be backported from upstream, and the further from upstream you get the harder it is to do that work.

Btrfs has to be rebased _every_ release. If moves too fast and there is so much work being done that you can't just cherry pick individual fixes. This makes it a huge pain in the ass.

Then you have RHEL's "if we ship it we support it" mantra. Every release you have something that is more Frankenstein-y than it was before, and you run more of a risk of shit going horribly wrong. That's a huge liability for an engineering team that has 0 upstream btrfs contributors.

The entire local file system group are xfs developers. Nobody has done serious btrfs work at Red Hat since I left (with a slight exception with Zach Brown for a little while.)

Suse uses it as their default and has a lot of inhouse expertise. We use it in a variety of ways inside Facebook. It's getting faster and more stable, admittedly slower than I'd like, but we are getting there. This announcement from Red Hat is purely a reflection of Red Hat's engineering expertise and the way they ship kernels, and not an indictment of Btrfs itself.

I think a natural follow-up question is "Why Red Hat does not have engineers to support btrfs?" That is, if the lack of engineers is a symptom, what is the cause?

I'm pretty sure, had RH wanted they could either hire or assign engineers to maintain the btrfs code, take care of patches from upstream, etc. So why didn't that happen? I wonder what is your opinion on that.

I see a bunch of possibilities (not necessarily independent ones):

1) Politics. Perhaps RH wants to kill btrfs for some reason?

I see this as rather unlikely, as RH does not have a competing solution (unlike in the Jigsaw controversy, where they have incentives to kill it in favor of the JBoss module system).

2) Inability to hire enough engineers familiar with btrfs, or assign existing engineers.

Perhaps the number of engineers would be too high, increasing costs. Especially if not only to maintain the RHEL kernels, but to contribute to btrfs and move it forward.

Or maybe there's a pushback from the current filesystems team, where most people are xfs developers?

3) Incompatible development models.

If each release requires a rebase, perhaps supporting btrfs would require too much work / too many engineers, increasing costs? I wonder what Suse and others are doing differently, except for having in-house btrfs developers.

4) Lack of trust btrfs will get mature enough for RHEL soon.

It may work for certain deployments, but for RHEL customers that may not be sufficient. That probably requires a filesystem performing well for a wider range of workloads.

5) Lack of interest from paying RHEL customers.

Many of our customers have RHEL systems (or CentOS / Scientific Linux), and I don't remember a single one of them using btrfs or planning to do so. We only deal with database servers, which is a very narrow segment of the market, and fairly conservative one when it comes to filesystems.

But overall, if customers are not interested in a feature, it's merely a pragmatic business decision not to spend money on it.

6) Better alternatives available.

I'm not aware of one, although "ZFS on Linux" is getting much better.

So I tend to see this as a pragmatic business decision, based on customer interest in btrfs on RHEL vs. costs of supporting it.

All this talk about Oracle is just plain stupid. Oracle doesn't control anything, the community does. One core developer still works on Btrfs from Oracle, the vast majority of the contributions come from outside Oracle.

Now as to

> "Why Red Hat does not have engineers to support btrfs?"

You have to understand how most kernel teams work across all companies. Kernel engineers work on what they want to work on, and companies hire the people working on the thing the company cares about to make sure they get their changes in.

This means that the engineers have 95% of the power. Sure you can tell your kernel developer to go work on something else, but if they don't want to do that they'll just go to a different company that will let them work on what they care about.

This gives Red Hat 2 options. One is they hire existing Btrfs developers to come help do the work. That's unlikely to happen unless they get one of the new contributors, as all of the seasoned developers are not likely to move. The second is to develop the talent in-house. But again we're back at that "it's hard to tell kernel engineers what to do" problem. If nobody wants to work on it then there's not going to be anybody that will do it.

And then there's the fact that Red Hat really does rely on the community to do the bulk of the heavy lifting for a lot of areas. BPF is a great example of this, cgroups is another good example.

Btrfs isn't ready for Red Hat's customer base, nobody who works on Btrfs will deny that fact. Does it make sense for Red Hat to pay a bunch of people to make things go faster when the community is doing the work at no cost to Red Hat?

Oracle certainly controls the license for ZFS.

Release under a compatible license would likely see a ZFS kernel module appear in EPEL immediately; Red Hat would likely replace XFS with ZFS as the default in RHEL8 were this legally possible.

Oracle supports BtrFS in their Linux clone of RHEL. It certainly appears that Red Hat is swallowing a "poison pill" to increase Oracle's support costs (and I'm surprised that they have not swallowed more).


With these new added costs, Oracle might find it cheaper to simply support the code for the whole ecosystem (CentOS and Scientific Linux included). Given the adversarial relationship that has developed between the two protagonists, an enforceable legal agreement would likely be Red Hat's precondition.

Otherwise, BtrFS has been mortally wounded.

> Oracle doesn't control anything

Oracle owns a lot of patents and I suspect both ZFS and BtrFS rely on some.

> All this talk about Oracle is just plain stupid. Oracle doesn't control anything, the community does. One core developer still works on Btrfs from Oracle, the vast majority of the contributions come from outside Oracle.

FWIW I haven't said anything about Oracle & btrfs ...

>> "Why Red Hat does not have engineers to support btrfs?"

> You have to understand how most kernel teams work across all companies. Kernel engineers work on what they want to work on, and companies hire the people working on the thing the company cares about to make sure they get their changes in.

> This means that the engineers have 95% of the power. Sure you can tell your kernel developer to go work on something else, but if they don't want to do that they'll just go to a different > company that will let them work on what they care about.

> This gives Red Hat 2 options. One is they hire existing Btrfs developers to come help do the work. That's unlikely to happen unless they get one of the new contributors, as all of the seasoned developers are not likely to move. The second is to develop the talent in-house. But again we're back at that "it's hard to tell kernel engineers what to do" problem. If nobody wants to work on it then there's not going to be anybody that will do it.

Sure, I understand many developers have their favorite area of development, and move to companies that will allow them to work on it. But surely some developers are willing to switch fields and start working on new challenges, and then there are new developers, of course. So it's not like the number of btrfs developers can't grow. It may take time to build the team, but they had several years to do that. Yet it didn't happen.

> And then there's the fact that Red Hat really does rely on the community to do the bulk of the heavy lifting for a lot of areas. BPF is a great example of this, cgroups is another good example.

I tend to see deprecation as the last state before removal of a feature. If that's the case, I don't see how community doing the heavy lifting makes any difference for btrfs in RH.

Or are you suggesting they may add it back once it gets ready for them? That's possible, but the truth is if btrfs is missing in RHEL (and derived distributions), that's a lot of users.

I don't know what are the development statistics, but if majority of btrfs developers works for Facebook (for example), I suppose they are working on improving areas important for Facebook. Some of that will overlap with use cases of RHEL users, some of it will be specific. So it likely means a slower pace of improvements relevant to RHEL users.

> Btrfs isn't ready for Red Hat's customer base, nobody who works on Btrfs will deny that fact. Does it make sense for Red Hat to pay a bunch of people to make things go faster when the community is doing the work at no cost to Red Hat?

The question is, how could it get ready for Red Hat's customer base, when there are no RH engineers working on it? Also, I assume the in-house developers are not there only to work on btrfs improvements, but also to investigate issues reported by customers. That's something you can't offload to the community.

I still think RH simply made a business decision, along the lines:

1) The btrfs possibly matters to X% of our paying customers, and some of them might leave if we deprecate it, costing us $Y.

2) In-house team of btrfs developers who would work on it and provide support to customers would cost $Z.

If $Y < $Z, deprecate btrfs.

Wholeheartedly agree that btrfs isn't ready for a real customer base. I wish SuSE would have learned that lesson before they pushed it as default.

It feels like reiserfs all over again.

Well, I am guilty of using reiserfs selectively (and with research + design) and having a really good experience with it. Maybe I should have done the same with btrfs but I took btrfs on faith and was burned.

Oracle has essential control of both "nextgen" filesystems that should be used in Linux - as Sun, they developed and licensed ZFS, and they are the chief contributors of BtrFS. Their refusal to release ZFS under a license that is compatible with the GPL is keeping it out of Red Hat's distribution.

This move by Red Hat must be seen as a provocation of Oracle, to force either greater cooperation and compliance in producing a stable BtrFS for RHEL, or the release of ZFS under a compatible license. Red Hat has put an end to BtrFS for now, and Oracle will have to go to greater lengths to use it in their clone. Customers also will not want it if it does not run equally well between RHEL and Oracle Linux.

It is obvious that Oracle will have to assume higher costs and support if they want BtrFS in RHEL. Red Hat is certainly justified in bringing Oracle to heel.

Oracle recently committed preliminary dedup support for XFS, so they must be intimately aware of the technical and legal issues behind Red Hat's move.


Oracle is not the "chief contributors" of Btrfs. If anyone is, it's Facebook. Chris Mason (the btrfs creator) worked for Oracle. He left in 2012.

> This move by Red Hat must be seen as a provocation of Oracle

I doubt it.

A thousand pardons - I am mistaking "initially designed" for current control.


"... initially designed at Oracle Corporation for use in Linux."

Oracle uses RHEL for their Unbreakable Linux [0] distribution. The least thing they can do is open up ZFS for the Linux community.

[0] - https://linux.oracle.com/

Seconded. I am in absolute agreement.

> open up ZFS for the Linux community.

More likely they'll support it only on their Linux.

This will particularly impact the "Red Hat Compatible Kernel" (RHCK) that is shipped by Oracle Linux.


Assuming that RHEL v8 strips BtrFS, Oracle's RHCK will have to add support back in, and thus no longer be "compatible." Without that support, some filesystems will fail to mount at boot. In-place upgrades from v7 to v8 will be problematic.

Oracle has worked very hard to maintain "compatibility" with Red Hat, even going so far as to accept MariaDB over MySQL. Their reaction to the latest "poison pill" will be interesting.

Why would Oracle have to add Btrfs support back into the RHCK? It's exactly the point of this kernel to be 100% identical to upstream RHEL. If an Oracle Linux user needs Btrfs support, it will still be included in the "Unbreakable Enterprise Kernel" (UEK), which Oracle provides as an alternative.

Any BtrFS filesystems in /etc/fstab won't mount if/when an RHCK boots that lacks the filesystem driver.

An in-place upgrade from v7 to v8 could easily get hosed.

Does Oracle support Btrfs (as opposed to making it just a tech preview) with the compatible kernel? I don't think so, since it's the same code as RHEL. And if not, hosing in-place upgrades is acceptable. RHEL 7 is supported until 2024.

>>> This announcement from Red Hat is purely a reflection of Red Hat's engineering expertise and the way they ship kernels, and not an indictment of Btrfs itself.

It's a clear indicator that RedHat doesn't want or can't support btrfs.

Which is a reflection of btrfs AND RedHat: the effort required to maintain it, the lack of usage in RHEL paying customers, the immaturity/fast development of the filesystem.

Thanks. Any indication why RH didn't hire btrfs devs? It looks like a decision was made that it wasn't strategic (obviously xfs on Linux has a much longer history).

They brought on Zach right before I left specifically to help with the effort, but he left as well. I can't really speak to Red Hat's overall strategic decisions, but really they have a large local file system team, and a lot of them are xfs developers. You aren't going to convince Dave Chinner he should go work on Btrfs instead of XFS. Unless there's somebody internally that actually wants to work on Btrfs the work simply isn't going to get done. All of the other Btrfs developers work for other companies, and none of them are interested in working for Red Hat.

I'm feeling a subtext here that maybe RH isn't a desired place to work, when I've always imagined the opposite. Is this the case?

One of Red Hat's superpowers is in hiring relatively unknown developers, and helping them become strong participants in the open source world. But their compensation isn't super high, and when you travel on Red Hat's nickle you have to share a room with someone else --- assuming you get travel approval to go at all. For people on help organize conferences, Red Hat is rather infamous about having their full-time employees ask for travel scholarships, which originally established to support hobbyist developers.

As a result, it is not at all surprising that Red Hat ends up functioning as somewhat like a baseball farm team for companies like Facebook, Google, etc. who are willing to pay more and have more liberal travel policies than Red Hat. If someone can become a strong open source contributor while working at Red Hat, they can probably get a pay raise going somewhere else.

There is a trade off --- companies that pay you much more also tend to expect that you will add a corresponding amount of value to the company's bottom line. So you might have slightly more control over what you choose to work on at Red Hat.

Nope I love Red Hat and loved working for Red Hat and still interact with most of my colleagues there on a day to day basis. I shouldn't be speaking for everybody, but from what I can tell we're all pretty happy where we are, so no real reason to switch companies.

I'd read the subtext as "there are only a handful of filesystem developers in the world and the 10 of them are already settled in a good big company".

I think there are some ways that RH would be less desirable for many people than a BigCo. When I was interested in working for them they had offices in inconvenient locations and a requirement that you (or at least, I) work in one of them -- e.g. their "Boston" office is 30 miles away in Westford, and their headquarters are in North Carolina. That's disqualifying for many people.

I imagine they pay significantly less than the other companies (e.g. Facebook) who want to hire Btrfs devs can afford to, too.

Isn't FBs internal distro Fedora based? I wonder if FB has a solid RH-based btrfs production ready kernel floating about.

The Fedora kernel is based on upstream. The RHEL kernel is a 3.10 fork with key subsystems currently having at least 4.5-ish features.

Fair enough. I imagined if btrfs was a high enough priority they'd hire new staff specifically for it, but if they've tried, and money/good employment conditions don't work, that's all they can do.

XFS does not support transparent compression, error detection and recovery, and (as yet) deduplication.

Fragmentation is also an issue, and xfs_fsr should be run at regular intervals to "defrag" an XFS file system. (I assume that) BtrFS handles this more intelligently.

I'd love to see XFS get some or all of these features.

thanks for this context; i read this thread previously and had no idea of the "why" behind the news item. great comment to understand better.

The problem is that Redhat and others are refusing to challenge the norm and break away from the "freeze the release; backport fixes" mantra.

Stop backporting fixes. You're forking the codebase.

Ship exactly what upstream provides.

Teach upstream projects how to do better release engineering if they're abandoning major releases to early or breaking API/ABI in a minor release.

Stop backporting fixes. You're forking the codebase.

edit: also stop incorrectly backporting security fixes and creating new CVEs. Seriously. Stop it.

I think you're underestimating the stability that such practices provide for enterprise. This is what people pay Redhat for.

Not all upstreams are interested in doing release engineering. There are non zero costs to doing it. It can eat up time that can be spent on bug fixes and features, or even make it too costly to change direction if a certain approach to implementation is proving more difficult than it should be.

Look at the Linux kernel. The only reason there is a stable kernel series is because Greg K-H decided it was important enough. He was unable to convince any other developers to go along with it, and eventually the decision was "if you want to support it, then you can do it."

Do you consider the stable kernel series a fork of the codebase? Should everyone be running the newest kernel every release despite the plenty of regressions that appear?

Kernel developers are not interested in making every change in such a slow and controlled manner as to avoid any regressions. And it works for them. They get a lot of stuff done, and come back and fix the regressions later.

There are real tradeoffs between development velocity, stability, scope (wide/narrow applicability), and headcount.

If you don't care that much about development velocity, it's really easy to make something that is super stable.

If you only care about making things work on a very narrow use cases (to support the back end of a particular company's web servers, or just to support a single embedded device), life also gets much easier.

If you want to "move fast and break things", that's also viable.

Finally, if you have unlimited amounts of head count, life also becomes simpler.

Different parts of the Linux ecosystem have different weights on all of these issues. Some environments care about stability, but they really don't care about advanced features, at least if stability/security might be threatened. Others are interested in adding new features into the kernel because that's how they add differentiators against their competitors. Still others care about making a kernel that only works on a particular ARM SOC, and to hell if the kernel even builds for any other architecture. And Red Hat does not have infinite amounts of cash, so they have to prioritize what they support.

So a statement such as "Teach upstream projects how to do better release engineering", is positively Trumpian in its naivete. Who do you think is going to staff all of this release engineering effort? Who is going to pay for it? Upstream projects consists of some number of hobbists, and some number of engineers from companies that have their own agendas. Some of those engineers might only care about making things better for Qualcomm SOC's, and to hell with everyone else. Others might primarily interested in how Linux works on IBM Mainframes. If there are no tradeoffs, then people might mind work that doesn't hurt their interests, but helps someone else. They might even contribute a bit to helping others, in the hopes that they will help their use case. That's the whole basis of the open source methology.

But at the same time you can't assume that someone will spend vast amounts of release engineering effort if it doesn't benefit them or their company. Things just don't work that way. And an API/ABI that must be stable might get in the way of adding some new feature which is critically important to some startup which is funding another kernel engineer.

There is a reason why the Linux ecosystem is the way it is. Saying "stop it" is about as intelligent as saying that someone who is working two 25 hour part-time jobs should be given a "choice" about her healthcare plan, when none of the "choices" are affordable.

I get the feeling you've never had to provide support for a distribution before. There are many guarantees that Red Hat or SUSE provide that are not provided by upstream projects. Freezing the release is the only sane way of doing it, and backporting fixes is necessary. There are exceptions to this, such as stable kernels (which was started by GregKH out of frustration of the backporting problem while at SUSE).

Upstreams don't have the resources to do proper release engineering, they're busy working on new features. The fact that SUSE and Red Hat spawned from a requirement for release engineering that upstreams were not able to provide should show that it takes a lot more work than you might think.

Also, can we please all agree as a community that writing patches and forking of codebases is literally the whole point of free software? If nobody should ever fork a codebase then why do we even have freedom #1 and #2? The trend of free software projects to have an anti-backport stance is getting ridiculous. If you don't want us to backport stuff, stop forcing us to do your release engineering for you.

Sadly more and more upstream wants to have their cake and eat it to. Just look at Flatpak, that is all about moving the updating and distribution from distros to upstream.

I think Flatpak won't end up solving the problem though. Mainly because it still requires distributions to exist and provide system updates, but also because it just makes the static binary problem (that distributions were made to fix) even worse.

Honestly what I think we need is to have containers that actually overlay on the host system and only include whatever specialised stuff they need on top of the host. So updates to the host do propagate into containers -- and for bonus points the container metadata can still be understood by the host.

In the end i don't see it as a technical problem, but a mentality problem.

Again and again we see that without any financial incentive, developers are loath to put any effort into backwards compatibility and interface stability.

At the same time they all want people to be running their latest and shiniest.

So in the end, what will happen is that each "app" will bundle the world, or at least as much as they feel they need to.

I don't get why this would be a positive thing for Red Hat's customers (or Red Hat, since stability/predictability is what Red Hat customers are paying for). There is a Red Hat-maintained Linux that is very close to upstream (Fedora). But the people who pay for RHEL don't want upstream and surprises, they want predictable for seven years and they're willing to pay a lot of money for that. Why would that be a negative for you or me? RHEL isn't breaking upstream with this practice, even if they are making mistakes in their own backports.

"edit: also stop incorrectly backporting security fixes and creating new CVEs. Seriously. Stop it."

Can you give some examples of cases where Red Hat introduced bugs in their backported patches? I follow RHEL CVEs relatively closely (because some of my packages are derived from their packages), and I can't think of an example of that happening. Debian has done so, but very rarely, that I can recall. (And, Ubuntu, too, since they just copy Debian for huge swaths of the OS.)

I for one am very grateful Red Hat does not do that. We have a kernel driver for custom hardware, there's around 80 of these devices in the world, split roughly 50/50 between Windows and Red Hat users. While these devices are not cheap, we could not recoup the cost of maintaining it if we had to track the upstream kernel all the time - we tried, and could not justify the cost.

The number of times the APIs changes from under your feet is astounding - even with just keeping up with Red Hat, we spend around 4-6 times the engineering time on the driver compared to what we do with the Windows version of the driver, tracking upstream gave us almost an order of magnitude more work (And keep in mind that /only/ supporting the most recent upstream kernel is rarely an option - several versions need to be supported concurrently)

Red Hat provides a stable ABI for a pretty large set of symbols. Unless you are doing strange things in the driver, a module built for RHEL 7.0 should be fine until 8 comes out.

That is exactly what I am saying - which is in stark contrast to what would happen if Red Hat did not provide that stable ABI, but instead "Ship exactly what upstream provides" as the original comment suggest they should do.

If upstream releases were doing better release engineering in the way you mean then there would be no money to be made shipping RHEL as a product.

> the "freeze the release; backport fixes" mantra.

For many customers of Red Hat, that mantra is the very reason they use RHEL in the first place.

Indeed. Or they would have stuck to using Windows, or some commercial Unix.

Sadly i feel that more and more upstream wants it both ways, be able to push their latest and shiniest, and keep ignoring any need for interface stability etc.

Frankly i suspect the end result of the likes of Flatpak will be that upstream push whole distros worth of bundled libs, just so they don't have to consider interface stability as they pound out their shinies in their best "move fast and break things" manner.

Taking this to its most ludicrous extreme, everyone should use Arch, and anyone who can't should... what? Not use Linux?

The Fedora Project focuses, as much as possible, on not deviating from upstream in the software it includes in the repository.

[0]: https://fedoraproject.org/wiki/Staying_close_to_upstream_pro...

Right, "as much as possible" implying that there are cases where this is not possible. Which is more upstream-compliant than RHEL, but not 100% "stop doing this", which is the opinion of the comment I was replying to.

We use both RHEL and Oracle Linux as a peer to VAX VMS and Unisys(Univac) OS2200 (a COBOL mainframe).

From the perspective of legacy systems, Red Hat's approach is more comfortable.

You cant teach anything to anybody. An unrelated proof: I'm having to write a bot with phantomjs to scrape my uni's announcements page and turn it into an RSS feed so that I won't have to periodically control it; because they decided wordpress wouldnt cut it and they needed some blumming angular.js stuff, breaking all the urls and removing any sort of rss feeds on the way. And all it is is a blog, basically, nothing more. Mailed them telling that I used to use that stuff, no replies in weeks. At least I'm learning phantomjs which seems to be very useful of a tool.

I am as happy as anyone that XFS is finally getting the position of honor it deserves on enterprise Linux (something like 15 years later than it should have, grumble grumble) but it doesn't really take the place of what btrfs was trying to do. Only ZFS is in a position to do that. I wonder if there are any plans for supporting the native port on RHEL.

Anecdote time: Last month I had an XFS volume fail on me. It got some sort of internal inconsistency and refused to work (all fs calls returned errors until I unmounted). This is where I discovered that XFS still has extremely poor recovery tools.

xfs_repair will complain if there is a journal present and tell you to mount the fs to replay the journal. But mount would refuse, saying the fs was inconsistent. So the only option was to xfs_repair -L to just throw out the journal.

Then, xfs_repair sucked up something like 30GB or more of RAM, so I had to make a huge swapfile so that the kernel would OOM the repair.

Then, after roughly 20 to 30 hours of repair it would exit with an error. At that point it would actually mount, but hitting certain areas of the filesystem would trigger the inconsistency again and start the entire process over.

In the end I couldn't fix it and sadly had to reformat. I chose ext4 when I did—I've had lots of experience with ext3 and 4 and I've never had a filesystem that I couldn't at least make consistent again (even if it loses some data).

Yikes. That's unnerving to read. Were you able to create a bug report? Was there any other related issue like an underlying storage controller messing up?

Redhat hasn't been on the best of terms with Oracle, so I suspect that they want to stay clear of ZFS. It does however leave Redhat without a more modern feature rich filesystem.

Perhaps Redhat could help to develop snapshots on XFS. It's not the only feature XFS is missing, but it's a start.

> Perhaps Redhat could help to develop snapshots on XFS. It's not the only feature XFS is missing, but it's a start.

Upstream has been working on adding more btrfs-like features to XFS, but I believe that RHEL encourages using devicemapper snapshots (which you then format with XFS).

> Upstream has been working on adding more btrfs-like features to XFS, but I believe that RHEL encourages using devicemapper snapshots (which you then format with XFS).

Exactly, mountable and mergable snapshots have been supported by a LVM/devicemapper stack for a long time.

It's still a bit more involved than the transparent snapshots ZFS and, to a lesser degree, BtrFS offer. I'm not happy with this and sincerely hope this position changes in the future.

>Redhat hasn't been on the best of terms with Oracle, so I suspect that they want to stay clear of ZFS

The only connection Oracle has to ZFS on Linux is ownership of some patents that the license allows you to use, their reluctance is based on distribution issues between the GPL and CDDL.

… for which Oracle owns the copyright.

I can't see that the copyright matter here. What matters is the license. Oracle can't unlicence what Sun code is already part of OpenZFS.

Correct, Oracle would be in a position to do exactly that. Whether they do or not is another story but thats a pretty big liability.

Given they are discontinuing Solaris and all-in on Red Hat Enterprise Linux I can't help but wonder why they don't do more with ZFS on Linux and therefor wonder if the NetApp patent suits or some other patent suit is preventing them from doing anything in the background.

Many people don't realise that these crappy patent suits in the background prevent all sorts of really basic stuff, like the fact most things now bounce through a cloud server (like Facetime) because there's a patent troll for peer-to-peer communications. And it's causing total waste as a result :( It also seems likely that prevented facetime becoming an open standard as Apple original promised. This is only 1 example though.

> Given they are discontinuing Solaris

Oracle is NOT discontinuing Solaris. This FUD must die.


This may be true, but when Oracle killed OpenSolaris, my non-Sun/Oracle friends wrote off Solaris and moved off it.

Killing OpenSolaris, and talking up SPARC so much, made people think that a) Larry just wants to vendor lock them, b) doesn't care about x86 support because it makes vendor lock-in harder for Oracle, c) the OpenSolaris derivative community will not be able to compete with Linux. So everyone has grudgingly accepted that Linux is it for the enterprise Unix market.

I hate this as much as you do. I <3 Solaris/Illumos. Illumos derivatives have their niches, no doubt, and I want to be able to use them much more. But that's not how business people think.

I'm not sure that Oracle could turn this impression around at this point. To begin with it would have to restart OpenSolaris, and that might not be enough. OpenSolaris greatly helped Sun overcome resistance to Solaris, but it only went so far, so Oracle will have to do even more work to make Solaris' future bright.

This blog post is as relevant today as ever: https://blogs.oracle.com/bmc/the-economics-of-software

(And yes, it's STILL hosted at blogs.oracle.com. I'm almost afraid of mentioning it: who knows, it might get removed if Oracle execs notice it.)

Update: Apparently Oracle actually wouldn't be the liability on a CDDL basis because the violation is of the GPL not the CDL. Fair point.

Hrm thanks seems you're right. They cancelled Solaris "12" but 11 is still in development. Thanks for the correction.


But presumably, as copyright holders, Oracle is the entity that could try to enforce CDDL in court, in particular breaking the CDDL by mixing in GPL code in the same (ie: OS) distribution? Oracle goes: we bought Sun, and hold copyright to ZFS (also at the point of the OpenZFS fork) - RedHat could respond: we got a license - the CDDL - and Oracle could respond, sure - but CDDL isn't compatible with GPL - so you're in breach of the CDDL?

Oracle is not the one that could sue. There is nothing in the CDDL that prevents it being used else where.

The GPL on the other hand is a strong copy left. If you link against GPL code, your code must also be licensed as GPL.

This means the Linux copyright owners could sue the distributers of ZoL binaries, but Oracle could not.

Oracle has the power to allow their ZFS code to be relicensed as GPL, removing this road block, but they have no incentive to do so.

>But presumably, as copyright holders, Oracle is the entity that could try to enforce CDDL in court, in particular breaking the CDDL by mixing in GPL code in the same (ie: OS) distribution?

Any Linux contributor could also try to enforce it, which is why the license incompatibility is the issue stopping them. Oracle holds no special power.

True - but the incentives are a bit different. How many other Linux contributors[1] are selling a commercial operating system in direct competition with Linux as a general purpose Unix-like OS, with ZFS as one of the differentiating features?

Most Linux contributors want Linux to succeed. I don't think it's at all clear that corporate Oracle prefers Linux to succeed - at least not if higher adoption of Solaris is an alternative.

[1] (I guess IBM and Microsoft come to mind... but they don't have any special investment in ZFS)

There are thousands of Linux contributors, I don't care enough to check but I have to imagine Oracle has employed at least one. Any one of them could sue, and several have mentioned they're considering the option.

The license issue is what's keeping Red Hat from using ZFS, not some rivalry with Oracle.

How does mixing CDDL and GPL violate CDDL?

The only issue I'm aware is that mixing the two would violate GPL.

CDDL says: Source code must be licensed under CDDL.

GPL says: Source code must be licensed under GPL.

If you follow the conditions of GPL, you are violating the condition of the CDDL. If you are following the conditions of CDDL, you are violating the GPL. Basic binary logic.

To add: "the engineers who had written the Solaris kernel requested that the license of OpenSolaris be GPL-incompatible". A license is really just an written intention of the author on what conditions copyright law restrictions may be legally ignored. In this case, those wishes had a very explicit intention. However those using the license today has had a general change of heart, and those with GPL interest has a general stance that no FOSS project will ever sue an other FOSS project over license incompatibility. As such, the risk of lawsuit is really just a company suing an other company under the technicality of incompatibility.

Naturally some organizations won't intentionally break copyright law just because no one will sue.

Neither license forbids mixing with other licenses. As long as the demands of both are met, they can apply to the same source code.

>If you are following the conditions of CDDL, you are violating the GPL. Basic binary logic.

Relationship between licenses can be transitive but not commutative.

As far as I know CDDL allows using with code under GPL but GPL does not allow using code under CDDL. CDDL copyright owners have no case, GPL copyright owners have.

The question is: If I'm incorrect, what in CDDL prevents using with GPL?

If CDDL has no issue with GPL conditions, then follow the GPL and everything is fine.

CDDL has this text: "Any Covered Software that You distribute or otherwise make available in Executable form must also be made available in Source Code form and that Source Code form must be distributed only under the terms of this License"

So you take some CDDL code, and some GPL code, and you put that whole new source code tree under GPL in order to fullfill the GPL license condition. Are you then in compliance with the CDDL code? My concussion is that you are not, as that would be in conflict with the above condition of the CDDL. The source code tree would not be "distributed only under the terms of this license".

CDDL is per-file.

How does this change this?

I take a CDDL licensed source code file. I take a GPL licensed source code file. I add inline the GPL licensed code to the CDDL licensed file, and release an executable form of the result. In order to comply with the GPL I then give out a single source code file under the GPL license terms with the code from the two files.

Is this in compliance with the CDDL terms and conditions?

This stackoverflow question have some good points (notably, the accepted answer, and the bit about limitations - the CDDL section 6.2, for example revokes the CDDL in case of patent infringement, something that might be considered an "extra limitation" under the GPL (you're not allowed to add additional limitations to either the GPL or the CDDL). As such, the CDDL might be incompatible with the GPL up to and including v2 - while GPL3 might also be incompatible with the CDDL):



Also, the SO answer mentions consumer protection laws - but AFAIK they generally only apply to consumers - not businesses. So the GPL 0 clause might be void in many jurisdictions for individuals but still valid for businesses.

Oracle can relicense their codebase anytime they want. They _cannot_ be constrained by OpenZFS/Illumos unless they accept patches from them without copyright assignment.

What about just using LVM for snapshots? Considering that there default partition schemes include LVM maybe that's what they bank on in most use cases?

I have one machine running XFS but if that one is representative then I won't be installing XFS anywhere else and would happily discourage others from using it. It is terribly slow when doing some fairly common operations when you have a large number of small files.

You're joking, surely.

XFS has outperformed EXT4 in almost all "high" use-cases in my experience and testing: Large files (500GB~) or many small files (128k files * 2,400,000 or so). EXT4 under those loads is comically bad.

BTRFS is also terrible at this, only XFS and ZFS are good at handling it.

On the other hand on database workloads, for example PostgreSQL, XFS and EXT4 are about equal these days. ZFS (at least on Linux) and Btrfs are both clearly slower on those workloads.

Here is one benchmark, but I have seen plenty of similar benchmark results for PostgreSQL showing the same thing: https://blog.pgaddict.com/posts/postgresql-performance-on-ex...

There's a fairly simple reason for that which is that ZFS (and btrfs to some extent) are almost literally "ACID" databases. They do alot of the same double-writes and other safe behaviour the database is also doing. Those have a penalty and you're doubled up.

There are various guides around for tuning ZFS and database servers to try reduce that duplication, for example you can disable the InnoDB double write buffer because ZFS guarantees you don't need it. You also need to tune recordsize to match the database page size so that you don't accidentally create large multi page blocks.

I partially disagree with the claim that ZFS is slower than ext4/xfs. It is, but only as long as you don't use ext4/xfs on top of LVM to get similar snapshot capabilities etc. Then ZFS starts to win.

So if you only need a plain filesystem, ext4/xfs are great and you will get better performance.

If you need/want snapshots, e.g. to do backups that way, it makes sense to look at ZFS.

I wish I was. It's pathetic. Creating a new directory entry on an idle machine with plenty of CPU and memory takes seconds, ditto deletions.

Also: I love how that comment sits at -4, as if downvoting it will somehow discredit the data point.

> I have one machine running XFS but if that one is representative

> Creating a new directory entry on an idle machine with plenty of CPU and memory takes seconds, ditto deletions.

I think your answer lies in your premise then. It's not representative.

10's millions of files should have been the ideal use case for XFS, that's why I installed it in the first place. This was for the 'reocities.com' project and by the time I realized what the problem was most of the import had already been done so I let it run to completion but it makes updating the project a real PITA.

There's so much that can go wrong setting up a Linux server that it's impossible to give much advice with something like this.

I guess the general stuff is: the easy default partitioning setup you get from a Linux distro is total bs, you need more RAM than you think you do, the way you're serving files or accessing the system (NFS!) has plenty of ways to screw things up as well, and tens or hundreds of millions of files is not any filesystem's ideal use case. The classic IRIX workload would be guaranteed-rate streaming of large media files, and the Linux port of the filesystem obviously inherited a lot of that system's traits (without the GRIO).

XFS has received some very serious performance improvements in the past couple of years to address indexing, large volumes of metadata, and so on, so that'd be one very relevant thing. Dave Chinner's talks are worth the time to watch if you're interested. You would be giving bad advice if you steered people one way or the other with regard to filesystems based on a seven-year old project (unless you've refreshed that system much more recently, of course).

> XFS has received some very serious performance improvements in the past couple of years to address indexing, large volumes of metadata, and so on, so that'd be one very relevant thing.

That's probably the difference right there. Thanks for pointing that out.

Sure, but the issue could be configuration, drive, interface, etc. It's impossible to speculate in, but what we know is you have trouble with one machine, and it's the only one that has used XFS. It's unfortunate, but likely a coincidence, or at least unrelated to XFS at its core.

I've been using XFS for 10 years without the issues you seem to be having.

Your performance problem reminds me of this dentry cache performance failure https://sysdig.com/blog/container-isolation-gone-wrong/

You have to tune the parameters at filesystem creation time if you care about small file performance. It was designed for large files.

What parameters? This guide[1] only mentions two things in relation to number of files. The first is inode count, for which performance is binary. The other is files in a single directory, and it says that the default setting is fine for a million. There's no explanation for the performance jacquesm describes.

[1] https://access.redhat.com/documentation/en-US/Red_Hat_Enterp...

Here is a link to benchmarks for the file systems. To bash XFS which is about as solid as one can get I would love to know why you would say that? http://www.phoronix.com/scan.php?page=article&item=linux-44-...

Considering the size of disks now-a-days, the chance of bit rot is high. And (I don't have the original source) on SSD, bit rots probability is higher still. So... ZFS and BTRF have meta-data as well as data checksumming. From what I've read, XFS may have metadata checksumming, but not on the data side of things.

I consider checksumming important. Do others? What is the solution? What other file systems offer that sort of capability?

Snapshotting is a second go-to function. Particularly when it is integrated into the LXC container creation process. (There was a comment elsewhere here which said LXC is on it's way out.... huh? what?)

There are many different ways that storage can be layered, and depending on your use case, you can put various advanced features (snapshots, checksum/data integrity, encryption, etc.) functionality in different places in the storage stack. You can put functionality the block device layer (e.g., lvm, dm-thin, dm-verity), you can put functionality into the file system, you can put functionality into the cluster filesystem (if you have such a thing), or you can put iti in the application level.

Depending on the requirements of your use case different choices will make more sense. It's important to remember that RHEL is used for enterprise customers, and what might be common in the enterprise world might not be common for yours, and vice versa. Certainly, if you are using a cluster file system, it makes no sense to do checksum protections at the disk file system level, because you will be using some kind of erasure coding (e.g., Reed Solomon error correcting codes) to protect against node failure. This will also take care of bit flips.

If you are using cloud VM's, or if you are using Docker / Kubernetes, then LXC won't make sense. It all depends on your technology choices, and so it's important to look at the big picture, not just at the individual file system's features.

Given a stock (or additional packages?) RHEL 7.4 install on non-clustered storage, what would be the best combination to detect & correct bitrot at the filesystem and lower level?

Linux 4.12 introduced dm-integrity, which adds integrity checking at the block device level, so it will work with any file system:


One good thing about ZFS integrity checking is that when it finds an error it can repair the bit rot from another disk if you have parity or mirroring. Can dm-integrity do that?

dm-integrity will only operate on a single disk so no, not on its own.

It does however return an error if the integrity check fails, so if you put mdadm on top, mdadm can repair the erroneous block. I've tested this and am currently running it on a 32TB array.

Not so far, it seems. https://www.spinics.net/lists/dm-devel/msg31482.html

> this target do not provide error correction, only detection of error (such a tool could be written on top of dm-integrity though)

Or multiple copies of the data (copies=n property).

Use mirror raid and have mdadm do a full disk compare/check every month (this is the default on Debian).

Additionally use smartmontools and configure it to do a short self test each night, and a long self test (i.e. full disk read) each week.

This will catch/flag errors early, which mdadm will then detect.

Yes it can detect errors, but it can't continue to function correctly (read: return the correct data) because it doesn't know which copy of the differing data is damaged because it doesn't have checksums.

Moreover, if it doesn't always read both copies of the data (which it may well not, for performance reasons), then you have the possibility of silently propagating damaged data to all mirrors in the case that damaged data is returned to an application and the application then rewrites said data.

Compare that to a filesystem with checksums, which, in addition to being able to detect such a problem, could also continue to function completely correctly in the face of it.

Yep. "What happens if you read all the disks successfully but the redundancy doesn't agree?" is a great question.

Mirrors and RAID5: there's obviously no way that `md` software RAID can help, since it doesn't know which is correct. What about RAID6 though? Double parity means `md` would have enough information to determine which disk has provided incorrect data. Surely it does this, right?

Wrong. In the event of any parity mismatch, `md` assumes the data disks are correct and rewrites the parity to match. See "Scrubbing and Mismatches" section in `man 4 md`:


If you scrub a RAID 6 array with a disk that returns bad data, `md` helpfully overwrites your two disks of redundancy in order to agree with the one disk that's wrong. Array consistent, job done, data... eaten.

That's incredible! Thanks for the insight.

Any recommendations for detecting/correcting bitrot with RHEL 7.4 at the filesystem or lower levels?

Disk A and disk B both contain file SomeFile.

On disk B this file has rotted.

When reading the file SomeFile into memory, the read will be distributed among the disks (for performance reasons) (and it will probby need to span a multiple of the stripe size).

Ok, file is read into memory, including the bitrotted part from disk B. Now we write the file blocks back - as one does.

Voila! Both disks now contain the bitrot. And mdadm will not complain - disk A and B are identical for the area of file SomeFile.

Moreover, even if you don't read the file, and the bit rot is discovered during the monthly compare, at least on Linux the disk that is considered correct will be chosen at random. So you need at least three disks to have some semblance of protection. Have you guys seen many laptops that come with three or more drives?

Just use ZFS. Even on a single disk setup you will at least not get silent bit rot.

"Never go to sea with two chronometers; take one or three."

- adage cited in the Mythical Man Month

Or just do raidz6 in ZFS and call it a day.

Actually it's better to just do mirrors. Avoid RAIDZ at all costs if you care about performance and the ability to resilver in a reasonable amount of time.

Sure I agree. But nested mirrors still suffer from the same issue of losing a drive and you lose everything.

> But nested mirrors still suffer from the same issue of losing a drive and you lose everything.

Are you referring to mirroring a volume or dataset on a single disk? Why would you want to do that instead of mirroring among multiple drives?

how would you set up a large pool?

two sets of say 5 disks in a mirror raidz1 would still fail if a disk in one set failed and a disk in the other set failed. I guess you could do a stripe setup of 5 sets of 2 disks in mirrors. Still it seems wicked risky to me. I do agree though mirroring has been the best for speed but a lot of that changes with nicer SSDs especially NVMe ones.

I was curious about what a "nested" mirror is really. What exactly is nested?

I'd setup a large pool with mirror vdevs, i.e. n sets of 2 disks per mirror.

My half-remembered reasoning was that backups manage the risk you'll lose data. But replacing a disk in a mirror vdev is much easier, and faster, than doing so with RAIDZ.

The risk of RAIDZ is that resilvering impacts multiple vdevs, is much more intensive than a simple mirror resilvering, and thus the probability that additional drives will fail is much higher.

Here's a blog post that I definitely read the last time I was reading up on this:

- [ZFS: You should use mirror vdevs, not RAIDZ. – JRS Systems: the blog](http://jrs-s.net/2015/02/06/zfs-you-should-use-mirror-vdevs-...)

A Reddit post about that blog post in my other reply:

- [You should use mirror vdevs, not RAIDZ. : DataHoarder](https://www.reddit.com/r/DataHoarder/comments/2v0quc/you_sho...)

I wonder if resilvering is still an issue with SSDs. But I cede your point, nested vdevs of two disks making mirrors makes sense. It doesn't sit well still, but makes sense

According to OpenZFS's changelog for v 0.7 resilvering is smarter now: https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.7.0

> on SSD, bit rots probability is higher still

Do you have a source for this? So far I believed that bit-rot rates are pretty similar.

A google search turns up a number of sources. Another concept which could use justification is whether SSDs bit rot more over long term than do spinning disks. I heard that somewhere as well.

Using google was the first thing I did. I couldn't find something substantial.

Well I used btrfs as a root filesystem for quite a while, until I realized it was pig slow for sync() -- I mean, it would take AGES to do and apt-get upgrade for example. I ended up having to do some tasks using 'eatmydata' [0] to make it all better, risking filesystem corruption in trade for speed. Also, at the time, there was no functioning fsck.

So I moved back safely to ext4 and never looked back!

[0]: https://www.flamingspork.com/projects/libeatmydata/

Over the recent years on every new laptop install i switched between filesystems, so i had ext3/4, btrfs and (currently) xfs on my system. I have to say, btrfs had the most glitches (a few years back), although it worked ok'ish (no data loss).

Nowadays, i must say that i very much prefer a stable filesystem with as little complicated logic as possible. I actually never use snapshots or subtrees! I never put another disk in my laptop (where would that go?!) so i don't need to do dynamic resizing (while online of course!). All this makes the filesystems a lot more complex then it has to be. I've also run into problems using ZFS on Solaris some years back which took ~2 weeks in dtracing what the hell is going on. Of course it was related to CoW.

My lessons learned: Check your requirements. Will you really need and use subtrees/snapshots/XYZ on your system? Will you really need to do online-resizing? If not, just use a stable, simple filesystem. There are perfect usecases for ZFS or btrfs. But not everyone needs the advanced features.

I'm using ZFS on my FreeBSD laptop. Snapshots not only make backups safer (by making sure the complete backup is taken at the same time, and by zfs sending and receiving them), the boot environments feature also make upgrading safer.

I also really like that I don't need to partition my disk, if it turns out that /tmp needs > 10% of the disk for whatever reason: no problem!

And as I like my data, I appreciate checksumming and copy-on-write.

I haven't noticed any bad slowdowns compared to ext4 on my Debian laptop I used before.

"I'm using ZFS on my FreeBSD laptop. Snapshots not only make backups safer..."

Did you know you can 'zfs send' snapshots to rsync.net ?[1][2]

[1] https://arstechnica.com/information-technology/2015/12/rsync...

[2] http://www.rsync.net/products/zfsintro.html

> Will you really need and use subtrees/snapshots/XYZ on your system?

It's a valid question, but not the best one. Almost nobody needs snapshots. But they make things easier. You most likely don't need a journaled fs in your laptop either (battery level notification should take care of the issues). But it does make life better.

"Need" is not the threshold I'm interested in. Most features, I'd like. One feature I think I do need most is scrubbing, which is still absent from most filesystems :(

"Need" as in "will you use it?". I played around with snapshots once and never really used them. So i clearly don't have a need for them on my laptop. Journaling on the other hand helps data safety a lot and i think it's not overly complex. I've had data loss happing in the past before journaling, but never again since then. So, wouldn't i need CoW for even better "data safety"? Maybe, but since i've never experienced data loss for so many years, i don't feel like the added complexity is worth it. On my laptop, for my usecase.

But that's only me. Your experience may differ very much :)

> "Need" as in "will you use it?".

Well, many of us have experienced a botched system package upgrade or two. If the file system supports snapshots, then the package manager could automatically ensure fully atomic package upgrades.

That should be reason enough, I should think.

Re: The data loss issue: Yes, I've actually have XFS completely throw away a file system upon a hard power-off + boot-up cycle. (This was ages ago, I'm sure it's improved heaps since then.)

Good point. I'm using Debian as my Desktop for many years. I don't remember a "botched" system package upgrade in the last 5 years, but i've probably learned over years how to handle dpkg/apt.

The atomic updating is a very interesting topic and the reason why i find ostree/guix/nixos very appealing. Note that neither ostree nor guix or nixos make use of filesystem snapshots, afaik. OSTree even documents why it won't use filesystem snapshots: https://ostree.readthedocs.io/en/latest/manual/related-proje... Debians dpkg does not use snapshots as well.

So, it's a definitely a nice-to-have, but not something i need, because i can handle dpkg/apt much better then i could handle filesystem internals.

That's sort of the point: I don't want the complexity in the filesystem, but i am fine with it in userspace. I can use snapshots on filesystem level. Or i can use other backup tools in userspace. While it's certainly neat that the filesystem can do that, i'm perfectly fine with handling backups on another level.

Another example: It's certainly neat that there are a bunch of distributed filesystems (which by the way have A LOT of complexity and often can't handle all workloads you would expect from a filesystem). But i'd rather use either an S3-like network storage or build a system that scales well without relying on Ceph/Gluster/Quobyte/etc.

For example, in a hypothetical distributed system i'd rather use Cassandra and distribute data over commodity hardware then use Ceph. I'd rather handle problems with data persistence/replication on the cassandra level then debugging on file system level. Especially, when Cassandra has a problem i'll most likely be able to access all data atleast on the filesystem level. When my filesystem is borked, i'm in a much worse situation.

> So, it's a definitely a nice-to-have, but not something i need, because i can handle dpkg/apt much better then i could handle filesystem internals.

I don't think you're seeing my point. You wouldn't have do anything -- it would all be done automatically as long as your file system supports snapshots.

BTW, to your "I know how to use dpkg/apt": It's not about knowledge. I could well be said to be at an "advanced" level of expertise in system maintenance, but "system upgrade" fuckups had nothing to do with me, but everything to with bad packaging and/or weird circumstances such as a dist-upgrade failing midway through because some idiot cut a cable somewhere in my neighborhood.

While Nix and the like are nice and all, they're currently suffering from a distinct lack of manpower relative to the major distributions. They also don't quite fully solve the "atomic update" problem, but that's a tangent. Then, OTOH, some of them have other advantages such as the easy of maintaining your full system config in e.g. Git. Swings and roundabouts on that front. FS support for snapshots would help everybody.

You're absolutely right. Still, dpkg doesn't support snapshots out of the box. I could fiddle around with it and i suppose i could make a snapshot before running "apt upgrade", but since that never failed for me, i would touch something very stable for little apparent benefit. Let's say Debian 10 will support btrfs snapshots on updates, i'll consider using btrfs for the next installation, but not before.

Did you read the link from the ostree people? Let's pretend Debian 10 offers to choose between OStree-like updates and btrfs snapshots: I'd probably choose OStree and stick to ext4/xfs.

> Still, dpkg doesn't support snapshots out of the box.

Yes, but it SHOULD, just because ALL REASONABLE FILE SYSTEMS SHOULD SUPPORT SNAPSHOTS. Therefore dpkg should assume that such support is avaiable, or at the very least take advantage of it, when available.

Just to reiterate: You (impersonal!), the "ignorant user", shouldn't have to even have to think about it.

Does this make my point clear?

(I'm only being this obtuse because you're saying "you're absolutely right", but apparently not seeing my point. I'm assuming it's some form of miscommunication, but it's difficult to tell.)

EDIT: Hehe, I'm sorry, that sounded much more aggressive than I intended. I just think that us software developers could and should(!) do much better by our users than we(!) currently do. My excuse is that most of my stuff is web-only, so at least I can't do the accidental equivalent of "rm -rf /", but...

Still, i think that my filesystem shouldn't accumulate that much complexity. Maybe a layered approach similar to Redhats Stratis is a better way.

> If the file system supports snapshots, then the package manager could automatically ensure fully atomic package upgrades.

That's exactly what openSUSE / SLE do with snapper. Every upgrade or package install with YaST/zypper creates two snapshots (before/after) and you can easily rollback to an older snapshot (even doing so from GRUB). This has been enabled by default for years.

I have a wrapper script for apt-get to do that on several machines that run Debian unstable.

Another use for snapshots is backups. I love `zfs send` - it makes backup braindead-simple.

If you're on an SSD/MTD/NVMe, you have TRIM, and scrubbing is a no-op no matter what approach you try. You need a spinning HDD for scrubbing to be useful.

Here is one way to do simple, secure scrubbing on Linux without any intrusive system changes. It is mildly restrictive, but works.

First, you need a small, dedicated partition, but it only needs to be around 16MB or so. Resizing an existing partition down (tune2fs will happily resize a mounted ext4 filesystem, but you'll probably still need to reboot to reload the partition table once you've resized that too) will give you a bit of space.

Now you have a small area of the disk that occupies a known range of sectors, and because you have no TRIM, writes to this area will be properly deterministic. Good.

Create and mount a new filesystem without a journal on the new partition. ext2 could work here (:D), you could `mkfs.ext{3,4} -O ^has_journal`, or you could use filesystem defaults and simply overwrite the entire partition with /dev/urandom later.

Make a sparse file with fallocate (make sure the file system you create the file on can handle sparse files) that is big enough to handle the biggest file.

Create a LUKS volume with a detached header inside the new sparse file, and store the detached header metadata into a file in the new journal-less partition.

Create an ordinary filesystem inside the LUKS volume.

Now you have a Rube Goldberg sparse file. You've moved the deterministic-writing/journal-less stage into a tiny key, which is a lot easier to manage than a whole gigabytes+-large partition.

As an alternative you could drop the key onto a flash drive, and nuke the flash drive when you wanted to kill the data. That's kind of wasteful though (and it carries the same flash-drive-quality risks as copying the only copy of the data itself onto the flash drive).

LUKS was designed such that if you lose the key(s) or the detached header, all that's left is statistically random garbage.

You seem to be using a different definition of scrubbing than people talking about filesystems usually use.

Scrubbing means to read all the data off a filesystem and compare it against its checksums, so that you are confident nothing has happened to the data (hardware failures, cosmic rays, whatever).

ZFS and btrfs have specific scrub commands that do that.

There's no scrubbing available for a system which does not keep some form of checksum/crc/hash of the data.

I think that you are talking about secure delete procedures.

OH. You're right. I got the terminology confused with, uh... shredding. Heh.

I actually tried to delete this comment for unrelated reasons shortly after posting it, but was unable to. Now I feel doubly stupid.

It's a good practice to use LVM there between disk partitions and volumes. It has negligible performance implications but makes things very flexible when you need to resize volumes or add space. You also gain reliable snapshotting from device mapper, although that does have some performance effects.

My requirements: transparent compression. I'm left with: btrfs or, if possible, ZFS.

It's ok to play with the OS in order to learn.

We are using XFS for most of our production workloads, it turned out to be an excellent choice for most data heavy use cases. Brtfs was never an option, it is a bad idea to gamble with beta technology for data storage that a production system relies on. ext4 vs xfs is a much interesting argument, I haven't had time to follow up on this.

We use XFS for sparql.uniprot.org (basically columnar database with semantic graph) there we recently retested it by accident. We use 2*4 TB consumer SSD ,ok rich consumer ;) With XFS we have 10-13% faster linear write of one big file (1.3TB) and about 20% more reads serviced per minute than EXT4. EXT4 was selected by accident instead of XFS for 1 of the 2 otherwise equal machines when upgrading these with the new 4TB SSDs instead of 1TB ones which they had before.

In general I feel that the more you go into "enterprise" storage levels, the more XFS pulls ahead from the EXT family. i.e. laptops and small servers are not where the difference lies.

apt-get upgrade syncs so often that 'eatmydata' gives a noticeable speedup pretty much everywhere (I got into the habit of using it for ext3/4)

I don't get why apt syncs so often - isn't the main point of log-structured file systems their ability to recover after a crash or powerloss? If so, why should you need to sync more than one every ten seconds or so?

apt doesn't assume that you have a reliable filesystem. It assumes that you might crash at any moment, and it would be really important for you to have a consistent view of what packages are installed when you reboot.

But ext3 and more advanced filesystems have been around for almost twenty years now... it seems an odd assumption that your filesystem is unreliable on any machine that isn't completely ancient (is anyone still using ext2, for instance?)

It's not about the filesystem being "unreliable". It's about having the package manager's state checkpointed so that it can recover and resume if there is e.g. power loss or any other form of interruption at any point during package installation, upgrade, removal etc. This means having all of the updated files synched on disc plus the database state which describes it.

When you move to a more advanced setup such as ZFS clones, you could do the full upgrade with a cloned snapshot, and swap it with the original once the changes were complete. This would avoid the need for all intermediate syncs--if there's a problem, you can simply restart from the starting point and throw all the intermediate state away.

Debian calls itself "the universal operating system" and officially supports not only multiple init systems, but multiple kernels (kfreebsd); somehow I don't see it relying on specific filesystems.

Though true I have literally never had my system get corrupt or inconsistent during failed dpkg/apt from power loss, hang, filesystem going ro, etc. It's very reliable.

I've had older rpm & yum/dnf failures multiple times leaving me in weird inconsistent states from crashes or power losses etc. not conclusive but anecdotal experience - It's also possible it's been improved.

Meanwhile you can disable the file syncing with the apt preference dpkg::unsafe-io (google will be required for the exact syntax and file in /etc/apt - fairly sure you can cmdline it also)

I think this is a political move disguised as technical move

oracle pays the developers of btrfs [0]

redhat hates the guts of oracle, since oracle released oracle linux, which is a clone of redhat enterprise (based on centos)

so, redhat wants to cripple btrfs and hurt oracle.

However, btrfs is my favorite FS, been using it on my home computer and backup drives for at least 6 years, before it was included in the kernel, love the subvolumes, snapshots, and compression; never had issues with it .

[0] https://oss.oracle.com/~mason/

[Update] Chris mason no longer at Oracle since 2012

I think this is a political move disguised as technical move

I think there are solid technical reasons to discourage Btrfs use, just to quote from the official wiki [0]:

> The parity RAID code has multiple serious data-loss bugs in it. It should not be used for anything other than testing purposes.

Now I don't know if this issue has been addressed already, or which kernels are affected, but the fact that there is a prominent warning on the wiki speaks for itself.

Personally, I'm a happy btrfs user deploying a mixed-disk-size array without parity, with the hope to add redundancy some time in the future. Currently, btrfs is the only FS allowing to mix disks of any size and to run an optimal configuration on top of them [1].

[0] https://btrfs.wiki.kernel.org/index.php/RAID56

[1] http://carfax.org.uk/btrfs-usage/

The particular bug that sparked that warning was fixed a while ago, but as a precaution against "btrfs ate my data" stories they've removed the ability to create btrfs-raid from the CLI tools (you can still use md RAID with btrfs but you lose most of the benefits of btrfs that way).

Who's they? Upstream have not removed Btrfs raid creation capability in btrfs-progs, I'm not aware of any distro that has patched it this way.

Oh, I must've misunderstood this mail[1] and thought they had actually gone through with #ifdef-ing out the raid56 creation code in btrfs-progs.

My bad.

[1]: https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg...

Which benefits are those? Both synology and qnap have the ability to detect and correct bitrot doing btrfs on top of mdraid.

The design of btrfs allows for mismatched disks to be used in an array, and the btrfs RAID will keep the right level of redundancy while using the maximal amount of space, e.g. with a 4 TB, 2 TB, and 1 TB drive:

mdadm will give you 1 TB in RAID 1, or 1.5 TB in RAID 10 (constrained by the smallest drive).

btrfs will give you 3 TB in RAID 1 (constrained by the sum of the smallest drives).

btrfs also allows per-subvolume raid policies. So you could, for example, give users an "archive" subvolume in their home directory. You could then mark this as RAID 1 or RAID 5 (because you don't care so much about performance) while the main /home filesystem is RAID 10.

Unfortunately the RAID code is all horribly broken.

Being able to have non-symmetric disk topologies with redundancy. I believe that md raid does not support that, while btrfs multi-device does (which is what I think of as one of the really unique features of btrfs -- not even ZFS can handle the sort of disk topologies that btrfs can).

md-raid absolutely supports that and synology has for a long time. They call it "SHR". You simply do raid over disk partitions to enable disks of disparate sizes.

The reason ZFS doesn't support it, and absolutely 0 enterprise storage devices support this is because as the disks fill up, you sacrifice both performance and redundancy. Synology won't even support it on their high-end devices for this very reason. They'll only do it on their devices targeted at home use.

Even when it gets resolved, it's far from being the only major problem with Btrfs. It's merely the current high profile one.

Fixed in 4.12.

Chris Mason, the principal Btrfs author, left Oracle over 5 years ago:


"In June 2012, Chris Mason left Oracle for Fusion-io, which he left a year later with Josef Bacik to join Facebook; while at both companies, Mason continued his work on Btrfs."

> However, btrfs is my favorite FS, been using it on my home computer and backup drives for at least 6 years, before it was included in the kernel, love the subvolumes, snapshots, and compression; never had issues with it .

Slightly off topic. I chose btrfs as my main filesystem recently on a system running Ubuntu/Xubuntu. I have done some research on backing up (with the advantage of snapshots) but it looks like there aren't (m)any graphical tools (this gets a little more confusing with /@ and /@home subvolumes on the same partition being treated separately for snapshots, AFAIK).

Do you manage it all from the command line and/or do you have any suggestions for graphical tools to do "as-is clones of entire partitions" (and also incremental backups) to local external drives (not over the network)? Or if you could point to any great documentation or blog posts on this topic, that'd be helpful too (I have read some bits of the btrfs wiki and the btrfs parts in the Arch wiki).

Currently I'm doing a plain rsync using Grsync, and not really taking advantage of btrfs features like snapshots.

The main reason I'm looking at avoiding the command line is to make it easier for others around me to use it.

my line for snapshots

btrfs subvolume snapshot /source/drive/folder/ /source/drive/folder/.snapshots/snapshot-`date +"%Y-%m-%d-at-%I-%M%P"`

this will create a snapshot with date and time attached to the snapshot name

you can more info here


make sure to remember to delete old snapshots, otherwise you'll run out of disk space and not know where it went

performance during snapshot creation/deletes (presumably mostly the deletes) is one of the reasons I personally stopped using btrfs on my desktop. Now using ZFS root (with Ubuntu devel).

I had auto hourly snapshots and sometimes when it deleted one my entire system would hang for a few seconds and occasionally 10s of seconds.

Having said that I do suspect that might be partially related to also using ecryptfs on top, but still.

If anything, the political move would've been Oracle's sponsorship of btrfs in the first place. They want to push people to enterprise OS/storage systems, so they've told everyone "Oh yeah, btrfs is coming soon, it'll be great" ... and it isn't great. It sucks.

I've finally broken and installed ZOL (ZFS on Linux) after trying btrfs repeatedly over the last three years. ZFS is already a breath of fresh air and I've only been using it a couple of months. For whatever reason, btrfs came together as a messy hodge-podge, and it shows in bad performance for many use cases (e.g. "omg I forgot nodatacow"), buggy implementations, difficult user interfaces, kernel bugs, etc.

btrfs needs a reboot (I hear bcache? is trying). Meanwhile, everyone should stop getting hung up on the arcane licensing details and just use ZFS directly. It can't be distributed as part of the kernel, but that's why we have distributions, isn't it? They bundle all that crap together for us. There shouldn't even be the normal OSS infighting because this isn't a proprietary blob or something, it's just using a license that's GPL-incompatible.

The best thing Linus could do for the community at large would be to fork and start committing to ZOL, giving it a tacit endorsement.

Chris Mason left Oracle for Fusion-io back in 2012 and then from there to Facebook in 2013

Oracle still has a couple developers on their payroll that do BtrFS development:

* Liu Bo * Anand Jain

> I think this is a political move disguised as technical move

You literally have no idea what you're talking about, and I doubt you've used btrfs seriously, or you wouldn't talk this shit. The fact it's been upvoted so heavily just shows what absolute technically-false nonsense will draw support at HN.

There was no reason given for the deprecation of BTRFS.

Has anyone here tried bcachefs (http://bcachefs.org/) for some of the same use cases as btrfs? What do people think of its current state?

You'd need to read up on it but the TL;DR i got from previous comments was that previously it was being funded by a company that wasn't 100% aware of the entire situation around it (they used it in their commercial product, management didn't fully understand it was being released as open source, though I also understand they didn't own the rights to the entire code base so basically that was necessary effectively). The person developing it left that company and is seeking his own funding for it but that generally the development was significantly stunted as a result.

It seems a bit here-say-ish though, so please don't assume I'm entirely correct on all fronts there and I'd encourage you to research it further!

>I'd encourage you to research it further!

On that note, their Patreon is a better primer than the website is: https://www.patreon.com/bcachefs

Happy to see his Patreon is looking "healthier". I mean $1500/m isn't really that much but I'm sure last I looked it was much lower.

In other news.. consider supporting the people that support you! Personally, I spent over $100/month on Patreon. Most of those are creators rather than open source people but there are a couple of open source ones such as Ondřej Surý who works on PHP packaging in Debian/Ubuntu

Joey Hess, a formerly prominent Debian developer and author of git-annex, etckeeper and a bunch of other open source projects is also on Patreon: https://www.patreon.com/joeyh

Unfortunately, he's only getting $500/m, which isn't much even for someone with a very "off-grid" life: https://joeyh.name/blog/entry/notes_for_a_caretaker/

Thanks for the link! I've added him to my patreon-roll (what do you call that?)

Discoverability is a real problem for Patreon, outside of the "most successful"

It's $1050, not $1500. I'm interested in two projects on Patreon (bcachefs and Matrix) and both projects are not getting nearly enough to be self-funded, so this raises the question if this model even works for Open Source. So far any advanced technology seems to be funded by some big corporation and it's not very good. But, I guess, users just don't care about good inner workings, they care about things they personally enjoy (cartoons, etc), so funding those inner workings is still an open question.

I think it's an issue with the marketing of it as well, I recently got notice that some of the most important open source projects I use have a patron page they depend on, yet didn't show it on github, only on their home page under donate. Same as with Kickstarter, if you want funds, you have to properly and visibly ask for it!

I donated a lot to git-annex because he did it right, showing everywhere that you could sponsor the development.

Oh, they knew it was being released as open source. There was a clear boundary between the open source and the proprietary code - I worked on both. The open source thing was just a convenient excuse for some political bullshit - either that or they thought they could buy out a GPL project and take it proprietary, which is just insane. I mean, half the copyright was Google.

OT, but I think you meant "hearsay-ish"

I've looked into BCacheFS extensively. Two problems I have -- First, it's not there in features yet. It doesn't have quotas, nor snapshots.

Second, it lacks a formally published design for these features. Because the author is the only person with knowledge of how these things might work, it makes it really difficult to mitigate the bus factor while the project is still in heavy development.

Looks neat. Nobody uses it though and LKML has zero discussion on the announcements.

However, HN has a previous thread on it here: https://news.ycombinator.com/item?id=12410798

I actually use bcachefs on my main data drive and so far it has been running solid for ~3months. That said, looks like the author has gone camping/moving/re-evaluating life. Hoping the development picks back up or that it gets rebased off of 4.12 (or maybe made into a set of kernel patches?). Wanted a COW fs - tried btrfs and kept running into issues where the drive would fall into ro mode so far not regretting bcachefs.


Unsurprising. Red Hat has not hired upstream Btrfs developers for years, where SUSE has hired bunches. Meanwhile Red Hat has upstream ext4, XFS and LVM developers.

If you're going to support a code base for ~10 years, you're going to need upstream people to support it. And realistically Red Hat's comfortable putting their eggs all in the device-mapper, LVM, and XFS basket.

But, there's more: https://github.com/stratis-storage/stratisd

Btrfs has no licensing issues, but after many years of work it still has significant technical issues that may never be resolved. page 4

Stratis version 3.0 Rough ZFS feature parity. New DM features needed. Page 22 https://stratis-storage.github.io/StratisSoftwareDesign.pdf

Both of those are unqualified statements, so fair or unfair my inclination is to take the project with a grain of salt.

As for the significant technical issues, one thing is the core decision to make it a CoW system, which has fundamental performance issues with many workloads that are exactly those used in the server space. You can disable CoW, but you lose many reasons to use btrfs in the first place if you do.

When I gave up on it there were also fundamental issues with metadata vs data balancing, not-really-working RAID support, and so on...

I find the suggestion that the technical issues are caused by the CoW design a bit strange.

Sure, making the filesystem CoW-based means there are some inherent costs, but it allows the filesystem to implement some interesting features (e.g. snapshots) in a more efficient way. For example if you want to do snapshots with ext4/xfs, you'll probably do that using LVM (which you can see as turning the stack into a CoW). In my experience the performance impact of creating a snapshot on ext4/LVM is about 50%, so you cut the performance in half. While on ZFS the impact is mostly negligible, due to the filesystem is designed as CoW in the first place.

And thanks to ZFS we know that it's possible to implement a CoW filesystem that provides extremely stable and balanced performance. I've done a number of database-related tests (which is the workload that I do care about) and it did ~70-80% TPS compared to ext4/xfs (without snapshots). And once you create a snapshot on ext4/xfs, the performance tanks, while ZFS works just like before, thanks to the CoW design.

Unfortunately, BTRFS so far hasn't reached this level of maturity and stable performance (at least not in the workloads that I personally care about). But that has nothing to do with the filesystem being CoW, except perhaps that CoW maybe makes the design more complicated.

Didn't one of your benchmarks show that nodatacow on Btrfs resulted in a major performance improvement? But that might just show an issue with Btrfs's CoW implementation rather than CoW in general.

Yes, I've done some tests on BTRFS with nodatacow, and it improved the performance and behavior in general. Still slower tha XFS/EXT4, but better than ZFS (with "full" CoW).

But as you mention, that does not say anything about CoW filesystems in general. It merely hints the BTRFS implementation in not really optimized.

FWIW while I do a lot of benchmarks (both out of curiosity and as part of my job, when evaluating customer systems), I've learned to value stability and predictability over performance. That is, if the system is 20% slower, but provides stable and predictable behavior, it's probably OK. If you really need the extra 20% you can probably get that by adding a bit more hardware, and it's cheaper than switching filesystems etc. (Sure, if you have more such systems, that changes the formula.)

With EXT4/XFS/ZFS you can get that - predictable, stable performance. With BTRFS not so much, unfortunately.

>it allows the filesystem to implement some interesting features (e.g. snapshots) in a more efficient way.

Interesting features are worthless when reading and writing data is prohibitively slow. Or when there are documented cases where updating a file in random-access manner can cause its storage requirement to balloon to blocks^2.

There's a write magnification effect when using CoW. The ZIL helps with this because the ZIL itself is not CoW'ed, and it allows deferring writes, which allows more transactions to share interior metadata blocks, thus reducing the write magnification multiplier. I don't get where you get O(N^2) from.

As to snapshots, who cares, they cost nothing to create and they do not slow down writes -- they only slow down things like zfs send (linearly) and they cost storage over time, but not much more.

You are confusing storage requirements with write amplification (which is another downside). They're totally different.

Are you suggesting that's a problem with CoW in general, or with BTRFS implementation specifically?

I would say ZFS works extremely well (at least for the workloads I care about, i.e. PostgreSQL databases, both OLTP and OLAP). I know about companies that actually migrated to FreeBSD to benefit from this, back when "ZFS on Linux" was not as good as it's today.

Unconvincing. LVM's snapshots are CoW whether thick or thinly provisioned. And while not yet merged in mainline, XFS devs are working on CoW as well which is used when modifying reflinked files (shared extents).

Btrfs behaves basically like that with 'nodatacow' today. It will overwrite extents if there's no reflink/snapshot. If there is, CoW happens for new writes and any subsequent modifications are overwrites until there's a reflink/snapshot in which case CoW happens.

The 'nodatacow' flag can be used as either a mount option, or selectively with an xattr per subvolume, directory, or file. And in all cases, metadata writes (the file system itself) are still CoW.

This is specific to RHEL7, notably that they won't backport any further kernel updates and won't move it from Technology Preview to release. Red Hat wasn't really driving btrfs development at all from what I am aware of.

⁠Btrfs has been deprecated

The Btrfs file system has been in Technology Preview state since the initial release of Red Hat Enterprise Linux 6. Red Hat will not be moving Btrfs to a fully supported feature and it will be removed in a future major release of Red Hat Enterprise Linux.

The Btrfs file system did receive numerous updates from the upstream in Red Hat Enterprise Linux 7.4 and will remain available in the Red Hat Enterprise Linux 7 series. However, this is the last planned update to this feature.

Red Hat will continue to invest in future technologies to address the use cases of our customers, specifically those related to snapshots, compression, NVRAM, and ease of use. We encourage feedback through your Red Hat representative on features and requirements you have for file systems and storage technology.

More importantly Red Hat has deprecated FCoE in RHEL, which is big news, because at a previous $JOB they went all in on FCoE because it was supposed to be the future.

FCoE died a death <2013.

Expensive hardware, with little gain sadly. It was a nice idea, however at its very core is a fairly large problem: converged network adaptors are problematic.

Unless you have lots of bandwidth in said adaptor (ie 56gig inifiband) you are going to get contention between network and disk IO.

I agree, not only that, but at least with Cisco gear FCoE was a real pain in the behind to manage and configure. So much duplication in configuration and settings across a lot of gear, and it never was as smooth as they made it out to be.

We've been big Cisco customers for years and ever since I saw FCoE I thought it was a disaster. All of the DCB extensions that had to go into ethernet to get it to work was such an ugly mess. It was just too complicated compared to alternatives, and iSCSI got a free ride on all that work (ethernet pause, flow control, etc) and was far simpler to implement. And of course with lots of 10Gb options with iSCSI offload, it was getting harder and harder to find any advantage LARGE ENOUGH in FCoE to justify it's cost and complexity.

Completely agreed, but it is something that Cisco was still pushing fairly heavily alongside their partners. Various different vendors bought into it, and it was deployed heavily in various different MSP's.

The complexity of FCoE is staggering, and the configuration required across all the different moving pieces to make it a success made things even more difficult!


What's the replacement then? ISCSI, or going back to FC? Or is everything cloud something these days? :)

NVME over fabrics.. I guess, the general idea being to run NVME over FC, RCoE, or infiniband.

Given the existing install-base of FC, I'm guessing as people upgrade to the 32/128Gbit adapters they will start to purchase disk's that can support FC-NVME as well. Which will bootstrap the market there.

Although it could go to infiniband as well, if people buy into the converged infiniband/ethernet adapter route.

To soon to tell, but a lot of it will be dependent on which technology does a better job avoiding the "forklift" upgrade problem that FCoE required.

Yeah iSCSI and NFS. Also affecting this is the huge growth in "Hyperconverged Infrastructure" (HCI).

Hyperconverged Infrastructure is definitely eating a lot of the traditional storage vendors lunches. Using Ubuntu with JuJu/Ceph/OpenStack all on the same servers provides plenty of power while reducing costs.

Even VMWare has come on board with vSAN which pushes out vendors like EMC/NetApp because you no longer need them when you can just create it against your existing hypervisors. Sure you can run one or two less VM's on it, but you have less cost overall.

iSCSI off-load that is available on a variety of network cards, along with network policies that are similar to FCoE (never drop a packet) allow you to get the same speed/reliability as FCoE for a lot less.

FCoE requires that the networking gear drops only one packet in 10 million or something like it, if you can make the same guarantees for iSCSI, it is for all intents and purposes the same thing. With iSCSI off-load it is even better.

iSCSI also runs across your existing network stack, and doesn't require purchasing special equipment and is better supporter across a variety of different vendors, thereby making it easier to find the gear with the features you need, rather than settling for something that supports FCoE.

I've done many bad things to BTRFS, used it on multiple drives of differing sises, used it on drives connected over the cheapest USB to SATA adapters I could find, used it on disks with consistent corruption for over a year, and it's handled it gracefully.

I've also been using btrfs as the backend to docker for a long time on my desktop PC and never noticed any problems. BTRFS has been rock solid for me. I don't doubt it is more unstable than other filesystems, however it seems i haven't been unlucky enough to experience any issues.

When using BTRFS, i've always stuck to the latest kernel releases, and run a scrub + balance every month. This is the advice I heard from people who used btrfs, and I wonder how many of the people who complain about data corruption do these steps. Perhaps their corruption bugs are solved in a newer kernel version. I've had multiple scrubs pick up data corruption, which other filesystems wouldn't have found.

The only time btrfs corrupted my data was when I used the ext4 to btrfs conversion tool, it created an unmountable FS and then I just migrated my data manually.

You shouldn't have to do these steps.

Manual balancing is a workaround for a critical flaw in the implementation.

In my last major use of Btrfs, whole archive rebuilds of Debian, it would take less than 48 hours to completely unbalance a brand new Btrfs filesystem. ~25k snapshots continuously created and deleted over the period in 20 parallel jobs absolutely toasted the filesystem, even though it was 1% utilised for the most part, 10% at peak usage.

The point I want to make is that a Btrfs filesystem can become unbalanced at some indeterminate point in the future, which makes it impossible to rely on if you want to guarantee continued service.

I've also suffered from a number of dataloss incidents which likely are fixed now, but despite lots of bugfixing, there are still major flaws to address.

btrfs requires far too much maintenance to be a good general-purpose FS, and it absolutely crumbles under certain not-too-uncommon use cases (specifically, databases and VMs). I really wanted to like btrfs and used it extensively over the last couple of years, but I've finally given up and moved on to ZFS, and I'm grateful that I did. Give it a try.

Strange decision, as systemd-nspawn[1] specifically mentions and supports btrfs as a CoW filesystem for its containers. And as far as I understand, systemd is primarily developed by Red Hat employees. So either they'll add support for CoW alternatives, or they'll remove btrfs support from systemd-nspawn all together.

[1]: https://www.freedesktop.org/software/systemd/man/systemd-nsp...

There's nothing which prevents you from creating an empty block device file, putting btrfs on that, and mounting it.

nspawn uses btrfs snapshotting natively for its templating and no other FS (ZFS was explicitly rejected by Mr. Poettering because "it's not in the kernel") so yeah either they are going to have to do something about this or there will be a significant step down in functionality for nspawn. I can't see how this isn't really bad news for nspawn.

I guess one option is to pull btrfs tree into Systemd :-)

Deprecated? In favour of what?

Will Redhat too (like Ubuntu) start shipping ZFS?

My best guess is XFS coupled with Permabit (which they just bought and will open source) compression and dedupe services. Probably layered on an enhanced mdraid too.

I'm guessing that the ZFS licensing hairball is a bridge too far for even Red Hat, so they'll cobble together equivalent-ish functionality - even if it's not anywhere near as elegant as ZFS's integral data production and reduction.

XFS is the default FS on RHEL now, so likely that.

Do you know if XFS can shrink volumes yet? As far as I'm aware, that's the only limitation it has compared to other filesystems of that era.

It can't, but you're probably better off using trim/discard/virt-sparsify rather than shrinking filesystems. Even on filesystems like ext4 that support it, shrinking can cause strange fs performance problems.

Nope, still can't. Not online, not offline.

The following will sound snarky but I would personally prefer ext2 if the other choice is btrfs. "ButterFs because your data melts away"

XFS - the workhorse that keeps on running.

I am curious as to the same. The document says

> Red Hat will continue to invest in future technologies to address the use cases of our customers, specifically those related to snapshots, compression, NVRAM, and ease of use.

but it's unclear what this means exactly.

Red Hat will never ship ZFS, because its an entity that exists in the US and a probable target for lawsuits / license violation if ZFS is included.

What's the risk involved for Red Hat in shipping ZFS? OpenZFS and ZFS-on-Linux are under the CDDL, a legitimate open-source license that some feel may be GPL-incompatible. Red Hat distributes non-GPL programs as a matter of routine, and I'm sure this includes other CDDL programs, especially considering Red Hat's enthusiastic involvement in the Java ecosystem.

The only potential risk is that the GPL is so virulently infectious that any driver is automatically GPL'd by virtue of its own existence as a compiled kernel module, but that possibility seems fairly remote, and it hasn't seemed to affect the distribution of other purportedly-non-GPL kernel modules.

I'm not a lawyer so maybe I'm missing something.

The risk is FUDdy, but not entirely imaginary. In particular there's the thread of a threat of patent lawsuits from a variety of players in the industry. Of course, that's always the case in this industry, so I don't buy it. But RH might, and that's their call.

My guess too is that if Canonical manages to go a few years without a lawsuit from kernel copyright holders then we might see more of what it is doing. But RH would -I guess!- still suffer from patent FUD and so stay away from ZFS.

That's all fine by me. The better for RH's competition. More competition, mo' betta.

I supposed their upgrade path is...

* Make fresh backups

* Verify the backups

* Re-install and use the backups

Basically. Those steps should be done regardless of FS. I heard this on a podcast recently but if you are not doing that you end up with "schrodinger backups"

That's too bad. The subvolume [0] features were an interesting paradigm. Kind of let you have a virtual filesystem-within-a-filesystem.

[0] https://en.wikipedia.org/wiki/Btrfs#Subvolumes_and_snapshots

It's certainly interesting, but if you look at the ZFS design they were inspired by, they got a lot wrong. Some points to consider:

With ZFS, you have a hierarchy of datasets. These inherit properties from their parents, and while the mountpoints can also mimic this hierarchy, the mountpoint property can be set independently. Btrfs couples the two concepts, forcing subvolumes to be in a specific place in the actual filesystem; zfs datasets in comparison are purely metadata and are for organisation and administration, not direct use in the filesystem hierarchy.

ZFS snapshots are read-only, and clones of these snapshots are datasets in the hierarchy. Btrfs snapshots are read-write by default, which in some ways defeats the point of a point-in-time snapshot. You can also make changes to a ZFS clone and later promote it to replace the original dataset. Likewise rollbacks. Btrfs makes no provision for doing either; you have to delete the original and then rename the snapshot, which isn't atomic. ZFS' metadata preserves all relations between datasets, snapshots and clones.

The ZFS way of doing things makes things safe and accessible for system administration. There's no way to confuse the origin of a snapshot because it's tied to a parent dataset. Likewise clones of snapshots, unless you deliberately choose to break the link. The Btrfs way looks superficially nicer, but in practice is much less flexible, and potentially more dangerous since you don't have the ability to audit what came from where and when. Btrfs snapshot performance is also abysmal. ZFS handles snapshots simply by recording the transaction ID, which makes them really lightweight (and it also provides "bookmarks" which are even lighter weight). ZFS keeps the referenced blocks in deadlists, and its performance is excellent (compare how fast snapshot deletion is between the two). ZFS also allows delegating permissions to perform snapshot, clone, rollback etc. to normal users; I'm unware of Btrfs allowing such delegation--some operations can be performed like snapshotting, but not deletion, while ZFS permits this all to be configured transparently.

Are you trying to say that BTRFS is supposed to compete feature-to-feature with ZFS? It's not. https://lwn.net/Articles/342892/

>I had a unique opportunity to take a detailed look at the features missing from Linux, and felt that Btrfs was the best way to solve them.

>From other points of view, they are wildly different: file system architecture, development model, maturity, license, and host operating system, among other things


>Btrfs snapshots are read-write by default, which in some ways defeats the point of a point-in-time snapshot.

Yes, and have the option of being read only for your temporal "in place" snapshots. But if I want to clone a container for instant use (as LXC or Docker does), then the RW snapshots make sense. Btrfs doesn't make a distinction between a Clone and Snapshot, they are one and the same with a flag.

> but in practice is much less flexible

Tell me more how I can mix disks of differing size in RAID on ZFS

> There's no way to confuse the origin of a snapshot because it's tied to a parent dataset

There's no confusing to the origin of my sanpshots. `btrfs subvolume list -q` shows the ancestral parent as well as the subvolume it's located in, example:

  ID 6442 gen 50527 top level 751 parent_uuid 0f4442f8-6363-6944-be8d-e2b45d809352 path .snapshots/321/snapshot
> some operations can be performed like snapshotting, but not deletion

See user_subvol_rm_allowed mount option, available since Kernel 3.0

It's like comparing a car and a truck, they both have four wheels, transport passengers and cargo, and have an engine. Just because a truck runs on diesel does not make the fact that the car running on gas "wrong". Due to its fundamentally different implementation, the way the filesystem works is also different.

Yes ZFS has many more features, has been in development longer, and probably more "production ready" than BTRFS. But ZFS is not GPL compatible. And BTRFS doesn't require it's own separate cache that is apart from the normal filesystem cache.

Yes, that's what rleigh is saying. It's what I'm saying.

ZFS sets a very very high bar indeed. There are things that could be done better (I've talked about some of those on HN). But pound for pound, it's the best storage stack today and has been for over a decade. ZFS is the benchmark against which all others are to be stacked. There will be applications for which you will find a more performant solution, maybe, but altogether, ZFS has been the last word in filesystems for a long time now.

The most interesting competition, IMO, is from HAMMER. We'll see how that progresses.

> Are you trying to say that BTRFS is supposed to compete feature-to-feature with ZFS?

Not entirely. Btrfs was designed with benefit of hindsight, so one would expect for the features they did choose to implement, that they would be superior in both design and implementation. Sadly, neither are the case except for a few minor exceptions.

> Btrfs doesn't make a distinction between a Clone and Snapshot, they are one and the same with a flag.

Yep, and this is one design choice which on the face of it is straightfoward and convenient, but has the side effect of being very inefficient. Because ZFS snapshots are owned by the dataset, AFAIK there's little refcounting overhead; you're just moving blocks to deadlists based on simple transaction ID number comparisons. If you modify a block and its transaction ID is greater than the latest snapshot, you can dispose of it, otherwise you add it to the snapshot deadlist (and also add the new updated block). If you delete a snapshot, you do the same thing: for each block, if the block transaction ID is later than the transaction ID of the previous snapshot, you dispose of it, else you move it to the previous snapshot's deadlist. No refcounting changes except to decrement for disposal. You only start paying the overhead when you create a clone. This makes ZFS snapshots very cheap, and clones a bit more expensive. Btrfs is always expensive as far as I understand.

Your particular uses might not take advantage of this, but it's something to bear in mind.

> Tell me more how I can mix disks of differing size in RAID on ZFS

You can have pools with vdevs of different sizes (I have one right here). It doesn't make sense to have different sizes within a vdev.

The need for cobbling together different sized discs appears to mainly be something needed for tinkering and testing. No one is going to care about this for production systems. It's a neat feature which few people care about in practice. I'd rather they had spent the time on making the basic featureset reliable.

> > some operations can be performed like snapshotting, but not deletion > See user_subvol_rm_allowed mount option, available since Kernel 3.0

Nice to see some option for this. It's better than nothing, but it's not really equivalent. ZFS has a fine-grained permissions delegation system which is inherited through dataset relationships, rather than coarse capabilities.

> And BTRFS doesn't require it's own separate cache that is apart from the normal filesystem cache.

Not a particular concern for me; it's well integrated on FreeBSD, and it's not a problem in practice on Linux nowadays IME. Do you have a specific problem with the ARC?

I have used Btrfs in production and I would say it's great. It's super easy to just add an extra EBS volume and attach to a Btrfs volume and now you have more disk space. Performance is good enough for me as well, I used it as storage for InfluxDB and Docker.

Luckily this is only Redhat, not Btrfs itself.

How much did you deal with snapshots? Because I've had crippling performance problems stemming from them. Even something as simple as having 90 daily snapshots and deleting the oldest few can cause trouble, where the filesystem does not respond to any requests for multiple seconds. And that's on an SSD. I don't remember if it was deleting snapshots or running balance, but I've had Btrfs on a hard drive not respond to I/O requests for two minutes. A light-use server that had snapper running for a few months fragmented so badly that it regularly hitched up even after snapshots were paused. I had to migrate the entire filesystem.

I'm still using Btrfs on my backup system, but that's only because I like the dedup enough to overlook the brief hangs.

Not only that but in conjunction with journald you can achieve amazing disk space leaks that cannot be repaired easily without losing data.

Unfortunately, RedHat tends to be the trendsetter in the Linux world. Once it's gone from RHEL, it'll be gone from most other RPM based distros soon enough. It's kind of a shame that one company has grown to dominate the Linux software ecosystem, often to the detriment of all involved.

On the contrary, RHEL ships a relatively small subset of packages, and is especially stingy on kernel features in order to make the distribution supportable.

Btrfs is definitely not gone from Fedora, for example.

(Disclaimer: I'm on the virtualization team at Red Hat).

Red Hat has a lot of core Linux contributors on their payroll. Their status is, AFAIK, not undeserved.

RPM-based distros make up maybe 10 % of the Linux installations these days, so we should not overstate RedHat's influence.

RHEL makes up maybe 99.9% of the Linux enterprise installations these days, so we can't overstate RedHat's influence.

I don't think that number is even close to accurate. SLE almost certainly makes up more than 15% of the market alone. And that's ignoring all of the other enterprise distributions.

I believe that 2015 estimates from the IDC[1] had RHEL at ~60%, SLE at ~20%, Oracle Linux at ~12% and "Other" at ~8%. But I can't access the document at the moment.

[1]: http://www.idc.com/getdoc.jsp?containerId=US41360517

Number of installations is not also necessarily a good metric. Red Hat is a very large company with a lot of money and and a lot of employees working on upstream Linux projects and driving their direction.

There are not that many companies doing the same. Which is why they have a lot of influence over the direction of things.

Concourse (which is a CI/CD system that orchestrates Docker containers) recently switched from btrfs to overlay to fix performance and stability issues.

For those with morbid curiosity on the many stability issues with btrfs as a container file system, this is chronicled in Github: https://github.com/concourse/concourse/issues/1045

I tried btrfs and got bitten by bugs; not doing that again. Judging from this move by RH, it looks like I wasn't the only one.

What bugs?

I recently setup a software mirroring raid with btrfs and I'm loving features like checksumming. It makes me feel my data is quite safe and can't bit rot anymore. So far it is working fine.


Did you notice that the official description of RAID-1 is "Mostly working"? Are you aware if one of your drive fails, you have one chance to re-mirror it, before the remaining drive can no longer be mounted read-write and you need to dump the filesystem and re-create from scratch?

What do you mean by one chance to re-mirror? Does this mean that if the resilver fails you can't try it again? Is this documented? With ZFS or regular RAID as long as you have one good disk in the mirror you can resilver, is this not the case for BTRFS? If so this is quite disappointing.

There's a reason my server is running ZFS on FreeBSD. I also love jails, which let me have as many virtual servers as I want without any virtualization overhead.

> Does this mean that if the resilver fails you can't try it again? Is this documented?

It's documented on the status Wiki. RAID 1 is "mostly working". If a mirror drops to having one disk, you can mount it once as a read-write volume (required for resilvering); after that, you have to trash it and start again.

Tbh I don't care if I can't mount it read-write as long as I can mount RO and get the data to a new filesystem. The drives are new so nothing to worry about for years (probably). And if I get a chance ("one chance") to re-mirror, even better.

It might be an inconvenient restriction, but when a drive fails I'm already happy that there will be no data loss.

RAID1 meaning you get one chance. Just move to RAID10 if you're worried. (/s)

It has exactly the same bug with RAID10.

That's just how raid mirroring works. Now sure why you mention it specifically for btrfs?

No, it isn't. You can run a ZFS VDEV or a Linux mdraid as a single-disk RAID 1 unit until the remaining disk fails. You have an arbitary number of reboots/remounts to fix the problem.

So it turns out there's some nuance here. You can remount it multiple times, as long as you don't write to it, which I was familiar with. The moment you write again and don't explicitly remove the extra volume, you'll lose the ability to mount r/w, which is a pretty bad bug :(

I had tried xfs recently and I hated it. I have a thin laptop which would just lose power if I grabbed it wrong and very quickly my xfs partition got corrupted and I was unable to fix it. So, I had to re-install my os, and went back to ext4. Which even with the same power offs, never an issue since several months. This is largely anecdotal. But have had a coworker with the same issue too...

I also had issues with a vms using xfs too.

But, I do use xfs on ssd raids(in our servers, being used for testing) and never had an issue there.

We lost an entire openstack host due to an xfs glitch last year. It needed rebuilding from scratch.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact