Hacker News new | past | comments | ask | show | jobs | submit login
ZFS on Linux: Unlistable and disappearing files (github.com)
380 points by heinrichhartman 11 months ago | hide | past | web | favorite | 161 comments

We are working on it. We know what patch introduced the regression and 0.7.8 is going out soon to revert it. Until then, users should downgrade to 0.7.6 if they have not already. The Gentoo and EPEL maintainers have pulled the affected releases from the repositories (technically masked on Gentoo). Ubuntu was never affected.

The regression makes it so that creating a new file could fail with ENOSPC after which files created in that directory could become orphaned. Existing files seem okay, but I have yet to confirm that myself and I cannot speak for what others know. It is incredibly difficult to reproduce on systems running coreutils 8.23 or later. So far, reports have only come from people using coreutils 8.22 or older. The directory size actually gets incremented for each orphaned file, which makes it wrong after orphan files happen.

We will likely have some way to recover the orphaned files (like ext4’s lost+found) and fix the directory sizes in the very near future. Snapshots of the damaged datasets are problematic though. Until we have a subcommand to fix it (not including the snapshots, which we would have to list), the damage can be removed from a system that has it either by rolling back to a snapshot before it happened or creating a new dataset with 0.7.6 (or another release other than 0.7.7), moving everything to the new dataset and destroying the old. That will restore things to pristine condition.

It should also be possible to check for pools that are affected, but I have yet to finish my analysis to be certain that no false negatives occur when checking, so I will avoid saying how for now.

> We will likely have some way to recover the orphaned files (like ext4’s lost+found) and fix the directory sizes in the very near future.

How should people behave right now?

Will normal usage of production filesystems erase data, or will read/write activity leave the potentially-orphaned files in place?

You've also mentioned snapshots being tricky in the thread. Should people stop creating snapshots in case orphaned files are not included in the snapshots?


> It is incredibly difficult to reproduce on systems running coreutils 8.23 or later.


- This is specifically due to the fact that `cp` in 8.23 is optimized (8.22 created files in {0..2000} order, 8.23+ randomized the order (I don't quite understand why))

- The script in https://gist.github.com/trisk/9966159914d9d5cd5772e44885112d... uses `touch` to create files in random order and some people reported this triggered the bug

I should have been more comprehensive in my previous reply. Normal usage is ill defined, but reads and writes of existing files have no chance of triggering the issue. Any snapshots that contain the orphaned files will need to be destroyed in order to fully purge the damage from affected pools, but it is fine to make them in the interim. Orphaned files should not be overwritten by further usage because they are still using space allocated from the space maps.

You are right about the touch script. Albert Lee designed it after studying syscall traces from a CentOS cp binary. Other things can definitely trigger it. I read on reddit that rclone triggered it. Extracting tar archives has been suggested to also be able to trigger it. I do not expect many systems running 0.7.7 to have actually triggered the bug though. We had a hard time trying to reproduce this on systems without an old enough version of coreutils’ cp.

In any case, instructions on what to do to detect and repair the damage will be made available after we finish our analysis and make the tool to fix this. That tool will likely be a subcommand in 0.7.9, which I expect Brian to push out fairly quickly once we have finalized the complete solution. Reverting the patch is a stopgap measure both to stop the population of affected systems from growing and stop the orphaned file counts on affected systems from growing.

> 8.23+ randomized the order (I don't quite understand why)

It's using inode order, which speeds up things significantly on some filesystems:


Ah, thanks! I knew there was a rationale but hadn't quite gotten it.


Existing data is unlikely to be lost, but I strongly suggest that existing systems get off 0.7.7 if they are on it. Having to figure out where the orphaned files belonged would be very inconvenient.

I have a fairly specific question - I'm in the process of migrating a large array onto another disk using zfs send and receive (6TB). It's been at it for about 2 days now. After the migration, I plan to destroy and recreate a larger array and send it back. If the pool is not being read/written to by anything else, am I safe? Or should I stop the send and receive and downgrade? Thank you for the hard work.

You should be fine. The DMU that handles send/recv is unaffected as long as the snapshots at the source were not taken after 0.7.7 experienced an ENOSPC on file creation.

As far as I know, it has never experienced an ENOSPC error. It was mentioned elsewhere that a procedure is being developed to detect orphaned files that arise from this bug. Would it be a good idea for me to keep the original array around long enough to run through that procedure just in case?

I am the developer who volunteered to develop the code to fix things. Anyway, there is no reason to avoid send/recv with 0.7.7. Send/recv will neither trigger this bug nor make it worse if it were triggered.

I understand - what I meant to say is that I cannot be sure that the array never ran into an ENOSPC issue prior to the snapshot. Given that, should I keep the old array around until I can do whatever is recommended to detect potential orphans resulting from the bug?

Edit: And thank you, again.

There is no reason to do that here. Send/recv makes a perfect copy as far as this is concerned.

You are welcome. Also, the proper term is pool.

I come from many years of mdadm use, sorry :-)

Just so I understand, I should be able to run through whatever the final recommendation is for detecting orphans on the new pool with identical results as if it were run on the original pool, then?


This is a good reminder for everyone that snapshots are not backups.

Also, thank you for all of the hard work on ZoL!

You are welcome.

I should clarify the snapshots remark. The problem with how this interacts with snapshots is that the snapshots containing orphaned files cannot be repaired by software without BPR. They can only be listed by software for deletion by the administrator. Also, for the dataset’s tip, the future tool to repair it can only put the orphaned files into a lost+found directory without the original file names.

Seriously you and the other maintainers are my heroes. Keep it up! And thanks for the awesome write up.

lost+found is an incorrect approach for ZFS. If you feel you have to regress to what was 40 years ago and was a bad solution, then the misfeature causing it needs to be engineered instead of hacked together, or better yet not implemented at all. If you have to break from mainstream ZFS and regress that violently, it’s the wrong approach.

Where do you suggest recovered file data (when metadata has been lost) should be put in the event of repair after a file system driver bug?

I suggest completely backing out the code in question and using a FreeBSD or an illumos based system like SmartOS. That's what I suggest. metadata are not files so they have no business going into lost+found.

> I suggest completely backing out the code in question

This makes sense and we did that. It does not fix the fact that datasets exist where directories have incorrect sizes and files are orphaned because of the bad patch.

> and using a FreeBSD or an illumos based system like SmartOS. That's what I suggest.

While those are fine choices, they are not immune to bugs either. We had a space map corruption bug that we inherited from Illumos several years ago. We also inherited an issue in send/recv from illumos several years ago when hole_birth was introduced. FreeBSD’s TRIM implementation was far from perfect too and likely messed some things up given the long history of bug fixes that it required.

I do not mean to discourage people from using those systems, but it is silly to adopt them because a bug got past joint OpenZFS review and landed in the Linux port first. While they do a good job too, they are not infallible. The major OpenZFS platforms are roughly at parity in terms of their risk for bugs and regressions.

> metadata are not files so they have no business going into lost+found.

This makes no sense. How are orphaned files “metadata”? Why should they not be placed in a lost+found directory?

I do not mean to discourage people from using those systems, but it is silly to adopt them because a bug got past joint OpenZFS review and landed in the Linux port first.

That’s your opinion and while I extremely disagree with it, it is what it is. It might be even true in the case if ZFS on Linux team but on the whole of GNU/Linux, my professional experience is that the amount of bugs and breakage is much higher than on BSD or illumos based operating systems, and to me, quality and stability are the most important things. For me a computer is a tool to make money and so reliability of hardware and especially the software trumps everything else. In fact, apart from reliability and simplicity nothing else is important to me when it comes to software.

Your postings reminded me of illumOS's existence, so I went looking for an arm build to no avail. You wouldn't happen to know if that is something that is available would you?


that's the closest you're going to get right now, unless you jump in and help out.

Thank you

[citation needed]

How can one cite one's professional experience? But okay: bugzilla.redhat.com. Take a look at priority 1 bugs for epic breakage.

So what exactly is your suggestion? "lost+found is ugly so don't recover the files at all"? He said meta data is lost so I'm really curious to hear your elegant modern solution.

This is a tool that might be needed on a few systems that suffered from this relatively short loved bug and then never again after that, so your solution better not require ten times the effort.

If it's a one time throwaway tool, then it doesn't need lost+found!

I was talking more about the idea of a lost+found directory. Brian suggested that we make it ‘.zfs/lost+found‘, which might be what we do. I am leaning toward extending scrub to add the function when a flag for an erratum check is passed. That implies that the tool would be a part of the driver so that it can be done online for minimal downtime.

But something is majorly wrong with this approach if you can even have orphaned files! This should be impossible on ZFS!

That's why it's a bug?

also a good time to remind people that backups are not backups unless they're geographically diverse. back up to an off site location in case your (home|office) burns down or suffers some other total catastrophe.

> also a good time to remind people that backups are not backups unless they're geographically diverse.

... and backups are not backups until you can reliably restore from them.

...and a DR plan isn't worth the paper it's written on unless you test it regularly (people leave, media gets corrupted... just like the military; do drills!)

Paper? You write on paper? My DR plan is in a file and backed up locally, and ... um ... oh oh.

... and a DR plan isn't worth anything unless you test the "people leave" part without any knowledge in their heads. your internal documentation and credential storage needs to be solid in case four people flying together all die abruptly in the same plane crash. the traditional "bus problem".

And that you regularly test that property.

But how do you test that reliably if the filesystem makes files disappear?

I.e., you'll have an original that misses some files, and a backup that misses the same files. A diff will show nothing.

Compare an older snapshot to current for unexpected changes? Though that is difficult (probably not practically possible in the general case) to automate because telling the difference between rare but expected changes and unexpected ones due to corruption could require quite some intelligence/intuition.

No backup+test stratergey can catch everything. You just have to be sure to catch everything that you practically can (and need to factor your time & effort and the importance of the data, into the judgement about that is practical to do and what is justifiable or unavoidable risk).

Unlikely to be the same files

In the general case, but in this case isn't unlikely at all: you have a ZFS pool that suffers from this bug, you back it up, the missing files doesn't show up neither in the original or in the backup.

And they are not backups unless DD is somehow involved.

>also a good time to remind people that backups are not backups unless they're geographically diverse

I don't mean to jump on you here, but seriously I don't think this is fair, and additionally an attitude that I think can sometimes be at least mildly harmful. Data redundancy is a spectrum, not binary. Each additional option one might use helps against additional threat scenarios, but also at additional cost, complexity, and usage requirements. Some of the other replies to your comment are correct on things like "testing" and such, but I'd argue the ultimate Rule 0 of "backups are not backups if..." would be "backups are not backups if they're not actually used". All the possible media and geographic redundancy in the world, all the possible verification and regular testing, none of it matters if the result is too inconvenient or just plain too expensive for the operation/users to bother with. Remember the kinds of situation diversity people face; huge portions of the world, even including business operations in America, have poor net access period, and a much higher slice then that are dealing with highly asymmetrical links even if the download is ok. The kinds of natural disasters, crime threats, and so forth all vary from place to place also. And even if their data is reasonably valuable, for some places even just an extra few hundred or few thousand dollars/euros a year isn't nothing.

What is binary is that anything is better then zero. I have genuinely dealt with people, not just regular users but folks wearing IT hats in rural SOHO scenarios, who got discouraged by being told that they didn't have "real backups" (implying naturally that they didn't "really care" about their data) because they failed to check all the boxes an enterprise in SV easily could, and ended up just kind of giving up on most anything. In my experience any sort of orderly backup process at all still isn't always the rule, so I worry about mental and implementation friction there, even though it's true there's a need to push on folks a bit to have at least a minimal quality solution too.

Still, if someone regularly plugs in a USB drive to their desktop and runs an rsync script or TM or something every evening when closing up? Yeah, that's a backup. No it doesn't cover the place burning down, but it does cover some primary hardware failure, users accidentally deleting something, possibly some ransomware (if used with care), etc. Maybe they have 2 drives, and stick one in a firesafe in a sealed bag, now maybe they've got a bit of protection from certain fires too if they're caught fast enough. Maybe they add on a bit of light net backup, just of key low size accounting documents and the like, what their 5/1 ADSL link can reasonably handle, ok that's better still. Etc. If they've got the discipline (and money) to have a bunch of drives or get a tape system and then regularly rotate a week or month's worth into a safe deposit box at their bank or something? Great. But there is no set number of 9s that needs to be hit before it's a real backup. 90% is real, and really better then 0%, even if it's worse then 99% or 99.9[99]%. Get in the habit, get in the habit of yearly reviews too at least, then keep improving as importance, opportunity and budget allows.

“This is a good reminder for everyone that snapshots are not backups.”

Were this not ZFS, I would agree with you, however, since zfs rollback will revert to previous state and if the snapshot is on a redundant vdev, there is no difference between that and losing your backup due to a damaged tape.

In fact, zfs snapshots are exactly how time machine on illumos based operating systems is implemented.

Backups have three characteristics: redundancy, versioning and distribution. ZFS only fulfils two.

If ZFS is your backup target, then you're making more sense.

ZFS has distribution via zfs send subcommand.

Using ZoL extensively in commercial environment - just upgraded to RHEL 7.5 with ZFS - lost a huge 32TB pool with exact same issue as in 0.7.7 - Not an expert but we think that didn't regress the issue properly. Issue still exists.

Upgraded from 0.7.3

Love the product and your work.

Just wanna give an update to our desperate experience: Rebuilt server on RHEL 7.5 with : 3.10.0-862.el7.x86_64 #1 SMP Wed Mar 21 18:14:51 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux And Zol 0.7.7 because we have many other systems using 0.7.7 not showing any issues but on a lower kernel. (3.10.0-693)

On the new box, out of the wrapper with a brand new 32TB zpool - directory listings are messed up: >ls /newzpoolmount >ls: reading directory .: Not a directory >total 0 ALthough a new file created can be edited and read - just not listed. No hard link nor any thing else in the zfs.

If this is the original issue, then we are bewildered as to why our other boxes with 0.7.7 and a lower kernel dont' show the issue with massive rsyncs running and why upgrading to 0.7.8 makes no difference? (Haven't tried 0.7.6 yet as we came off 0.7.3)

Just putting it out there in case this is a useful test scenario.

Can you confirm whether 0.7.8 is actually fuctional under RHEL (centos) 7.5 (released last week) as this may be a complicating factor?

We are experiencing exactly the same issues with 0.7.8 under RHEL 7.5. So far no known remedy.

While the 2.6.32-696.23.1.el6.x86_64 is an old kernel, it is actually the current kernel for CentOS/RHEL 6. Red Hat has backported a lot of fixes to it and CentOS/RHEL 6 is a distribution that is still under maintenance.

It is still out there in a lot of places and has been in production for a while. That it is old, is a feature, it also means that it is incredibly stable.

Also note that the problem is also reported on a CentOS 7.4 kernel a few posts down.

We used to support Linux 2.6.16, but support for that was dropped a long time ago. Then we had support for 2.6.26, but support was dropped for that too. 2.6.32 support will be maintained until at least 2020 if I recall correctly. I could see us maintaining it well past that too.

Indeed, but we are just end consumers of RHEL's release cycles. In large commercial environment (banking in this case) we are bound to use the supported versions from commercial vendors by law.

My scenario is simple - lots of very stable zfs 0.7.7 running on RHEL 7.4. We saw the 0.7.7 media reports of the bug and upgraded to 0.7.8 in fear of a catastrophe - but all hell broke loose after upgrading. Downgraded to 0.7.7 but got the new 7.5 RHEL kernel and still everything is a mess. Rebuilt some test systems with RHEL release 7.5 and 0.7.7 and still cannot even list a mount point of a brand new zpool without even creating any files on it.

Now we are seriously worried. Can't go back to 0.7.6 easily and any way zfs 0.7.6 modules don't load into RHEL 7.5 release - not compatible. We may have to go back to RHEL 7.4 and 0.7.7 which is still stable on many systems.

You guys are all heroes for your phenomenal work. Please keep to going...

My thoughts exactly, while the RHEL kernel has lots of backported patches - it's by no means complete or near current.

Of course, it also isn't intended to be.

Probably obvious, but hooray for open source software! What a fantastic response to the bug.

Yep, this is one of the big reasons why I love the open source linux world. Something breaks -- and in this case, the initial situation looks quite terrifying. And then a whole bunch of people pile in with information, reproduction, systems they can use for testing, ways to handle this in production, ...

Imagine if this had happened to Apple, everyone and his dog would be rolling over each other villifying them.

Apple's file systems HAVE had some serious bugs, as you probably know. I haven't noticed that they've been getting all that much press. What press they've been getting is very negative, though, and quite right too.

I paid good money to Apple so something like that would not happen, so yeah I’d be villifying them. You bet!

As opposed to the free RedHat Enterprise Linux operating system

No, I love to hate redhat because they are the most incompetent company in the computer industry’s history; just take a look at their priority 1 bugs at bugzilla.redat.com and the picture becomes crystal clear. (I’m forced to perform system engineering on RHEL every day and a lot, so my hate for it grows daily.)

Like storing the passphrase in the hint field...?

Shit happens. Software is hard.

Since when is introducing a major regression into something as critical as a filing system a hooray for open source? Only on GNU/Linux... and nobody thinks twice about it or bats an eyelash; instead, the scrambling to contain the regression is greeted with a hooray. It’s mentality like this which makes me want to never touch a computer again.

Introducing major bugs happens in software all the time. Finding it, working out a solution, and fixing it quickly is a hooray for open source.

How quickly a problem is understood and fixed has nothing whatsoever to do with open source. That is a fallacy.

Is this ZOL tip or Linux specific? FWIW, the bug does not seem to reproduce on FreeBSD-current (r332158).

However, it also does not reproduce on Ubuntu 4.4.0-116-generic running the ZFS stuff from Ubuntu.

See the test-case furthest down in the bug report. They were able to get it to reproduce on current linux distributions. The differences between older and newer distributions had to do with coreutils changes to the "cp" implementation between 8.22 and 8.23

Reverting this commit causes the bug to stop reproducing:



An inspection of the following makes me think this is ZoL specific.


Ubuntu was never affected. The regression started in 0.7.7 and Ubuntu is on 0.7.5. HEAD was affected until earlier today when the patch was reverted.

I am not sure if the bad patch was ported to the other OpenZFS platforms.

Debian Stable(jessie) is on and Debian unstable(sid) is on 0.7.6 so they shouldn't be affected.

Debian stable is stretch, released in june 2017. It's on 0.7.5 currently.

correction, I hadn't updated my server in a while, stretch itself is on also, but stretch-backports is on 0.7.6 (not sure when that happened exactly)

> but stretch-backports is on 0.7.6 (not sure when that happened exactly)

[2018-03-09] Accepted zfs-linux 0.7.6-1~bpo9+1 (source amd64 all) into stretch-backports (Aron Xu)

source: https://tracker.debian.org/pkg/zfs-linux

Looks like they're on top of the causes and solution. But there's a caveat: possible orphans.


We will work out a solution for people affected so that they can get those orphaned files back.

Regression introduced by https://github.com/zfsonlinux/zfs/commit/cc63068e95ee725cce0..., will be reverted and a new version tagged.

From title first thought was - "Wow, this is awesome security/privacy feature of the file system!" and then from reading comments I realise it is a bug

Great to hear the ZoL guys are right on this. Bravo.

It also reminds me why my NAS runs Debian..

It also reminds me why my NAS runs FreeBSD.

The bad patch passed review by developer(s) from other platforms. Matthew Ahrens was a reviewer. This was not merged based on unilateral review by the ZFSOnLinux developers. There was also nothing Linux specific about it.

That said, bugs happen. We should be putting new test cases in place to help catch such regressions in the future. If we find more ways to harden the code add against regressions of this nature as we continue our analysis, we will certainly do them too.

Reviewers will help catch bugs, but the engineer who writes the code and seeks to integrate it is ultimately responsible for sufficient testing to avoid issues like this one.

When multiple people have reviewed and approved, all of those people are jointly and equally responsible for any fallout. A reviewer needs to take their responsibility as seriously as the coder. If they don’t, it diminishes the value of having the code review.

The coder, the reviewers, the approvers, they’re all in this together. It’s unfortunate that this happened, but no single person should be held accountable when there’s a process in place designed to protect against individual mistakes.

It’s a shame that this skipped through still anyway, but that’s part of the nature of the resource limitation, especially with F/LOSS. There’s always a risk of this happening. The only way to reduce the risk is to contribute more resources. Blaming the coder is more likely to result in a reduction of resources, as less code gets done.

The author is a fairly new contributor:


What do you suggest that we should have done?

I'm not sure, but I'm not deeply familiar with this part of the code. In general, I think it's good to be able to induce all of the failure cases for all of the error handling code that's being added.

This can be time-consuming work, but I would argue that this is the file system -- anything less is an unacceptable risk. This quote from my boss comes to mind:

Remember: you are (or should be!) always empowered as an engineer to take more time to test your work. -- http://dtrace.org/blogs/bmc/2015/09/03/software-immaculate-f...

au contraire, engineers should not be testing their changes. tunnel vision is a real thing - you need to be seriously experienced to be able to sidestep that.

i stick with old good ext4

So this is not a "ZFS bug", but a "ZFS on Linux" bug. The actual ZFS on systems which have had ZFS for decades is not affected at all.

It is a recent regression from 2 months ago that originated in the Linux port. I do not know the status of the other platforms. I and the others involved with the Linux port are still busy handling the issue. I suggest asking the developers of the other platforms whether they had adopted the bad patch or not.

I noticed someone has fixed the title, so my comment has lost its point now. Awesome.

> The actual ZFS on systems which have had ZFS for decades is not affected at all.

ZFS hasn't existed "for decades".

One and a half decades can probably be seen as "decades", but you are right, this was some kind of an exaggeration.

It has existed since 2002, and was officially put back into onnv on 31.10.2005, and made available in Solaris 10 update 2 at the end of June 2006. So, it has existed for at least one decade and more.

It hasn't existed since 10 april 1998 therefore "decades" is inaccurate. Furthermore, since it wasn't publicly available from 10 april 2003 (who cares how long it was in development) if you round fairly, then its accurate to say "more than a decade" but "decades" is inaccurate. Because "decades" means "more than 1 decade". It denotes "at least 20 years".

That's regardless of the subject we discuss or how good it is etc.

Anyway, the original parent admitted their mistake (thank you) so I'm done with that discussion.

At least it's no btrfs. What a disaster that filesystem's been.

A disaster for enterprise perhaps. I use only a subset of features (such as subvolumes and snapshots and a little RAID 1) and have never had problems. With the way some people talk about it, it sure sounds like it never worked at all.

I believe the most recent issues I've seen documented with it are around the built in RAID5/6 implementation not being fully stable. I haven't used any of it myself so I can't comment on the rest of it. (I did use it on an external drive like 5-6 years ago and had issues that I know have been fixed, but haven't retried it).

They should have kept RAID behind yellow tape for years to come. There's just too many problems that can bite you once a physical disk fails and you really need that RAID to recover.

Yeah, but the distributions are a bit at fault as well. If you use btrfs on SuSE you cannot use RAID5/6 because it is not supported.

You've never had problems? You're lucky. I guarantee you will. It took me less than a month to lose significant data with it, under normal usage patterns.

It took me less than 3 days for btrfs to fail me, but that was in the 2.6.33_rc4 days. Had it gone better, I might have become a btrfs developer rather than a ZFS developer. In hindsight, I am happy with how things turned out.

I've been using it for a good 5 years. OK, I had one time. When I removed a disk from a mirror, forced the remaining disk to operate writable as a single, then attempted to migrate it to single disk duplication, then shut it down before it finished migrating.

That's not a sane way to handle data really, I was playing around with unimportant stuff. If it was real data, I would have mounted it read only, and copied it away.

Well, that is very much the intended use case for a mirror...

I was not replacing the disk, instead I was attempting to switch to a single disk

We use Suse with BTRFS at work. I don't know of any instance where we lost data, but the weekly filesystem rebalance that runs during business hours and causes the machine to not respond to anything except pings is a little frustrating.

I have not had good luck with BTRFS on systems running on battery. Its been solid on my workstation for a while though. Still, it is disappointing that it hasn't stabilized faster.

I run BTRFS on my laptop. Is yours an ultraportable? Mine is a fairly chunky system with a 100Wh battery, about 3 years old.

In my experience BTRFS has issues when the disk is allowed to go into low power mode.

Same here. I've been using btrfs for years so far on my personal laptop and never had any problems.

I would never trust btrfs for anything. You might as well just use /dev/null for storage

i have been using btrfs in production for years now and it has never failed me, and i am doing hundreds of snapshots and send/receiving, i've reconfigured raids on the fly and went from 6 disk raid 10 of mixed size to raid 10 of same size. we have in some cases had many power failures with no data loss. We also have some set up using md raid and some using hardware raid...

when i hear people dogging btrfs it just speaks to their inexperience imo.

that said i know it's not perfect, but what is?

I use OpenSuse Tumbleweed with BTRFS on a laptop. This is nothing special - a 512GB SSD, no RAID, about 350GB of data and a bunch of snapshots. It was basically the out of the box configuration plus some snapper config to regularly snapshot /home.

One day I shut it down and it wouldn’t boot - the BTRFS filesystem had gotten itself into a state where it would mount ok read-only but hang the system when mounting read/write. My best guess is that I shut it down (meaning a normal shutdown through the GUI) while it was running its weekly rebalance in the background. I’ve since recovered the data and rebuilt the machine, but it’s the first time in a long time that I’ve had a filesystem (any filesystem) fail me.

Huh. I had that same problem on two machines running Tumbleweed a couple of months ago. In both cases, the only working approach I found was to wipe the root partition and install from scratch.

/home was in a separate partition, so it was not a tragedy, but annoying nevertheless. This has never happened to me before, unless the underlying hardware was about to retire.

I wasn't so lucky since I had everything as one big BTRFS filesystem.

I did recover it by booting the installer in recovery mode, mounting the fs read-only, and doing a backup. Then I blew away the partition table and reinstalled.

I also learned a lesson about backups, mainly that I should have them.

Tumbleweed is a rolling edge release. I get wonky desktops and other problems all the time when I update.

That is true.

But I have used Gentoo, another rolling release distro, before that (way before that, actually), and I never had such a problem with Gentoo, even on the unstable branch.

From what I hear Arch users tell about their distro of choice, Arch does not give them this kind of headache, either.

And last but not least, the reason I took the plunge and went for Tumbleweed was that the project uses extensive automatic testing to ensure they do not break anything.

Do not get me wrong, I still use Tumbleweed on both machines and do not see that changing for the foreseeable future. I know what I signed up for. ;-)

Are you using btrfs native RAID support? The docs say it's really unstable so I was scared to use it

I run single with hardware or md raid, or btrfs raid 1 or 10. Usually the latter.

Raid5 and 6 are unstable in the "avoid at all costs" kind of way.

This may be of use if youd like to know more: https://btrfs.wiki.kernel.org/index.php/Status

I've been running a two-disk RAID1 setup, alongside a single disk root drive (for snapshots) for almost two years without a single issue. I think there's still a lot of FUD being spread on account of the RAID 5/6 write hole still existing, for which this is the latest update:

> The write hole is the last missing part, preliminary patches have been posted but needed to be reworked.

Is there a stable (loaded word, I know) versioning file system available for Linux?

The simple, unqualified answer is no.

BTFS is official Linux and apparently "ok" for simple use cases, but there are no end of reports of failures when non-trivial RAID modes are used, demanding work loads are applied, device replacement is attempted, etc. and there are performance problems under a variety of conditions. There are enough qualifications on the BTFS status page[1] that I, for one, do not consider it 'stable.' ZFS is a thing on Linux, but it's not in the kernel and _when_ it breaks the kernel developers don't officially care, except when they happen to have a foot in both camps. This situation naturally limits the size of the ZFS on Linux user base; you're one of the few if you're doing it and that's not where most production users want to be. LVM can snapshot logical volumes and produce `crash consistent' volumes independent of the type of file system. That's been my go-to solution given no other alternative.

The fact is Linux has trailed far behind its contemporaries in advanced file systems for... 10+ years now? Not terribly flattering.

I suspect the reason is that most production use of Linux occurs in environments that provide many enterprise storage functions independent of the operating system, so there isn't much pressure to, for instance, harden BTRFS until there aren't major deficiencies. I can snapshot/clone/restore/whatever my EBS volumes any time I wish and I can do similar with my private cloud powervault volumes as well. I trust either of these mechanisms far more than _anything_ Linux has ever provided, including LVM.

[1] https://btrfs.wiki.kernel.org/index.php/Status

Do you mean snapshotting filesystems? If you really meant snapshotting, that would be ZFS. If you really meant a versioning filesystem, Wikipedia claims NILFS is stable:


LVM can do snapshots and Redhat is working on its own hybrid storage management solution based on xfs called stratis[0]. Redhat deprecated btrfs recently. I don't think that means anything more than Redhat has lots of xfs devs and no btrfs devs.

[0] https://www.phoronix.com/scan.php?page=news_item&px=Stratis-...

The most stable solution right now is taking your favorite stable fs like ext4 or xfs and using LVM thin pools for snapshots.

There is hope though: https://www.patreon.com/bcachefs


Agree. Good old NIH.

Why is the thread title hyphenated? It made me expect there's a utility called "zfs-bug" causing data loss.

The OP appears to speak German natively. In German they are much more hyphen happy than English. It's probably the most common error I see among German speakers typing English.

This. German merges words together as their modifier mechanism, whereas English just uses word order to indicate modification. Since it's technically _not_ illegal in English to hyphenate to do the same thing, German speakers tend to do this for English.

It's a very hard habit for them to break. :)

One could argue that technically nothing's illegal in English, since there's no governing body, just what's in common use.

Kind of like civil vs common law too, now that I think about it :)

Too many Germans use "idiot spaces" ("Deppenleerzeichen") though, e.g. "Adler Straße" instead of the correct "Adlerstraße" or "Adler-Straße" ("Eagle Street"). Anglophony hasn't done us good.

Exactly this. In German the headline would just have been:

"Linuxzfsdatenverlustfehler gedfunden!"

Getting this stuff right in English is hard. (Pun intended)


> gedfunden

One of those days, eh?

(You spent all the attention on that first word ;-)

Same thing in Norway wrt hyphenation; we use hyphens a lot. And I am well aware that native English speakers use them much less frequently but I intentionally use hyphens in places where an English speaker would not when there is ambiguity that can be resolved by using them. I also use hyphens for aesthetic reasons sometimes just based on what looks better to me.

Also like a sibling commenter said, it’s not necessarily an error to hyphenate even though people that grew up in the US or the UK wouldn’t.

In the case of OP title I would not have hyphenated though.

basically the reason I like my filesystems a few decades old and mature, I would not trust zfs or btrfs with anything critical.

I have been running ZFS on macOS (OS X), illumos and Solaris for a good 7 years or so now. A major part of the reason I switched over fully, despite some warts, was that I experienced actual and significant data rot from stuff I was carrying forward under XFS (IRIX), HFS, etc. I don't consider my personal stuff to go back that long, but I still have things from 1993 or so that matter to me. I did a review around 2009ish, and found that a number of old files had at some point or another become corrupted, including ones I know for sure weren't in 2004. I'd been following ZFS somewhat since Sun had demoed it, including a depressing spell after Apple almost moved to it then quit, but it was my own personal actual losses that pushed me to move over.

End to end checksumming and other integrity features just plain should be universal at this point. Should have been a decade ago or more in fact. We have so much incredibly important data now that is digital only and nowhere else, and memory and storage have both become very cheap at the general population level. It's shameful that anyone should still be losing data or experiencing anxiety years or even decades down the line. Integrity, basic levels of security/privacy, and flexible, high integrity replication should all be native level features of any data store system. "Decades old" filesystems just plain absolutely do not cut it, no does anything newer that doesn't include those promises at least as options. Bugs are unfortunate, and I hope this prompts ZoL and associated projects under the OpenZFS umbrella to double check their automated unit and stress tests. Sun rightly made a big deal of that on release. Even so, I wholeheartedly believe that ZFS or the like are far better primitives for a data storage scheme then older FS (or many newer ones for that matter).

I relate to your usage scenario, and my requirements are similar, and so are my frustrations with older filesystems: little or placebo protections against silent data corruption, lack of CoW semantics, pain of deduplication.

But on the other hand, over the years, I've heard so much FUD -- or underwhelming rumours that may-or-may-not be FUD -- about both ZFS and btrfs that have made me very reluctant to jump ship, and instead lean on an extensive suite of homemade processes and procedures that badly try to replicate subsets of features I'd expect a modern filesystem to provide. This is likely a Bad Idea, but my trust model right now can better stomach me losing data due to my bug, than someone else's.

I think what I'd like to see is widespread deployments, deployments-by-default, and better, persistent press and/or marketing surrounding *ZFS or brtfs, such that I wouldn't feel like an early adopter when using these filesystems. I'm actually glad Apple secretly wrote a filesystem with some of the same ideas, and deployed it to production on millions of (users') devices in the wild, because it raises the profile of modern filesystems and increases the likelihood that comparable alternatives will see more attention.

I am still somewhat suprised there are still not relatively simple filesystems that don't do this. Make it the one feature it does and does properly. Adding enterprise features will just increase development time (and cost) and add bugs. Basically an online archival filesystem, that's how most people use their home computers.


1. Checksumming on all files

2. Minimise assumptions of ram correctness

3. Disc replication (soft raid 1)

Out of scope:

* Snapshotting

* Subvolumes

* Deduplication

As much as I love subvolumes and snapshotting, I feel like CoW probably adds too much complexity to make it (easily) reliable. Honestly I don't know why turning off CoW on BTRFS disables checksumming, so if anyone can shed some light as to why this is, feel free to point out that my requirements are far more complex than I think.

In-place writes invalidate checksums. It is theoretically possible for an in-place filesystem to have data checksums, but maintaining them would require reading the entire block to regenerate the checksum on every write (I consider reads from cache to be reads). That largely defeats the purpose of doing operations in-place, which is to avoid that overhead. Maybe it would make more sense on btrfs (because it’s metadata structures can become horribly unbalanced) than it usually would on a convention in-place filesystem. However, the code would need to do gymnastics to achieve that without risking subtle reliability issues. That is something that I would not expect many filesystem developers to want to maintain. The btrfs codebase is complex enough (more so than ZFS despite having fewer features) without adding that.

By the way, partial writes from power failures are impossible to correct on in place filesystem designs without writing everything twice. CoW has performance penalties, but not so much as in-place designs when it comes to ensuring integrity. The complexity of doing CoW is also not that bad. It is how virtual memory works in every modern computer system, minus some rare embedded systems and ancient/unikernel designs that lack it. Doing CoW in storage is not a particularly strange thing.

> I don't know why turning off CoW on BTRFS disables checksumming

Because in-place updates means you need to journal data writes, otherwise you'll get checksum errors for aborted writes. (Which classic file systems don't catch)

The recent (~5 years old) batch of block-deduplicating backup tools (attic, bup, borg, restic etc.) would fit the bill to some extent, but all of them were/are more bug-laden than the file systems discussed so far.

Checksumming is like a myth on HN already. No, it's not useful on its own to the user or to the filesystem, but it is useful to the filesystem if it can transparently heal the data. Whole disk replication doesn't address your reliability concerns as much as you think either, doesn't work all that well in general and is a burden to maintain. Doing it properly requires treating the filesystem as a distributed system where disks can join and leave, where everything is sharded, where rebalancing, self-healing, syncing, resyncing is all automatic and blazing fast. And so on.

Nowadays most of the work in storage systems is in distributed systems, and local filesystems are treated as just another unreliable layer.

Hope this gives you some ideas why there are no filesystems like that.

Not particularly. I'm thinking for the home user who just copies the photos from their camera to their home computer. Distributed systems would be great, were it not for the fact that by and large, home connections have abysmal upload speeds.

>As much as I love subvolumes and snapshotting, I feel like CoW probably adds too much complexity to make it (easily) reliable.

Despite this bug, ZFS still seems reliable and does everything you ask. And any reliability you feel is gained by dropped CoW is offset by the reliability lost by being susceptible to power outages corrupting data.

I feel similarly. I've had basically all my data on ZFS/ZoL for 8 years at this point and have not had any problems except occasional system lockups when scrubbing. This has not happened for the last year or so.

For a while it seemed like ZoL was working hard to catch up with the "official" ZFS feature set, but at this point I hope they take it slowly and carefully and really beat on new upgrades.

The more stable you are, the more important it becomes for a FS to not screw it up, since the consequences get larger with a larger user base...

It sounds like you are suggesting that checksums and other integrity features obviate the need to back up your important files.

They're part of the same process, you need the checksums to know your backup is reliable and to tell you when your main files become corrupt. Backups alone aren't enough either.

No, it doesn't work like that, backup is not redundancy. To detect a bad block you first need to read it, you can't just know when a file becomes corrupt. And reading every block is very expensive and doesn't provide predictable guarantees. Instead proper backup has built in redundancy to account for lost blocks and so does proper data storage. Checksums are not visible to users in neither case.

And backup has to be incremental to prevent disasters.

I don't understand how you could have possibly gotten that, particularly given that I very explicitly finished with "that ZFS or the like are far better primitives for a data storage scheme then older FS". Primitives. Other components are still important. But backups in no way whatsoever inherently provide integrity, that's not the threat scenario they deal with. A non-verifying backup (quite typical, especially historically) will cheerfully and unquestioningly backup corrupted data, and can become corrupted itself over time. When you're considering data maintenance over the course of not just years but decades, "oh just backup" isn't at all enough without an integrity story to go with it.

Now of course it is possible to build levels of verification higher up the stack, but the problem with a lot of the ones available to general user is that they either leave holes in various places or they're just plain too much of a PITA (and "PITA" here includes a significant enough level of slow down) or both. It's much easier to do a good job if the FS itself is reasonably trustable and has verification and basic repair built in. FWIW I do use tools like par2 for cold or cool backups of ZFS dumps to targets like tape or Glacier now, but it's still a big help to my sanity to have as much of that dealt with at a low, automated level for hot/warm systems where convenience is critical.

Of course not. But using fss lacking these features the backups end up 'faithfully' (and uselessly) propogating bitrotted files.

Are you still using ext2 then? Ext3 was 2001, and ZFS was 2005. JFS2 is only 19 years old so doesn't quite meet the "decades" requirement.

not parent, but -

ext3 is improved ext2 which is improved minix fs which is improved fat, and all use the 'traditional' static-map-of-blocks approach, with gradual incremental improvements on top of this.. so the general technology 'family' is decades old even if any one implementation is not..

zfs/btrfs are log-structured, which wasn't really feasible from a hardware/performance standpoint until much later (see also LFS), and so represent implementations from a later 'generation' of technologies

I'll give you ext3, certainly as ext2 and ext3 share a codebase. They are wholly unrelated to FAT though, ext and minix both derive from the UFS/FFS branch of file system development which has little, if nothing in common with FAT filesystems.

[edit] The original 8-bit FAT and FFS were both implemented at about the same time, if you consider FFS's beginning to be when I-nodes were moved into cylinder groups. If you consider the Unix "FS" to be the beginning, then it predates FAT by a lot. In any event inode and cluster-chain systems have fairly significant differences.

Also, long-structured/CoW systems have been used as backends for databases for about 40 years now.

ZFS and btrfs are not log structured, but CoW. There are some aspects of ZFS like ZIL and L2ARC that are essentially log structured, but the filesystem itself is not a log structured filesystem.

Using CoW in storage is an idea that occurred after in-place and log structured filesystems had been made as far as I know though.

He's likely referring to XFS, which meets the requirement.

Considering that XFS is currently getting new features en masse, this doesn't sound logical. Except when they mean they are using a still maintained kernel before Linux 4.x.

And just to round it out, UFS2 was first in a release in 2003 in FreeBSD 5.0. So on the BSDs you're stuck with UFS1 at "decades."

Alternatively you can just avoid the bleeding edge versions of any file system.

Yes. I don't think anyone would contend that it's not bad for this to have landed in a stable release, but in fairness, that stable release was about 3 weeks old and ZoL is not yet 1.0. Mistakes like this are unfortunate but they do happen, and that's why many people wait several months before they incorporate new versions.

From the bug report, sounds like ZoL is integrating some new tests, learning how to retrieve orphaned files, and possibly strengthening its hashing mechanisms, so overall this will strengthen ZFS's robustness. Little consolation for those with damaged filesystems, but that's how the cookie crumbles sometimes.

Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact