Tell HN: ZFS silent data corruption bugfix – my research results

AndrewDavis · on Dec 7, 2023

This bug shouldn't really scare people. It's requires such an incredibly specific workload to hit

Here's a post by RobN (the dev who wrote the fix) on the ZFS On Linux mailing list

> There's a really important subtlety that a lot of people are missing in this. The bug is _not_ in reads. If you read data, its there. The bug is that sometimes, asking the filesystem "is there data here?" it says "no" when it should say "yes". This distinction is important, because the vast majority of programs do not ask this - they just read.

> Further, the answer only comes back "no" when it should be "yes" if there has been a write on that part of the file, where there was no data before (so overwriting data will not trip it), at the same moment from another thread, and at a time where the file is being synced out already, which means it had a change in the previous transaction and in this one.

> And then, the gap you have to hit is in the tens of machine instructions.

> This makes it very hard to suggest an actual probability, because this is a sequence and timing of events that basically doesn't happen in real workloads, save for certain kinds of parallel build systems, which combine generated object files into a larger compiled program in very short amounts of time.

> And even _then_, all this supposes that you do all this stuff, and don't then use the destination file, because if you did, you would have noticed that its incomplete.

> So while I would never say that no one has ever hit the problem unknowingly, I feel pretty confident that they haven't. And if you're not sure, ask yourself if you've ever had highly parallel workloads that involve writing and seeking the same files at the same moment.

https://zfsonlinux.topicbox.com/groups/zfs-discuss/Tcf27ae8f...

Here's another writeup by another ZFS dev Rincebrain https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574...

I think the only reason this has gotten so much attention is because it came up as a block cloning bug (which it's not) and that being a new feature created a massive scare that it's widespread. This isn't the first or the last bug ZFS has had - it's software.

buildbot · on Dec 7, 2023

I feel like there has been kind of a weird concerted effort to push that zfs is bad due to this bug and how trust has been lost etcetera etcetera - super annoying when most other filesystems just corrupt your data and nobody will ever know it happened. I’ve experienced bad data corruption on xfs, btrfs, ext2, and ext4. So far zfs is been nothing but perfect.

3np · on Dec 7, 2023

My experience has been aggravated by being simultaneously affected by https://github.com/openzfs/zfs/issues/11893, which results in all zfs operations hanging (kill -9 is ineffective) until a power cycle. This round of upgrades was not fun.

Debian bookworm users: You need to enable backports for zfs. Both bugs are fixed in backports but still present in stable.

Let's do away with the tribalism in both ways? Calling it perfect is quite a stretch. I'm not saying "don't use zfs", just "you want to be prepared for the worst-case if your data matters". 3-2-1 and "validate your backups" still apply.

buildbot · on Dec 7, 2023

Perfect for me is not perfect for others of course. I have not run into the issue you linked, that sounds frustrating to deal with.

Backups is the only (near) 100% way to insure you data I agree.

sgarland · on Dec 8, 2023

As someone currently rewriting Ansible plays to build Bookworm QEMU images with ZFS support, thank you for this incredibly niche and useful bit of information.

3np · on Dec 8, 2023

Np! Actually, in case you're able to share some of those plays in any form, that could be very useful here, however niche :)

sgarland · on Dec 9, 2023

I have an older version [0] for Bullseye that you're welcome to take a look at. It's actually combined Ansible + Packer, so if you point it at a Proxmox host, you get an image in ~10-15 minutes.

It's a fork of other work [1], modified and updated for Debian and Debian only. It also in absolutely no way has any best practices, which is why I'm doing a complete re-write with proper role dependencies, modularity, etc. When that's done, I'll be marking [0] as archived and linking to the new repo.

The new version also replaces the incredibly hacky shell script builder (hooray dialog) with Python, and updates Packer's JSON with HCL2.

[0]: https://github.com/stephanGarland/packer-proxmox-templates

[1]: https://github.com/chriswayg/packer-proxmox-templates

3np · on Dec 9, 2023

Thank you!

worthless-trash · on Dec 7, 2023

No, you're just hearing about ZFS corruption like you've heard corruption in other filesystems, this wont be the first and last.

Its easy to feel like its being targeted with some kind of campaign, however the truth is rarely that exciting. The software is being used by more people, which means it will expose more bugs.

No software is perfect, it just now starting to be abused enough to be important enough to be talked about. This stage is entirely normal.

buildbot · on Dec 7, 2023

But it's totally different than corruption in other filesystems. People are acting like upgrading to ZFS 2.2.0 ate all their data like XFS used to back in the bad old days. I remember once the power went out at my house, and the entire XFS filesystem was irreparably damaged.

This bug is super hard to trigger, and has roots back to 2006 and only surfaced with a recent coreutils update. Yet there are numerous posts in the past few days verging on hysteria. I myself was very worried until I saw posts like the grand parent comment or the linked GitHub issues.

I don't believe there's actually a campaign of ZFS haters sitting around in a storeroom somewhere posting to phoronix and HN, but it does feel like a lot of strongly held opinions are being unleashed now that ZFS had this bug.

I'm pretty sure ZFS has been abused a lot - ten years ago I was helping test petabyte scale ZFS clusters, doing tons of concurrent IO and such. I'm pretty sure Sun and Oracle tested the crap out of it.

worthless-trash · on Dec 7, 2023

> But it's totally different than corruption in other

> filesystems. People are acting like upgrading to ZFS 2.2.0

> ate all their data like XFS used to back in the bad old

> days. I remember once the power went out at my house, and

> the entire XFS filesystem was irreparably damaged.

Different bugs manifest differently due to the coding, this is just the same.

ZFS wouldn't even have a hundredth of the testing of real-world use ext3/ext4 in hours. The large scale deployments of ext3/ext4 dwarf the usage of ZFS.

lproven · on Dec 7, 2023

ext4 released: 2008

https://archive.ph/20120529150649/http://git.kernel.org/?p=l...

ZFS released: 2005

https://web.archive.org/web/20130619165135/https://blogs.ora...

This is Linux local bias. There is more to Unix than Linux.

Back around 2001 when work on ZFS began, I very strongly suspect that there were several orders of magnitude more storage on ZFS on Solaris in datacentres around the world than there were on all Linux filesystems combined.

Just because you are more familiar with the FOSS tool doesn't mean that it's more mature or more widely used or more deployed in production.

worthless-trash · on Dec 8, 2023

I would -very- strongly imagine it the other way. I guess we'll never know.

buildbot · on Dec 8, 2023

I worked at a company that shipped petabytes of ZFS storage for years and years. Think 42u racks with 60 drive JBODs. Hundreds of these. Hundreds. And they were a very small company. Can you imagine the scale Sun and Oracle shipped hardware at? Was there any linux based hardware company operating at the same scale?

I don’t think we had any customers request ext4 bulk data storage. OS drives sure. Everything else was ZFS. Or weirder proprietary filesystems!

worthless-trash · on Dec 10, 2023

Yeah, that GPFS was real pita, i'm not talking about size comparison issues. I'm talking about runtime in the field.

You simply can't begin to understand the amount of shitty iot and android devices that exist.

lproven · on Dec 9, 2023

See the earlier comment to you.

I think you have this 100% backwards. I heard similar comments to yours inside some large Linux vendors, too, and I think they betray a profound misunderstanding about the nature of large-scale Unix deployments over the years.

FOSS Unix is recent and mostly small-scale and transient deployments of containers, not long-term storage... but before that, there were decades of vast scale commercial Unix systems, which dwarf the FOSS stuff.

worthless-trash · on Dec 10, 2023

How many zfs systems you think are running right now, all hours combined compared to linux ext filesystems ?

lproven · on Dec 10, 2023

"Bananas are better than passion fruit because there are more bananas being grown!"

ZFS is an enterprise storage-management tool. It's not just another hard disk format, like ext4.

Not many people are running Linux boxes with root on ZFS because few distros support that.

Single-disk setups are not its strength so the number of single-disk setups of ZFS compared to ext4, which only supports single-disk setups, is irrelevant.

Even saying that, though...

I bet all FreeBSD deployments are on ZFS, and it's got a solid enterprise presence. Anything streamed off Netflix anywhere is being served by FreeBSD.

Oracle still supports Solaris. All Solaris, OpenSolaris, Illumos, Nexenta, OmniOPS, SmartOS, Tribblix... all on ZFS.

And it's still not its core area.

If you want to do a fair comparison let's hear about companies doing petascale to exascale arrays with double-digit to quadruple-digit numbers of drives.

How many of them do you reckon use ext4?

worthless-trash · on Dec 12, 2023

I'm not the one who moved the goal posts here though was I ?

Here, let me quote myself.

> The software is being used by more people, which means it will expose more bugs.

kaba0 · on Dec 7, 2023

ZFS and ext4 is so completely different on a fundamental level that there is zero reason comparing them, it just shows your "expertise" on the topic..

worthless-trash · on Dec 8, 2023

I dont think it takes expertise to do basic math. But aint no value in ego measuring online.

deepbluev7 · on Dec 8, 2023

To me it does feel like this has been on HN a bit more than other filesystem corruption bugs. This bug can basically only be triggered by using lseek to search for a hole. Ext4 has a very similar bug, but that requires enabling inline_data (and possibly a non-default blocksize, but that doesn't always seem to be the case): https://forums.gentoo.org/viewtopic-t-1166006.html

I don't think I ever heard about this, apart from in the context of the ZFS bug. And although inline_data is niche, ext4 as a whole I would argue is not.

Actually lseek seems to have been broken on most filesystems at some point: https://bugs.gentoo.org/891125 https://github.com/gluster/glusterfs/issues/894

And apparently apart from modern coreutils using that, it is mostly gentoo users hitting the bugs in lseek.

moss2 · on Dec 7, 2023

I'm unreasonably annoyed you used etcetera instead of et cetera.

'Et' is Latin for 'and'. The term means "and other things", not "andotherthings".

No, I have nothing important to add to the ZFS discussion.

frafra · on Dec 7, 2023

etcetera is found in ~500 years old books, and et cetera itself comes from et caetera, from the Greek καὶ τὰ ἕτερα. Which one should we use then? Should we only use the Latin from 1 BCE/CE or the vulgar Latin that evolved after? :-)

Maybe I do not feel it like a big issue, since many Latin languages use it a single word (such as Italian with eccetera, Spanish, Portuguese), due to the evolution of the language.

moss2 · on Dec 7, 2023

I guess even ~500 year old books can be wrong :-)

I'm just joking. This is just a pet peeve of mine. Thank you for the fun lesson in etymology.

buildbot · on Dec 7, 2023

Did not know this, I don’t mind learning something. You’s think though that autocorrect would pick up on that…

antongribok · on Dec 7, 2023

The reason, and the difference, is that all these other filesystems have check and repair (and sometimes multiple) tools.

Please correct me, but ZFS has none.

chungy · on Dec 7, 2023

What good would ZFS be without that?

The key differences are in two places:

1. Every administration task for ZFS is done online. You don't need to take your pool offline just to check for errors and repair them (if possible). (Mind that on FreeBSD, UFS can usually have a fsck done in the background while the file system is in-use. Just about zero Linux-native file systems have this capability)

2. "other filesytems" can only hope to detect and repair inconsistencies in their metadata structures. If your files are corrupted, they can't know and won't tell you that they are. ZFS checksums everything, including regular file data. It will repair regular file data too, if possible.

gavinhoward · on Dec 7, 2023

You're completely wrong.

ZFS's scrub is both a check and a repair tool. It's already saved some of my data.

hdjdkdbdbe · on Dec 7, 2023

You are partly right. Zfs scrub is a repair tool when it has parity / mirrored copy of data to recreate it.

danparsonson · on Dec 7, 2023

It's safe to make the assumption that the tool isn't magic.

Borealid · on Dec 7, 2023

A scrub can also repair data when using `ncopies` greater than one even outside any mirrors or parity.

nightfly · on Dec 7, 2023

Zfs has checksumming and scrubs, which can catch lots of data corruption that (most?) other filesystems can't catch catch at all

antongribok · on Dec 7, 2023

I guess I meant it as having a separate repair tool, but yes, you're of course correct.

To me it was always weird to not have a separate tool for being able to do an offline ZFS repair.

With regards to data corruption, I mean, this is exactly why I moved to btrfs, because it's able to catch bit rot same as ZFS.

I think ultimately this is why ZFS data corruption bugs always are such a big deal... It's because in a lot of tech circles ZFS is put on this infallible pedestal where it can never do any harm to your data.

Volundr · on Dec 7, 2023

> To me it was always weird to not have a separate tool for being able to do an offline ZFS repair.

Have you ever asked yourself why those operations need to be done offline?

It's not that ZFS doesn't have these capabilities, it's that ZFS is doing these similar checks and repairs as part of its normal operation and more. If you can't mount a ZFS filesystem, odds are a similarly borked ext4 filesystem isn't going to be repaired by fsck.

I got bit by some pretty bad foot guns in BTRFS that has meant I steer pretty far away, but my understanding is these have been somewhat mitigated. I hope it continues to work well for you.

AndrewDavis · on Dec 7, 2023

Could you please explain what you mean by a separate repair tool kit?

lproven · on Dec 7, 2023

I think he means an equivalent for `fsck`.

db48x · on Dec 7, 2023

`zfs scrub` is equivalent to `fsck`.

It just has a different name, and it can fix more kinds of corruption (including corrupted data, which `fsck` for other file systems can never do), and can do so on line. Of course zfs can fix those same errors on line while doing ordinary reads, so really all `zfs scrub` does is read everything.

`zfs scrub` is better than `fsck`.

rincebrain · on Dec 7, 2023

It is not.

`zpool scrub` walks every block in the pool, and implicitly, you do checksums and other things while doing that. That's it.

It's not doing any sort of logic bug repairs or cleanup or anything else.

It's also not checking that you can, say, decrypt things, since that would mean you needed the keys to scrub.

db48x · on Dec 8, 2023

It really is better. The key difference to understand is that in ext4 any form of metadata corruption which can be automatically fixed requires you to take the filesystem off–line (to unmount it), and then run fsck. Meanwhile with zfs, any form of metadata corruption which can automatically be fixed is simply fixed right on the spot, transparently.

In truth fsck is a wart, a kludge, a bag on the file system design.

rincebrain · on Dec 8, 2023

I am passingly acquainted with ZFS.

The key thing to realize is that you can't, actually, automatically fix every problem, sometimes you have found a logic problem which results in an impossible outcome and you need someone to manually clean it up.

In a world without flaws, it would be great to never need that. But the thing about theory and practice is that in theory, they never differ, but in practice...

db48x · on Dec 8, 2023

I know that. But fsck only fixes the ones that it can fix automatically. With anything else, you are completely on your own. Get a hex editor and go to town. With zfs, if there is some kind of problem that cannot be automatically fixed then at least you have one more tool available: zdb. It’s a _debugger_ for zfs filesystems. It will show you everything, more than you ever wanted to know. It is way better for fixing problems than any hex editor.

rincebrain · on Dec 8, 2023

zdb is read-only. It's not fixing anything, just telling you what's going on.

db48x · on Dec 8, 2023

Don’t be an idiot. You can fix more with zdb and a hex editor than you can with the hex editor alone.

rincebrain · on Dec 8, 2023

That seems rather rude.

The discussion was about the need for tools to make it easier to handle cases where you couldn't automatically handle repairing them, and your statement was that zdb is very useful, which is true, but it doesn't fix anything.

fsck for various filesystems has a bunch of common cases like "this is an orphaned file, should I save it or mark it free?", and something similar would indeed be useful for a number of failure cases in ZFS which require more explicit instruction on what to do about it because you can't easily automatically resolve it.

`zpool scrub` is very useful, but ZFS could still benefit from automated tooling to handle some common failure modes, not just let you write bespoke tooling every time.

db48x · on Dec 8, 2023

> fsck for various filesystems has a bunch of common cases like "this is an orphaned file, should I save it or mark it free?"

This is an example of an automated fix that zfs just handles transparently, without needing to prompt the user. Or it would, if zfs could even have orphaned files, which it cannot.

I don’t know why you think this is such a win for fsck, which doesn’t even bother to give you any idea what the file was. It doesn’t try to show you the contents, and it probably doesn’t know what the file was called, or why it was deleted. Or even if it _was_ deleted; a file could be orphaned merely because the data was written but the write to the directory entry got lost. The user has nothing to go on and just guesses, or says `y` for everything. Useless.

> `zpool scrub` is very useful, but ZFS could still benefit from automated tooling to handle some common failure modes

This is precisely and exactly what scrubbing does! All failure modes that can be automatically fixed, whether they are common or not, are transparently fixed without even needing to unmount anything.

zdb is there for the really rare cases where there is so much damage to the filesystem that zfs cannot even mount it safely. Other filesystems don’t have anything like it.

rincebrain · on Dec 8, 2023

> Or it would, if zfs could even have orphaned files, which it cannot.

If you use the zfs_recover parameter, then in a couple of cases, it will just permanently mark space as unallocatable forever because it can't figure out what owns it due to some errors, and you decided that was a better outcome than whatever error it was encountering. (That's what the "leaked" zpool property counts.)

Conceivably, you could write something to carefully figure out what, if anything, owns it, and allow it to be freed, or grab the contents of that region and drop it somewhere and then throw it out after the same safeguards, but by definition if you triggered that handling, something has gone wrong and you don't have a better automated intervention.

I wasn't arguing that ZFS has a case requiring orphaned files, but my point was that that was an example of "I don't know what to do with this, but I know enough to realize I can't decide what to do about this or just throw it out, so here, you do it."

Or if you, for example, wanted to do a destructive rewind on a pool because of some horrible edge case, then you might want some automated way to extract everything you're about to throw out, and that might require more work than just `zdb -R` if the pool won't import in the first place.

One example of something that would be useful to be able to do, particularly offline, would be when pools have some issue like spacemap corruption, you could conceivably walk the non-spacemap metadata to re-synthesize the spacemaps from whole cloth and write them out again. (And it's not an all-or-nothing thing, a number of versions many years ago would have very minor errors in how they computed spacemaps, which just mean zdb whines at you if you ask it to verify them even when the pool is offline, but don't interfere with the running otherwise.)

Could you convince the kernel code to do that for you on import if it's blocking import? Probably, though you'd probably want some out of band communication method to force it sometimes because I would bet there's any number of ways it might go awry and get too far into the woods of thinking it's "fine" before tripping some assertion.

Is it going to be faster to iterate on it from userland, particularly if you can simulate whether the import works from something like zdb with those conceptual spacemaps written out somewhere, versus rebooting on kernel panic? Absolutely.

But that's an example of a case you might have something so disgruntled it could be nice to reconstruct it outside of the normal import flow.

> All failure modes that can be automatically fixed, whether they are common or not, are transparently fixed without even needing to unmount anything.

`zpool scrub` doesn't fix anything except checksum errors. That's it. Any other class of flaw, it doesn't handle. I don't know why you think it does anything else, but I promise you, it absolutely does not.

cangeroo · on Dec 7, 2023

> ask yourself if you've ever had highly parallel workloads that involve writing and seeking the same files at the same moment.

Uhhhh, databases?

rincebrain · on Dec 7, 2023

Databases don't involve using SEEK_HOLE to find gaps in a sparse file, usually, so it wouldn't come up here.

thyrsus · on Dec 8, 2023

So you needn't do so: SEEK_HOLE does not occur in the GitHub.com repos for postgresql, mariadb, nor sqlite. Are there system libraries they incorporate which use SEEK_HOLE?

qwertox · on Dec 7, 2023

>> So while I would never say that no one has ever hit the problem unknowingly, I feel pretty confident that they haven't. And if you're not sure, ask yourself if you've ever had highly parallel workloads that involve writing and seeking the same files at the same moment.

It makes it sound unlikely, but if I have a couple of VMs in datasets (all formatted as ext4 internally and some running DBs inside them) each is one big `raw` file which is getting a lot of reads and writes, I assume.

How are they at risk?

Also, what about ZVOLs mounted as ext4 drives in these VMs?

sandreas · on Dec 7, 2023

This is a great explanation, thank you.

tw04 · on Dec 7, 2023

I hope you never look at the history of ext, or ntfs, or btrfs, or ufs, or xfs, or reiserfs, or… (stop me when you get the point).

One corner case data loss bug in 20 years? Throw that baby out, the bath water is bad!

KyleSanderson · on Dec 7, 2023

https://github.com/openzfs/zfs/pull/15529#pullrequestreview-...

Honestly, ZFS is the best thing on the (Free)BSDs only... On Linux it doesn't even use the page cache, and you conflict severely with L2ARC. I know there's a variety of people who don't care, but still for real users it's not an actual option.

antongribok · on Dec 7, 2023

For me it was the previous data corruption bug [1] that killed any enthusiasm for ZoL. After that annoyances like the caching issues you mention and the constant kernel upgrades breaking DKMS on Fedora just stopped being worth it for me.

I finally moved to btrfs earlier this year, and so far I'm glad I did.

I run raid1 on my primary array, and raid5 on my off-site backup array at my mom's apartment connecting with Tailscale.

Replication is with rsync and borg, not snapshots.

Yes, it's more painful to replace disks after a failure, but once you get a bit used to it, it's really no big deal.

On my main workstation/laptop the dedupe and compression work much better than my experience with ZFS.

[1] https://news.ycombinator.com/item?id=16797644

michaelmrose · on Dec 7, 2023

Isn't raid5 on raid5 perpetually broken and unsafe since inception?

Also a complicating factor with kernel upgrades is that while zfs release notes clearly delineate what kernel versions are supported that information doesn't appear to be meaningfully encoded in package metadata so if you use new enough kernels compared to the version of zfs for your distro it is possible to front run support. For instance 2.2.2 supports up to 6.6 but you could very well for instance install 6.7 and it might not work.

The somewhat broken thing is not encoding known data like which kernel is supported to automatically do the right thing in the packaging system not the filesystem. The lazy fix is to just manually handle kernel updates. The lazier one is to grab the release notes and update if latest kernel is <= supported.

antongribok · on Dec 7, 2023

The packaging was a major reason for switching to btrfs. I run sudo dnf upgrade and that's it, my system is upgraded. Zero issues ever. With ZFS I had to pin to older kernel versions, not to mention a bunch of manual steps and cleanup after any major version upgrade (every 6-12 months).

Re btrfs raid 5/6, yes everyone knows about this, and this is why I have it only on my backup system. My primary data which holds 15TB of family photos and videos is on raid 1. The offsite is there only for the time my house burns down.[1]

[1] I watched my neighbor's house go up in flames 2 years ago, and it finally got me going on setting up remote backups. The fire spread to 3 other houses, and everything happened very, very quickly. No one got hurt, but multiple families got displaced for more than a year. Besides having backups , it's also a good reminder to have adequate insurance. One neighbor did not.

michaelmrose · on Dec 7, 2023

I wonder if a metapackage that always depends on kernel <= supported would resolve the issue by ensuring you don't need to pin a specific version manually.

E39M5S62 · on Dec 7, 2023

That's the way Void Linux does it. That ensures that the default kernel series always works with ZFS and NVidia modules. If you want to go off-reservation, you can do so but you're on your own, then.

yjftsjthsd-h · on Dec 7, 2023

Yep, same on NixOS; just set

    boot.kernelPackages = config.boot.zfs.package.latestCompatibleLinuxPackages;

And you're done.

rincebrain · on Dec 7, 2023

The problem is that RHEL, for example, loves backporting breaking changes, so you can't know a priori that RHEL's "2.6.18" or whatever is going to keep working, and otherwise you need to push a new metapackage every time they ship an update.

tarruda · on Dec 7, 2023

> Replication is with rsync and borg, not snapshots.

I backup my files with borg, but I still use snapshots during the Borg backup to ensure files are not modified during the process

antongribok · on Dec 7, 2023

Yeah, of course you should be doing that. I was just trying to say that people shouldn't pretend like having snapshot-based, filesystem-level replication alone (which can be a very efficient way of replicating data changes) is a good backup strategy.

mavhc · on Dec 7, 2023

If you don't rebuild the zfs volume on the backup computer aren't you left with a) unlimited incremental backups, or b) doing a full backup every x days?

sandreas · on Dec 7, 2023

Does btrfs support native encryption?

antongribok · on Dec 7, 2023

Btrfs does not have native encryption.

I use LUKS / dm-crypt on entire drive partitions. Did that with ZFS too, and by the time ZFS got this feature I was already planning my migration to btrfs.

I like encrypting with LUKS, because I can have a drive configured with mine and my wife's password. Either one of us can take that drive, plug it in, and Gnome will put a nice graphical prompt asking for a password, then decrypt the drive and mount the filesystem.

If I get hit by a bus, LUKS makes it much easier for my family to get access to important data without having to have that data sit around somewhere in plaintext.

gmokki · on Dec 7, 2023

It is coming: https://lwn.net/ml/linux-fsdevel/cover.1701468305.git.josef@...

The kernel fscrypt layer just needed a lot of refactoring first to support cow filesystems.

ikiris · on Dec 7, 2023

its also had many data loss / corruption issues, and still looks to be scary for years to come.

antongribok · on Dec 7, 2023

I've never had data loss with either ZFS or btrfs.

For me on Fedora, uptime is better and maintenance overhead is lower on btrfs.

KolenCh · on Dec 8, 2023

Right, those 0.7.x releases scared the hell out of me. It is like every other critical releases are problematic. Those are the darkest hours of ZoL.

kaba0 · on Dec 7, 2023

NixOS has seamless ZFS integration -- it's the problem of the package manager/distro, not of the tool

mwpmaybe · on Dec 7, 2023

If my data is in ARC, why would I also want it to be in page cache?

L2ARC isn't typically recommended for many use-cases. Anyway, what's the conflict you're referencing?

toast0 · on Dec 7, 2023

> On Linux it doesn't even use the page cache

This is the case on FreeBSD as well, ZFS uses its own cache, but UFS and I guess other filesystems use the kernel page cache.

upon_drumhead · on Dec 7, 2023

https://gist.github.com/rincebrain/e23b4a39aba3fadc04db18574... is a great writeup of the actual bug.

1letterunixname · on Dec 7, 2023

I come from the GPFS, Lustre, and Panasas world of HPCC.

Personally, I used (past tense) ZoL in 2014-2017 on Ubuntu. The issue is that the array eventually entered an unrecoverable state where it could no longer be mounted RW. That wasn't the end of the world, but the support from ZoL was to shrug at it. That was the end of that because without support and without pride, something that appears shiny is effectively useless.

Been running many XFS volumes over mdadm RAID10 near-2 arrays. Zero major problems in 5+ years with over 400 TiB online. SGI's, Redhat's, and more contributions to various Linux storage components are excellent.

My conclusion is that ZoL != Solaris ZFS. Once it left enterprise with controlled hardware and dedicated engineering & support teams, it regressed and devolved. Beware of fanboys where passion and tribalism exceeds evidence and reliability.

db48x · on Dec 7, 2023

If you had a Solaris support contract then they might have had an engineer poke around in your dead filesystem with `zdb` to see if they could salvage anything. `zdb` still exists, so you could have done that yourself. Why didn’t you?

rincebrain · on Dec 7, 2023

I'd be curious to know what you mean by "shrugged at it", with a link to the relevant bug.

I would suggest that you should probably have a support contract with _somebody_ if you want a guaranteed turnaround time on responses. Plenty of people are paid to work on OpenZFS, but if you're not the one paying them, then it's always going to be a best effort thing.

Since the incorrect code dates back to Sun, I don't know that you should be using this specific bug to cast stones.

cperciva · on Dec 7, 2023

See also the FreeBSD Errata Notice: https://www.freebsd.org/security/advisories/FreeBSD-EN-23:16...

chungy · on Dec 7, 2023

> 8. On linux, do a `sudo modinfo zfs | grep version` to see the version number

Even easier than that, type "zfs version"; it will report both the loaded kernel module and the userland version.

sandreas · on Dec 7, 2023

thx, corrected :-)

godelski · on Dec 7, 2023

Is there any good beginner friendly documentation for zfs? I've started using it as a testing/learning NAS with a raspberry pi (cloning my google library to immich). It has not been a clearly easy process and errors are very much not clear. I recently extended a single drive to 2 and now I can't import due to corrupted metadata and reports bad disk but smartctl reports all fine. Stack overflow is all over the place and reddit is... reddit. So is there a good goto place for these kinds of issues? I suspect this will be far from my last one.

denkmoon · on Dec 7, 2023

Truenas forum. Level 1 Techs youtube channel. Lawrence Systems youtube channel. the man pages of course.

First thing I'd suggest is really getting the terminology under your belt. Everything makes so much more sense when you use the correct terminology. For example, you don't extend a drive. That doesn't really make sense. I suspect what you mean is that you added a vdev to a zpool. Don't think about things in terms of disks and extending them, think about things in terms of one or more disks making up a vdev, and a zpool is, well, a pool of those vdevs. Zfs then works out how best to write data across the vdevs in the pool. Not disks.

godelski · on Dec 7, 2023

Thanks. I'll look into these. Seems like a rough entry, but I can totally get it being "obvious" after you get through. I'm a vim user so I can do this hahaha

sirdvd · on Dec 7, 2023

I find klarasystems articles and Mr Salter articles on arstechnica are a good start point

godelski · on Dec 7, 2023

Thanks!

rincebrain · on Dec 7, 2023

Checksumming the files on ZFS is fine, even before the fix, unless the checksum program is attempting an optimization that seems insane to me.

Since the problem is in the read path, having copied them off ZFS might result in them being incorrect on the destination, modulo all the caveats about this being quite rare and hard to hit unless you're very deliberately trying.

You can reproduce this back on 0.6.5 and pre-OpenZFS merge FreeBSD and illumos, if you really want to.

bell-cot · on Dec 7, 2023

Old geezer take on the situation:

Back in (say) the 1980's, the code for a serious OS's filesystem was small enough that formal verification was at least imaginable. Similar for writing test code for ~all of the corner cases.

Not now.

When and where you value reliability - try to keep things simple, stick to code paths well-tested by the passage of the masses, do at least some testing of your own, and have a Plan B.

mhio · on Dec 7, 2023

What type of workload triggered the bug for you?