Hacker News new | past | comments | ask | show | jobs | submit login
XFS metadata corruption after upgrade to 6.3.3 kernel (redhat.com)
130 points by itvision on May 26, 2023 | hide | past | favorite | 48 comments



Apparently the 6.3 kernel made significant changes to some relevant APIs -- I noticed OpenZFS is not released for 6.3 yet. https://github.com/openzfs/zfs/issues/14622


> So far, it looks like this will be a major undertaking. Whoever was responsible for the namespace support is going to have to rebase the entire dang thing with alternate code paths for everything, because as of Linux 6.3, they replaced a whole mess of APIs so they either newly accept struct mnt_idmap * instead of struct user_namespace *, or accept the former in addition to the previous parameters. This has far reaching implications, I think. Good luck to whoever attempts the task.

ouch


Ouch indeed. I've been wondering why it seems to be taking longer than usual for ZFS to build against the new kernel. Guess I'll be on LTS kernels for awhile yet.

For once at least I feel like these changes aren't just the Linux devs breaking ZFS for funsies.


Support is already merged in git tip, a stable release just hasn't been cut with those changes backported yet.

I don't think it'd be that problematic to backport, from what I saw, just nobody's done it.


the struct mnt_idmap stuff was badly needed, before struct user_namespace was being passed around for two different purposes - it was just asking for confusion and bugs.


I keep having this kinky fantasy of a GPL'd Rust ZFS re-implementation. Yeah, I know.


You'd be more likely to get one for bcachefs. Kent has expressed some interest in Rust although he is (correctly) just focusing on finishing the filesystem first.


I'm curious about which would stabilize more quickly, an all new filesystem or the reimplementation of an existing one in a different language? The reimpl has the advantage of having the original to compare and test with, but the fresh design can use a staged approach. Yet more hypothetical Friday-style questions.


Incremental file by file rewrite, for sure.


There’s a lot of unexplored stuff w.r.t. the GPL license, but I doubt that would legally work for a rewrite of CDDL-licensed ZFS to a GPL license.


An implementation in a different language should be pretty clear of license constraints? No code shared is no code shared.


koverstreet wrote “Incremental file by file rewrite, for sure.”

That implies intermediate versions that have both CDDL-licensed ZFS code and new rust code. That rust code, inside the ZFS code base, can’t be GPL-licensed because of the incompatibility of the two licenses.

As the writer of that code, you can separately license your code as GPL if you want, but not in a product containing any CDDL-licensed code (at least in the reading of the FSF. See https://en.wikipedia.org/wiki/Common_Development_and_Distrib...)

So, you could argue that you can license the full product as GPL once all CDDL licensed code has been replaced, and then continue working on that version.

IMO, it is at the least unclear whether that’s legally true.

(All personal opinion, not legal advice, etc)


Even if you finish your rewrite and then release it, the final product might still be a derivative work of the original, because "translations" are legally codified as derivative works. Of course, this is bounded by the merger doctrine and various court cases regarding software compatibility. If you can show that everything in your version has to be structured the same way as the original in order for it to read and write ZFS volumes, thus making it a sort of 'external interface', then you probably have good ground to stand on against Oracle. But things like source file structure or internal call chains are unlikely to constitute an interface in a filesystem driver.

The closest legal example of a "file-by-file rewrite" would be BSD; but the lawsuit they got from UNIX System Laboratories was settled, so it creates no precedent.

The reason why the reverse engineering crowd talks a lot about clean-room is because you naturally discover the interface boundaries as part of the development process. Legally speaking, you don't need two separate and isolated groups of people, but starting from a document of how the system 'should' work and then adding in implementation details that you actually need is a good way to avoid accidentally copying things you don't need to.


Only a clean-house reimplementation, by not looking at the code, just the functionality.


Could one do this alone if they had confirmed split personality disorder?


yep, just waiting on gcc-rust support to land and then we'll be moving our Rust interfaces into the kernel


gcc-rs will take a few years, rustc_codegen_gcc is pretty close to being ready to go.


Interesting that ElRepo is on 6.3.4.

http://elrepo.org/tiki/kernel-ml


"-ml" is always the latest upstream. "-lt" [1] will have the latest LTS kernel if that is what you were looking for.

Using the kernel-ml may expose your system to security, performance and/or data corruption issues.

[1] - https://elrepo.org/tiki/kernel-lt


And this is why I was never a fan of rolling release distros


Ignorant take


I think the title is somewhat unnecessarily editorialized from:

> Bug 2208553 - xfs metadata corruption after upgrade to 6.3.3 kernel [NEEDINFO]

There are reported problems, but also:

> I note that all these reported failures seem to be using hardware raid and with stripe (whether either is relevant is unknown).

> I am also using XFS (without issue) on a number of other systems with a 6.3 kernel, but none of them are using hardware raid or stripes.

Ed: That said - it's quite worrying to see corruption on xfs as the filesystem should be really rock solid by now.


> I note that all these reported failures seem to be using hardware raid and with stripe (whether either is relevant is unknown).

That's not true. Later in the comments there is a person without HW raid who also has been affected.

There's nothing indicating that's limited to 6.3.3 exactly. In fact in the same discussion kernel 6.3.1 was reported to having had this issue.

Also, a developer chimed it and said it's probably specific to the entire 6.3.x series because patches for 6.3.1-6.3.4 did not introduce any serious XFS specific changes.


Damn here I was hoping I was safe on Opensuse TW with Kernel 6.3.2

Thankfully I should be able to rollback with Opensuse's snapper. This is the key advantage of Opensuse TW in my opinion, The BTRFS filesystem alongside snapper helps maintaining a reliable system while on the bleeding edge.


There is also LVM2 in the picture, but I'm not pointing any fingers. My gut feeling is this smells like some cache type thing waking up to sync/flush buffers and the pointer they follow lands them in the middle of some unwanted region.

https://en.wikipedia.org/wiki/Dm-cache

But I didn't see any evidence of caching in the BZ, so it's a potential redherring.

The data spanning the corrupted regions was an RSA key, which is fine... some folks might get their kackles up, but meh... I'm not sure why a key file would be cached, unless it's a read cache, and then the normal vfs cache would be where.


So it only affects users who care about their data? Heck of a take.


It might not be xfs - it could be the hw raid drivers.

I get that the poster wanted to get the word out - but this (for now) only affects opt-in testers; some due diligence is expected.

Additionally, that doesn't change the fact that hn frowns on editorializing titles.


> It might not be xfs - it could be the hw raid drivers.

It could be that XFS users see the corruption first because of its CRC-verified metadata; you'd expect ZFS to see this quickly too, but another commenter mentioned that OpenZFS isn't even out yet for 6.3. Or perhaps because XFS happens to put its metadata where the corruption occurs. Unclear at this point.


At least one person in there reported it with no HW RAID controller in the mix.


> only affects opt-in testers

That's a falsehood. Kernel releases on kernel.org are _not_ considered "unstable". That's a myth open source aficionados have to drop ASAP.


I'm sorry - in the context of rhel (ed: Fedora) - this affects opt-in testers.

Tbh, I'd say anyone compiling their kernel off kernel.org releases, also do opt-in testing in a certain sense (you're more likely to end up with a concrete "sum" of compiled/configured kernel/user space that is unique to your setup). Contrast this with a (stable) distro release, where you can at least hope what you get has been tested (like by the people reporting this bug).

I certainly agree that any stable kernel org release is expected to work - but also somewhat expected to receive fixes for as-yet unknown bugs (hence point releases).

TFA is a bug report against rhel (Ed: Fedora )- i can't find anything (recent) on lkml?


The "opt-in testers" is referring to the context of this Bugzilla report against Fedora. The only Fedora users who will have the 6.3 kernel series on their systems are those who willingly and intentionally enable the updates-testing repository. It is not a reference to the Linux ecosystem at large.

Putting that aside... this probably should have been caught by upstream before releasing, though. I find it a bit hard to believe no kernel developer, particularly those working on the storage subsystems, don't have some kind of system using XFS within a hardware or software RAID configuration.


> The only Fedora users who will have the 6.3 kernel series on their systems are those who willingly and intentionally enable the updates-testing repository.

1. People compile released kernels found on kernel.org. I do. 2. Multiple distros release kernel.org kernel releases quite fast including Arch and Gentoo. There's also a Ubuntu PPA with the mainline kernel many people, who I personally know, use. 3. Where's the line between "stable" and unstable/beta/whatever? Who draws it? What about "stable" LTS kernel regressions which have been aplenty so far?

Also please refer to this comment https://www.phoronix.com/forums/forum/phoronix/latest-phoron...

because this argument that "LTS or distro kernels are stable" has never been true. It's been nothing but a very bad myth.

> a hardware or software RAID configuration.

Again, there's a person without raid in the comments. Could you please drop it and make it look like those people are somehow extremely unlucky and everyone else is safe?


> 1. People compile released kernels found on kernel.org. I do. 2. Multiple distros release kernel.org kernel releases quite fast including Arch and Gentoo. There's also a Ubuntu PPA with the mainline kernel many people, who I personally know, use.

To repeat myself, the is the _Red Hat Bugzilla bug tracker for Fedora_. While what you say is true, that's outside the context of what the OP and I are saying, and the statement in the ticket. As seen in the ticket, users of other distros and kernel configs have made comments reporting their experience with the 6.3 series, but that is besides the point I was making above.

Anyone performing their own builds (even on Fedora) are more than welcome to contribute and test, but that's outside the bounds of the Fedora QA process.

> Again, there's a person without raid in the comments. Could you please drop it and make it look like those people are somehow extremely unlucky and everyone else is safe?

I'm not the original thread starter, but I read through the ticket earlier and unless I misinterpreted one of the comments or missed something in the wall of text that is RHBZ, I didn't see a RAID-less reporter.


> I didn't see a RAID-less reporter.

https://bugzilla.redhat.com/show_bug.cgi?id=2208553#c24 says "no hardware RAID", which was the original context of the claims about RAID versus no RAID.

Also reportedly https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin... fixes it, with the hypothesis being sometimes instead of livelocking, they return an incorrect mapping to be written out.


> says "no hardware RAID", which was the original context of the claims about RAID versus no RAID.

That's fair, but also the reason why I specified hardware and software RAID in my prior comment. Though at this point I think we're getting to be overly pedantic :D

Looks like 6.3.4 was published with that fix in place. Hopefully this issue can be moved on from!


> this argument that "LTS or distro kernels are stable" has never been true. It's been nothing but a very bad myth.

It might have some mythical aspects, but if 10 000 people have run a certain kernel/FS combination without issue for a few months - you can reasonably expect that particular set of binary artifacts to work well if you install/run them in the same way.

Now, if you change the sources, adjust the configs and compile flags - you can be less certain everything will work. That doesn't mean kernel.org kernels are unstable - it just means change drives risk (often justifiable risk).


Waht "myth" ? Most people just dont compile kernel directly from there but instead takes it from distro


Do you imply that hardware RAID is safer than software RAID?


If it's battery-backed, yes.


It shouldn't be safer. However, using a battery-backed write cache in hardware RAID allows the OS to just dump the writes into the cache, get the message that they're safe very quickly, and then move on to the next thing. So, a battery-backed write cache allows it to be faster while still being safe. You can still get the same level of safety without it, and you can get very close to the same speed by switching off all the protections (don't do that), or by logging writes to a separate fast storage device (like a SSD or a battery-backed RAM).


I was under the impression that hardware RAID is generally considered unreliable.


It's complicated. For consumer level users I'd recommend software RAID because knowing what your getting with hardware RAID is tricky. If your motherboard offers a built-in RAID it's almost certainly crap and your better off disabling it first thing. In general these aren't actually hardware RAIDs anyway, it's actually implemented in the drivers.

Low end RAID cards are sometimes better, but mdraid or windows RAID will still probably be more reliable.

It's really only expensive enterprise-grade RAID cards that meaningfully compete with software RAID and you'll want to look at the tradeoffs of each.


Hardware RAID is pretty much a niche at this stage. And unreliable and unsafe in general. Using it is like having an open fire pit inside your home because it's nice and cozy, and assuming that your house not having burned down yet means it's a good idea.

So far the only deployments where I've still seen it somewhat relevant is single-node Windows machines without virtualisation (so the balance between trusting windows with your block storage vs. trusting a RAID controller swings towards the controller), and large proprietary SANs where you don't really have a choice.

Everything else is either local software RAID (or ZFS), or a cluster filesystem with block storage support for SAN consumption.

And then there is the worst of all worlds: firmware-enabled software RAID like the ones you get with consumer mainboards. As far as I know, that only exists because of windows and the legacy inability to boot from anything except a single-disk magic method on MBR or single ESP partition based booting in UEFI. It's all just crutches for bad systems to not have to do it correctly.


> So far the only deployments where I've still seen it somewhat relevant

I'm glad what all your deployments have a competent sysadmins supervisng them. Sadly, this is not how it works in the real life, so I doubt your experience cover 100% of use cases.


Hardware RAID is now almost universal, we just call it "SSDs". With an SSD the relationship between the kernel's block device abstraction and the reality of how the blocks are stored has been interdicted by an opaque controller that decides what blocks to write, how to exploit device parallelism, and what mathematical codes to use for redundancy.


Uh, no. Main purpose for RAID is device failure prevention, SSDs do nothing to address that as controller managing storage chips can still fail, or go "well, one of chips is dead, you're not getting your data back".


It's generally PITA, as you need to use a piece of proprietary software to manage, and often have more limitations than mdadm when it comes to what is possible and migration between RAID levels.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: