The Btrfs inode-number epic (part 1: the problem)

NovemberWhiskey · on Aug 24, 2021

Inode non-uniqueness is a hell of a problem to debug: it's not necessarily a problem most of the time then you'll run into a corner case where something depends on it.

e.g. we had a big issue with this back around 2008 where we accidentally had two AFS volumes with overlapping inode ranges (AFS was represented as a single device, even if there were multiple volumes mounted).

Linux's dynamic linker had an optimization that cached device/inode for DSOs that had already been loaded, and wouldn't attempt to load them again.

If you had a binary that depended on linking two DSOs from different volumes that had different names, but the same inode number, the first one would get linked OK but the linker would then ignore the second one and you'd get missing dynamic linker dependencies out of nowhere.

This was the first and hopefully last time reading the source to ld.so during a production outage.

Deeg9rie9usi · on Aug 24, 2021

Here you can find part 2: the solutions: https://lwn.net/SubscriberLink/866709/671690ea60c1cb37/

kzrdude · on Aug 26, 2021

Don't miss Dave Chinner's interesting comment comparing btrfs, bcachefs and others: https://lwn.net/Articles/867427/

nextaccountic · on Aug 24, 2021

Thank you

londons_explore · on Aug 24, 2021

The Linux file interface was not designed for subvolumes to masquerade as directories.

The correct course of action would have been for Linus to not allow btrfs into mainline with such an interface. Each subvolume should have been seperately mounted, or the stat interface should have been extended to make subvolumes first class.

pavon · on Aug 24, 2021

If I understand the second part of this article correctly, the first proposed solution was to replace those magic subvolume directories with auto mounted directories, but some standard uses of btrfs have so many subvolumes, the kernel code managing mounts was running into scaling issues. Which would probably be a good problem to address eventually, but is a big effort in itself.

crest · on Aug 24, 2021

An ugly workaround would be to embed the subvolume id into the inode number, but it would almost require 64 bit inode numbers and some tools make daring assumptions about the stability of inode numbers across reboots.

AshamedCaptain · on Aug 24, 2021

As far as I can see, subvolume ids are stable across reboots.

kzrdude · on Aug 25, 2021

It seems like there is no agreement on if inodes should be unique or not, or what information is required to uniquely identify a file or directory. It does not surprise me that Linux is looking for a solution while this problem is open.

mjevans · on Aug 24, 2021

From the first part alone, much of the 'sin' is likely non-unique inode numbers across a pool of BTRFS devices. I could understand partitioning ranges of new numbers for performance; the non-uniqueness seems just silly.

aidenn0 · on Aug 24, 2021

Ranges aren't possible because there aren't enough bits in a 64-bit inode.

ZFS doesn't have this problem because each dataset (the equivalent of subvolume) is treated as a separate mountpoint. This can be annoying with NFS because you need to export and mount each dataset separately.

mbreese · on Aug 24, 2021

> This can be annoying with NFS because you need to export and mount each dataset separately.

I use ZFS on multiple servers, and appreciate this approach. The problem is -- you're going to feel pain one way or the other. In that case, make sure that the pain is obvious and predictable.

If you're using subvolumes/datasets, you will have to deal with a problem at some point. Either you have to manually export multiple NFS volumes (ZFS), or potentially have inode uniqueness issues (Btrfs).

I'd much rather have the problems make excruciatingly obvious. I can script generating a config file for exporting many datasets (and I have done so). I can't really deal with non-unique inodes in the same manner.

aidenn0 · on Aug 24, 2021

I overall agree, but on our build-server the person who can add new branches is not the same person who can add new entries to the autofs map, so we don't use one dataset per branch, which would make other things a lot easier.

mbreese · on Aug 25, 2021

That’s true. For me, I made a dataset for each user as a home directory. Each home directory was then exported over NFS.

In order to automate the export, I have a single script that can be executed with sudo on that system to manage both the creation of the dataset and the export.

So, there are still ways, if you’re able to use sudo.

aidenn0 · on Aug 25, 2021

Yeah, I can sudo on the NAS, but not on the nis server

mbreese · on Aug 25, 2021

If it helps…

My LDAP and NFS servers are separate too. I control both, so it’s a bit different, but I still have it setup so that account creation and home directory creation/export are handled in one script.

The account creation script (on the LDAP server) makes an SSH call to the NFS server to run the home directory creation/export update script. That specific SSH certificate is password-less and restricted to running only that single command. It could be root or another user that calls sudo.

So, some of what you’re looking for can be done with some SSH tricks… but only if both sides are comfortable with that setup. The benefit is that each side can manage their own scripts. The NIS admin can write the script on their side and NFS on their side. You just need to establish the workflow that works best for you.

Our cluster is completely self contained, so with sudo and SSH restrictions, I’m not concerned about security issues with this setup.

normaler · on Aug 24, 2021

Atleast when you use the build in zfs nfs, it also exports "subvolumes"

aidenn0 · on Aug 24, 2021

Well on fairly recent TrueNAS at least you need to export each dataset separately, and they show up in /proc/mounts.

normaler · on Aug 24, 2021

I habe the zpool "data" with the zfs Volumes like this

data/media/video/movies data/media/video/series

I set sharenfs with read write access for

data/media/video

Which gives me access via nfa to all the subvolumes.

yjftsjthsd-h · on Aug 24, 2021

So how does ZFS handle the same problem? (I would guess that it just goes through with making each dataset a full filesystem and just deals with the overhead for NFS, but I don't know that)

Deeg9rie9usi · on Aug 24, 2021

Since ZFS does not really have the concept sub-volumes like btrfs it does not suffer from it.

magicalhippo · on Aug 24, 2021

I'm no expert so maybe I'm missing something.

To me it seems that ZFS snapshots represents a similar challenge[1] given that the snapshots can be accessed through the .zfs directory, and that ZFS also plays tricks[2] with the inodes.

[1]: https://docs.oracle.com/cd/E19253-01/819-5461/gbiqe/index.ht...

[2]: http://mikeboers.com/blog/2019/02/21/zfs-inode-generations

rincebrain · on Aug 24, 2021

They're always explicit mounts on ZFS. (Even the .zfs directory triggers mounting on access of things under snapshot/.)

AndrewDavis · on Aug 25, 2021

As a tangent, what's interesting is whether `mount` shows the mountpoint.

If you cd into a snapshot on FreeBSD it does not get listed in calls to `mount` but does on ZoL.

edit: i have not tested this since FreeBSD rebased on ZoL

speed_spread · on Aug 24, 2021

BTRFS has volumes/subvolumes while ZFS has pools/filesystems. I suspect they have considerably different implementations but the userspace implications are similar.

yjftsjthsd-h · on Aug 24, 2021

How is a BTRFS sub volume different from a ZFS child dataset/filesystem?

ListenLinda · on Aug 25, 2021

Over the years I run into so many idiosyncrasies in btrfs that I have come to the conclusion (a) you better understand the internals of it _very_ well or (b) only use it with small volumes. I have gone with option B because I don't read their mailing list on a regular basis.

ansible · on Aug 24, 2021

In our current setup, we only NFS export a btrfs subvolume, with no subvolumes inside that one. So that should be OK, if I'm reading this right.

I'm a big fan of btrfs snapshots, but with some other recent issues, I'm wondering if we should migrate away from btrfs.

baggy_trough · on Aug 24, 2021

I've run a service for 10 years and during that time, there have only been 2 frightening outages (that were difficult to understand and debug) - both of them were due to btrfs.

Muromec · on Aug 24, 2021

Don't hardlinks have same inode, as they are essentially same file, included in a different directory? Is this also why we don't have hardlinks to directories?

AshamedCaptain · on Aug 24, 2021

Yes, so a client program that sees such both files with the same inode will correctly assume that both files are the same. However, the assumption is broken when using snapshots since the same inode may refer to the same file but from differing snapshots, so it's no longer the same.

turminal · on Aug 24, 2021

They do have the same inode. Directories have inodes too, so that's not an obstacle.

We don't have hardlinks to directories because that is a terrible idea functionality wise.

darknavi · on Aug 24, 2021

I have seen too many posts with users who have had their btrfs pools corrupted for me to try and dabble with it. (Using Unraid specifically).

I'll stick with xfs.

__david__ · on Aug 24, 2021

Similarly, after I got stuck with a corrupted xfs volume that couldn't be self-fixed or manually fixed (the process would take literal days and then fail in some unhelpful way) I gave up and went back to using ext4 for everything.

SkyMarshal · on Aug 24, 2021

Why not ZFS or ZoL (https://zfsonlinux.org/)?

B/c it’s not built into the kernel, or some other reason?

SkyMarshal · on Aug 24, 2021

> I’ll stick with xfs

Why not ZFS or ZoL (https://zfsonlinux.org/)?

B/c it’s not built into the kernel, or some other reason?

aaron_m04 · on Aug 24, 2021

Perhaps inode numbers should never have been numbers in the first place. If they were strings it would be trivial to prepend a volume ID.

st_goliath · on Aug 24, 2021

The original research Unix systems had a filesystem that resolved filenames to inode numbers (which were basically indices into a lookup table of inode structures). Kinda sorta like DNS where you piece-by-piece resolve a hierarchically structured string into a number that you can use to directly address a target, until you reach the last one.

If two paths resolve to the same inode number, and thus to the same inode and data, that's called a hard link. The whole concept of inodes and inode numbers is deeply ingrained into the Unix userspace API and Unix filesystem semantics.

The whole purpose of inode numbers is to not deal with strings.

What you are suggesting is essentially a fundamental redesign of Unix filesystem semantics that requires each file to have at least two paths: a hierarchical, user generated one, and a non-hierarchical, auto generated one.

turminal · on Aug 24, 2021

That would create a whole lot of other problems and probably some performance penalty, because strings are much, much more difficult to work with in the context of a kernel filesystem driver.