Inode non-uniqueness is a hell of a problem to debug: it's not necessarily a problem most of the time then you'll run into a corner case where something depends on it.
e.g. we had a big issue with this back around 2008 where we accidentally had two AFS volumes with overlapping inode ranges (AFS was represented as a single device, even if there were multiple volumes mounted).
Linux's dynamic linker had an optimization that cached device/inode for DSOs that had already been loaded, and wouldn't attempt to load them again.
If you had a binary that depended on linking two DSOs from different volumes that had different names, but the same inode number, the first one would get linked OK but the linker would then ignore the second one and you'd get missing dynamic linker dependencies out of nowhere.
This was the first and hopefully last time reading the source to ld.so during a production outage.
The Linux file interface was not designed for subvolumes to masquerade as directories.
The correct course of action would have been for Linus to not allow btrfs into mainline with such an interface. Each subvolume should have been seperately mounted, or the stat interface should have been extended to make subvolumes first class.
If I understand the second part of this article correctly, the first proposed solution was to replace those magic subvolume directories with auto mounted directories, but some standard uses of btrfs have so many subvolumes, the kernel code managing mounts was running into scaling issues. Which would probably be a good problem to address eventually, but is a big effort in itself.
An ugly workaround would be to embed the subvolume id into the inode number, but it would almost require 64 bit inode numbers and some tools make daring assumptions about the stability of inode numbers across reboots.
It seems like there is no agreement on if inodes should be unique or not, or what information is required to uniquely identify a file or directory. It does not surprise me that Linux is looking for a solution while this problem is open.
From the first part alone, much of the 'sin' is likely non-unique inode numbers across a pool of BTRFS devices. I could understand partitioning ranges of new numbers for performance; the non-uniqueness seems just silly.
Ranges aren't possible because there aren't enough bits in a 64-bit inode.
ZFS doesn't have this problem because each dataset (the equivalent of subvolume) is treated as a separate mountpoint. This can be annoying with NFS because you need to export and mount each dataset separately.
> This can be annoying with NFS because you need to export and mount each dataset separately.
I use ZFS on multiple servers, and appreciate this approach. The problem is -- you're going to feel pain one way or the other. In that case, make sure that the pain is obvious and predictable.
If you're using subvolumes/datasets, you will have to deal with a problem at some point. Either you have to manually export multiple NFS volumes (ZFS), or potentially have inode uniqueness issues (Btrfs).
I'd much rather have the problems make excruciatingly obvious. I can script generating a config file for exporting many datasets (and I have done so). I can't really deal with non-unique inodes in the same manner.
I overall agree, but on our build-server the person who can add new branches is not the same person who can add new entries to the autofs map, so we don't use one dataset per branch, which would make other things a lot easier.
That’s true. For me, I made a dataset for each user as a home directory. Each home directory was then exported over NFS.
In order to automate the export, I have a single script that can be executed with sudo on that system to manage both the creation of the dataset and the export.
So, there are still ways, if you’re able to use sudo.
My LDAP and NFS servers are separate too. I control both, so it’s a bit different, but I still have it setup so that account creation and home directory creation/export are handled in one script.
The account creation script (on the LDAP server) makes an SSH call to the NFS server to run the home directory creation/export update script. That specific SSH certificate is password-less and restricted to running only that single command. It could be root or another user that calls sudo.
So, some of what you’re looking for can be done with some SSH tricks… but only if both sides are comfortable with that setup. The benefit is that each side can manage their own scripts. The NIS admin can write the script on their side and NFS on their side. You just need to establish the workflow that works best for you.
Our cluster is completely self contained, so with sudo and SSH restrictions, I’m not concerned about security issues with this setup.
So how does ZFS handle the same problem? (I would guess that it just goes through with making each dataset a full filesystem and just deals with the overhead for NFS, but I don't know that)
To me it seems that ZFS snapshots represents a similar challenge[1] given that the snapshots can be accessed through the .zfs directory, and that ZFS also plays tricks[2] with the inodes.
BTRFS has volumes/subvolumes while ZFS has pools/filesystems. I suspect they have considerably different implementations but the userspace implications are similar.
Over the years I run into so many idiosyncrasies in btrfs that I have come to the conclusion (a) you better understand the internals of it _very_ well or (b) only use it with small volumes. I have gone with option B because I don't read their mailing list on a regular basis.
I've run a service for 10 years and during that time, there have only been 2 frightening outages (that were difficult to understand and debug) - both of them were due to btrfs.
Don't hardlinks have same inode, as they are essentially same file, included in a different directory? Is this also why we don't have hardlinks to directories?
Yes, so a client program that sees such both files with the same inode will correctly assume that both files are the same. However, the assumption is broken when using snapshots since the same inode may refer to the same file but from differing snapshots, so it's no longer the same.
Similarly, after I got stuck with a corrupted xfs volume that couldn't be self-fixed or manually fixed (the process would take literal days and then fail in some unhelpful way) I gave up and went back to using ext4 for everything.
The original research Unix systems had a filesystem that resolved filenames to inode numbers (which were basically indices into a lookup table of inode structures). Kinda sorta like DNS where you piece-by-piece resolve a hierarchically structured string into a number that you can use to directly address a target, until you reach the last one.
If two paths resolve to the same inode number, and thus to the same inode and data, that's called a hard link. The whole concept of inodes and inode numbers is deeply ingrained into the Unix userspace API and Unix filesystem semantics.
The whole purpose of inode numbers is to not deal with strings.
What you are suggesting is essentially a fundamental redesign of Unix filesystem semantics that requires each file to have at least two paths: a hierarchical, user generated one, and a non-hierarchical, auto generated one.
That would create a whole lot of other problems and probably some performance penalty, because strings are much, much more difficult to work with in the context of a kernel filesystem driver.
e.g. we had a big issue with this back around 2008 where we accidentally had two AFS volumes with overlapping inode ranges (AFS was represented as a single device, even if there were multiple volumes mounted).
Linux's dynamic linker had an optimization that cached device/inode for DSOs that had already been loaded, and wouldn't attempt to load them again.
If you had a binary that depended on linking two DSOs from different volumes that had different names, but the same inode number, the first one would get linked OK but the linker would then ignore the second one and you'd get missing dynamic linker dependencies out of nowhere.
This was the first and hopefully last time reading the source to ld.so during a production outage.