Hacker News new | past | comments | ask | show | jobs | submit login
The Btrfs inode-number epic (part 1: the problem) (lwn.net)
99 points by Deeg9rie9usi on Aug 24, 2021 | hide | past | favorite | 39 comments



Inode non-uniqueness is a hell of a problem to debug: it's not necessarily a problem most of the time then you'll run into a corner case where something depends on it.

e.g. we had a big issue with this back around 2008 where we accidentally had two AFS volumes with overlapping inode ranges (AFS was represented as a single device, even if there were multiple volumes mounted).

Linux's dynamic linker had an optimization that cached device/inode for DSOs that had already been loaded, and wouldn't attempt to load them again.

If you had a binary that depended on linking two DSOs from different volumes that had different names, but the same inode number, the first one would get linked OK but the linker would then ignore the second one and you'd get missing dynamic linker dependencies out of nowhere.

This was the first and hopefully last time reading the source to ld.so during a production outage.


Here you can find part 2: the solutions: https://lwn.net/SubscriberLink/866709/671690ea60c1cb37/


Don't miss Dave Chinner's interesting comment comparing btrfs, bcachefs and others: https://lwn.net/Articles/867427/


Thank you


The Linux file interface was not designed for subvolumes to masquerade as directories.

The correct course of action would have been for Linus to not allow btrfs into mainline with such an interface. Each subvolume should have been seperately mounted, or the stat interface should have been extended to make subvolumes first class.


If I understand the second part of this article correctly, the first proposed solution was to replace those magic subvolume directories with auto mounted directories, but some standard uses of btrfs have so many subvolumes, the kernel code managing mounts was running into scaling issues. Which would probably be a good problem to address eventually, but is a big effort in itself.


An ugly workaround would be to embed the subvolume id into the inode number, but it would almost require 64 bit inode numbers and some tools make daring assumptions about the stability of inode numbers across reboots.


As far as I can see, subvolume ids are stable across reboots.


It seems like there is no agreement on if inodes should be unique or not, or what information is required to uniquely identify a file or directory. It does not surprise me that Linux is looking for a solution while this problem is open.


From the first part alone, much of the 'sin' is likely non-unique inode numbers across a pool of BTRFS devices. I could understand partitioning ranges of new numbers for performance; the non-uniqueness seems just silly.


Ranges aren't possible because there aren't enough bits in a 64-bit inode.

ZFS doesn't have this problem because each dataset (the equivalent of subvolume) is treated as a separate mountpoint. This can be annoying with NFS because you need to export and mount each dataset separately.


> This can be annoying with NFS because you need to export and mount each dataset separately.

I use ZFS on multiple servers, and appreciate this approach. The problem is -- you're going to feel pain one way or the other. In that case, make sure that the pain is obvious and predictable.

If you're using subvolumes/datasets, you will have to deal with a problem at some point. Either you have to manually export multiple NFS volumes (ZFS), or potentially have inode uniqueness issues (Btrfs).

I'd much rather have the problems make excruciatingly obvious. I can script generating a config file for exporting many datasets (and I have done so). I can't really deal with non-unique inodes in the same manner.


I overall agree, but on our build-server the person who can add new branches is not the same person who can add new entries to the autofs map, so we don't use one dataset per branch, which would make other things a lot easier.


That’s true. For me, I made a dataset for each user as a home directory. Each home directory was then exported over NFS.

In order to automate the export, I have a single script that can be executed with sudo on that system to manage both the creation of the dataset and the export.

So, there are still ways, if you’re able to use sudo.


Yeah, I can sudo on the NAS, but not on the nis server


If it helps…

My LDAP and NFS servers are separate too. I control both, so it’s a bit different, but I still have it setup so that account creation and home directory creation/export are handled in one script.

The account creation script (on the LDAP server) makes an SSH call to the NFS server to run the home directory creation/export update script. That specific SSH certificate is password-less and restricted to running only that single command. It could be root or another user that calls sudo.

So, some of what you’re looking for can be done with some SSH tricks… but only if both sides are comfortable with that setup. The benefit is that each side can manage their own scripts. The NIS admin can write the script on their side and NFS on their side. You just need to establish the workflow that works best for you.

Our cluster is completely self contained, so with sudo and SSH restrictions, I’m not concerned about security issues with this setup.


Atleast when you use the build in zfs nfs, it also exports "subvolumes"


Well on fairly recent TrueNAS at least you need to export each dataset separately, and they show up in /proc/mounts.


I habe the zpool "data" with the zfs Volumes like this

data/media/video/movies data/media/video/series

I set sharenfs with read write access for

data/media/video

Which gives me access via nfa to all the subvolumes.


So how does ZFS handle the same problem? (I would guess that it just goes through with making each dataset a full filesystem and just deals with the overhead for NFS, but I don't know that)


Since ZFS does not really have the concept sub-volumes like btrfs it does not suffer from it.


I'm no expert so maybe I'm missing something.

To me it seems that ZFS snapshots represents a similar challenge[1] given that the snapshots can be accessed through the .zfs directory, and that ZFS also plays tricks[2] with the inodes.

[1]: https://docs.oracle.com/cd/E19253-01/819-5461/gbiqe/index.ht...

[2]: http://mikeboers.com/blog/2019/02/21/zfs-inode-generations


They're always explicit mounts on ZFS. (Even the .zfs directory triggers mounting on access of things under snapshot/.)


As a tangent, what's interesting is whether `mount` shows the mountpoint.

If you cd into a snapshot on FreeBSD it does not get listed in calls to `mount` but does on ZoL.

edit: i have not tested this since FreeBSD rebased on ZoL


BTRFS has volumes/subvolumes while ZFS has pools/filesystems. I suspect they have considerably different implementations but the userspace implications are similar.


How is a BTRFS sub volume different from a ZFS child dataset/filesystem?


Over the years I run into so many idiosyncrasies in btrfs that I have come to the conclusion (a) you better understand the internals of it _very_ well or (b) only use it with small volumes. I have gone with option B because I don't read their mailing list on a regular basis.


In our current setup, we only NFS export a btrfs subvolume, with no subvolumes inside that one. So that should be OK, if I'm reading this right.

I'm a big fan of btrfs snapshots, but with some other recent issues, I'm wondering if we should migrate away from btrfs.


I've run a service for 10 years and during that time, there have only been 2 frightening outages (that were difficult to understand and debug) - both of them were due to btrfs.


Don't hardlinks have same inode, as they are essentially same file, included in a different directory? Is this also why we don't have hardlinks to directories?


Yes, so a client program that sees such both files with the same inode will correctly assume that both files are the same. However, the assumption is broken when using snapshots since the same inode may refer to the same file but from differing snapshots, so it's no longer the same.


They do have the same inode. Directories have inodes too, so that's not an obstacle.

We don't have hardlinks to directories because that is a terrible idea functionality wise.


I have seen too many posts with users who have had their btrfs pools corrupted for me to try and dabble with it. (Using Unraid specifically).

I'll stick with xfs.


Similarly, after I got stuck with a corrupted xfs volume that couldn't be self-fixed or manually fixed (the process would take literal days and then fail in some unhelpful way) I gave up and went back to using ext4 for everything.


Why not ZFS or ZoL (https://zfsonlinux.org/)?

B/c it’s not built into the kernel, or some other reason?


> I’ll stick with xfs

Why not ZFS or ZoL (https://zfsonlinux.org/)?

B/c it’s not built into the kernel, or some other reason?


Perhaps inode numbers should never have been numbers in the first place. If they were strings it would be trivial to prepend a volume ID.


The original research Unix systems had a filesystem that resolved filenames to inode numbers (which were basically indices into a lookup table of inode structures). Kinda sorta like DNS where you piece-by-piece resolve a hierarchically structured string into a number that you can use to directly address a target, until you reach the last one.

If two paths resolve to the same inode number, and thus to the same inode and data, that's called a hard link. The whole concept of inodes and inode numbers is deeply ingrained into the Unix userspace API and Unix filesystem semantics.

The whole purpose of inode numbers is to not deal with strings.

What you are suggesting is essentially a fundamental redesign of Unix filesystem semantics that requires each file to have at least two paths: a hierarchical, user generated one, and a non-hierarchical, auto generated one.


That would create a whole lot of other problems and probably some performance penalty, because strings are much, much more difficult to work with in the context of a kernel filesystem driver.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: