Interesting. I tried to follow the discussion in the linked thread, and the only takeaway I got was "something to do with RCU". What id the simplified explanation?
In Linux, the file descriptor table (fdtable) of a process starts with a minimum of 256 slots. Two threads creating 256 sockets each, which uses 512 fds on top of the three already present (for stdin, stdout and stderr), requires that the fdtable be expanded about halfway through when the capacity is doubled from 256 to 512, and again near the end when resizing from 512 to 1024.
This is done by expand_fdtable() in the kernel. It contains the following code:
if (atomic_read(&files->count) > 1)
synchronize_rcu();
The field files->count is a reference counter. As there are two threads, which share a set of open files between them, the value of this is 2, meaning that synchronize_rcu() is called here during fdtable expansion. This waits until a full RCU grace period has elapsed, causing a delay in acquiring a new fd for the socket currently being created.
If the fdtable is expanded prior to creating a new thread, as the test program optionally will do by calling dup(0, 666) if supplied a command line argument, this avoids the synchronize_rcu() call because at this point files->count == 1. Therefore, if this is done, there will be no delay later on when creating all the sockets as the fdtable will have sufficient capacity.
By contrast, the OpenBSD kernel doesn't have anything like RCU and just uses a rwlock when the file descriptor table of the process is being modified, avoiding the long delay during expansion that may be observed in Linux.
> allocating kernel objects from proper virtual memory makes this easier. Linux currently just allocates kernel objects straight out of the linear mapping of all physical memory
I found this to be a key takeaway of reading the full thread: this is, in part, a benchmark of kernel memory allocation approaches, that surfaces an unforeseen difference in FD performance at a mere 256 x 2 allocs. Presumably we’re seeing a test case distilled down from a real world scenario where this slowdown was traced for some reason?
That’s how they’re designed; they are intended to complete at some point that’s not soon. There’s an “expedited RCU” which to my understanding tries to get everyone past the barrier as fast as possible by yelling at them but I don’t know if that would be appropriate here.
Reading this code for the first time, this seems to be a consequence of the separation between allocating and fd and "installing" a pointer to a file there. Allocating the fd already needs to acquire a lock. So if the install happens together with allocation, there wouldn't be a need to use synchronize_rcu to kick out other threads. The lock would do that.
When 2 threads are allocating sockets sequentially, they fight for the locks. If you preallocate a bigger table by creating fd 666 first, the lock contention goes away.
It's something that has always been interesting about Windows NT, which has a multi-level object handle table, and does not have the rule about re-using the lowest numbered available table index. There's scope for reducing contention amongst threads in such an architecture.
Although:
1. back in application-mode code the language runtime libraries make things look like a POSIX API and maintain their own table mapping object handles to POSIX-like file descriptors, where there is the old contention over the lowest free entries; and
1. in practice the object handle table seems to mostly append, so multiple object-opening threads all contend over the end of the table.
lowest-number-reuse is also a robustness issue. If multi-threaded programs UAF or double-free a file descriptor they likely end up touching FDs owned by other parts of the program which can result in various kinds of corruption and memory-unsafety.
Assigning numbers from a large domain, either randomly or from a permutation sequence, would massively reduce that probability and manifest in prompt errors instead.
I want an alternative unix ABI that doesn't guarantee lowest-reuse for this exact reason. I suppose you could (almost) just hijack the close call and replace it with a dup2 of /dev/null or something (but then you'd miss close errors).
It could be emulated in userspace with F_DUPFD. But that's costly because the kernel table is optimized for dense data, not sparse.
The rust standard library aborts the program in debug builds when it detects a double-close; at that point corruption may already have occurred but better than nothing.