Interesting. I tried to follow the discussion in the linked thread, and the only...

bobby_big_balls · 2025-08-15T21:03:48 1755291828

In Linux, the file descriptor table (fdtable) of a process starts with a minimum of 256 slots. Two threads creating 256 sockets each, which uses 512 fds on top of the three already present (for stdin, stdout and stderr), requires that the fdtable be expanded about halfway through when the capacity is doubled from 256 to 512, and again near the end when resizing from 512 to 1024.

This is done by expand_fdtable() in the kernel. It contains the following code:

        if (atomic_read(&files->count) > 1)
          synchronize_rcu();

The field files->count is a reference counter. As there are two threads, which share a set of open files between them, the value of this is 2, meaning that synchronize_rcu() is called here during fdtable expansion. This waits until a full RCU grace period has elapsed, causing a delay in acquiring a new fd for the socket currently being created.

If the fdtable is expanded prior to creating a new thread, as the test program optionally will do by calling dup(0, 666) if supplied a command line argument, this avoids the synchronize_rcu() call because at this point files->count == 1. Therefore, if this is done, there will be no delay later on when creating all the sockets as the fdtable will have sufficient capacity.

By contrast, the OpenBSD kernel doesn't have anything like RCU and just uses a rwlock when the file descriptor table of the process is being modified, avoiding the long delay during expansion that may be observed in Linux.

tptacek · 2025-08-15T21:41:33 1755294093

RCUs are super interesting; here's (I think I've got the right link) a good talk on how they work and why they work that way:

https://www.youtube.com/watch?v=9rNVyyPjoC4

dekhn · 2025-08-15T21:43:57 1755294237

Thanks for the explanation. I confirmed the performance timing different by enabling the dup call.

I guess my question is why would synchronize_rcu take many milliseconds (20+) to run. I would expect that to be in the very low milliseconds or less.

altairprime · 2025-08-15T21:59:34 1755295174

> allocating kernel objects from proper virtual memory makes this easier. Linux currently just allocates kernel objects straight out of the linear mapping of all physical memory

I found this to be a key takeaway of reading the full thread: this is, in part, a benchmark of kernel memory allocation approaches, that surfaces an unforeseen difference in FD performance at a mere 256 x 2 allocs. Presumably we’re seeing a test case distilled down from a real world scenario where this slowdown was traced for some reason?

saagarjha · 2025-08-15T22:10:43 1755295843

That’s how they’re designed; they are intended to complete at some point that’s not soon. There’s an “expedited RCU” which to my understanding tries to get everyone past the barrier as fast as possible by yelling at them but I don’t know if that would be appropriate here.

loeg · 2025-08-16T01:56:10 1755309370

> This is done by expand_fdtable() in the kernel. It contains the following code:

Context: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...

neerajsi · 2025-08-16T05:54:46 1755323686

Reading this code for the first time, this seems to be a consequence of the separation between allocating and fd and "installing" a pointer to a file there. Allocating the fd already needs to acquire a lock. So if the install happens together with allocation, there wouldn't be a need to use synchronize_rcu to kick out other threads. The lock would do that.

viraptor · 2025-08-15T20:40:52 1755290452

When 2 threads are allocating sockets sequentially, they fight for the locks. If you preallocate a bigger table by creating fd 666 first, the lock contention goes away.

JdeBP · 2025-08-15T21:48:47 1755294527

It's something that has always been interesting about Windows NT, which has a multi-level object handle table, and does not have the rule about re-using the lowest numbered available table index. There's scope for reducing contention amongst threads in such an architecture.

Although:

1. back in application-mode code the language runtime libraries make things look like a POSIX API and maintain their own table mapping object handles to POSIX-like file descriptors, where there is the old contention over the lowest free entries; and

1. in practice the object handle table seems to mostly append, so multiple object-opening threads all contend over the end of the table.

the8472 · 2025-08-16T12:39:20 1755347960

lowest-number-reuse is also a robustness issue. If multi-threaded programs UAF or double-free a file descriptor they likely end up touching FDs owned by other parts of the program which can result in various kinds of corruption and memory-unsafety. Assigning numbers from a large domain, either randomly or from a permutation sequence, would massively reduce that probability and manifest in prompt errors instead.

loeg · 2025-08-16T16:37:23 1755362243

I want an alternative unix ABI that doesn't guarantee lowest-reuse for this exact reason. I suppose you could (almost) just hijack the close call and replace it with a dup2 of /dev/null or something (but then you'd miss close errors).

the8472 · 2025-08-16T18:55:30 1755370530

It could be emulated in userspace with F_DUPFD. But that's costly because the kernel table is optimized for dense data, not sparse.

The rust standard library aborts the program in debug builds when it detects a double-close; at that point corruption may already have occurred but better than nothing.

loeg · 2025-08-16T21:48:48 1755380928

> The rust standard library aborts the program in debug builds when it detects a double-close

On EBADF? Neat.

saagarjha · 2025-08-15T21:55:46 1755294946

RCU is very explicitly a lockless synchronization strategy.

loeg · 2025-08-16T01:57:15 1755309435

This isn't why it's slow. 2x256 just isn't a lot of locks in CPU time.

ginko · 2025-08-16T03:18:08 1755314288

The problem with locking isn’t the overhead of the locking mechanism itself but the necessary serialization.

loeg · 2025-08-16T03:23:23 1755314603

Again, mutex contention / serialization is not why this is slow.