I don't think you understand the point of the constraints ZFS has put in place h...

grantwu · on Sept 9, 2021

> Blocking within one, with those locks acquired, and thunking back down to userspace, would mean that 1. a CPU core, and 2. the filesystem itself, would both be tied up indefinitely until that callback thunk returns. If it ever does!

In this theoretical design, you could just block all other administrative modifications. I don't think you need to tie up an entire CPU core, and I'm fairly sure that these zfs operations don't block regular reads and writes.

I think you had it right in your initial comment. There's no good way express branching with an implementation which incrementally submits operations to be committed as a batch. You'd have to take an admin lock on an entire zpool.

EDIT: talked to a zfs dev, said this would take the txg sync lock.

derefr · on Sept 9, 2021

While there are ways to deschedule both userspace and kernel threads, there is no mechanism to deschedule a userspace thread while it's executing in the middle of kernel mode because of a blocking syscall.

Think of it like trying to deschedule a userspace thread in the middle of it having jumped to kernelspace to handle an interrupt. It just wouldn't work; that's not a pre-emptible state, not one that can be cleanly represented during a context switch with a PUSHA, not one where pre-emption would leave the kernel in a known state, etc.

So the CPU core is tied up because the original thread can't be descheduled, and instead would still be "stuck" in the middle of the system call, doing a busy-wait on the result of the callback. To make the callback actually happen in this hypothetical design, the execution of the callback would need to be scheduled onto another CPU core, using some system-global callback-scheduler like Apple's libdispatch.

Note that this is also why, in Linux, processes stuck in the D state are unkillable. They're stuck "inside" a blocking system call, and so cannot be descheduled, even by the process manager trying to hard-kill them (which, in the end, requires the system call to at least return to the kernel so that the kernel resources involved can reach a known postcondition state.)

And this is why innovations like io_uring make so much sense in Linux — they allow a userspace process to 1. make a long-running blocking syscall, while also 2. spawning a worker subprocess to communicate asynchronously with the logic inside the running syscall, by queuing messages back and forth through the kernel rings. (Picture, say, sendfile(2) messaging your worker to let you observe the progress of the operation, and/or to signal it on a channel to gracefully cancel the operation-in-progress.)

grantwu · on Sept 9, 2021

I'm not following what you're saying. Why do we need a callback?

In this imaginary design, the syscalls you make would look something like:

- BeginChannelTx -> return ChannelTxID

- ReadZFSProperties(ChannelTxID, params) -> return data

- DestroySomeDatasets(ChannelTxID, params) -> ok

- CommitChannelTx(ChannelTxID)

Notably, DestroySomeDatasets doesn't actually do any work. It merely records which datasets you want to destroy. There are no callbacks as far as I can see: there's no kernel thread waiting on a user thread to do something. This way also lets you express branching.

The drawback of this approach is you need a lock on all mutating administrative commands when you call BeginChannelTX. I talked to a ZFS dev, and he said that with ZFS' design, that's actually the txg sync lock. This means that while reads will proceed, writes will only proceed for a short period of time, and nothing will make it to disk. The overhead of making all these syscalls was also judged to be problematic.