I don't think you understand the point of the constraints ZFS has put in place here. They're not there for transactional isolation per se. They're there because these ZFS maintenance operations run as synchronous system calls that take per-filesystem global mutex locks. Blocking within one, with those locks acquired, and thunking back down to userspace, would mean that 1. a CPU core, and 2. the filesystem itself, would both be tied up indefinitely until that callback thunk returns. If it ever does!
Picture a database server where once you begin a transaction, the whole DB server goes single-threaded and doesn't serve any other clients until the TX completes. Fundamentally, at least for its more arcane global operations, that's exactly what a filesystem is. Usually this doesn't matter, because even these arcane filesystem operations take on the order of microseconds to complete (just with lots of non-locked pre/post execution overhead that inflates the total syscall time.) But a userspace program can do arbitrary-much stuff in a callback. It can even just get into an infinite loop, and never get back to the kernel. This is why there isn't such a thing as a system call with userspace callbacks! (Or any API equivalent to one—e.g. one with tx_begin + tx_commit calls, where tx_begin acquires a global kernel lock before returning to userspace, under the expectation that userspace call tx_commit to release the kernel lock.)
On the other hand, submitting a "callback program" as a whole to the kernel, allows the kernel to statically analyze the behavior of the "callback program" in advance; and only if it is determined to be "simple" (e.g. if it predictably halts after a short time as a static property), will the kernel issue the program a capability for actually running that "callback program."
This is how compute shaders work. This is how eBPF works. And this is (apparently) how this ZFS Lua thing works. They all do it for similar reasons: to ensure that the program is a complete, soft-realtime practical unit of work to be executing in some bounded-time context.
You do get atomicity from this, but it's not MVCC atomicity. It's more like Redis's atomicity: during the execution of a ZFS Lua program, all other clients making system calls against the filesystem will wait on acquiring the filesystem mutex, and so will block until the program completes. There's never a case where other activity might "interleave" with the Lua program's execution, and cause it to retry. There is no other activity. The only case in which the Lua program can actually fail (and so get rolled back), is if it dies/is killed in the middle of execution, due to e.g. a hardware power cut. That's the kind of atomicity that ZFS is concerned about—guaranteeing that changes that make it into the filesystem journal, are complete valid change units. In this case, one Lua program is one change unit in the journal.
> Blocking within one, with those locks acquired, and thunking back down to userspace, would mean that 1. a CPU core, and 2. the filesystem itself, would both be tied up indefinitely until that callback thunk returns. If it ever does!
In this theoretical design, you could just block all other administrative modifications. I don't think you need to tie up an entire CPU core, and I'm fairly sure that these zfs operations don't block regular reads and writes.
I think you had it right in your initial comment. There's no good way express branching with an implementation which incrementally submits operations to be committed as a batch. You'd have to take an admin lock on an entire zpool.
EDIT: talked to a zfs dev, said this would take the txg sync lock.
While there are ways to deschedule both userspace and kernel threads, there is no mechanism to deschedule a userspace thread while it's executing in the middle of kernel mode because of a blocking syscall.
Think of it like trying to deschedule a userspace thread in the middle of it having jumped to kernelspace to handle an interrupt. It just wouldn't work; that's not a pre-emptible state, not one that can be cleanly represented during a context switch with a PUSHA, not one where pre-emption would leave the kernel in a known state, etc.
So the CPU core is tied up because the original thread can't be descheduled, and instead would still be "stuck" in the middle of the system call, doing a busy-wait on the result of the callback. To make the callback actually happen in this hypothetical design, the execution of the callback would need to be scheduled onto another CPU core, using some system-global callback-scheduler like Apple's libdispatch.
Note that this is also why, in Linux, processes stuck in the D state are unkillable. They're stuck "inside" a blocking system call, and so cannot be descheduled, even by the process manager trying to hard-kill them (which, in the end, requires the system call to at least return to the kernel so that the kernel resources involved can reach a known postcondition state.)
And this is why innovations like io_uring make so much sense in Linux — they allow a userspace process to 1. make a long-running blocking syscall, while also 2. spawning a worker subprocess to communicate asynchronously with the logic inside the running syscall, by queuing messages back and forth through the kernel rings. (Picture, say, sendfile(2) messaging your worker to let you observe the progress of the operation, and/or to signal it on a channel to gracefully cancel the operation-in-progress.)
I'm not following what you're saying. Why do we need a callback?
In this imaginary design, the syscalls you make would look something like:
- BeginChannelTx -> return ChannelTxID
- ReadZFSProperties(ChannelTxID, params) -> return data
- DestroySomeDatasets(ChannelTxID, params) -> ok
- CommitChannelTx(ChannelTxID)
Notably, DestroySomeDatasets doesn't actually do any work. It merely records which datasets you want to destroy. There are no callbacks as far as I can see: there's no kernel thread waiting on a user thread to do something. This way also lets you express branching.
The drawback of this approach is you need a lock on all mutating administrative commands when you call BeginChannelTX. I talked to a ZFS dev, and he said that with ZFS' design, that's actually the txg sync lock. This means that while reads will proceed, writes will only proceed for a short period of time, and nothing will make it to disk. The overhead of making all these syscalls was also judged to be problematic.
Picture a database server where once you begin a transaction, the whole DB server goes single-threaded and doesn't serve any other clients until the TX completes. Fundamentally, at least for its more arcane global operations, that's exactly what a filesystem is. Usually this doesn't matter, because even these arcane filesystem operations take on the order of microseconds to complete (just with lots of non-locked pre/post execution overhead that inflates the total syscall time.) But a userspace program can do arbitrary-much stuff in a callback. It can even just get into an infinite loop, and never get back to the kernel. This is why there isn't such a thing as a system call with userspace callbacks! (Or any API equivalent to one—e.g. one with tx_begin + tx_commit calls, where tx_begin acquires a global kernel lock before returning to userspace, under the expectation that userspace call tx_commit to release the kernel lock.)
On the other hand, submitting a "callback program" as a whole to the kernel, allows the kernel to statically analyze the behavior of the "callback program" in advance; and only if it is determined to be "simple" (e.g. if it predictably halts after a short time as a static property), will the kernel issue the program a capability for actually running that "callback program."
This is how compute shaders work. This is how eBPF works. And this is (apparently) how this ZFS Lua thing works. They all do it for similar reasons: to ensure that the program is a complete, soft-realtime practical unit of work to be executing in some bounded-time context.
You do get atomicity from this, but it's not MVCC atomicity. It's more like Redis's atomicity: during the execution of a ZFS Lua program, all other clients making system calls against the filesystem will wait on acquiring the filesystem mutex, and so will block until the program completes. There's never a case where other activity might "interleave" with the Lua program's execution, and cause it to retry. There is no other activity. The only case in which the Lua program can actually fail (and so get rolled back), is if it dies/is killed in the middle of execution, due to e.g. a hardware power cut. That's the kind of atomicity that ZFS is concerned about—guaranteeing that changes that make it into the filesystem journal, are complete valid change units. In this case, one Lua program is one change unit in the journal.