The limitation of "0x10000 operations" (64k) suprises me. Is that what can be submitted at once to the kernel, or also the amount of operations in-flight?
There's certainly a lot of servers out there which have more than 100k connections with low activity (keep-alive between HTTP requests, running messaging protocols, etc), where one read() operation would need to be enqueued for each connection for the case new data arrives or the connection closes. Is this supported or not?
My understanding of io_uring on Linux is that the submission queue only holds the recently-submitted entries that the kernel has not started processing. Once the kernel reads a Submission Queue Entry, it consumes that SQE and keeps track of command status using private memory, and a Completion Queue entry will not appear until the kernel is done handling that IO operation. So the Submission Queue size is not a limitation on how many in-flight/pending IOs there can be, just on how big a batch you can dump on the kernel at once. Of course, the kernel's internal limits on operation state tracking might be smaller or larger than the Submission Queue size.
It's unclear if there's a global limitation, but my reading is that the limit of 64k operations is per IORING_OBJECT. There's nothing that says you can't have multiple submission queues; for example, the NVMe spec allows 64k commands per queue on up to 64k queues. (On the other hand, there's nothing that says that multiple submission queues is a good match for any given application either, but in the storage world it often is).
In any case, given how popular NVMe is, I assume IORING_OBJECT would support this model.
That is correct - the 64K limitation is per IORING_OBJECT. Nothing stopping a process from creating multiple objects, as many as its handle table allows actually, so very similar to NVMe.
Thank goodness. That should minimize porting friction, and help both APIs catch on. I want to see lots of languages offering cross-platform abstractions over these to their standard libraries.
I'm not familiar with the gory details on the Windows side, so can anyone say whether the file handles this API interacts are general enough that this can do or be extended to do network IO, the way io_uring can?
Also, has there been any official confirmation about how this relates to DirectStorage? I've been expecting DirectStorage to include a clone of io_uring, and this seems to fit the bill.
The handles are file handles, so they're as generic as it gets on Windows - those can be handles to files, sockets, pipes...
This means it can be pretty easily extended to support other I/O operations so I hope that will be implemented by the time 21H2 is released. It does look like the plan is to implement io_uring on Windows so I'm optimistic.
The APIs were actually originally implemented in the Storage DLL and not in KernelBase so looks like DirectStorage is going to be using this pretty soon.
One of the reasons to have multiple nodes in a cluster is to leverage multiple kernels for processing. Even VMs are bound to whatever linear limits are in place in the hypervisor. I’ve always thought it would be possible to run a multiple kernel OS in VRAM on a gpu for better processing.
The GPU side of things might not help you on current hardware because there's still some serial hardware queues, but the general concept of 'don't run kernels on multiple cores to guarantee a lack of software serialization points' is a valid and active area of research.
There's certainly a lot of servers out there which have more than 100k connections with low activity (keep-alive between HTTP requests, running messaging protocols, etc), where one read() operation would need to be enqueued for each connection for the case new data arrives or the connection closes. Is this supported or not?