Benchmarks at  and prototype repo at 
It'll be interesting to see how long it takes for Tokio to add support for it in the Rust community, and how long it takes for Go to transparently support this instead of a pool of file I/O workers.
It comes with some downsides though too: The IO operations then only accept certain specialized buffer types and nothing generic like pure byte slices anymore, which might make those things less interoperable. The other issue is that while using refcounting leaves at least the memory in a non-corrupted state the IO source will be in a somewhat undefined state after the cancellation. E.g. if a write operation had been issued on the socket, the API user can not make any assumptions whether or not any byte had been written, and which impact a subsequent write might have. The only reasonable thing to do is to close the IO source (where the close might also need to wait until the still pending background task has completed).
if possible the future should take ownership of the buffer and have it returned on completion.
Note that io_uring in linux 5.1 has various bugs; especially READ* and WRITE* on sockets/pipe don't work very well (afaik they either block submission or return -EAGAIN with O_NONBLOCK), so the example code in my repo isn't that useful for linux 5.1 (POLL_ADD works fine though).
For proper support in tokio this would have to be done in mio afaict (or tokio would have to replace mio with something else).
The idea is that during submission an operation is started with IOCB_NOWAIT; an operation shouldn't block with that. If it returns EAGAIN it will be retried in another kernel thread without IOCB_NOWAIT. Afaik IOCB_NOWAIT support should be announced through FMODE_NOWAIT.
Right now operations that don't support IOCB_NOWAIT ("magic check" in io_file_supports_async ) are still run during submission but without the IOCB_NOWAIT flag.
Sockets/pipes don't handle IOCB_NOWAIT, they only care about O_NONBLOCK right now (which is a bug imho).
Circular buffers were shared between user space and kernel. (Allocated by user space, mapped into kernel with get_user_pages.)
User space threads just added messages to their circular buffer by copying in data, hitting a barrier and updating a head pointer.
Like io_uring, I avoided too many system calls by having a kernel thread poll for changes. User space would only make a system call when the ring would near a "high water mark", like 85% full or whatever.
The circular buffers could outlive the process which created them and attached them. A process could crash and go away; the kernel thread would still process the last messages written right up to the crash.
Not everyone around was keen on the idea. What do you mean just stick it into a buffer and the kernel will pick it up. Polling? Bah.
(The kernel thread was actually provide by a system call from user space. So there was a daemon as a user space program, but that just called into the kernel, where it then sat in a loop. This is nice: you have something in user space that you can kill with SIGTERM or whatever and restart by running some /usr/bin/executable.)
If the daemon terminated, that didn't destroy any of the buffers; they would just sit there getting backlogged. New ones could be created, too (via ioctl calls on a character device in /dev).
One aspect was that get_user_pages doesn't give you a linear block of memory. I wrote the grotty C code to dequeue messages from the non-linearly-mapped circular buffer. The message headers could straddle page boundaries, so I copied those out into temporaries before working with them. I think there was some attempt to avoid making unnecessary copies of the bulk data.
Open question: would it have helped to have a CMA region managed by a kernel module and mapping that memory/regions of that memory to user-space using remap_pfn_range instead?
 CMA documentation file: https://lwn.net/Articles/396707/
I can definitely recommend reading through the first half of Genode Foundations (the second half is just an API reference). It's really interesting and informative, and it's a surprisingly easy read as well. However, I'd wait till the 19.05 edition gets released, since that should be less than a month's wait if they manage to release it on time.
 https://genode.org/documentation/genode-foundations-18-05.pd... (page 92)
 (pages 94 & 116)
For non-blocking sockets and the like you have what grew out of the readiness model from select and poll. Today on Linux that's epoll.
This model never worked for files on disk. There is the POSIX aio API for that. On Linux, glibc implements that and it's always been based on synchronous I/O in thread pools because the kernel support isn't good. And I guess there's been Linux specific things like io_submit(2), which as the article states only worked with O_DIRECT and sometimes fell back to synchronous I/O.
It says the new mechanism works with both files and sockets, but I imagine it's still reasonable to use epoll for sockets, and only use this new stuff for files on disk. One good thing about the readiness model for sockets is you don't need to allocate memory upfront every potential read the remote host might trigger.
The amount of work that goes into supporting, debugging, developing, improving, and designing the sheer number of ways that we do things, at literally hundreds of layers of abstractions all which are slightly wrong and each of which does things their own way with their own tradeoffs.
I'm damn glad there are people a lot smarter and more dedicated than I am that keep this stuff running so that I can store a value in localstorage in a browser and have it punch through all of those layers in a fraction of a moment down to writing bits on one of like a dozen storage mediums on any number of OSs via multiple interfaces in multiple browsers.
The model for the various FD polling mechanisms is "wake me up when read(2) won't return EAGAIN when I call it later". That is very natural for sockets, because what read(2) is doing is draining a kernel side receive buffer, so the question is "wake me up when the buffer isn't empty". If the FD were a file on disk, knowing ahead of time that read(2) isn't going to return quickly is kind of a clumsy task.
But isn't this exactly what happens? A process running in kernel mode calls schedule() and sleeps waiting for I/O. The reason for calling schedule() is that it already does know read isn't going to return quickly. But maybe I'm completely misunderstanding your explanation about why current AIO on Linux is clunky?
For all of these cases I suppose you could make the I/O fail with EAGAIN, but this seems like a lot of potential for spurious wakeups.
 - https://news.ycombinator.com/item?id=19818899
The ring buffer is definitely innovation.
kqueue is neither epoll nor aio but a common queueing interface for events, which includes io readiness and aio completion as separate types. Linux prefers to split these as different types of fds. FWIW, this wasn't due to obliviousness of the existing design - https://yarchive.net/comp/linux/event_queues.html.
Honestly your comment sounds very like a regurgitated old and somewhat outdated Bryan Cantrill rant.
Luckily, in the end epoll ended up supporting multiple queues at least.
I do believe that ET is good though.
> (parts snipped for brevity)
> The biggest limitation is undoubtedly that it only supports async IO for O_DIRECT (or un-buffered) accesses. Due to the restrictions of O_DIRECT (cache bypassing and size/alignment restraints), this makes the native aio interface a no-go for most use cases.
> There are a number of ways async-io submission can end up blocking - if meta data is required to perform IO, the submission will block waiting for that. For storage devices, there are a fixed number of request slots available. If those slots are currently all in use, submission will block waiting for one to become available.
> The API isn't great. Each IO submission ends up needing to copy 64 + 8 bytes and each completion copies 32 bytes. That's 104 bytes of memory copy. IO always requires at least two system calls (submit + wait-for-completion), which in these post spectre/meltdown days is a serious slowdown.
The io_uring soln:
> With a shared ring buffer, we could eliminate the need to have shared locking between the application and the kernel, getting away with some clever use of memory ordering and barriers instead. There are two fundamental operations associated with an async interface: the act of submitting a request, and the event that is associated with the completion of said request. For submitting IO, the application is the producer and the kernel is the consumer. The opposite is true for completions - here the kernel produces completion events and the application consumes them. Hence, we need a pair of rings to provide an effective communication channel between an application and the kernel. That pair of rings is at the core of the new interface, io_uring. They are suitably named submission queue (SQ), and completion queue (CQ), and form the foundation of the new interface.
> The cqes are organized into an array, with the memory backing the array being visible and modifiable by both the kernel and the application. However, since the cqe's are produced by the kernel, only the kernel is actually modifying the cqe entries. The communication is managed by a ring buffer. Whenever a new event is posted by the kernel to the CQ ring, it updates the tail associated with it. When the application consumes an entry, it updates the head. Hence, if the tail is different than the head, the application knows that it has one or more events available for consumption. The ring counters themselves are free flowing 32-bit integers, and rely on natural wrapping when the number of completed events exceed the capacity of the ring.
> For the submission side, the roles are reversed. The application is the one updating the tail, and the kernel consumes entries (and updates) the head. One important difference is that while the CQ ring is directly indexing the shared array of cqes, the submission side has an indirection array between them. Hence the submission side ring buffer is an index into this array, which in turn contains the index into the sqes. This might initially seem odd and confusing, but there's some reasoning behind it. Some applications may embed request units inside internal data structures, and this allows them the flexibility to do so while retaining the ability to submit multiple sqes in one.
I've seen the ring-buffer design pattern employed here by io_uring in log4j via disruptor  and the queue structure in libfabric interface for Infiniband  for both command/control and data.
Happy to see Jens Axboe  committing his time to improve story around aio.
However, one thing, in 8.3
> This is done through a kernel thread, speciﬁc to that io_uring.
Does it means it will be 100k native kernel threads for an app that has 100k open sockets? Won't this ruin scalability?
Even if there's a definite answer to that, I still don't understand why threads are necessary? While CPU waits for completed IO, it doesn't poll anything. The CPU fires a DMA request at hardware, and forgets about it until it gets the response.
Ideally the CPU never waits for completed IO, it's busy building the next set of responses, or writing out the new state to the rest of the application. If you're ever "waiting", then you should probably be putting another/bigger nic in there or rethinking what your application actually needs to do:
10Gb ethernet is only about 10x slower than main memory, so 10 threads could be enough if you have a single memory fetch to service the response, and if you need two, then 20 threads. If you need more than six, then you should definitely rearchitect your application, since you're out of cores and time-slicing again.
Server-class CPUs have much higher bandwidth nowadays:
> The total memory bandwidth available to a single CPU clocks in at 170 GB/s.
(i.e. ~136x slower than main memory)
But you're raising a good point: If you can arrange your data set across the physical DIMMS yes you can get much another 10x, and maybe the contortions you need to do to get it can be worth it (if you're actually memory bound).
: better link, since the one you posted is bust: https://developer.amd.com/wp-content/resources/56301_1.0.pdf
There's a huge space between 'never waiting' and 'always waiting'.
That section discusses an optional mechanism how the kernel gets notified about new IO requests submit by the application. It’s not for IO completion polling. For that use case, having a kernel thread per uring makes perfect sense.
But I’m not sure it needs one. It has kqueue, and unlike epoll, kqueue is a general-purpose way to get events from kernel. It’s compatible not just with sockets, but also files and pipes, directory changes, processes, signals, timers, devices, and more. Not unlike IOCP on Windows.
On paper new uring is faster, but kqueue is not particularly slow either. And it’s available for couple decades now, i.e. it’s stable and tested really well. I’m happy Linux finally getting fast general purpose async IO, but I’ll only benefit from it after a year or so, after kernel updates are out, and after higher-level runtimes like golang and .net core are updated accordingly.
I find kqueue to be a great and well designed API, so that can only be a good thing IMHO.
How is this one different?
The specific point to make is that while this can work with socket handles, it's primarily aimed to allow more efficient work with files (where AIO has failed to provide much benefit outside of batching operations with fewer syscalls).
io_uring is also one of the first post-spectre/meltdown syscall designs to be merged into the Linux kernel. If the ring buffer style used in this API becomes a pattern or not will be interesting.
Because to my knowledge, there's just one asynchronous IO interface Linux and it's Linux-AIO as used via io_setup()/io_submit()/io_getevents()/io_destroy() w/O_DIRECT.
There's glibc support for POSIX AIO, but that's not some kernel feature, it's just POSIX support and not especially great.
I'm inclined to suspect you're confusing asynchronous IO with the myriad file descriptor event multiplexing interfaces: select()/poll()/epoll_wait(), which are orthogonal to this.
So I can't really speak to the API details, but the general concept seems like a good idea to me.
> In terms of benchmarks, I ran some numbers comparing io_uring to libaio
and spdk. The tldr is that io_uring is pretty close to spdk, in some
cases faster. Latencies over spdk are generally better. The areas where
we are still missing a bit of performance all lie in the block layer,
and I'll be working on that to close the gap some more.
From a status update in the article it looks like it's less than 2% slower than spdk now. Trading 2% throughput for lower latency sounds like a useful thing to me, as someone who gets judged by response times.