
The Rapid Growth of Io_uring - signa11
https://lwn.net/Articles/810414/
======
rayiner
One of the comments to the article is really interesting. The recent Meltdown
stuff has really blown up the cost of privilege transitions, because now
people expect non-architectural data not to leak across privilege boundaries.
System calls are about twice as slow as they used to be. Meanwhile, I/O is
faster than ever, with PCIE and NVME. Io_uring offers the opportunity to avoid
privilege transitions through asynchronous calls based on writing to a memory
buffer shared between user space and the kernel. That has the potential for
fundamentally changing the basis system call interface. As the article hints
at, the trick is designing the API so you can construct as large a block of
work to be done asynchronously as possible. At the limit, you could really
push down the core of your I/O loop into the kernel, hence the suggestions
that BPF programs could be submitted through the ring to chain operations all
within the kernel.

(Incidentally, this is a good illustration of the flexibility of UNIX’s model
of describing everything with a file descriptor. The same interface meant for
asynchronous file I/O was easily extended to network I/O.)

~~~
stefan_
> hence the suggestions that BPF programs could be submitted through the ring
> to chain operations all within the kernel

That sounds unbelievably annoying.

No, the limit has been found and it is not in the kernel. You push the I/O
loop up for userspace I/O, more specific, not more general. They call it SPDK
[1] (or DPDK for networking) and as far I can tell, the principle is
essentially having a dummy driver in the kernel that maps the entire PCIe
peripheral memory space into your chosen process, and everything flows from
there.

At the I/O limit, _asynchronous_ isn't feasible because interrupts introduce
latency and waste cycles not doing work. All userspace I/O frameworks work
only through polling.

1: [https://spdk.io/](https://spdk.io/)

~~~
rayiner
The problem with mechanisms like DPDK is that they bypass all the
infrastructure in the kernel and make it hard to play well with others using
the same hardware or services. DPDK, for example, bypasses the TCP/IP stack.
SPDK bypasses the VFS. You can write your own TCP/IP stack or filesystem on
top of those things, but then you can't play well with other processes using
those services. While some GPUs can directly multiplex command streams from
different processes, most hardware cannot.

~~~
saber6
That's the point of DPDK: to get the kernel out of the way of packet
processing.

Userland packet processing (in network context) is much more flexible and less
brittle than forcing certain functionality to exist solely in the Kernel
layer. However things do exist that allow you to (mostly) transparently re-
jigger a standard app's TCP/IP calls. One such example is using LD_PRELOAD to
"hijack" the sys-calls for certain things and snake it to your (super high
performance) userspace app!

There's a lot of exciting stuff happening in the open source networking world
(DPDK, VPP/FDio, Network Service Mesh, etc). I really recommend digging into
it!

------
willvarfar
So I started googling io_uring, and came across this excellent talk at FOSDEM
[https://fosdem.org/2020/schedule/event/rust_techniques_sled/](https://fosdem.org/2020/schedule/event/rust_techniques_sled/)

~~~
jen20
This was indeed a great talk, and the Rio library it describes is also
incredibly usable.

------
blattimwind
Turns out, if you give people a sane API, they actually want to use it! And
this is the first relatively sane way on Linux/BSD to do asynchronous IO. :)

(I say "relatively sane" because this isn't really asynchronous, it's just
make-believe with a kernel-managed thread pool, because I/O being a fully
synchronous affair is ingrained far too deep into both Linux and BSD I/O
stacks.)

~~~
wtallis
> "relatively sane" because this isn't really asynchronous

What's your criteria for "really asynchronous"?

> it's just make-believe with a kernel-managed thread pool

To the extent that io_uring uses anything resembling a thread pool, it seems
to me that it is used _completely differently_ from how a userspace AIO thread
pool operates. When a userspace AIO implementation submits IO to the kernel,
it does so with a blocking syscall and that thread stalls until that IO is
complete. That means the number of outstanding IOs is limited by the number of
threads in the pool. I don't see any such limitation in using io_uring to
deliver IO requests to the block layer.

~~~
tyingq
> _What 's your criteria for "really asynchronous"_

One example might be that related operations, like stat(), opendir(),
readdir(), getpeername(), and so on...remain synchronous. And that async
functionality is mostly a bolt-on to very established things, file
descriptors, berkeley sockets etc.

Also, every improvement is a pretty big patchset with code to ensure the
traditional synchronous operations don't get unintended side effects.

Just generally the idea that a "clean start", non-POSIX bound OS might design
things differently, I imagine. Google's Fuchsia seems to hit some middle
ground, where async is more foundational, for example.

I don't think that observation detracts from the improvements.

~~~
chriswarbo
> Just generally the idea that a "clean start", non-POSIX bound OS might
> design things differently, I imagine. Google's Fuchsia seems to hit some
> middle ground, where async is more foundational, for example.

Microsoft's (research, discontinued) Midori OS was heavily async:

[http://joeduffyblog.com/2015/11/19/asynchronous-
everything](http://joeduffyblog.com/2015/11/19/asynchronous-everything)

~~~
tyingq
That's exactly where I was going. Critisim of Linux async shouldn't be
discouraged because of inherent limitations. That would hample innovation. I
wouldn't be surprised by a new web server, API gateway, load balancer, etc,
that mandated _" forget about POSIX, adhere to this"_. The whole cloud
abstraction movement seems to enable it.

------
matheusmoreira
Linux finally has something comparable to Windows I/O Completion Ports. Looks
like it is flexible enough to be a generic asynchronous system call interface.
Awesome!

------
willvarfar
io_uring puts me in mind of a 2012 prototype of async and batched syscalls
that sped up MySQL by 40%!
[https://www.usenix.org/legacy/events/osdi10/tech/full_papers...](https://www.usenix.org/legacy/events/osdi10/tech/full_papers/Soares.pdf)

The cost of syscalls has gone up massively due to all the mitigations against
recent side channel attacks. Batching and async syscalls would be an even
bigger win than ever.

Wouldn't it be great if glib or something could adopt new mechanisms and
everyone got faster?

~~~
api
In the long term I think OS APIs need to be redesigned (or at least the
calling mechanism) around something more like io_uring for a lot of things.
Shared memory, lock-free data structures, no syscalls.

~~~
cesarb
> Shared memory, lock-free data structures, no syscalls.

You'd still need a syscall to sleep when the ring is empty, and a syscall to
wake up the mechanism when the first request is put on an empty ring. But
yeah, other than that (and perhaps an "yield" syscall), you could do
everything through the ring.

~~~
yxhuvud
You can actually go totally syscall free with io_uring, but that requires
privileged mode as the setup call to enable the SQPOLL flag will fail
otherwise. But yeah, I suppose most people don't like running things as sudo.

~~~
couchand
Well, not totally free. If you don't submit an event before the sq poll
timeout the kernel thread will sleep and you need to call enter again to wake
it up.

------
mavam
For users who need macOS and FreeBSD (kqueue) support as well, is there any
unified standard for async I/O that covers both file and network? Or is the
only choice to go with a library like libuv, which will always pick the
optimal native implementation under the hood?

~~~
earenndil
Once freebsd's linux implementation supports io_uring, you'll be able to use
that there too. Nothing for macos, though, I'm afraid.

~~~
cpach
I’m afraid I don’t follow. What do you mean by “freebsd’s linux
implementation”?

~~~
jacobush
FreeBSD can run linux binaries:

[https://www.freebsd.org/doc/handbook/linuxemu.html](https://www.freebsd.org/doc/handbook/linuxemu.html)

I wonder though... wouldn't a safe first way of implementing these new
syscalls be to make them actually _synchronous_?

That way you'd be able to run these Linux binaries but without any of the
performance benefits.

~~~
cesarb
> I wonder though... wouldn't a safe first way of implementing these new
> syscalls be to make them actually synchronous?

No, because it visibly changes the semantics. Consider for instance
IORING_OP_ACCEPT; if you make it synchronous, and nothing connects to your
program, it would wait forever, instead of returning immediately and allowing
the program to continue. The file-related opcodes are safer (when used with
actual files, instead of network sockets), but still would behave differently
for instance with a hanging NFS mount.

------
kashyapc
From last sunday, at FOSDEM 2020, a talk[1] by QEMU developer Julia Suvorova,
on integrating _io_uring_ into QEMU (the open source machine emulator and
virtualizer), and related performance implications.

From the abstract:

 _" iouring enhances the existing Linux AIO API, and provides QEMU a flexible
interface, allowing you to use the desired set of features: submission
polling, completion polling, fd and memory buffer registration. By explaining
these features we will come to examples of how and when you need to use them
to get the most out of iouring. Expect many benchmarks with different QEMU I/O
engines and userspace storage solutions (SPDK).

You will get a brief overview of the new kernel feature, how we used it in
QEMU, combined its capabilities to speed up storage in VMs and what
performance we achieved. Should io_uring be the new default AIO engine in
QEMU? Come and find out!"_

[1] io_uring in QEMU: high-performance disk I/O for Linux —
[https://fosdem.org/2020/schedule/event/vai_io_uring_in_qemu/](https://fosdem.org/2020/schedule/event/vai_io_uring_in_qemu/)

------
willvarfar
I used to do asynchronous io with epoll, aio etc and spend time benchmarking
my servers vs completion ports and kqueue etc.

eventually libraries like Libuv turned up and made my life a lot easier.

Are there any good stats on how io_uring compares to all those older async io
stuff?

~~~
khc
epoll doesn't allow you to do async io to local storage, and io_uring doesn't
target the network use case, so they are not quite comparable.

~~~
topspin
> io_uring doesn't target the network use case

There are several io_uring opcodes intended specifically for the network use
case, including: IORING_OP_ACCEPT, IORING_OP_CONNECT, IORING_OP_SENDMSG and
IORING_OP_RECVMSG.

Did you mean something else by 'target'?

~~~
khc
iirc it wasn't created for network IO, but I could be wrong

~~~
yxhuvud
The reason it was created and what it targets are different things. io_uring
is evolving to be a very general way to do async interaction with the kernel,
and the way it is built enables high performance also for things that isn't
disk IO.

------
cytzol
Question: is io_uring only worth using for _writing_ to files, or does it
provide performance benefits when reading from them, as well?

I've seen database programmers rave about the speed benefits of asynchronous
I/O, because they have to store a lot of data on disk. But the majority of
programs I write only have to deal with _reading_ files. I'd love to try using
io_uring, but only when it's appropriate.

~~~
thristian
Asynchronous I/O is only useful if you have something else you can be doing
while you wait for file-data to arrive. For example, if you can do
calculations on the part of the file you've already read while you wait for
the rest to arrive, or if you can process data for one client while data from
a different client is in-flight. Databases are a good example - usually
they're handling queries for multiple clients at once, and even within a query
they can be reading from multiple places simultaneously (like the individual
tables involved in a JOIN).

If your program needs to read in an entire JSON blob (for example) before it
can do anything, or if it only does light processing on each individual part
(like adding up a column of a CSV file) then async I/O probably isn't going to
help.

~~~
machinecoffee
Which would actually work nicely with a fibres library, so you can yield while
you're waiting for the io to complete.

------
executesorder66
Is there any userspace software that is actually making use of io_uring?

~~~
saghul
There is an ongoing effort to bring it into libuv:
[https://github.com/libuv/libuv/pull/2322](https://github.com/libuv/libuv/pull/2322)
Looks like using liburing might be the way to go, since it's MIT licensed.

