Hacker News new | past | comments | ask | show | jobs | submit login

epoll: tell me when any of these descriptors are ready, then I'll issue another syscall to actually read from that descriptor into a buffer.

io_uring: when any of these descriptors are ready, read into any one of these buffers I've preallocated for you, then let me know when it is done.

Instead of waking up a process just so it can do the work of calling back into the kernel to have the kernel fill a buffer, io_uring skips that extra syscall altogether.

Taking things to the next level, io_uring allows you to chain operations together. You can tell it to read from one socket and write the results into a different socket or directly to a file, and it can do that without waking your process pointlessly at any intermediate stage.

A nearby comment also mentioned opening files, and that's cool too. You could issue an entire command sequence to io_uring, then your program can work on other stuff and check on it later, or just go to sleep until everything is done. You could tell the kernel that you want it to open a connection, write a particular buffer that you prepared for it into that connection, then open a specific file on disk, read the response into that file, close the file, then send a prepared buffer as a response to the connection, close the connection, then let you know that it is all done. You just have to prepare two buffers on the frontend, issue the commands (which could require either 1 or 0 syscalls, depending on how you're using io_uring), then do whatever you want.

You can even have numerous command sequences under kernel control in parallel, you don't have to issue them one at a time and wait on them to finish before you can issue the next one.

With epoll, you have to do every individual step along the way yourself, which involves syscalls, context switches, and potentially more code complexity. Then you realize that epoll doesn't even support file I/O, so you have to mix multiple approaches together to even approximate what io_uring is doing.

(Note: I've been looking for an excuse to use io_uring, so I've read a ton about it, but I don't have any practical experience with it yet. But everything I wrote above should be accurate.)




If you're looking for an excuse to work on io_uring, please consider helping get it implemented and tested in your favorite event loop or I/O abstraction library. Here's some open issues and PRs:

https://github.com/golang/go/issues/31908

https://github.com/libuv/libuv/pull/2322

https://github.com/tokio-rs/mio/issues/923

https://gitlab.gnome.org/GNOME/glib/-/issues/2084

https://github.com/libevent/libevent/issues/1019


io_uring has been a game-changer for Samba IO speed.

Check out Stefan Metzmacher's talk at SambaXP 2021 (online event) for details:

https://www.youtube.com/watch?v=eYxp8yJHpik


The performance comparisons start here https://youtu.be/eYxp8yJHpik?t=1421

Looks like the bandwidth went from 3.8 GB/s to 22 GB/s, with the client being the bottleneck.


Oh, trust me… that Go issue is top of mind for me. I have the fifth comment on that issue, along with several other comments in there, and I’d love to implement it… I’m just not familiar enough with working on Go runtime internals, and motivation for volunteer work is sometimes hard to come by for the past couple of years.

Maybe someday I’ll get it done :)


Haha nice, I just noticed that :) I think supporting someone else to help work on it and even just offering to help test and review a PR is a great and useful thing to do.


Being able to open files with io_uring is important because there is no other way to do it without an unpredictable delay. Some systems like Erlang end up using separate OS threads just to be able to open files without blocking the main interpreter thread.


All async systems use threads for files. There is no other option. Until io_uring.


That's not true. There is iosubmit on Linux which is used by some database systems. It's hard to use and requires IODIRECT, which means you can't use the kernel file buffer, writes and reads have to be aligned and a multiple of the alignment size (sector? page?)

Iouring is a huge improvement here.


AIO has been around forever and it gives a crappy way to do async i/o on disk files that are already open, but it doesn't give a way to asynchronously open a file. If you issue an open(2) call, your thread blocks until the open finishes, which can involve disk and network waits or whatever. To avoid the block, you have to put the open in another thread or process, or use io_uring, as far as I can tell.


Io_submit is so terrible and seldom used you are basically nitpicking for even bringing it up. It only actually works on xfs.


It is terrible, but it's worth pointing out that there has been an API for real async file IO for a while. I believe it works with ext4 as well, although maybe there are edge cases. Iouring is a big improvement.


Threads are the simplest way but I think you can also use ancillary messages on unix domain sockets, to open the files in a separate process and then send the file descriptors across.


There's one more aspect that I don't think gets enough consideration - you can batch many operations on different file descriptors into a single syscall. For example, if epoll is tracking N sockets and tells you M need to be read, you'd make M + 1 syscalls to do it. With io_uring, you'd make 1 or 2 depending on the mode. Note how it is entirely independent of either N or M.


What you're describing sounds awesome, I hadn't thought about being able to string syscall commands together like that. I wonder how well that will work in practice? Is there a way to be notified if one of the commands in the sequence fails like for instance the buffer wasn't large enough to write all the incoming data into?


According to a relevant manpage[0]:

> Only members inside the chain are serialized. A chain of SQEs will be broken, if any request in that chain ends in error. io_uring considers any unexpected result an error. This means that, eg, a short read will also terminate the remainder of the chain. If a chain of SQE links is broken, the remaining unstarted part of the chain will be terminated and completed with -ECANCELED as the error code.

So it sounds like you would need to decide what your strategy is. It sounds like you can inspect the step in the sequence that had the error, learn what the error was, and decide whether you want to re-issue the command that failed along with the remainder of the sequence. For a short read, you should still have access to the bytes that were read, so you're not losing information due to the error.

There is an alternative "hardlink" concept that will continue the command sequence even in the presence of an error in the previous step, like a short read, as long as the previous step was correctly submitted.

Error handling gets in the way of some of the fun, as usual, but it is important to think about.

[0]: https://manpages.debian.org/unstable/liburing-dev/io_uring_e...


I'm looking at the evolution in the chaining capabilities of io_uring. Right now it's a bit basic but I'm guessing in 5 or 6 kernel versions people will have built a micro kernel or a web server just by chaining things in io_uring and maybe some custom chaining/decision blocks in ebpf :-)


BPF, you say? https://lwn.net/Articles/847951/

> The obvious place where BPF can add value is making decisions based on the outcome of previous operations in the ring. Currently, these decisions must be made in user space, which involves potential delays as the relevant process is scheduled and run. Instead, when an operation completes, a BPF program might be able to decide what to do next without ever leaving the kernel. "What to do next" could include submitting more I/O operations, moving on to the next in a series of files to process, or aborting a series of commands if something unexpected happens.


Yes exactly what I had in mind. I'm also thinking of a particular chain of syscalls [0][1][2][3] (send netlink message, setsockopt, ioctls, getsockopts, reads, then setsockopt, then send netlink message) grouped so as to be done in one sequence without ever surfacing up to userland (just fill those here buffers, who's a good boy!). So now I'm missing ioctls and getsockopts but all in good time!

[0] https://github.com/checkpoint-restore/criu/blob/7686b939d155...

[1] https://github.com/checkpoint-restore/criu/blob/7686b939d155...

[2] https://github.com/checkpoint-restore/criu/blob/7686b939d155...

[3] https://www.infradead.org/~tgr/libnl/doc/api/group__qdisc__p...


BPF is going to change so many things... At the moment I'm having lots of trouble with the tooling but hey, let's just write BPF bytecode by hand or with a macro-asm. Reduce the ambitions...


Also wondering whether we should rethink language runtimes for this. Like write everything in SPARK (so all specs are checked), target bpf bytecode through gnatllvm. OK you've written the equivalent of a cuda kernel or tbb::flow block. Now for the chaining y'all have this toolbox of task-chainers (barriers, priority queues, routers...) and you'll never even enter userland? I'm thinking /many/ programs could be described as such.


One obvious limit is that usually chaining the open op is not very straightforward as following commands tend to need the fd.

But to answer your actual question, you can link requests and either abort or continue on failure.


Yes, check the documentation for the IOSQE_IO_LINK flag to see exactly how this works.


This makes it much clearer. Thanks!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: