Hacker News new | past | comments | ask | show | jobs | submit login
Io_uring By Example: cat, cp and a web server with io_uring (unixism.net)
185 points by shuss on April 6, 2020 | hide | past | favorite | 37 comments

> the readv() call will block until all iovec buffers are fil led with file data. Once it returns, we should be able to access the file data from the iovecs and print them on the console.

This is wrong. readv() can return as soon as a single byte had been read, in a similar fashion as read(). If you need to read all bytes you have to use a loop. This program is potentially printing non-initialized memory.

I think the same applies to the uring examples, which also don't seem to check the actual processed bytes.

I'm also actually not sure if in the uring version requests are guaranteed to be processed in order or not. I would have assumed the latter - and thereby the printed result could have been out of order.

Obviously the demos also miss all the required memory management - but I guess that was intended. But if we would add it, the uring versions would increase more in complexity than the synchronous version.

==> IO is hard. uring is exciting for high performance use-cases, but will rather make it harder than easier. However ideally most end-users would not have to use it directly (and neither liburing), but instead use an async/await/coroutine framework on top of it that makes the asyncness transparent to the application and allows to avoid most of the pitfalls.

https://docs.rs/rio/ is a pretty credible attempt at providing a misuse-resistant API for io_uring, both synchronous and async.

There's still some nonzero complexity in using it instead of read/write, but it doesn't seem particularly burdensome to me.

I've got a lot of hope for higher-level misuse-resistant libraries on top of io_uring.

Unfortunately rio is unsound (and the maintainer doesn’t seem to care) [1].

It’s also GPL3 which limits its use in many projects.

But I too am hopeful. Just not for rio’s API model specifically.

[1] https://github.com/spacejam/rio/issues/11

This doesn't read to me as "the maintainer doesn't seem to care" so much as they feel that the unsafety is likely an edge case and, in another issue, there's some more discussion around fixing it.


`forgetting` is not an edge case. In the context of rust, having an API that invokes unsafe behavior without it being marked unsafe is not acceptable. Keeping ourselves to this rule is what makes the concept of `unsafe` useful because we can rely on the behavior/api-spec of other code.

Failing to ensure correct behavior in the presence of `forget` means that once the program gets complex enough it can blow up in undetermined ways because one piece of code somewhere wasn't aware of a restriction in another piece of code that is potentially very far away.

Failing to isolate this type of unsafety is a pitfall waiting for some future developer to be trapped in.

I've never personally seen or used mem::forget. I'm sure it happens, of course, and I'm in agreement that the API should be unsafe. I'm just saying it's not clear that they "don't care".

How about https://github.com/quininer/io-uring ? It is licensed under either of Apache2/MIT.

First off, you're right that readv(), like read() can indeed return lesser data than was requested. Thanks. I'll fix that.

> I'm also actually not sure if in the uring version requests are guaranteed to be processed in order or not. I would have assumed the latter - and thereby the printed result could have been out of order.

io_uring may not make completions available in the same order as submissions. The article goes into some detail explaining this. However, the "cat" example only deals with one request at a time which uses readv() and readv()'s iovecs are returned in order. So, this is not a problem. The "cp" example does deal with multiple submissions and completions, however it uses writev() on the completion of each readv(), writing the correct offsets. So, it wouldn't matter if completions arrive out of order. But this is indeed something to look out for.

> I'm also actually not sure if in the uring version requests are guaranteed to be processed in order or not. I would have assumed the latter - and thereby the printed result could have been out of order.

You assume correctly. There is no ordering guarantee, unless events are chained (which is a special flag).

Here is an async/await/coroutine (lib)uring binding written in C++, plus performance suggestions, benchmarks and demo usage. Be sure to check it https://github.com/CarterLi/liburing4cpp

Here is an echo server that uses some of the 5.7 features:


One of the 5.7 features, IORING_FEAT_FAST_POLL, gives a (free) performance boost of up to 68% compared to epoll:


I maintain a NGINX fork with io_uring support (instead of AIO) here https://github.com/lazerl0rd/nginx. io_uring has been providing promising results, and recent updates to Linux have just been constantly improving it.

Glad to see that my patch is used somewhere :)

Nice set of articles, thank you! io_uring is actually pretty easy to use with liburing, and seeing more people adopt it is exciting.

The problem with io_uring, really, is that every kernel release has now become much, much more exciting than the last one due to all the improvements (c.f. 5.7 now having buffer selection primops, big deal!) I'm constantly, anxiously, ever-awaiting the next kernel release... :(

Probabbly don't really know what I am talking about, since only ever used async IO through boost.asio:

Isn't it a bit crazy that there are so many different ways of doing async IO on linux alone ? You would think that this is a more or less solved problem by now.

No, I don't think so.

The terminology here is a bit confusing because different people have defined "async" differently, but in the strictest sense of the word, before io_uring the only other way to do async IO is POSIX AIO. See aio(7) in your man pages: http://man7.org/linux/man-pages/man7/aio.7.html

On the other hand, using select(2), poll(2), and epoll(2) is not really async IO.

The difference is really simple: for non-async IO, whenever you use read(2) or write(2) and that call returns without an errno, that operation is decidedly complete from the perspective of the user space. The buffer might not actually make it to disk (say because of caching) or the network (say because of the Nagle algorithm). But really from the user space perspective it is done.

What about "async" libraries in user space? All the bells and whistles added by these libraries in user space merely give you support for knowing when a file descriptor is ready. It doesn't make the actual reading or writing asynchronous. So with a such library, your code might appear to call read(), but the framework knows that the file descriptor isn't actually ready, suspends your greenthread/fiber/coroutine or whatever it uses and does something else. The actual read(2) call doesn't happen. The kernel doesn't know you want to read.

With io_uring, you actually deal with read or write requests sent to the kernel that are incomplete. You actually tell the kernel you want to read (by writing to a ring buffer, hence the name). The kernel knows you want to read. That's the difference.

Now don't get me wrong but I think for the majority of applications you don't need true async IO. All you really need is efficient notification of whether a file descriptor is ready. So AIO and io_uring are both niche topics that likely won't affect your app.

Really importantly, the behavior of poll/select/epoll on files is to always return "available." But a read from that file may still sleep waiting on disk — available does not mean the result is already in memory. They're only really effective primitives for networking sockets and artificial fds like signalfd().

Prior to io_uring, libraries that provide async file access in userspace (by necessity) use some kind of threadpool, with each thread processing an operation synchronously and providing an async result via self-pipe or other user-driven event.

Both competition notification and readiness notification are forms of async io. Buffering in userspace instead in kernel space is not a meaningful difference. Calling the latter not true asynch io seems incorrect to me.

It is true that readiness notification does not lend well to disk io though.

This introduction to io_uring has some history of the interface that may answer your question.


This might be slightly off-topic, but I'm curious:

> The core of this program is a loop which calculates the number of blocks required to hold the data of file we’re reading by first finding its size. Memory for all the required iovec structures is allocated. We iterate for a count equal to the number of blocks the file size is, allocate block-sized memory to hold the actual data and finally call readv() to read in the data.

My reading of this is that the first readv(2) implementation of cat uses space in O(n) where n is file size. With a very small constant, yes, but still linear in the number of blocks.

Before reading this article, I would have assumed one wanted to write cat to be constant space, and stream one block at a time through user space – i.e. in O(n) number of system calls instead.

When I'm reading this implementation, I'm starting to realise this is the good old space-time tradeoff, where one can either use O(n) space or O(n) system calls. For very small files, the preallocated references are probably preferable, and for bigger files, the streaming approach is better.

Does anyone know of any rules of thumb or approaches to estimate where the cutoff is? Or is it just a matter of benchmarking on various platforms?

It might be a newbie question, but why use fputc and outputing character by character instead of using liburing to write directly to fd 1 ? (or writev?)

Yes, you could do that. But I wanted to keep the example programs in the beginning very simple. I the same vein, they only deal with one request at a time, for instance.

Makes sense, thanks!

I almost certainly have no idea what I'm talking about but

Can I write a device driver using io_uring? As in talk to i2c endpoints instead of normal files?

From what I understand and looking at the supported opcodes [0] io_uring allows submitting certain "system calls" (and some other auxiliary operations you would normally expect in an async API) asynchronously. So, if you can interact with your device driver using those system calls, then you should be able to do so. I am personally not familiar with Linux, so I am unsure if those system calls are sufficient for normal operation, but it lacks ioctl, so it is not sufficient for total replacement in all cases.

[0]: https://github.com/torvalds/linux/blob/master/include/uapi/l... search IORING_OP_

I don't know the answer to that question, but if you are developing an embedded project (and especially if it's a raspberry PI) you can mmap() the hardware I2C registers to userspace memory and use them directly. You will have to basically write the I2C driver in userspace and it won't be portable but it's not that hard and it will make for very low latency I2C communication. I had to do this once for latency.

i2c and high performance async IO are pretty rarely overlapping requirements. I’m curious what your use case is?

I don't know anything either, but I assume not? With a device driver you'd be handling interrupts directly by registering an interrupt service routine and all that good stuff. io_uring is a user->kernel API because userspace doesn't handle those interrupts directly. Instead, device driver ISRs fill buffers and what have you, then kernel notifies user via io_uring that data is ready.

Nope. io_uring is an API that helps with asynchronous I/O.

That's kind of my question. I want async I/O with a device, can io_uring help?

If you have a character/block device that does read/write ops, you'd probably get io_uring support for free.

As neighbour says, io_uring is for really high performance (as in high throughput) stuff, so… not I2C !

But you can still want async I/O with an I2C device, of course, as in you don't want to block a thread to wait on a message. And for that I believe you can still use good old select/epoll as usual on your I2C device file, and as a consequence also just use your favourite async I/O framework of the day (libevent, node.js, what have you).

Does your device have a file descriptor, as is traditional in Unix? Seems like the theoretical answer should then be yes.

This seems to assume there's always space in the ring during submission at various places in the code.

Does it support timers / timeouts?

Thanks for the article, it's very appealing to start to play around with this.

Yes. Please see io_uring_wait_cqe_timeout() in liburing.

Yes, assuming you have a fresh enough kernel.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact