Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
An io_uring-based user-space block driver (lwn.net)
102 points by veddan on Aug 18, 2022 | hide | past | favorite | 38 comments



I wonder if this could replace most uses of NBD (network block devices), and/or help get iSCSI into userspace where more flexible load-balancing policy can be implemented.

It also reminds me of attempts to define BUSE[0][1][2], which would have been a block device equivalent of FUSE. IIRC attempts to get BUSE into the Linux kernel have been blocked for performance reasons -- the FUSE protocol isn't well designed and is only barely acceptable for VFS.

If io_uring (+ careful use of zero-copy) has fixed the performance issues with userspace block devices, maybe it would be applicable to FUSE (or FUSE-v2)? I've tried using io_uring with the current FUSE protocol to reduce syscall overhead and it kinda works, but a protocol designed to operate in that mode from the beginning would be even better.

[0] https://github.com/acozzette/BUSE

[1] https://dspace.cuni.cz/bitstream/handle/20.500.11956/148791/...

[2] https://dl.acm.org/doi/10.1145/3456727.3463768


The SPDK project is certainly looking to use this to replace our limited use of NBD, as well as present SPDK block devices as kernel block devices, including devices backed by userspace implementations of iSCSI, NVMe-oF, and various other network protocols.


I had the same thought re: FUSE. I'm tentatively thinking of getting back into programming by working some on sshfs (because I'm bored and think it's important, while it's maintainer-less and very squarely within my specialty). Not until early September, though, since right now life is consumed by end-of-summer stuff and then getting my daughter off to college. Anyhow, within that context I've also thought about FUSE (which I also have some experience with since I added SELinux tag support) plus io_uring. Certainly nothing's likely to happen right away, but it will be on my personal roadmap.


I'd be careful about giving up the network capabilities. With NBD it's very useful to move the client and server apart, either having them both run in userspace on the same machine over a Unix domain socket or talking remotely over the network. For our case this is by far the most common use of NBD, we hardly use nbd.ko at all.


Is BUSE significantly different from CUSE (“character device”)?

https://lwn.net/Articles/308445/


Yep! Character devices are much closer to "stream of bytes", and from the FUSE perspective they look like a single file with limited operations (open, close, read, write). Think of something like a mouse (sending a stream of motion/click events) or a webcam (send stream of frames, receive basic control commands). If you've written even the most basic FUSE layer, you've got all the necessary handlers to implement CUSE too.

Block devices operate on blocks of data identified by offset. Hard disks, CD-ROM drives, USB sticks, basically anything where it'd make sense to say "read (or write) these 1024 bytes at offset 0x10000".

You can in principle implement a block device-ish API in FUSE by disabling open/close and requiring all reads/writes to be at given offsets -- IIRC this is how the "fuseblk" mode added for ntfs-3g works -- but the protocol is too chatty to be fast enough for things people want block devices for.

I've also heard the kernel's block layer error handling doesn't interact well with the FUSE protocol, but I don't know the details too well on that.


> Block devices operate on blocks of data identified by offset.

Sure, although CUSE read and write operations take offsets, too. The kernel could just send block-sized IOs to a CUSE driver and it wouldn't be all that different.

> You can in principle implement a block device-ish API in FUSE by disabling open/close and requiring all reads/writes to be at given offsets

Right, ok.

I think the historical distinction between block and character devices is largely that -- historical. Nowadays the distinction is mostly whether or not the kernel puts a block cache in front of the device. FreeBSD eliminated the distinction entirely.


There may be kernels that have simplified their device model to unify character and block devices, but Linux has not. FUSE/CUSE (and now ublk) are Linux-oriented protocols from the beginning, with relatively little thought given to cross-platform compatibility.

If you use FreeBSD then you're likely familiar with the challenges they've faced adapting FUSE to their VFS, and last time I checked they don't have plans to support CUSE at all.

You might also be interested in <https://lwn.net/Articles/343514/> (from 2009!), which discusses some of the challenges with using something like the FUSE protocol to back a block device in Linux. That message also describes a better solution which, to my eyes, looks a lot like ublk.


FreeBSD does have CUSE, for what it’s worth.


FreeBSD has a /dev/cuse device, and a libcuse reimplemented on top of it, but it uses a different protocol from Linux's CUSE. You can see the FreeBSD implementation at https://github.com/freebsd/freebsd-src/blob/release/13.1.0/s... -- note how cuse_server_read() and cuse_server_write() are stubs.

I am somewhat familiar with this because I wrote a FUSE/CUSE server library in Rust, and tried porting it to FreeBSD. The FUSE bits worked with only minor issues[0][1], but the CUSE bits were completely different so I had to turn off that part of the library for FreeBSD targets.

[0] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253411

[1] https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=253500


Somehow this reminded me of this post on LKLM:

- <odd>.x.x: Linus went crazy, broke absolutely _everything_, and rewrote the kernel to be a microkernel using a special message-passing version of Visual Basic. (timeframe: "we expect that he will be released from the mental institution in a decade or two").

[*] https://lkml.org/lkml/2005/3/2/247

It's really interesting to see Linux getting more and more micro-kernel like features throughout the years.


Notably, the post is from a “decade or two” ago, so timeline matches.


For me, the killer use case for this is presenting logical volumes to containers. There just has not been an efficient mechanism for a local storage service in one container to serve logical volumes to another container on the same system until this. For VMs there is virtio/vfio-user, but for containers the highest performing option until this was NVMe-oF/TCP loopback.

Basically, you can implement a virtual SAN for containers efficiently with this.


I'd like a built-in iSCSI volume driver for docker, podman, et al. There are third party things (netapp trident[1], etc.) but no generic driver. One would think -- given the ubiquity of SAN boxes populating racks outside of cloud operators -- you could "-v iscsi:<rfc-4173-iscsi-uri>:/mountpoint" a network block device into a container out of the box. I suppose it's difficult to deal with in cross platform way. When you read the golang source for trident you see they're just exec-ing iscsiadm on linux container hosts.

[1] https://github.com/NetApp/trident


Very appealing. Do you think this solution would be comparable in performance with an in-kernel storage driver?


That reminds me quite strongly of VirtIO (block) devices... and yet the actual command format is, of course, different. Why can't we stop re-inveting things over and over?


Because different use cases require different designs? If you try to create a protocol that can work for all purposes, it'll be a poor fit for any of them and will be out-competed by more specialized alternatives.

There's a reason emulators design their virtual devices to resemble real hardware (PCI, SCSI, USB) -- there's already going to be a bunch of code in the hypervisor to create fake hardware. It's also more practical to piggy-back on PCI (etc) when the spec needs to be implemented by competing vendors, since there's no kernel and no OS idioms involved. Not to mention various pre-kernel code such as EFI and bootloaders.

Conversely, userspace developers really do not want to be coding up a fake PCI device with registers and interrupts and so on just to get some bytes into the kernel. They want to invoke system calls (ioctl, mmap, io_uring) and let the OS handle the details.


The basis of virtio is literally just a ring queue, with its request descriptors looking almost exactly like structs suitable for passing to readv(2) or writev(2); the PCI shim is built on top of that and is completely optional (you can have a purely MMIO virtio device, after all). It was built this way so that KVM would not have to mimic the idiosyncrasies of real hardware: passing data as-is to the physical devices can't work for obvious reasons (even if you disregard security completely); instead in can, after minimal processing, shove it into the Linux kernel and let it take care of the rest.


The VirtIO specification for its use with MMIO[0] contains the following example device description:

  // EXAMPLE: virtio_block device taking 512 bytes at 0x1e000, interrupt 42. 
  virtio_block@1e000 { 
          compatible = "virtio,mmio"; 
          reg = <0x1e000 0x200>; 
          interrupts = <42>; 
  }
The next sub-section of the MMIO section is a datasheet of control registers.

My point about PCI isn't strictly about PCI, it applies equally to VirtIO over MMIO. I do not ever want to have my userspace code poke at memory-mapped registers or do interrupt handling just to do the equivalent of an ioctl.

---

OK, fine, maybe PCI and MMIO are irrelevant but there could be opportunities to share struct layouts. In the section describing block devices[1], there's some code listings for the request packets. A representative example is the request struct:

  struct virtio_blk_req { 
          le32 type; 
          le32 reserved; 
          le64 sector; 
          u8 data[][512]; 
          u8 status; 
  };
Take a look at that request, then look at struct ublksrv_ctrl_cmd in ublk_cmd.h[2]. There is very little the two protocols have in common. Yes, they're both doing some sort of packetized data transfer, but all of the details are different.

Also, just ... just look at the size of the VirtIO specification. There is a lot there. I haven't run a `wc -l` but it would not surprise me if just the spec for VirtIO is longer than the entire patch series for ublk. Out of all that, the sum total of the virtio-blk struct layouts is something like 100, 200 lines.

Is it worth going through the trouble of trying to unify these two unrelated specs just so that we can satisfy some bizarre philosophical goal of carefully avoiding new ideas?

Like, if you're going to go that far, why does virtio-blk need to exist instead of continuing to emulate SCSI? Or using iSCSI for host<-> device transfer? The obvious answer is, again, because different use cases have different requirements. It's silly to cook two soups in the same bowl.

[0] http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1...

[1] https://docs.oasis-open.org/virtio/virtio/v1.1/csprd01/virti...

[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/lin...


Is anything like this for networking done or in the works?


For networking the closest equivalent would be TUN/TAP, which lets userspace route either IP packets (TUN) or Ethernet frames (TAP).


If I'm understanding ublk and your question correctly, then yes, there are a lot of kernel-bypass networking options out there, such as openonload, dpdk, mellanox (though they seem to have been absorbed into nvidia). You'll likely need a special/particular network card, an external kernel module, and at least an LD_PRELOAD to use them though.


Is there no way to avoid kernel copying all network data?

I understand the frustration of having the network driver crash but could it not be run in a way that it doesn't bring down the OS?

It seems to me Java would have a no-brainer advantage of a user-space networking option since you're already in a VM!?

When I saturate my HTTP server the kernel takes 30% of the CPU just copying data for no good reason?!



Yes, and also just to note, zero-copy and kernel-bypass are independent. Traditional Berkeley socket syscalls are copy+kernel, io_uring has/will have zerocopy+kernel, openonload provides both APIs for copy+kernelbypass and zerocopy+kernelbypass.


There is XDP.


XDP is kind of the opposite of this, right? It's moving userland code into the kernel.


XDP is a lot of stuff, but I think I have someone around using af_xdp to bypass the kernel network stack and for some (and the filtering and decision of which streams, is done through some ebpf iirc) packets deliver them directly into userland buffer-queues? DPDK also has an AF_XDP backend to bridge your classical DPDK app and AF_XDP sockets.


Ah, that's true. AF_XDP is definitely similar to userland block device offload.


yes, i think there is example code where io_ring is used to get blocks into and out of XDP/kernel.


[flagged]


Such as? You seem very confident about it, so why not just show the problems instead of dropping vague hints?


So it’s essentially like userspace iSCSI server, but proprietary?


Not proprietary, but not iSCSI-specific either. The whole idea is that you can use any protocol you like. Could be iSCSI, could be NBD, could be AoE, could be something proprietary but that's less likely than open/standard alternatives.


It’s obviously proprietary: it’s non-standard and specific to a single vendor.

What is the whole idea, though? Serving things to kernel from userland is decades old and commonly used with both NFS and iSCSI. The fact that this particular implementation uses io_uring instead of something non-proprietary like RDMA, is just an implementation detail.


  > It’s obviously proprietary: it’s non-standard and specific to a
  > single vendor.
That's not how people typically use "proprietary" when referring to open-source code developed collectively by multiple vendors, universities, and thousands of independent contributors.

  > What is the whole idea, though? [...] this particular implementation
  > uses io_uring instead of something non-proprietary like RDMA, is just
  > an implementation detail.
When performance matters, sometimes implementation details are the whole idea.

According to the patch's author at <https://lwn.net/Articles/904638/>, ublk has about twice the throughput of NBD.


>That's not how people typically use "proprietary" when referring to open-source code

Indeed, many people believe that source code being available somehow magically makes things non-proprietary. Not sure where that belief came from. An API is proprietary when it 1. Doesn't comply with existing standards, 2. Isn't interoperable, and 3. Is controlled by a single entity.

>multiple vendors

It comes from IBM/RedHat - a single commercial entity, not “thousands of independent contributors”. But yes, if it was a proper community project then it of course wouldn’t be proprietary.

>twice the throughput

Compared to NBD which can’t use RDMA at all. Again: the idea is old and not bad, it’s just that this particular implementation looks like another case of NIH.


> It’s obviously proprietary: it’s non-standard and specific to a single vendor.

I just re-read the LWN article and there's no suggestion of any such thing. Where are you seeing it?

> Serving things to kernel from userland is decades old and commonly used with both NFS and iSCSI.

...and regular FC/SCSI too. I worked on that for four years. Yawn. What's your point?

> The fact that this particular implementation uses io_uring instead of something non-proprietary like RDMA

RDMA implementations are even more likely to be bound up with proprietary bits than vanilla-TCP implementations. RDMA over IB or other even lesser known interconnects (both open/standard and proprietary) existed long before ROCE. Do you even know what "open" vs. "proprietary" mean? There were open network-disk protocols before iSCSI. I worked on such as early as 1989, and developed my own (along with my own user-space block device for Linux and Windows) in 2000. It seems like you've only been exposed to a small set of open/standards based technologies, and deride anything else as "proprietary" even though that's far from accurate. Is there some undisclosed interest at play here?

> is just an implementation detail

It's a very important detail, considering the performance difference.


Why do you say proprietary?




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: