Hacker News new | past | comments | ask | show | jobs | submit login
Rethinking splice() (lwn.net)
133 points by mfiguiere on March 3, 2023 | hide | past | favorite | 22 comments



Back in 2019 I wrote about "splicing" in the context of networking:

https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-futu...

Many of the mentioned bugs are fixed by now, though arguably I'm focusing mostly on network-network splice, not disk-network transfers.

There are large benefits to be had, if one combines sendfile() with ktls:

https://news.ycombinator.com/item?id=30903146

https://people.freebsd.org/~gallatin/talks/euro2021.pdf

In traditional splice() the lack of completion mechanism was always a usability problem. Similarly with other Linux zero-copy API's. This is not a problem usually in kernel-kernel data transfers, but is when doing userspace-kernel zero-copy.

io_uring is promising.


This is not a problem with splice per se, this is N+1 problem with treating network as "file socket" per Unix classic idea "everything is a file". It is crazy to consider that writing to a TCP socket over wifi with 1001 reasons for failure is similar to writing to the local memory or hard drive.

Unix began when "networked computers" were sitting in one local network wired in a stable manner with patchcords. Then, latency of communication over the wires was no much different from writing to a local magnetic tape.

Today those latencies diverged dramatically. Your machines talk over radio, with multiple hops between cities and different providers, while local storage pushes terabytes in seconds. Trying to "splice" those two for latency gains is kinda crazy. And so are networked filesystems.

Git / Dropbox are much better approaches: you have your stuff on both ends and sync up content explicitly, without trying to provide interactive abstraction of having data locally while it actually isn't.


Sorry for meta comment, but splice() in the title name should not be capitalized. It is not on the original page. It is a part of the API and it would be clearer to have it in small caps. It being on lwn.net suggests it's a Linux syscall, but could be better.


I agree.

> It is a part of the API and it would be clearer to have it in small caps.

Very small nitpick (perhaps useful for you to be clearer in future discussion): I think you mean 'lower case letters'. Small caps are something different, and not what you'd want here: https://en.wikipedia.org/wiki/Small_caps


> Neither splice() call knows when the data it passes through has reached its final destination; the network layer may still be working with the file data even after both splice() calls have completed. There is no easy way to know that the data has been transmitted and that it is safe to modify the file again.

Am confused. Normally you'd modify buffers after they've been acked at the application layer. Do samba clients not send acknowledgements?

> the problem could be simplified by taking the pipes out of the picture and allowing one non-pipe file descriptor to be connected directly to another. The pipes, Axboe said, "do get in the way" sometimes

> a new "send file" io_uring operation could be a good solution to this problem; it could be designed from the beginning with asynchronous operation in mind and without the use of pipes

ooh I like that purely from a programmer convenience PoV


> Am confused. Normally you'd modify buffers after they've been acked at the application layer. Do samba clients not send acknowledgements?

If the file is being locked against changes while the network transmission is in progress (e.g. the "stable pages" idea mentioned in the article), then in normal circumstances (unless the networking subsystem is very overloaded) the network transmission will be completed quickly, so the file will only be briefly locked against changes. By contrast, waiting for the client to acknowledge can take a lot longer, and will result in the file being locked for a lot longer, which is much more likely to have a negative impact on applications, other SMB clients, etc. Consider if you had a database file (bad practice I know, but some people will do it anyway) being read/written over SMB, blocking all writes until any outstanding reads are acknowledged could have a big negative performance impact.

Hence, they are okay with locking things in place until the network transmission is complete, but not until an acknowledgement happens. But, the splice() API doesn't provide any way to notify that the destination has finished processing all the writes the splice call sent to it.


The splice system call could simply not return until the acknowledgment is in. That doesn’t solve the problem completely though because if a packet is dropped, it has to be retransmitted but the original data might have been modified which probably nobody expects.


And thus become useless, since people using splice for performance likely use some form of non blocking IO.

edit: it also works poorly for blocking IO: one doesn’t splice from a file to a socket — one splices from a file to a pipe and then to a socket. Which splice() call is supposed to block?


The call is only supposed to return when all the data has been sent, so both. This whole discussion is just about retransmits.

This does not make the call useless as the performance gain is supposed to be from avoiding copying the data from the kernel to userspace which still happens.


> The call is only supposed to return when all the data has been sent, so both.

Good luck. Splice from file to socket doesn’t send any data — it just references it. So if you mmap a file writably and then splice from it to a pipe, either splice needs to copy it or write-protect the mapping to enable copy-on-write or take a reference that will potentially propagate future changes from the mapping to the pipe. Write-protecting the mapping is slow, so pretty much all options don’t work. Blocking until the reference is gone makes no sense, because reference won’t go away after until the caller splices to the network, which will never happen. Deadlock.


I’m not sure why you’re trying to convince me it can’t work like the documentation says it does. Is the documentation wrong?


Depending on how you interpret the man page, it’s either vague or wrong.

But that’s not what I’m trying to convince you of. I’m trying to convince you that the semantics you seem to want cannot be implemented efficiently. The kernel isn’t a magic box that can do anything you want. It’s a computer program that happens to run at a high privilege level and gets to use some fun hardware features. There is no fun hardware feature that magically and efficiently snapshots memory without copying it. There is a magic hardware feature that “write-protects” a memory mapping and notifies the kernel when someone tries to write to it, but file data in page cache may have an arbitrarily large number of mappings, write protecting a mapping is slow (especially on x86, which is, in many respects, showing its age as an architecture), and the kind of fancy software bookkeeping needed to maintain the fiction of copy-on-write is neither straightforward nor particular efficient.

If you rent x86 servers and instruct them to fiddle with the MMU for every few kB of data sent, AWS will happily take all your money while sending a remarkably small data rate per core. You will not get Netflix-style line rate performance like this.


Sendfile() works fine and is pretty fast and avoids this whole problem so you’re not going to convince me it’s impossible or that having the kernel read a file into a buffer and then send it somewhere is fast while taking a copy-on-write on the pages when it is already in memory is slow.

It’s just that Linus doesn’t want to change splice() to do what the Samba people want, which is fine but doesn’t mean it’d be impossible.


The problem is that if one part of samba is sending a file while another is reading, there’s no good way to coordinate that because the writer doesn’t know when the reader is finished. Thus you end up potentially changing the page cache partway through a send. Opening the file in direct_io for writing though would probably mitigate that concern provided you combined that with io_uring so that your performance didn’t crater


> Am confused. Normally you'd modify buffers after they've been acked at the application layer. Do samba clients not send acknowledgements?

It's the other processes that make changes.

splice() connects and move data between file descriptors as if they are "pipes", but files are not pipes (in its purest sense) -- they can be seek randomly and modified concurrently by other processes.

The abstraction breaks quickly when we mix in zero-copy, caches, network packet resend, and all sorts of filesystem difference.


I think the principal reason for requiring a pipe is so that splice only has to understand N combinations of source/sink pairs, rather than N^2 combinations. Some of the mechanics, like page borrowing and gifting, follow naturally and deliberately from using the pipe as a pivot.

I think the concept remains relatively elegant in the Unix tradition (90+% value by exposing an extremely simple kernel primitive to user space), especially as compared to sendfile, but also io_uring. splice falls apart for non-blocking I/O, though. And at least sendfile has an offset parameter for block-like objects (e.g. files).

io_uring solves all of this by implementing a Go-like work scheduler in the kernel and then exposing a command queue to user space. (Both Go and io_uring use a pool of worker threads that can perform synchronous blocking task, but opportunistically make use of readiness polling or asynchronous completion for specific source/sink objects to minimize the number of active worker threads.) That's elegant in its own way, but it's totally at odds with the traditional Unix kernel/user space architecture.


How does it fall apart for non-blocking I/O?

Also splice supports offsets (to me your comment implied it doesn't).


You're right. It's been awhile since I played around with it, and confused some of the discussion here and on LWN with the splice prototype.

Regarding non-blocking I/O, the biggest issue is splice'ing to or from a regular file I/O can still block, as file I/O in Linux is not asynchronous. For buffered I/O, io_uring gets around this by first checking if an operation would block, and if so scheduling the operation for a worker thread to perform. User space can't implement this performance hack without running into TOCTTOU issues, so you have to choose between possibly blocking, or pessimistically always pushing the operation to a thread pool.

Then there are other issues, IIRC, like the fact that readiness polling on sockets doesn't obey low water marks. Specifically, you can't set SO_SNDLOWAT on Linux; it's fixed at 1 byte. For something like splice this makes it more complex to efficiently track through polling when borrowed/gifted pages have been retired from the send queue. I think the proper method now is polling on timestamps events in the socket error queue, but this is also an indirect method. In short, it's just very difficult to track the state of pages that have been spliced, which can be problematic for splicing from a regular file (e.g. the Samba issues), but especially also for managing vmsplice buffers. In synchronous contexts, and especially simple I/O chains like a command utility implements, these aren't likely to be huge stumbling blocks; but when concurrently juggling many I/O contexts and trying to use your resources as efficiently as possible, they stick out.

I think these and other issues could be fixed or addressed, either with kernel changes or a user space library (or both), but in well over a decade nobody has; and now we have io_uring, all but guaranteeing they will never get fixed.


Thanks for the detailed response!


> Normally you'd modify buffers after they've been acked at the application layer.

“Normally”, with select/poll/epoll/kqueue IO, it is entirely valid to allocate a buffer, put some data in, call send(), and immediately free it. With IOCP or io_uring or another completion mechanism, you ought to leave the buffer alone until the network stack reports completion. None of this has anything to do with SMB or other application layers — it’s how the network stack works.

With splice(), you don’t know when to discard the buffer. Even if you add application-layer acks, you are potentially vulnerable to a misbehaving peer that acks too early. For example, if you spliced from memory (using an mmapped, file-based heap, although I don’t know why one would do this):

malloc() (from mmapped space)

fill in buffer

splice to pipe then splice to socket

wait for application layer ack

free

If the ack is a lie, you have a use-after-free.


With all communication being encrypted this optimization becomes less and less useful anyway.


There's kTLS NIC offload. So in principle could splice a file into a pipe and then the pipe into a kTLS socket and have the NVMe drive DMA it to the NIC and then NIC encrypts it and the data never touches main memory or the CPU.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: