
Rob Pike discovers sendfile(2) - self
http://groups.google.com/group/golang-nuts/msg/623dd6d97a9156b0
======
luckydude
Hi, I'm the guy who came up with the splice idea. It's based on what I learned
doing this:

www.connectathon.org/talks96/bds.pdf

which was for the EIS (Earth Imaging System) project, a government effort to
image the earth about 15 years ago. That project eventually had 200Mhz MIPS
SMP boxes moving data through NFS at close to 1Gbyte/sec 24x7. So far as I
know, nobody else has ever come close to that even with 10x faster CPUs.

Most of the people in this thread pretty clearly don't understand the issues
involved, Rob included (sorry, Rob, go talk to Greg). Moving lots and lots of
data very quickly precludes looking at each byte by the CPU. The only thing
that should look at each byte is a DMA engine.

Sendfile(2) is a hack, that's true. It is a subset of what I imagined
splice(2) could be (actually splice(3), the syscalls are pull(2) and push(2)).
But it's a necessary hack.

Jens' splice() implementation was a start but wasn't really what I imagined
for splice(), to really go there you need to rethink how the OS thinks about
moving data. Unless the buyin is pervasive splice() is sort of a wart.

~~~
aliguori
I'm assuming that the real objection to an interface like sendfile() is that
you shouldn't need to copy_from_user() just to construct an SKB if the user
address being referred to is pointing at a file backed VMA. If we could do a
zero-copy send() from userspace, there would be no need for sendfile() in the
first place.

IOW, { data = mmap(fd, ...); ...; send(fd, data + offset, length); } could be
just as fast as sendfile() if we were smarter.

We're starting to build infrastructure to do zero-copy networking in userspace
specifically for KVM. So far, it's not at the socket level but instead at the
macvtap level but that's arguably another problem with the networking stack--
macvtap/tap should just be an address family :-)

~~~
caf
I'm not so sure. The problem is that when send() returns, the application is
free to modify the buffer and expect that the modifications won't be visible
to the other side of the link - the data has already been "sent" as far as the
app is concerned.

So this implies that if you wished to send by DMAing directly from the page
cache to the NIC, send() would have to block - not just until the data (and
all previous buffered written data on the connection) has been passed to the
NIC, but until the TCP ACK from the other side has been recieved (or RST).
(Non-blocking sockets could obviously never use this method - they would have
to perform the copy).

sendfile() obviously has this limitation already, but if you were to do it for
send() you'd at least have to hide it behind a setsockopt(). Otherwise the
weird behaviour is likely to upset existing applications (eg. select() says
that write() will not block - but it does anyway, because the buffer you sent
from was mmap()ed?)

~~~
aliguori
No, TCP ACK has nothing to do with this. All that matters is when the NIC has
completed DMA'ing the packet. At that point, userspace can be unblocked.

Non-blocking sockets only make sense when there's an intermediate buffer.

~~~
mad
What happens if you need to retransmit a packet?

~~~
caf
Exactly right. SOCK_STREAM sockets can't unblock the process until it knows
that it won't need that buffer again for a retransmission.

------
ithkuil
The unnecessary data copying problem, as Robert Pike suggests, can be also
solved by a more generic Zero-Copy approach, instead of adding a specific
single purpose system call.

<http://www.linuxjournal.com/article/6345> <http://kerneltrap.org/node/294>
<http://www.cs.duke.edu/ari/trapeze/freenix/node6.html>

It has to be noted however that often the term Zero-Copy is used to describe a
technique which avoids memory copying by employing virtual memory remapping.

VM tricks are also expensive because, depending on the architecture, it might
require flushing the TLBs and impact subsequent memory accesses. The advantage
of this way of zero copy approach thus depends on several factors such as the
amount of data being transferred to the kernel's buffers.

I don't have any recent data regarding real word performances, any references
are welcome. However it's far from being self-evident that VM tricks can rival
the performance of a dedicated 'sendfile' like system call.

~~~
tptacek
For the uninitiated, the TLB is the thing that keeps your MMU hardware from
having to trawl through the page directory in memory every time it accesses a
virtual address; it's a cache, and you generally want to avoid flushing it.

~~~
sovande
Oh that made it so much more clearer :)

~~~
groks
<http://lwn.net/Articles/253361/>

------
unwind
For the un-initiated, sendfile() is a system call that sends data between two
file descriptors. The intent is to make the kernel do the read/write cycle
instead of the application (user-level) code, thereby cutting down the number
of times the data needs to be mapped between kernel and userspace memory
spaces.

The manual page:
<[http://linux.die.net/man/2/sendfile>](http://linux.die.net/man/2/sendfile>).
A related Linux Journal article:
<[http://www.linuxjournal.com/article/6345>](http://www.linuxjournal.com/article/6345>).

~~~
jrockway
Yeah, it's all about context switching, which is just unnecessary work.
Doesn't matter if you are serving 10 dynamic HTTP requests a day on a 16-core
super machine. Does matter if you are serving the same file 100,000 times a
second from your phone :)

(I do wish it worked for any fd to any other fd, because I have to write that
code myself rather frequently. Example: copying data from a pty to the real
terminal.)

~~~
tonfa
Would splice help? <http://en.wikipedia.org/wiki/Splice_(system_call)>

~~~
FooBarWidget
Unfortunately not, splice requires that one of the fds is a file. I want to
forward data from a socket to another.

~~~
fragmede
In 2.6.31, support was added to

> Allow splice(2) to work when both the input and the output is a pipe.

[http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6...](http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=7c77f0b3f9208c339a4b40737bb2cb0f0319bb8d)

~~~
alexgartrell
This goes half way toward solving the problem. What most people want is a pure
socket->socket. As evidenced by haproxy, you must still use a pipe
intermediary, requiring multiple data copies (on the plus side, it's still
fast).

    
    
      ret = splice(fd, NULL, b->pipe->prod, NULL, max,
                   SPLICE_F_MOVE|SPLICE_F_NONBLOCK);
    

edited to add (because I couldn't reply): It almost certainly is faster than
read/write, because it's straight memcpy's (which are pretty fast on modern
hardware), instead of memcpy's, context switching, special read()/write()
logic, and other stuff.

~~~
caf
The pipe intermediary doesn't necessarily add a copy. You SPLICE_F_MOVE the
data from the first fd to the pipe, then SPLICE_F_MOVE it from the pipe to the
second fd. If either or both of those can be done zero-copy, they will be. The
pipe intermediary is just a way of holding onto the reference-counted pages.

------
mfukar
Rob Pike should read about D-Bus in the kernel[1], next. Maybe he'll have some
comments. No, I mispoke, I think he'll definitely have some comments.

[1]
[http://git.collabora.co.uk/?p=user/alban/linux-2.6.35.y/.git...](http://git.collabora.co.uk/?p=user/alban/linux-2.6.35.y/.git;a=summary)

~~~
wmf
There is a legitimate debate about the performance vs. elegance tradeoff that
sendfile represents. Putting D-Bus in the kernel is just a bad idea.

~~~
mfukar
I disagree. I think it's nothing but a good idea compared to its current
state, for the following reasons:

\- IPC in userspace is insecure, and currently easily subverted by rootkits.

\- When a system is under heavy load, the dbus daemon becomes a choke point
for scheduling. "Spammy" processes can prevent messages from reaching higher
priority processes.

\- It is a _lifesaver_ for embedded devices with poor multitasking
capabilities. Think about ucLinux.

\- It's not for everyone. It's like PF_RING for packet capturing.

\- It makes sense not only for performance, but also reliability.

\- Even though DBus can be improved in several other aspects, I think it's
nice that someone sat down and wrote this instead of bikeshedding on some
website.

2c

------
loewenskind
>It can be written in a few lines of efficient user code.

I'm not sure he's understanding what this is. There is no copying needed here
at all. The kernel could make the hard drive write to a place in memory, have
the NIC read from that place and just manage interrupts between the two. The
kernel wouldn't have to touch the data at all. This may not be how Linux does
it (given the requirement for a memmap'able file descriptor) but that would be
possible at least. I don't think you could do anything near this in user code.

~~~
kmavm
Rob Pike deserves the benefit of your doubt when discussing UNIX
implementations.

<http://en.wikipedia.org/wiki/Rob_Pike>

The read(2)/write(2) solution from userspace would involve no copying as well,
assuming a competent kernel implementation. The only "overheads" to speak of
relative to what's possible with sendfile("2") would be those associated with
writing one contiguous page table entry for every 4KB of data to set up the
mappings; since sendfile(2) requires that you use mmap'able input, it probably
incurs the same overheads.

Fine, let's say doing the right thing with read(2)/write(2) is hard, even
really hard, and sendfile(2) is faster today. Making expedient shortcuts to
performance in _the system call API_ has historically not turned out well in
UNIX. People write software which depends on this interface, and that software
may well outlive any existing hardware and its quirks.

~~~
rlpb
> The read(2)/write(2) solution from userspace would involve no copying as
> well, assuming a competent kernel implementation.

Wouldn't the userspace application have to ensure that the buffers being
passed to read(2) and write(2) are page-aligned?

~~~
wmf
I think the kernel could copy the un-page-aligned head and tail while
remapping the body.

~~~
jedbrown
Which kernels actually do this?

Also, if I read(2), it's supposed to be putting stuff in my buffer (obtained
with malloc or whatever). It would seem very complicated to change the mapping
for addresses in this buffer. And if I write to the buffer, it must be copy-
on-write (in fact, it has to copy the page if I dereference any pointer into
it, because most hardware doesn't distinguish between read-only and read-write
at the instruction level.

It just seems like if the kernel really did this level of mapping tricks, that
MPI implementations could do something better without this patch.
<http://lkml.org/lkml/2010/9/14/468>

~~~
luckydude
IRIX did page flipping on the way out to users (read(2)) and copy-on-write
page borrowing on the way in from users (write(2)). That's for networking.

For file I/O, if you used O_DIRECT, then the data was dma-ed to/from your
pages directly, never went through the kernel's file cache at all.

All the file performance came from getting many disk DMA engines running in
parallel.

~~~
caf
The "copy-on-write borrowing" on write() is where the problem lies. To turn
the writable pages into COWs you have to update the process's page tables to
remove write permission, which requires a TLB flush. These kinds of games are
a net loss, at least on modern hardware.

(And then the app likely reuses the buffer for the next `read()` anyway,
requiring either replacing the page or faulting in a fresh one and doing the C
in COW).

~~~
luckydude
Huh. Like to see some numbers on that. Not arguing, just a little surprised.
So you have data that shows bcopy() to be faster than the copy-on-write stuff?

As for the buffers, known problem, all the apps that used this sort of thing
cycled through N buffers for exactly that reason. I think N=4 was way more
than enough.

------
crux_
I was just doing research on using sendfile's modern zero-copy replacement(s),
splice & tee: <http://kerneltrap.org/node/6505>

They were linked elsewhere here, but worth repeating in a top-level comment.
:) Possibly a handy tool for the rare times you need to squeeze blood out of a
stone.

