Does NetBSD/OpenBSD actually have a zero-copy sosend()?
I wrote one once for FreeBSD back in the 90s for some research work I was doing. FreeBSD, even at that time, had the primitives needed for a userspace application to loan pages to the kernel for transmit. However, unless the application had knowledge that it was using a zero-copy socket, it was a bit of a mess in practice, as most applications would not benefit, due to taking COW page faults by re-writing to memory that was loaned to the kernel (and was marked read-only). The other big problem was handling the mapping changes (eg, the marking memory read-only, and then restoring RW access). The whole thing was crying out for a better interface. I probably should have required applications use aio_write(), or something similar. That would have removed a lot of the overhead..
Author: ad <email@example.com>
Date: Wed May 28 21:01:42 2008 +0000
Disable zero copy if MULTIPROCESSOR, until it is fixed:
- The TLB coherency overhead on MP systems is really expensive.
- It triggers a race in the VM system (grep kpause uvm/).*
It isn't available in NetBSD yet.
This reminds me that CoW filesystems (think ZFS) and writing through mmap() don't play well. You end up having to use msync(2), which many apps assume there is no need for, and msync(2) is often terribly slow (ISRT at least one system ended up doing page-at-a-time sync writes!).
Now, what happens if you have a sosend(2) that requires the caller leave the memory alone, and the caller touches it anyways? Undefined behavior. Possible outcomes include: some CRC/hash/MAC will fail to verify, the mod will have come too late and not been included, the mod will have come soon enough to be included.
EDIT: Oops, no, the memory would not be copied to the send buffers, and since this is for IPC, we don't even need to account for buffering in this path. Also, for IPC, just sharing with the receiver doesn't work: you can't tell when the received will be done with the memory. You'd need the receiver to munmap() the memory when done, else you'll never know when it's done. Though presumably even sosend(2) as-is requires a munmap() on the receive side... but the docs I can find don't mention it, e.g., https://www.freebsd.org/cgi/man.cgi?query=sosend&sektion=9&m...
The mechanism for completion notification (when the data is ack'ed in the non-IPC case, or when the data is unmapped in the IPC case) would be kqueue- or epoll- or whatever-based.
In any case, I'm skeptical of CoW for this. The problem is that on today's CPUs any manipulation of memory mappings is just expensive. I'm even more skeptical of loaning w/o CoW -- its semantics rub me very much the wrong way (see ZFS experience with writing through mmaps). So I'm inclined to suspect this path is just not worthwhile for sockets. (sendfile(2) is different because there is no loaning there as the blocks to [read, if not already read, and then] send are in the same address space already.) I'll be very happy to be very wrong about this because indeed, it feels wrong that copying should be faster than loaning, and CoW feels so right.
Then those busybody CPU manufacturers gave us lots of cores, and each core its own TLB, and then we had to trash everybody's TLB whenever somebody re-mapped something. That made working by flapping page mappings slow everything else down, and UVM became a specialist technique for embedded systems small enough to have just one core, but big enough to have mapped memory.
Meanwhile, memory systems have got better and better at copying -- mostly just by adding bandwidth, but also by having lots of registers to slurp bytes into and spew out elsewhere -- to the point that gymnastics to avoid copying are often slower than just copying.