
SOCKMAP – TCP splicing of the future - jgrahamc
https://blog.cloudflare.com/sockmap-tcp-splicing-of-the-future/
======
jaytaylor
Articles like this are my very most favorite kind. The scientific approach
combined with the never ending quest for more efficient and higher performance
from software!

I love everything about it!

<3

------
eloycoto
Thomas Graf, a Cilium founder, made some comments on Twitter around this:

[https://twitter.com/majek04/status/1097485987054346240](https://twitter.com/majek04/status/1097485987054346240)

~~~
castlec
For those that don't want to jump to the twitter feed, we should be expecting
a follow-up as there should be a push addressing the data loss and allowing
changes to attempt speed enhancements without data loss.

Good, wholesome, internet collaboration right here.....

------
Matthias247
How is backpressure handled in the SOCKMAP solution? The description sounds
like it just attaches all incoming buffers to the outgoing part of the socket
irrespective of the number of already buffered data. That could be bad, if it
means the memory usage is unbounded.

The other solutions definitely all would block if the send buffer is full.

Apart from that question it was interesting to learn that io_submit actually
works on sockets too. I definitely need to read more about this one.

~~~
majke
There are couple of interesting question:

\- backpressure

\- what if both userspace and SOCKMAP are doing read() on a socket?

\- can one use SOCKMAP to splice only a selected amount of data, and pick up
the rest with read()

\- What are the parser/verdict program semantics. What can they do.

\- what is the SK_MSG abstraction, and how to benefit from it.

and more... some of this is discussed:

[http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf](http://vger.kernel.org/lpc_net2018_talks/ktls_bpf_paper.pdf)

On io_submit, we wrote about it here:

[https://blog.cloudflare.com/io_submit-the-epoll-
alternative-...](https://blog.cloudflare.com/io_submit-the-epoll-alternative-
youve-never-heard-about/)

TLDR: it allows for batching, and with IOCB_CMD_POLL can be used as epoll
alternative.

~~~
Matthias247
Thanks for all the articles! I read the one about io_submit in the meantime. I
actually hoped it would be as async as the API suggests, so it seems a bit
disappointing for my use-case. Do you know if when I would schedule multiple
writes via io_submit, whether it would block until all sockets have written
something, or only until at least one has written something?

I guess the fact that one has to guess and try how those APIs behave makes me
most nervous and would make me prevent from using those. An async IO system
should be very well-behaved, and not block at arbitrary points.

------
grandinj
Feels like we're heading to a future where half of userspace will recompile
itself as BPF and upload itself into kernel space to be executed there

~~~
diegocg
Jokes aside, I wonder how much time will pass until people start writing
kernel drivers in BPF

~~~
pas
XDP is sort of that for the network subsystem. (Though it doesn't allow you to
implement drivers, "just" protocols.) Next step could be filesystem stuff,
then libvirt drivers, etc.

But BPF is no that versatile, but at least it has a stable ABI.

------
xmichael999
Any reason this article completely skipped over the userspace tools like DPDK
and Netmap (there are many others). Considering cloudflare uses a customer
version of nginx using the dpdk stack?

~~~
majke
Apples and oranges. Most important problem with DPDK and Netmap is that they
require a dedicated network card. We don't have a network card to spare for
each of our applications. Read on more:

[https://blog.cloudflare.com/kernel-
bypass/](https://blog.cloudflare.com/kernel-bypass/)

[https://blog.cloudflare.com/single-rx-queue-kernel-bypass-
wi...](https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap/)

Also using dpdk and netmap kills usual tooling (from basics like tcpdump). We
very much like iptables, xdp, conntrack, syn cookies and other technologies
deeply embedded into linux kernel. Doing DDoS once on linux kernel is hard
enough. We don't want to redo the logic for each of the possible kernel
bypasses technologies:

[https://blog.cloudflare.com/why-we-use-the-linux-kernels-
tcp...](https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/)

[https://blog.cloudflare.com/syn-packet-handling-in-the-
wild/](https://blog.cloudflare.com/syn-packet-handling-in-the-wild/)

------
jandrese
This article had me all excited until they got to the benchmarks. Hopefully
this is just some design deficiencies and not something terrible like running
a BPF bytecode is just plain slower than a couple of context switches.

~~~
majke
I really really wanted to prove SOCKMAP rocks. Oh, well. Not just yet. In one
of the linked videos the SOCKMAP authors say they didn't optimize part of the
code. So there definitely is plenty of room.

I don't see any fundamental reason why SOCKMAP wouldn't be fastest. It's just
a matter of putting an effort into it.

For example, we noticed the poor splice(2) performance is due to a spinlock.
We need to figure out just why this happens, but it doesn't seem like a major
problem. Probably just a trivial regression.

~~~
jandrese
I really appreciate the effort you put into doing the benchmarks. Reading the
first half of your article I was excited but also slightly dreading that I'd
have to go to the trouble to create and run the benchmarks to see if it was
worth the effort.

------
amluto
> The naive TCP echo server would look like:
    
    
        while data:
            data = read(sd, 4096)
            write(sd, data)
    

Not if that’s the write syscall, since it can return with fewer bytes than
requested having been written. The full C code detects this case and crashes,
which looks totally wrong.

The Python-ish example should, at the very least, use sendall. But even that
is potentially suboptimal.

------
debatem1
I'm surprised that for something this simple there isn't an easy to use
approach through netmap and friends.

~~~
majke
We specifically want to splice userspace TCP sockets. The unix way. I'm also
not a fan of custom TCP/IP stacks.

[https://blog.cloudflare.com/why-we-use-the-linux-kernels-
tcp...](https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/)

~~~
debatem1
I see the opinion, but the kernel developer in me thinks that perhaps it would
be better for everyone if the answer to such narrow, performance focused use
cases was not "let's develop an exotic kernel interface that doesn't really
work" but rather "go buy hardware or do it in userland".

