
Kernel-Bypass Networking - bbowen
https://www.godaddy.com/engineering/2019/12/10/Kernel-Bypass-Networking/
======
k_sze
Couldn't read that in Hong Kong because GoDaddy automatically redirects me to
the hk.godaddy.com domain, which doesn't have that article.

Update: If you are like me and can't see the article at your localised
GoDaddy.com website, you can (hopefully) select United States at the bottom of
the page to force GoDaddy to serve you the US site.

------
londons_explore
How important is it to bypass the kernel if the kernel doesn't get to
see/handle each packet individually?

As soon as you get the hardware to handle TCP reassembly and just wake the
kernel up once per few megabytes of data sent/received, things scale well
again.

There's work to do though - there are no systems around today that I'm aware
of which can send data from SSD to a TCP socket (common use case for cache
server) without the data itself going through the CPU (despite most chipsets
allowing the network card to be sent data directly from a PCIE-connected SSD).

~~~
nine_k
I suppose that CPU would schedule a DMA transfer from SSD to RAM, and then
from RAM to a NIC.

For NICs that have memory-mapped buffers, it could be just one transfer.

What am I missing?

~~~
londons_explore
Nothing. That's how it would work. But today's Linux kernel doesn't do that
(even when you use the sendfile() API)

~~~
navaati
> But today's Linux kernel doesn't do that (even when you use the sendfile()
> API)

Oh ? I’m quite disappointed, I thought that was the whole point of sendfile()
! What does it do if not that ?

~~~
toolslive
simple data transfer is : SSD -> kernel -> user-space -> kernel -> NIC

sendfile allows for: SSD -> kernel -> NIC

which is already a major improvement

~~~
navaati
Right, silly me, I had forgotten userspace :). Thanks !

------
the8472
> Typically, an application using BSD uses system calls to read and write data
> to a socket. Those system calls have overhead due to context switching and
> other impacts.

On the other hand the kernel isn't standing still, overhead reductions have
been trickling in over decades. sendfile, epoll, recv-/sendmmsg, all the
multi-queue stuff, kTLS with hardware offload, io_uring, p2p-dma. The C10k
problem was tackled in 1999, userspace APIs can get you much further today.

~~~
ra1n85
The current approach is fundamentally not going to work in the long term.
100Gbps at line rate means single digits[1] of nanoseconds between frames. At
that frequency, a cache miss is pretty bad.

This is all not to mention locks, or that there are competing functions
running in most distros (turn off irqbalance completely and watch your
forwarding rate increase).

The low hanging fruit seems to have been picked as well - NAPI polling,
interrupt coalescing, RSS + multique NICs + SMP, etc, are already out there,
and we're still struggling to do 10G line rate in the Kernel...and data
centers are moving quickly to 25/100G.

[1] Edited for terrible math - 10Gbps at line rate is 67ns per packet, 100Gbps
is 6.7ns

~~~
danceparty
We are not struggling to do line rate 10G in the kernel. Modern 100Gbe nics
(mellanox, solarflare) will happily do line rate with stock upstream kernel
for a while now (definitely since 4.x) you only need to tune your irq
balancing, and you can probably get away with not even doing that. If you are
buying 100gbe nics you are also buying server class (xeon, rome) processors
that can keep up.

Source: I operate a CDN with thousands of 100Gbe nics with a stock upstream
LTS kernel, and minimal kernel tuning.

~~~
ra1n85
You're saying you can forward 100Gbps at line rate (148MPPS) through a stock
kernel?

~~~
danceparty
You can get within a few percentage points, yes

I just tested this with two hosts with 4.14.127 upstream kernel and upstream
mlx5 driver, and mellanox connectx-5 card. Using 16 iperf threads

[SUM] 0.0-10.0 sec 85.1 Gbits/sec

That's pretty close with no tuning, and well beyond 10gb/s we mentioned
earlier

~~~
ra1n85
16 iperf threads...sending at what packet size? Do you understand the notion
of line rate? 85Gbps at 1500B is only 7MPPS, which is half of 10Gbps at line
rate.

~~~
Dylan16807
Where are you getting your definitions? I have never seen "line rate" used to
refer to packets per second.

~~~
ra1n85
It implies it. Ethernet at 84B per frame is the smallest you can go, thus the
line rate - some examples:

[https://events19.linuxfoundation.org/wp-
content/uploads/2017...](https://events19.linuxfoundation.org/wp-
content/uploads/2017/12/jim-Thompson.pdf)

[https://www.redhat.com/en/blog/pushing-limits-kernel-
network...](https://www.redhat.com/en/blog/pushing-limits-kernel-networking)

[https://kernel-recipes.org/en/2014/ndiv-a-low-overhead-
netwo...](https://kernel-recipes.org/en/2014/ndiv-a-low-overhead-network-
traffic-diverter/)

~~~
Dylan16807
"How do you fill a 100GBps pipe with small packets?"

"achieve 10 Gbps line rate at 60B frames"

"reaching line rate on all packet sizes"

Line rate is just bits per second. You have to add in a qualifier about packet
size before you're talking about packets per second.

~~~
ra1n85
Nope, I'm sorry you're not quite getting it here. Minimum Ethernet frame is
84B on the wire - it's simple enough from there.

~~~
big_chungus
I've never heard this weird qualification for the definition of "line rate"
that it somehow requires minimum packet size, so I looked it up. The first
three sources for a quoted big-g search all imply or directly state that it's
the same as bandwidth:

[https://blog.ipspace.net/2009/03/line-rate-and-bit-
rate.html](https://blog.ipspace.net/2009/03/line-rate-and-bit-rate.html)

[https://www.reddit.com/r/networking/comments/4tk2to/bandwidt...](https://www.reddit.com/r/networking/comments/4tk2to/bandwidth_vs_line_rate_vs_throughput_vs/)

[https://www.fmad.io/blog-what-is-10g-line-
rate.html](https://www.fmad.io/blog-what-is-10g-line-rate.html)

Also, for gigabit networks, ethernet packets are padded to at least 512 bytes
because of a bigger slot size:
[https://www.cse.wustl.edu/~jain/cis788-97/ftp/gigabit_ethern...](https://www.cse.wustl.edu/~jain/cis788-97/ftp/gigabit_ethernet/index.html)

~~~
Hikikomori
Line rate does imply pps at the smallest sized frames in the context of
networking equipment performance. Vendors use it extensively in their docs.

64B is the minimum frame size in Ethernet, including interframe gap and
preamble its 84B on the wire. It is the same with Ethernet, Gigabit Ethernet
and even 100Gbit Ethernet, that source is not correct.

[https://kb.juniper.net/InfoCenter/index?page=content&id=KB14...](https://kb.juniper.net/InfoCenter/index?page=content&id=KB14737)

~~~
bogomipz
No line rate does not "imply pps at the smallest sized frames."

Network hardware always quote PPS using the smallest sizes. And this makes
sense for things like route and switch processors. Perhaps you are confusing
that.

You should reread your link a little more carefully. From your link:

">However it is also important to make sure that the device has the capacity
or the ability to switch/route as many packets as required to achieve wire
rate performance."

The key phrase there is "as required." Almost nobody needs to sustain
forwarding Ethernet frames with empty TCP segments or empty UDP datagrams in
them. In fact many vendors will spec for an average size. Since packet size x
PPS will give you your throughput, if the average packet size is larger you
need much less PPS to achieve line rate.

------
benou
Disclaimer: I work on VPP.

The typical usecase are virtual network functions: think virtual
switches/routers used to interconnect VMs or containers, or containerized VPN
gateways etc. It is also used for high-performance L3-L4 load-balancers etc.

As pointed out by others, what is hard is to move small packets. TCP with
iperf is not relevant for this kind of workloads. It is easy to max out 100GbE
with 1500-bytes packets, but with 200-bytes packets not so much. This is why
they communicate about PPS, not bandwidth.

There results seems low but it is hard to tell without knowing the platform or
configuration. VPP can sustain 20+Mpps / core (2 hyperthreads) on Skylake
@2.5GHz (no turboboost).

~~~
GhettoMaestro
VPP is amazing - you made my work life much easier... for free :-). Great
work. More people should dig into high-perf open source networking.

Thank you!!

------
Thorentis
I wonder what motivated GoDaddy to research this at all, since at the end they
say that it isn't necessary to pursue their research any further. Driving
tech-minded traffic with blog posts?

~~~
angry_octet
Have you seen how much a router costs? And probably they don't need to route
any faster.

------
pjmlp
This was already an issue back at the beginning of the century at CERN, to
handle high data rates.

Here is a relatively recent paper of the kind of work being done in this area,

[https://iopscience.iop.org/article/10.1088/1748-0221/8/12/C1...](https://iopscience.iop.org/article/10.1088/1748-0221/8/12/C12039/pdf)

------
anonymousDan
What are the disadvantages of kernel bypass?

~~~
parliament32
You don't get kernel features anymore, and have to re-implement the ones you
need yourself.

~~~
ra1n85
Correct - things you take for granted like ARP and TCP/IP are completely up to
you to take care of. Further, the Kernel has little visibility into most
kernel bypass stacks, so /proc or iproute are often blind to what's happening.

DPDK does have "kernel interfaces", so you can direct packets to the kernel.

------
gnufx
I wonder why this sort of thing seems to be thought radical. Infiniband RDMA
well estabilshed, with low latency and high bandwidth. (There was DMA between
the micro-kernel-ish systems we used in the 1980s which got basically Ethernet
line speed on <1 MIPS systems, as I recall; I assume it wasn't a new idea
then.)

~~~
StillBored
and fibrechannel, and various other protocols too. The point being that
ethernet+IP/TCP is uniquely poor/difficult at offload and the minimum packet
sizes are tiny.

TCP is genius for a WAN, but unlike most things designed in the past 25 or so
years, robustness precedes performance.

~~~
gnufx
Kernel bypass isn't the same thing as offload. I don't understand "minimum
packet sizes are tiny". The 1980s system was driving Ethernet, just not with
Unix/sockets.

------
brian_herman__
Berkeley not Berkely

~~~
sidpatil
[https://www.youtube.com/watch?v=pKoK9znaPSw](https://www.youtube.com/watch?v=pKoK9znaPSw)

