
Why we use the Linux kernel's TCP stack - majke
https://blog.cloudflare.com/why-we-use-the-linux-kernels-tcp-stack/
======
raphaelj
I implemented an highly-scalable user-space TCP stack as part of my Master's
thesis [1], last year.

One doesn't use an user-space network stack because the Linux's network stack
is slow (it's fast), but _because it doesn 't scale correctly on an high
number of CPUs_ (> 8 cores) [2]. This is because the kernel suffers from some
lock contention when accessing the table containing the socket descriptors.

An user-space stack can be significantly faster when the application layer is
really simple (e.g. doing some very simple filtering or routing) and when it
does not share any mutable state. As soon as your application layer starts
sharing mutable states between connections (e.g. like a database), you'll
start having contention issues similar to those experienced by the kernel, and
you'll not gain anything from using an user-space stack. Very often,
_applications that can benefit from such as stack can also be scaled easily on
multiple machines, and it 's usually easier to keep using the Linux's stack
and add more servers_.

\--

[1] [https://github.com/RaphaelJ/rusty](https://github.com/RaphaelJ/rusty)

[2]
[https://github.com/RaphaelJ/rusty/blob/master/doc/img/perfor...](https://github.com/RaphaelJ/rusty/blob/master/doc/img/performances.png)

~~~
majke
> This is because the kernel suffers from some lock contention when accessing
> the table containing the socket descriptors.

This would indicate the problem is with packet delivery to application. From
my experience even packet delivery to "filter" iptables chain is "slow".

But let's assume you are right, can you elaborate? Do you think SO_REUSEPORT
on TCP sockets can solve the contention of accept()?

[https://lwn.net/Articles/542629/](https://lwn.net/Articles/542629/)

There are some initiatives improve SO_REUSEPORT CPU affinity, hopefully making
it even faster.

Update: I misread. Ok, so "table containing the sockets", but this is just a
large hash table, nothing too fancy... aRFS for greater locality?

~~~
raphaelj
A single socket can be shared by multiple cores. That means that the kernel
must both protect the socket descriptor from concurrent writes, and can't
enforce a TCP link to be handled by a defined core (CPU affinity).

~~~
majke
> A single socket can be shared by multiple cores

Absolutely, it _can_. But everybody sane avoids that, pinning worker processes
to specific CPU's and not sharing sockets between them. The rule of thumb is
that spinlocks on the hot path of socket access become a bunch of no-ops if
there is no lock contention.

~~~
joosters
I can't see why pinning is such an obvious choice. The kernel's scheduler may
decide that CPU 1 should be woken to handle some new packets, because CPU 2 is
busy. If you've pinned the socket to CPU 2, you may be losing out.

I get that there are trade-offs between the two modes: pinning can provide
better cache usage, you can avoid some locks (but indirectly make the kernel
do the work for you) and so on, but I don't see how you can confidently state
that pinning is the 'sane' choice.

In an ideal world, the kernel has a better overview of the network state and
CPU state, and therefore is best positioned to decide which CPU should handle
each packet.

~~~
felixgallo
The kernel scheduler has no idea which application thread handles which
socket. You can formulate an application level plan and then enforce your will
with socket and cpu pinning.

~~~
scott_s
Correct, but the difficulty is if your application must share the machine with
any other application - even short lived ones. That, I think, is what joosters
was alluding to. If the assumption that your application is the only consumer
of system resources is broken, then you may see pathological scheduling
behavior.

~~~
achamayou
You can set isolcpus to earmark some cores for your application, and let the
kernel manage the rest.

------
peterwwillis
Physical limitations of bandwidth on packets per second:
[https://www.cisco.com/c/en/us/about/security-
center/network-...](https://www.cisco.com/c/en/us/about/security-
center/network-performance-metrics.html)

Why the Linux kernel has a hard time processing more than 1-2M packets per
core per second, and patches/improvements for the kernel:
[https://lwn.net/Articles/629155/](https://lwn.net/Articles/629155/)

CloudFlare's kernel bypass blog post: [https://blog.cloudflare.com/kernel-
bypass/](https://blog.cloudflare.com/kernel-bypass/)

A paper from NTop on doing 10G line rate packet processing, the limitations,
and then-current options:
[http://luca.ntop.org/10g.pdf](http://luca.ntop.org/10g.pdf)

NetOptimizer kernel dev blog, where they've maxed out the throughput of a 10G
link using the kernel's stack, and details on latency, theoretical maximums
and how to test:
[https://netoptimizer.blogspot.com/search/label/10G](https://netoptimizer.blogspot.com/search/label/10G)

\--

The real answer to "Why do we use the Linux kernel's TCP stack?" is that
operating systems are designed to help users and programs. They are not
designed to be a custom tailored highest-performance cure-all for the highest
possible theoretical computing throughput. Using one tcp stack helps users and
programs more than each user or program using its own unique stack.

~~~
Annatar
_They are not designed to be a custom tailored highest-performance cure-all
for the highest possible theoretical computing throughput._

I beg to differ vehemently, as the FireEngine TCP/IP stack in illumos was
designed to be the highest possible performance cure-all for highest possible
throughput. I've posted the links above in another entry. That GNU/Linux's
TCP/IP stack is hitting the limits does not mean that nobody else is capable
of designing a high performance TCP/IP stack, and indeed, I have been able to
max out 1 Gbit connection running Solaris 10 on a measly DELL R910. If the
network administrator hadn't come running to "turn it off! Turn the damn thing
off!", I would have maxed out a trunked 40 Gbit link too. I sure taught that
guy a lesson that day, never again have I heard a peep about "not ever seeing
anyone being able to max out a 1 Gbit connection with a single machine".

Only Solaris / illumos' _FireEngine_ makes it possible.

~~~
jsnell
Help me understand this. How is maxing out a 1Gbps connection with a 4U, 4
socket (so presumably 24-32 core) Xeon server supposed to be impressive?

~~~
Annatar
Considering I did it seven years ago, I think it's impressive in that context.

------
doomrobo
>With this scale of attack the Linux kernel is not enough for us. We must work
around it...we added a partial kernel bypass feature to Netmap: that's
described in this blog post. With this technique we can offload our anti-DDoS
iptables to a very fast userspace process.

What do they mean by "very fast userspace process"? If you're doing the same
thing the kernel would be doing, the userspace process should be strictly
slower due to context switches. What costs are they saving on here?

~~~
majke
The "very fast userspace process" has direct access to hardware NIC RX queue
and is doing busy polling. It uses 100% CPU all the time. The process is
faster then kernel because:

\- it does less only since it implements only some subset of iptables

\- is single threaded, no locks

\- doesn't implement TCP

\- it's small, no iTLB misses

\- the working size set is small, we only deal with couple of packets at a
time

\- no memory allocations on the hot path (no skb)

\- it is doing busy polling, saving the Xus needed for an interrupt context
switch

~~~
catern
It seems like the main thing making that process fast is the fact that it's
doing polling. But Linux does polling too, when there are enough packets
coming in. Likewise, if you really are dropping packets very early in the
networking stack, you wouldn't reach the TCP layer and would have a small
working set in the kernel too.

So, why not just implement this very fast userspace process as a kernel patch?
That seems much easier...

------
kieranelby
The argument about not being able to run SSH on the server seemed a bit weak,
surely just stick two NICs in there, one for management + one for user-space
stuff?

~~~
scott_s
I think that's less an argument, and more of an example. You are correct, that
is certainly something one can do. But I think the related argument is that
now your system configurations are more complicated; you have tied your
hardware and software together, for example. For some that may not be
possible, and even if it is, some may not want to give up the abstraction that
the kernel provides.

------
Annatar
_With this scale of attack the Linux kernel is not enough for us. We must work
around it._

...Or you could just use an operating system substrate based on illumos which
utilizes the _FireEngine_ , like for instance SmartOS, instead of having to
invent workarounds or use one's own TCP/IP stack implementations:

[http://www.baylisa.org/library/slides/2005/august2005.pdf](http://www.baylisa.org/library/slides/2005/august2005.pdf)

[https://sunaytripathi.wordpress.com/2010/03/25/solaris-10-ne...](https://sunaytripathi.wordpress.com/2010/03/25/solaris-10-networking-
the-magic-revealed/)

this technology has been available for over ten years now; designed from the
ground up to scale across multiple hardware threads for high performance, by
the experts in the problem domain.

~~~
quitspamming
Quit spamming about stupid SmartOS, you try to shoehorn it in to every topic.
You're like a Mormon Missionary for SmartOS and it is super annoying.

~~~
oddsignals
I found his comment relevant and interesting enough, and judging by his
posting history SmartOS is far from the only thing he comments about. It
certainly added more to the discussion than yours did.

~~~
cyphar
An incredibly large percentage of Annatar's posting history is a
misunderstanding of something about GNU/Linux, followed by a pitch about
SmartOS. It's not the only thing they talk a out, but it's the only posts that
stick in my mind. While I find the history of free operating systems
fascinating, it's quite dismissive to pretend that all possible problems that
GNU/Linux faces today were solved "10+ years ago by experts in the problem
domain".

~~~
Annatar
Misunderstanding? I develop on Linux day in and day out. Care to qualify that
assertion?

 _it 's quite dismissive to pretend that all possible problems that GNU/Linux
faces today were solved "10+ years ago by experts in the problem domain"._

As one of the principal kernel engineers of the FireEngine, yeah I think Sunay
is the expert in the problem domain, having invented parallel enqueuing or
what he terms "fanout", and Radia Perlman, who I believe collaborated with him
on it invented the spanning tree protocol. If that doesn't make them the
subject matter experts in the TCP/IP stack domain, then I have nothing more to
add. And yes, some or the problems GNU/Linux is hitting today have been solved
on Solaris more than ten, others more than twenty years ago. Solaris had large
enterprises as paying customers throughout the nineties of the past century,
and those customers both demanded and paid huge sums of money to have these
types of problems solved, so in some cases illumos has up to 25 years of a
headstart, and by the time GNU/Linux catches up, illumos will already be
ahead, as the development is not standing still and it has professional kernel
engineers working on the code base.

~~~
cyphar
> Care to qualify that assertion?

The most recent example I can think of is you posting about containers on
GNU/Linux[1], claiming that they were implemented primarily using cgroups (and
that the main purpose was resource restrictions). That is not true, and hasn't
been true for a long time (if ever). Yes, the very first upstream "container"
primitive was cgroups -- but that was very quickly replaced with namespaces
and cgroups took on the resource restriction role. What most people call
"containers" was always about virtualization (ie isolation), and the isolation
primitive in the Linux kernel is namespaces.

There are almost certainly more examples, but I don't feel like going through
any more of your comment history at the moment.

> And yes, some or the problems GNU/Linux is hitting today have been solved on
> Solaris more than ten, others more than twenty years ago.

Believe it or not, but constraints have changed in the past 20 years. I'm not
saying that illumos doesn't have awesome technology (it does), but it is not a
panacea. I get it, you're an advocate for alternative free operating systems.
Good for you. Solaris does have a 25 year headstart -- on solving problems 25
years old. Modern computing has many more problems that weren't even concieved
25 years ago (cloud and distributed computing being the main ones, as well as
embedded devices which is something that Solaris can't put a candle to
GNU/Linux on). So it's very dismissive to claim that Solaris has solved all
problems that may face GNU/Linux. Both operating systems have problems they
need to fix.

> and it has professional kernel engineers working on the code base

So does Linux, I'm missing your point here.

[1]
[https://news.ycombinator.com/item?id=11944847](https://news.ycombinator.com/item?id=11944847)

~~~
Annatar
> What most people call "containers" was always about virtualization (ie
> isolation), and the isolation primitive in the Linux kernel is namespaces.

There is no isolation with cgroups in Linux, that is the crux of the matter:

[https://www.youtube.com/watch?v=coFIEH3vXPw](https://www.youtube.com/watch?v=coFIEH3vXPw)

since containers in Solaris existed before cgroups and before the entire Linux
hype, and you specifically adress my "misunderstanding" (of hype), you compel
me to correct on terminology:

containers are resource constraints, while technology like LXC and OpenVZ
provide the lightweight virtualization and _isolation_ , a very important
distinction (full virtualization is achieved via XEN and KVM on GNU/Linux).
Conceptually, as a resource constraint, containers are in that sense the same
in Solaris as they are in Linux, with vastly different mechanism
implementations, but neither provide _isolation_.

Again, and I corrected you on this before (this happens to be my problem
domain), what you think of as containers are lightweight virtual machines, as
zones in Solaris and LXC / OpenVZ in Linux, and equating cgroups and
namespaces with a lightweight virtual machine technology is conflating two
different things.

If you should have the inclination to point out my other "misunderstandings"
of Linux, an operating system I very heavily use, develop on, and engineer
for, I would be interested to learn of them.

> So does Linux, I'm missing your point here.

If they exist, I have not heard of them, read about them, or met them yet; at
any rate, since Linux has so many architectural and performance problems,
again I am compelled to conclude that those "Linux kernel engineers" are not
of the same caliber as the ones working on BSD and illumos kernels. That an
operating system, after almost twenty years of massive investment and
literally armies of programmers still cannot get basic things like startup
(init.d/systemd/other variants of startup), shutdown (trying to flush a
filesystem buffer to an unmounted filesystem), or even TCP/IP performance
right tells me it is missing kernel engineers. Enthusiasts and volunteers
tinkering with the kernel do not professional kernel engineers make, as is
evident by this entire topic of whether to bypass the kernel's TCP/IP stack
with one's own implementation, because the stack cannot deliver sufficient
performance. That is what one can call damning evidence, no matter how one
slices or dices it.

~~~
cyphar
> There is no isolation with cgroups in Linux > containers are resource
> constraints

I'm going to say this one more time:

 _Linux containers use namespaces as the primary isolation mechanism -- NOT
cgroups_. You can create containers without cgroups. This happens to be my
problem space too, and you're not helping by spreading ignorance.

> equating cgroups and namespaces with a lightweight virtual machine
> technology is conflating two different things.

Finally you mention namespaces. Who mentioned "lightweight vritual machines"?
Namespaces are just tags for a process that are used to scope operations to
provide isolation. Cgroups are different tags used to provide resource
constraints. Just because people use containers in that way at the moment
doesn't make the underlying technology just about that.

> an operating system I very heavily use, develop on, and engineer for, I
> would be interested to learn of them.

Arrogance is not an endearing quality.

> If they exist, I have not heard of them

We can play that game all day. I don't care who you have and haven't heard of,
Linux has talented kernel engineers as evidenced by the fact that Linux is
widely used for production deployments. You might not agree with what has been
built, but you can't deny that it does exist and is being used to power
production systems. Please calm down on the saltiness, sodium is bad for your
health.

------
alberth
DragonflyBSD

I really wonder how DragonflyBSD compares given that they have a lockless
network stack implemented in the kernel.

[https://www.dragonflybsd.org/~aggelos/netmp-
paper.pdf](https://www.dragonflybsd.org/~aggelos/netmp-paper.pdf)

------
__b__
Previous:
[https://news.ycombinator.com/item?id=12048709](https://news.ycombinator.com/item?id=12048709)

------
chmike
dpdk and netmap is really only for applications with cooperating I/O
processes. This is because the queue of received packets is shared between all
process and any of them can delete any packet.

It may not be good for CloudFlare hosting multiple web servers on the same
host, but I it could be good for a database or cache server usually run in a
LAN with 10Gbit/s network cards.

~~~
gpderetta
Don't modern high performance network cards have multiple tx/rx queues which
are virtualizable via IOMMU?

That's a genuine question BTW, I've only a bit of experience with userspace
networking with fully cooperating processes.

~~~
jsnell
SR-IOV is good for actual virtualization, but it's pretty clumsy for trying to
create isolation within a single VM. For example:

\- Every VF you create using SR-IOV will need to have a distinct MAC (and thus
in practice different IP). But what you'd usually want for this use case is
use the same IP for all apps, and do the split by destination port.

\- Another consequence of the previos point is that all apps would need to
include their own support for ARP, DHCP, etc. Doing it centralized doesn't
really work.

\- No promiscuous mode (at least on Intel NICs)m you only get traffic directed
to one specific MAC address

Now, if you didn't try to use the virtualization support but just use the
separate RX/TX queues, with something like the flow director for deciding what
traffic gets sent to which queue, you'd get rid of the above problems. But
then you end up with the issue that DPDK makes it very hard to have separate
applications access the same NIC, even on different queues.

~~~
peterwwillis
Use VPP for namespace-specific userland applications?
[https://wiki.fd.io/view/VPP/What_is_VPP%3F](https://wiki.fd.io/view/VPP/What_is_VPP%3F)

