
Bypassing the Linux kernel for high-performance packet filtering - jgrahamc
https://blog.cloudflare.com/kernel-bypass/
======
mavam
We've developed packet BRICKS for this: [https://github.com/bro/packet-
bricks](https://github.com/bro/packet-bricks)

Packet BRICKS is a Linux/FreeBSD daemon that is capable of receiving and
distributing ingress traffic to userland applications. Its main
responsibilities may include (i) load-balancing, (ii) duplicating and/or (iii)
filtering ingress traffic across all registered applications. The distribution
is flow-aware (i.e. packets of one connection will always end up in the same
application). At the moment, packet-bricks uses netmap packet I/O framework
for receiving packets. It employs netmap pipes to forward packets to end host
applications.

(Credit goes to Asim Jamshed, who pulled this off as part of an internship at
ICSI.)

~~~
tinco
So is the advantage that it's more performant than iptables+virtual
interfaces? If I use iptables to distribute traffic over virtual interfaces do
IP headers get parsed twice by the kernel in some inefficient way?

~~~
mavam
Iptables sits in the kernel and is also not available on non-Linux platforms
like FreeBSD. With packet bricks you bypass the kernel and expose "virtual"
interfaces to your applications by means of a simple configuration. Here's an
example from the README:

    
    
            bricks> lb = Brick.new("LoadBalancer", 2)
    	bricks> lb:connect_input("eth3")
    	bricks> lb:connect_output("eth3{0", "eth3{1", "eth3{2", "eth3{3", "eth2")
    	bricks> pe:link(lb)
    

This binds pkteng pe with LoadBalancer brick and asks the system to read
ingress packets from eth3 and split them flow-wise based on the 2-tuple (src &
dst IP addresses) metadata of the packet header. The "lb:connect_output(...)"
command creates four netmap-specific pipes named "netmap:eth3{x" where 0 <= x
< 4 and an egress interface named "eth2". The traffic is evenly split between
all five channels based on the 2 tuple header as previously mentioned.
Userland applications can now use packet-bricks to get their fair share of
ingress traffic. The brick is finally linked with the packet engine.

~~~
eikenberry
So the goals of packet bricks are portability and ease of configuration, not
performance gains?

~~~
mavam
The goal is to have both. In fact, the whole point of kernel bypass is
performance, so just having ease of configuration would defeat the point.

We're using packet bricks primarily for high-performance network monitoring in
environments with more than 10 Gbps aggregate upstream traffic.

------
awgn
In this essay the author forgot to mention PFQ, that at the time of writing
represents a performant and innovative approach to packet capture and in-
kernel functional processing of packets (running on-top-of vanilla drivers).
The software is available at www.pfq.io.

~~~
tuukkah
PFQ is impressive: "Rx and Tx line-rate on 10-Gbit links (14,8 Mpps), on-top-
of Intel ixgbe vanilla drivers." [http://www.pfq.io/](http://www.pfq.io/)

------
lukego
Great prespective.

Thinking of the future: Recent experience suggests that we are able to do
around 50-100 Mpps of traffic dispatching in Snabb Switch using one CPU core.
I suspect that dedicated software traffic dispatchers will displace hardware
(RSS, VMDq, etc) in the immediate future.

We are planning to explore this soon in the context of software dispatching
for 100Gbps ethernet ports.

~~~
wmf
So would you recommend giving Snabb the NIC, letting it drop the DDoS traffic,
and then injecting what's left into the kernel stack via a tap interface or
something?

~~~
lukego
Great question actually.

That is one solution: take a 10G port into Snabb Switch, filter and sort the
traffic, then feed a slice to the kernel e.g. via a tap device with multiqueue
/dev/vhost-net acceleration (same interface that QEMU/KVM uses).

The risk I see here is that maybe it is even harder to tune your kernel when
it is using a software interface for I/O instead of a hardware one. The kernel
can depend on so many things (multiqueue, TSO, LRO, checksum, encap offload,
etc) that can behave differently between hardware/software NICs and you would
need to be confident that this will work out well. Otherwise the risk is that
you take your hardest problem - tuning the kernel - and make it even harder.

If we are only talking about 2x10G ports per server then one alternative would
be to connect Snabb Switch to the kernel with a physical 10G port instead of a
software one. That is, separate the DDoS-protecting frontend (Snabb) from the
backend (kernel) with a network cable. You could still run the Snabb
application on the same server but with a dedicated network card cabled
directly to the kernel. (You could also run it on a different server if you
prefer.)

End off-cuff braindump :-)

------
ck2
For those not at cloudflare level and don't want to reinvent the wheel, try
IPSET instead of iptables

[http://daemonkeeper.net/781/mass-blocking-ip-addresses-
with-...](http://daemonkeeper.net/781/mass-blocking-ip-addresses-with-ipset/)

TLDR: [http://daemonkeeper.net/wp-
content/uploads/2012/05/ipset4.pn...](http://daemonkeeper.net/wp-
content/uploads/2012/05/ipset4.png)

------
Symmetry
This reminds me of the work the Arrakis folks are doing:

[https://www.usenix.org/system/files/conference/osdi14/osdi14...](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-
peter_simon.pdf)

[https://www.youtube.com/watch?v=WG3b2hE4i6U](https://www.youtube.com/watch?v=WG3b2hE4i6U)

------
s1m0n
The article appears to be inaccurate. Why? AFAIK it's possible to make use of
netmap on a box with a single nic. I tried this out for myself about 2 years
ago on a VMware virtual machine. How does it work? A user land packet filter
can "forward" certain packets on to the kernel -- e.g. ssh packets in my case
-- while others stick around in shared memory for kernel bypass. This means
that I can ssh to the box and the ssh packets flow via the kernel, while the
rest of the packets bypass the kernel, but all packets floor over the same
nic. Nice :-)

~~~
majke
Well, sure. You can use netmap and use a "host ring" to inject the packets
back to the kernel. While good for toy app, this won't work in real workloads.
For example host ring doesn't support multiple RX queues. The article goes
into details how to work around it and just leave most of the network flows to
be dealt with by the kernel, while opting-in to the kernel bypass for only
selected flows.

~~~
s1m0n
This is the inaccurate sentence: "Snabbswitch, DPDK and netmap take over the
whole network card, not allowing any traffic on that NIC to reach the kernel."
Obviously with netmap traffic to the NIC may reach the kernel...

~~~
majke
There are many ways to inject packets back to kernel. Tuntap, raw socket on
loopback, "dummy" device, etc. So by this count you can always make packets
reach the kernel.

There are two problems with doing the "take over the nic" techniques:

1) I don't believe you can actually push, say 2M pps back to the kernel with
any of this techniques. There is a reason RSS exists, and even if you can
process 10M pps on one CPU, it doesn't mean it's easy to insert them back to
kernel.

2) I don't think putting a piece of custom code between CloudFlare kernel and
network card is feasible on the architectural level. You really want to stand
in the way and have to actively forward all these packets?

~~~
s1m0n
The title of the article does not mention CloudFlare; only bypassing. The fact
that the CloudFlare architecture pushes a higher bandwidth of packets into the
network kernel and bypasses the rest does not make it a good technique or to
be recommended. If you are primarily interested in the best performance with a
single NIC solution then I believe it is suboptimal. Why? You are asking the
CPU to do two different types of work; optimized and unoptimized. Because of
cache line pollution then the "unoptimized" work via the network kernel will
pollute the other work. I may be wrong but I would bet you'd get better
performance by separating your CloudFlare specific workload onto two boxes,
each with one NIC. In this scenario then no cache line pollution can occur. Of
course, these two boxes might not be easily possible within three existing
CloudFlare architecture. But this has nothing to do with the general idea of
packets bypassing the kernel. After the bypass you want the CPU to process
those packets in the most efficient way...

------
ilanco
It doesn't mention in the article why the kernel is so slow at processing
network packets. I'm not a kernel programmer so this may be utterly wrong, but
wouldn't it be possible to sacrifice some feature for speed by disabling it in
the kernel code?

~~~
acdha
One thing to remember is that high-speed packet filtering is an unusual
workflow and CloudFlare operates at a much greater scale than most of us see:
most Linux devices are not connected to 10G, much less 100G, networks and
they're usually doing more work than looking at a packet to decide whether to
accept or reject it. The fact that the APIs and the kernel stack were designed
many years before those kind of speeds were possible doesn't matter because
most sites don't have that much traffic and most server applications will
bottleneck at doing other work well before that point.

The example in the article found a single core handling 1.4M packets per
second. If you're running a web-server shoveling data out to clients those
packets are going to be close to the maximum size which, if I haven't screwed
up the math, looks something like this:

1.4M * 1400 bytes (assuming a low MTU) * 8 (bytes -> bits) = 15Gbps

That's not to say that there isn't still plenty of room to improve and, as
lukego noted, there's a lot of work in progress (see e.g.
[https://lwn.net/Articles/615238/](https://lwn.net/Articles/615238/) on work
to batch operations to avoid paying some of the processing costs for every
packet) but for the average server you'd find bottlenecks on something like a
database, application logic, request handling, client network capacity, etc.
before the network stack overhead is your greatest challenge. The people who
encounter this tend to be CDN vendors like CloudFlare and security people who
need to filter, analyze, or generate traffic on levels which are at least the
the scale of a large company (e.g.
[https://github.com/robertdavidgraham/masscan](https://github.com/robertdavidgraham/masscan)).

~~~
scurvy
Sorry if I sound mean, but this is just a long apologist post about how things
are just so hard. Really? Why? Why can't Linux match BSD's performance?

Also, 10gb servers are not rare by any means. Take a look around the next time
you walk in a colo. 10 gb servers everywhere.

~~~
acdha
> Sorry if I sound mean, but this is just a long apologist post about how
> things are just so hard. Really? Why? Why can't Linux match BSD's
> performance?

Mostly I just wish you'd read it again: you appear to have missed the part
where I said that this is a real problem which needed working on.

The point I was making is that it's not a problem for most Linux users. Linux
includes millions of devices attached to sub-100Mb networks but even if you
want to look solely at things in modern data centers ask yourself how many of
them are running network-limited applications or are providing services to
internet users over an uplink which is actually fast enough to stress modern
hardware. For all but the most demanding users the Linux vs. BSD decision will
be made on other factors.

~~~
scurvy
Those millions of sub-100Mb network devices aren't pushing Linux forward
though. Sure, they're using Linux, but they're not the ones pushing it
forward. It's colo and datacenter use cases that are. A basic MySQL OLTP load
with SSD's and a modicum of compute power will easily saturate a 10gig NIC. It
does start to die shortly after that due to network/kernel stuff. I'd really
like to get a lot more out of my existing hardware if i could.

~~~
acdha
You mean “pushing it forward in this particular direction”. Again, it's great
to have people working on this and better if the companies using Linux support
developers working on the problem but it's only a deal-breaker for a much
smaller number of people.

Just to use an example which you see a lot today: how many of the developers
jumping on Docker care that much about network performance at this level? I
would argue that continued development of the container system has done more
to boost Linux usage than low-level networking performance, even though both
are entirely legitimate and worthwhile concerns worth developer sponsorship.
(ZFS has pulled in the opposite direction for people who care about storage)

From the other direction, imagine if *BSD had gotten serious about package
management by the mid-to-late 90s when it was obvious how much better the
experience was on Debian so that a generation of developers wasn't trained to
favor Linux to avoid getting sucked into dependency management. That doesn't
have anything to do with the kernel but it mattered more for many, many
people. This would have been really interesting if kfreebsd had hit critical
mass and made the cost of switching that much lower.

------
acconsta
I feel like there's a disconnect between the HPC community, which has publicly
deployed these techniques for years, and the broader tech community. Even some
enterprise hardware uses InfiniBand (with kernel bypass) these days.

Yet you never hear about Google or AWS using kernel bypass in their load
balancers, for example (possibly a trade secret, possibly the result of Linux
monoculture).

~~~
lukego
I reckon that networking is transitioning from being a system programming
problem (interrupt - switch to kernel - grab packet - process quickly - switch
back) to being a HPC problem (infinite stream of packets arriving in memory).

ISPs are where I expect to see the disruption of HPC-oriented x86 servers
being supremely capable of handling work previously done by specialized
hardware.

------
revelation
At this point, they could presumably just use FPGAs. There are plenty of dev
boards with 10GiB+ interfaces precisely because FPGAs are such a good fit for
this kind of processing.

~~~
majke
Please factor in development and support cost. There is a reason why people
prefer general purpose computers to dedicated hardware. I think only big
players like Google can afford dedicated hardware teams.

~~~
__d
Not to disagree about the need to factor in the dev cost, but there's plenty
of companies smaller than Google doing TCP stacks on FPGAs.

Pretty much all high-frequency trading today uses FPGAs, for instance, and
that's often with teams of fewer than ten people.

~~~
majke
This is very exciting. Can you give some examples?

~~~
corysama
[http://www.argondesign.com/case-studies/2013/sep/18/high-
per...](http://www.argondesign.com/case-studies/2013/sep/18/high-performance-
trading/)

[https://www.youtube.com/watch?v=uDy_8Q0GdTk](https://www.youtube.com/watch?v=uDy_8Q0GdTk)

IIRC, the FPGA has been incorporated into a switch. When a market data packet
starts to arrive, the system starts sending a response packet before the input
packet has completely arrived and before the system has actually made a
decision. While the input packet is read, the system decides whether or not it
will cancel the response market order by intentionally corrupting the checksum
of the output packet at the last possible instant.

------
stzup7
I've been following these posts for a while and it looks to me they've decided
to go with Solarflare but haven't really explained why. I would be interested
to see a fair comparison between Solarflare & openonload and their competitors
such as Dolphin & super sockets, Mellanox & VMA, Chelsio & rdma. Also, if
they're looking at pure PPS stats a good FPGA with a built-in hardware TCP/IP
stack could be a very powerful filter.

~~~
masklinn
Have they? They just noted that Solarflare's proprietary library EF_VI has an
interesting approach to kernel bypass, then indicate that you can replicate
that approach on other NICs and show how.

~~~
stzup7
Right. The fact that it's the only proprietary platform tested biased my mind
too quickly :)

~~~
__d
It's worth noting that OpenOnload is GPLv2.

The issue is that Solarflare has patents on some of the underlying techniques.
They were very open to licensing those patents for a reasonable figure when I
last talked with them about it, but that wouldn't work for a general-purpose
OSS project.

------
wslh
( _I wasn 't expecting so many downvotes for this question_)

I am curious about packet filtering in Windows. Anyone with experience in HN?

Now, in my company, we are doing some tests using different methods: WinPcap,
WFP, NDIS, and WinPcap is the winner in a VM but we will start to test with
real 10gbps ethernet cards next week.

~~~
Noxwizard
For my Masters' work, I needed high speed tx/rx on Windows and looked into the
same things you did. I can't find the statistics for the tests I ran, but
WinPcap's speeds weren't much better than Winsock's, which was fairly poor.
The solution I used was an NDIS kernel filter and protocol driver which pushed
the packets into user-space memory. Luigi Rizzo has recently added a Windows
port of netmap to his repository, so you might want to look into that:
[https://github.com/luigirizzo/netmap](https://github.com/luigirizzo/netmap)

~~~
trentnelson
Did you look into registered I/O?

[https://technet.microsoft.com/en-
us/library/Hh997032.aspx](https://technet.microsoft.com/en-
us/library/Hh997032.aspx)

------
karthick18
Good post.

And I have performance numbers with OVS-dpdk that make kernel bypass
compelling since its off-the-charts while comparing with kernel datapath.

For those interested, my ovs-dpdk experiments which also include patches,
README etc. for others to carry it themselves can be found here:

[https://www.dropbox.com/sh/nfe70cgksmy543k/AABD_0qsQ15e2GItX...](https://www.dropbox.com/sh/nfe70cgksmy543k/AABD_0qsQ15e2GItXDNCpxBaa?dl=0)

The perf results directory has the ovs-dpdk perf for all the use-cases in the
dataplane performance pdf that you might be interested in.

Has use-cases covering up to 11VM or 11 containers with 11 IP flows to measure
dataplane performance.

Also you really need a server with 1 gig hugetlb support (and also enable that
for guest) to extract maximum performance.

Expected I guess ...

------
s1m0n
Wouldn't it be more efficient to just have all or nearly all packets bypass
the network kernel? Why compromise?

~~~
mavam
That's exactly the idea behind packet bricks (see other comment): you have one
single tool that takes all the packets directly from the NIC (say eth0) and
then exposes them according to your bricks configuration to a bunch of other
interfaces (eth0}0, etho0}1, etc.). Very similar to Click. This layer of
indirection obviates the need for shared NIC access, which is what CloudFlare
works around in a more cumbersome way.

~~~
s1m0n
Looks interesting. I'll take a look. You might also be interested in mTCP
([https://github.com/eunyoung14/mtcp](https://github.com/eunyoung14/mtcp)), or
possibly adding mTCP functionality to packet bricks?

~~~
mavam
Excellent follow up, because the main developer of packet bricks is also a co-
author of mTCP :-).

~~~
s1m0n
:-)

------
amelius
> I do hope an open source kernel bypass API will emerge soon

Or how about a faster kernel? :)

~~~
acconsta
The cost of kernel I/O isn't just the direct cost of time spent in the kernel.
Even a really, really fast kernel will pollute the CPU caches and TLB.

------
inversionOf
Given the overhead of context switches, is it possible to take a general
purpose application like nginx and use a user-mode TCP stack? For instance if
I had a network adapter that is solely dedicated to nginx, and don't need any
of the kernel TCP services. Is this even a viable consideration?

I've done high performance nginx, in the million request per second range
(there are situations that benefit from these, though unfortunately such
discussions always get waylaid by people insisting that performance doesn't
matter), but there is enormous system overhead at this rate that I'd like to
get around.

~~~
jsnell
It's possible, you'd need to override the relevant system calls with
LD_PRELOADed library.

I don't know if a complete drop-in solution is the right solution though. If
your application is performance sensitive enough to require embedding a full
networking stack, you might as well make use of better APIs. For example it'd
be silly to indirect the event dispatching through something poll/select-like.
Instead you'd much rather just have the core IO loop call the handlers
directly. Or as another example, zero-copy will be impossible with a
recv()-like interface where the client provides the buffer that data needs to
go to, but will be trivial with an API where it's the network stack giving the
client a buffer that already has the data.

If you want to experiment with this, mTCP
([http://shader.kaist.edu/mtcp/](http://shader.kaist.edu/mtcp/)) is probably
the right starting point.

~~~
blibble
this is exactly how openonload works, it's pretty impressive that they get
nearly all weird behaviour of the Linux socket API correct (correct behaviour
across fork, select/poll/epoll, multicast behaviour, etc).

presentation: [http://www.openonload.org/openonload-google-
talk.pdf](http://www.openonload.org/openonload-google-talk.pdf)

