
Filtering millions of packets per second on commodity NICs - jgrahamc
https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap/
======
majke
This blog post nicely fits as a fourth part of the series we've been releasing
since June:

[https://blog.cloudflare.com/how-to-receive-a-million-
packets...](https://blog.cloudflare.com/how-to-receive-a-million-packets/)

[https://blog.cloudflare.com/how-to-achieve-low-
latency/](https://blog.cloudflare.com/how-to-achieve-low-latency/)

[https://blog.cloudflare.com/kernel-
bypass/](https://blog.cloudflare.com/kernel-bypass/)

[https://blog.cloudflare.com/single-rx-queue-kernel-bypass-
wi...](https://blog.cloudflare.com/single-rx-queue-kernel-bypass-with-netmap/)

I hope this gives a bit more context.

~~~
samstave
What is the aggregate PPS Cloudflare handles now? Whats the goal (aside from
infinite)?

~~~
majke
We regularly have many-million pps per server.

You might find this interesting:

[https://youtu.be/UcAygzNSxlI?t=7980](https://youtu.be/UcAygzNSxlI?t=7980)

[https://indico.dns-
oarc.net/event/21/contribution/5/material...](https://indico.dns-
oarc.net/event/21/contribution/5/material/slides/0.pdf)

~~~
jlgaddis
Awesome links, thanks!

------
j_s
_We would also like to thank Luigi Rizzo, for his Netmap work and great
feedback on our patches._

Clearly a useful contribution! The linked pull request looks like more of a
finished product; I appreciate it even more when companies include the details
of the sausage making.

------
aexaey
This gives a very interesting data point to the "open-source" vs. "free
software" debate. Normally free/libre software zealots would tout
BSD/ISC/Apache licenses as a way to never get back any downstream changes. And
yet - cloudflare did contribute back nicely to a BSD-licensed project, in a
situation where they were absolutely _not_ under an obligation to do so.

In fact, even GPLv2 would not have imposed an obligation to publish changes
here, only a super-strict GPLv3 would.

One data point of course, hardly warrant a far-reaching conclusion; still -
that is something very nice to see.

~~~
jgrahamc
We open source stuff because it's a virtuous circle. We think other people
will look at our code and make it better!

------
revelation
What exactly is the process() doing in the sample? Or is that also commented
out in the test?

Because if the only processing here is throw-away this still screams for a
FPGA in front of the NIC. Someone mentioned higher R&D on a FPGA solution, but
clearly there is massive R&D here in just making sure _evil_ packets don't hit
a _slow_ code path.

~~~
pjc50
You can get FPGA-based switches, e.g. from Arista. They're not cheap, but you
can do whatever you like with the packets as the bytes arrive. But for most
applications you'd stick with commodity cards for the cost.

~~~
revelation
FPGA-based switches from Arista are a gimmick of that particular vendor. 10G
ethernet and beyond is absolutely _commodity_ in the FPGA world, every dev kit
has one.

~~~
wmf
An FPGA dev kit probably costs more than a NIC and is harder to program.

------
ju-st
Does anyone know about the current state of IP routing on commodity NICs and
Linux? Is 14M pps on 500'000 routes possible?

~~~
acd
You want to check out Brocade, Intel DPDK and 6wind. Brocades Vyatta router
has DPDK support as has Juniper VMX.

[http://www.slideshare.net/shemminger/dpdk-
performance](http://www.slideshare.net/shemminger/dpdk-performance)

~~~
ju-st
Nice, DPDK has even a library for longest prefix matching [1] but sadly there
are no published performance results.

[1] [http://dpdk.org/doc/guides/prog_guide/lpm_lib.html#lpm-
api-o...](http://dpdk.org/doc/guides/prog_guide/lpm_lib.html#lpm-api-overview)

------
SixSigma
When you have to bypass your operating system to get your hardware to perform,
perhaps it is time to re-assess your choice of Operating system.

~~~
bwoj
This isn't quite so much bypassing the OS as it is redefining the boundary of
the privileged space to not include the network traffic. This lets your
filtering application get the network packets directly without having to copy
them out of kernel space and into user space. This is exactly the same
technique all high performance network devices follow presently. The ones that
aren't doing it in userspace, are doing it in some sort of RTOS that doesn't
even have protected memory spaces.

~~~
SixSigma
In hpc circles this scheme is called OS bypass

[http://blogs.cisco.com/performance/mpi-newbie-what-is-
operat...](http://blogs.cisco.com/performance/mpi-newbie-what-is-operating-
system-bypass)

~~~
lrizzo
(netmap author here) I prefer to define netmap as a "network stack bypass"
scheme because we use as much as possible of the OS -- all the things it does
well, we do not want to reinvent. Device drivers, system calls,
synchronization support etc. are part of the kernel. Native netmap support for
a NIC only involves 3-400 lines of code, or 10% of the typical device driver.

Processes do ioctl(), mmap() and poll() for I/O - all standard system calls
implemented by the OS, there is no NIC-specific code in the application. NICs
can be switched in and out of netmap mode without reloading modules (and with
the cloudflare patch, even sharing the two modes). There are no custom memory
pools or hugepages to reserve. Device configuration relies on ethtool and
ifconfig etc.

This approach is what let the cloudflare folks implement their traffic
steering with zero new code, just a couple of ethtool lines; the change they
contributed back to support the split mode is completely agnostic of the
specific NIC being used.

