
How to receive a million packets per second - _jomo
https://blog.cloudflare.com/how-to-receive-a-million-packets/
======
adekok
Nice, except recvmmsg() is broken.

[http://man7.org/linux/man-
pages/man2/recvmmsg.2.html](http://man7.org/linux/man-
pages/man2/recvmmsg.2.html)

    
    
        The timeout argument does not work as intended.  The timeout is
        checked only after the receipt of each datagram, so that if up to
        vlen-1 datagrams are received before the timeout expires, but then no
        further datagrams are received, the call will block forever.
    

Which makes it useless for any application that wants to service data in a
short time frame. The only way around it is to use a "self clocking" method.
If you want to receive packets at least every 10ms, set a 10ms timeout... and
then be sure to send yourself a packet every 10ms.

I've done similar tests with UDP applications. It's possible to get 500K pps
on a multi-core system with a test application that isn't too complex, or uses
too many tricks. The problem is that the system spends 80% to 90% of its time
in the kernel doing IO. So you have no time left to run your application.

Another alternative is pcap and PF_RING, as seen here:
[https://github.com/robertdavidgraham/robdns](https://github.com/robertdavidgraham/robdns)

That might be useful. Previous discussion on robdns:
[https://news.ycombinator.com/item?id=8802425](https://news.ycombinator.com/item?id=8802425)

~~~
justincormack
The Snabb switch paralkelisation experiments are getting to 35 million packets
per second doing real work (encap decap) in Linux userspace[1]

[1] [https://groups.google.com/forum/m/#!topic/snabb-
devel/_vKQgC...](https://groups.google.com/forum/m/#!topic/snabb-
devel/_vKQgCU29_Q)

~~~
ajross
Tunnel encapsulation is real work, but not all real work can be mapped to
tunnel encapsulation.

The point of all that work to context switch into processes to handle small
amounts of network I/O is that very often THE CORRECT SOFTWARE ARCHITECTURE is
for multiple address-space-separated processes to be doing small amounts of
network I/O. That I/O "means something" to a larger data model being
implemented by the software.

It's true that for some tasks that "look like routing" there's no point to
having that kind of external data model. The packets _are_ the data being
operated on. So there's little value in process separation and you might as
well DMA them all streamwise into a single process to do it. And that's great
stuff, but AFAICT it's really not what the linked article is about.

Ultimately, all those packets are going to end up in conventional processes,
because that's where conventional processing needs to happen. There are very
good reasons why we like our page-protected address space separation in this
world!

------
edude03
Hmm, I might be missing something here, but don't most high performance
network applications skip the kernel for this exact reason? (IE
[http://highscalability.com/blog/2014/2/13/snabb-switch-
skip-...](http://highscalability.com/blog/2014/2/13/snabb-switch-skip-the-os-
and-get-40-million-requests-per-sec.html))

Makes me wonder how often bypassing the kernel is used in production networked
applications.

~~~
pjc50
He's "cheating" by using sendmmsg() to send many messages per system call,
reducing the number of context switches.

~~~
amluto
Barely. On a reasonably configured kernel (you need both syscall auditing and
context tracking turned off, which is doable at compile time or runtime), a
modern CPU should be able to round-trip a syscall in under 40 ns. That only
eats 4% CPU at 1M syscalls per second.

(It's slightly worse than that due to extra cache and TLB pressure, but I
doubt that matters in this workload.)

------
danpalmer
> Last week during a casual conversation I overheard a colleague saying: "The
> Linux network stack is slow! You can't expect it to do more than 50 thousand
> packets per second per core!"

> They both have two six core 2GHz Xeon processors. With hyperthreading (HT)
> enabled that counts to 24 processors on each box.

24 * 50,000 = 1,200,000

> we had shown that it is technically possible to receive 1Mpps on a Linux
> machine

So the original proposition was correct.

~~~
Anderkent
AFAIK he's only using 4 out of the 24 cores:

> two cores busy with handling RX queues, and third running the application,
> it's possible to get ~650k pps

That's ~200k pps per core, so 4x the initial bet.

------
shin_lao
It's an interesting post.

If you really want to squeeze out all the performance of your network card,
what you should use is something like DPDK.

[http://dpdk.org/](http://dpdk.org/)

~~~
LinuXY
Right- Though the point of the post was to dispel the fallacy that the Linux
kernel can only handle 50k pp/s per core. Using RDMA effectively bypasses the
kernel. He's also testing with a SolarFlare card which doesn't support DPDK,
though it does support RDMA with OpenOnload. What I've found is that RDMA is
the "easy" part to get right, as it's fairly simple. Not every network card is
created equal with respect to how many packets it can actually pass through to
the kernel however, partially in part to the kernel driver (whether it's using
NAPI or not, driver efficiency, MSI-X support, interrupt coalescing) and the
card itself (onboard buffer, latency characteristics, etc.) 10g cards max out
at around 14Mpp/s @ 60 bytes when the kernel is involved and everything is
perfectly tuned. Which should be where the card he's using falls. A generic
onboard Intel card generally maxes out around 8-10Mpp/s. But both would most
likely be able to hit 16Mpp/s @ 60 bytes if using RDMA in any form.

~~~
shin_lao
DPDK is great because it means telling to the customer "you keep your network,
just buy a server with this card and we promise you unreal throughput".

If you really want maximum throughput with RDMA, I think the best is to go
InfiniBand.

~~~
Galanwe
...Except it's dead.

InfiniBand was the way to go 5/10 years ago, when 10G Ethernet was not there.

Nowadays, most of companies that invested in IB years ago are stuck with a
dead infrastructure. It costs a lot, there is very little knowledge about the
techno, and most support for it is dropping (e.g. glusterfs). Sad truth is
most of these old IB infra are now used for IPoIB ...

Source: been working in HFT firms implementing IB RDMA, then GBEth RDMA, now
proprietary NIC RDMA

~~~
shin_lao
We've seen InfiniBand in HPC, it's true none of our customers in finance has
it.

------
jedberg
A joke answer and a serious question:

A: "Use BSD"

Q: Why is there such a strong focus on trying to get Linux network performance
when (I think) everyone agrees BSD is better at networking? What does Linux
offer beyond the network that BSD doesn't when it comes to applications that
demand the fastest networks?

ps. I think the markdown filter is broken, I can't make a literal asterisk
with a backslash. Anyone know how HN lets you make an inline asterisk?

~~~
jemfinch
Who, in 2015, agrees that BSD is better at networking?

I remember these claims being made in the late 90s, and perhaps they were true
back then, but it's been 15 years, and I would be surprised if Linux hasn't
caught up by virtue of its faster development pace, greater mindshare, and
increased corporate/datacenter usage.

So, in all seriousness: what recent, well argued essays/papers can you refer
me to so I can understand the claim that BSD networking is still better than
Linux in 2015?

~~~
martin1975
Here's an interesting kqueue v epoll benchmark I picked up somewhere when this
topic came up..
[http://daemonforums.org/showthread.php?t=2124](http://daemonforums.org/showthread.php?t=2124)

Time for a epoll for kqueue swap, and make this performance debate go away for
both, Linux and FreeBSD once and for all. No reason for this pissing contest.

~~~
trentnelson
Registered I/O on Windows is about three to four decades ahead, conceptually.

(As in, the stuff that facilitates registered I/O is based on concepts that
can be traced back to VMS, released in 1977. Namely, the Irp, plus, a kernel
designed around waitable events, not runnable processes.)

------
chx
Perhaps because I am not really a low level programmer it strikes me odd that
"receive packet" is a call. I would expect to pass a function pointer to the
driver and be called with the packet address every time one has arrived.

~~~
colanderman
For your own sanity, you want some control over which thread is processing a
packet, and at what point it does so. The easiest way to do this is to
explicitly notify the kernel when you are in fact interested in another
packet, and whether you wish your thread of control to be suspended until such
a packet arrives if there is no next packet.

The JavaScript model of "call some callback when the currently-executing
function returns" doesn't work, because C is not event-based. The embedded
model of "interrupt whatever's happening and invoke some handler on the
current thread of control" is just an absolute nightmare to deal with.

~~~
jerf
"The JavaScript model of "call some callback when the currently-executing
function returns" doesn't work,"

I can't prove it, but at this scale, I am guessing that the cost of
constructing call frames to call into the callback will start to matter, too.
Switching into an already-existing context is probably significantly cheaper.
(Well, it's _definitely_ significantly cheaper to switch into an existing C
context than construct a Javascript call, but, that's sort of cheating. Or a
low blow. Or something like that.)

~~~
TickleSteve
Not even remotely... To setup for a call to a C function is a matter of a
handfull of instructions. To setup a context switch is hundreds.

~~~
jerf
If you're talking about kernel context switches, the JS has them too, and then
has more instructions to set up the JS call stack to boot. Setting up a
million JS call stacks is not necessarily trivial. A million here, a million
there, soon you're talking real wall-clock time.

------
samstave
I was curious to see how many pps our servers are handling...

We have an app server that currently handles 40K concurrent users per node. I
get ~63pps only:

TX eth0: 42780 pkts/s RX eth0: 64676 pkts/s

TX eth0: 41570 pkts/s RX eth0: 63401 pkts/s

TX eth0: 41867 pkts/s RX eth0: 63697 pkts/s

TX eth0: 41585 pkts/s RX eth0: 63187 pkts/s

TX eth0: 40408 pkts/s RX eth0: 61912 pkts/s

TX eth0: 41445 pkts/s RX eth0: 63299 pkts/s

TX eth0: 41119 pkts/s RX eth0: 63186 pkts/s

TX eth0: 41502 pkts/s RX eth0: 63153 pkts/s

TX eth0: 40465 pkts/s RX eth0: 62118 pkts/s

TX eth0: 42105 pkts/s RX eth0: 63986 pkts/s

But this is utilizing 7 of 8 cores on each node... with CPU util very low.

------
brobinson
Why even have netfilter ("iptables") loaded in the kernel at all? Won't those
two rules still have to be evaluated for each packet even if the rules are
saying not to do anything?

There are additional things at play here, too, including what the NIC driver's
strategy for interrupt generation is and how interrupts are balanced across
the available cores, whether there are cores dedicated to interrupt handling
and otherwise isolated from the I/O scheduler, various sysctl settings, etc.

There's further gains here if you want to get really into it.

------
nikropht
Actually the Linux kernel is rather fast if used right. The Mikrotik CCR1036
series routers have a 36 core tile CPU with each core running at 1.2ghz it can
cram out 15 million pps.
[https://www.youtube.com/watch?v=UNwxAjJ4V4A](https://www.youtube.com/watch?v=UNwxAjJ4V4A)

RouterOS is based on the Linux kernel see
[https://en.wikipedia.org/wiki/MikroTik](https://en.wikipedia.org/wiki/MikroTik)

------
netman
The Automattic guys did some testing a few years ago with better results on
SolarFlare. I wonder where their testing ultimately ended up.
[https://wpneteng.wordpress.com/2013/12/21/10g-nic-
testing/](https://wpneteng.wordpress.com/2013/12/21/10g-nic-testing/)

------
zurn
Where does the funny 50 kpps per core idea in the lead-in come from? This
would mean falling far short of 1 gigE line rate with 1500 byte packets! This
is is trivially disproven with everyday experience of anyone who's run scp
over his home lan or crossover cable?

~~~
majke
TCP != UDP. In TCP recv() gives you a buffer. In UDP recv() gives you a
packet.

Good luck receiving 50k pps of UDP packets, doing some processing and
_sending_ 50k pps responses back. It actually is hard.

~~~
zurn
The lead-in wasn't talking about UDP. And the article is specifically showing
you how to use the multiple-packets-per call APIs (sendmmsg et al) and hitting
350 kpps with simeple single-threaded code and UDP. Then 1.4 Mpps with tuning
and parallelism.

And even if you were using the slow API, even with ping:

    
    
      [7 year old desktop]$ sudo ping -q -f -c 100000 localhost
      PING localhost.localdomain (127.0.0.1) 56(84) bytes of data.
      --- localhost.localdomain ping statistics ---
      100000 packets transmitted, 100000 received, 0% packet loss, time 954ms
      rtt min/avg/max/mdev = 0.004/0.004/0.195/0.002 ms, ipg/ewma 0.009/0.005 ms
    

Google around for some iperf or netperf results that people are generally
getting with Linux.

------
bitL
Excellent article! Thanks for sharing! I am glad to learn something new today!
;-)

------
known
man ethtool

------
floridaguy01
You know what is cooler than 1 million packets per second? 1 billion packets
per second.

