
How to receive a million packets per second (2015) - warpech
https://blog.cloudflare.com/how-to-receive-a-million-packets/
======
ncmncm
A thousand nanoseconds per packet? Luxury!

At 10Gbps, and hundred-odd byte packets as are common on finance multicast
feeds, I get 73ns per packet. The amount that can be done in that time is
limited. Typically, you just dump them in a big-ass ring buffer and let
processes on a bunch of other cores pick out interesting ones to spend more
time on.

The way to get the packets is usually with some proprietary kernel-bypass
library (some have success with DPDK), and a very carefully isolated core that
the kernel is kept from interrupting, and just spins moving packets. NIC
hardware-level ring buffers are usually pretty small, so pauses make for
drops.

Many NICs can be persuaded to filter packets out to multiple ring buffers, and
then you can get multiple cores picking them off; that is necessary when you
get to 40 and 100 Gbps. Multiple rings even with one core multiplies headroom,
making occasional housekeeping without drops possible.

~~~
jkulubya
Excuse my naïveté, but I’d love to know what kind of feeds you connect to to
pull in 40gbps of financial data. Are these basically feeds of all available
securities trading everywhere in the world?

Background is that I did some work connecting to a few smaller exchanges, but
we never ever thought we’d ever need to go past “just” 20mbps.

~~~
7532yahoogmail
Don't get too impressed here by finance. Almost everything we're talking about
here Linux Kernel Developments, SolarFlare (or Mellonx), core isolation (cpu
set) was invented elsewhere. Finance merely buys it; it's a great client of
all this tech. The only vaguely created-in-finance thing he/she references but
then not even by name is
[https://martinfowler.com/articles/lmax.html](https://martinfowler.com/articles/lmax.html)
which comes from ring buffers coming from CS EE networking itself.

~~~
ncmncm
Never heard of LMAX; looks like a Java thing.

Nobody said finance invented anything. But high performance demands produce
similar architectures everywhere.

~~~
7532yahoogmail
LMAX is a java thing. It's high speed, low latency messaging system used a
betting facility in the UK somewhere where in certain betting situations, the
number of messages drives up explosively high, and need to be well handled.
Thus high, high speed and low latency with reliably low latency were key.

------
tayo42
> Don't expect performance like that for any practical application without a
> lot more work.

5 years later I have servers that can do 2 million (multi process) without
messing with Numa or nic settings.

~~~
dpe82
At least for TX it's _much_ easier to get that performance today due to
relatively recent kernel changes. Unsure about RX though.

A few years ago I worked on the upload service within the YouTube Android app.
Shortly after YouTube started using QUIC for video streaming I decided it
would be interesting to experiment using QUIC for mobile uploads; my thesis
being that its BBR implementation might lead to better throughput on lossy
wireless connections and therefore faster mobile uploads. I discovered though
when tethered to my dev workstation I wasn't able to break ~100mbit while the
native TCP stack was able to saturate the USB-C gigabit ethernet adapter I was
using. This problem got the attention of the QUIC folks, which lead to a more
detailed investigation than I was capable of and eventually to kernel patches
described here: [http://vger.kernel.org/lpc_net2018_talks/willemdebruijn-
lpc2...](http://vger.kernel.org/lpc_net2018_talks/willemdebruijn-
lpc2018-udpgso-paper-DRAFT-1.pdf)

I'd like to think my little project was the major motivator for that work but
I'm pretty sure the server efficiency gains described in that paper were a
much stronger motivation. :)

~~~
londons_explore
That paper is a real disappointment with the places it _doesnt_ go.

Specifically, if QUIC is to have high performance and high power efficiency,
just batching the packets isn't enough. Think of a usecase - a CDN wants to be
able to send a block of data from disk to a client. It should be able to
direct that data from the SSD to the network card without the CPU touching the
data at all.

On the receive side, a mobile device should be able to download a big file to
ram or flash with the main CPU asleep most of the time.

That means the network device should be encrypting the packets, doing the
pacing, doing the retries, etc.

It may be that a quic specific set of options or syscalls is best for that, or
it might be that the application should provide a "this is how to send my
data" set of bytecode which the network hardware executes repeatedly.

But just batching UDP packets is a half way measure which isn't in the
direction of the final goal.

~~~
dpe82
One step at a time.

Right, that paper doesn't introduce life-shattering innovation but rather just
fixes a long-standing issue with UDP performance in the Linux kernel. That
ain't nothing.

A large motivator to develop and deploy QUIC is that by implementing much of
the stack as a userspace library the protocol can be iterated on at a pace
impossible with TCP. BBR congestion control experiments are a great example of
that; there's enormous value in quick iteration, even at the cost of compute
efficiency.

I agree with your sentiment though; I have to imagine eventually the protocol
will settle down and hardware acceleration will become a thing - but imho not
to the degree you propose.

------
mehrdadn
Curious, does anyone know how things look on Windows?

------
7532yahoogmail
OP has Solarflare NICs. So surely > 1MM pps is possible on that HW? But as you
opened with 1 million per second on the Linux stack, I gather that kernel by-
pass and such were by definition out of scope. The goal here was to always
pass through the kernel. But if so, why use Solarflare? Why not see if one can
do + 1MM pps on a nice, even beefy but not Solarflare/Mellonx NICs whose
motivation includes kernel by-pass.

------
frumiousirc
1 Gbps, unoptimized, commodity home network between an ancient thinkpad t520
and gen 4 i7 workstation using ZeroMQ throughput perf test gives more than 1M
messages per second:

    
    
        $ ./local_thr tcp://192.168.1.123:5678 100 1000000
        message size: 100 [B]
        message count: 1000000
        mean throughput: 1143708 [msg/s]
        mean throughput: 914.966 [Mb/s]

~~~
davidcuddeback
Assuming ZeroMQ batches messages like other message queue technologies, you're
comparing apples to oranges. The article is about receiving 1M _network
packets_ per second.

